freebsd-dev

Author	SHA1	Message	Date
Andriy Gapon	1db5f1724b	fix an architectural problem introduced in r320156, ZFS ABD import The implementation of ZFS refcount_t uses the emulated illumos mutex (the sx lock) and the waiting memory allocation when ZFS_DEBUG is enabled. This makes refcount_t unsuitable for use in GEOM g_up thread where sleeping is prohibited. When importing the ABD change I modified vdev_geom using illumos vdev_disk as an example. As a result, I added a call to abd_return_buf in vdev_geom_io_intr. The latter is called on g_up thread while the former uses refcount_t. This change fixes the problem by deferring the abd_return_buf call to the previously unused vdev_geom_io_done that is called on a ZFS zio taskqueue thread where sleeping is allowed. A side bonus of this change is that now a vdev zio has a pointer to its corresponding bio while the zio is active. Reported by: Shawn Webb <shawn.webb@hardenedbsd.org> Tested by: Shawn Webb <shawn.webb@hardenedbsd.org> MFC after: 1 week X-MFC with: r320156	2017-06-28 13:59:20 +00:00
Andriy Gapon	c20b00c6af	zfs: port vdev_file part of illumos change 3306 3306 zdb should be able to issue reads in parallel illumos/illumos-gate/31d7e8fa33fae995f558673adb22641b5aa8b6e1 https://www.illumos.org/issues/3306 The upstream change was made before we started to import upstream commits individually. It was imported into the illumos vendor area as r242733. That commit was MFV-ed in r260138, but as the commit message says vdev_file.c was left intact. This commit actually implements the parallel I/O for vdev_file using a taskqueue with multiple thread. This implementation does not depend on the illumos or FreeBSD bio interface at all, but uses zio_t to pass around all the relevent data. So, the code looks a bit different from the upstream. This commit also incorporates ZoL commit zfsonlinux/zfs/bc25c9325b0e5ced897b9820dad239539d561ec9 that fixed https://github.com/zfsonlinux/zfs/issues/2270 We need to use a dedicated taskqueue for exactly the same reason as ZoL as we do not implement TASKQ_DYNAMIC. Obtained from: illumos, ZFS on Linux MFC after: 2 weeks	2017-06-26 09:10:09 +00:00
Andriy Gapon	ee2d3c0a5b	fix gcc-specific fallout from r320156, MFV of r318946, ZFS ABD Reported by: jhibbits MFC after: 1 week X-MFC with: r320156	2017-06-23 08:42:53 +00:00
Andriy Gapon	3385c74539	MFV r319950: 5220 L2ARC does not support devices that do not provide 512B access FreeBSD note: the actual change has been in FreeBSD since r297848. This commit accounts for integration of that change with subsequent changes, especially r320156 (MFV of r318946) and r314274. illumos/illumos-gate@403a8da73c `403a8da73c` https://www.illumos.org/issues/5220 There are disk devices that have logical sector size larger than 512B, for example 4KB. That is, their physical sector size is larger than 512B and they do not provide emulation for 512B sector sizes. For such devices both a data offset and a data size must be properly aligned. L2ARC should arrange that because it uses physical I/O. zio_vdev_io_start() performs a necessary transformation if io_size is not aligned to vdev_ashift, but that is done only for logical I/O. Something similar should be done in L2ARC code. * a temporary write buffer should be allocated if the original buffer is not going to be compressed and its size is not aligned * size of a temporary compression buffer should be ashift aligned * for the reads, if a size of a target buffer is not sufficiently large and it is not aligned then a temporary read buffer should be allocated Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 3 weeks	2017-06-22 17:10:34 +00:00
Andriy Gapon	ae5ec64b88	MFV r319742: 8056 zfs send size estimate is inaccurate for some zvols illumos/illumos-gate@0255edcc85 `0255edcc85` https://www.illumos.org/issues/8056 The send size estimate for a zvol can be too low, if the size of the record headers (dmu_replay_record_t's) is a significant portion of the size. This is typically the case when the data is highly compressible, especially with embedded blocks. The problem is that dmu_adjust_send_estimate_for_indirects() assumes that blocks are the size of the "recordsize" property (128KB). However, for zvols, the blocks are the size of the "volblocksize" property (8KB). Therefore, we estimate that there will be 16x less record headers than there really will be. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Paul Dagnelie <pcd@delphix.com> MFC after: 3 weeks	2017-06-22 16:58:09 +00:00
Andriy Gapon	e70097b50f	MFV r318947: 7578 Fix/improve some aspects of ZIL writing. FreeBSD note: this commit removes small differences between what mav committed to FreeBSD in r308782 and what ended up committed to illumos after addressing all review comments. illumos/illumos-gate@c5ee46810f `c5ee46810f` https://www.illumos.org/issues/7578 After some ZIL changes 6 years ago zil_slog_limit got partially broken due to zl_itx_list_sz not updated when async itx'es upgraded to sync. Actually because of other changes about that time zl_itx_list_sz is not really required to implement the functionality, so this patch removes some unneeded broken code and variables. Original idea of zil_slog_limit was to reduce chance of SLOG abuse by single heavy logger, that increased latency for other (more latency critical) loggers, by pushing heavy log out into the main pool instead of SLOG. Beside huge latency increase for heavy writers, this implementation caused double write of all data, since the log records were explicitly prepared for SLOG. Since we now have I/O scheduler, I've found it can be much more efficient to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG. Existing ZIL implementation had problem with space efficiency when it has to write large chunks of data into log blocks of limited size. In some cases efficiency stopped to almost as low as 50%. In case of ZIL stored on spinning rust, that also reduced log write speed in half, since head had to uselessly fly over allocated but not written areas. This change improves the situation by offloading problematic operations from z_log_write() to zil_lwb_commit(), which knows real situation of log blocks allocation and can split large requests into pieces much more efficiently. Also as side effect it removes one of two data copy operations done by ZIL code WR_COPIED case. While there, untangle and unify code of z_log_write() functions. Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing block boundary, that may also improve efficiency if ZPL is made to do that. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Alexander Motin <mav@FreeBSD.org> MFC after: 3 weeks	2017-06-22 16:52:22 +00:00
Andriy Gapon	a4a2976d8a	fix several fallouts from r320156, ZFS ABD import All of the problems were related to the FreeBSD-only features. One was caused by a mismerge in the zfsbootcfg support code. All others were in the TRIM support code. MFC after: 1 week X-MFC with: r320156	2017-06-21 08:12:07 +00:00
Andriy Gapon	ebf3b53dac	fix several fallouts from r320156, ZFS ABD import All of the problems were related to the FreeBSD-only features. One was caused by a mismerge in the zfsbootcfg support code. All others were in the TRIM support code. Reported by: ken, O. Hartmann <ohartmann@walstatt.org>, Trond Endrestøl <Trond.Endrestol@fagskolen.gjovik.no> MFC after: 1 week X-MFC with: r320156	2017-06-21 08:10:45 +00:00
Andriy Gapon	f9cdbaba8d	MFV r318946: 8021 ARC buf data scatter-ization illumos/illumos-gate@770499e185 `770499e185` https://www.illumos.org/issues/8021 The ARC buf data project (known simply as "ABD" since its genesis in the ZoL community) changes the way the ARC allocates `b_pdata` memory from using linear `void *` buffers to using scatter/gather lists of fixed-size 1KB chunks. This improves ZFS's performance by helping to defragment the address space occupied by the ARC, in particular for cases where compressed ARC is enabled. It could also ease future work to allocate pages directly from `segkpm` for minimal- overhead memory allocations, bypassing the `kmem` subsystem. This is essentially the same change as the one which recently landed in ZFS on Linux, although they made some platform-specific changes while adapting this work to their codebase: 1. Implemented the equivalent of the `segkpm` suggestion for future work mentioned above to bypass issues that they've had with the Linux kernel memory allocator. 2. Changed the internal representation of the ABD's scatter/gather list so it could be used to pass I/O directly into Linux block device drivers. (This feature is not available in the illumos block device interface yet.) FreeBSD notes: - the actual (default) chunk size is 4KB (despite the text above saying 1KB) - we can try to reimplement ABDs, so that they are not permanently mapped into the KVA unless explicitly requested, especially on platforms with scarce KVA - we can try to use unmapped I/O and avoid intermediate allocation of a linear, virtual memory mapped buffer - we can try to avoid extra data copying by referring to chunks / pages in the original ABD Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Dan Kimmel <dan.kimmel@delphix.com> MFC after: 3 weeks	2017-06-20 17:39:24 +00:00
Andriy Gapon	42ce346fcc	revert r315852 which introduced zio_buf_alloc_nowait for use in vdev_queue_aggregate I think that the change is still good, but reconciling it with a planned merge of the ARC buf data scatter-ization is a bit more tedious than I can handle. MFC after: 17 days	2017-06-20 16:55:30 +00:00
Andriy Gapon	602cf4e4a7	MFV r319951: 8311 ZFS_READONLY is a little too strict illumos/illumos-gate@2889ec41c0 `2889ec41c0` https://www.illumos.org/issues/8311 Description: There was a misunderstanding about the enforcement details of the "Read-only" flag introduced for SMB/CIFS compatibility, way back in 2007 in the Sun PSARC 2007/315 case. The original authors thought enforcement of the READONLY flag should work similarly as the IMMUTABLE flag. Unfortunately, that enforcement is incompatible with the expectations of Windows applications using this feature through the SMB service. Applications assume (and the MS File System Algorithms MS-FSA confirms they should) that an SMB client can: (a) Open an SMB handle on a file with read/write access, (b) Set the DOS attributes to include the READONLY flag, (c) continue to have write access via that handle. This access model is essentially the same as a Unix/POSIX application that creates a file (with read/write access), uses fchmod() to change the file mode to something not granting write access (i.e. 0444), and then continues to write that file using the open handle it got before the mode change. Currently, the SMB server works-around this problem in a way that will become difficult to maintain as we implement support for SMB3 persistent handles, so SMB depends on this fix. I've written a test program that can be used to demonstrate this problem, and added it to zfs-tests (tests/functional/acl/cifs/cifs_attr_004_pos). It currently fails, but will pass when this problem fixed. Steps to Reproduce: Run the test program on a ZFS file system. Expected Results: Pass Actual Results: Fail. Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Approved by: Prakash Surya <prakash.surya@delphix.com> Author: Gordon Ross <gwr@nexenta.com> MFC after: 2 weeks	2017-06-14 16:55:47 +00:00
Andriy Gapon	7d506d0d57	MFV r319948: 5428 provide fts(), reallocarray(), and strtonum() illumos/illumos-gate@4585130b25 `4585130b25` https://www.illumos.org/issues/5428 Most of the upstream change is not applicable to FreeBSD. Only the renaming of strtonum to zfs_strtonum is relevant to us. And we already had it partially done. Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Joshua M. Clulow <josh@sysmgr.org> Author: Yuri Pankov <yuri.pankov@nexenta.com> MFC after: 1 week	2017-06-14 16:42:38 +00:00
Andriy Gapon	b8d341fe26	MFV r319945,r319946: 8264 want support for promoting datasets in libzfs_core illumos/illumos-gate@a4b8c9aa65 `a4b8c9aa65` https://www.illumos.org/issues/8264 Oddly there is a lzc_clone function, but no lzc_promote function. Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@kebe.com> Approved by: Dan McDonald <danmcd@kebe.com> Author: Andrew Stormont <astormont@racktopsystems.com> MFC after: 1 week	2017-06-14 16:31:36 +00:00
Andriy Gapon	667002fa27	MFV r319741: 8156 dbuf_evict_notify() does not need dbuf_evict_lock illumos/illumos-gate@dbfd9f9300 `dbfd9f9300` https://www.illumos.org/issues/8156 dbuf_evict_notify() holds the dbuf_evict_lock while checking if it should do the eviction itself (because the evict thread is not able to keep up). This can result in massive lock contention. It isn't necessary to hold the lock, because if we make the wrong choice occasionally, nothing bad will happen. Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 1 week	2017-06-09 15:28:57 +00:00
Andriy Gapon	5f9cf93878	MFV r319739: 8005 poor performance of 1MB writes on certain RAID-Z configurations illumos/illumos-gate@5b06278253 `5b06278253` https://www.illumos.org/issues/8005 RAID-Z requires that space be allocated in multiples of P+1 sectors, because this is the minimum size block that can have the required amount of parity. Thus blocks on RAIDZ1 must be allocated in a multiple of 2 sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4. A sector is a unit of 2^ashift bytes, typically 512B or 4KB. To satisfy this constraint, the allocation size is rounded up to the proper multiple, resulting in up to 3 "pad sectors" at the end of some blocks. The contents of these pad sectors are not used, so we do not need to read or write these sectors. However, some storage hardware performs much worse (around 1/2 as fast) on mostly-contiguous writes when there are small gaps of non-overwritten data between the writes. Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that include pad sectors. If writing a pad sector will fill the gap between two (required) writes, we will issue the optional zio, thus doubling performance. The gap-filling performance improvement was introduced in July 2009. Writing the optional zio is done by the io aggregation code in vdev_queue.c. The problem is that it is also subject to the limit on the size of aggregate writes, zfs_vdev_aggregation_limit, which is by default 128KB. For a given block, if the amount of data plus padding written to a leaf device exceeds zfs_vdev_aggregation_limit, the optional zio will not be written, resulting in a ~2x performance degradation. The problem occurs only for certain values of ashift, compressed block size, and RAID-Z configuration (number of parity and data disks). It cannot occur with the default recordsize=128KB. If compression is enabled, all configurations with recordsize=1MB or larger will be impacted to some degree. The problem notably occurs with recordsize=1MB, compression=off, with 10 disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors). Therefore Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 10 days	2017-06-09 15:27:22 +00:00
Andriy Gapon	9f141f8d71	MFV r319738: 8155 simplify dmu_write_policy handling of pre-compressed buffers illumos/illumos-gate@adaec86ad2 `adaec86ad2` https://www.illumos.org/issues/8155 When writing pre-compressed buffers, arc_write() requires that the compression algorithm used to compress the buffer matches the compression algorithm requested by the zio_prop_t, which is set by dmu_write_policy(). This makes dmu_write_policy() and its callers a bit more complicated. We can simplify this by making arc_write() trust the caller to supply the type of pre-compressed buffer that it wants to write, and override the compression setting in the zio_prop_t. Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 10 days	2017-06-09 15:26:03 +00:00
Andriy Gapon	1628f75af1	zfs_lookup: fix bogus arguments to lookup of "snapshot" directory When a parent directory lookup is done at the root of a snapshot mounted under .zfs/snapshot directory, we need to look up that directory in the parent filesystem. We achieve that by doing a VOP_LOOKUP operation on a .zfs vnode with "snapshot" as a target name. But previously we also passed ISDOTDOT flag to the lookup and, because of that, the lookup actually returned the parent of the .zfs vnode, that is, a root vnode of the parent filesystem. Reported by: lev Tested by: lev MFC after: 3 days	2017-05-29 06:30:34 +00:00
Konstantin Belousov	03311f117b	Use whole mnt_stat.f_fsid bits for st_dev. Since ino64 expanded dev_t to 64bit, make VOP_GETATTR(9) provide all bits of mnt_stat.f_fsid as va_fsid for vnodes on filesystems which use f_fsid. In particular, NFSv3 and sometimes NFSv4, and ZFS use this method or reporting st_dev by stat(2). Provide a new helper vn_fsid() to avoid duplicating code to copy f_fsid to va_fsid. Note that the change is mostly cosmetic. Its motivation is to avoid sign-extension of f_fsid[0] into 64bit dev_t value which happens after dev_t becomes 64bit.. Reviewed by: avg(zfs), rmacklem (nfs) (both for previous version) Sponsored by: The FreeBSD Foundation	2017-05-27 17:00:30 +00:00
Andriy Gapon	32ecf81aff	MFV r318944: 8265 Reserve send stream flag for large dnode feature illumos/illumos-gate@bc83969fdb `bc83969fdb` https://www.illumos.org/issues/8265 Reserve bit 23 in the zfs send stream flags for the large dnode feature which has been implemented for Linux. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Brian Behlendorf <behlendorf1@llnl.gov> MFC after: 1 week	2017-05-26 12:08:38 +00:00
Andriy Gapon	a51eb0a964	MFV r318942: 8166 zpool scrub thinks it repaired offline device illumos/illumos-gate@2d2f193a21 `2d2f193a21` https://www.illumos.org/issues/8166 If we do a scrub while a leaf device is offline (via "zpool offline"), we will inadvertently clear the DTL (dirty time log) of the offline device, even though it is still damaged. When the device comes back online, we will incompletely resilver it, thinking that the scrub repaired blocks written before the scrub was started. The incomplete resilver can lead to data loss if there is a subsequent failure of a different leaf device. The fix is to never clear the DTL of offline devices. Note that if a device is onlined while a scrub is in progress, the scrub will be restarted. The problem can be worked around by running "zpool scrub" after "zpool online". See also https://github.com/zfsonlinux/zfs/issues/5806 Reviewed by: George Wilson george.wilson@delphix.com Reviewed by: Brad Lewis <brad.lewis@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>	2017-05-26 12:04:21 +00:00
Andriy Gapon	2cd05c2473	MFV r318934: 8070 Add some ZFS comments illumos/illumos-gate@40713f2b24 `40713f2b24` https://www.illumos.org/issues/8070 Add some ZFS comments left by various developers at different times Reviewed by: Yuri Pankov <yuri.pankov@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Alan Somers <asomers@gmail.com> MFC after: 1 week	2017-05-26 11:49:42 +00:00
Andriy Gapon	0a07ea0e2f	MFV r318931: 8063 verify that we do not attempt to access inactive txg illumos/illumos-gate@b7b2590dd9 `b7b2590dd9` https://www.illumos.org/issues/8063 A standard practice in ZFS is to keep track of "per-txg" state. Any of the 3 active TXG's (open, quiescing, syncing) can have different values for this state. We should assert that we do not attempt to modify other (inactive) TXG's. Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 2 weeks	2017-05-26 11:37:11 +00:00
Andriy Gapon	28c5e43e36	MFV r318929: 7786 zfs`vdev_online() needs better notification about state changes illumos/illumos-gate@5f368aef86 `5f368aef86` https://www.illumos.org/issues/7786 Currently, vdev_online() will only post sysevent if previous state was "offline". It should also post the event when the state changes from "removed" or "faulted" to "healthy" or "degraded". This will fix the following scenario: - pull disk from slot A - check that hotspare has taken its place (if available) - insert disk into slot B - check that hotspare moved back to "avail" state (if spare was used) The problem here is that we don't get any ESC_ZFS_VDEV_* notification and fail to update the vdev FRU. Reviewed by: Matthew Ahrens mahrens@delphix.com Reviewed by: George Wilson george.wilson@delphix.com Approved by: Albert Lee <trisk@forkgnu.org> Author: Yuri Pankov <yuri.pankov@nexenta.com> MFC after: 1 week	2017-05-26 11:33:34 +00:00
Andriy Gapon	9c2a3c861f	MFV r318927: 8025 dbuf_read() creates unnecessary zio_root() for bonus buf illumos/illumos-gate@def4fac588 `def4fac588` https://www.illumos.org/issues/8025 dbuf_read() creates a zio_root() to track and wait for all the zio's that may happen as part of this call. However, if the blkptr_t for this buffer is NULL or a hole, we will not create any more zio's, so this zio_root() is unnecessary. This is always the case when calling dbuf_read() on a bonus buffer, because it has no blkptr (it's part of the containing dnode). For workloads that read a lot of bonus buffers (e.g. file creation and removal), creating and destroying these unnecessary zio's can decrease performance by around 3%. Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2017-05-26 11:30:55 +00:00
Andriy Gapon	ebaf416f95	MFV r316929: 6914 kernel virtual memory fragmentation leads to hang illumos/illumos-gate@af868f46a5 `af868f46a5` https://www.illumos.org/issues/6914 FreeBSD note: only a ZFS part of the change is merged, changes to the VM subsystem are not ported (obviously). Also, now that FreeBSD has vmem(9) we don't have to ifdef-out the code that uses it. MFC after: 2 weeks	2017-05-26 11:23:16 +00:00
Andriy Gapon	8629ec8394	arc_init: make code closer to upstream by introducing 'allmem' variable All the differences in calculations are kept. A comment about arc_max being 1/2 of all memory is fixed to reflect the actual code that uses 5/8 as a factor. MFC after: 1 week	2017-05-26 11:05:56 +00:00
Andriy Gapon	cf781c9b60	zfs_putpages: assert that sa_bulk_update() must succeed Same as the upstream does in r316927. MFC after: 1 week	2017-05-26 10:37:55 +00:00
Andriy Gapon	04b7c6b337	MFV r316928: 7256 low probability race in zfs_get_data illumos/illumos-gate@0c94e1af67 `0c94e1af67` https://www.illumos.org/issues/7256 error = dmu_sync(zio, lr->lr_common.lrc_txg, zfs_get_done, zgd); ASSERT(error \|\| lr->lr_length <= zp->z_blksz); It's possible, although extremely rare, that the zfs_get_done() callback is executed before dmu_sync() returns. In that case the znode's range lock is dropped and the znode is unreferenced. Thus, the assertion can access some invalid or wrong data via the zp pointer. size variable caches the correct value of z_blksz and can be safely used here. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Andriy Gapon <andriy.gapon@clusterhq.com> MFC after: 1 week	2017-05-26 10:31:05 +00:00
Andriy Gapon	7a94dd7aee	MFC r316924: 8061 sa_find_idx_tab can be declared more type-safely illumos/illumos-gate@7f0bdb4257 `7f0bdb4257` https://www.illumos.org/issues/8061 sa_find_idx_tab() is declared as taking and returning "void *" parameters. These can be declared to be the specific types. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 1 week	2017-05-26 10:27:35 +00:00
Andriy Gapon	8816c0bb48	MFV r316925: 6101 attempt to lzc_create() a filesystem under a volume results in a panic illumos/illumos-gate@b127fe3c05 `b127fe3c05` https://www.illumos.org/issues/6101 lzc_create(), or more correctly, zfs_ioc_create() does not reject an attempt to create a filesystem as a child of a volume, instead it proceeds to a crash. A crash stack obtained on FreeBSD: page fault while in kernel mode zap_leaf_lookup() fzap_lookup() zap_lookup_norm() zap_lookup() zfs_get_zplprop() zfs_fill_zplprops_impl() zfs_ioc_create() zfsdev_ioctl() devfs_ioctl_f() kern_ioctl() sys_ioctl() This crash happened with a kernel without debugging assertions. The immediate cause of crash appears to an attempt to interpret a zvol object as a zap object. For filesystems: #define MASTER_NODE_OBJ 1 For zvols: #define ZVOL_OBJ 1ULL #define ZVOL_ZAP_OBJ 2ULL So, I see two problems here: 1. an attempt to create a filesystem under a zvol should be rejected as early as possible, maybe in zfs_fill_zplprops() 2. maybe zap_lookup / zap_lockdir should reject objects that are not of one of the zap object types Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 2 weeks	2017-05-24 22:34:54 +00:00
Andriy Gapon	e73f9f8a49	MFV r316923: 8026 retire zfs_throttle_delay and zfs_throttle_resolution illumos/illumos-gate@6b03625981 `6b03625981` https://www.illumos.org/issues/8026 zfs_throttle_delay and zfs_throttle_resolution became disused since the new write throttling mechanism was introduced. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 1 week	2017-05-24 22:32:56 +00:00
Andriy Gapon	9fe5e04dfc	MFC r316921: 8027 tighten up dsl_pool_dirty_delta illumos/illumos-gate@313ae1e182 `313ae1e182` https://www.illumos.org/issues/8027 dsl_pool_dirty_delta() should not wake up waiters when dp->dp_dirty_total == zfs_dirty_data_max, because they wait for dp_dirty_total to fall strictly below the threshold. It's probably very rare for that condition to occur, but it's better to have more accurate code. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 1 week	2017-05-24 22:27:48 +00:00
Andriy Gapon	e1b8f10a5e	MFV r316920: 8023 Panic destroying a metaslab deferred range tree illumos/illumos-gate@3991b535a8 `3991b535a8` https://www.illumos.org/issues/8023 $C ffffff0011bc0970 vpanic() ffffff0011bc0a00 strlog() ffffff0011bc0a30 range_tree_destroy+0x72(ffffff043769ad00) ffffff0011bc0a70 metaslab_fini+0xd5(ffffff0449acf380) ffffff0011bc0ab0 vdev_metaslab_fini+0x56(ffffff0462bae800) ffffff0011bc0af0 spa_unload+0x9b(ffffff03e3dac000) ffffff0011bc0b70 spa_export_common+0x115(ffffff047f4b4000, 2, 0, 0, 0) ffffff0011bc0b90 spa_destroy+0x1d(ffffff047f4b4000) ffffff0011bc0bd0 zfs_ioc_pool_destroy+0x20(ffffff047f4b4000) ffffff0011bc0c80 zfsdev_ioctl+0x4d7(11400000000, 5a01, 8040190, 100003, ffffff03e1956b10, ffffff0011bc0e68) ffffff0011bc0cc0 cdev_ioctl+0x39(11400000000, 5a01, 8040190, 100003, ffffff03e1956b10, ffffff0011bc0e68) ffffff0011bc0d10 spec_ioctl+0x60(ffffff03d9153b00, 5a01, 8040190, 100003, ffffff03e1956b10, ffffff0011bc0e68, 0) ffffff0011bc0da0 fop_ioctl+0x55(ffffff03d9153b00, 5a01, 8040190, 100003, ffffff03e1956b10, ffffff0011bc0e68, 0) ffffff0011bc0ec0 ioctl+0x9b(3, 5a01, 8040190) ffffff0011bc0f10 _sys_sysenter_post_swapgs+0x149() Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: George Wilson <george.wilson@delphix.com> MFC after: 2 weeks	2017-05-24 22:25:26 +00:00
Andriy Gapon	5386d7295a	MFV r316917: 7968 multi-threaded spa_sync() illumos/illumos-gate@94c2d0eb22 `94c2d0eb22` https://www.illumos.org/issues/7968 spa_sync() iterates over all the dirty dnodes and processes each of them by calling dnode_sync(). If there are many dirty dnodes (e.g. because we created or removed a lot of files), the single thread of spa_sync() calling dnode_sync() can become a bottleneck. Additionally, if many dnodes are dirtied concurrently in open context (e.g. due to concurrent file creation), the os_lock will experience lock contention via dnode_setdirty(). The solution is to track dirty dnodes on a multilist_t, and for spa_sync() to use separate threads to process each of the sublists in the multilist. On the concurrent file creation microbenchmark, the performance improvement from dnode_setdirty() is up to 7%. Additionally, the wall clock time spent in spa_sync() is reduced to 15%-40% of the single-threaded case. In terms of cost/ reward, once the other bottlenecks are addressed, fixing this bug will provide a medium-large performance gain and require a medium amount of effort to implement. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 3 weeks	2017-05-24 22:21:24 +00:00
Andriy Gapon	2ba631553c	MFV r316916: 7970 zfs_arc_num_sublists_per_state should be common to all multilists illumos/illumos-gate@10fbdecb05 `10fbdecb05` https://www.illumos.org/issues/7970 The global tunable zfs_arc_num_sublists_per_state is used by the ARC and the dbuf cache, and other users are planned. We should change this tunable to be common to all multilists. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 3 weeks	2017-05-24 22:15:16 +00:00
Andriy Gapon	1d7634429c	MFC r316915: 7801 add more by-dnode routines (lint) illumos/illumos-gate@411be58a6e `411be58a6e` MFC after: 24 days X-MFC with: r318823	2017-05-24 21:52:20 +00:00
Andriy Gapon	31fd119cc2	MFC r316914: 7801 add more by-dnode routines illumos/illumos-gate@b0c42cd470 `b0c42cd470` https://www.illumos.org/issues/7801 Add _by_dnode() routines for accessing objects given their dnode_t , this is more efficient than accessing the object by (objset_t *, uint64_t object). This change converts some but not all of the existing consumers. As performance-sensitive code paths are discovered they should be converted to use these routines. Ported from: `0eef1bde31` Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: bzzz77 <bzzz.tomas@gmail.com> MFC after: 24 days	2017-05-24 21:49:21 +00:00
Andriy Gapon	5aab788866	MFC r316913: 7869 panic in bpobj_space(): null pointer dereference illumos/illumos-gate@a3905a4592 `a3905a4592` https://www.illumos.org/issues/7869 The issue fixed by this patch is a race condition in the deadlist code. A thread executing an administrative command that uses `dsl_deadlist_space_range()` holds the lock of the whole `deadlist_t` to protect the access of all its entries that the deadlist contains in an avl tree. Sync threads trying to insert a new entry in the deadlist (through `dsl_deadlist_insert()` -> `dle_enqueue()`) do not hold the deadlist lock at that moment. If the `dle_bpobj` is the empty bpobj (our sentinel value), we close and reopen it. Between these two operations, it is possible for the `dsl_deadlist_space_range()` thread to dereference that bpobj which is `NULL` during that window. Threads should hold the a deadlist's `dl_lock` when they manipulate its internal data so scenarios like the one above are avoided. In addition, threads should also hold the bpobj lock whenever they are allocating the subobj list of a bpobj, and not just when they actually insert the subobj to the list. This way we can avoid potential memory leaks. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com> MFC after: 2 weeks	2017-05-24 21:45:52 +00:00
Andriy Gapon	930f1af491	MFC r316912: 7793 ztest fails assertion in dmu_tx_willuse_space illumos/illumos-gate@61e255ce72 `61e255ce72` https://www.illumos.org/issues/7793 Background information: This assertion about tx_space_* verifies that we are not dirtying more stuff than we thought we would. We “need” to know how much we will dirty so that we can check if we should fail this transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() — typically less than a millisecond), we call dbuf_dirty() on the exact blocks that will be modified. Once this happens, the temporary accounting in tx_space_* is unnecessary, because we know exactly what blocks are newly dirtied; we call dnode_willuse_space() to track this more exact accounting. The fundamental problem causing this bug is that dmu_tx_hold_() relies on the current state in the DMU (e.g. dn_nlevels) to predict how much will be dirtied by this transaction, but this state can change before we actually perform the transaction (i.e. call dbuf_dirty()). This bug will be fixed by removing the assertion that the tx_space_ accounting is perfectly accurate (i.e. we never dirty more than was predicted by dmu_tx_hold_()). By removing the requirement that this accounting be perfectly accurate, we can also vastly simplify it, e.g. removing most of the logic in dmu_tx_count_(). The new tx space accounting will be very approximate, and may be more or less than what is actually dirtied. It will still be used to determine if this transaction will put us over quota. Transactions that are marked by dmu_tx_mark_netfree() will be excepted from this check. We won’t make an attempt to determine how much space will be freed by the transaction — this was rarely accurate enough to determine if a transaction should be permitted when we are over quota, which is why dmu_tx_mark_netfree() was introduced in 2014. We also won’t attempt to give “credit” when overwriting existing blocks, if those blocks may be freed. This allows us to remove the do_free_accounting logic in dbuf_dirty(), and associated routines. This Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 3 weeks	2017-05-24 21:43:34 +00:00
Andriy Gapon	3a9c923927	MFC r316907: 1300 filename normalization doesn't work for removes illumos/illumos-gate@1c17160ac5 `1c17160ac5` https://www.illumos.org/issues/1300 FreeBSD note: recent FreeBSD was not affected by the issue fixed as the name cache is completely bypassed when normalization is enabled. The change is imported for the sake of ZAP infrastructure modifications. Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Kevin Crowe <kevin.crowe@nexenta.com> MFC after: 3 weeks	2017-05-24 21:29:31 +00:00
Alan Somers	7ac72c256f	vdev_geom may associate multiple vdevs per g_consumer vdev_geom.c currently uses the g_consumer's private field to point to a vdev_t. That way, a GEOM event can cause a change to a ZFS vdev. For example, when you remove a disk, the vdev's status will change to REMOVED. However, vdev_geom will sometimes attach multiple vdevs to the same GEOM consumer. If this happens, then geom events will only be propagated to one of the vdevs. Fix this by storing a linked list of vdevs in g_consumer's private field. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c * g_consumer.private now stores a linked list of vdev pointers associated with the consumer instead of just a single vdev pointer. * Change vdev_geom_set_physpath's signature to more closely match vdev_geom_set_rotation_rate * Don't bother calling g_access in vdev_geom_set_physpath. It's guaranteed that we've already accessed the consumer by the time we get here. * Don't call vdev_geom_set_physpath in vdev_geom_attach. Instead, call it in vdev_geom_open, after we know that the open has succeeded. PR: 218634 Reviewed by: gibbs MFC after: 1 week Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D10391	2017-05-11 16:26:56 +00:00
Josh Paetzel	ba13ab83f2	Fix misport of compressed ZFS send/recv from 317414 Reported by: Michael Jung <mikej@mikej.com> Reviewed by: avg	2017-05-01 12:56:12 +00:00
Mark Johnston	babf030fd6	Get rid of some ifdef soup in the fasttrap ioctl handler. No functional change intended. MFC after: 1 week	2017-04-28 22:25:22 +00:00
Josh Paetzel	285d85ab04	MFV 316905 7740 fix for 6513 only works in hole punching case, not truncation illumos/illumos-gate@7de35a3ed0 `7de35a3ed0` https://www.illumos.org/issues/7740 The problem is that dbuf_findbp will return ENOENT if the block it's trying to find is beyond the end of the file. If that happens, we assume there is no birth time, and so we lose that information when we write out new blkptrs. We should teach dbuf_findbp to look for things that are beyond the current end, but not beyond the absolute end of the file. To verify, create a large file, truncate it to a short length, and then write beyond the end. Check with zdb to make sure that there are no holes with birth time zero (will appear as gaps). Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Paul Dagnelie <pcd@delphix.com>	2017-04-28 02:11:29 +00:00
Josh Paetzel	358f157522	MFV 316900 7743 per-vdev-zaps have no initialize path on upgrade illumos/illumos-gate@555da5111b `555da5111b` https://www.illumos.org/issues/7743 When loading a pool that had been created before the existance of per-vdev zaps, on a system that knows about per-vdev zaps, the per-vdev zaps will not be allocated and initialized. This appears to be because the logic that would have done so, in spa_sync_config_object(), is not reached under normal operation. It is only reached if spa_config_dirty_list is non-empty. The fix is to add another `AVZ_ACTION_` enum that will allow this code to be reached when we detect that we're loading an old pool, even when there are no dirty configs. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Paul Dagnelie <pcd@delphix.com>	2017-04-27 23:31:38 +00:00
Josh Paetzel	8ad5797208	MFV 316898 7613 ms_freetree[4] is only used in syncing context illumos/illumos-gate@5f14577801 `5f14577801` https://www.illumos.org/issues/7613 metaslab_t:ms_freetree[TXG_SIZE] is only used in syncing context. We should replace it with two trees: the freeing tree (ranges that we are freeing this syncing txg) and the freed tree (ranges which have been freed this txg). Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>	2017-04-27 22:00:03 +00:00
Josh Paetzel	e3cb0e99f8	MFV 316897 7586 remove #ifdef __lint hack from dmu.h illumos/illumos-gate@4ba5b96163 `4ba5b96163` https://www.illumos.org/issues/7586 The #ifdef __lint in dmu.h is ugly, and it would be nice not to duplicate it if we add other inline functions into header files in ZFS, especially since it is difficult to make any other solution work across all compilation targets. We should switch to disabling the lint flags that are failing instead. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Dan Kimmel <dan.kimmel@delphix.com>	2017-04-27 21:11:57 +00:00
Josh Paetzel	011275233c	MFV 316896 7580 ztest failure in dbuf_read_impl illumos/illumos-gate@1a01181fdc `1a01181fdc` https://www.illumos.org/issues/7580 We need to prevent any reader whenever we're about the zero out all the blkptrs. To do this we need to grab the dn_struct_rwlock as writer in dbuf_write_children_ready and free_children just prior to calling bzero. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: George Wilson <george.wilson@delphix.com>	2017-04-27 16:38:28 +00:00
Josh Paetzel	fa88c78914	MFV 316895 7606 dmu_objset_find_dp() takes a long time while importing pool illumos/illumos-gate@7588687e6b `7588687e6b` https://www.illumos.org/issues/7606 When importing a pool with a large number of filesystems within the same parent filesystem, we see that dmu_objset_find_dp() takes a long time. It is called from 3 places: spa_check_logs(), spa_ld_claim_log_blocks(), and spa_load_verify(). There are several ways to improve performance here: 1. We don't really need to do spa_check_logs() or spa_ld_claim_log_blocks() if the pool was closed cleanly. 2. spa_load_verify() uses dmu_objset_find_dp() to check that no datasets have too long of names. 3. dmu_objset_find_dp() is slow because it's doing zap_value_search() (which is O(N sibling datasets)) to determine the name of each dsl_dir when it's opened. In this case we actually know the name when we are opening it, so we can provide it and avoid the lookup. This change implements fix #3 from the above list; i.e. make dmu_objset_find_dp() provide the name of the dataset so that we don't have to search for it. Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prashanth Sreenivasa <prashksp@gmail.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> Author: Matthew Ahrens <mahrens@delphix.com>	2017-04-27 15:10:45 +00:00
Josh Paetzel	c78abb8b50	MFV 316894 7252 7628 compressed zfs send / receive illumos/illumos-gate@5602294fda `5602294fda` https://www.illumos.org/issues/7252 This feature includes code to allow a system with compressed ARC enabled to send data in its compressed form straight out of the ARC, and receive data in its compressed form directly into the ARC. https://www.illumos.org/issues/7628 We should have longer, more readable versions of the ZFS send / recv options. 7628 create long versions of ZFS send / receive options Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: David Quigley <dpquigl@davequigley.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Dan Kimmel <dan.kimmel@delphix.com>	2017-04-25 17:57:43 +00:00

1 2 3 4 5 ...

1410 Commits