freebsd-dev

Author	SHA1	Message	Date
Matt Macy	cc0fbbb92e	MFV/ZoL: Implement large_dnode pool feature commit `50c957f702` Author: Ned Bass <bass6@llnl.gov> Date: Wed Mar 16 18:25:34 2016 -0700 Implement large_dnode pool feature Justification ------------- This feature adds support for variable length dnodes. Our motivation is to eliminate the overhead associated with using spill blocks. Spill blocks are used to store system attribute data (i.e. file metadata) that does not fit in the dnode's bonus buffer. By allowing a larger bonus buffer area the use of a spill block can be avoided. Spill blocks potentially incur an additional read I/O for every dnode in a dnode block. As a worst case example, reading 32 dnodes from a 16k dnode block and all of the spill blocks could issue 33 separate reads. Now suppose those dnodes have size 1024 and therefore don't need spill blocks. Then the worst case number of blocks read is reduced to from 33 to two--one per dnode block. In practice spill blocks may tend to be co-located on disk with the dnode blocks so the reduction in I/O would not be this drastic. In a badly fragmented pool, however, the improvement could be significant. ZFS-on-Linux systems that make heavy use of extended attributes would benefit from this feature. In particular, ZFS-on-Linux supports the xattr=sa dataset property which allows file extended attribute data to be stored in the dnode bonus buffer as an alternative to the traditional directory-based format. Workloads such as SELinux and the Lustre distributed filesystem often store enough xattr data to force spill bocks when xattr=sa is in effect. Large dnodes may therefore provide a performance benefit to such systems. Other use cases that may benefit from this feature include files with large ACLs and symbolic links with long target names. Furthermore, this feature may be desirable on other platforms in case future applications or features are developed that could make use of a larger bonus buffer area. Implementation -------------- The size of a dnode may be a multiple of 512 bytes up to the size of a dnode block (currently 16384 bytes). A dn_extra_slots field was added to the current on-disk dnode_phys_t structure to describe the size of the physical dnode on disk. The 8 bits for this field were taken from the zero filled dn_pad2 field. The field represents how many "extra" dnode_phys_t slots a dnode consumes in its dnode block. This convention results in a value of 0 for 512 byte dnodes which preserves on-disk format compatibility with older software. Similarly, the in-memory dnode_t structure has a new dn_num_slots field to represent the total number of dnode_phys_t slots consumed on disk. Thus dn->dn_num_slots is 1 greater than the corresponding dnp->dn_extra_slots. This difference in convention was adopted because, unlike on-disk structures, backward compatibility is not a concern for in-memory objects, so we used a more natural way to represent size for a dnode_t. The default size for newly created dnodes is determined by the value of a new "dnodesize" dataset property. By default the property is set to "legacy" which is compatible with older software. Setting the property to "auto" will allow the filesystem to choose the most suitable dnode size. Currently this just sets the default dnode size to 1k, but future code improvements could dynamically choose a size based on observed workload patterns. Dnodes of varying sizes can coexist within the same dataset and even within the same dnode block. For example, to enable automatically-sized dnodes, run # zfs set dnodesize=auto tank/fish The user can also specify literal values for the dnodesize property. These are currently limited to powers of two from 1k to 16k. The power-of-2 limitation is only for simplicity of the user interface. Internally the implementation can handle any multiple of 512 up to 16k, and consumers of the DMU API can specify any legal dnode value. The size of a new dnode is determined at object allocation time and stored as a new field in the znode in-memory structure. New DMU interfaces are added to allow the consumer to specify the dnode size that a newly allocated object should use. Existing interfaces are unchanged to avoid having to update every call site and to preserve compatibility with external consumers such as Lustre. The new interfaces names are given below. The versions of these functions that don't take a dnodesize parameter now just call the _dnsize() versions with a dnodesize of 0, which means use the legacy dnode size. New DMU interfaces: dmu_object_alloc_dnsize() dmu_object_claim_dnsize() dmu_object_reclaim_dnsize() New ZAP interfaces: zap_create_dnsize() zap_create_norm_dnsize() zap_create_flags_dnsize() zap_create_claim_norm_dnsize() zap_create_link_dnsize() The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The spa_maxdnodesize() function should be used to determine the maximum bonus length for a pool. These are a few noteworthy changes to key functions: * The prototype for dnode_hold_impl() now takes a "slots" parameter. When the DNODE_MUST_BE_FREE flag is set, this parameter is used to ensure the hole at the specified object offset is large enough to hold the dnode being created. The slots parameter is also used to ensure a dnode does not span multiple dnode blocks. In both of these cases, if a failure occurs, ENOSPC is returned. Keep in mind, these failure cases are only possible when using DNODE_MUST_BE_FREE. If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0. dnode_hold_impl() will check if the requested dnode is already consumed as an extra dnode slot by an large dnode, in which case it returns ENOENT. * The function dmu_object_alloc() advances to the next dnode block if dnode_hold_impl() returns an error for a requested object. This is because the beginning of the next dnode block is the only location it can safely assume to either be a hole or a valid starting point for a dnode. * dnode_next_offset_level() and other functions that iterate through dnode blocks may no longer use a simple array indexing scheme. These now use the current dnode's dn_num_slots field to advance to the next dnode in the block. This is to ensure we properly skip the current dnode's bonus area and don't interpret it as a valid dnode. zdb --- The zdb command was updated to display a dnode's size under the "dnsize" column when the object is dumped. For ZIL create log records, zdb will now display the slot count for the object. ztest ----- Ztest chooses a random dnodesize for every newly created object. The random distribution is more heavily weighted toward small dnodes to better simulate real-world datasets. Unused bonus buffer space is filled with non-zero values computed from the object number, dataset id, offset, and generation number. This helps ensure that the dnode traversal code properly skips the interior regions of large dnodes, and that these interior regions are not overwritten by data belonging to other dnodes. A new test visits each object in a dataset. It verifies that the actual dnode size matches what was stored in the ztest block tag when it was created. It also verifies that the unused bonus buffer space is filled with the expected data patterns. ZFS Test Suite -------------- Added six new large dnode-specific tests, and integrated the dnodesize property into existing tests for zfs allow and send/recv. Send/Receive ------------ ZFS send streams for datasets containing large dnodes cannot be received on pools that don't support the large_dnode feature. A send stream with large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be unrecognized by an incompatible receiving pool so that the zfs receive will fail gracefully. While not implemented here, it may be possible to generate a backward-compatible send stream from a dataset containing large dnodes. The implementation may be tricky, however, because the send object record for a large dnode would need to be resized to a 512 byte dnode, possibly kicking in a spill block in the process. This means we would need to construct a new SA layout and possibly register it in the SA layout object. The SA layout is normally just sent as an ordinary object record. But if we are constructing new layouts while generating the send stream we'd have to build the SA layout object dynamically and send it at the end of the stream. For sending and receiving between pools that do support large dnodes, the drr_object send record type is extended with a new field to store the dnode slot count. This field was repurposed from unused padding in the structure. ZIL Replay ---------- The dnode slot count is stored in the uppermost 8 bits of the lr_foid field. The bits were unused as the object id is currently capped at 48 bits. Resizing Dnodes --------------- It should be possible to resize a dnode when it is dirtied if the current dnodesize dataset property differs from the dnode's size, but this functionality is not currently implemented. Clearly a dnode can only grow if there are sufficient contiguous unused slots in the dnode block, but it should always be possible to shrink a dnode. Growing dnodes may be useful to reduce fragmentation in a pool with many spill blocks in use. Shrinking dnodes may be useful to allow sending a dataset to a pool that doesn't support the large_dnode feature. Feature Reference Counting -------------------------- The reference count for the large_dnode pool feature tracks the number of datasets that have ever contained a dnode of size larger than 512 bytes. The first time a large dnode is created in a dataset the dataset is converted to an extensible dataset. This is a one-way operation and the only way to decrement the feature count is to destroy the dataset, even if the dataset no longer contains any large dnodes. The complexity of reference counting on a per-dnode basis was too high, so we chose to track it on a per-dataset basis similarly to the large_block feature. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3542	2018-08-12 00:45:53 +00:00
Alexander Motin	07ddc55096	MFV r337223: 9580 Add a hash-table on top of nvlist to speed-up operations illumos/illumos-gate@2ec7644aab Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2018-08-03 01:52:25 +00:00
Alexander Motin	420a1ab349	MFV r337220: 8375 Kernel memory leak in nvpair code illumos/illumos-gate@843c2111b1 Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-08-03 01:30:03 +00:00
Alexander Motin	33bdecc5d0	MFV r337218: 7261 nvlist code should enforce name length limit illumos/illumos-gate@48dd5e630c Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-08-03 01:26:07 +00:00
Alexander Motin	c8c51346f5	MFV r337216: 7263 deeply nested nvlist can overflow stack illumos/illumos-gate@9ca527c3d3 Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-08-03 01:09:12 +00:00
Alexander Motin	0285589b38	MFV 337214: 9621 Make createtxg and guid properties public illumos/illumos-gate@e8d4a73c86 Reviewed by: Andy Stormont <astormont@racktopsystems.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Yuri Pankov <yuripv@yuripv.net> Approved by: Robert Mustacchi <rm@joyent.com> Author: Josh Paetzel <josh@tcbug.org>	2018-08-03 00:24:27 +00:00
Alexander Motin	2bce9a5316	MFV r337182: 9330 stack overflow when creating a deeply nested dataset Datasets that are deeply nested (~100 levels) are impractical. We just put a limit of 50 levels to newly created datasets. Existing datasets should work without a problem. illumos/illumos-gate@5ac95da7d6 Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>	2018-08-02 21:19:35 +00:00
Alexander Motin	6413a6d31f	MFV r336946: 9238 ZFS Spacemap Encoding V2 The current space map encoding has the following disadvantages: [1] Assuming 512 sector size each entry can represent at most 16MB for a segment. This makes the encoding very inefficient for large regions of space. [2] As vdev-wide space maps have started to be used by new features (i.e. device removal, zpool checkpoint) we've started imposing limits in the vdevs that can be used with them based on the maximum addressable offset (currently 64PB for a top-level vdev). The new remains backwards compatible with the old one. The introduced two-word entry format, besides extending the limits imposed by the single-entry layout, also includes a vdev field and some extra padding after its prefix. The extra padding after the prefix should is reserved for future usage (e.g. new prefixes for future encodings or new fields for flags). The new vdev field not only makes the space maps more self-descriptive, but also opens the doors for pool-wide space maps. One final important note is that the number of bits used for vdevs is reduced to 24 bits for blkptrs. That was decided as we don't know of any setups that use more than 16M vdevs for the time being and we wanted to fit the vdev field in the space map. In addition that gives us some extra bits in dva_t. illumos/illumos-gate@17f11284b4 Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2018-07-30 23:47:38 +00:00
Andriy Gapon	81f187e576	allow ZFS pool to have temporary name for duration of current import The change adds -t <name> option to zpool create and -t option to zpool import in its form with an old name and a new name. This allows to import (or create) a pool under a name that's different from its real, permanent name without affecting that name. This is useful when working with VM images or images of other physical systems if they happen to have a ZFS pool with the same name as the host system. The changes come from ZoL with some small tweaks. The porting has been done by julian. The change is being submitted to OpenZFS: https://github.com/openzfs/openzfs/pull/600 Submitted by: julian Reviewed by: smh MFC after: 2 weeks Sponsored by: Panzura (porting) Differential Revision: https://reviews.freebsd.org/D14972	2018-04-12 10:37:26 +00:00
Alexander Motin	5c4561f332	MFV r331706: 9235 rename zpool_rewind_policy_t to zpool_load_policy_t illumos/illumos-gate@5dafeea3eb We want to be able to pass various settings during import/open of a pool, which are not only related to rewind. Instead of adding a new policy and duplicate a bunch of code, we should just rename rewind_policy to a more generic term like load_policy. For instance, we'd like to set spa->spa_import_flags from the nvlist, rather from a flags parameter passed to spa_import as in some cases we want those flags not only for the import case, but also for the open case. One such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb) which would allow zfs to open a pool when logs are missing. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-03-28 22:29:06 +00:00
Alexander Motin	0b0c76bc58	MFV r331695, 331700: 9166 zfs storage pool checkpoint illumos/illumos-gate@8671400134 The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with exactly that. It can be thought of as a “pool-wide snapshot” (or a variation of extreme rewind that doesn’t corrupt your data). It remembers the entire state of the pool at the point that it was taken and the user can revert back to it later or discard it. Its generic use case is an administrator that is about to perform a set of destructive actions to ZFS as part of a critical procedure. She takes a checkpoint of the pool before performing the actions, then rewinds back to it if one of them fails or puts the pool into an unexpected state. Otherwise, she discards it. With the assumption that no one else is making modifications to ZFS, she basically wraps all these actions into a “high-level transaction”. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>	2018-03-28 22:01:27 +00:00
Alexander Motin	24433f00ea	MFV r329502: 7614 zfs device evacuation/removal illumos/illumos-gate@5cabbc6b49 https://www.illumos.org/issues/7614: This project allows top-level vdevs to be removed from the storage pool with “zpool remove”, reducing the total amount of storage in the pool. This operation copies all allocated regions of the device to be removed onto other devices, recording the mapping from old to new location. After the removal is complete, read and free operations to the removed (now “indirect”) vdev must be remapped and performed at the new location on disk. The indirect mapping table is kept in memory whenever the pool is loaded, so there is minimal performance overhead when doing operations on the indirect vdev. The size of the in-memory mapping table will be reduced when its entries become “obsolete” because they are no longer used by any block pointers in the pool. An entry becomes obsolete when all the blocks that use it are freed. An entry can also become obsolete when all the snapshots that reference it are deleted, and the block pointers that reference it have been “remapped” in all filesystems/zvols (and clones). Whenever an indirect block is written, all the block pointers in it will be “remapped” to their new (concrete) locations if possible. This process can be accelerated by using the “zfs remap” command to proactively rewrite all indirect blocks that reference indirect (removed) vdevs. Note that when a device is removed, we do not verify the checksum of the data that is copied. This makes the process much faster, but if it were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy the wrong data, when we have the correct data on e.g. the other side of the mirror. Therefore, mirror and raidz devices can not be removed. Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Reviewed by: Tim Chase <tim@chase2k.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Prashanth Sreenivasa <pks@delphix.com>	2018-02-21 16:51:02 +00:00
Alexander Motin	e5a4a83784	MFV r318941: 7446 zpool create should support efi system partition illumos/illumos-gate@7855d95b30 `7855d95b30` https://www.illumos.org/issues/7446 Since we support whole-disk configuration for boot pool, we also will need whole disk support with UEFI boot and for this, zpool create should create efi- system partition. I have borrowed the idea from oracle solaris, and introducing zpool create - B switch to provide an way to specify that boot partition should be created. However, there is still an question, how big should the system partition be. For time being, I have set default size 256MB (thats minimum size for FAT32 with 4k blocks). To support custom size, the set on creation "bootsize" property is created and so the custom size can be set as: zpool create B - o bootsize=34MB rpool c0t0d0 After pool is created, the "bootsize" property is read only. When -B switch is not used, the bootsize defaults to 0 and is shown in zpool get output with value ''. Older zfs/zpool implementations are ignoring this property. https://www.illumos.org/rb/r/219/ Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Reviewed by: Yuri Pankov <yuri.pankov@gmail.com> Approved by: Dan McDonald <danmcd@kebe.com> Author: Toomas Soome <tsoome@me.com> This commit makes no sense for FreeBSD, that is why I blocked the option, but it should be good to stay closer to upstream.	2018-02-21 00:18:57 +00:00
Alexander Motin	89dabdb4ea	MFC r316910: 7812 Remove gender specific language illumos/illumos-gate@48bbca8168 `48bbca8168` https://www.illumos.org/issues/7812 This change removes all gendered language that did not refer specifically to an individual person or pet. The convention taken was to use variations on "they" when referring to users and/or human beings, while using "it" when referring to code, functions, and/or libraries. Additionally, we took the liberty to fix up any whitespace issues that were found in any files that were already being modified. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Daniel Hoffman <dj.hoffman@delphix.com>	2018-02-20 05:07:21 +00:00
Andriy Gapon	2e06836fee	MFV r325605: 8713 Buffer overflow in dsl_dataset_name() illumos/illumos-gate@f37ae9a714 `f37ae9a714` https://www.illumos.org/issues/8713 If we're creating a pool with version >= SPA_VERSION_DSL_SCRUB (v11) we need to account for additional space needed by the origin dataset which will also be snapshotted: "poolname"+"/"+"$ORIGIN"+"@"+"$ORIGIN". Enforce this limit in pool_namecheck(). Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: loli10K <ezomori.nozomu@gmail.com> MFC after: 1 week	2017-11-09 18:12:21 +00:00
Andriy Gapon	bda88d07d9	MFV r323530,r323533,r323534: 7431 ZFS Channel Programs, and followups 7431 ZFS Channel Programs illumos/illumos-gate@dfc115332c `dfc115332c` https://www.illumos.org/issues/7431 ZFS channel programs (ZCP) adds support for performing compound ZFS administrative actions via Lua scripts in a sandboxed environment (with time and memory limits). This initial commit includes both base support for running ZCP scripts, and a small initial library of API calls which support getting properties and listing, destroying, and promoting datasets. Testing: in addition to the included unit tests, channel programs have been in use at Delphix for several months for batch destroying filesystems. The dsl_destroy_snaps_nvl() call has also been replaced with Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Chris Williamson <chris.williamson@delphix.com> 8552 ZFS LUA code uses floating point math illumos/illumos-gate@916c8d8811 `916c8d8811` https://www.illumos.org/issues/8552 In the LUA interpreter used by "zfs program", the lua format() function accidentally includes support for '%f' and friends, which can cause compilation problems when building on platforms that don't support floating-point math in the kernel (e.g. sparc). Support for '%f' friends (%f %e %E %g %G) should be removed, since there's no way to supply a floating-point value anyway (all numbers in ZFS LUA are int64_t's). Reviewed by: Yuri Pankov <yuripv@gmx.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Dan McDonald <danmcd@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> 8590 memory leak in dsl_destroy_snapshots_nvl() illumos/illumos-gate@e6ab4525d1 `e6ab4525d1` https://www.illumos.org/issues/8590 In dsl_destroy_snapshots_nvl(), "snaps_normalized" is not freed after it is added to "arg". Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> FreeBSD notes: - zfs-program.8 manual page is taken almost as is from the vendor repository, no FreeBSD-ification done - fixed multiple instances of NULL being used where an integer is expected - replaced ETIME and ECHRNG with ETIMEDOUT and EDOM respectively This commit adds a modified version of Lua 5.2.4 under sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lua, mirroring the upstream. See README.zfs in that directory for the description of Lua customizations. See zfs-program.8 on how to use the new feature. MFC after: 5 weeks Relnotes: yes Differential Revision: https://reviews.freebsd.org/D12528	2017-10-01 16:11:07 +00:00
Ed Maste	3c3d2ba6fe	zfs: do not advertise edonr which is not yet supported illumos 4185 ("add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R") was intentionally merged only partially in r289422, without adding support for skein, sha512 and edonr on FreeBSD. Support for skein and sha512 was added later on, but edonr is still not implemented in FreeBSD. Prior to this commit zfs(8) correctly rejected edonr, but with an error message that claimed support: fk@r500 ~ $zfs set checksum=edonr tank cannot set property for 'tank': 'checksum' must be one of 'on \| off \| fletcher2 \| fletcher4 \| sha256 \| sha512 \| skein \| edonr' PR: 204055 Submitted by: Fabian Keil Approved by: allanjude Obtained from: ElectroBSD MFC after: 1 week	2017-08-29 22:24:22 +00:00
Andriy Gapon	f9cdbaba8d	MFV r318946: 8021 ARC buf data scatter-ization illumos/illumos-gate@770499e185 `770499e185` https://www.illumos.org/issues/8021 The ARC buf data project (known simply as "ABD" since its genesis in the ZoL community) changes the way the ARC allocates `b_pdata` memory from using linear `void *` buffers to using scatter/gather lists of fixed-size 1KB chunks. This improves ZFS's performance by helping to defragment the address space occupied by the ARC, in particular for cases where compressed ARC is enabled. It could also ease future work to allocate pages directly from `segkpm` for minimal- overhead memory allocations, bypassing the `kmem` subsystem. This is essentially the same change as the one which recently landed in ZFS on Linux, although they made some platform-specific changes while adapting this work to their codebase: 1. Implemented the equivalent of the `segkpm` suggestion for future work mentioned above to bypass issues that they've had with the Linux kernel memory allocator. 2. Changed the internal representation of the ABD's scatter/gather list so it could be used to pass I/O directly into Linux block device drivers. (This feature is not available in the illumos block device interface yet.) FreeBSD notes: - the actual (default) chunk size is 4KB (despite the text above saying 1KB) - we can try to reimplement ABDs, so that they are not permanently mapped into the KVA unless explicitly requested, especially on platforms with scarce KVA - we can try to use unmapped I/O and avoid intermediate allocation of a linear, virtual memory mapped buffer - we can try to avoid extra data copying by referring to chunks / pages in the original ABD Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Dan Kimmel <dan.kimmel@delphix.com> MFC after: 3 weeks	2017-06-20 17:39:24 +00:00
Josh Paetzel	ef18459108	MFV 316891 7386 zfs get does not work properly with bookmarks illumos/illumos-gate@edb901aab9 `edb901aab9` https://www.illumos.org/issues/7386 The zfs get command does not work with the bookmark parameter while it works properly with both filesystem and snapshot: # zfs get -t all -r creation rpool/test NAME PROPERTY VALUE SOURCE rpool/test creation Fri Sep 16 15:00 2016 - rpool/test@snap creation Fri Sep 16 15:00 2016 - rpool/test#bkmark creation Fri Sep 16 15:00 2016 - # zfs get -t all -r creation rpool/test@snap NAME PROPERTY VALUE SOURCE rpool/test@snap creation Fri Sep 16 15:00 2016 - # zfs get -t all -r creation rpool/test#bkmark cannot open 'rpool/test#bkmark': invalid dataset name # The zfs get command should be modified to work properly with bookmarks too. Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Marcel Telka <marcel@telka.sk>	2017-04-21 19:53:52 +00:00
Alexander Motin	3d1e0e0830	MFV r302662: 6447 handful of nvpair cleanups illumos/illumos-gate@759e89be35 https://github.com/illumos/illumos-gate/commit/759e89be359f2af635e4122d147df56bc e948773 https://www.illumos.org/issues/6447 I got a patch from someone who uses nvpair code outside of illumos. It fixes a couple of gcc warnings/bugs for him. 1. silence uninitialized use warnings 2. add parentheses around assignment used as truth value 3. fix printf format specifier (ll is for integers only) 4. strstr, strspn, strcspn, and strcmp are declared in string.h, not strings.h. 5. avoid scanning integer into boolean variable Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: Steve Dougherty <sdougherty@barracuda.com>	2016-09-01 15:17:39 +00:00
Alexander Motin	41b9077ef6	MFV r302660: 6314 buffer overflow in dsl_dataset_name illumos/illumos-gate@9adfa60d48 https://github.com/illumos/illumos-gate/commit/9adfa60d484ce2435f5af77cc99dcd4e6 92b6660 https://www.illumos.org/issues/6314 Callers of dsl_dataset_name pass a buffer of size ZFS_MAXNAMELEN, but dsl_dataset_name copies the datasets' name PLUS the snapshot name to it, resulting in a max of 2 * ZFS_MAXNAMELEN + '@'. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-01 15:08:27 +00:00
Alexander Motin	e12a269749	MFV r302659: 6931 lib/libzfs: cleanup gcc warnings illumos/illumos-gate@88f61dee20 `88f61dee20` https://www.illumos.org/issues/6931 need cleanup: CERRWARN += -_gcc=-Wno-switch CERRWARN += -_gcc=-Wno-parentheses CERRWARN += -_gcc=-Wno-unused-function Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Igor Kozhukhov <ikozhukhov@gmail.com>	2016-09-01 14:57:06 +00:00
Alexander Motin	1c7d88abed	MFV r302646: 6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch illumos/illumos-gate@ea4a67f462 https://github.com/illumos/illumos-gate/commit/ea4a67f462de0a39a9adea8197bcdef84 9de5371 https://www.illumos.org/issues/6980 doing zfs send -i snap1 snap2 >testfile results in internal error: Invalid argument Abort (core dumped) Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-01 14:06:30 +00:00
Alexander Motin	e36599916f	Revert r299454 and r299448. Those changes were found confusing FreeBSD libc ACL code, that doesn't differentiate ACL for directories and files, and report ACLs for all directories created after those patches as non-trivial. On the other side these changes were considered wrong from POSIX and NFSv4 points of view. Until further investigation done upstream, revert those changes locally in preparation for FreeBSD 11.0 release. Approved by: re (hrs)	2016-06-30 14:55:49 +00:00
Allan Jude	0144ad3e78	Connect the SHA-512t256 and Skein hashing algorithms to ZFS Support for the new hashing algorithms in ZFS was introduced in r289422 However it was disconnected because FreeBSD lacked implementations of SHA-512 (truncated to 256 bits), and Skein. These implementations were introduced in r300921 and r300966 respectively This commit connects them to ZFS and enabled these new checksum algorithms This new algorithms are not supported by the boot blocks, so do not use them on your root dataset if you boot from ZFS. Relnotes: yes Sponsored by: ScaleEngine Inc.	2016-05-31 04:12:14 +00:00
Alexander Motin	2a219f349e	MFV r299442: 6762 POSIX write should imply DELETE_CHILD on directories - and some additional considerations Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Author: Kevin Crowe <kevin.crowe@nexenta.com> openzfs/openzfs@d316fffc9c	2016-05-11 13:43:20 +00:00
Alexander Motin	7370229e8d	Add new IOCTL compat shims for ABI breakage caused by r296510: MFV r296505: 6531 Provide mechanism to artificially limit disk performance	2016-03-09 11:16:15 +00:00
Alexander Motin	468bca03ef	MFV r296527: 6659 nvlist_free(NULL) is a no-op Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: Marcel Telka <marcel@telka.sk> Approved by: Robert Mustacchi <rm@joyent.com> Author: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> illumos/illumos-gate@aab83bb83b	2016-03-08 18:11:38 +00:00
Alexander Motin	7a90077752	MFV r296518: 5027 zfs large block support (add copyright) Author: Matthew Ahrens <matt@mahrens.org> illumos/illumos-gate@c3d26abc9e	2016-03-08 17:51:09 +00:00
Andrew Turner	e5ca5f2abd	Fix the open solaris atomic functions on arm64. Without this we may use the wrong value in the comparison, leading to incorrectly setting the new value. This has been observed in the ZFS code. Without this we can lose track of the reference count in a zrlock object. We should move to use the generic atomic functions, however as this has been observed I would prefer to have this working, then move to the generic functions. PR: 204037 Sponsored by: ABT Systems Ltd	2015-11-05 16:55:27 +00:00
Alexander Motin	6b513e2853	MFV r289561: 6328 Fix cstyle errors in zfs codebase Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Jorgen Lundman <lundman@lundman.net> Approved by: Robert Mustacchi <rm@joyent.com> Author: Paul Dagnelie <pcd@delphix.com> illumos/illumos-gate@9a686fbc18	2015-10-19 08:25:37 +00:00
Alexander Motin	ab866a3d61	Fix ZFS ABI compat shims for `zfs receive` after r289362. Difference appeared much less drammatic then seemed originally.	2015-10-17 07:32:46 +00:00
Alexander Motin	43f774f296	MFV r289310: 4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Garrett D'Amore <garrett@damore.org> Author: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@45818ee124 This is only a partial merge of respective ZFS infrastructure changes. At this moment FreeBSD kernel has no those crypto algorithms, so the parts of the code to enable them are commented out. When they are implemented, it will be trivial to plug them in.	2015-10-16 14:45:21 +00:00
Alexander Motin	c70e61feed	MFV r289312: 2605 want to resume interrupted zfs send Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Xin Li <delphij@freebsd.org> Reviewed by: Arne Jansen <sensille@gmx.net> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@9c3fd1216f For more info, see: - slides http://www.slideshare.net/MatthewAhrens/openzfs-send-and-receive - video https://www.youtube.com/watch?v=iY44jPMvxog - manpage changes (for zfs resume -s and zfs send -t) - upcoming talk at the OpenZFS Developer Summit The TL;DR is: Use "zfs receive -s" to save the partially received state on failure. On failure, get the receive token with "zfs get receive_resume_token <fs>" Resume the send with "zfs send -t <token_value>" Relnotes: yes	2015-10-15 08:47:32 +00:00
Andriy Gapon	9b977fcea2	define aok in libnvpair which is linked to all zfs libraries that need aok This removes the circular dependency of libnvpair on libzfs / libzpool. PR: 199811 Obtained from: bapt MFC after: 23 days	2015-09-28 15:25:36 +00:00
Xin LI	3e691a57db	MFV r287684: 6091 avl_add doesn't assert on non-debug builds Use assfail() from libuutil instead of ASSERT() in userland AVL avl_add. illumos/illumos-gate@faa2b6be2f Illumos issues: 6091 avl_add doesn't assert on non-debug builds https://www.illumos.org/issues/6091	2015-09-12 08:50:43 +00:00
Alexander Motin	0d0def87fe	MFV 286707: 5959 clean up per-dataset feature count code Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@ca0cc3918a A ZFS feature flags (large blocks) tracks its refcounts as the number of datasets that have ever used the feature. Several features of this type are planned to be added (new checksum functions). This code should be made common infrastructure rather than duplicating the code for each feature.	2015-08-12 23:59:17 +00:00
Mariusz Zaborski	306a82f8f4	Rename zfs nvpair files to not colidate with our nvlist. PR: 201356 Approved by: pjd (mentor)	2015-07-09 21:53:40 +00:00
Andriy Gapon	39b6f1d6c1	nvpair_type_is_array: DATA_TYPE_INT8_ARRAY was not recognized To do: upstream (https://www.illumos.org/issues/5778) MFC after: 10 days	2015-04-28 06:34:55 +00:00
Steven Hartland	bc96366c86	Mechanically convert cddl sun #ifdef's to illumos Since the upstream for cddl code is now illumos not sun, mechanically convert all sun #ifdef's to illumos #ifdef's which have been used in all newer code for some time. Also do a manual pass to correct the use if #ifdef comments as per style(9) as well as few uses of #if defined(__FreeBSD__) vs #ifndef illumos. MFC after: 1 month Sponsored by: Multiplay	2015-01-17 14:44:59 +00:00
Xin LI	8bcd603968	MFV r274273: ZFS large block support. Please note that booting from datasets that have recordsize greater than 128KB is not supported (but it's Okay to enable the feature on the pool). This may remain unchanged because of memory constraint. Limited safety belt is provided for mounted root filesystem but use caution is advised. Illumos issue: 5027 zfs large block support MFC after: 1 month	2014-11-10 08:20:21 +00:00
Rui Paulo	d18aa577d5	Copy strtolctype.h to sys/cddl/contrib/opensolaris/common/util to keep the kernel self-contained. Requested by: jhb	2014-07-31 08:07:23 +00:00
Xin LI	a3cbca537e	MFV r269223: Change dn->dn_dbufs from linked list to AVL tree. Illumos issues: 4873 zvol unmap calls can take a very long time for larger datasets MFC after: 2 weeks	2014-07-29 08:42:22 +00:00
Xin LI	7e37b1e609	MFV r269010: Import Illumos changes to address the following Illumos issues: 4976 zfs should only avoid writing to a failing non-redundant top-level vdev 4978 ztest fails in get_metaslab_refcount() 4979 extend free space histogram to device and pool 4980 metaslabs should have a fragmentation metric 4981 remove fragmented ops vector from block allocator 4982 space_map object should proactively upgrade when feature is enabled 4984 device selection should use fragmentation metric MFC after: 2 weeks	2014-07-26 10:20:48 +00:00
Marcel Moolenaar	e7d939bda2	Remove ia64. This includes: o All directories named ia64 o All files named ia64 o All ia64-specific code guarded by __ia64__ o All ia64-specific makefile logic o Mention of ia64 in comments and documentation This excludes: o Everything under contrib/ o Everything under crypto/ o sys/xen/interface o sys/sys/elf_common.h Discussed at: BSDcan	2014-07-07 00:27:09 +00:00
Xin LI	30324e945a	MFV r268122: 4929 want prevsnap property illumos/illumos-gate@b461c7460e MFC after: 2 weeks	2014-07-01 22:42:53 +00:00
Xin LI	9cc8a15b2e	MFV r268121: 4924 LZ4 Compression for metadata illumos/illumos-gate@b8289d24d8 MFC after: 2 weeks	2014-07-01 22:31:09 +00:00
Xin LI	55f6421982	- Fix handling of "new" style of ioctl in compatiblity mode [1]; - Reorganize code and reduce diff from upstream; - Improve forward compatibility shims for previous kernel; Reported by: sbruno [1] X-MFC-With: r268075	2014-07-01 20:57:39 +00:00
Xin LI	71eaf0fda7	MFV r267566: 4390 i/o errors when deleting filesystem/zvol can lead to space map corruption MFC after: 2 weeks	2014-07-01 07:29:42 +00:00
Xin LI	29441ba3fa	MFV r267565: 4757 ZFS embedded-data block pointers ("zero block compression") 4913 zfs release should not be subject to space checks MFC after: 2 weeks	2014-07-01 06:43:15 +00:00

1 2 3

101 Commits