Update vendor-sys/opensolaris to last OpenSolaris state (13149:b23a4dab3d50)

Add ZFS bits to vendor-sys/opensolaris Obtained from: https://hg.openindiana.org/upstream/oracle/onnv-gate
2012-07-18 08:12:04 +00:00 · 2012-07-18 08:12:04 +00:00 · 39f5422299
commit 39f5422299
parent e27f30edd4
236 changed files with 178767 additions and 110 deletions
--- a/OPENSOLARIS.LICENSE
+++ b/OPENSOLARIS.LICENSE
@ -0,0 +1,384 @@
+Unless otherwise noted, all files in this distribution are released
+under the Common Development and Distribution License (CDDL).
+Exceptions are noted within the associated source files.
+
+--------------------------------------------------------------------
+
+
+COMMON DEVELOPMENT AND DISTRIBUTION LICENSE Version 1.0
+
+1. Definitions.
+
+    1.1. "Contributor" means each individual or entity that creates
+         or contributes to the creation of Modifications.
+
+    1.2. "Contributor Version" means the combination of the Original
+         Software, prior Modifications used by a Contributor (if any),
+         and the Modifications made by that particular Contributor.
+
+    1.3. "Covered Software" means (a) the Original Software, or (b)
+         Modifications, or (c) the combination of files containing
+         Original Software with files containing Modifications, in
+         each case including portions thereof.
+
+    1.4. "Executable" means the Covered Software in any form other
+         than Source Code.
+
+    1.5. "Initial Developer" means the individual or entity that first
+         makes Original Software available under this License.
+
+    1.6. "Larger Work" means a work which combines Covered Software or
+         portions thereof with code not governed by the terms of this
+         License.
+
+    1.7. "License" means this document.
+
+    1.8. "Licensable" means having the right to grant, to the maximum
+         extent possible, whether at the time of the initial grant or
+         subsequently acquired, any and all of the rights conveyed
+         herein.
+
+    1.9. "Modifications" means the Source Code and Executable form of
+         any of the following:
+
+        A. Any file that results from an addition to, deletion from or
+           modification of the contents of a file containing Original
+           Software or previous Modifications;
+
+        B. Any new file that contains any part of the Original
+           Software or previous Modifications; or
+
+        C. Any new file that is contributed or otherwise made
+           available under the terms of this License.
+
+    1.10. "Original Software" means the Source Code and Executable
+          form of computer software code that is originally released
+          under this License.
+
+    1.11. "Patent Claims" means any patent claim(s), now owned or
+          hereafter acquired, including without limitation, method,
+          process, and apparatus claims, in any patent Licensable by
+          grantor.
+
+    1.12. "Source Code" means (a) the common form of computer software
+          code in which modifications are made and (b) associated
+          documentation included in or with such code.
+
+    1.13. "You" (or "Your") means an individual or a legal entity
+          exercising rights under, and complying with all of the terms
+          of, this License.  For legal entities, "You" includes any
+          entity which controls, is controlled by, or is under common
+          control with You.  For purposes of this definition,
+          "control" means (a) the power, direct or indirect, to cause
+          the direction or management of such entity, whether by
+          contract or otherwise, or (b) ownership of more than fifty
+          percent (50%) of the outstanding shares or beneficial
+          ownership of such entity.
+
+2. License Grants.
+
+    2.1. The Initial Developer Grant.
+
+    Conditioned upon Your compliance with Section 3.1 below and
+    subject to third party intellectual property claims, the Initial
+    Developer hereby grants You a world-wide, royalty-free,
+    non-exclusive license:
+
+        (a) under intellectual property rights (other than patent or
+            trademark) Licensable by Initial Developer, to use,
+            reproduce, modify, display, perform, sublicense and
+            distribute the Original Software (or portions thereof),
+            with or without Modifications, and/or as part of a Larger
+            Work; and
+
+        (b) under Patent Claims infringed by the making, using or
+            selling of Original Software, to make, have made, use,
+            practice, sell, and offer for sale, and/or otherwise
+            dispose of the Original Software (or portions thereof).
+
+        (c) The licenses granted in Sections 2.1(a) and (b) are
+            effective on the date Initial Developer first distributes
+            or otherwise makes the Original Software available to a
+            third party under the terms of this License.
+
+        (d) Notwithstanding Section 2.1(b) above, no patent license is
+            granted: (1) for code that You delete from the Original
+            Software, or (2) for infringements caused by: (i) the
+            modification of the Original Software, or (ii) the
+            combination of the Original Software with other software
+            or devices.
+
+    2.2. Contributor Grant.
+
+    Conditioned upon Your compliance with Section 3.1 below and
+    subject to third party intellectual property claims, each
+    Contributor hereby grants You a world-wide, royalty-free,
+    non-exclusive license:
+
+        (a) under intellectual property rights (other than patent or
+            trademark) Licensable by Contributor to use, reproduce,
+            modify, display, perform, sublicense and distribute the
+            Modifications created by such Contributor (or portions
+            thereof), either on an unmodified basis, with other
+            Modifications, as Covered Software and/or as part of a
+            Larger Work; and
+
+        (b) under Patent Claims infringed by the making, using, or
+            selling of Modifications made by that Contributor either
+            alone and/or in combination with its Contributor Version
+            (or portions of such combination), to make, use, sell,
+            offer for sale, have made, and/or otherwise dispose of:
+            (1) Modifications made by that Contributor (or portions
+            thereof); and (2) the combination of Modifications made by
+            that Contributor with its Contributor Version (or portions
+            of such combination).
+
+        (c) The licenses granted in Sections 2.2(a) and 2.2(b) are
+            effective on the date Contributor first distributes or
+            otherwise makes the Modifications available to a third
+            party.
+
+        (d) Notwithstanding Section 2.2(b) above, no patent license is
+            granted: (1) for any code that Contributor has deleted
+            from the Contributor Version; (2) for infringements caused
+            by: (i) third party modifications of Contributor Version,
+            or (ii) the combination of Modifications made by that
+            Contributor with other software (except as part of the
+            Contributor Version) or other devices; or (3) under Patent
+            Claims infringed by Covered Software in the absence of
+            Modifications made by that Contributor.
+
+3. Distribution Obligations.
+
+    3.1. Availability of Source Code.
+
+    Any Covered Software that You distribute or otherwise make
+    available in Executable form must also be made available in Source
+    Code form and that Source Code form must be distributed only under
+    the terms of this License.  You must include a copy of this
+    License with every copy of the Source Code form of the Covered
+    Software You distribute or otherwise make available.  You must
+    inform recipients of any such Covered Software in Executable form
+    as to how they can obtain such Covered Software in Source Code
+    form in a reasonable manner on or through a medium customarily
+    used for software exchange.
+
+    3.2. Modifications.
+
+    The Modifications that You create or to which You contribute are
+    governed by the terms of this License.  You represent that You
+    believe Your Modifications are Your original creation(s) and/or
+    You have sufficient rights to grant the rights conveyed by this
+    License.
+
+    3.3. Required Notices.
+
+    You must include a notice in each of Your Modifications that
+    identifies You as the Contributor of the Modification.  You may
+    not remove or alter any copyright, patent or trademark notices
+    contained within the Covered Software, or any notices of licensing
+    or any descriptive text giving attribution to any Contributor or
+    the Initial Developer.
+
+    3.4. Application of Additional Terms.
+
+    You may not offer or impose any terms on any Covered Software in
+    Source Code form that alters or restricts the applicable version
+    of this License or the recipients' rights hereunder.  You may
+    choose to offer, and to charge a fee for, warranty, support,
+    indemnity or liability obligations to one or more recipients of
+    Covered Software.  However, you may do so only on Your own behalf,
+    and not on behalf of the Initial Developer or any Contributor.
+    You must make it absolutely clear that any such warranty, support,
+    indemnity or liability obligation is offered by You alone, and You
+    hereby agree to indemnify the Initial Developer and every
+    Contributor for any liability incurred by the Initial Developer or
+    such Contributor as a result of warranty, support, indemnity or
+    liability terms You offer.
+
+    3.5. Distribution of Executable Versions.
+
+    You may distribute the Executable form of the Covered Software
+    under the terms of this License or under the terms of a license of
+    Your choice, which may contain terms different from this License,
+    provided that You are in compliance with the terms of this License
+    and that the license for the Executable form does not attempt to
+    limit or alter the recipient's rights in the Source Code form from
+    the rights set forth in this License.  If You distribute the
+    Covered Software in Executable form under a different license, You
+    must make it absolutely clear that any terms which differ from
+    this License are offered by You alone, not by the Initial
+    Developer or Contributor.  You hereby agree to indemnify the
+    Initial Developer and every Contributor for any liability incurred
+    by the Initial Developer or such Contributor as a result of any
+    such terms You offer.
+
+    3.6. Larger Works.
+
+    You may create a Larger Work by combining Covered Software with
+    other code not governed by the terms of this License and
+    distribute the Larger Work as a single product.  In such a case,
+    You must make sure the requirements of this License are fulfilled
+    for the Covered Software.
+
+4. Versions of the License.
+
+    4.1. New Versions.
+
+    Sun Microsystems, Inc. is the initial license steward and may
+    publish revised and/or new versions of this License from time to
+    time.  Each version will be given a distinguishing version number.
+    Except as provided in Section 4.3, no one other than the license
+    steward has the right to modify this License.
+
+    4.2. Effect of New Versions.
+
+    You may always continue to use, distribute or otherwise make the
+    Covered Software available under the terms of the version of the
+    License under which You originally received the Covered Software.
+    If the Initial Developer includes a notice in the Original
+    Software prohibiting it from being distributed or otherwise made
+    available under any subsequent version of the License, You must
+    distribute and make the Covered Software available under the terms
+    of the version of the License under which You originally received
+    the Covered Software.  Otherwise, You may also choose to use,
+    distribute or otherwise make the Covered Software available under
+    the terms of any subsequent version of the License published by
+    the license steward.
+
+    4.3. Modified Versions.
+
+    When You are an Initial Developer and You want to create a new
+    license for Your Original Software, You may create and use a
+    modified version of this License if You: (a) rename the license
+    and remove any references to the name of the license steward
+    (except to note that the license differs from this License); and
+    (b) otherwise make it clear that the license contains terms which
+    differ from this License.
+
+5. DISCLAIMER OF WARRANTY.
+
+    COVERED SOFTWARE IS PROVIDED UNDER THIS LICENSE ON AN "AS IS"
+    BASIS, WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED,
+    INCLUDING, WITHOUT LIMITATION, WARRANTIES THAT THE COVERED
+    SOFTWARE IS FREE OF DEFECTS, MERCHANTABLE, FIT FOR A PARTICULAR
+    PURPOSE OR NON-INFRINGING.  THE ENTIRE RISK AS TO THE QUALITY AND
+    PERFORMANCE OF THE COVERED SOFTWARE IS WITH YOU.  SHOULD ANY
+    COVERED SOFTWARE PROVE DEFECTIVE IN ANY RESPECT, YOU (NOT THE
+    INITIAL DEVELOPER OR ANY OTHER CONTRIBUTOR) ASSUME THE COST OF ANY
+    NECESSARY SERVICING, REPAIR OR CORRECTION.  THIS DISCLAIMER OF
+    WARRANTY CONSTITUTES AN ESSENTIAL PART OF THIS LICENSE.  NO USE OF
+    ANY COVERED SOFTWARE IS AUTHORIZED HEREUNDER EXCEPT UNDER THIS
+    DISCLAIMER.
+
+6. TERMINATION.
+
+    6.1. This License and the rights granted hereunder will terminate
+    automatically if You fail to comply with terms herein and fail to
+    cure such breach within 30 days of becoming aware of the breach.
+    Provisions which, by their nature, must remain in effect beyond
+    the termination of this License shall survive.
+
+    6.2. If You assert a patent infringement claim (excluding
+    declaratory judgment actions) against Initial Developer or a
+    Contributor (the Initial Developer or Contributor against whom You
+    assert such claim is referred to as "Participant") alleging that
+    the Participant Software (meaning the Contributor Version where
+    the Participant is a Contributor or the Original Software where
+    the Participant is the Initial Developer) directly or indirectly
+    infringes any patent, then any and all rights granted directly or
+    indirectly to You by such Participant, the Initial Developer (if
+    the Initial Developer is not the Participant) and all Contributors
+    under Sections 2.1 and/or 2.2 of this License shall, upon 60 days
+    notice from Participant terminate prospectively and automatically
+    at the expiration of such 60 day notice period, unless if within
+    such 60 day period You withdraw Your claim with respect to the
+    Participant Software against such Participant either unilaterally
+    or pursuant to a written agreement with Participant.
+
+    6.3. In the event of termination under Sections 6.1 or 6.2 above,
+    all end user licenses that have been validly granted by You or any
+    distributor hereunder prior to termination (excluding licenses
+    granted to You by any distributor) shall survive termination.
+
+7. LIMITATION OF LIABILITY.
+
+    UNDER NO CIRCUMSTANCES AND UNDER NO LEGAL THEORY, WHETHER TORT
+    (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE, SHALL YOU, THE
+    INITIAL DEVELOPER, ANY OTHER CONTRIBUTOR, OR ANY DISTRIBUTOR OF
+    COVERED SOFTWARE, OR ANY SUPPLIER OF ANY OF SUCH PARTIES, BE
+    LIABLE TO ANY PERSON FOR ANY INDIRECT, SPECIAL, INCIDENTAL, OR
+    CONSEQUENTIAL DAMAGES OF ANY CHARACTER INCLUDING, WITHOUT
+    LIMITATION, DAMAGES FOR LOST PROFITS, LOSS OF GOODWILL, WORK
+    STOPPAGE, COMPUTER FAILURE OR MALFUNCTION, OR ANY AND ALL OTHER
+    COMMERCIAL DAMAGES OR LOSSES, EVEN IF SUCH PARTY SHALL HAVE BEEN
+    INFORMED OF THE POSSIBILITY OF SUCH DAMAGES.  THIS LIMITATION OF
+    LIABILITY SHALL NOT APPLY TO LIABILITY FOR DEATH OR PERSONAL
+    INJURY RESULTING FROM SUCH PARTY'S NEGLIGENCE TO THE EXTENT
+    APPLICABLE LAW PROHIBITS SUCH LIMITATION.  SOME JURISDICTIONS DO
+    NOT ALLOW THE EXCLUSION OR LIMITATION OF INCIDENTAL OR
+    CONSEQUENTIAL DAMAGES, SO THIS EXCLUSION AND LIMITATION MAY NOT
+    APPLY TO YOU.
+
+8. U.S. GOVERNMENT END USERS.
+
+    The Covered Software is a "commercial item," as that term is
+    defined in 48 C.F.R. 2.101 (Oct. 1995), consisting of "commercial
+    computer software" (as that term is defined at 48
+    C.F.R. 252.227-7014(a)(1)) and "commercial computer software
+    documentation" as such terms are used in 48 C.F.R. 12.212
+    (Sept. 1995).  Consistent with 48 C.F.R. 12.212 and 48
+    C.F.R. 227.7202-1 through 227.7202-4 (June 1995), all
+    U.S. Government End Users acquire Covered Software with only those
+    rights set forth herein.  This U.S. Government Rights clause is in
+    lieu of, and supersedes, any other FAR, DFAR, or other clause or
+    provision that addresses Government rights in computer software
+    under this License.
+
+9. MISCELLANEOUS.
+
+    This License represents the complete agreement concerning subject
+    matter hereof.  If any provision of this License is held to be
+    unenforceable, such provision shall be reformed only to the extent
+    necessary to make it enforceable.  This License shall be governed
+    by the law of the jurisdiction specified in a notice contained
+    within the Original Software (except to the extent applicable law,
+    if any, provides otherwise), excluding such jurisdiction's
+    conflict-of-law provisions.  Any litigation relating to this
+    License shall be subject to the jurisdiction of the courts located
+    in the jurisdiction and venue specified in a notice contained
+    within the Original Software, with the losing party responsible
+    for costs, including, without limitation, court costs and
+    reasonable attorneys' fees and expenses.  The application of the
+    United Nations Convention on Contracts for the International Sale
+    of Goods is expressly excluded.  Any law or regulation which
+    provides that the language of a contract shall be construed
+    against the drafter shall not apply to this License.  You agree
+    that You alone are responsible for compliance with the United
+    States export administration regulations (and the export control
+    laws and regulation of any other countries) when You use,
+    distribute or otherwise make available any Covered Software.
+
+10. RESPONSIBILITY FOR CLAIMS.
+
+    As between Initial Developer and the Contributors, each party is
+    responsible for claims and damages arising, directly or
+    indirectly, out of its utilization of rights under this License
+    and You agree to work with Initial Developer and Contributors to
+    distribute such responsibility on an equitable basis.  Nothing
+    herein is intended or shall be deemed to constitute any admission
+    of liability.
+
+--------------------------------------------------------------------
+
+NOTICE PURSUANT TO SECTION 9 OF THE COMMON DEVELOPMENT AND
+DISTRIBUTION LICENSE (CDDL)
+
+For Covered Software in this distribution, this License shall
+be governed by the laws of the State of California (excluding
+conflict-of-law provisions).
+
+Any litigation relating to this License shall be subject to the
+jurisdiction of the Federal Courts of the Northern District of
+California and the state courts of the State of California, with
+venue lying in Santa Clara County, California.
--- a/common/acl/acl_common.c
+++ b/common/acl/acl_common.c
--- a/common/acl/acl_common.h
+++ b/common/acl/acl_common.h
@ -0,0 +1,59 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_ACL_COMMON_H
+#define	_ACL_COMMON_H
+
+#include <sys/types.h>
+#include <sys/acl.h>
+#include <sys/stat.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+extern ace_t trivial_acl[6];
+
+extern int acltrivial(const char *);
+extern void adjust_ace_pair(ace_t *pair, mode_t mode);
+extern void adjust_ace_pair_common(void *, size_t, size_t, mode_t);
+extern int ace_trivial(ace_t *acep, int aclcnt);
+extern int ace_trivial_common(void *, int,
+    uint64_t (*walk)(void *, uint64_t, int aclcnt, uint16_t *, uint16_t *,
+    uint32_t *mask));
+extern acl_t *acl_alloc(acl_type_t);
+extern void acl_free(acl_t *aclp);
+extern int acl_translate(acl_t *aclp, int target_flavor,
+    int isdir, uid_t owner, gid_t group);
+void ksort(caddr_t v, int n, int s, int (*f)());
+int cmp2acls(void *a, void *b);
+int acl_trivial_create(mode_t mode, ace_t **acl, int *count);
+void acl_trivial_access_masks(mode_t mode, uint32_t *allow0, uint32_t *deny1,
+    uint32_t *deny2, uint32_t *owner, uint32_t *group, uint32_t *everyone);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif /* _ACL_COMMON_H */
--- a/common/atomic/amd64/atomic.s
+++ b/common/atomic/amd64/atomic.s
@ -0,0 +1,573 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+
+/*
+ * Copyright (c) 2004, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+	.file	"atomic.s"
+
+#include <sys/asm_linkage.h>
+
+#if defined(_KERNEL)
+	/*
+	 * Legacy kernel interfaces; they will go away (eventually).
+	 */
+	ANSI_PRAGMA_WEAK2(cas8,atomic_cas_8,function)
+	ANSI_PRAGMA_WEAK2(cas32,atomic_cas_32,function)
+	ANSI_PRAGMA_WEAK2(cas64,atomic_cas_64,function)
+	ANSI_PRAGMA_WEAK2(caslong,atomic_cas_ulong,function)
+	ANSI_PRAGMA_WEAK2(casptr,atomic_cas_ptr,function)
+	ANSI_PRAGMA_WEAK2(atomic_and_long,atomic_and_ulong,function)
+	ANSI_PRAGMA_WEAK2(atomic_or_long,atomic_or_ulong,function)
+#endif
+
+	ENTRY(atomic_inc_8)
+	ALTENTRY(atomic_inc_uchar)
+	lock
+	incb	(%rdi)
+	ret
+	SET_SIZE(atomic_inc_uchar)
+	SET_SIZE(atomic_inc_8)
+
+	ENTRY(atomic_inc_16)
+	ALTENTRY(atomic_inc_ushort)
+	lock
+	incw	(%rdi)
+	ret
+	SET_SIZE(atomic_inc_ushort)
+	SET_SIZE(atomic_inc_16)
+
+	ENTRY(atomic_inc_32)
+	ALTENTRY(atomic_inc_uint)
+	lock
+	incl	(%rdi)
+	ret
+	SET_SIZE(atomic_inc_uint)
+	SET_SIZE(atomic_inc_32)
+
+	ENTRY(atomic_inc_64)
+	ALTENTRY(atomic_inc_ulong)
+	lock
+	incq	(%rdi)
+	ret
+	SET_SIZE(atomic_inc_ulong)
+	SET_SIZE(atomic_inc_64)
+
+	ENTRY(atomic_inc_8_nv)
+	ALTENTRY(atomic_inc_uchar_nv)
+	xorl	%eax, %eax	/ clear upper bits of %eax return register
+	incb	%al		/ %al = 1
+	lock
+	  xaddb	%al, (%rdi)	/ %al = old value, (%rdi) = new value
+	incb	%al		/ return new value
+	ret
+	SET_SIZE(atomic_inc_uchar_nv)
+	SET_SIZE(atomic_inc_8_nv)
+
+	ENTRY(atomic_inc_16_nv)
+	ALTENTRY(atomic_inc_ushort_nv)
+	xorl	%eax, %eax	/ clear upper bits of %eax return register
+	incw	%ax		/ %ax = 1
+	lock
+	  xaddw	%ax, (%rdi)	/ %ax = old value, (%rdi) = new value
+	incw	%ax		/ return new value
+	ret
+	SET_SIZE(atomic_inc_ushort_nv)
+	SET_SIZE(atomic_inc_16_nv)
+
+	ENTRY(atomic_inc_32_nv)
+	ALTENTRY(atomic_inc_uint_nv)
+	xorl	%eax, %eax	/ %eax = 0
+	incl	%eax		/ %eax = 1
+	lock
+	  xaddl	%eax, (%rdi)	/ %eax = old value, (%rdi) = new value
+	incl	%eax		/ return new value
+	ret
+	SET_SIZE(atomic_inc_uint_nv)
+	SET_SIZE(atomic_inc_32_nv)
+
+	ENTRY(atomic_inc_64_nv)
+	ALTENTRY(atomic_inc_ulong_nv)
+	xorq	%rax, %rax	/ %rax = 0
+	incq	%rax		/ %rax = 1
+	lock
+	  xaddq	%rax, (%rdi)	/ %rax = old value, (%rdi) = new value
+	incq	%rax		/ return new value
+	ret
+	SET_SIZE(atomic_inc_ulong_nv)
+	SET_SIZE(atomic_inc_64_nv)
+
+	ENTRY(atomic_dec_8)
+	ALTENTRY(atomic_dec_uchar)
+	lock
+	decb	(%rdi)
+	ret
+	SET_SIZE(atomic_dec_uchar)
+	SET_SIZE(atomic_dec_8)
+
+	ENTRY(atomic_dec_16)
+	ALTENTRY(atomic_dec_ushort)
+	lock
+	decw	(%rdi)
+	ret
+	SET_SIZE(atomic_dec_ushort)
+	SET_SIZE(atomic_dec_16)
+
+	ENTRY(atomic_dec_32)
+	ALTENTRY(atomic_dec_uint)
+	lock
+	decl	(%rdi)
+	ret
+	SET_SIZE(atomic_dec_uint)
+	SET_SIZE(atomic_dec_32)
+
+	ENTRY(atomic_dec_64)
+	ALTENTRY(atomic_dec_ulong)
+	lock
+	decq	(%rdi)
+	ret
+	SET_SIZE(atomic_dec_ulong)
+	SET_SIZE(atomic_dec_64)
+
+	ENTRY(atomic_dec_8_nv)
+	ALTENTRY(atomic_dec_uchar_nv)
+	xorl	%eax, %eax	/ clear upper bits of %eax return register
+	decb	%al		/ %al = -1
+	lock
+	  xaddb	%al, (%rdi)	/ %al = old value, (%rdi) = new value
+	decb	%al		/ return new value
+	ret
+	SET_SIZE(atomic_dec_uchar_nv)
+	SET_SIZE(atomic_dec_8_nv)
+
+	ENTRY(atomic_dec_16_nv)
+	ALTENTRY(atomic_dec_ushort_nv)
+	xorl	%eax, %eax	/ clear upper bits of %eax return register
+	decw	%ax		/ %ax = -1
+	lock
+	  xaddw	%ax, (%rdi)	/ %ax = old value, (%rdi) = new value
+	decw	%ax		/ return new value
+	ret
+	SET_SIZE(atomic_dec_ushort_nv)
+	SET_SIZE(atomic_dec_16_nv)
+
+	ENTRY(atomic_dec_32_nv)
+	ALTENTRY(atomic_dec_uint_nv)
+	xorl	%eax, %eax	/ %eax = 0
+	decl	%eax		/ %eax = -1
+	lock
+	  xaddl	%eax, (%rdi)	/ %eax = old value, (%rdi) = new value
+	decl	%eax		/ return new value
+	ret
+	SET_SIZE(atomic_dec_uint_nv)
+	SET_SIZE(atomic_dec_32_nv)
+
+	ENTRY(atomic_dec_64_nv)
+	ALTENTRY(atomic_dec_ulong_nv)
+	xorq	%rax, %rax	/ %rax = 0
+	decq	%rax		/ %rax = -1
+	lock
+	  xaddq	%rax, (%rdi)	/ %rax = old value, (%rdi) = new value
+	decq	%rax		/ return new value
+	ret
+	SET_SIZE(atomic_dec_ulong_nv)
+	SET_SIZE(atomic_dec_64_nv)
+
+	ENTRY(atomic_add_8)
+	ALTENTRY(atomic_add_char)
+	lock
+	addb	%sil, (%rdi)
+	ret
+	SET_SIZE(atomic_add_char)
+	SET_SIZE(atomic_add_8)
+
+	ENTRY(atomic_add_16)
+	ALTENTRY(atomic_add_short)
+	lock
+	addw	%si, (%rdi)
+	ret
+	SET_SIZE(atomic_add_short)
+	SET_SIZE(atomic_add_16)
+
+	ENTRY(atomic_add_32)
+	ALTENTRY(atomic_add_int)
+	lock
+	addl	%esi, (%rdi)
+	ret
+	SET_SIZE(atomic_add_int)
+	SET_SIZE(atomic_add_32)
+
+	ENTRY(atomic_add_64)
+	ALTENTRY(atomic_add_ptr)
+	ALTENTRY(atomic_add_long)
+	lock
+	addq	%rsi, (%rdi)
+	ret
+	SET_SIZE(atomic_add_long)
+	SET_SIZE(atomic_add_ptr)
+	SET_SIZE(atomic_add_64)
+
+	ENTRY(atomic_or_8)
+	ALTENTRY(atomic_or_uchar)
+	lock
+	orb	%sil, (%rdi)
+	ret
+	SET_SIZE(atomic_or_uchar)
+	SET_SIZE(atomic_or_8)
+
+	ENTRY(atomic_or_16)
+	ALTENTRY(atomic_or_ushort)
+	lock
+	orw	%si, (%rdi)
+	ret
+	SET_SIZE(atomic_or_ushort)
+	SET_SIZE(atomic_or_16)
+
+	ENTRY(atomic_or_32)
+	ALTENTRY(atomic_or_uint)
+	lock
+	orl	%esi, (%rdi)
+	ret
+	SET_SIZE(atomic_or_uint)
+	SET_SIZE(atomic_or_32)
+
+	ENTRY(atomic_or_64)
+	ALTENTRY(atomic_or_ulong)
+	lock
+	orq	%rsi, (%rdi)
+	ret
+	SET_SIZE(atomic_or_ulong)
+	SET_SIZE(atomic_or_64)
+
+	ENTRY(atomic_and_8)
+	ALTENTRY(atomic_and_uchar)
+	lock
+	andb	%sil, (%rdi)
+	ret
+	SET_SIZE(atomic_and_uchar)
+	SET_SIZE(atomic_and_8)
+
+	ENTRY(atomic_and_16)
+	ALTENTRY(atomic_and_ushort)
+	lock
+	andw	%si, (%rdi)
+	ret
+	SET_SIZE(atomic_and_ushort)
+	SET_SIZE(atomic_and_16)
+
+	ENTRY(atomic_and_32)
+	ALTENTRY(atomic_and_uint)
+	lock
+	andl	%esi, (%rdi)
+	ret
+	SET_SIZE(atomic_and_uint)
+	SET_SIZE(atomic_and_32)
+
+	ENTRY(atomic_and_64)
+	ALTENTRY(atomic_and_ulong)
+	lock
+	andq	%rsi, (%rdi)
+	ret
+	SET_SIZE(atomic_and_ulong)
+	SET_SIZE(atomic_and_64)
+
+	ENTRY(atomic_add_8_nv)
+	ALTENTRY(atomic_add_char_nv)
+	movzbl	%sil, %eax		/ %al = delta addend, clear upper bits
+	lock
+	  xaddb	%sil, (%rdi)		/ %sil = old value, (%rdi) = sum
+	addb	%sil, %al		/ new value = original value + delta
+	ret
+	SET_SIZE(atomic_add_char_nv)
+	SET_SIZE(atomic_add_8_nv)
+
+	ENTRY(atomic_add_16_nv)
+	ALTENTRY(atomic_add_short_nv)
+	movzwl	%si, %eax		/ %ax = delta addend, clean upper bits
+	lock
+	  xaddw	%si, (%rdi)		/ %si = old value, (%rdi) = sum
+	addw	%si, %ax		/ new value = original value + delta
+	ret
+	SET_SIZE(atomic_add_short_nv)
+	SET_SIZE(atomic_add_16_nv)
+
+	ENTRY(atomic_add_32_nv)
+	ALTENTRY(atomic_add_int_nv)
+	mov	%esi, %eax		/ %eax = delta addend
+	lock
+	  xaddl	%esi, (%rdi)		/ %esi = old value, (%rdi) = sum
+	add	%esi, %eax		/ new value = original value + delta
+	ret
+	SET_SIZE(atomic_add_int_nv)
+	SET_SIZE(atomic_add_32_nv)
+
+	ENTRY(atomic_add_64_nv)
+	ALTENTRY(atomic_add_ptr_nv)
+	ALTENTRY(atomic_add_long_nv)
+	mov	%rsi, %rax		/ %rax = delta addend
+	lock
+	  xaddq	%rsi, (%rdi)		/ %rsi = old value, (%rdi) = sum
+	addq	%rsi, %rax		/ new value = original value + delta
+	ret
+	SET_SIZE(atomic_add_long_nv)
+	SET_SIZE(atomic_add_ptr_nv)
+	SET_SIZE(atomic_add_64_nv)
+
+	ENTRY(atomic_and_8_nv)
+	ALTENTRY(atomic_and_uchar_nv)
+	movb	(%rdi), %al	/ %al = old value
+1:
+	movb	%sil, %cl
+	andb	%al, %cl	/ %cl = new value
+	lock
+	cmpxchgb %cl, (%rdi)	/ try to stick it in
+	jne	1b
+	movzbl	%cl, %eax	/ return new value
+	ret
+	SET_SIZE(atomic_and_uchar_nv)
+	SET_SIZE(atomic_and_8_nv)
+
+	ENTRY(atomic_and_16_nv)
+	ALTENTRY(atomic_and_ushort_nv)
+	movw	(%rdi), %ax	/ %ax = old value
+1:
+	movw	%si, %cx
+	andw	%ax, %cx	/ %cx = new value
+	lock
+	cmpxchgw %cx, (%rdi)	/ try to stick it in
+	jne	1b
+	movzwl	%cx, %eax	/ return new value
+	ret
+	SET_SIZE(atomic_and_ushort_nv)
+	SET_SIZE(atomic_and_16_nv)
+
+	ENTRY(atomic_and_32_nv)
+	ALTENTRY(atomic_and_uint_nv)
+	movl	(%rdi), %eax
+1:
+	movl	%esi, %ecx
+	andl	%eax, %ecx
+	lock
+	cmpxchgl %ecx, (%rdi)
+	jne	1b
+	movl	%ecx, %eax
+	ret
+	SET_SIZE(atomic_and_uint_nv)
+	SET_SIZE(atomic_and_32_nv)
+
+	ENTRY(atomic_and_64_nv)
+	ALTENTRY(atomic_and_ulong_nv)
+	movq	(%rdi), %rax
+1:
+	movq	%rsi, %rcx
+	andq	%rax, %rcx
+	lock
+	cmpxchgq %rcx, (%rdi)
+	jne	1b
+	movq	%rcx, %rax
+	ret
+	SET_SIZE(atomic_and_ulong_nv)
+	SET_SIZE(atomic_and_64_nv)
+
+	ENTRY(atomic_or_8_nv)
+	ALTENTRY(atomic_or_uchar_nv)
+	movb	(%rdi), %al	/ %al = old value
+1:
+	movb	%sil, %cl
+	orb	%al, %cl	/ %cl = new value
+	lock
+	cmpxchgb %cl, (%rdi)	/ try to stick it in
+	jne	1b
+	movzbl	%cl, %eax	/ return new value
+	ret
+	SET_SIZE(atomic_or_uchar_nv)
+	SET_SIZE(atomic_or_8_nv)
+
+	ENTRY(atomic_or_16_nv)
+	ALTENTRY(atomic_or_ushort_nv)
+	movw	(%rdi), %ax	/ %ax = old value
+1:
+	movw	%si, %cx
+	orw	%ax, %cx	/ %cx = new value
+	lock
+	cmpxchgw %cx, (%rdi)	/ try to stick it in
+	jne	1b
+	movzwl	%cx, %eax	/ return new value
+	ret
+	SET_SIZE(atomic_or_ushort_nv)
+	SET_SIZE(atomic_or_16_nv)
+
+	ENTRY(atomic_or_32_nv)
+	ALTENTRY(atomic_or_uint_nv)
+	movl	(%rdi), %eax
+1:
+	movl	%esi, %ecx
+	orl	%eax, %ecx
+	lock
+	cmpxchgl %ecx, (%rdi)
+	jne	1b
+	movl	%ecx, %eax
+	ret
+	SET_SIZE(atomic_or_uint_nv)
+	SET_SIZE(atomic_or_32_nv)
+
+	ENTRY(atomic_or_64_nv)
+	ALTENTRY(atomic_or_ulong_nv)
+	movq	(%rdi), %rax
+1:
+	movq	%rsi, %rcx
+	orq	%rax, %rcx
+	lock
+	cmpxchgq %rcx, (%rdi)
+	jne	1b
+	movq	%rcx, %rax
+	ret
+	SET_SIZE(atomic_or_ulong_nv)
+	SET_SIZE(atomic_or_64_nv)
+
+	ENTRY(atomic_cas_8)
+	ALTENTRY(atomic_cas_uchar)
+	movzbl	%sil, %eax
+	lock
+	cmpxchgb %dl, (%rdi)
+	ret
+	SET_SIZE(atomic_cas_uchar)
+	SET_SIZE(atomic_cas_8)
+
+	ENTRY(atomic_cas_16)
+	ALTENTRY(atomic_cas_ushort)
+	movzwl	%si, %eax
+	lock
+	cmpxchgw %dx, (%rdi)
+	ret
+	SET_SIZE(atomic_cas_ushort)
+	SET_SIZE(atomic_cas_16)
+
+	ENTRY(atomic_cas_32)
+	ALTENTRY(atomic_cas_uint)
+	movl	%esi, %eax
+	lock
+	cmpxchgl %edx, (%rdi)
+	ret
+	SET_SIZE(atomic_cas_uint)
+	SET_SIZE(atomic_cas_32)
+
+	ENTRY(atomic_cas_64)
+	ALTENTRY(atomic_cas_ulong)
+	ALTENTRY(atomic_cas_ptr)
+	movq	%rsi, %rax
+	lock
+	cmpxchgq %rdx, (%rdi)
+	ret
+	SET_SIZE(atomic_cas_ptr)
+	SET_SIZE(atomic_cas_ulong)
+	SET_SIZE(atomic_cas_64)
+
+	ENTRY(atomic_swap_8)
+	ALTENTRY(atomic_swap_uchar)
+	movzbl	%sil, %eax
+	lock
+	xchgb %al, (%rdi)
+	ret
+	SET_SIZE(atomic_swap_uchar)
+	SET_SIZE(atomic_swap_8)
+
+	ENTRY(atomic_swap_16)
+	ALTENTRY(atomic_swap_ushort)
+	movzwl	%si, %eax
+	lock
+	xchgw %ax, (%rdi)
+	ret
+	SET_SIZE(atomic_swap_ushort)
+	SET_SIZE(atomic_swap_16)
+
+	ENTRY(atomic_swap_32)
+	ALTENTRY(atomic_swap_uint)
+	movl	%esi, %eax
+	lock
+	xchgl %eax, (%rdi)
+	ret
+	SET_SIZE(atomic_swap_uint)
+	SET_SIZE(atomic_swap_32)
+
+	ENTRY(atomic_swap_64)
+	ALTENTRY(atomic_swap_ulong)
+	ALTENTRY(atomic_swap_ptr)
+	movq	%rsi, %rax
+	lock
+	xchgq %rax, (%rdi)
+	ret
+	SET_SIZE(atomic_swap_ptr)
+	SET_SIZE(atomic_swap_ulong)
+	SET_SIZE(atomic_swap_64)
+
+	ENTRY(atomic_set_long_excl)
+	xorl	%eax, %eax
+	lock
+	btsq	%rsi, (%rdi)
+	jnc	1f
+	decl	%eax			/ return -1
+1:
+	ret
+	SET_SIZE(atomic_set_long_excl)
+
+	ENTRY(atomic_clear_long_excl)
+	xorl	%eax, %eax
+	lock
+	btrq	%rsi, (%rdi)
+	jc	1f
+	decl	%eax			/ return -1
+1:
+	ret
+	SET_SIZE(atomic_clear_long_excl)
+
+#if !defined(_KERNEL)
+
+	/*
+	 * NOTE: membar_enter, and membar_exit are identical routines. 
+	 * We define them separately, instead of using an ALTENTRY
+	 * definitions to alias them together, so that DTrace and
+	 * debuggers will see a unique address for them, allowing 
+	 * more accurate tracing.
+	*/
+
+	ENTRY(membar_enter)
+	mfence
+	ret
+	SET_SIZE(membar_enter)
+
+	ENTRY(membar_exit)
+	mfence
+	ret
+	SET_SIZE(membar_exit)
+
+	ENTRY(membar_producer)
+	sfence
+	ret
+	SET_SIZE(membar_producer)
+
+	ENTRY(membar_consumer)
+	lfence
+	ret
+	SET_SIZE(membar_consumer)
+
+#endif	/* !_KERNEL */
--- a/common/atomic/i386/atomic.s
+++ b/common/atomic/i386/atomic.s
@ -0,0 +1,720 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+
+/*
+ * Copyright 2010 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+	.file	"atomic.s"
+
+#include <sys/asm_linkage.h>
+
+#if defined(_KERNEL)
+	/*
+	 * Legacy kernel interfaces; they will go away (eventually).
+	 */
+	ANSI_PRAGMA_WEAK2(cas8,atomic_cas_8,function)
+	ANSI_PRAGMA_WEAK2(cas32,atomic_cas_32,function)
+	ANSI_PRAGMA_WEAK2(cas64,atomic_cas_64,function)
+	ANSI_PRAGMA_WEAK2(caslong,atomic_cas_ulong,function)
+	ANSI_PRAGMA_WEAK2(casptr,atomic_cas_ptr,function)
+	ANSI_PRAGMA_WEAK2(atomic_and_long,atomic_and_ulong,function)
+	ANSI_PRAGMA_WEAK2(atomic_or_long,atomic_or_ulong,function)
+#endif
+
+	ENTRY(atomic_inc_8)
+	ALTENTRY(atomic_inc_uchar)
+	movl	4(%esp), %eax
+	lock
+	incb	(%eax)
+	ret
+	SET_SIZE(atomic_inc_uchar)
+	SET_SIZE(atomic_inc_8)
+
+	ENTRY(atomic_inc_16)
+	ALTENTRY(atomic_inc_ushort)
+	movl	4(%esp), %eax
+	lock
+	incw	(%eax)
+	ret
+	SET_SIZE(atomic_inc_ushort)
+	SET_SIZE(atomic_inc_16)
+
+	ENTRY(atomic_inc_32)
+	ALTENTRY(atomic_inc_uint)
+	ALTENTRY(atomic_inc_ulong)
+	movl	4(%esp), %eax
+	lock
+	incl	(%eax)
+	ret
+	SET_SIZE(atomic_inc_ulong)
+	SET_SIZE(atomic_inc_uint)
+	SET_SIZE(atomic_inc_32)
+
+	ENTRY(atomic_inc_8_nv)
+	ALTENTRY(atomic_inc_uchar_nv)
+	movl	4(%esp), %edx	/ %edx = target address
+	xorl	%eax, %eax	/ clear upper bits of %eax
+	incb	%al		/ %al = 1
+	lock
+	  xaddb	%al, (%edx)	/ %al = old value, inc (%edx)
+	incb	%al	/ return new value
+	ret
+	SET_SIZE(atomic_inc_uchar_nv)
+	SET_SIZE(atomic_inc_8_nv)
+
+	ENTRY(atomic_inc_16_nv)
+	ALTENTRY(atomic_inc_ushort_nv)
+	movl	4(%esp), %edx	/ %edx = target address
+	xorl	%eax, %eax	/ clear upper bits of %eax
+	incw	%ax		/ %ax = 1
+	lock
+	  xaddw	%ax, (%edx)	/ %ax = old value, inc (%edx)
+	incw	%ax		/ return new value
+	ret
+	SET_SIZE(atomic_inc_ushort_nv)
+	SET_SIZE(atomic_inc_16_nv)
+
+	ENTRY(atomic_inc_32_nv)
+	ALTENTRY(atomic_inc_uint_nv)
+	ALTENTRY(atomic_inc_ulong_nv)
+	movl	4(%esp), %edx	/ %edx = target address
+	xorl	%eax, %eax	/ %eax = 0
+	incl	%eax		/ %eax = 1
+	lock
+	  xaddl	%eax, (%edx)	/ %eax = old value, inc (%edx)
+	incl	%eax		/ return new value
+	ret
+	SET_SIZE(atomic_inc_ulong_nv)
+	SET_SIZE(atomic_inc_uint_nv)
+	SET_SIZE(atomic_inc_32_nv)
+
+	/*
+	 * NOTE: If atomic_inc_64 and atomic_inc_64_nv are ever
+	 * separated, you need to also edit the libc i386 platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_inc_64_nv.
+	 */
+	ENTRY(atomic_inc_64)
+	ALTENTRY(atomic_inc_64_nv)
+	pushl	%edi
+	pushl	%ebx
+	movl	12(%esp), %edi	/ %edi = target address
+	movl	(%edi), %eax
+	movl	4(%edi), %edx	/ %edx:%eax = old value
+1:
+	xorl	%ebx, %ebx
+	xorl	%ecx, %ecx
+	incl	%ebx		/ %ecx:%ebx = 1
+	addl	%eax, %ebx
+	adcl	%edx, %ecx	/ add in the carry from inc
+	lock
+	cmpxchg8b (%edi)	/ try to stick it in
+	jne	1b
+	movl	%ebx, %eax
+	movl	%ecx, %edx	/ return new value
+	popl	%ebx
+	popl	%edi
+	ret
+	SET_SIZE(atomic_inc_64_nv)
+	SET_SIZE(atomic_inc_64)
+
+	ENTRY(atomic_dec_8)
+	ALTENTRY(atomic_dec_uchar)
+	movl	4(%esp), %eax
+	lock
+	decb	(%eax)
+	ret
+	SET_SIZE(atomic_dec_uchar)
+	SET_SIZE(atomic_dec_8)
+
+	ENTRY(atomic_dec_16)
+	ALTENTRY(atomic_dec_ushort)
+	movl	4(%esp), %eax
+	lock
+	decw	(%eax)
+	ret
+	SET_SIZE(atomic_dec_ushort)
+	SET_SIZE(atomic_dec_16)
+
+	ENTRY(atomic_dec_32)
+	ALTENTRY(atomic_dec_uint)
+	ALTENTRY(atomic_dec_ulong)
+	movl	4(%esp), %eax
+	lock
+	decl	(%eax)
+	ret
+	SET_SIZE(atomic_dec_ulong)
+	SET_SIZE(atomic_dec_uint)
+	SET_SIZE(atomic_dec_32)
+
+	ENTRY(atomic_dec_8_nv)
+	ALTENTRY(atomic_dec_uchar_nv)
+	movl	4(%esp), %edx	/ %edx = target address
+	xorl	%eax, %eax	/ zero upper bits of %eax
+	decb	%al		/ %al = -1
+	lock
+	  xaddb	%al, (%edx)	/ %al = old value, dec (%edx)
+	decb	%al		/ return new value
+	ret
+	SET_SIZE(atomic_dec_uchar_nv)
+	SET_SIZE(atomic_dec_8_nv)
+
+	ENTRY(atomic_dec_16_nv)
+	ALTENTRY(atomic_dec_ushort_nv)
+	movl	4(%esp), %edx	/ %edx = target address
+	xorl	%eax, %eax	/ zero upper bits of %eax
+	decw	%ax		/ %ax = -1
+	lock
+	  xaddw	%ax, (%edx)	/ %ax = old value, dec (%edx)
+	decw	%ax		/ return new value
+	ret
+	SET_SIZE(atomic_dec_ushort_nv)
+	SET_SIZE(atomic_dec_16_nv)
+
+	ENTRY(atomic_dec_32_nv)
+	ALTENTRY(atomic_dec_uint_nv)
+	ALTENTRY(atomic_dec_ulong_nv)
+	movl	4(%esp), %edx	/ %edx = target address
+	xorl	%eax, %eax	/ %eax = 0
+	decl	%eax		/ %eax = -1
+	lock
+	  xaddl	%eax, (%edx)	/ %eax = old value, dec (%edx)
+	decl	%eax		/ return new value
+	ret
+	SET_SIZE(atomic_dec_ulong_nv)
+	SET_SIZE(atomic_dec_uint_nv)
+	SET_SIZE(atomic_dec_32_nv)
+
+	/*
+	 * NOTE: If atomic_dec_64 and atomic_dec_64_nv are ever
+	 * separated, it is important to edit the libc i386 platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_dec_64_nv.
+	 */
+	ENTRY(atomic_dec_64)
+	ALTENTRY(atomic_dec_64_nv)
+	pushl	%edi
+	pushl	%ebx
+	movl	12(%esp), %edi	/ %edi = target address
+	movl	(%edi), %eax
+	movl	4(%edi), %edx	/ %edx:%eax = old value
+1:
+	xorl	%ebx, %ebx
+	xorl	%ecx, %ecx
+	not	%ecx
+	not	%ebx		/ %ecx:%ebx = -1
+	addl	%eax, %ebx
+	adcl	%edx, %ecx	/ add in the carry from inc
+	lock
+	cmpxchg8b (%edi)	/ try to stick it in
+	jne	1b
+	movl	%ebx, %eax
+	movl	%ecx, %edx	/ return new value
+	popl	%ebx
+	popl	%edi
+	ret
+	SET_SIZE(atomic_dec_64_nv)
+	SET_SIZE(atomic_dec_64)
+
+	ENTRY(atomic_add_8)
+	ALTENTRY(atomic_add_char)
+	movl	4(%esp), %eax
+	movl	8(%esp), %ecx
+	lock
+	addb	%cl, (%eax)
+	ret
+	SET_SIZE(atomic_add_char)
+	SET_SIZE(atomic_add_8)
+
+	ENTRY(atomic_add_16)
+	ALTENTRY(atomic_add_short)
+	movl	4(%esp), %eax
+	movl	8(%esp), %ecx
+	lock
+	addw	%cx, (%eax)
+	ret
+	SET_SIZE(atomic_add_short)
+	SET_SIZE(atomic_add_16)
+
+	ENTRY(atomic_add_32)
+	ALTENTRY(atomic_add_int)
+	ALTENTRY(atomic_add_ptr)
+	ALTENTRY(atomic_add_long)
+	movl	4(%esp), %eax
+	movl	8(%esp), %ecx
+	lock
+	addl	%ecx, (%eax)
+	ret
+	SET_SIZE(atomic_add_long)
+	SET_SIZE(atomic_add_ptr)
+	SET_SIZE(atomic_add_int)
+	SET_SIZE(atomic_add_32)
+
+	ENTRY(atomic_or_8)
+	ALTENTRY(atomic_or_uchar)
+	movl	4(%esp), %eax
+	movb	8(%esp), %cl
+	lock
+	orb	%cl, (%eax)
+	ret
+	SET_SIZE(atomic_or_uchar)
+	SET_SIZE(atomic_or_8)
+
+	ENTRY(atomic_or_16)
+	ALTENTRY(atomic_or_ushort)
+	movl	4(%esp), %eax
+	movw	8(%esp), %cx
+	lock
+	orw	%cx, (%eax)
+	ret
+	SET_SIZE(atomic_or_ushort)
+	SET_SIZE(atomic_or_16)
+
+	ENTRY(atomic_or_32)
+	ALTENTRY(atomic_or_uint)
+	ALTENTRY(atomic_or_ulong)
+	movl	4(%esp), %eax
+	movl	8(%esp), %ecx
+	lock
+	orl	%ecx, (%eax)
+	ret
+	SET_SIZE(atomic_or_ulong)
+	SET_SIZE(atomic_or_uint)
+	SET_SIZE(atomic_or_32)
+
+	ENTRY(atomic_and_8)
+	ALTENTRY(atomic_and_uchar)
+	movl	4(%esp), %eax
+	movb	8(%esp), %cl
+	lock
+	andb	%cl, (%eax)
+	ret
+	SET_SIZE(atomic_and_uchar)
+	SET_SIZE(atomic_and_8)
+
+	ENTRY(atomic_and_16)
+	ALTENTRY(atomic_and_ushort)
+	movl	4(%esp), %eax
+	movw	8(%esp), %cx
+	lock
+	andw	%cx, (%eax)
+	ret
+	SET_SIZE(atomic_and_ushort)
+	SET_SIZE(atomic_and_16)
+
+	ENTRY(atomic_and_32)
+	ALTENTRY(atomic_and_uint)
+	ALTENTRY(atomic_and_ulong)
+	movl	4(%esp), %eax
+	movl	8(%esp), %ecx
+	lock
+	andl	%ecx, (%eax)
+	ret
+	SET_SIZE(atomic_and_ulong)
+	SET_SIZE(atomic_and_uint)
+	SET_SIZE(atomic_and_32)
+
+	ENTRY(atomic_add_8_nv)
+	ALTENTRY(atomic_add_char_nv)
+	movl	4(%esp), %edx	/ %edx = target address
+	movb	8(%esp), %cl	/ %cl = delta
+	movzbl	%cl, %eax	/ %al = delta, zero extended
+	lock
+	  xaddb	%cl, (%edx)	/ %cl = old value, (%edx) = sum
+	addb	%cl, %al	/ return old value plus delta
+	ret
+	SET_SIZE(atomic_add_char_nv)
+	SET_SIZE(atomic_add_8_nv)
+
+	ENTRY(atomic_add_16_nv)
+	ALTENTRY(atomic_add_short_nv)
+	movl	4(%esp), %edx	/ %edx = target address
+	movw	8(%esp), %cx	/ %cx = delta
+	movzwl	%cx, %eax	/ %ax = delta, zero extended
+	lock
+	  xaddw	%cx, (%edx)	/ %cx = old value, (%edx) = sum
+	addw	%cx, %ax	/ return old value plus delta
+	ret
+	SET_SIZE(atomic_add_short_nv)
+	SET_SIZE(atomic_add_16_nv)
+
+	ENTRY(atomic_add_32_nv)
+	ALTENTRY(atomic_add_int_nv)
+	ALTENTRY(atomic_add_ptr_nv)
+	ALTENTRY(atomic_add_long_nv)
+	movl	4(%esp), %edx	/ %edx = target address
+	movl	8(%esp), %eax	/ %eax = delta
+	movl	%eax, %ecx	/ %ecx = delta
+	lock
+	  xaddl	%eax, (%edx)	/ %eax = old value, (%edx) = sum
+	addl	%ecx, %eax	/ return old value plus delta
+	ret
+	SET_SIZE(atomic_add_long_nv)
+	SET_SIZE(atomic_add_ptr_nv)
+	SET_SIZE(atomic_add_int_nv)
+	SET_SIZE(atomic_add_32_nv)
+
+	/*
+	 * NOTE: If atomic_add_64 and atomic_add_64_nv are ever
+	 * separated, it is important to edit the libc i386 platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_add_64_nv.
+	 */
+	ENTRY(atomic_add_64)
+	ALTENTRY(atomic_add_64_nv)
+	pushl	%edi
+	pushl	%ebx
+	movl	12(%esp), %edi	/ %edi = target address
+	movl	(%edi), %eax
+	movl	4(%edi), %edx	/ %edx:%eax = old value
+1:
+	movl	16(%esp), %ebx
+	movl	20(%esp), %ecx	/ %ecx:%ebx = delta
+	addl	%eax, %ebx
+	adcl	%edx, %ecx	/ %ecx:%ebx = new value
+	lock
+	cmpxchg8b (%edi)	/ try to stick it in
+	jne	1b
+	movl	%ebx, %eax
+	movl	%ecx, %edx	/ return new value
+	popl	%ebx
+	popl	%edi
+	ret
+	SET_SIZE(atomic_add_64_nv)
+	SET_SIZE(atomic_add_64)
+
+	ENTRY(atomic_or_8_nv)
+	ALTENTRY(atomic_or_uchar_nv)
+	movl	4(%esp), %edx	/ %edx = target address
+	movb	(%edx), %al	/ %al = old value
+1:
+	movl	8(%esp), %ecx	/ %ecx = delta
+	orb	%al, %cl	/ %cl = new value
+	lock
+	cmpxchgb %cl, (%edx)	/ try to stick it in
+	jne	1b
+	movzbl	%cl, %eax	/ return new value
+	ret
+	SET_SIZE(atomic_or_uchar_nv)
+	SET_SIZE(atomic_or_8_nv)
+
+	ENTRY(atomic_or_16_nv)
+	ALTENTRY(atomic_or_ushort_nv)
+	movl	4(%esp), %edx	/ %edx = target address
+	movw	(%edx), %ax	/ %ax = old value
+1:
+	movl	8(%esp), %ecx	/ %ecx = delta
+	orw	%ax, %cx	/ %cx = new value
+	lock
+	cmpxchgw %cx, (%edx)	/ try to stick it in
+	jne	1b
+	movzwl	%cx, %eax	/ return new value
+	ret
+	SET_SIZE(atomic_or_ushort_nv)
+	SET_SIZE(atomic_or_16_nv)
+
+	ENTRY(atomic_or_32_nv)
+	ALTENTRY(atomic_or_uint_nv)
+	ALTENTRY(atomic_or_ulong_nv)
+	movl	4(%esp), %edx	/ %edx = target address
+	movl	(%edx), %eax	/ %eax = old value
+1:
+	movl	8(%esp), %ecx	/ %ecx = delta
+	orl	%eax, %ecx	/ %ecx = new value
+	lock
+	cmpxchgl %ecx, (%edx)	/ try to stick it in
+	jne	1b
+	movl	%ecx, %eax	/ return new value
+	ret
+	SET_SIZE(atomic_or_ulong_nv)
+	SET_SIZE(atomic_or_uint_nv)
+	SET_SIZE(atomic_or_32_nv)
+
+	/*
+	 * NOTE: If atomic_or_64 and atomic_or_64_nv are ever
+	 * separated, it is important to edit the libc i386 platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_or_64_nv.
+	 */
+	ENTRY(atomic_or_64)
+	ALTENTRY(atomic_or_64_nv)
+	pushl	%edi
+	pushl	%ebx
+	movl	12(%esp), %edi	/ %edi = target address
+	movl	(%edi), %eax
+	movl	4(%edi), %edx	/ %edx:%eax = old value
+1:
+	movl	16(%esp), %ebx
+	movl	20(%esp), %ecx	/ %ecx:%ebx = delta
+	orl	%eax, %ebx
+	orl	%edx, %ecx	/ %ecx:%ebx = new value
+	lock
+	cmpxchg8b (%edi)	/ try to stick it in
+	jne	1b
+	movl	%ebx, %eax
+	movl	%ecx, %edx	/ return new value
+	popl	%ebx
+	popl	%edi
+	ret
+	SET_SIZE(atomic_or_64_nv)
+	SET_SIZE(atomic_or_64)
+
+	ENTRY(atomic_and_8_nv)
+	ALTENTRY(atomic_and_uchar_nv)
+	movl	4(%esp), %edx	/ %edx = target address
+	movb	(%edx), %al	/ %al = old value
+1:
+	movl	8(%esp), %ecx	/ %ecx = delta
+	andb	%al, %cl	/ %cl = new value
+	lock
+	cmpxchgb %cl, (%edx)	/ try to stick it in
+	jne	1b
+	movzbl	%cl, %eax	/ return new value
+	ret
+	SET_SIZE(atomic_and_uchar_nv)
+	SET_SIZE(atomic_and_8_nv)
+
+	ENTRY(atomic_and_16_nv)
+	ALTENTRY(atomic_and_ushort_nv)
+	movl	4(%esp), %edx	/ %edx = target address
+	movw	(%edx), %ax	/ %ax = old value
+1:
+	movl	8(%esp), %ecx	/ %ecx = delta
+	andw	%ax, %cx	/ %cx = new value
+	lock
+	cmpxchgw %cx, (%edx)	/ try to stick it in
+	jne	1b
+	movzwl	%cx, %eax	/ return new value
+	ret
+	SET_SIZE(atomic_and_ushort_nv)
+	SET_SIZE(atomic_and_16_nv)
+
+	ENTRY(atomic_and_32_nv)
+	ALTENTRY(atomic_and_uint_nv)
+	ALTENTRY(atomic_and_ulong_nv)
+	movl	4(%esp), %edx	/ %edx = target address
+	movl	(%edx), %eax	/ %eax = old value
+1:
+	movl	8(%esp), %ecx	/ %ecx = delta
+	andl	%eax, %ecx	/ %ecx = new value
+	lock
+	cmpxchgl %ecx, (%edx)	/ try to stick it in
+	jne	1b
+	movl	%ecx, %eax	/ return new value
+	ret
+	SET_SIZE(atomic_and_ulong_nv)
+	SET_SIZE(atomic_and_uint_nv)
+	SET_SIZE(atomic_and_32_nv)
+
+	/*
+	 * NOTE: If atomic_and_64 and atomic_and_64_nv are ever
+	 * separated, it is important to edit the libc i386 platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_and_64_nv.
+	 */
+	ENTRY(atomic_and_64)
+	ALTENTRY(atomic_and_64_nv)
+	pushl	%edi
+	pushl	%ebx
+	movl	12(%esp), %edi	/ %edi = target address
+	movl	(%edi), %eax
+	movl	4(%edi), %edx	/ %edx:%eax = old value
+1:
+	movl	16(%esp), %ebx
+	movl	20(%esp), %ecx	/ %ecx:%ebx = delta
+	andl	%eax, %ebx
+	andl	%edx, %ecx	/ %ecx:%ebx = new value
+	lock
+	cmpxchg8b (%edi)	/ try to stick it in
+	jne	1b
+	movl	%ebx, %eax
+	movl	%ecx, %edx	/ return new value
+	popl	%ebx
+	popl	%edi
+	ret
+	SET_SIZE(atomic_and_64_nv)
+	SET_SIZE(atomic_and_64)
+
+	ENTRY(atomic_cas_8)
+	ALTENTRY(atomic_cas_uchar)
+	movl	4(%esp), %edx
+	movzbl	8(%esp), %eax
+	movb	12(%esp), %cl
+	lock
+	cmpxchgb %cl, (%edx)
+	ret
+	SET_SIZE(atomic_cas_uchar)
+	SET_SIZE(atomic_cas_8)
+
+	ENTRY(atomic_cas_16)
+	ALTENTRY(atomic_cas_ushort)
+	movl	4(%esp), %edx
+	movzwl	8(%esp), %eax
+	movw	12(%esp), %cx
+	lock
+	cmpxchgw %cx, (%edx)
+	ret
+	SET_SIZE(atomic_cas_ushort)
+	SET_SIZE(atomic_cas_16)
+
+	ENTRY(atomic_cas_32)
+	ALTENTRY(atomic_cas_uint)
+	ALTENTRY(atomic_cas_ulong)
+	ALTENTRY(atomic_cas_ptr)
+	movl	4(%esp), %edx
+	movl	8(%esp), %eax
+	movl	12(%esp), %ecx
+	lock
+	cmpxchgl %ecx, (%edx)
+	ret
+	SET_SIZE(atomic_cas_ptr)
+	SET_SIZE(atomic_cas_ulong)
+	SET_SIZE(atomic_cas_uint)
+	SET_SIZE(atomic_cas_32)
+
+	ENTRY(atomic_cas_64)
+	pushl	%ebx
+	pushl	%esi
+	movl	12(%esp), %esi
+	movl	16(%esp), %eax
+	movl	20(%esp), %edx
+	movl	24(%esp), %ebx
+	movl	28(%esp), %ecx
+	lock
+	cmpxchg8b (%esi)
+	popl	%esi
+	popl	%ebx
+	ret
+	SET_SIZE(atomic_cas_64)
+
+	ENTRY(atomic_swap_8)
+	ALTENTRY(atomic_swap_uchar)
+	movl	4(%esp), %edx
+	movzbl	8(%esp), %eax
+	lock
+	xchgb	%al, (%edx)
+	ret
+	SET_SIZE(atomic_swap_uchar)
+	SET_SIZE(atomic_swap_8)
+
+	ENTRY(atomic_swap_16)
+	ALTENTRY(atomic_swap_ushort)
+	movl	4(%esp), %edx
+	movzwl	8(%esp), %eax
+	lock
+	xchgw	%ax, (%edx)
+	ret
+	SET_SIZE(atomic_swap_ushort)
+	SET_SIZE(atomic_swap_16)
+
+	ENTRY(atomic_swap_32)
+	ALTENTRY(atomic_swap_uint)
+	ALTENTRY(atomic_swap_ptr)
+	ALTENTRY(atomic_swap_ulong)
+	movl	4(%esp), %edx
+	movl	8(%esp), %eax
+	lock
+	xchgl	%eax, (%edx)
+	ret
+	SET_SIZE(atomic_swap_ulong)
+	SET_SIZE(atomic_swap_ptr)
+	SET_SIZE(atomic_swap_uint)
+	SET_SIZE(atomic_swap_32)
+
+	ENTRY(atomic_swap_64)
+	pushl	%esi
+	pushl	%ebx
+	movl	12(%esp), %esi
+	movl	16(%esp), %ebx
+	movl	20(%esp), %ecx
+	movl	(%esi), %eax
+	movl	4(%esi), %edx	/ %edx:%eax = old value
+1:
+	lock
+	cmpxchg8b (%esi)
+	jne	1b
+	popl	%ebx
+	popl	%esi
+	ret
+	SET_SIZE(atomic_swap_64)
+
+	ENTRY(atomic_set_long_excl)
+	movl	4(%esp), %edx	/ %edx = target address
+	movl	8(%esp), %ecx	/ %ecx = bit id
+	xorl	%eax, %eax
+	lock
+	btsl	%ecx, (%edx)
+	jnc	1f
+	decl	%eax		/ return -1
+1:
+	ret
+	SET_SIZE(atomic_set_long_excl)
+
+	ENTRY(atomic_clear_long_excl)
+	movl	4(%esp), %edx	/ %edx = target address
+	movl	8(%esp), %ecx	/ %ecx = bit id
+	xorl	%eax, %eax
+	lock
+	btrl	%ecx, (%edx)
+	jc	1f
+	decl	%eax		/ return -1
+1:
+	ret
+	SET_SIZE(atomic_clear_long_excl)
+
+#if !defined(_KERNEL)
+
+	/*
+	 * NOTE: membar_enter, membar_exit, membar_producer, and 
+	 * membar_consumer are all identical routines. We define them
+	 * separately, instead of using ALTENTRY definitions to alias them
+	 * together, so that DTrace and debuggers will see a unique address
+	 * for them, allowing more accurate tracing.
+	*/
+
+
+	ENTRY(membar_enter)
+	lock
+	xorl	$0, (%esp)
+	ret
+	SET_SIZE(membar_enter)
+
+	ENTRY(membar_exit)
+	lock
+	xorl	$0, (%esp)
+	ret
+	SET_SIZE(membar_exit)
+
+	ENTRY(membar_producer)
+	lock
+	xorl	$0, (%esp)
+	ret
+	SET_SIZE(membar_producer)
+
+	ENTRY(membar_consumer)
+	lock
+	xorl	$0, (%esp)
+	ret
+	SET_SIZE(membar_consumer)
+
+#endif	/* !_KERNEL */
--- a/common/atomic/sparc/atomic.s
+++ b/common/atomic/sparc/atomic.s
@ -0,0 +1,801 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+
+/*
+ * Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+	.file	"atomic.s"
+
+#include <sys/asm_linkage.h>
+
+#if defined(_KERNEL)
+	/*
+	 * Legacy kernel interfaces; they will go away (eventually).
+	 */
+	ANSI_PRAGMA_WEAK2(cas8,atomic_cas_8,function)
+	ANSI_PRAGMA_WEAK2(cas32,atomic_cas_32,function)
+	ANSI_PRAGMA_WEAK2(cas64,atomic_cas_64,function)
+	ANSI_PRAGMA_WEAK2(caslong,atomic_cas_ulong,function)
+	ANSI_PRAGMA_WEAK2(casptr,atomic_cas_ptr,function)
+	ANSI_PRAGMA_WEAK2(atomic_and_long,atomic_and_ulong,function)
+	ANSI_PRAGMA_WEAK2(atomic_or_long,atomic_or_ulong,function)
+	ANSI_PRAGMA_WEAK2(swapl,atomic_swap_32,function)
+#endif
+
+	/*
+	 * NOTE: If atomic_inc_8 and atomic_inc_8_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_inc_8_nv.
+	 */
+	ENTRY(atomic_inc_8)
+	ALTENTRY(atomic_inc_8_nv)
+	ALTENTRY(atomic_inc_uchar)
+	ALTENTRY(atomic_inc_uchar_nv)
+	ba	add_8
+	  add	%g0, 1, %o1
+	SET_SIZE(atomic_inc_uchar_nv)
+	SET_SIZE(atomic_inc_uchar)
+	SET_SIZE(atomic_inc_8_nv)
+	SET_SIZE(atomic_inc_8)
+
+	/*
+	 * NOTE: If atomic_dec_8 and atomic_dec_8_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_dec_8_nv.
+	 */
+	ENTRY(atomic_dec_8)
+	ALTENTRY(atomic_dec_8_nv)
+	ALTENTRY(atomic_dec_uchar)
+	ALTENTRY(atomic_dec_uchar_nv)
+	ba	add_8
+	  sub	%g0, 1, %o1
+	SET_SIZE(atomic_dec_uchar_nv)
+	SET_SIZE(atomic_dec_uchar)
+	SET_SIZE(atomic_dec_8_nv)
+	SET_SIZE(atomic_dec_8)
+
+	/*
+	 * NOTE: If atomic_add_8 and atomic_add_8_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_add_8_nv.
+	 */
+	ENTRY(atomic_add_8)
+	ALTENTRY(atomic_add_8_nv)
+	ALTENTRY(atomic_add_char)
+	ALTENTRY(atomic_add_char_nv)
+add_8:
+	and	%o0, 0x3, %o4		! %o4 = byte offset, left-to-right
+	xor	%o4, 0x3, %g1		! %g1 = byte offset, right-to-left
+	sll	%g1, 3, %g1		! %g1 = bit offset, right-to-left
+	set	0xff, %o3		! %o3 = mask
+	sll	%o3, %g1, %o3		! %o3 = shifted to bit offset
+	sll	%o1, %g1, %o1		! %o1 = shifted to bit offset
+	and	%o1, %o3, %o1		! %o1 = single byte value
+	andn	%o0, 0x3, %o0		! %o0 = word address
+	ld	[%o0], %o2		! read old value
+1:
+	add	%o2, %o1, %o5		! add value to the old value
+	and	%o5, %o3, %o5		! clear other bits
+	andn	%o2, %o3, %o4		! clear target bits
+	or	%o4, %o5, %o5		! insert the new value
+	cas	[%o0], %o2, %o5
+	cmp	%o2, %o5
+	bne,a,pn %icc, 1b
+	  mov	%o5, %o2		! %o2 = old value
+	add	%o2, %o1, %o5
+	and	%o5, %o3, %o5
+	retl
+	srl	%o5, %g1, %o0		! %o0 = new value
+	SET_SIZE(atomic_add_char_nv)
+	SET_SIZE(atomic_add_char)
+	SET_SIZE(atomic_add_8_nv)
+	SET_SIZE(atomic_add_8)
+
+	/*
+	 * NOTE: If atomic_inc_16 and atomic_inc_16_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_inc_16_nv.
+	 */
+	ENTRY(atomic_inc_16)
+	ALTENTRY(atomic_inc_16_nv)
+	ALTENTRY(atomic_inc_ushort)
+	ALTENTRY(atomic_inc_ushort_nv)
+	ba	add_16
+	  add	%g0, 1, %o1
+	SET_SIZE(atomic_inc_ushort_nv)
+	SET_SIZE(atomic_inc_ushort)
+	SET_SIZE(atomic_inc_16_nv)
+	SET_SIZE(atomic_inc_16)
+
+	/*
+	 * NOTE: If atomic_dec_16 and atomic_dec_16_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_dec_16_nv.
+	 */
+	ENTRY(atomic_dec_16)
+	ALTENTRY(atomic_dec_16_nv)
+	ALTENTRY(atomic_dec_ushort)
+	ALTENTRY(atomic_dec_ushort_nv)
+	ba	add_16
+	  sub	%g0, 1, %o1
+	SET_SIZE(atomic_dec_ushort_nv)
+	SET_SIZE(atomic_dec_ushort)
+	SET_SIZE(atomic_dec_16_nv)
+	SET_SIZE(atomic_dec_16)
+
+	/*
+	 * NOTE: If atomic_add_16 and atomic_add_16_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_add_16_nv.
+	 */
+	ENTRY(atomic_add_16)
+	ALTENTRY(atomic_add_16_nv)
+	ALTENTRY(atomic_add_short)
+	ALTENTRY(atomic_add_short_nv)
+add_16:
+	and	%o0, 0x2, %o4		! %o4 = byte offset, left-to-right
+	xor	%o4, 0x2, %g1		! %g1 = byte offset, right-to-left
+	sll	%o4, 3, %o4		! %o4 = bit offset, left-to-right
+	sll	%g1, 3, %g1		! %g1 = bit offset, right-to-left
+	sethi	%hi(0xffff0000), %o3	! %o3 = mask
+	srl	%o3, %o4, %o3		! %o3 = shifted to bit offset
+	sll	%o1, %g1, %o1		! %o1 = shifted to bit offset
+	and	%o1, %o3, %o1		! %o1 = single short value
+	andn	%o0, 0x2, %o0		! %o0 = word address
+	! if low-order bit is 1, we will properly get an alignment fault here
+	ld	[%o0], %o2		! read old value
+1:
+	add	%o1, %o2, %o5		! add value to the old value
+	and	%o5, %o3, %o5		! clear other bits
+	andn	%o2, %o3, %o4		! clear target bits
+	or	%o4, %o5, %o5		! insert the new value
+	cas	[%o0], %o2, %o5
+	cmp	%o2, %o5
+	bne,a,pn %icc, 1b
+	  mov	%o5, %o2		! %o2 = old value
+	add	%o1, %o2, %o5
+	and	%o5, %o3, %o5
+	retl
+	srl	%o5, %g1, %o0		! %o0 = new value
+	SET_SIZE(atomic_add_short_nv)
+	SET_SIZE(atomic_add_short)
+	SET_SIZE(atomic_add_16_nv)
+	SET_SIZE(atomic_add_16)
+
+	/*
+	 * NOTE: If atomic_inc_32 and atomic_inc_32_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_inc_32_nv.
+	 */
+	ENTRY(atomic_inc_32)
+	ALTENTRY(atomic_inc_32_nv)
+	ALTENTRY(atomic_inc_uint)
+	ALTENTRY(atomic_inc_uint_nv)
+	ALTENTRY(atomic_inc_ulong)
+	ALTENTRY(atomic_inc_ulong_nv)
+	ba	add_32
+	  add	%g0, 1, %o1
+	SET_SIZE(atomic_inc_ulong_nv)
+	SET_SIZE(atomic_inc_ulong)
+	SET_SIZE(atomic_inc_uint_nv)
+	SET_SIZE(atomic_inc_uint)
+	SET_SIZE(atomic_inc_32_nv)
+	SET_SIZE(atomic_inc_32)
+
+	/*
+	 * NOTE: If atomic_dec_32 and atomic_dec_32_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_dec_32_nv.
+	 */
+	ENTRY(atomic_dec_32)
+	ALTENTRY(atomic_dec_32_nv)
+	ALTENTRY(atomic_dec_uint)
+	ALTENTRY(atomic_dec_uint_nv)
+	ALTENTRY(atomic_dec_ulong)
+	ALTENTRY(atomic_dec_ulong_nv)
+	ba	add_32
+	  sub	%g0, 1, %o1
+	SET_SIZE(atomic_dec_ulong_nv)
+	SET_SIZE(atomic_dec_ulong)
+	SET_SIZE(atomic_dec_uint_nv)
+	SET_SIZE(atomic_dec_uint)
+	SET_SIZE(atomic_dec_32_nv)
+	SET_SIZE(atomic_dec_32)
+
+	/*
+	 * NOTE: If atomic_add_32 and atomic_add_32_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_add_32_nv.
+	 */
+	ENTRY(atomic_add_32)
+	ALTENTRY(atomic_add_32_nv)
+	ALTENTRY(atomic_add_int)
+	ALTENTRY(atomic_add_int_nv)
+	ALTENTRY(atomic_add_ptr)
+	ALTENTRY(atomic_add_ptr_nv)
+	ALTENTRY(atomic_add_long)
+	ALTENTRY(atomic_add_long_nv)
+add_32:
+	ld	[%o0], %o2
+1:
+	add	%o2, %o1, %o3
+	cas	[%o0], %o2, %o3
+	cmp	%o2, %o3
+	bne,a,pn %icc, 1b
+	  mov	%o3, %o2
+	retl
+	add	%o2, %o1, %o0		! return new value
+	SET_SIZE(atomic_add_long_nv)
+	SET_SIZE(atomic_add_long)
+	SET_SIZE(atomic_add_ptr_nv)
+	SET_SIZE(atomic_add_ptr)
+	SET_SIZE(atomic_add_int_nv)
+	SET_SIZE(atomic_add_int)
+	SET_SIZE(atomic_add_32_nv)
+	SET_SIZE(atomic_add_32)
+
+	/*
+	 * NOTE: If atomic_inc_64 and atomic_inc_64_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_inc_64_nv.
+	 */
+	ENTRY(atomic_inc_64)
+	ALTENTRY(atomic_inc_64_nv)
+	ba	add_64
+	  add	%g0, 1, %o1
+	SET_SIZE(atomic_inc_64_nv)
+	SET_SIZE(atomic_inc_64)
+
+	/*
+	 * NOTE: If atomic_dec_64 and atomic_dec_64_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_dec_64_nv.
+	 */
+	ENTRY(atomic_dec_64)
+	ALTENTRY(atomic_dec_64_nv)
+	ba	add_64
+	  sub	%g0, 1, %o1
+	SET_SIZE(atomic_dec_64_nv)
+	SET_SIZE(atomic_dec_64)
+
+	/*
+	 * NOTE: If atomic_add_64 and atomic_add_64_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_add_64_nv.
+	 */
+	ENTRY(atomic_add_64)
+	ALTENTRY(atomic_add_64_nv)
+	sllx	%o1, 32, %o1		! upper 32 in %o1, lower in %o2
+	srl	%o2, 0, %o2
+	add	%o1, %o2, %o1		! convert 2 32-bit args into 1 64-bit
+add_64:
+	ldx	[%o0], %o2
+1:
+	add	%o2, %o1, %o3
+	casx	[%o0], %o2, %o3
+	cmp	%o2, %o3
+	bne,a,pn %xcc, 1b
+	  mov	%o3, %o2
+	add	%o2, %o1, %o1		! return lower 32-bits in %o1
+	retl
+	srlx	%o1, 32, %o0		! return upper 32-bits in %o0
+	SET_SIZE(atomic_add_64_nv)
+	SET_SIZE(atomic_add_64)
+
+	/*
+	 * NOTE: If atomic_or_8 and atomic_or_8_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_or_8_nv.
+	 */
+	ENTRY(atomic_or_8)
+	ALTENTRY(atomic_or_8_nv)
+	ALTENTRY(atomic_or_uchar)
+	ALTENTRY(atomic_or_uchar_nv)
+	and	%o0, 0x3, %o4		! %o4 = byte offset, left-to-right
+	xor	%o4, 0x3, %g1		! %g1 = byte offset, right-to-left
+	sll	%g1, 3, %g1		! %g1 = bit offset, right-to-left
+	set	0xff, %o3		! %o3 = mask
+	sll	%o3, %g1, %o3		! %o3 = shifted to bit offset
+	sll	%o1, %g1, %o1		! %o1 = shifted to bit offset
+	and	%o1, %o3, %o1		! %o1 = single byte value
+	andn	%o0, 0x3, %o0		! %o0 = word address
+	ld	[%o0], %o2		! read old value
+1:
+	or	%o2, %o1, %o5		! or in the new value
+	cas	[%o0], %o2, %o5
+	cmp	%o2, %o5
+	bne,a,pn %icc, 1b
+	  mov	%o5, %o2		! %o2 = old value
+	or	%o2, %o1, %o5
+	and	%o5, %o3, %o5
+	retl
+	srl	%o5, %g1, %o0		! %o0 = new value
+	SET_SIZE(atomic_or_uchar_nv)
+	SET_SIZE(atomic_or_uchar)
+	SET_SIZE(atomic_or_8_nv)
+	SET_SIZE(atomic_or_8)
+
+	/*
+	 * NOTE: If atomic_or_16 and atomic_or_16_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_or_16_nv.
+	 */
+	ENTRY(atomic_or_16)
+	ALTENTRY(atomic_or_16_nv)
+	ALTENTRY(atomic_or_ushort)
+	ALTENTRY(atomic_or_ushort_nv)
+	and	%o0, 0x2, %o4		! %o4 = byte offset, left-to-right
+	xor	%o4, 0x2, %g1		! %g1 = byte offset, right-to-left
+	sll	%o4, 3, %o4		! %o4 = bit offset, left-to-right
+	sll	%g1, 3, %g1		! %g1 = bit offset, right-to-left
+	sethi	%hi(0xffff0000), %o3	! %o3 = mask
+	srl	%o3, %o4, %o3		! %o3 = shifted to bit offset
+	sll	%o1, %g1, %o1		! %o1 = shifted to bit offset
+	and	%o1, %o3, %o1		! %o1 = single short value
+	andn	%o0, 0x2, %o0		! %o0 = word address
+	! if low-order bit is 1, we will properly get an alignment fault here
+	ld	[%o0], %o2		! read old value
+1:
+	or	%o2, %o1, %o5		! or in the new value
+	cas	[%o0], %o2, %o5
+	cmp	%o2, %o5
+	bne,a,pn %icc, 1b
+	  mov	%o5, %o2		! %o2 = old value
+	or	%o2, %o1, %o5		! or in the new value
+	and	%o5, %o3, %o5
+	retl
+	srl	%o5, %g1, %o0		! %o0 = new value
+	SET_SIZE(atomic_or_ushort_nv)
+	SET_SIZE(atomic_or_ushort)
+	SET_SIZE(atomic_or_16_nv)
+	SET_SIZE(atomic_or_16)
+
+	/*
+	 * NOTE: If atomic_or_32 and atomic_or_32_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_or_32_nv.
+	 */
+	ENTRY(atomic_or_32)
+	ALTENTRY(atomic_or_32_nv)
+	ALTENTRY(atomic_or_uint)
+	ALTENTRY(atomic_or_uint_nv)
+	ALTENTRY(atomic_or_ulong)
+	ALTENTRY(atomic_or_ulong_nv)
+	ld	[%o0], %o2
+1:
+	or	%o2, %o1, %o3
+	cas	[%o0], %o2, %o3
+	cmp	%o2, %o3
+	bne,a,pn %icc, 1b
+	  mov	%o3, %o2
+	retl
+	or	%o2, %o1, %o0		! return new value
+	SET_SIZE(atomic_or_ulong_nv)
+	SET_SIZE(atomic_or_ulong)
+	SET_SIZE(atomic_or_uint_nv)
+	SET_SIZE(atomic_or_uint)
+	SET_SIZE(atomic_or_32_nv)
+	SET_SIZE(atomic_or_32)
+
+	/*
+	 * NOTE: If atomic_or_64 and atomic_or_64_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_or_64_nv.
+	 */
+	ENTRY(atomic_or_64)
+	ALTENTRY(atomic_or_64_nv)
+	sllx	%o1, 32, %o1		! upper 32 in %o1, lower in %o2
+	srl	%o2, 0, %o2
+	add	%o1, %o2, %o1		! convert 2 32-bit args into 1 64-bit
+	ldx	[%o0], %o2
+1:
+	or	%o2, %o1, %o3
+	casx	[%o0], %o2, %o3
+	cmp	%o2, %o3
+	bne,a,pn %xcc, 1b
+	  mov	%o3, %o2
+	or	%o2, %o1, %o1		! return lower 32-bits in %o1
+	retl
+	srlx	%o1, 32, %o0		! return upper 32-bits in %o0
+	SET_SIZE(atomic_or_64_nv)
+	SET_SIZE(atomic_or_64)
+
+	/*
+	 * NOTE: If atomic_and_8 and atomic_and_8_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_and_8_nv.
+	 */
+	ENTRY(atomic_and_8)
+	ALTENTRY(atomic_and_8_nv)
+	ALTENTRY(atomic_and_uchar)
+	ALTENTRY(atomic_and_uchar_nv)
+	and	%o0, 0x3, %o4		! %o4 = byte offset, left-to-right
+	xor	%o4, 0x3, %g1		! %g1 = byte offset, right-to-left
+	sll	%g1, 3, %g1		! %g1 = bit offset, right-to-left
+	set	0xff, %o3		! %o3 = mask
+	sll	%o3, %g1, %o3		! %o3 = shifted to bit offset
+	sll	%o1, %g1, %o1		! %o1 = shifted to bit offset
+	orn	%o1, %o3, %o1		! all ones in other bytes
+	andn	%o0, 0x3, %o0		! %o0 = word address
+	ld	[%o0], %o2		! read old value
+1:
+	and	%o2, %o1, %o5		! and in the new value
+	cas	[%o0], %o2, %o5
+	cmp	%o2, %o5
+	bne,a,pn %icc, 1b
+	  mov	%o5, %o2		! %o2 = old value
+	and	%o2, %o1, %o5
+	and	%o5, %o3, %o5
+	retl
+	srl	%o5, %g1, %o0		! %o0 = new value
+	SET_SIZE(atomic_and_uchar_nv)
+	SET_SIZE(atomic_and_uchar)
+	SET_SIZE(atomic_and_8_nv)
+	SET_SIZE(atomic_and_8)
+
+	/*
+	 * NOTE: If atomic_and_16 and atomic_and_16_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_and_16_nv.
+	 */
+	ENTRY(atomic_and_16)
+	ALTENTRY(atomic_and_16_nv)
+	ALTENTRY(atomic_and_ushort)
+	ALTENTRY(atomic_and_ushort_nv)
+	and	%o0, 0x2, %o4		! %o4 = byte offset, left-to-right
+	xor	%o4, 0x2, %g1		! %g1 = byte offset, right-to-left
+	sll	%o4, 3, %o4		! %o4 = bit offset, left-to-right
+	sll	%g1, 3, %g1		! %g1 = bit offset, right-to-left
+	sethi	%hi(0xffff0000), %o3	! %o3 = mask
+	srl	%o3, %o4, %o3		! %o3 = shifted to bit offset
+	sll	%o1, %g1, %o1		! %o1 = shifted to bit offset
+	orn	%o1, %o3, %o1		! all ones in the other half
+	andn	%o0, 0x2, %o0		! %o0 = word address
+	! if low-order bit is 1, we will properly get an alignment fault here
+	ld	[%o0], %o2		! read old value
+1:
+	and	%o2, %o1, %o5		! and in the new value
+	cas	[%o0], %o2, %o5
+	cmp	%o2, %o5
+	bne,a,pn %icc, 1b
+	  mov	%o5, %o2		! %o2 = old value
+	and	%o2, %o1, %o5
+	and	%o5, %o3, %o5
+	retl
+	srl	%o5, %g1, %o0		! %o0 = new value
+	SET_SIZE(atomic_and_ushort_nv)
+	SET_SIZE(atomic_and_ushort)
+	SET_SIZE(atomic_and_16_nv)
+	SET_SIZE(atomic_and_16)
+
+	/*
+	 * NOTE: If atomic_and_32 and atomic_and_32_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_and_32_nv.
+	 */
+	ENTRY(atomic_and_32)
+	ALTENTRY(atomic_and_32_nv)
+	ALTENTRY(atomic_and_uint)
+	ALTENTRY(atomic_and_uint_nv)
+	ALTENTRY(atomic_and_ulong)
+	ALTENTRY(atomic_and_ulong_nv)
+	ld	[%o0], %o2
+1:
+	and	%o2, %o1, %o3
+	cas	[%o0], %o2, %o3
+	cmp	%o2, %o3
+	bne,a,pn %icc, 1b
+	  mov	%o3, %o2
+	retl
+	and	%o2, %o1, %o0		! return new value
+	SET_SIZE(atomic_and_ulong_nv)
+	SET_SIZE(atomic_and_ulong)
+	SET_SIZE(atomic_and_uint_nv)
+	SET_SIZE(atomic_and_uint)
+	SET_SIZE(atomic_and_32_nv)
+	SET_SIZE(atomic_and_32)
+
+	/*
+	 * NOTE: If atomic_and_64 and atomic_and_64_nv are ever
+	 * separated, you need to also edit the libc sparc platform
+	 * specific mapfile and remove the NODYNSORT attribute
+	 * from atomic_and_64_nv.
+	 */
+	ENTRY(atomic_and_64)
+	ALTENTRY(atomic_and_64_nv)
+	sllx	%o1, 32, %o1		! upper 32 in %o1, lower in %o2
+	srl	%o2, 0, %o2
+	add	%o1, %o2, %o1		! convert 2 32-bit args into 1 64-bit
+	ldx	[%o0], %o2
+1:
+	and	%o2, %o1, %o3
+	casx	[%o0], %o2, %o3
+	cmp	%o2, %o3
+	bne,a,pn %xcc, 1b
+	  mov	%o3, %o2
+	and	%o2, %o1, %o1		! return lower 32-bits in %o1
+	retl
+	srlx	%o1, 32, %o0		! return upper 32-bits in %o0
+	SET_SIZE(atomic_and_64_nv)
+	SET_SIZE(atomic_and_64)
+
+	ENTRY(atomic_cas_8)
+	ALTENTRY(atomic_cas_uchar)
+	and	%o0, 0x3, %o4		! %o4 = byte offset, left-to-right
+	xor	%o4, 0x3, %g1		! %g1 = byte offset, right-to-left
+	sll	%g1, 3, %g1		! %g1 = bit offset, right-to-left
+	set	0xff, %o3		! %o3 = mask
+	sll	%o3, %g1, %o3		! %o3 = shifted to bit offset
+	sll	%o1, %g1, %o1		! %o1 = shifted to bit offset
+	and	%o1, %o3, %o1		! %o1 = single byte value
+	sll	%o2, %g1, %o2		! %o2 = shifted to bit offset
+	and	%o2, %o3, %o2		! %o2 = single byte value
+	andn	%o0, 0x3, %o0		! %o0 = word address
+	ld	[%o0], %o4		! read old value
+1:
+	andn	%o4, %o3, %o4		! clear target bits
+	or	%o4, %o2, %o5		! insert the new value
+	or	%o4, %o1, %o4		! insert the comparison value
+	cas	[%o0], %o4, %o5
+	cmp	%o4, %o5		! did we succeed?
+	be,pt	%icc, 2f
+	  and	%o5, %o3, %o4		! isolate the old value
+	cmp	%o1, %o4		! should we have succeeded?
+	be,a,pt	%icc, 1b		! yes, try again
+	  mov	%o5, %o4		! %o4 = old value
+2:
+	retl
+	srl	%o4, %g1, %o0		! %o0 = old value
+	SET_SIZE(atomic_cas_uchar)
+	SET_SIZE(atomic_cas_8)
+
+	ENTRY(atomic_cas_16)
+	ALTENTRY(atomic_cas_ushort)
+	and	%o0, 0x2, %o4		! %o4 = byte offset, left-to-right
+	xor	%o4, 0x2, %g1		! %g1 = byte offset, right-to-left
+	sll	%o4, 3, %o4		! %o4 = bit offset, left-to-right
+	sll	%g1, 3, %g1		! %g1 = bit offset, right-to-left
+	sethi	%hi(0xffff0000), %o3	! %o3 = mask
+	srl	%o3, %o4, %o3		! %o3 = shifted to bit offset
+	sll	%o1, %g1, %o1		! %o1 = shifted to bit offset
+	and	%o1, %o3, %o1		! %o1 = single short value
+	sll	%o2, %g1, %o2		! %o2 = shifted to bit offset
+	and	%o2, %o3, %o2		! %o2 = single short value
+	andn	%o0, 0x2, %o0		! %o0 = word address
+	! if low-order bit is 1, we will properly get an alignment fault here
+	ld	[%o0], %o4		! read old value
+1:
+	andn	%o4, %o3, %o4		! clear target bits
+	or	%o4, %o2, %o5		! insert the new value
+	or	%o4, %o1, %o4		! insert the comparison value
+	cas	[%o0], %o4, %o5
+	cmp	%o4, %o5		! did we succeed?
+	be,pt	%icc, 2f
+	  and	%o5, %o3, %o4		! isolate the old value
+	cmp	%o1, %o4		! should we have succeeded?
+	be,a,pt	%icc, 1b		! yes, try again
+	  mov	%o5, %o4		! %o4 = old value
+2:
+	retl
+	srl	%o4, %g1, %o0		! %o0 = old value
+	SET_SIZE(atomic_cas_ushort)
+	SET_SIZE(atomic_cas_16)
+
+	ENTRY(atomic_cas_32)
+	ALTENTRY(atomic_cas_uint)
+	ALTENTRY(atomic_cas_ptr)
+	ALTENTRY(atomic_cas_ulong)
+	cas	[%o0], %o1, %o2
+	retl
+	mov	%o2, %o0
+	SET_SIZE(atomic_cas_ulong)
+	SET_SIZE(atomic_cas_ptr)
+	SET_SIZE(atomic_cas_uint)
+	SET_SIZE(atomic_cas_32)
+
+	ENTRY(atomic_cas_64)
+	sllx	%o1, 32, %o1		! cmp's upper 32 in %o1, lower in %o2
+	srl	%o2, 0, %o2		! convert 2 32-bit args into 1 64-bit
+	add	%o1, %o2, %o1
+	sllx	%o3, 32, %o2		! newval upper 32 in %o3, lower in %o4
+	srl	%o4, 0, %o4		! setup %o2 to have newval
+	add	%o2, %o4, %o2
+	casx	[%o0], %o1, %o2
+	srl	%o2, 0, %o1		! return lower 32-bits in %o1
+	retl
+	srlx	%o2, 32, %o0		! return upper 32-bits in %o0
+	SET_SIZE(atomic_cas_64)
+
+	ENTRY(atomic_swap_8)
+	ALTENTRY(atomic_swap_uchar)
+	and	%o0, 0x3, %o4		! %o4 = byte offset, left-to-right
+	xor	%o4, 0x3, %g1		! %g1 = byte offset, right-to-left
+	sll	%g1, 3, %g1		! %g1 = bit offset, right-to-left
+	set	0xff, %o3		! %o3 = mask
+	sll	%o3, %g1, %o3		! %o3 = shifted to bit offset
+	sll	%o1, %g1, %o1		! %o1 = shifted to bit offset
+	and	%o1, %o3, %o1		! %o1 = single byte value
+	andn	%o0, 0x3, %o0		! %o0 = word address
+	ld	[%o0], %o2		! read old value
+1:
+	andn	%o2, %o3, %o5		! clear target bits
+	or	%o5, %o1, %o5		! insert the new value
+	cas	[%o0], %o2, %o5
+	cmp	%o2, %o5
+	bne,a,pn %icc, 1b
+	  mov	%o5, %o2		! %o2 = old value
+	and	%o5, %o3, %o5
+	retl
+	srl	%o5, %g1, %o0		! %o0 = old value
+	SET_SIZE(atomic_swap_uchar)
+	SET_SIZE(atomic_swap_8)
+
+	ENTRY(atomic_swap_16)
+	ALTENTRY(atomic_swap_ushort)
+	and	%o0, 0x2, %o4		! %o4 = byte offset, left-to-right
+	xor	%o4, 0x2, %g1		! %g1 = byte offset, right-to-left
+	sll	%o4, 3, %o4		! %o4 = bit offset, left-to-right
+	sll	%g1, 3, %g1		! %g1 = bit offset, right-to-left
+	sethi	%hi(0xffff0000), %o3	! %o3 = mask
+	srl	%o3, %o4, %o3		! %o3 = shifted to bit offset
+	sll	%o1, %g1, %o1		! %o1 = shifted to bit offset
+	and	%o1, %o3, %o1		! %o1 = single short value
+	andn	%o0, 0x2, %o0		! %o0 = word address
+	! if low-order bit is 1, we will properly get an alignment fault here
+	ld	[%o0], %o2		! read old value
+1:
+	andn	%o2, %o3, %o5		! clear target bits
+	or	%o5, %o1, %o5		! insert the new value
+	cas	[%o0], %o2, %o5
+	cmp	%o2, %o5
+	bne,a,pn %icc, 1b
+	  mov	%o5, %o2		! %o2 = old value
+	and	%o5, %o3, %o5
+	retl
+	srl	%o5, %g1, %o0		! %o0 = old value
+	SET_SIZE(atomic_swap_ushort)
+	SET_SIZE(atomic_swap_16)
+
+	ENTRY(atomic_swap_32)
+	ALTENTRY(atomic_swap_uint)
+	ALTENTRY(atomic_swap_ptr)
+	ALTENTRY(atomic_swap_ulong)
+	ld	[%o0], %o2
+1:
+	mov	%o1, %o3
+	cas	[%o0], %o2, %o3
+	cmp	%o2, %o3
+	bne,a,pn %icc, 1b
+	  mov	%o3, %o2
+	retl
+	mov	%o3, %o0
+	SET_SIZE(atomic_swap_ulong)
+	SET_SIZE(atomic_swap_ptr)
+	SET_SIZE(atomic_swap_uint)
+	SET_SIZE(atomic_swap_32)
+
+	ENTRY(atomic_swap_64)
+	sllx	%o1, 32, %o1		! upper 32 in %o1, lower in %o2
+	srl	%o2, 0, %o2
+	add	%o1, %o2, %o1		! convert 2 32-bit args into 1 64-bit
+	ldx	[%o0], %o2
+1:
+	mov	%o1, %o3
+	casx	[%o0], %o2, %o3
+	cmp	%o2, %o3
+	bne,a,pn %xcc, 1b
+	  mov	%o3, %o2
+	srl	%o3, 0, %o1		! return lower 32-bits in %o1
+	retl
+	srlx	%o3, 32, %o0		! return upper 32-bits in %o0
+	SET_SIZE(atomic_swap_64)
+
+	ENTRY(atomic_set_long_excl)
+	mov	1, %o3
+	slln	%o3, %o1, %o3
+	ldn	[%o0], %o2
+1:
+	andcc	%o2, %o3, %g0		! test if the bit is set
+	bnz,a,pn %ncc, 2f		! if so, then fail out
+	  mov	-1, %o0
+	or	%o2, %o3, %o4		! set the bit, and try to commit it
+	casn	[%o0], %o2, %o4
+	cmp	%o2, %o4
+	bne,a,pn %ncc, 1b		! failed to commit, try again
+	  mov	%o4, %o2
+	mov	%g0, %o0
+2:
+	retl
+	nop
+	SET_SIZE(atomic_set_long_excl)
+
+	ENTRY(atomic_clear_long_excl)
+	mov	1, %o3
+	slln	%o3, %o1, %o3
+	ldn	[%o0], %o2
+1:
+	andncc	%o3, %o2, %g0		! test if the bit is clear
+	bnz,a,pn %ncc, 2f		! if so, then fail out
+	  mov	-1, %o0
+	andn	%o2, %o3, %o4		! clear the bit, and try to commit it
+	casn	[%o0], %o2, %o4
+	cmp	%o2, %o4
+	bne,a,pn %ncc, 1b		! failed to commit, try again
+	  mov	%o4, %o2
+	mov	%g0, %o0
+2:
+	retl
+	nop
+	SET_SIZE(atomic_clear_long_excl)
+
+#if !defined(_KERNEL)
+
+	/*
+	 * Spitfires and Blackbirds have a problem with membars in the
+	 * delay slot (SF_ERRATA_51).  For safety's sake, we assume
+	 * that the whole world needs the workaround.
+	 */
+	ENTRY(membar_enter)
+	membar	#StoreLoad|#StoreStore
+	retl
+	nop
+	SET_SIZE(membar_enter)
+
+	ENTRY(membar_exit)
+	membar	#LoadStore|#StoreStore
+	retl
+	nop
+	SET_SIZE(membar_exit)
+
+	ENTRY(membar_producer)
+	membar	#StoreStore
+	retl
+	nop
+	SET_SIZE(membar_producer)
+
+	ENTRY(membar_consumer)
+	membar	#LoadLoad
+	retl
+	nop
+	SET_SIZE(membar_consumer)
+
+#endif	/* !_KERNEL */
--- a/common/avl/avl.c
+++ b/common/avl/avl.c
--- a/common/list/list.c
+++ b/common/list/list.c
@ -0,0 +1,251 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2003, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+/*
+ * Generic doubly-linked list implementation
+ */
+
+#include <sys/list.h>
+#include <sys/list_impl.h>
+#include <sys/types.h>
+#include <sys/sysmacros.h>
+#ifdef _KERNEL
+#include <sys/debug.h>
+#else
+#include <assert.h>
+#define	ASSERT(a)	assert(a)
+#endif
+
+#ifdef lint
+extern list_node_t *list_d2l(list_t *list, void *obj);
+#else
+#define	list_d2l(a, obj) ((list_node_t *)(((char *)obj) + (a)->list_offset))
+#endif
+#define	list_object(a, node) ((void *)(((char *)node) - (a)->list_offset))
+#define	list_empty(a) ((a)->list_head.list_next == &(a)->list_head)
+
+#define	list_insert_after_node(list, node, object) {	\
+	list_node_t *lnew = list_d2l(list, object);	\
+	lnew->list_prev = (node);			\
+	lnew->list_next = (node)->list_next;		\
+	(node)->list_next->list_prev = lnew;		\
+	(node)->list_next = lnew;			\
+}
+
+#define	list_insert_before_node(list, node, object) {	\
+	list_node_t *lnew = list_d2l(list, object);	\
+	lnew->list_next = (node);			\
+	lnew->list_prev = (node)->list_prev;		\
+	(node)->list_prev->list_next = lnew;		\
+	(node)->list_prev = lnew;			\
+}
+
+#define	list_remove_node(node)					\
+	(node)->list_prev->list_next = (node)->list_next;	\
+	(node)->list_next->list_prev = (node)->list_prev;	\
+	(node)->list_next = (node)->list_prev = NULL
+
+void
+list_create(list_t *list, size_t size, size_t offset)
+{
+	ASSERT(list);
+	ASSERT(size > 0);
+	ASSERT(size >= offset + sizeof (list_node_t));
+
+	list->list_size = size;
+	list->list_offset = offset;
+	list->list_head.list_next = list->list_head.list_prev =
+	    &list->list_head;
+}
+
+void
+list_destroy(list_t *list)
+{
+	list_node_t *node = &list->list_head;
+
+	ASSERT(list);
+	ASSERT(list->list_head.list_next == node);
+	ASSERT(list->list_head.list_prev == node);
+
+	node->list_next = node->list_prev = NULL;
+}
+
+void
+list_insert_after(list_t *list, void *object, void *nobject)
+{
+	if (object == NULL) {
+		list_insert_head(list, nobject);
+	} else {
+		list_node_t *lold = list_d2l(list, object);
+		list_insert_after_node(list, lold, nobject);
+	}
+}
+
+void
+list_insert_before(list_t *list, void *object, void *nobject)
+{
+	if (object == NULL) {
+		list_insert_tail(list, nobject);
+	} else {
+		list_node_t *lold = list_d2l(list, object);
+		list_insert_before_node(list, lold, nobject);
+	}
+}
+
+void
+list_insert_head(list_t *list, void *object)
+{
+	list_node_t *lold = &list->list_head;
+	list_insert_after_node(list, lold, object);
+}
+
+void
+list_insert_tail(list_t *list, void *object)
+{
+	list_node_t *lold = &list->list_head;
+	list_insert_before_node(list, lold, object);
+}
+
+void
+list_remove(list_t *list, void *object)
+{
+	list_node_t *lold = list_d2l(list, object);
+	ASSERT(!list_empty(list));
+	ASSERT(lold->list_next != NULL);
+	list_remove_node(lold);
+}
+
+void *
+list_remove_head(list_t *list)
+{
+	list_node_t *head = list->list_head.list_next;
+	if (head == &list->list_head)
+		return (NULL);
+	list_remove_node(head);
+	return (list_object(list, head));
+}
+
+void *
+list_remove_tail(list_t *list)
+{
+	list_node_t *tail = list->list_head.list_prev;
+	if (tail == &list->list_head)
+		return (NULL);
+	list_remove_node(tail);
+	return (list_object(list, tail));
+}
+
+void *
+list_head(list_t *list)
+{
+	if (list_empty(list))
+		return (NULL);
+	return (list_object(list, list->list_head.list_next));
+}
+
+void *
+list_tail(list_t *list)
+{
+	if (list_empty(list))
+		return (NULL);
+	return (list_object(list, list->list_head.list_prev));
+}
+
+void *
+list_next(list_t *list, void *object)
+{
+	list_node_t *node = list_d2l(list, object);
+
+	if (node->list_next != &list->list_head)
+		return (list_object(list, node->list_next));
+
+	return (NULL);
+}
+
+void *
+list_prev(list_t *list, void *object)
+{
+	list_node_t *node = list_d2l(list, object);
+
+	if (node->list_prev != &list->list_head)
+		return (list_object(list, node->list_prev));
+
+	return (NULL);
+}
+
+/*
+ *  Insert src list after dst list. Empty src list thereafter.
+ */
+void
+list_move_tail(list_t *dst, list_t *src)
+{
+	list_node_t *dstnode = &dst->list_head;
+	list_node_t *srcnode = &src->list_head;
+
+	ASSERT(dst->list_size == src->list_size);
+	ASSERT(dst->list_offset == src->list_offset);
+
+	if (list_empty(src))
+		return;
+
+	dstnode->list_prev->list_next = srcnode->list_next;
+	srcnode->list_next->list_prev = dstnode->list_prev;
+	dstnode->list_prev = srcnode->list_prev;
+	srcnode->list_prev->list_next = dstnode;
+
+	/* empty src list */
+	srcnode->list_next = srcnode->list_prev = srcnode;
+}
+
+void
+list_link_replace(list_node_t *lold, list_node_t *lnew)
+{
+	ASSERT(list_link_active(lold));
+	ASSERT(!list_link_active(lnew));
+
+	lnew->list_next = lold->list_next;
+	lnew->list_prev = lold->list_prev;
+	lold->list_prev->list_next = lnew;
+	lold->list_next->list_prev = lnew;
+	lold->list_next = lold->list_prev = NULL;
+}
+
+void
+list_link_init(list_node_t *link)
+{
+	link->list_next = NULL;
+	link->list_prev = NULL;
+}
+
+int
+list_link_active(list_node_t *link)
+{
+	return (link->list_next != NULL);
+}
+
+int
+list_is_empty(list_t *list)
+{
+	return (list_empty(list));
+}
--- a/common/nvpair/nvpair.c
+++ b/common/nvpair/nvpair.c
--- a/common/nvpair/nvpair_alloc_fixed.c
+++ b/common/nvpair/nvpair_alloc_fixed.c
@ -0,0 +1,120 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+
+/*
+ * Copyright 2006 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+#pragma ident	"%Z%%M%	%I%	%E% SMI"
+
+#include <sys/stropts.h>
+#include <sys/isa_defs.h>
+#include <sys/nvpair.h>
+#include <sys/sysmacros.h>
+#if defined(_KERNEL) && !defined(_BOOT)
+#include <sys/varargs.h>
+#else
+#include <stdarg.h>
+#include <strings.h>
+#endif
+
+/*
+ * This allocator is very simple.
+ *  - it uses a pre-allocated buffer for memory allocations.
+ *  - it does _not_ free memory in the pre-allocated buffer.
+ *
+ * The reason for the selected implemention is simplicity.
+ * This allocator is designed for the usage in interrupt context when
+ * the caller may not wait for free memory.
+ */
+
+/* pre-allocated buffer for memory allocations */
+typedef struct nvbuf {
+	uintptr_t	nvb_buf;	/* address of pre-allocated buffer */
+	uintptr_t 	nvb_lim;	/* limit address in the buffer */
+	uintptr_t	nvb_cur;	/* current address in the buffer */
+} nvbuf_t;
+
+/*
+ * Initialize the pre-allocated buffer allocator. The caller needs to supply
+ *
+ *   buf	address of pre-allocated buffer
+ *   bufsz	size of pre-allocated buffer
+ *
+ * nv_fixed_init() calculates the remaining members of nvbuf_t.
+ */
+static int
+nv_fixed_init(nv_alloc_t *nva, va_list valist)
+{
+	uintptr_t base = va_arg(valist, uintptr_t);
+	uintptr_t lim = base + va_arg(valist, size_t);
+	nvbuf_t *nvb = (nvbuf_t *)P2ROUNDUP(base, sizeof (uintptr_t));
+
+	if (base == 0 || (uintptr_t)&nvb[1] > lim)
+		return (EINVAL);
+
+	nvb->nvb_buf = (uintptr_t)&nvb[0];
+	nvb->nvb_cur = (uintptr_t)&nvb[1];
+	nvb->nvb_lim = lim;
+	nva->nva_arg = nvb;
+
+	return (0);
+}
+
+static void *
+nv_fixed_alloc(nv_alloc_t *nva, size_t size)
+{
+	nvbuf_t *nvb = nva->nva_arg;
+	uintptr_t new = nvb->nvb_cur;
+
+	if (size == 0 || new + size > nvb->nvb_lim)
+		return (NULL);
+
+	nvb->nvb_cur = P2ROUNDUP(new + size, sizeof (uintptr_t));
+
+	return ((void *)new);
+}
+
+/*ARGSUSED*/
+static void
+nv_fixed_free(nv_alloc_t *nva, void *buf, size_t size)
+{
+	/* don't free memory in the pre-allocated buffer */
+}
+
+static void
+nv_fixed_reset(nv_alloc_t *nva)
+{
+	nvbuf_t *nvb = nva->nva_arg;
+
+	nvb->nvb_cur = (uintptr_t)&nvb[1];
+}
+
+const nv_alloc_ops_t nv_fixed_ops_def = {
+	nv_fixed_init,	/* nv_ao_init() */
+	NULL,		/* nv_ao_fini() */
+	nv_fixed_alloc,	/* nv_ao_alloc() */
+	nv_fixed_free,	/* nv_ao_free() */
+	nv_fixed_reset	/* nv_ao_reset() */
+};
+
+const nv_alloc_ops_t *nv_fixed_ops = &nv_fixed_ops_def;
--- a/common/unicode/u8_textprep.c
+++ b/common/unicode/u8_textprep.c
--- a/common/zfs/zfs_comutil.c
+++ b/common/zfs/zfs_comutil.c
@ -0,0 +1,202 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2008, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+/*
+ * This file is intended for functions that ought to be common between user
+ * land (libzfs) and the kernel. When many common routines need to be shared
+ * then a separate file should to be created.
+ */
+
+#if defined(_KERNEL)
+#include <sys/systm.h>
+#else
+#include <string.h>
+#endif
+
+#include <sys/types.h>
+#include <sys/fs/zfs.h>
+#include <sys/int_limits.h>
+#include <sys/nvpair.h>
+#include "zfs_comutil.h"
+
+/*
+ * Are there allocatable vdevs?
+ */
+boolean_t
+zfs_allocatable_devs(nvlist_t *nv)
+{
+	uint64_t is_log;
+	uint_t c;
+	nvlist_t **child;
+	uint_t children;
+
+	if (nvlist_lookup_nvlist_array(nv, ZPOOL_CONFIG_CHILDREN,
+	    &child, &children) != 0) {
+		return (B_FALSE);
+	}
+	for (c = 0; c < children; c++) {
+		is_log = 0;
+		(void) nvlist_lookup_uint64(child[c], ZPOOL_CONFIG_IS_LOG,
+		    &is_log);
+		if (!is_log)
+			return (B_TRUE);
+	}
+	return (B_FALSE);
+}
+
+void
+zpool_get_rewind_policy(nvlist_t *nvl, zpool_rewind_policy_t *zrpp)
+{
+	nvlist_t *policy;
+	nvpair_t *elem;
+	char *nm;
+
+	/* Defaults */
+	zrpp->zrp_request = ZPOOL_NO_REWIND;
+	zrpp->zrp_maxmeta = 0;
+	zrpp->zrp_maxdata = UINT64_MAX;
+	zrpp->zrp_txg = UINT64_MAX;
+
+	if (nvl == NULL)
+		return;
+
+	elem = NULL;
+	while ((elem = nvlist_next_nvpair(nvl, elem)) != NULL) {
+		nm = nvpair_name(elem);
+		if (strcmp(nm, ZPOOL_REWIND_POLICY) == 0) {
+			if (nvpair_value_nvlist(elem, &policy) == 0)
+				zpool_get_rewind_policy(policy, zrpp);
+			return;
+		} else if (strcmp(nm, ZPOOL_REWIND_REQUEST) == 0) {
+			if (nvpair_value_uint32(elem, &zrpp->zrp_request) == 0)
+				if (zrpp->zrp_request & ~ZPOOL_REWIND_POLICIES)
+					zrpp->zrp_request = ZPOOL_NO_REWIND;
+		} else if (strcmp(nm, ZPOOL_REWIND_REQUEST_TXG) == 0) {
+			(void) nvpair_value_uint64(elem, &zrpp->zrp_txg);
+		} else if (strcmp(nm, ZPOOL_REWIND_META_THRESH) == 0) {
+			(void) nvpair_value_uint64(elem, &zrpp->zrp_maxmeta);
+		} else if (strcmp(nm, ZPOOL_REWIND_DATA_THRESH) == 0) {
+			(void) nvpair_value_uint64(elem, &zrpp->zrp_maxdata);
+		}
+	}
+	if (zrpp->zrp_request == 0)
+		zrpp->zrp_request = ZPOOL_NO_REWIND;
+}
+
+typedef struct zfs_version_spa_map {
+	int	version_zpl;
+	int	version_spa;
+} zfs_version_spa_map_t;
+
+/*
+ * Keep this table in monotonically increasing version number order.
+ */
+static zfs_version_spa_map_t zfs_version_table[] = {
+	{ZPL_VERSION_INITIAL, SPA_VERSION_INITIAL},
+	{ZPL_VERSION_DIRENT_TYPE, SPA_VERSION_INITIAL},
+	{ZPL_VERSION_FUID, SPA_VERSION_FUID},
+	{ZPL_VERSION_USERSPACE, SPA_VERSION_USERSPACE},
+	{ZPL_VERSION_SA, SPA_VERSION_SA},
+	{0, 0}
+};
+
+/*
+ * Return the max zpl version for a corresponding spa version
+ * -1 is returned if no mapping exists.
+ */
+int
+zfs_zpl_version_map(int spa_version)
+{
+	int i;
+	int version = -1;
+
+	for (i = 0; zfs_version_table[i].version_spa; i++) {
+		if (spa_version >= zfs_version_table[i].version_spa)
+			version = zfs_version_table[i].version_zpl;
+	}
+
+	return (version);
+}
+
+/*
+ * Return the min spa version for a corresponding spa version
+ * -1 is returned if no mapping exists.
+ */
+int
+zfs_spa_version_map(int zpl_version)
+{
+	int i;
+	int version = -1;
+
+	for (i = 0; zfs_version_table[i].version_zpl; i++) {
+		if (zfs_version_table[i].version_zpl >= zpl_version)
+			return (zfs_version_table[i].version_spa);
+	}
+
+	return (version);
+}
+
+const char *zfs_history_event_names[LOG_END] = {
+	"invalid event",
+	"pool create",
+	"vdev add",
+	"pool remove",
+	"pool destroy",
+	"pool export",
+	"pool import",
+	"vdev attach",
+	"vdev replace",
+	"vdev detach",
+	"vdev online",
+	"vdev offline",
+	"vdev upgrade",
+	"pool clear",
+	"pool scrub",
+	"pool property set",
+	"create",
+	"clone",
+	"destroy",
+	"destroy_begin_sync",
+	"inherit",
+	"property set",
+	"quota set",
+	"permission update",
+	"permission remove",
+	"permission who remove",
+	"promote",
+	"receive",
+	"rename",
+	"reservation set",
+	"replay_inc_sync",
+	"replay_full_sync",
+	"rollback",
+	"snapshot",
+	"filesystem version upgrade",
+	"refquota set",
+	"refreservation set",
+	"pool scrub done",
+	"user hold",
+	"user release",
+	"pool split",
+};
--- a/common/zfs/zfs_comutil.h
+++ b/common/zfs/zfs_comutil.h
@ -0,0 +1,46 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2008, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_ZFS_COMUTIL_H
+#define	_ZFS_COMUTIL_H
+
+#include <sys/fs/zfs.h>
+#include <sys/types.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+extern boolean_t zfs_allocatable_devs(nvlist_t *);
+extern void zpool_get_rewind_policy(nvlist_t *, zpool_rewind_policy_t *);
+
+extern int zfs_zpl_version_map(int spa_version);
+extern int zfs_spa_version_map(int zpl_version);
+extern const char *zfs_history_event_names[LOG_END];
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _ZFS_COMUTIL_H */
--- a/common/zfs/zfs_deleg.c
+++ b/common/zfs/zfs_deleg.c
@ -0,0 +1,237 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2007, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#if defined(_KERNEL)
+#include <sys/systm.h>
+#include <sys/sunddi.h>
+#include <sys/ctype.h>
+#else
+#include <stdio.h>
+#include <unistd.h>
+#include <strings.h>
+#include <libnvpair.h>
+#include <ctype.h>
+#endif
+/* XXX includes zfs_context.h, so why bother with the above? */
+#include <sys/dsl_deleg.h>
+#include "zfs_prop.h"
+#include "zfs_deleg.h"
+#include "zfs_namecheck.h"
+
+/*
+ * permission table
+ *
+ * Keep this table in sorted order
+ *
+ * This table is used for displaying all permissions for
+ * zfs allow
+ */
+
+zfs_deleg_perm_tab_t zfs_deleg_perm_tab[] = {
+	{ZFS_DELEG_PERM_ALLOW, ZFS_DELEG_NOTE_ALLOW},
+	{ZFS_DELEG_PERM_CLONE, ZFS_DELEG_NOTE_CLONE },
+	{ZFS_DELEG_PERM_CREATE, ZFS_DELEG_NOTE_CREATE },
+	{ZFS_DELEG_PERM_DESTROY, ZFS_DELEG_NOTE_DESTROY },
+	{ZFS_DELEG_PERM_MOUNT, ZFS_DELEG_NOTE_MOUNT },
+	{ZFS_DELEG_PERM_PROMOTE, ZFS_DELEG_NOTE_PROMOTE },
+	{ZFS_DELEG_PERM_RECEIVE, ZFS_DELEG_NOTE_RECEIVE },
+	{ZFS_DELEG_PERM_RENAME, ZFS_DELEG_NOTE_RENAME },
+	{ZFS_DELEG_PERM_ROLLBACK, ZFS_DELEG_NOTE_ROLLBACK },
+	{ZFS_DELEG_PERM_SNAPSHOT, ZFS_DELEG_NOTE_SNAPSHOT },
+	{ZFS_DELEG_PERM_SHARE, ZFS_DELEG_NOTE_SHARE },
+	{ZFS_DELEG_PERM_SEND, ZFS_DELEG_NOTE_NONE },
+	{ZFS_DELEG_PERM_USERPROP, ZFS_DELEG_NOTE_USERPROP },
+	{ZFS_DELEG_PERM_USERQUOTA, ZFS_DELEG_NOTE_USERQUOTA },
+	{ZFS_DELEG_PERM_GROUPQUOTA, ZFS_DELEG_NOTE_GROUPQUOTA },
+	{ZFS_DELEG_PERM_USERUSED, ZFS_DELEG_NOTE_USERUSED },
+	{ZFS_DELEG_PERM_GROUPUSED, ZFS_DELEG_NOTE_GROUPUSED },
+	{ZFS_DELEG_PERM_HOLD, ZFS_DELEG_NOTE_HOLD },
+	{ZFS_DELEG_PERM_RELEASE, ZFS_DELEG_NOTE_RELEASE },
+	{ZFS_DELEG_PERM_DIFF, ZFS_DELEG_NOTE_DIFF},
+	{NULL, ZFS_DELEG_NOTE_NONE }
+};
+
+static int
+zfs_valid_permission_name(const char *perm)
+{
+	if (zfs_deleg_canonicalize_perm(perm))
+		return (0);
+
+	return (permset_namecheck(perm, NULL, NULL));
+}
+
+const char *
+zfs_deleg_canonicalize_perm(const char *perm)
+{
+	int i;
+	zfs_prop_t prop;
+
+	for (i = 0; zfs_deleg_perm_tab[i].z_perm != NULL; i++) {
+		if (strcmp(perm, zfs_deleg_perm_tab[i].z_perm) == 0)
+			return (perm);
+	}
+
+	prop = zfs_name_to_prop(perm);
+	if (prop != ZPROP_INVAL && zfs_prop_delegatable(prop))
+		return (zfs_prop_to_name(prop));
+	return (NULL);
+
+}
+
+static int
+zfs_validate_who(char *who)
+{
+	char *p;
+
+	if (who[2] != ZFS_DELEG_FIELD_SEP_CHR)
+		return (-1);
+
+	switch (who[0]) {
+	case ZFS_DELEG_USER:
+	case ZFS_DELEG_GROUP:
+	case ZFS_DELEG_USER_SETS:
+	case ZFS_DELEG_GROUP_SETS:
+		if (who[1] != ZFS_DELEG_LOCAL && who[1] != ZFS_DELEG_DESCENDENT)
+			return (-1);
+		for (p = &who[3]; *p; p++)
+			if (!isdigit(*p))
+				return (-1);
+		break;
+
+	case ZFS_DELEG_NAMED_SET:
+	case ZFS_DELEG_NAMED_SET_SETS:
+		if (who[1] != ZFS_DELEG_NA)
+			return (-1);
+		return (permset_namecheck(&who[3], NULL, NULL));
+
+	case ZFS_DELEG_CREATE:
+	case ZFS_DELEG_CREATE_SETS:
+		if (who[1] != ZFS_DELEG_NA)
+			return (-1);
+		if (who[3] != '\0')
+			return (-1);
+		break;
+
+	case ZFS_DELEG_EVERYONE:
+	case ZFS_DELEG_EVERYONE_SETS:
+		if (who[1] != ZFS_DELEG_LOCAL && who[1] != ZFS_DELEG_DESCENDENT)
+			return (-1);
+		if (who[3] != '\0')
+			return (-1);
+		break;
+
+	default:
+		return (-1);
+	}
+
+	return (0);
+}
+
+int
+zfs_deleg_verify_nvlist(nvlist_t *nvp)
+{
+	nvpair_t *who, *perm_name;
+	nvlist_t *perms;
+	int error;
+
+	if (nvp == NULL)
+		return (-1);
+
+	who = nvlist_next_nvpair(nvp, NULL);
+	if (who == NULL)
+		return (-1);
+
+	do {
+		if (zfs_validate_who(nvpair_name(who)))
+			return (-1);
+
+		error = nvlist_lookup_nvlist(nvp, nvpair_name(who), &perms);
+
+		if (error && error != ENOENT)
+			return (-1);
+		if (error == ENOENT)
+			continue;
+
+		perm_name = nvlist_next_nvpair(perms, NULL);
+		if (perm_name == NULL) {
+			return (-1);
+		}
+		do {
+			error = zfs_valid_permission_name(
+			    nvpair_name(perm_name));
+			if (error)
+				return (-1);
+		} while (perm_name = nvlist_next_nvpair(perms, perm_name));
+	} while (who = nvlist_next_nvpair(nvp, who));
+	return (0);
+}
+
+/*
+ * Construct the base attribute name.  The base attribute names
+ * are the "key" to locate the jump objects which contain the actual
+ * permissions.  The base attribute names are encoded based on
+ * type of entry and whether it is a local or descendent permission.
+ *
+ * Arguments:
+ * attr - attribute name return string, attribute is assumed to be
+ *        ZFS_MAX_DELEG_NAME long.
+ * type - type of entry to construct
+ * inheritchr - inheritance type (local,descendent, or NA for create and
+ *                               permission set definitions
+ * data - is either a permission set name or a 64 bit uid/gid.
+ */
+void
+zfs_deleg_whokey(char *attr, zfs_deleg_who_type_t type,
+    char inheritchr, void *data)
+{
+	int len = ZFS_MAX_DELEG_NAME;
+	uint64_t *id = data;
+
+	switch (type) {
+	case ZFS_DELEG_USER:
+	case ZFS_DELEG_GROUP:
+	case ZFS_DELEG_USER_SETS:
+	case ZFS_DELEG_GROUP_SETS:
+		(void) snprintf(attr, len, "%c%c%c%lld", type, inheritchr,
+		    ZFS_DELEG_FIELD_SEP_CHR, (longlong_t)*id);
+		break;
+	case ZFS_DELEG_NAMED_SET_SETS:
+	case ZFS_DELEG_NAMED_SET:
+		(void) snprintf(attr, len, "%c-%c%s", type,
+		    ZFS_DELEG_FIELD_SEP_CHR, (char *)data);
+		break;
+	case ZFS_DELEG_CREATE:
+	case ZFS_DELEG_CREATE_SETS:
+		(void) snprintf(attr, len, "%c-%c", type,
+		    ZFS_DELEG_FIELD_SEP_CHR);
+		break;
+	case ZFS_DELEG_EVERYONE:
+	case ZFS_DELEG_EVERYONE_SETS:
+		(void) snprintf(attr, len, "%c%c%c", type, inheritchr,
+		    ZFS_DELEG_FIELD_SEP_CHR);
+		break;
+	default:
+		ASSERT(!"bad zfs_deleg_who_type_t");
+	}
+}
--- a/common/zfs/zfs_deleg.h
+++ b/common/zfs/zfs_deleg.h
@ -0,0 +1,85 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2007, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_ZFS_DELEG_H
+#define	_ZFS_DELEG_H
+
+#include <sys/fs/zfs.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+#define	ZFS_DELEG_SET_NAME_CHR		'@'		/* set name lead char */
+#define	ZFS_DELEG_FIELD_SEP_CHR		'$'		/* field separator */
+
+/*
+ * Max name length for a delegation attribute
+ */
+#define	ZFS_MAX_DELEG_NAME	128
+
+#define	ZFS_DELEG_LOCAL		'l'
+#define	ZFS_DELEG_DESCENDENT	'd'
+#define	ZFS_DELEG_NA		'-'
+
+typedef enum {
+	ZFS_DELEG_NOTE_CREATE,
+	ZFS_DELEG_NOTE_DESTROY,
+	ZFS_DELEG_NOTE_SNAPSHOT,
+	ZFS_DELEG_NOTE_ROLLBACK,
+	ZFS_DELEG_NOTE_CLONE,
+	ZFS_DELEG_NOTE_PROMOTE,
+	ZFS_DELEG_NOTE_RENAME,
+	ZFS_DELEG_NOTE_RECEIVE,
+	ZFS_DELEG_NOTE_ALLOW,
+	ZFS_DELEG_NOTE_USERPROP,
+	ZFS_DELEG_NOTE_MOUNT,
+	ZFS_DELEG_NOTE_SHARE,
+	ZFS_DELEG_NOTE_USERQUOTA,
+	ZFS_DELEG_NOTE_GROUPQUOTA,
+	ZFS_DELEG_NOTE_USERUSED,
+	ZFS_DELEG_NOTE_GROUPUSED,
+	ZFS_DELEG_NOTE_HOLD,
+	ZFS_DELEG_NOTE_RELEASE,
+	ZFS_DELEG_NOTE_DIFF,
+	ZFS_DELEG_NOTE_NONE
+} zfs_deleg_note_t;
+
+typedef struct zfs_deleg_perm_tab {
+	char *z_perm;
+	zfs_deleg_note_t z_note;
+} zfs_deleg_perm_tab_t;
+
+extern zfs_deleg_perm_tab_t zfs_deleg_perm_tab[];
+
+int zfs_deleg_verify_nvlist(nvlist_t *nvlist);
+void zfs_deleg_whokey(char *attr, zfs_deleg_who_type_t type,
+    char checkflag, void *data);
+const char *zfs_deleg_canonicalize_perm(const char *perm);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _ZFS_DELEG_H */
--- a/common/zfs/zfs_fletcher.c
+++ b/common/zfs/zfs_fletcher.c
@ -0,0 +1,246 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+/*
+ * Fletcher Checksums
+ * ------------------
+ *
+ * ZFS's 2nd and 4th order Fletcher checksums are defined by the following
+ * recurrence relations:
+ *
+ *	a  = a    + f
+ *	 i    i-1    i-1
+ *
+ *	b  = b    + a
+ *	 i    i-1    i
+ *
+ *	c  = c    + b		(fletcher-4 only)
+ *	 i    i-1    i
+ *
+ *	d  = d    + c		(fletcher-4 only)
+ *	 i    i-1    i
+ *
+ * Where
+ *	a_0 = b_0 = c_0 = d_0 = 0
+ * and
+ *	f_0 .. f_(n-1) are the input data.
+ *
+ * Using standard techniques, these translate into the following series:
+ *
+ *	     __n_			     __n_
+ *	     \   |			     \   |
+ *	a  =  >     f			b  =  >     i * f
+ *	 n   /___|   n - i		 n   /___|	 n - i
+ *	     i = 1			     i = 1
+ *
+ *
+ *	     __n_			     __n_
+ *	     \   |  i*(i+1)		     \   |  i*(i+1)*(i+2)
+ *	c  =  >     ------- f		d  =  >     ------------- f
+ *	 n   /___|     2     n - i	 n   /___|	  6	   n - i
+ *	     i = 1			     i = 1
+ *
+ * For fletcher-2, the f_is are 64-bit, and [ab]_i are 64-bit accumulators.
+ * Since the additions are done mod (2^64), errors in the high bits may not
+ * be noticed.  For this reason, fletcher-2 is deprecated.
+ *
+ * For fletcher-4, the f_is are 32-bit, and [abcd]_i are 64-bit accumulators.
+ * A conservative estimate of how big the buffer can get before we overflow
+ * can be estimated using f_i = 0xffffffff for all i:
+ *
+ * % bc
+ *  f=2^32-1;d=0; for (i = 1; d<2^64; i++) { d += f*i*(i+1)*(i+2)/6 }; (i-1)*4
+ * 2264
+ *  quit
+ * %
+ *
+ * So blocks of up to 2k will not overflow.  Our largest block size is
+ * 128k, which has 32k 4-byte words, so we can compute the largest possible
+ * accumulators, then divide by 2^64 to figure the max amount of overflow:
+ *
+ * % bc
+ *  a=b=c=d=0; f=2^32-1; for (i=1; i<=32*1024; i++) { a+=f; b+=a; c+=b; d+=c }
+ *  a/2^64;b/2^64;c/2^64;d/2^64
+ * 0
+ * 0
+ * 1365
+ * 11186858
+ *  quit
+ * %
+ *
+ * So a and b cannot overflow.  To make sure each bit of input has some
+ * effect on the contents of c and d, we can look at what the factors of
+ * the coefficients in the equations for c_n and d_n are.  The number of 2s
+ * in the factors determines the lowest set bit in the multiplier.  Running
+ * through the cases for n*(n+1)/2 reveals that the highest power of 2 is
+ * 2^14, and for n*(n+1)*(n+2)/6 it is 2^15.  So while some data may overflow
+ * the 64-bit accumulators, every bit of every f_i effects every accumulator,
+ * even for 128k blocks.
+ *
+ * If we wanted to make a stronger version of fletcher4 (fletcher4c?),
+ * we could do our calculations mod (2^32 - 1) by adding in the carries
+ * periodically, and store the number of carries in the top 32-bits.
+ *
+ * --------------------
+ * Checksum Performance
+ * --------------------
+ *
+ * There are two interesting components to checksum performance: cached and
+ * uncached performance.  With cached data, fletcher-2 is about four times
+ * faster than fletcher-4.  With uncached data, the performance difference is
+ * negligible, since the cost of a cache fill dominates the processing time.
+ * Even though fletcher-4 is slower than fletcher-2, it is still a pretty
+ * efficient pass over the data.
+ *
+ * In normal operation, the data which is being checksummed is in a buffer
+ * which has been filled either by:
+ *
+ *	1. a compression step, which will be mostly cached, or
+ *	2. a bcopy() or copyin(), which will be uncached (because the
+ *	   copy is cache-bypassing).
+ *
+ * For both cached and uncached data, both fletcher checksums are much faster
+ * than sha-256, and slower than 'off', which doesn't touch the data at all.
+ */
+
+#include <sys/types.h>
+#include <sys/sysmacros.h>
+#include <sys/byteorder.h>
+#include <sys/zio.h>
+#include <sys/spa.h>
+
+void
+fletcher_2_native(const void *buf, uint64_t size, zio_cksum_t *zcp)
+{
+	const uint64_t *ip = buf;
+	const uint64_t *ipend = ip + (size / sizeof (uint64_t));
+	uint64_t a0, b0, a1, b1;
+
+	for (a0 = b0 = a1 = b1 = 0; ip < ipend; ip += 2) {
+		a0 += ip[0];
+		a1 += ip[1];
+		b0 += a0;
+		b1 += a1;
+	}
+
+	ZIO_SET_CHECKSUM(zcp, a0, a1, b0, b1);
+}
+
+void
+fletcher_2_byteswap(const void *buf, uint64_t size, zio_cksum_t *zcp)
+{
+	const uint64_t *ip = buf;
+	const uint64_t *ipend = ip + (size / sizeof (uint64_t));
+	uint64_t a0, b0, a1, b1;
+
+	for (a0 = b0 = a1 = b1 = 0; ip < ipend; ip += 2) {
+		a0 += BSWAP_64(ip[0]);
+		a1 += BSWAP_64(ip[1]);
+		b0 += a0;
+		b1 += a1;
+	}
+
+	ZIO_SET_CHECKSUM(zcp, a0, a1, b0, b1);
+}
+
+void
+fletcher_4_native(const void *buf, uint64_t size, zio_cksum_t *zcp)
+{
+	const uint32_t *ip = buf;
+	const uint32_t *ipend = ip + (size / sizeof (uint32_t));
+	uint64_t a, b, c, d;
+
+	for (a = b = c = d = 0; ip < ipend; ip++) {
+		a += ip[0];
+		b += a;
+		c += b;
+		d += c;
+	}
+
+	ZIO_SET_CHECKSUM(zcp, a, b, c, d);
+}
+
+void
+fletcher_4_byteswap(const void *buf, uint64_t size, zio_cksum_t *zcp)
+{
+	const uint32_t *ip = buf;
+	const uint32_t *ipend = ip + (size / sizeof (uint32_t));
+	uint64_t a, b, c, d;
+
+	for (a = b = c = d = 0; ip < ipend; ip++) {
+		a += BSWAP_32(ip[0]);
+		b += a;
+		c += b;
+		d += c;
+	}
+
+	ZIO_SET_CHECKSUM(zcp, a, b, c, d);
+}
+
+void
+fletcher_4_incremental_native(const void *buf, uint64_t size,
+    zio_cksum_t *zcp)
+{
+	const uint32_t *ip = buf;
+	const uint32_t *ipend = ip + (size / sizeof (uint32_t));
+	uint64_t a, b, c, d;
+
+	a = zcp->zc_word[0];
+	b = zcp->zc_word[1];
+	c = zcp->zc_word[2];
+	d = zcp->zc_word[3];
+
+	for (; ip < ipend; ip++) {
+		a += ip[0];
+		b += a;
+		c += b;
+		d += c;
+	}
+
+	ZIO_SET_CHECKSUM(zcp, a, b, c, d);
+}
+
+void
+fletcher_4_incremental_byteswap(const void *buf, uint64_t size,
+    zio_cksum_t *zcp)
+{
+	const uint32_t *ip = buf;
+	const uint32_t *ipend = ip + (size / sizeof (uint32_t));
+	uint64_t a, b, c, d;
+
+	a = zcp->zc_word[0];
+	b = zcp->zc_word[1];
+	c = zcp->zc_word[2];
+	d = zcp->zc_word[3];
+
+	for (; ip < ipend; ip++) {
+		a += BSWAP_32(ip[0]);
+		b += a;
+		c += b;
+		d += c;
+	}
+
+	ZIO_SET_CHECKSUM(zcp, a, b, c, d);
+}
--- a/common/zfs/zfs_fletcher.h
+++ b/common/zfs/zfs_fletcher.h
@ -0,0 +1,53 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+#ifndef	_ZFS_FLETCHER_H
+#define	_ZFS_FLETCHER_H
+
+#include <sys/types.h>
+#include <sys/spa.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+/*
+ * fletcher checksum functions
+ */
+
+void fletcher_2_native(const void *, uint64_t, zio_cksum_t *);
+void fletcher_2_byteswap(const void *, uint64_t, zio_cksum_t *);
+void fletcher_4_native(const void *, uint64_t, zio_cksum_t *);
+void fletcher_4_byteswap(const void *, uint64_t, zio_cksum_t *);
+void fletcher_4_incremental_native(const void *, uint64_t,
+    zio_cksum_t *);
+void fletcher_4_incremental_byteswap(const void *, uint64_t,
+    zio_cksum_t *);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _ZFS_FLETCHER_H */
--- a/common/zfs/zfs_namecheck.c
+++ b/common/zfs/zfs_namecheck.c
@ -0,0 +1,345 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+/*
+ * Common name validation routines for ZFS.  These routines are shared by the
+ * userland code as well as the ioctl() layer to ensure that we don't
+ * inadvertently expose a hole through direct ioctl()s that never gets tested.
+ * In userland, however, we want significantly more information about _why_ the
+ * name is invalid.  In the kernel, we only care whether it's valid or not.
+ * Each routine therefore takes a 'namecheck_err_t' which describes exactly why
+ * the name failed to validate.
+ *
+ * Each function returns 0 on success, -1 on error.
+ */
+
+#if defined(_KERNEL)
+#include <sys/systm.h>
+#else
+#include <string.h>
+#endif
+
+#include <sys/param.h>
+#include <sys/nvpair.h>
+#include "zfs_namecheck.h"
+#include "zfs_deleg.h"
+
+static int
+valid_char(char c)
+{
+	return ((c >= 'a' && c <= 'z') ||
+	    (c >= 'A' && c <= 'Z') ||
+	    (c >= '0' && c <= '9') ||
+	    c == '-' || c == '_' || c == '.' || c == ':' || c == ' ');
+}
+
+/*
+ * Snapshot names must be made up of alphanumeric characters plus the following
+ * characters:
+ *
+ * 	[-_.: ]
+ */
+int
+snapshot_namecheck(const char *path, namecheck_err_t *why, char *what)
+{
+	const char *loc;
+
+	if (strlen(path) >= MAXNAMELEN) {
+		if (why)
+			*why = NAME_ERR_TOOLONG;
+		return (-1);
+	}
+
+	if (path[0] == '\0') {
+		if (why)
+			*why = NAME_ERR_EMPTY_COMPONENT;
+		return (-1);
+	}
+
+	for (loc = path; *loc; loc++) {
+		if (!valid_char(*loc)) {
+			if (why) {
+				*why = NAME_ERR_INVALCHAR;
+				*what = *loc;
+			}
+			return (-1);
+		}
+	}
+	return (0);
+}
+
+
+/*
+ * Permissions set name must start with the letter '@' followed by the
+ * same character restrictions as snapshot names, except that the name
+ * cannot exceed 64 characters.
+ */
+int
+permset_namecheck(const char *path, namecheck_err_t *why, char *what)
+{
+	if (strlen(path) >= ZFS_PERMSET_MAXLEN) {
+		if (why)
+			*why = NAME_ERR_TOOLONG;
+		return (-1);
+	}
+
+	if (path[0] != '@') {
+		if (why) {
+			*why = NAME_ERR_NO_AT;
+			*what = path[0];
+		}
+		return (-1);
+	}
+
+	return (snapshot_namecheck(&path[1], why, what));
+}
+
+/*
+ * Dataset names must be of the following form:
+ *
+ * 	[component][/]*[component][@component]
+ *
+ * Where each component is made up of alphanumeric characters plus the following
+ * characters:
+ *
+ * 	[-_.:%]
+ *
+ * We allow '%' here as we use that character internally to create unique
+ * names for temporary clones (for online recv).
+ */
+int
+dataset_namecheck(const char *path, namecheck_err_t *why, char *what)
+{
+	const char *loc, *end;
+	int found_snapshot;
+
+	/*
+	 * Make sure the name is not too long.
+	 *
+	 * ZFS_MAXNAMELEN is the maximum dataset length used in the userland
+	 * which is the same as MAXNAMELEN used in the kernel.
+	 * If ZFS_MAXNAMELEN value is changed, make sure to cleanup all
+	 * places using MAXNAMELEN.
+	 */
+
+	if (strlen(path) >= MAXNAMELEN) {
+		if (why)
+			*why = NAME_ERR_TOOLONG;
+		return (-1);
+	}
+
+	/* Explicitly check for a leading slash.  */
+	if (path[0] == '/') {
+		if (why)
+			*why = NAME_ERR_LEADING_SLASH;
+		return (-1);
+	}
+
+	if (path[0] == '\0') {
+		if (why)
+			*why = NAME_ERR_EMPTY_COMPONENT;
+		return (-1);
+	}
+
+	loc = path;
+	found_snapshot = 0;
+	for (;;) {
+		/* Find the end of this component */
+		end = loc;
+		while (*end != '/' && *end != '@' && *end != '\0')
+			end++;
+
+		if (*end == '\0' && end[-1] == '/') {
+			/* trailing slashes are not allowed */
+			if (why)
+				*why = NAME_ERR_TRAILING_SLASH;
+			return (-1);
+		}
+
+		/* Zero-length components are not allowed */
+		if (loc == end) {
+			if (why) {
+				/*
+				 * Make sure this is really a zero-length
+				 * component and not a '@@'.
+				 */
+				if (*end == '@' && found_snapshot) {
+					*why = NAME_ERR_MULTIPLE_AT;
+				} else {
+					*why = NAME_ERR_EMPTY_COMPONENT;
+				}
+			}
+
+			return (-1);
+		}
+
+		/* Validate the contents of this component */
+		while (loc != end) {
+			if (!valid_char(*loc) && *loc != '%') {
+				if (why) {
+					*why = NAME_ERR_INVALCHAR;
+					*what = *loc;
+				}
+				return (-1);
+			}
+			loc++;
+		}
+
+		/* If we've reached the end of the string, we're OK */
+		if (*end == '\0')
+			return (0);
+
+		if (*end == '@') {
+			/*
+			 * If we've found an @ symbol, indicate that we're in
+			 * the snapshot component, and report a second '@'
+			 * character as an error.
+			 */
+			if (found_snapshot) {
+				if (why)
+					*why = NAME_ERR_MULTIPLE_AT;
+				return (-1);
+			}
+
+			found_snapshot = 1;
+		}
+
+		/*
+		 * If there is a '/' in a snapshot name
+		 * then report an error
+		 */
+		if (*end == '/' && found_snapshot) {
+			if (why)
+				*why = NAME_ERR_TRAILING_SLASH;
+			return (-1);
+		}
+
+		/* Update to the next component */
+		loc = end + 1;
+	}
+}
+
+
+/*
+ * mountpoint names must be of the following form:
+ *
+ *	/[component][/]*[component][/]
+ */
+int
+mountpoint_namecheck(const char *path, namecheck_err_t *why)
+{
+	const char *start, *end;
+
+	/*
+	 * Make sure none of the mountpoint component names are too long.
+	 * If a component name is too long then the mkdir of the mountpoint
+	 * will fail but then the mountpoint property will be set to a value
+	 * that can never be mounted.  Better to fail before setting the prop.
+	 * Extra slashes are OK, they will be tossed by the mountpoint mkdir.
+	 */
+
+	if (path == NULL || *path != '/') {
+		if (why)
+			*why = NAME_ERR_LEADING_SLASH;
+		return (-1);
+	}
+
+	/* Skip leading slash  */
+	start = &path[1];
+	do {
+		end = start;
+		while (*end != '/' && *end != '\0')
+			end++;
+
+		if (end - start >= MAXNAMELEN) {
+			if (why)
+				*why = NAME_ERR_TOOLONG;
+			return (-1);
+		}
+		start = end + 1;
+
+	} while (*end != '\0');
+
+	return (0);
+}
+
+/*
+ * For pool names, we have the same set of valid characters as described in
+ * dataset names, with the additional restriction that the pool name must begin
+ * with a letter.  The pool names 'raidz' and 'mirror' are also reserved names
+ * that cannot be used.
+ */
+int
+pool_namecheck(const char *pool, namecheck_err_t *why, char *what)
+{
+	const char *c;
+
+	/*
+	 * Make sure the name is not too long.
+	 *
+	 * ZPOOL_MAXNAMELEN is the maximum pool length used in the userland
+	 * which is the same as MAXNAMELEN used in the kernel.
+	 * If ZPOOL_MAXNAMELEN value is changed, make sure to cleanup all
+	 * places using MAXNAMELEN.
+	 */
+	if (strlen(pool) >= MAXNAMELEN) {
+		if (why)
+			*why = NAME_ERR_TOOLONG;
+		return (-1);
+	}
+
+	c = pool;
+	while (*c != '\0') {
+		if (!valid_char(*c)) {
+			if (why) {
+				*why = NAME_ERR_INVALCHAR;
+				*what = *c;
+			}
+			return (-1);
+		}
+		c++;
+	}
+
+	if (!(*pool >= 'a' && *pool <= 'z') &&
+	    !(*pool >= 'A' && *pool <= 'Z')) {
+		if (why)
+			*why = NAME_ERR_NOLETTER;
+		return (-1);
+	}
+
+	if (strcmp(pool, "mirror") == 0 || strcmp(pool, "raidz") == 0) {
+		if (why)
+			*why = NAME_ERR_RESERVED;
+		return (-1);
+	}
+
+	if (pool[0] == 'c' && (pool[1] >= '0' && pool[1] <= '9')) {
+		if (why)
+			*why = NAME_ERR_DISKLIKE;
+		return (-1);
+	}
+
+	return (0);
+}
--- a/common/zfs/zfs_namecheck.h
+++ b/common/zfs/zfs_namecheck.h
@ -0,0 +1,58 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+#ifndef	_ZFS_NAMECHECK_H
+#define	_ZFS_NAMECHECK_H
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+typedef enum {
+	NAME_ERR_LEADING_SLASH,		/* name begins with leading slash */
+	NAME_ERR_EMPTY_COMPONENT,	/* name contains an empty component */
+	NAME_ERR_TRAILING_SLASH,	/* name ends with a slash */
+	NAME_ERR_INVALCHAR,		/* invalid character found */
+	NAME_ERR_MULTIPLE_AT,		/* multiple '@' characters found */
+	NAME_ERR_NOLETTER,		/* pool doesn't begin with a letter */
+	NAME_ERR_RESERVED,		/* entire name is reserved */
+	NAME_ERR_DISKLIKE,		/* reserved disk name (c[0-9].*) */
+	NAME_ERR_TOOLONG,		/* name is too long */
+	NAME_ERR_NO_AT,			/* permission set is missing '@' */
+} namecheck_err_t;
+
+#define	ZFS_PERMSET_MAXLEN	64
+
+int pool_namecheck(const char *, namecheck_err_t *, char *);
+int dataset_namecheck(const char *, namecheck_err_t *, char *);
+int mountpoint_namecheck(const char *, namecheck_err_t *);
+int snapshot_namecheck(const char *, namecheck_err_t *, char *);
+int permset_namecheck(const char *, namecheck_err_t *, char *);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _ZFS_NAMECHECK_H */
--- a/common/zfs/zfs_prop.c
+++ b/common/zfs/zfs_prop.c
@ -0,0 +1,595 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+/* Portions Copyright 2010 Robert Milkowski */
+
+#include <sys/zio.h>
+#include <sys/spa.h>
+#include <sys/u8_textprep.h>
+#include <sys/zfs_acl.h>
+#include <sys/zfs_ioctl.h>
+#include <sys/zfs_znode.h>
+
+#include "zfs_prop.h"
+#include "zfs_deleg.h"
+
+#if defined(_KERNEL)
+#include <sys/systm.h>
+#else
+#include <stdlib.h>
+#include <string.h>
+#include <ctype.h>
+#endif
+
+static zprop_desc_t zfs_prop_table[ZFS_NUM_PROPS];
+
+/* Note this is indexed by zfs_userquota_prop_t, keep the order the same */
+const char *zfs_userquota_prop_prefixes[] = {
+	"userused@",
+	"userquota@",
+	"groupused@",
+	"groupquota@"
+};
+
+zprop_desc_t *
+zfs_prop_get_table(void)
+{
+	return (zfs_prop_table);
+}
+
+void
+zfs_prop_init(void)
+{
+	static zprop_index_t checksum_table[] = {
+		{ "on",		ZIO_CHECKSUM_ON },
+		{ "off",	ZIO_CHECKSUM_OFF },
+		{ "fletcher2",	ZIO_CHECKSUM_FLETCHER_2 },
+		{ "fletcher4",	ZIO_CHECKSUM_FLETCHER_4 },
+		{ "sha256",	ZIO_CHECKSUM_SHA256 },
+		{ NULL }
+	};
+
+	static zprop_index_t dedup_table[] = {
+		{ "on",		ZIO_CHECKSUM_ON },
+		{ "off",	ZIO_CHECKSUM_OFF },
+		{ "verify",	ZIO_CHECKSUM_ON | ZIO_CHECKSUM_VERIFY },
+		{ "sha256",	ZIO_CHECKSUM_SHA256 },
+		{ "sha256,verify",
+				ZIO_CHECKSUM_SHA256 | ZIO_CHECKSUM_VERIFY },
+		{ NULL }
+	};
+
+	static zprop_index_t compress_table[] = {
+		{ "on",		ZIO_COMPRESS_ON },
+		{ "off",	ZIO_COMPRESS_OFF },
+		{ "lzjb",	ZIO_COMPRESS_LZJB },
+		{ "gzip",	ZIO_COMPRESS_GZIP_6 },	/* gzip default */
+		{ "gzip-1",	ZIO_COMPRESS_GZIP_1 },
+		{ "gzip-2",	ZIO_COMPRESS_GZIP_2 },
+		{ "gzip-3",	ZIO_COMPRESS_GZIP_3 },
+		{ "gzip-4",	ZIO_COMPRESS_GZIP_4 },
+		{ "gzip-5",	ZIO_COMPRESS_GZIP_5 },
+		{ "gzip-6",	ZIO_COMPRESS_GZIP_6 },
+		{ "gzip-7",	ZIO_COMPRESS_GZIP_7 },
+		{ "gzip-8",	ZIO_COMPRESS_GZIP_8 },
+		{ "gzip-9",	ZIO_COMPRESS_GZIP_9 },
+		{ "zle",	ZIO_COMPRESS_ZLE },
+		{ NULL }
+	};
+
+	static zprop_index_t snapdir_table[] = {
+		{ "hidden",	ZFS_SNAPDIR_HIDDEN },
+		{ "visible",	ZFS_SNAPDIR_VISIBLE },
+		{ NULL }
+	};
+
+	static zprop_index_t acl_inherit_table[] = {
+		{ "discard",	ZFS_ACL_DISCARD },
+		{ "noallow",	ZFS_ACL_NOALLOW },
+		{ "restricted",	ZFS_ACL_RESTRICTED },
+		{ "passthrough", ZFS_ACL_PASSTHROUGH },
+		{ "secure",	ZFS_ACL_RESTRICTED }, /* bkwrd compatability */
+		{ "passthrough-x", ZFS_ACL_PASSTHROUGH_X },
+		{ NULL }
+	};
+
+	static zprop_index_t case_table[] = {
+		{ "sensitive",		ZFS_CASE_SENSITIVE },
+		{ "insensitive",	ZFS_CASE_INSENSITIVE },
+		{ "mixed",		ZFS_CASE_MIXED },
+		{ NULL }
+	};
+
+	static zprop_index_t copies_table[] = {
+		{ "1",		1 },
+		{ "2",		2 },
+		{ "3",		3 },
+		{ NULL }
+	};
+
+	/*
+	 * Use the unique flags we have to send to u8_strcmp() and/or
+	 * u8_textprep() to represent the various normalization property
+	 * values.
+	 */
+	static zprop_index_t normalize_table[] = {
+		{ "none",	0 },
+		{ "formD",	U8_TEXTPREP_NFD },
+		{ "formKC",	U8_TEXTPREP_NFKC },
+		{ "formC",	U8_TEXTPREP_NFC },
+		{ "formKD",	U8_TEXTPREP_NFKD },
+		{ NULL }
+	};
+
+	static zprop_index_t version_table[] = {
+		{ "1",		1 },
+		{ "2",		2 },
+		{ "3",		3 },
+		{ "4",		4 },
+		{ "5",		5 },
+		{ "current",	ZPL_VERSION },
+		{ NULL }
+	};
+
+	static zprop_index_t boolean_table[] = {
+		{ "off",	0 },
+		{ "on",		1 },
+		{ NULL }
+	};
+
+	static zprop_index_t logbias_table[] = {
+		{ "latency",	ZFS_LOGBIAS_LATENCY },
+		{ "throughput",	ZFS_LOGBIAS_THROUGHPUT },
+		{ NULL }
+	};
+
+	static zprop_index_t canmount_table[] = {
+		{ "off",	ZFS_CANMOUNT_OFF },
+		{ "on",		ZFS_CANMOUNT_ON },
+		{ "noauto",	ZFS_CANMOUNT_NOAUTO },
+		{ NULL }
+	};
+
+	static zprop_index_t cache_table[] = {
+		{ "none",	ZFS_CACHE_NONE },
+		{ "metadata",	ZFS_CACHE_METADATA },
+		{ "all",	ZFS_CACHE_ALL },
+		{ NULL }
+	};
+
+	static zprop_index_t sync_table[] = {
+		{ "standard",	ZFS_SYNC_STANDARD },
+		{ "always",	ZFS_SYNC_ALWAYS },
+		{ "disabled",	ZFS_SYNC_DISABLED },
+		{ NULL }
+	};
+
+	/* inherit index properties */
+	zprop_register_index(ZFS_PROP_SYNC, "sync", ZFS_SYNC_STANDARD,
+	    PROP_INHERIT, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME,
+	    "standard | always | disabled", "SYNC",
+	    sync_table);
+	zprop_register_index(ZFS_PROP_CHECKSUM, "checksum",
+	    ZIO_CHECKSUM_DEFAULT, PROP_INHERIT, ZFS_TYPE_FILESYSTEM |
+	    ZFS_TYPE_VOLUME,
+	    "on | off | fletcher2 | fletcher4 | sha256", "CHECKSUM",
+	    checksum_table);
+	zprop_register_index(ZFS_PROP_DEDUP, "dedup", ZIO_CHECKSUM_OFF,
+	    PROP_INHERIT, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME,
+	    "on | off | verify | sha256[,verify]", "DEDUP",
+	    dedup_table);
+	zprop_register_index(ZFS_PROP_COMPRESSION, "compression",
+	    ZIO_COMPRESS_DEFAULT, PROP_INHERIT,
+	    ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME,
+	    "on | off | lzjb | gzip | gzip-[1-9] | zle", "COMPRESS",
+	    compress_table);
+	zprop_register_index(ZFS_PROP_SNAPDIR, "snapdir", ZFS_SNAPDIR_HIDDEN,
+	    PROP_INHERIT, ZFS_TYPE_FILESYSTEM,
+	    "hidden | visible", "SNAPDIR", snapdir_table);
+	zprop_register_index(ZFS_PROP_ACLINHERIT, "aclinherit",
+	    ZFS_ACL_RESTRICTED, PROP_INHERIT, ZFS_TYPE_FILESYSTEM,
+	    "discard | noallow | restricted | passthrough | passthrough-x",
+	    "ACLINHERIT", acl_inherit_table);
+	zprop_register_index(ZFS_PROP_COPIES, "copies", 1, PROP_INHERIT,
+	    ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME,
+	    "1 | 2 | 3", "COPIES", copies_table);
+	zprop_register_index(ZFS_PROP_PRIMARYCACHE, "primarycache",
+	    ZFS_CACHE_ALL, PROP_INHERIT,
+	    ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT | ZFS_TYPE_VOLUME,
+	    "all | none | metadata", "PRIMARYCACHE", cache_table);
+	zprop_register_index(ZFS_PROP_SECONDARYCACHE, "secondarycache",
+	    ZFS_CACHE_ALL, PROP_INHERIT,
+	    ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT | ZFS_TYPE_VOLUME,
+	    "all | none | metadata", "SECONDARYCACHE", cache_table);
+	zprop_register_index(ZFS_PROP_LOGBIAS, "logbias", ZFS_LOGBIAS_LATENCY,
+	    PROP_INHERIT, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME,
+	    "latency | throughput", "LOGBIAS", logbias_table);
+
+	/* inherit index (boolean) properties */
+	zprop_register_index(ZFS_PROP_ATIME, "atime", 1, PROP_INHERIT,
+	    ZFS_TYPE_FILESYSTEM, "on | off", "ATIME", boolean_table);
+	zprop_register_index(ZFS_PROP_DEVICES, "devices", 1, PROP_INHERIT,
+	    ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT, "on | off", "DEVICES",
+	    boolean_table);
+	zprop_register_index(ZFS_PROP_EXEC, "exec", 1, PROP_INHERIT,
+	    ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT, "on | off", "EXEC",
+	    boolean_table);
+	zprop_register_index(ZFS_PROP_SETUID, "setuid", 1, PROP_INHERIT,
+	    ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT, "on | off", "SETUID",
+	    boolean_table);
+	zprop_register_index(ZFS_PROP_READONLY, "readonly", 0, PROP_INHERIT,
+	    ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME, "on | off", "RDONLY",
+	    boolean_table);
+	zprop_register_index(ZFS_PROP_ZONED, "zoned", 0, PROP_INHERIT,
+	    ZFS_TYPE_FILESYSTEM, "on | off", "ZONED", boolean_table);
+	zprop_register_index(ZFS_PROP_XATTR, "xattr", 1, PROP_INHERIT,
+	    ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT, "on | off", "XATTR",
+	    boolean_table);
+	zprop_register_index(ZFS_PROP_VSCAN, "vscan", 0, PROP_INHERIT,
+	    ZFS_TYPE_FILESYSTEM, "on | off", "VSCAN",
+	    boolean_table);
+	zprop_register_index(ZFS_PROP_NBMAND, "nbmand", 0, PROP_INHERIT,
+	    ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT, "on | off", "NBMAND",
+	    boolean_table);
+
+	/* default index properties */
+	zprop_register_index(ZFS_PROP_VERSION, "version", 0, PROP_DEFAULT,
+	    ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT,
+	    "1 | 2 | 3 | 4 | current", "VERSION", version_table);
+	zprop_register_index(ZFS_PROP_CANMOUNT, "canmount", ZFS_CANMOUNT_ON,
+	    PROP_DEFAULT, ZFS_TYPE_FILESYSTEM, "on | off | noauto",
+	    "CANMOUNT", canmount_table);
+
+	/* readonly index (boolean) properties */
+	zprop_register_index(ZFS_PROP_MOUNTED, "mounted", 0, PROP_READONLY,
+	    ZFS_TYPE_FILESYSTEM, "yes | no", "MOUNTED", boolean_table);
+	zprop_register_index(ZFS_PROP_DEFER_DESTROY, "defer_destroy", 0,
+	    PROP_READONLY, ZFS_TYPE_SNAPSHOT, "yes | no", "DEFER_DESTROY",
+	    boolean_table);
+
+	/* set once index properties */
+	zprop_register_index(ZFS_PROP_NORMALIZE, "normalization", 0,
+	    PROP_ONETIME, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT,
+	    "none | formC | formD | formKC | formKD", "NORMALIZATION",
+	    normalize_table);
+	zprop_register_index(ZFS_PROP_CASE, "casesensitivity",
+	    ZFS_CASE_SENSITIVE, PROP_ONETIME, ZFS_TYPE_FILESYSTEM |
+	    ZFS_TYPE_SNAPSHOT,
+	    "sensitive | insensitive | mixed", "CASE", case_table);
+
+	/* set once index (boolean) properties */
+	zprop_register_index(ZFS_PROP_UTF8ONLY, "utf8only", 0, PROP_ONETIME,
+	    ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT,
+	    "on | off", "UTF8ONLY", boolean_table);
+
+	/* string properties */
+	zprop_register_string(ZFS_PROP_ORIGIN, "origin", NULL, PROP_READONLY,
+	    ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME, "<snapshot>", "ORIGIN");
+	zprop_register_string(ZFS_PROP_MOUNTPOINT, "mountpoint", "/",
+	    PROP_INHERIT, ZFS_TYPE_FILESYSTEM, "<path> | legacy | none",
+	    "MOUNTPOINT");
+	zprop_register_string(ZFS_PROP_SHARENFS, "sharenfs", "off",
+	    PROP_INHERIT, ZFS_TYPE_FILESYSTEM, "on | off | share(1M) options",
+	    "SHARENFS");
+	zprop_register_string(ZFS_PROP_TYPE, "type", NULL, PROP_READONLY,
+	    ZFS_TYPE_DATASET, "filesystem | volume | snapshot", "TYPE");
+	zprop_register_string(ZFS_PROP_SHARESMB, "sharesmb", "off",
+	    PROP_INHERIT, ZFS_TYPE_FILESYSTEM,
+	    "on | off | sharemgr(1M) options", "SHARESMB");
+	zprop_register_string(ZFS_PROP_MLSLABEL, "mlslabel",
+	    ZFS_MLSLABEL_DEFAULT, PROP_INHERIT, ZFS_TYPE_DATASET,
+	    "<sensitivity label>", "MLSLABEL");
+
+	/* readonly number properties */
+	zprop_register_number(ZFS_PROP_USED, "used", 0, PROP_READONLY,
+	    ZFS_TYPE_DATASET, "<size>", "USED");
+	zprop_register_number(ZFS_PROP_AVAILABLE, "available", 0, PROP_READONLY,
+	    ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME, "<size>", "AVAIL");
+	zprop_register_number(ZFS_PROP_REFERENCED, "referenced", 0,
+	    PROP_READONLY, ZFS_TYPE_DATASET, "<size>", "REFER");
+	zprop_register_number(ZFS_PROP_COMPRESSRATIO, "compressratio", 0,
+	    PROP_READONLY, ZFS_TYPE_DATASET,
+	    "<1.00x or higher if compressed>", "RATIO");
+	zprop_register_number(ZFS_PROP_VOLBLOCKSIZE, "volblocksize",
+	    ZVOL_DEFAULT_BLOCKSIZE, PROP_ONETIME,
+	    ZFS_TYPE_VOLUME, "512 to 128k, power of 2",	"VOLBLOCK");
+	zprop_register_number(ZFS_PROP_USEDSNAP, "usedbysnapshots", 0,
+	    PROP_READONLY, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME, "<size>",
+	    "USEDSNAP");
+	zprop_register_number(ZFS_PROP_USEDDS, "usedbydataset", 0,
+	    PROP_READONLY, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME, "<size>",
+	    "USEDDS");
+	zprop_register_number(ZFS_PROP_USEDCHILD, "usedbychildren", 0,
+	    PROP_READONLY, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME, "<size>",
+	    "USEDCHILD");
+	zprop_register_number(ZFS_PROP_USEDREFRESERV, "usedbyrefreservation", 0,
+	    PROP_READONLY,
+	    ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME, "<size>", "USEDREFRESERV");
+	zprop_register_number(ZFS_PROP_USERREFS, "userrefs", 0, PROP_READONLY,
+	    ZFS_TYPE_SNAPSHOT, "<count>", "USERREFS");
+
+	/* default number properties */
+	zprop_register_number(ZFS_PROP_QUOTA, "quota", 0, PROP_DEFAULT,
+	    ZFS_TYPE_FILESYSTEM, "<size> | none", "QUOTA");
+	zprop_register_number(ZFS_PROP_RESERVATION, "reservation", 0,
+	    PROP_DEFAULT, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME,
+	    "<size> | none", "RESERV");
+	zprop_register_number(ZFS_PROP_VOLSIZE, "volsize", 0, PROP_DEFAULT,
+	    ZFS_TYPE_VOLUME, "<size>", "VOLSIZE");
+	zprop_register_number(ZFS_PROP_REFQUOTA, "refquota", 0, PROP_DEFAULT,
+	    ZFS_TYPE_FILESYSTEM, "<size> | none", "REFQUOTA");
+	zprop_register_number(ZFS_PROP_REFRESERVATION, "refreservation", 0,
+	    PROP_DEFAULT, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME,
+	    "<size> | none", "REFRESERV");
+
+	/* inherit number properties */
+	zprop_register_number(ZFS_PROP_RECORDSIZE, "recordsize",
+	    SPA_MAXBLOCKSIZE, PROP_INHERIT,
+	    ZFS_TYPE_FILESYSTEM, "512 to 128k, power of 2", "RECSIZE");
+
+	/* hidden properties */
+	zprop_register_hidden(ZFS_PROP_CREATETXG, "createtxg", PROP_TYPE_NUMBER,
+	    PROP_READONLY, ZFS_TYPE_DATASET, "CREATETXG");
+	zprop_register_hidden(ZFS_PROP_NUMCLONES, "numclones", PROP_TYPE_NUMBER,
+	    PROP_READONLY, ZFS_TYPE_SNAPSHOT, "NUMCLONES");
+	zprop_register_hidden(ZFS_PROP_NAME, "name", PROP_TYPE_STRING,
+	    PROP_READONLY, ZFS_TYPE_DATASET, "NAME");
+	zprop_register_hidden(ZFS_PROP_ISCSIOPTIONS, "iscsioptions",
+	    PROP_TYPE_STRING, PROP_INHERIT, ZFS_TYPE_VOLUME, "ISCSIOPTIONS");
+	zprop_register_hidden(ZFS_PROP_STMF_SHAREINFO, "stmf_sbd_lu",
+	    PROP_TYPE_STRING, PROP_INHERIT, ZFS_TYPE_VOLUME,
+	    "STMF_SBD_LU");
+	zprop_register_hidden(ZFS_PROP_GUID, "guid", PROP_TYPE_NUMBER,
+	    PROP_READONLY, ZFS_TYPE_DATASET, "GUID");
+	zprop_register_hidden(ZFS_PROP_USERACCOUNTING, "useraccounting",
+	    PROP_TYPE_NUMBER, PROP_READONLY, ZFS_TYPE_DATASET,
+	    "USERACCOUNTING");
+	zprop_register_hidden(ZFS_PROP_UNIQUE, "unique", PROP_TYPE_NUMBER,
+	    PROP_READONLY, ZFS_TYPE_DATASET, "UNIQUE");
+	zprop_register_hidden(ZFS_PROP_OBJSETID, "objsetid", PROP_TYPE_NUMBER,
+	    PROP_READONLY, ZFS_TYPE_DATASET, "OBJSETID");
+
+	/*
+	 * Property to be removed once libbe is integrated
+	 */
+	zprop_register_hidden(ZFS_PROP_PRIVATE, "priv_prop",
+	    PROP_TYPE_NUMBER, PROP_READONLY, ZFS_TYPE_FILESYSTEM,
+	    "PRIV_PROP");
+
+	/* oddball properties */
+	zprop_register_impl(ZFS_PROP_CREATION, "creation", PROP_TYPE_NUMBER, 0,
+	    NULL, PROP_READONLY, ZFS_TYPE_DATASET,
+	    "<date>", "CREATION", B_FALSE, B_TRUE, NULL);
+}
+
+boolean_t
+zfs_prop_delegatable(zfs_prop_t prop)
+{
+	zprop_desc_t *pd = &zfs_prop_table[prop];
+
+	/* The mlslabel property is never delegatable. */
+	if (prop == ZFS_PROP_MLSLABEL)
+		return (B_FALSE);
+
+	return (pd->pd_attr != PROP_READONLY);
+}
+
+/*
+ * Given a zfs dataset property name, returns the corresponding property ID.
+ */
+zfs_prop_t
+zfs_name_to_prop(const char *propname)
+{
+	return (zprop_name_to_prop(propname, ZFS_TYPE_DATASET));
+}
+
+/*
+ * For user property names, we allow all lowercase alphanumeric characters, plus
+ * a few useful punctuation characters.
+ */
+static int
+valid_char(char c)
+{
+	return ((c >= 'a' && c <= 'z') ||
+	    (c >= '0' && c <= '9') ||
+	    c == '-' || c == '_' || c == '.' || c == ':');
+}
+
+/*
+ * Returns true if this is a valid user-defined property (one with a ':').
+ */
+boolean_t
+zfs_prop_user(const char *name)
+{
+	int i;
+	char c;
+	boolean_t foundsep = B_FALSE;
+
+	for (i = 0; i < strlen(name); i++) {
+		c = name[i];
+		if (!valid_char(c))
+			return (B_FALSE);
+		if (c == ':')
+			foundsep = B_TRUE;
+	}
+
+	if (!foundsep)
+		return (B_FALSE);
+
+	return (B_TRUE);
+}
+
+/*
+ * Returns true if this is a valid userspace-type property (one with a '@').
+ * Note that after the @, any character is valid (eg, another @, for SID
+ * user@domain).
+ */
+boolean_t
+zfs_prop_userquota(const char *name)
+{
+	zfs_userquota_prop_t prop;
+
+	for (prop = 0; prop < ZFS_NUM_USERQUOTA_PROPS; prop++) {
+		if (strncmp(name, zfs_userquota_prop_prefixes[prop],
+		    strlen(zfs_userquota_prop_prefixes[prop])) == 0) {
+			return (B_TRUE);
+		}
+	}
+
+	return (B_FALSE);
+}
+
+/*
+ * Tables of index types, plus functions to convert between the user view
+ * (strings) and internal representation (uint64_t).
+ */
+int
+zfs_prop_string_to_index(zfs_prop_t prop, const char *string, uint64_t *index)
+{
+	return (zprop_string_to_index(prop, string, index, ZFS_TYPE_DATASET));
+}
+
+int
+zfs_prop_index_to_string(zfs_prop_t prop, uint64_t index, const char **string)
+{
+	return (zprop_index_to_string(prop, index, string, ZFS_TYPE_DATASET));
+}
+
+uint64_t
+zfs_prop_random_value(zfs_prop_t prop, uint64_t seed)
+{
+	return (zprop_random_value(prop, seed, ZFS_TYPE_DATASET));
+}
+
+/*
+ * Returns TRUE if the property applies to any of the given dataset types.
+ */
+boolean_t
+zfs_prop_valid_for_type(int prop, zfs_type_t types)
+{
+	return (zprop_valid_for_type(prop, types));
+}
+
+zprop_type_t
+zfs_prop_get_type(zfs_prop_t prop)
+{
+	return (zfs_prop_table[prop].pd_proptype);
+}
+
+/*
+ * Returns TRUE if the property is readonly.
+ */
+boolean_t
+zfs_prop_readonly(zfs_prop_t prop)
+{
+	return (zfs_prop_table[prop].pd_attr == PROP_READONLY ||
+	    zfs_prop_table[prop].pd_attr == PROP_ONETIME);
+}
+
+/*
+ * Returns TRUE if the property is only allowed to be set once.
+ */
+boolean_t
+zfs_prop_setonce(zfs_prop_t prop)
+{
+	return (zfs_prop_table[prop].pd_attr == PROP_ONETIME);
+}
+
+const char *
+zfs_prop_default_string(zfs_prop_t prop)
+{
+	return (zfs_prop_table[prop].pd_strdefault);
+}
+
+uint64_t
+zfs_prop_default_numeric(zfs_prop_t prop)
+{
+	return (zfs_prop_table[prop].pd_numdefault);
+}
+
+/*
+ * Given a dataset property ID, returns the corresponding name.
+ * Assuming the zfs dataset property ID is valid.
+ */
+const char *
+zfs_prop_to_name(zfs_prop_t prop)
+{
+	return (zfs_prop_table[prop].pd_name);
+}
+
+/*
+ * Returns TRUE if the property is inheritable.
+ */
+boolean_t
+zfs_prop_inheritable(zfs_prop_t prop)
+{
+	return (zfs_prop_table[prop].pd_attr == PROP_INHERIT ||
+	    zfs_prop_table[prop].pd_attr == PROP_ONETIME);
+}
+
+#ifndef _KERNEL
+
+/*
+ * Returns a string describing the set of acceptable values for the given
+ * zfs property, or NULL if it cannot be set.
+ */
+const char *
+zfs_prop_values(zfs_prop_t prop)
+{
+	return (zfs_prop_table[prop].pd_values);
+}
+
+/*
+ * Returns TRUE if this property is a string type.  Note that index types
+ * (compression, checksum) are treated as strings in userland, even though they
+ * are stored numerically on disk.
+ */
+int
+zfs_prop_is_string(zfs_prop_t prop)
+{
+	return (zfs_prop_table[prop].pd_proptype == PROP_TYPE_STRING ||
+	    zfs_prop_table[prop].pd_proptype == PROP_TYPE_INDEX);
+}
+
+/*
+ * Returns the column header for the given property.  Used only in
+ * 'zfs list -o', but centralized here with the other property information.
+ */
+const char *
+zfs_prop_column_name(zfs_prop_t prop)
+{
+	return (zfs_prop_table[prop].pd_colname);
+}
+
+/*
+ * Returns whether the given property should be displayed right-justified for
+ * 'zfs list'.
+ */
+boolean_t
+zfs_prop_align_right(zfs_prop_t prop)
+{
+	return (zfs_prop_table[prop].pd_rightalign);
+}
+
+#endif
--- a/common/zfs/zfs_prop.h
+++ b/common/zfs/zfs_prop.h
@ -0,0 +1,129 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright 2010 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+#ifndef	_ZFS_PROP_H
+#define	_ZFS_PROP_H
+
+#include <sys/fs/zfs.h>
+#include <sys/types.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+/*
+ * For index types (e.g. compression and checksum), we want the numeric value
+ * in the kernel, but the string value in userland.
+ */
+typedef enum {
+	PROP_TYPE_NUMBER,	/* numeric value */
+	PROP_TYPE_STRING,	/* string value */
+	PROP_TYPE_INDEX		/* numeric value indexed by string */
+} zprop_type_t;
+
+typedef enum {
+	PROP_DEFAULT,
+	PROP_READONLY,
+	PROP_INHERIT,
+	/*
+	 * ONETIME properties are a sort of conglomeration of READONLY
+	 * and INHERIT.  They can be set only during object creation,
+	 * after that they are READONLY.  If not explicitly set during
+	 * creation, they can be inherited.
+	 */
+	PROP_ONETIME
+} zprop_attr_t;
+
+typedef struct zfs_index {
+	const char *pi_name;
+	uint64_t pi_value;
+} zprop_index_t;
+
+typedef struct {
+	const char *pd_name;		/* human-readable property name */
+	int pd_propnum;			/* property number */
+	zprop_type_t pd_proptype;	/* string, boolean, index, number */
+	const char *pd_strdefault;	/* default for strings */
+	uint64_t pd_numdefault;		/* for boolean / index / number */
+	zprop_attr_t pd_attr;		/* default, readonly, inherit */
+	int pd_types;			/* bitfield of valid dataset types */
+					/* fs | vol | snap; or pool */
+	const char *pd_values;		/* string telling acceptable values */
+	const char *pd_colname;		/* column header for "zfs list" */
+	boolean_t pd_rightalign;	/* column alignment for "zfs list" */
+	boolean_t pd_visible;		/* do we list this property with the */
+					/* "zfs get" help message */
+	const zprop_index_t *pd_table;	/* for index properties, a table */
+					/* defining the possible values */
+	size_t pd_table_size;		/* number of entries in pd_table[] */
+} zprop_desc_t;
+
+/*
+ * zfs dataset property functions
+ */
+void zfs_prop_init(void);
+zprop_type_t zfs_prop_get_type(zfs_prop_t);
+boolean_t zfs_prop_delegatable(zfs_prop_t prop);
+zprop_desc_t *zfs_prop_get_table(void);
+
+/*
+ * zpool property functions
+ */
+void zpool_prop_init(void);
+zprop_type_t zpool_prop_get_type(zpool_prop_t);
+zprop_desc_t *zpool_prop_get_table(void);
+
+/*
+ * Common routines to initialize property tables
+ */
+void zprop_register_impl(int, const char *, zprop_type_t, uint64_t,
+    const char *, zprop_attr_t, int, const char *, const char *,
+    boolean_t, boolean_t, const zprop_index_t *);
+void zprop_register_string(int, const char *, const char *,
+    zprop_attr_t attr, int, const char *, const char *);
+void zprop_register_number(int, const char *, uint64_t, zprop_attr_t, int,
+    const char *, const char *);
+void zprop_register_index(int, const char *, uint64_t, zprop_attr_t, int,
+    const char *, const char *, const zprop_index_t *);
+void zprop_register_hidden(int, const char *, zprop_type_t, zprop_attr_t,
+    int, const char *);
+
+/*
+ * Common routines for zfs and zpool property management
+ */
+int zprop_iter_common(zprop_func, void *, boolean_t, boolean_t, zfs_type_t);
+int zprop_name_to_prop(const char *, zfs_type_t);
+int zprop_string_to_index(int, const char *, uint64_t *, zfs_type_t);
+int zprop_index_to_string(int, uint64_t, const char **, zfs_type_t);
+uint64_t zprop_random_value(int, uint64_t, zfs_type_t);
+const char *zprop_values(int, zfs_type_t);
+size_t zprop_width(int, boolean_t *, zfs_type_t);
+boolean_t zprop_valid_for_type(int, zfs_type_t);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _ZFS_PROP_H */
--- a/common/zfs/zpool_prop.c
+++ b/common/zfs/zpool_prop.c
@ -0,0 +1,202 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2007, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#include <sys/zio.h>
+#include <sys/spa.h>
+#include <sys/zfs_acl.h>
+#include <sys/zfs_ioctl.h>
+#include <sys/fs/zfs.h>
+
+#include "zfs_prop.h"
+
+#if defined(_KERNEL)
+#include <sys/systm.h>
+#else
+#include <stdlib.h>
+#include <string.h>
+#include <ctype.h>
+#endif
+
+static zprop_desc_t zpool_prop_table[ZPOOL_NUM_PROPS];
+
+zprop_desc_t *
+zpool_prop_get_table(void)
+{
+	return (zpool_prop_table);
+}
+
+void
+zpool_prop_init(void)
+{
+	static zprop_index_t boolean_table[] = {
+		{ "off",	0},
+		{ "on",		1},
+		{ NULL }
+	};
+
+	static zprop_index_t failuremode_table[] = {
+		{ "wait",	ZIO_FAILURE_MODE_WAIT },
+		{ "continue",	ZIO_FAILURE_MODE_CONTINUE },
+		{ "panic",	ZIO_FAILURE_MODE_PANIC },
+		{ NULL }
+	};
+
+	/* string properties */
+	zprop_register_string(ZPOOL_PROP_ALTROOT, "altroot", NULL, PROP_DEFAULT,
+	    ZFS_TYPE_POOL, "<path>", "ALTROOT");
+	zprop_register_string(ZPOOL_PROP_BOOTFS, "bootfs", NULL, PROP_DEFAULT,
+	    ZFS_TYPE_POOL, "<filesystem>", "BOOTFS");
+	zprop_register_string(ZPOOL_PROP_CACHEFILE, "cachefile", NULL,
+	    PROP_DEFAULT, ZFS_TYPE_POOL, "<file> | none", "CACHEFILE");
+
+	/* readonly number properties */
+	zprop_register_number(ZPOOL_PROP_SIZE, "size", 0, PROP_READONLY,
+	    ZFS_TYPE_POOL, "<size>", "SIZE");
+	zprop_register_number(ZPOOL_PROP_FREE, "free", 0, PROP_READONLY,
+	    ZFS_TYPE_POOL, "<size>", "FREE");
+	zprop_register_number(ZPOOL_PROP_ALLOCATED, "allocated", 0,
+	    PROP_READONLY, ZFS_TYPE_POOL, "<size>", "ALLOC");
+	zprop_register_number(ZPOOL_PROP_CAPACITY, "capacity", 0, PROP_READONLY,
+	    ZFS_TYPE_POOL, "<size>", "CAP");
+	zprop_register_number(ZPOOL_PROP_GUID, "guid", 0, PROP_READONLY,
+	    ZFS_TYPE_POOL, "<guid>", "GUID");
+	zprop_register_number(ZPOOL_PROP_HEALTH, "health", 0, PROP_READONLY,
+	    ZFS_TYPE_POOL, "<state>", "HEALTH");
+	zprop_register_number(ZPOOL_PROP_DEDUPRATIO, "dedupratio", 0,
+	    PROP_READONLY, ZFS_TYPE_POOL, "<1.00x or higher if deduped>",
+	    "DEDUP");
+
+	/* default number properties */
+	zprop_register_number(ZPOOL_PROP_VERSION, "version", SPA_VERSION,
+	    PROP_DEFAULT, ZFS_TYPE_POOL, "<version>", "VERSION");
+	zprop_register_number(ZPOOL_PROP_DEDUPDITTO, "dedupditto", 0,
+	    PROP_DEFAULT, ZFS_TYPE_POOL, "<threshold (min 100)>", "DEDUPDITTO");
+
+	/* default index (boolean) properties */
+	zprop_register_index(ZPOOL_PROP_DELEGATION, "delegation", 1,
+	    PROP_DEFAULT, ZFS_TYPE_POOL, "on | off", "DELEGATION",
+	    boolean_table);
+	zprop_register_index(ZPOOL_PROP_AUTOREPLACE, "autoreplace", 0,
+	    PROP_DEFAULT, ZFS_TYPE_POOL, "on | off", "REPLACE", boolean_table);
+	zprop_register_index(ZPOOL_PROP_LISTSNAPS, "listsnapshots", 0,
+	    PROP_DEFAULT, ZFS_TYPE_POOL, "on | off", "LISTSNAPS",
+	    boolean_table);
+	zprop_register_index(ZPOOL_PROP_AUTOEXPAND, "autoexpand", 0,
+	    PROP_DEFAULT, ZFS_TYPE_POOL, "on | off", "EXPAND", boolean_table);
+	zprop_register_index(ZPOOL_PROP_READONLY, "readonly", 0,
+	    PROP_DEFAULT, ZFS_TYPE_POOL, "on | off", "RDONLY", boolean_table);
+
+	/* default index properties */
+	zprop_register_index(ZPOOL_PROP_FAILUREMODE, "failmode",
+	    ZIO_FAILURE_MODE_WAIT, PROP_DEFAULT, ZFS_TYPE_POOL,
+	    "wait | continue | panic", "FAILMODE", failuremode_table);
+
+	/* hidden properties */
+	zprop_register_hidden(ZPOOL_PROP_NAME, "name", PROP_TYPE_STRING,
+	    PROP_READONLY, ZFS_TYPE_POOL, "NAME");
+}
+
+/*
+ * Given a property name and its type, returns the corresponding property ID.
+ */
+zpool_prop_t
+zpool_name_to_prop(const char *propname)
+{
+	return (zprop_name_to_prop(propname, ZFS_TYPE_POOL));
+}
+
+/*
+ * Given a pool property ID, returns the corresponding name.
+ * Assuming the pool propety ID is valid.
+ */
+const char *
+zpool_prop_to_name(zpool_prop_t prop)
+{
+	return (zpool_prop_table[prop].pd_name);
+}
+
+zprop_type_t
+zpool_prop_get_type(zpool_prop_t prop)
+{
+	return (zpool_prop_table[prop].pd_proptype);
+}
+
+boolean_t
+zpool_prop_readonly(zpool_prop_t prop)
+{
+	return (zpool_prop_table[prop].pd_attr == PROP_READONLY);
+}
+
+const char *
+zpool_prop_default_string(zpool_prop_t prop)
+{
+	return (zpool_prop_table[prop].pd_strdefault);
+}
+
+uint64_t
+zpool_prop_default_numeric(zpool_prop_t prop)
+{
+	return (zpool_prop_table[prop].pd_numdefault);
+}
+
+int
+zpool_prop_string_to_index(zpool_prop_t prop, const char *string,
+    uint64_t *index)
+{
+	return (zprop_string_to_index(prop, string, index, ZFS_TYPE_POOL));
+}
+
+int
+zpool_prop_index_to_string(zpool_prop_t prop, uint64_t index,
+    const char **string)
+{
+	return (zprop_index_to_string(prop, index, string, ZFS_TYPE_POOL));
+}
+
+uint64_t
+zpool_prop_random_value(zpool_prop_t prop, uint64_t seed)
+{
+	return (zprop_random_value(prop, seed, ZFS_TYPE_POOL));
+}
+
+#ifndef _KERNEL
+
+const char *
+zpool_prop_values(zpool_prop_t prop)
+{
+	return (zpool_prop_table[prop].pd_values);
+}
+
+const char *
+zpool_prop_column_name(zpool_prop_t prop)
+{
+	return (zpool_prop_table[prop].pd_colname);
+}
+
+boolean_t
+zpool_prop_align_right(zpool_prop_t prop)
+{
+	return (zpool_prop_table[prop].pd_rightalign);
+}
+#endif
--- a/common/zfs/zprop_common.c
+++ b/common/zfs/zprop_common.c
@ -0,0 +1,426 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright 2010 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+/*
+ * Common routines used by zfs and zpool property management.
+ */
+
+#include <sys/zio.h>
+#include <sys/spa.h>
+#include <sys/zfs_acl.h>
+#include <sys/zfs_ioctl.h>
+#include <sys/zfs_znode.h>
+#include <sys/fs/zfs.h>
+
+#include "zfs_prop.h"
+#include "zfs_deleg.h"
+
+#if defined(_KERNEL)
+#include <sys/systm.h>
+#include <util/qsort.h>
+#else
+#include <stdlib.h>
+#include <string.h>
+#include <ctype.h>
+#endif
+
+static zprop_desc_t *
+zprop_get_proptable(zfs_type_t type)
+{
+	if (type == ZFS_TYPE_POOL)
+		return (zpool_prop_get_table());
+	else
+		return (zfs_prop_get_table());
+}
+
+static int
+zprop_get_numprops(zfs_type_t type)
+{
+	if (type == ZFS_TYPE_POOL)
+		return (ZPOOL_NUM_PROPS);
+	else
+		return (ZFS_NUM_PROPS);
+}
+
+void
+zprop_register_impl(int prop, const char *name, zprop_type_t type,
+    uint64_t numdefault, const char *strdefault, zprop_attr_t attr,
+    int objset_types, const char *values, const char *colname,
+    boolean_t rightalign, boolean_t visible, const zprop_index_t *idx_tbl)
+{
+	zprop_desc_t *prop_tbl = zprop_get_proptable(objset_types);
+	zprop_desc_t *pd;
+
+	pd = &prop_tbl[prop];
+
+	ASSERT(pd->pd_name == NULL || pd->pd_name == name);
+	ASSERT(name != NULL);
+	ASSERT(colname != NULL);
+
+	pd->pd_name = name;
+	pd->pd_propnum = prop;
+	pd->pd_proptype = type;
+	pd->pd_numdefault = numdefault;
+	pd->pd_strdefault = strdefault;
+	pd->pd_attr = attr;
+	pd->pd_types = objset_types;
+	pd->pd_values = values;
+	pd->pd_colname = colname;
+	pd->pd_rightalign = rightalign;
+	pd->pd_visible = visible;
+	pd->pd_table = idx_tbl;
+	pd->pd_table_size = 0;
+	while (idx_tbl && (idx_tbl++)->pi_name != NULL)
+		pd->pd_table_size++;
+}
+
+void
+zprop_register_string(int prop, const char *name, const char *def,
+    zprop_attr_t attr, int objset_types, const char *values,
+    const char *colname)
+{
+	zprop_register_impl(prop, name, PROP_TYPE_STRING, 0, def, attr,
+	    objset_types, values, colname, B_FALSE, B_TRUE, NULL);
+
+}
+
+void
+zprop_register_number(int prop, const char *name, uint64_t def,
+    zprop_attr_t attr, int objset_types, const char *values,
+    const char *colname)
+{
+	zprop_register_impl(prop, name, PROP_TYPE_NUMBER, def, NULL, attr,
+	    objset_types, values, colname, B_TRUE, B_TRUE, NULL);
+}
+
+void
+zprop_register_index(int prop, const char *name, uint64_t def,
+    zprop_attr_t attr, int objset_types, const char *values,
+    const char *colname, const zprop_index_t *idx_tbl)
+{
+	zprop_register_impl(prop, name, PROP_TYPE_INDEX, def, NULL, attr,
+	    objset_types, values, colname, B_TRUE, B_TRUE, idx_tbl);
+}
+
+void
+zprop_register_hidden(int prop, const char *name, zprop_type_t type,
+    zprop_attr_t attr, int objset_types, const char *colname)
+{
+	zprop_register_impl(prop, name, type, 0, NULL, attr,
+	    objset_types, NULL, colname, B_FALSE, B_FALSE, NULL);
+}
+
+
+/*
+ * A comparison function we can use to order indexes into property tables.
+ */
+static int
+zprop_compare(const void *arg1, const void *arg2)
+{
+	const zprop_desc_t *p1 = *((zprop_desc_t **)arg1);
+	const zprop_desc_t *p2 = *((zprop_desc_t **)arg2);
+	boolean_t p1ro, p2ro;
+
+	p1ro = (p1->pd_attr == PROP_READONLY);
+	p2ro = (p2->pd_attr == PROP_READONLY);
+
+	if (p1ro == p2ro)
+		return (strcmp(p1->pd_name, p2->pd_name));
+
+	return (p1ro ? -1 : 1);
+}
+
+/*
+ * Iterate over all properties in the given property table, calling back
+ * into the specified function for each property. We will continue to
+ * iterate until we either reach the end or the callback function returns
+ * something other than ZPROP_CONT.
+ */
+int
+zprop_iter_common(zprop_func func, void *cb, boolean_t show_all,
+    boolean_t ordered, zfs_type_t type)
+{
+	int i, num_props, size, prop;
+	zprop_desc_t *prop_tbl;
+	zprop_desc_t **order;
+
+	prop_tbl = zprop_get_proptable(type);
+	num_props = zprop_get_numprops(type);
+	size = num_props * sizeof (zprop_desc_t *);
+
+#if defined(_KERNEL)
+	order = kmem_alloc(size, KM_SLEEP);
+#else
+	if ((order = malloc(size)) == NULL)
+		return (ZPROP_CONT);
+#endif
+
+	for (int j = 0; j < num_props; j++)
+		order[j] = &prop_tbl[j];
+
+	if (ordered) {
+		qsort((void *)order, num_props, sizeof (zprop_desc_t *),
+		    zprop_compare);
+	}
+
+	prop = ZPROP_CONT;
+	for (i = 0; i < num_props; i++) {
+		if ((order[i]->pd_visible || show_all) &&
+		    (func(order[i]->pd_propnum, cb) != ZPROP_CONT)) {
+			prop = order[i]->pd_propnum;
+			break;
+		}
+	}
+
+#if defined(_KERNEL)
+	kmem_free(order, size);
+#else
+	free(order);
+#endif
+	return (prop);
+}
+
+static boolean_t
+propname_match(const char *p, size_t len, zprop_desc_t *prop_entry)
+{
+	const char *propname = prop_entry->pd_name;
+#ifndef _KERNEL
+	const char *colname = prop_entry->pd_colname;
+	int c;
+#endif
+
+	if (len == strlen(propname) &&
+	    strncmp(p, propname, len) == 0)
+		return (B_TRUE);
+
+#ifndef _KERNEL
+	if (colname == NULL || len != strlen(colname))
+		return (B_FALSE);
+
+	for (c = 0; c < len; c++)
+		if (p[c] != tolower(colname[c]))
+			break;
+
+	return (colname[c] == '\0');
+#else
+	return (B_FALSE);
+#endif
+}
+
+typedef struct name_to_prop_cb {
+	const char *propname;
+	zprop_desc_t *prop_tbl;
+} name_to_prop_cb_t;
+
+static int
+zprop_name_to_prop_cb(int prop, void *cb_data)
+{
+	name_to_prop_cb_t *data = cb_data;
+
+	if (propname_match(data->propname, strlen(data->propname),
+	    &data->prop_tbl[prop]))
+		return (prop);
+
+	return (ZPROP_CONT);
+}
+
+int
+zprop_name_to_prop(const char *propname, zfs_type_t type)
+{
+	int prop;
+	name_to_prop_cb_t cb_data;
+
+	cb_data.propname = propname;
+	cb_data.prop_tbl = zprop_get_proptable(type);
+
+	prop = zprop_iter_common(zprop_name_to_prop_cb, &cb_data,
+	    B_TRUE, B_FALSE, type);
+
+	return (prop == ZPROP_CONT ? ZPROP_INVAL : prop);
+}
+
+int
+zprop_string_to_index(int prop, const char *string, uint64_t *index,
+    zfs_type_t type)
+{
+	zprop_desc_t *prop_tbl;
+	const zprop_index_t *idx_tbl;
+	int i;
+
+	if (prop == ZPROP_INVAL || prop == ZPROP_CONT)
+		return (-1);
+
+	ASSERT(prop < zprop_get_numprops(type));
+	prop_tbl = zprop_get_proptable(type);
+	if ((idx_tbl = prop_tbl[prop].pd_table) == NULL)
+		return (-1);
+
+	for (i = 0; idx_tbl[i].pi_name != NULL; i++) {
+		if (strcmp(string, idx_tbl[i].pi_name) == 0) {
+			*index = idx_tbl[i].pi_value;
+			return (0);
+		}
+	}
+
+	return (-1);
+}
+
+int
+zprop_index_to_string(int prop, uint64_t index, const char **string,
+    zfs_type_t type)
+{
+	zprop_desc_t *prop_tbl;
+	const zprop_index_t *idx_tbl;
+	int i;
+
+	if (prop == ZPROP_INVAL || prop == ZPROP_CONT)
+		return (-1);
+
+	ASSERT(prop < zprop_get_numprops(type));
+	prop_tbl = zprop_get_proptable(type);
+	if ((idx_tbl = prop_tbl[prop].pd_table) == NULL)
+		return (-1);
+
+	for (i = 0; idx_tbl[i].pi_name != NULL; i++) {
+		if (idx_tbl[i].pi_value == index) {
+			*string = idx_tbl[i].pi_name;
+			return (0);
+		}
+	}
+
+	return (-1);
+}
+
+/*
+ * Return a random valid property value.  Used by ztest.
+ */
+uint64_t
+zprop_random_value(int prop, uint64_t seed, zfs_type_t type)
+{
+	zprop_desc_t *prop_tbl;
+	const zprop_index_t *idx_tbl;
+
+	ASSERT((uint_t)prop < zprop_get_numprops(type));
+	prop_tbl = zprop_get_proptable(type);
+	idx_tbl = prop_tbl[prop].pd_table;
+
+	if (idx_tbl == NULL)
+		return (seed);
+
+	return (idx_tbl[seed % prop_tbl[prop].pd_table_size].pi_value);
+}
+
+const char *
+zprop_values(int prop, zfs_type_t type)
+{
+	zprop_desc_t *prop_tbl;
+
+	ASSERT(prop != ZPROP_INVAL && prop != ZPROP_CONT);
+	ASSERT(prop < zprop_get_numprops(type));
+
+	prop_tbl = zprop_get_proptable(type);
+
+	return (prop_tbl[prop].pd_values);
+}
+
+/*
+ * Returns TRUE if the property applies to any of the given dataset types.
+ */
+boolean_t
+zprop_valid_for_type(int prop, zfs_type_t type)
+{
+	zprop_desc_t *prop_tbl;
+
+	if (prop == ZPROP_INVAL || prop == ZPROP_CONT)
+		return (B_FALSE);
+
+	ASSERT(prop < zprop_get_numprops(type));
+	prop_tbl = zprop_get_proptable(type);
+	return ((prop_tbl[prop].pd_types & type) != 0);
+}
+
+#ifndef _KERNEL
+
+/*
+ * Determines the minimum width for the column, and indicates whether it's fixed
+ * or not.  Only string columns are non-fixed.
+ */
+size_t
+zprop_width(int prop, boolean_t *fixed, zfs_type_t type)
+{
+	zprop_desc_t *prop_tbl, *pd;
+	const zprop_index_t *idx;
+	size_t ret;
+	int i;
+
+	ASSERT(prop != ZPROP_INVAL && prop != ZPROP_CONT);
+	ASSERT(prop < zprop_get_numprops(type));
+
+	prop_tbl = zprop_get_proptable(type);
+	pd = &prop_tbl[prop];
+
+	*fixed = B_TRUE;
+
+	/*
+	 * Start with the width of the column name.
+	 */
+	ret = strlen(pd->pd_colname);
+
+	/*
+	 * For fixed-width values, make sure the width is large enough to hold
+	 * any possible value.
+	 */
+	switch (pd->pd_proptype) {
+	case PROP_TYPE_NUMBER:
+		/*
+		 * The maximum length of a human-readable number is 5 characters
+		 * ("20.4M", for example).
+		 */
+		if (ret < 5)
+			ret = 5;
+		/*
+		 * 'creation' is handled specially because it's a number
+		 * internally, but displayed as a date string.
+		 */
+		if (prop == ZFS_PROP_CREATION)
+			*fixed = B_FALSE;
+		break;
+	case PROP_TYPE_INDEX:
+		idx = prop_tbl[prop].pd_table;
+		for (i = 0; idx[i].pi_name != NULL; i++) {
+			if (strlen(idx[i].pi_name) > ret)
+				ret = strlen(idx[i].pi_name);
+		}
+		break;
+
+	case PROP_TYPE_STRING:
+		*fixed = B_FALSE;
+		break;
+	}
+
+	return (ret);
+}
+
+#endif
--- a/uts/common/Makefile.files
+++ b/uts/common/Makefile.files
--- a/uts/common/dtrace/dtrace.c
+++ b/uts/common/dtrace/dtrace.c
@ -20,12 +20,9 @@
 */

 /*
- * Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
- * Use is subject to license terms.
+ * Copyright (c) 2003, 2010, Oracle and/or its affiliates. All rights reserved.
 */

-#pragma ident	"%Z%%M%	%I%	%E% SMI"
-
 /*
 * DTrace - Dynamic Tracing for Solaris
 *
@ -186,7 +183,9 @@ static dtrace_ecb_t	*dtrace_ecb_create_cache; /* cached created ECB */
 static dtrace_genid_t	dtrace_probegen;	/* current probe generation */
 static dtrace_helpers_t *dtrace_deferred_pid;	/* deferred helper list */
 static dtrace_enabling_t *dtrace_retained;	/* list of retained enablings */
+static dtrace_genid_t	dtrace_retained_gen;	/* current retained enab gen */
 static dtrace_dynvar_t	dtrace_dynhash_sink;	/* end of dynamic hash chains */
+static int		dtrace_dynvar_failclean; /* dynvars failed to clean */

 /*
 * DTrace Locking
@ -240,10 +239,16 @@ static void
 dtrace_nullop(void)
 {}

+static int
+dtrace_enable_nullop(void)
+{
+	return (0);
+}
+
 static dtrace_pops_t	dtrace_provider_ops = {
 	(void (*)(void *, const dtrace_probedesc_t *))dtrace_nullop,
 	(void (*)(void *, struct modctl *))dtrace_nullop,
-	(void (*)(void *, dtrace_id_t, void *))dtrace_nullop,
+	(int (*)(void *, dtrace_id_t, void *))dtrace_enable_nullop,
 	(void (*)(void *, dtrace_id_t, void *))dtrace_nullop,
 	(void (*)(void *, dtrace_id_t, void *))dtrace_nullop,
 	(void (*)(void *, dtrace_id_t, void *))dtrace_nullop,
@ -427,6 +432,7 @@ dtrace_load##bits(uintptr_t addr)					\
 #define	DTRACE_DYNHASH_SINK	1
 #define	DTRACE_DYNHASH_VALID	2

+#define	DTRACE_MATCH_FAIL	-1
 #define	DTRACE_MATCH_NEXT	0
 #define	DTRACE_MATCH_DONE	1
 #define	DTRACE_ANCHORED(probe)	((probe)->dtpr_func[0] != '\0')
@ -1182,12 +1188,12 @@ dtrace_dynvar_clean(dtrace_dstate_t *dstate)
 {
 	dtrace_dynvar_t *dirty;
 	dtrace_dstate_percpu_t *dcpu;
-	int i, work = 0;
+	dtrace_dynvar_t **rinsep;
+	int i, j, work = 0;

 	for (i = 0; i < NCPU; i++) {
 		dcpu = &dstate->dtds_percpu[i];
-
-		ASSERT(dcpu->dtdsc_rinsing == NULL);
+		rinsep = &dcpu->dtdsc_rinsing;

 		/*
 		 * If the dirty list is NULL, there is no dirty work to do.
@ -1195,14 +1201,62 @@ dtrace_dynvar_clean(dtrace_dstate_t *dstate)
 		if (dcpu->dtdsc_dirty == NULL)
 			continue;

-		/*
-		 * If the clean list is non-NULL, then we're not going to do
-		 * any work for this CPU -- it means that there has not been
-		 * a dtrace_dynvar() allocation on this CPU (or from this CPU)
-		 * since the last time we cleaned house.
-		 */
-		if (dcpu->dtdsc_clean != NULL)
+		if (dcpu->dtdsc_rinsing != NULL) {
+			/*
+			 * If the rinsing list is non-NULL, then it is because
+			 * this CPU was selected to accept another CPU's
+			 * dirty list -- and since that time, dirty buffers
+			 * have accumulated.  This is a highly unlikely
+			 * condition, but we choose to ignore the dirty
+			 * buffers -- they'll be picked up a future cleanse.
+			 */
 			continue;
+		}
+
+		if (dcpu->dtdsc_clean != NULL) {
+			/*
+			 * If the clean list is non-NULL, then we're in a
+			 * situation where a CPU has done deallocations (we
+			 * have a non-NULL dirty list) but no allocations (we
+			 * also have a non-NULL clean list).  We can't simply
+			 * move the dirty list into the clean list on this
+			 * CPU, yet we also don't want to allow this condition
+			 * to persist, lest a short clean list prevent a
+			 * massive dirty list from being cleaned (which in
+			 * turn could lead to otherwise avoidable dynamic
+			 * drops).  To deal with this, we look for some CPU
+			 * with a NULL clean list, NULL dirty list, and NULL
+			 * rinsing list -- and then we borrow this CPU to
+			 * rinse our dirty list.
+			 */
+			for (j = 0; j < NCPU; j++) {
+				dtrace_dstate_percpu_t *rinser;
+
+				rinser = &dstate->dtds_percpu[j];
+
+				if (rinser->dtdsc_rinsing != NULL)
+					continue;
+
+				if (rinser->dtdsc_dirty != NULL)
+					continue;
+
+				if (rinser->dtdsc_clean != NULL)
+					continue;
+
+				rinsep = &rinser->dtdsc_rinsing;
+				break;
+			}
+
+			if (j == NCPU) {
+				/*
+				 * We were unable to find another CPU that
+				 * could accept this dirty list -- we are
+				 * therefore unable to clean it now.
+				 */
+				dtrace_dynvar_failclean++;
+				continue;
+			}
+		}

 		work = 1;

@ -1219,7 +1273,7 @@ dtrace_dynvar_clean(dtrace_dstate_t *dstate)
 			 * on a hash chain, either the dirty list or the
 			 * rinsing list for some CPU must be non-NULL.)
 			 */
-			dcpu->dtdsc_rinsing = dirty;
+			*rinsep = dirty;
 			dtrace_membar_producer();
 		} while (dtrace_casptr(&dcpu->dtdsc_dirty,
 		    dirty, NULL) != dirty);
@ -1650,7 +1704,7 @@ retry:
 			ASSERT(clean->dtdv_hashval == DTRACE_DYNHASH_FREE);

 			/*
-			 * Now we'll move the clean list to the free list.
+			 * Now we'll move the clean list to our free list.
 			 * It's impossible for this to fail:  the only way
 			 * the free list can be updated is through this
 			 * code path, and only one CPU can own the clean list.
@ -1663,6 +1717,7 @@ retry:
 			 * owners of the clean lists out before resetting
 			 * the clean lists.
 			 */
+			dcpu = &dstate->dtds_percpu[me];
 			rval = dtrace_casptr(&dcpu->dtdsc_free, NULL, clean);
 			ASSERT(rval == NULL);
 			goto retry;
@ -3600,7 +3655,7 @@ dtrace_dif_subr(uint_t subr, uint_t rd, uint64_t *regs,
 		int64_t index = (int64_t)tupregs[1].dttk_value;
 		int64_t remaining = (int64_t)tupregs[2].dttk_value;
 		size_t len = dtrace_strlen((char *)s, size);
-		int64_t i = 0;
+		int64_t i;

 		if (!dtrace_canload(s, len + 1, mstate, vstate)) {
 			regs[rd] = NULL;
@ -6655,7 +6710,7 @@ dtrace_match(const dtrace_probekey_t *pkp, uint32_t priv, uid_t uid,
 {
 	dtrace_probe_t template, *probe;
 	dtrace_hash_t *hash = NULL;
-	int len, best = INT_MAX, nmatched = 0;
+	int len, rc, best = INT_MAX, nmatched = 0;
 	dtrace_id_t i;

 	ASSERT(MUTEX_HELD(&dtrace_lock));
@ -6667,7 +6722,8 @@ dtrace_match(const dtrace_probekey_t *pkp, uint32_t priv, uid_t uid,
 	if (pkp->dtpk_id != DTRACE_IDNONE) {
 		if ((probe = dtrace_probe_lookup_id(pkp->dtpk_id)) != NULL &&
 		    dtrace_match_probe(probe, pkp, priv, uid, zoneid) > 0) {
-			(void) (*matched)(probe, arg);
+			if ((*matched)(probe, arg) == DTRACE_MATCH_FAIL)
+				return (DTRACE_MATCH_FAIL);
 			nmatched++;
 		}
 		return (nmatched);
@ -6714,8 +6770,12 @@ dtrace_match(const dtrace_probekey_t *pkp, uint32_t priv, uid_t uid,

 			nmatched++;

-			if ((*matched)(probe, arg) != DTRACE_MATCH_NEXT)
+			if ((rc = (*matched)(probe, arg)) !=
+			    DTRACE_MATCH_NEXT) {
+				if (rc == DTRACE_MATCH_FAIL)
+					return (DTRACE_MATCH_FAIL);
 				break;
+			}
 		}

 		return (nmatched);
@ -6734,8 +6794,11 @@ dtrace_match(const dtrace_probekey_t *pkp, uint32_t priv, uid_t uid,

 		nmatched++;

-		if ((*matched)(probe, arg) != DTRACE_MATCH_NEXT)
+		if ((rc = (*matched)(probe, arg)) != DTRACE_MATCH_NEXT) {
+			if (rc == DTRACE_MATCH_FAIL)
+				return (DTRACE_MATCH_FAIL);
 			break;
+		}
 	}

 	return (nmatched);
@ -6955,7 +7018,7 @@ dtrace_unregister(dtrace_provider_id_t id)
 	dtrace_probe_t *probe, *first = NULL;

 	if (old->dtpv_pops.dtps_enable ==
-	    (void (*)(void *, dtrace_id_t, void *))dtrace_nullop) {
+	    (int (*)(void *, dtrace_id_t, void *))dtrace_enable_nullop) {
 		/*
 		 * If DTrace itself is the provider, we're called with locks
 		 * already held.
@ -7101,7 +7164,7 @@ dtrace_invalidate(dtrace_provider_id_t id)
 	dtrace_provider_t *pvp = (dtrace_provider_t *)id;

 	ASSERT(pvp->dtpv_pops.dtps_enable !=
-	    (void (*)(void *, dtrace_id_t, void *))dtrace_nullop);
+	    (int (*)(void *, dtrace_id_t, void *))dtrace_enable_nullop);

 	mutex_enter(&dtrace_provider_lock);
 	mutex_enter(&dtrace_lock);
@ -7142,7 +7205,7 @@ dtrace_condense(dtrace_provider_id_t id)
 	 * Make sure this isn't the dtrace provider itself.
 	 */
 	ASSERT(prov->dtpv_pops.dtps_enable !=
-	    (void (*)(void *, dtrace_id_t, void *))dtrace_nullop);
+	    (int (*)(void *, dtrace_id_t, void *))dtrace_enable_nullop);

 	mutex_enter(&dtrace_provider_lock);
 	mutex_enter(&dtrace_lock);
@ -8103,7 +8166,7 @@ dtrace_difo_validate(dtrace_difo_t *dp, dtrace_vstate_t *vstate, uint_t nregs,
 			break;

 		default:
-			err += efunc(dp->dtdo_len - 1, "bad return size");
+			err += efunc(dp->dtdo_len - 1, "bad return size\n");
 		}
 	}

@ -9096,7 +9159,7 @@ dtrace_ecb_add(dtrace_state_t *state, dtrace_probe_t *probe)
 	return (ecb);
 }

-static void
+static int
 dtrace_ecb_enable(dtrace_ecb_t *ecb)
 {
 	dtrace_probe_t *probe = ecb->dte_probe;
@ -9109,7 +9172,7 @@ dtrace_ecb_enable(dtrace_ecb_t *ecb)
 		/*
 		 * This is the NULL probe -- there's nothing to do.
 		 */
-		return;
+		return (0);
 	}

 	if (probe->dtpr_ecb == NULL) {
@ -9123,8 +9186,8 @@ dtrace_ecb_enable(dtrace_ecb_t *ecb)
 		if (ecb->dte_predicate != NULL)
 			probe->dtpr_predcache = ecb->dte_predicate->dtp_cacheid;

-		prov->dtpv_pops.dtps_enable(prov->dtpv_arg,
-		    probe->dtpr_id, probe->dtpr_arg);
+		return (prov->dtpv_pops.dtps_enable(prov->dtpv_arg,
+		    probe->dtpr_id, probe->dtpr_arg));
 	} else {
 		/*
 		 * This probe is already active.  Swing the last pointer to
@ -9137,6 +9200,7 @@ dtrace_ecb_enable(dtrace_ecb_t *ecb)
 		probe->dtpr_predcache = 0;

 		dtrace_sync();
+		return (0);
 	}
 }

@ -9920,7 +9984,9 @@ dtrace_ecb_create_enable(dtrace_probe_t *probe, void *arg)
 	if ((ecb = dtrace_ecb_create(state, probe, enab)) == NULL)
 		return (DTRACE_MATCH_DONE);

-	dtrace_ecb_enable(ecb);
+	if (dtrace_ecb_enable(ecb) < 0)
+		return (DTRACE_MATCH_FAIL);
+
 	return (DTRACE_MATCH_NEXT);
 }

@ -10557,6 +10623,7 @@ dtrace_enabling_destroy(dtrace_enabling_t *enab)
 		ASSERT(enab->dten_vstate->dtvs_state != NULL);
 		ASSERT(enab->dten_vstate->dtvs_state->dts_nretained > 0);
 		enab->dten_vstate->dtvs_state->dts_nretained--;
+		dtrace_retained_gen++;
 	}

 	if (enab->dten_prev == NULL) {
@ -10599,6 +10666,7 @@ dtrace_enabling_retain(dtrace_enabling_t *enab)
 		return (ENOSPC);

 	state->dts_nretained++;
+	dtrace_retained_gen++;

 	if (dtrace_retained == NULL) {
 		dtrace_retained = enab;
@ -10713,7 +10781,7 @@ static int
 dtrace_enabling_match(dtrace_enabling_t *enab, int *nmatched)
 {
 	int i = 0;
-	int matched = 0;
+	int total_matched = 0, matched = 0;

 	ASSERT(MUTEX_HELD(&cpu_lock));
 	ASSERT(MUTEX_HELD(&dtrace_lock));
@ -10724,7 +10792,14 @@ dtrace_enabling_match(dtrace_enabling_t *enab, int *nmatched)
 		enab->dten_current = ep;
 		enab->dten_error = 0;

-		matched += dtrace_probe_enable(&ep->dted_probe, enab);
+		/*
+		 * If a provider failed to enable a probe then get out and
+		 * let the consumer know we failed.
+		 */
+		if ((matched = dtrace_probe_enable(&ep->dted_probe, enab)) < 0)
+			return (EBUSY);
+
+		total_matched += matched;

 		if (enab->dten_error != 0) {
 			/*
@ -10752,7 +10827,7 @@ dtrace_enabling_match(dtrace_enabling_t *enab, int *nmatched)

 	enab->dten_probegen = dtrace_probegen;
 	if (nmatched != NULL)
-		*nmatched = matched;
+		*nmatched = total_matched;

 	return (0);
 }
@ -10766,13 +10841,22 @@ dtrace_enabling_matchall(void)
 	mutex_enter(&dtrace_lock);

 	/*
-	 * Because we can be called after dtrace_detach() has been called, we
-	 * cannot assert that there are retained enablings.  We can safely
-	 * load from dtrace_retained, however:  the taskq_destroy() at the
-	 * end of dtrace_detach() will block pending our completion.
+	 * Iterate over all retained enablings to see if any probes match
+	 * against them.  We only perform this operation on enablings for which
+	 * we have sufficient permissions by virtue of being in the global zone
+	 * or in the same zone as the DTrace client.  Because we can be called
+	 * after dtrace_detach() has been called, we cannot assert that there
+	 * are retained enablings.  We can safely load from dtrace_retained,
+	 * however:  the taskq_destroy() at the end of dtrace_detach() will
+	 * block pending our completion.
 	 */
-	for (enab = dtrace_retained; enab != NULL; enab = enab->dten_next)
-		(void) dtrace_enabling_match(enab, NULL);
+	for (enab = dtrace_retained; enab != NULL; enab = enab->dten_next) {
+		cred_t *cr = enab->dten_vstate->dtvs_state->dts_cred.dcr_cred;
+
+		if (INGLOBALZONE(curproc) ||
+		    cr != NULL && getzoneid() == crgetzoneid(cr))
+			(void) dtrace_enabling_match(enab, NULL);
+	}

 	mutex_exit(&dtrace_lock);
 	mutex_exit(&cpu_lock);
@ -10830,6 +10914,7 @@ dtrace_enabling_provide(dtrace_provider_t *prv)
 {
 	int i, all = 0;
 	dtrace_probedesc_t desc;
+	dtrace_genid_t gen;

 	ASSERT(MUTEX_HELD(&dtrace_lock));
 	ASSERT(MUTEX_HELD(&dtrace_provider_lock));
@ -10840,15 +10925,25 @@ dtrace_enabling_provide(dtrace_provider_t *prv)
 	}

 	do {
-		dtrace_enabling_t *enab = dtrace_retained;
+		dtrace_enabling_t *enab;
 		void *parg = prv->dtpv_arg;

-		for (; enab != NULL; enab = enab->dten_next) {
+retry:
+		gen = dtrace_retained_gen;
+		for (enab = dtrace_retained; enab != NULL;
+		    enab = enab->dten_next) {
 			for (i = 0; i < enab->dten_ndesc; i++) {
 				desc = enab->dten_desc[i]->dted_probe;
 				mutex_exit(&dtrace_lock);
 				prv->dtpv_pops.dtps_provide(parg, &desc);
 				mutex_enter(&dtrace_lock);
+				/*
+				 * Process the retained enablings again if
+				 * they have changed while we weren't holding
+				 * dtrace_lock.
+				 */
+				if (gen != dtrace_retained_gen)
+					goto retry;
 			}
 		}
 	} while (all && (prv = prv->dtpv_next) != NULL);
@ -10970,7 +11065,8 @@ dtrace_dof_copyin(uintptr_t uarg, int *errp)

 	dof = kmem_alloc(hdr.dofh_loadsz, KM_SLEEP);

-	if (copyin((void *)uarg, dof, hdr.dofh_loadsz) != 0) {
+	if (copyin((void *)uarg, dof, hdr.dofh_loadsz) != 0 ||
+	    dof->dofh_loadsz != hdr.dofh_loadsz) {
 		kmem_free(dof, hdr.dofh_loadsz);
 		*errp = EFAULT;
 		return (NULL);
@ -11698,6 +11794,13 @@ dtrace_dof_slurp(dof_hdr_t *dof, dtrace_vstate_t *vstate, cred_t *cr,
 			}
 		}

+		if (DOF_SEC_ISLOADABLE(sec->dofs_type) &&
+		    !(sec->dofs_flags & DOF_SECF_LOAD)) {
+			dtrace_dof_error(dof, "loadable section with load "
+			    "flag unset");
+			return (-1);
+		}
+
 		if (!(sec->dofs_flags & DOF_SECF_LOAD))
 			continue; /* just ignore non-loadable sections */

@ -14390,7 +14493,8 @@ dtrace_open(dev_t *devp, int flag, int otyp, cred_t *cred_p)
 	 * If this wasn't an open with the "helper" minor, then it must be
 	 * the "dtrace" minor.
 	 */
-	ASSERT(getminor(*devp) == DTRACEMNRN_DTRACE);
+	if (getminor(*devp) != DTRACEMNRN_DTRACE)
+		return (ENXIO);

 	/*
 	 * If no DTRACE_PRIV_* bits are set in the credential, then the
@ -14427,7 +14531,7 @@ dtrace_open(dev_t *devp, int flag, int otyp, cred_t *cred_p)
 	mutex_exit(&cpu_lock);

 	if (state == NULL) {
-		if (--dtrace_opens == 0)
+		if (--dtrace_opens == 0 && dtrace_anon.dta_enabling == NULL)
 			(void) kdi_dtrace_set(KDI_DTSET_DTRACE_DEACTIVATE);
 		mutex_exit(&dtrace_lock);
 		return (EAGAIN);
@ -14463,7 +14567,12 @@ dtrace_close(dev_t dev, int flag, int otyp, cred_t *cred_p)

 	dtrace_state_destroy(state);
 	ASSERT(dtrace_opens > 0);
-	if (--dtrace_opens == 0)
+
+	/*
+	 * Only relinquish control of the kernel debugger interface when there
+	 * are no consumers and no anonymous enablings.
+	 */
+	if (--dtrace_opens == 0 && dtrace_anon.dta_enabling == NULL)
 		(void) kdi_dtrace_set(KDI_DTSET_DTRACE_DEACTIVATE);

 	mutex_exit(&dtrace_lock);
@ -15458,7 +15567,8 @@ static struct dev_ops dtrace_ops = {
 	nodev,			/* reset */
 	&dtrace_cb_ops,		/* driver operations */
 	NULL,			/* bus operations */
-	nodev			/* dev power */
+	nodev,			/* dev power */
+	ddi_quiesce_not_needed,		/* quiesce */
 };

 static struct modldrv modldrv = {
--- a/uts/common/dtrace/fasttrap.c
+++ b/uts/common/dtrace/fasttrap.c
@ -20,11 +20,10 @@
 */

 /*
- * Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
 * Use is subject to license terms.
 */

-#pragma ident	"%Z%%M%	%I%	%E% SMI"

 #include <sys/atomic.h>
 #include <sys/errno.h>
@ -876,7 +875,7 @@ fasttrap_disable_callbacks(void)
 }

 /*ARGSUSED*/
-static void
+static int
 fasttrap_pid_enable(void *arg, dtrace_id_t id, void *parg)
 {
 	fasttrap_probe_t *probe = parg;
@ -904,7 +903,7 @@ fasttrap_pid_enable(void *arg, dtrace_id_t id, void *parg)
 	 * provider can't go away while we're in this code path.
 	 */
 	if (probe->ftp_prov->ftp_retired)
-		return;
+		return (0);

 	/*
 	 * If we can't find the process, it may be that we're in the context of
@ -913,7 +912,7 @@ fasttrap_pid_enable(void *arg, dtrace_id_t id, void *parg)
 	 */
 	if ((p = sprlock(probe->ftp_pid)) == NULL) {
 		if ((curproc->p_flag & SFORKING) == 0)
-			return;
+			return (0);

 		mutex_enter(&pidlock);
 		p = prfind(probe->ftp_pid);
@ -975,7 +974,7 @@ fasttrap_pid_enable(void *arg, dtrace_id_t id, void *parg)
 			 * drop our reference on the trap table entry.
 			 */
 			fasttrap_disable_callbacks();
-			return;
+			return (0);
 		}
 	}

@ -983,6 +982,7 @@ fasttrap_pid_enable(void *arg, dtrace_id_t id, void *parg)
 	sprunlock(p);

 	probe->ftp_enabled = 1;
+	return (0);
 }

 /*ARGSUSED*/
@ -1946,7 +1946,8 @@ fasttrap_ioctl(dev_t dev, int cmd, intptr_t arg, int md, cred_t *cr, int *rv)

 		probe = kmem_alloc(size, KM_SLEEP);

-		if (copyin(uprobe, probe, size) != 0) {
+		if (copyin(uprobe, probe, size) != 0 ||
+		    probe->ftps_noffs != noffs) {
 			kmem_free(probe, size);
 			return (EFAULT);
 		}
@ -2044,13 +2045,6 @@ err:
 			    tp->ftt_proc->ftpc_acount != 0)
 				break;

-			/*
-			 * The count of active providers can only be
-			 * decremented (i.e. to zero) during exec, exit, and
-			 * removal of a meta provider so it should be
-			 * impossible to drop the count during this operation().
-			 */
-			ASSERT(tp->ftt_proc->ftpc_acount != 0);
 			tp = tp->ftt_next;
 		}

@ -2346,7 +2340,8 @@ static struct dev_ops fasttrap_ops = {
 	nodev,			/* reset */
 	&fasttrap_cb_ops,	/* driver operations */
 	NULL,			/* bus operations */
-	nodev			/* dev power */
+	nodev,			/* dev power */
+	ddi_quiesce_not_needed,		/* quiesce */
 };

 /*
--- a/uts/common/dtrace/lockstat.c
+++ b/uts/common/dtrace/lockstat.c
@ -19,11 +19,10 @@
 * CDDL HEADER END
 */
 /*
- * Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
 * Use is subject to license terms.
 */

-#pragma ident	"%Z%%M%	%I%	%E% SMI"

 #include <sys/types.h>
 #include <sys/param.h>
@ -84,7 +83,7 @@ static kmutex_t		lockstat_test;	/* for testing purposes only */
 static dtrace_provider_id_t lockstat_id;

 /*ARGSUSED*/
-static void
+static int
 lockstat_enable(void *arg, dtrace_id_t id, void *parg)
 {
 	lockstat_probe_t *probe = parg;
@ -103,6 +102,7 @@ lockstat_enable(void *arg, dtrace_id_t id, void *parg)
 	 */
 	mutex_enter(&lockstat_test);
 	mutex_exit(&lockstat_test);
+	return (0);
 }

 /*ARGSUSED*/
@ -310,11 +310,13 @@ static struct dev_ops lockstat_ops = {
 	nulldev,		/* reset */
 	&lockstat_cb_ops,	/* cb_ops */
 	NULL,			/* bus_ops */
+	NULL,			/* power */
+	ddi_quiesce_not_needed,		/* quiesce */
 };

 static struct modldrv modldrv = {
 	&mod_driverops,		/* Type of module.  This one is a driver */
-	"Lock Statistics %I%",	/* name of module */
+	"Lock Statistics",	/* name of module */
 	&lockstat_ops,		/* driver ops */
 };

--- a/uts/common/dtrace/profile.c
+++ b/uts/common/dtrace/profile.c
@ -19,11 +19,10 @@
 * CDDL HEADER END
 */
 /*
- * Copyright 2007 Sun Microsystems, Inc.  All rights reserved.
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
 * Use is subject to license terms.
 */

-#pragma ident	"%Z%%M%	%I%	%E% SMI"

 #include <sys/errno.h>
 #include <sys/stat.h>
@ -361,7 +360,7 @@ profile_offline(void *arg, cpu_t *cpu, void *oarg)
 }

 /*ARGSUSED*/
-static void
+static int
 profile_enable(void *arg, dtrace_id_t id, void *parg)
 {
 	profile_probe_t *prof = parg;
@ -391,6 +390,7 @@ profile_enable(void *arg, dtrace_id_t id, void *parg)
 	} else {
 		prof->prof_cyclic = cyclic_add_omni(&omni);
 	}
+	return (0);
 }

 /*ARGSUSED*/
@ -539,7 +539,8 @@ static struct dev_ops profile_ops = {
 	nodev,			/* reset */
 	&profile_cb_ops,	/* driver operations */
 	NULL,			/* bus operations */
-	nodev			/* dev power */
+	nodev,			/* dev power */
+	ddi_quiesce_not_needed,		/* quiesce */
 };

 /*
--- a/uts/common/dtrace/sdt_subr.c
+++ b/uts/common/dtrace/sdt_subr.c
@ -19,12 +19,9 @@
 * CDDL HEADER END
 */
 /*
- * Copyright 2008 Sun Microsystems, Inc.  All rights reserved.
- * Use is subject to license terms.
+ * Copyright (c) 2004, 2010, Oracle and/or its affiliates. All rights reserved.
 */

-#pragma ident	"%Z%%M%	%I%	%E% SMI"
-
 #include <sys/sdt_impl.h>

 static dtrace_pattr_t vtrace_attr = {
@ -43,6 +40,14 @@ static dtrace_pattr_t info_attr = {
 { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_ISA },
 };

+static dtrace_pattr_t fc_attr = {
+{ DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
+{ DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
+{ DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
+{ DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_ISA },
+{ DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
+};
+
 static dtrace_pattr_t fpu_attr = {
 { DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
 { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
@ -83,6 +88,14 @@ static dtrace_pattr_t xpv_attr = {
 { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_PLATFORM },
 };

+static dtrace_pattr_t iscsi_attr = {
+{ DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
+{ DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
+{ DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
+{ DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_ISA },
+{ DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
+};
+
 sdt_provider_t sdt_providers[] = {
 	{ "vtrace", "__vtrace_", &vtrace_attr, 0 },
 	{ "sysinfo", "__cpu_sysinfo_", &info_attr, 0 },
@ -91,11 +104,17 @@ sdt_provider_t sdt_providers[] = {
 	{ "sched", "__sched_", &stab_attr, 0 },
 	{ "proc", "__proc_", &stab_attr, 0 },
 	{ "io", "__io_", &stab_attr, 0 },
+	{ "ip", "__ip_", &stab_attr, 0 },
+	{ "tcp", "__tcp_", &stab_attr, 0 },
+	{ "udp", "__udp_", &stab_attr, 0 },
 	{ "mib", "__mib_", &stab_attr, 0 },
 	{ "fsinfo", "__fsinfo_", &fsinfo_attr, 0 },
+	{ "iscsi", "__iscsi_", &iscsi_attr, 0 },
 	{ "nfsv3", "__nfsv3_", &stab_attr, 0 },
 	{ "nfsv4", "__nfsv4_", &stab_attr, 0 },
 	{ "xpv", "__xpv_", &xpv_attr, 0 },
+	{ "fc", "__fc_", &fc_attr, 0 },
+	{ "srp", "__srp_", &fc_attr, 0 },
 	{ "sysevent", "__sysevent_", &stab_attr, 0 },
 	{ "sdt", NULL, &sdt_attr, 0 },
 	{ NULL }
@ -169,6 +188,73 @@ sdt_argdesc_t sdt_args[] = {
 	{ "fsinfo", NULL, 0, 0, "vnode_t *", "fileinfo_t *" },
 	{ "fsinfo", NULL, 1, 1, "int", "int" },

+	{ "iscsi", "async-send", 0, 0, "idm_conn_t *", "conninfo_t *" },
+	{ "iscsi", "async-send", 1, 1, "iscsi_async_evt_hdr_t *",
+	    "iscsiinfo_t *" },
+	{ "iscsi", "login-command", 0, 0, "idm_conn_t *", "conninfo_t *" },
+	{ "iscsi", "login-command", 1, 1, "iscsi_login_hdr_t *",
+	    "iscsiinfo_t *" },
+	{ "iscsi", "login-response", 0, 0, "idm_conn_t *", "conninfo_t *" },
+	{ "iscsi", "login-response", 1, 1, "iscsi_login_rsp_hdr_t *",
+	    "iscsiinfo_t *" },
+	{ "iscsi", "logout-command", 0, 0, "idm_conn_t *", "conninfo_t *" },
+	{ "iscsi", "logout-command", 1, 1, "iscsi_logout_hdr_t *",
+	    "iscsiinfo_t *" },
+	{ "iscsi", "logout-response", 0, 0, "idm_conn_t *", "conninfo_t *" },
+	{ "iscsi", "logout-response", 1, 1, "iscsi_logout_rsp_hdr_t *",
+	    "iscsiinfo_t *" },
+	{ "iscsi", "data-request", 0, 0, "idm_conn_t *", "conninfo_t *" },
+	{ "iscsi", "data-request", 1, 1, "iscsi_rtt_hdr_t *",
+	    "iscsiinfo_t *" },
+	{ "iscsi", "data-send", 0, 0, "idm_conn_t *", "conninfo_t *" },
+	{ "iscsi", "data-send", 1, 1, "iscsi_data_rsp_hdr_t *",
+	    "iscsiinfo_t *" },
+	{ "iscsi", "data-receive", 0, 0, "idm_conn_t *", "conninfo_t *" },
+	{ "iscsi", "data-receive", 1, 1, "iscsi_data_hdr_t *",
+	    "iscsiinfo_t *" },
+	{ "iscsi", "nop-send", 0, 0, "idm_conn_t *", "conninfo_t *" },
+	{ "iscsi", "nop-send", 1, 1, "iscsi_nop_in_hdr_t *", "iscsiinfo_t *" },
+	{ "iscsi", "nop-receive", 0, 0, "idm_conn_t *", "conninfo_t *" },
+	{ "iscsi", "nop-receive", 1, 1, "iscsi_nop_out_hdr_t *",
+	    "iscsiinfo_t *" },
+	{ "iscsi", "scsi-command", 0, 0, "idm_conn_t *", "conninfo_t *" },
+	{ "iscsi", "scsi-command", 1, 1, "iscsi_scsi_cmd_hdr_t *",
+	    "iscsiinfo_t *" },
+	{ "iscsi", "scsi-command", 2, 2, "scsi_task_t *", "scsicmd_t *" },
+	{ "iscsi", "scsi-response", 0, 0, "idm_conn_t *", "conninfo_t *" },
+	{ "iscsi", "scsi-response", 1, 1, "iscsi_scsi_rsp_hdr_t *",
+	    "iscsiinfo_t *" },
+	{ "iscsi", "task-command", 0, 0, "idm_conn_t *", "conninfo_t *" },
+	{ "iscsi", "task-command", 1, 1, "iscsi_scsi_task_mgt_hdr_t *",
+	    "iscsiinfo_t *" },
+	{ "iscsi", "task-response", 0, 0, "idm_conn_t *", "conninfo_t *" },
+	{ "iscsi", "task-response", 1, 1, "iscsi_scsi_task_mgt_rsp_hdr_t *",
+	    "iscsiinfo_t *" },
+	{ "iscsi", "text-command", 0, 0, "idm_conn_t *", "conninfo_t *" },
+	{ "iscsi", "text-command", 1, 1, "iscsi_text_hdr_t *",
+	    "iscsiinfo_t *" },
+	{ "iscsi", "text-response", 0, 0, "idm_conn_t *", "conninfo_t *" },
+	{ "iscsi", "text-response", 1, 1, "iscsi_text_rsp_hdr_t *",
+	    "iscsiinfo_t *" },
+	{ "iscsi", "xfer-start", 0, 0, "idm_conn_t *", "conninfo_t *" },
+	{ "iscsi", "xfer-start", 1, 0, "idm_conn_t *", "iscsiinfo_t *" },
+	{ "iscsi", "xfer-start", 2, 1, "uintptr_t", "xferinfo_t *" },
+	{ "iscsi", "xfer-start", 3, 2, "uint32_t"},
+	{ "iscsi", "xfer-start", 4, 3, "uintptr_t"},
+	{ "iscsi", "xfer-start", 5, 4, "uint32_t"},
+	{ "iscsi", "xfer-start", 6, 5, "uint32_t"},
+	{ "iscsi", "xfer-start", 7, 6, "uint32_t"},
+	{ "iscsi", "xfer-start", 8, 7, "int"},
+	{ "iscsi", "xfer-done", 0, 0, "idm_conn_t *", "conninfo_t *" },
+	{ "iscsi", "xfer-done", 1, 0, "idm_conn_t *", "iscsiinfo_t *" },
+	{ "iscsi", "xfer-done", 2, 1, "uintptr_t", "xferinfo_t *" },
+	{ "iscsi", "xfer-done", 3, 2, "uint32_t"},
+	{ "iscsi", "xfer-done", 4, 3, "uintptr_t"},
+	{ "iscsi", "xfer-done", 5, 4, "uint32_t"},
+	{ "iscsi", "xfer-done", 6, 5, "uint32_t"},
+	{ "iscsi", "xfer-done", 7, 6, "uint32_t"},
+	{ "iscsi", "xfer-done", 8, 7, "int"},
+
 	{ "nfsv3", "op-getattr-start", 0, 0, "struct svc_req *",
 	    "conninfo_t *" },
 	{ "nfsv3", "op-getattr-start", 1, 1, "nfsv3oparg_t *",
@ -788,6 +874,75 @@ sdt_argdesc_t sdt_args[] = {
 	    "nfsv4cbinfo_t *" },
 	{ "nfsv4", "cb-recall-done", 2, 2, "CB_RECALL4res *" },

+	{ "ip", "send", 0, 0, "mblk_t *", "pktinfo_t *" },
+	{ "ip", "send", 1, 1, "conn_t *", "csinfo_t *" },
+	{ "ip", "send", 2, 2, "void_ip_t *", "ipinfo_t *" },
+	{ "ip", "send", 3, 3, "__dtrace_ipsr_ill_t *", "ifinfo_t *" },
+	{ "ip", "send", 4, 4, "ipha_t *", "ipv4info_t *" },
+	{ "ip", "send", 5, 5, "ip6_t *", "ipv6info_t *" },
+	{ "ip", "send", 6, 6, "int" }, /* used by __dtrace_ipsr_ill_t */
+	{ "ip", "receive", 0, 0, "mblk_t *", "pktinfo_t *" },
+	{ "ip", "receive", 1, 1, "conn_t *", "csinfo_t *" },
+	{ "ip", "receive", 2, 2, "void_ip_t *", "ipinfo_t *" },
+	{ "ip", "receive", 3, 3, "__dtrace_ipsr_ill_t *", "ifinfo_t *" },
+	{ "ip", "receive", 4, 4, "ipha_t *", "ipv4info_t *" },
+	{ "ip", "receive", 5, 5, "ip6_t *", "ipv6info_t *" },
+	{ "ip", "receive", 6, 6, "int" }, /* used by __dtrace_ipsr_ill_t */
+
+	{ "tcp", "connect-established", 0, 0, "mblk_t *", "pktinfo_t *" },
+	{ "tcp", "connect-established", 1, 1, "ip_xmit_attr_t *",
+	    "csinfo_t *" },
+	{ "tcp", "connect-established", 2, 2, "void_ip_t *", "ipinfo_t *" },
+	{ "tcp", "connect-established", 3, 3, "tcp_t *", "tcpsinfo_t *" },
+	{ "tcp", "connect-established", 4, 4, "tcph_t *", "tcpinfo_t *" },
+	{ "tcp", "connect-refused", 0, 0, "mblk_t *", "pktinfo_t *" },
+	{ "tcp", "connect-refused", 1, 1, "ip_xmit_attr_t *", "csinfo_t *" },
+	{ "tcp", "connect-refused", 2, 2, "void_ip_t *", "ipinfo_t *" },
+	{ "tcp", "connect-refused", 3, 3, "tcp_t *", "tcpsinfo_t *" },
+	{ "tcp", "connect-refused", 4, 4, "tcph_t *", "tcpinfo_t *" },
+	{ "tcp", "connect-request", 0, 0, "mblk_t *", "pktinfo_t *" },
+	{ "tcp", "connect-request", 1, 1, "ip_xmit_attr_t *", "csinfo_t *" },
+	{ "tcp", "connect-request", 2, 2, "void_ip_t *", "ipinfo_t *" },
+	{ "tcp", "connect-request", 3, 3, "tcp_t *", "tcpsinfo_t *" },
+	{ "tcp", "connect-request", 4, 4, "tcph_t *", "tcpinfo_t *" },
+	{ "tcp", "accept-established", 0, 0, "mblk_t *", "pktinfo_t *" },
+	{ "tcp", "accept-established", 1, 1, "ip_xmit_attr_t *", "csinfo_t *" },
+	{ "tcp", "accept-established", 2, 2, "void_ip_t *", "ipinfo_t *" },
+	{ "tcp", "accept-established", 3, 3, "tcp_t *", "tcpsinfo_t *" },
+	{ "tcp", "accept-established", 4, 4, "tcph_t *", "tcpinfo_t *" },
+	{ "tcp", "accept-refused", 0, 0, "mblk_t *", "pktinfo_t *" },
+	{ "tcp", "accept-refused", 1, 1, "ip_xmit_attr_t *", "csinfo_t *" },
+	{ "tcp", "accept-refused", 2, 2, "void_ip_t *", "ipinfo_t *" },
+	{ "tcp", "accept-refused", 3, 3, "tcp_t *", "tcpsinfo_t *" },
+	{ "tcp", "accept-refused", 4, 4, "tcph_t *", "tcpinfo_t *" },
+	{ "tcp", "state-change", 0, 0, "void", "void" },
+	{ "tcp", "state-change", 1, 1, "ip_xmit_attr_t *", "csinfo_t *" },
+	{ "tcp", "state-change", 2, 2, "void", "void" },
+	{ "tcp", "state-change", 3, 3, "tcp_t *", "tcpsinfo_t *" },
+	{ "tcp", "state-change", 4, 4, "void", "void" },
+	{ "tcp", "state-change", 5, 5, "int32_t", "tcplsinfo_t *" },
+	{ "tcp", "send", 0, 0, "mblk_t *", "pktinfo_t *" },
+	{ "tcp", "send", 1, 1, "ip_xmit_attr_t *", "csinfo_t *" },
+	{ "tcp", "send", 2, 2, "__dtrace_tcp_void_ip_t *", "ipinfo_t *" },
+	{ "tcp", "send", 3, 3, "tcp_t *", "tcpsinfo_t *" },
+	{ "tcp", "send", 4, 4, "__dtrace_tcp_tcph_t *", "tcpinfo_t *" },
+	{ "tcp", "receive", 0, 0, "mblk_t *", "pktinfo_t *" },
+	{ "tcp", "receive", 1, 1, "ip_xmit_attr_t *", "csinfo_t *" },
+	{ "tcp", "receive", 2, 2, "__dtrace_tcp_void_ip_t *", "ipinfo_t *" },
+	{ "tcp", "receive", 3, 3, "tcp_t *", "tcpsinfo_t *" },
+	{ "tcp", "receive", 4, 4, "__dtrace_tcp_tcph_t *", "tcpinfo_t *" },
+
+	{ "udp", "send", 0, 0, "mblk_t *", "pktinfo_t *" },
+	{ "udp", "send", 1, 1, "ip_xmit_attr_t *", "csinfo_t *" },
+	{ "udp", "send", 2, 2, "void_ip_t *", "ipinfo_t *" },
+	{ "udp", "send", 3, 3, "udp_t *", "udpsinfo_t *" },
+	{ "udp", "send", 4, 4, "udpha_t *", "udpinfo_t *" },
+	{ "udp", "receive", 0, 0, "mblk_t *", "pktinfo_t *" },
+	{ "udp", "receive", 1, 1, "ip_xmit_attr_t *", "csinfo_t *" },
+	{ "udp", "receive", 2, 2, "void_ip_t *", "ipinfo_t *" },
+	{ "udp", "receive", 3, 3, "udp_t *", "udpsinfo_t *" },
+	{ "udp", "receive", 4, 4, "udpha_t *", "udpinfo_t *" },
+
 	{ "sysevent", "post", 0, 0, "evch_bind_t *", "syseventchaninfo_t *" },
 	{ "sysevent", "post", 1, 1, "sysevent_impl_t *", "syseventinfo_t *" },

@ -848,6 +1003,154 @@ sdt_argdesc_t sdt_args[] = {
 	{ "xpv", "setvcpucontext-end", 0, 0, "int" },
 	{ "xpv", "setvcpucontext-start", 0, 0, "domid_t" },
 	{ "xpv", "setvcpucontext-start", 1, 1, "vcpu_guest_context_t *" },
+
+	{ "srp", "service-up", 0, 0, "srpt_session_t *", "conninfo_t *" },
+	{ "srp", "service-up", 1, 0, "srpt_session_t *", "srp_portinfo_t *" },
+	{ "srp", "service-down", 0, 0, "srpt_session_t *", "conninfo_t *" },
+	{ "srp", "service-down", 1, 0, "srpt_session_t *",
+	    "srp_portinfo_t *" },
+	{ "srp", "login-command", 0, 0, "srpt_session_t *", "conninfo_t *" },
+	{ "srp", "login-command", 1, 0, "srpt_session_t *",
+	    "srp_portinfo_t *" },
+	{ "srp", "login-command", 2, 1, "srp_login_req_t *",
+	    "srp_logininfo_t *" },
+	{ "srp", "login-response", 0, 0, "srpt_session_t *", "conninfo_t *" },
+	{ "srp", "login-response", 1, 0, "srpt_session_t *",
+	    "srp_portinfo_t *" },
+	{ "srp", "login-response", 2, 1, "srp_login_rsp_t *",
+	    "srp_logininfo_t *" },
+	{ "srp", "login-response", 3, 2, "srp_login_rej_t *" },
+	{ "srp", "logout-command", 0, 0, "srpt_channel_t *", "conninfo_t *" },
+	{ "srp", "logout-command", 1, 0, "srpt_channel_t *",
+	    "srp_portinfo_t *" },
+	{ "srp", "task-command", 0, 0, "srpt_channel_t *", "conninfo_t *" },
+	{ "srp", "task-command", 1, 0, "srpt_channel_t *",
+	    "srp_portinfo_t *" },
+	{ "srp", "task-command", 2, 1, "srp_cmd_req_t *", "srp_taskinfo_t *" },
+	{ "srp", "task-response", 0, 0, "srpt_channel_t *", "conninfo_t *" },
+	{ "srp", "task-response", 1, 0, "srpt_channel_t *",
+	    "srp_portinfo_t *" },
+	{ "srp", "task-response", 2, 1, "srp_rsp_t *", "srp_taskinfo_t *" },
+	{ "srp", "task-response", 3, 2, "scsi_task_t *" },
+	{ "srp", "task-response", 4, 3, "int8_t" },
+	{ "srp", "scsi-command", 0, 0, "srpt_channel_t *", "conninfo_t *" },
+	{ "srp", "scsi-command", 1, 0, "srpt_channel_t *",
+	    "srp_portinfo_t *" },
+	{ "srp", "scsi-command", 2, 1, "scsi_task_t *", "scsicmd_t *" },
+	{ "srp", "scsi-command", 3, 2, "srp_cmd_req_t *", "srp_taskinfo_t *" },
+	{ "srp", "scsi-response", 0, 0, "srpt_channel_t *", "conninfo_t *" },
+	{ "srp", "scsi-response", 1, 0, "srpt_channel_t *",
+	    "srp_portinfo_t *" },
+	{ "srp", "scsi-response", 2, 1, "srp_rsp_t *", "srp_taskinfo_t *" },
+	{ "srp", "scsi-response", 3, 2, "scsi_task_t *" },
+	{ "srp", "scsi-response", 4, 3, "int8_t" },
+	{ "srp", "xfer-start", 0, 0, "srpt_channel_t *", "conninfo_t *" },
+	{ "srp", "xfer-start", 1, 0, "srpt_channel_t *",
+	    "srp_portinfo_t *" },
+	{ "srp", "xfer-start", 2, 1, "ibt_wr_ds_t *", "xferinfo_t *" },
+	{ "srp", "xfer-start", 3, 2, "srpt_iu_t *", "srp_taskinfo_t *" },
+	{ "srp", "xfer-start", 4, 3, "ibt_send_wr_t *"},
+	{ "srp", "xfer-start", 5, 4, "uint32_t" },
+	{ "srp", "xfer-start", 6, 5, "uint32_t" },
+	{ "srp", "xfer-start", 7, 6, "uint32_t" },
+	{ "srp", "xfer-start", 8, 7, "uint32_t" },
+	{ "srp", "xfer-done", 0, 0, "srpt_channel_t *", "conninfo_t *" },
+	{ "srp", "xfer-done", 1, 0, "srpt_channel_t *",
+	    "srp_portinfo_t *" },
+	{ "srp", "xfer-done", 2, 1, "ibt_wr_ds_t *", "xferinfo_t *" },
+	{ "srp", "xfer-done", 3, 2, "srpt_iu_t *", "srp_taskinfo_t *" },
+	{ "srp", "xfer-done", 4, 3, "ibt_send_wr_t *"},
+	{ "srp", "xfer-done", 5, 4, "uint32_t" },
+	{ "srp", "xfer-done", 6, 5, "uint32_t" },
+	{ "srp", "xfer-done", 7, 6, "uint32_t" },
+	{ "srp", "xfer-done", 8, 7, "uint32_t" },
+
+	{ "fc", "link-up",   0, 0, "fct_i_local_port_t *", "conninfo_t *" },
+	{ "fc", "link-down", 0, 0, "fct_i_local_port_t *", "conninfo_t *" },
+	{ "fc", "fabric-login-start", 0, 0, "fct_i_local_port_t *",
+	    "conninfo_t *" },
+	{ "fc", "fabric-login-start", 1, 0, "fct_i_local_port_t *",
+	    "fc_port_info_t *" },
+	{ "fc", "fabric-login-end", 0, 0, "fct_i_local_port_t *",
+	    "conninfo_t *" },
+	{ "fc", "fabric-login-end", 1, 0, "fct_i_local_port_t *",
+	    "fc_port_info_t *" },
+	{ "fc", "rport-login-start", 0, 0, "fct_cmd_t *",
+	    "conninfo_t *" },
+	{ "fc", "rport-login-start", 1, 1, "fct_local_port_t *",
+	    "fc_port_info_t *" },
+	{ "fc", "rport-login-start", 2, 2, "fct_i_remote_port_t *",
+	    "fc_port_info_t *" },
+	{ "fc", "rport-login-start", 3, 3, "int", "int" },
+	{ "fc", "rport-login-end", 0, 0, "fct_cmd_t *",
+	    "conninfo_t *" },
+	{ "fc", "rport-login-end", 1, 1, "fct_local_port_t *",
+	    "fc_port_info_t *" },
+	{ "fc", "rport-login-end", 2, 2, "fct_i_remote_port_t *",
+	    "fc_port_info_t *" },
+	{ "fc", "rport-login-end", 3, 3, "int", "int" },
+	{ "fc", "rport-login-end", 4, 4, "int", "int" },
+	{ "fc", "rport-logout-start", 0, 0, "fct_cmd_t *",
+	    "conninfo_t *" },
+	{ "fc", "rport-logout-start", 1, 1, "fct_local_port_t *",
+	    "fc_port_info_t *" },
+	{ "fc", "rport-logout-start", 2, 2, "fct_i_remote_port_t *",
+	    "fc_port_info_t *" },
+	{ "fc", "rport-logout-start", 3, 3, "int", "int" },
+	{ "fc", "rport-logout-end", 0, 0, "fct_cmd_t *",
+	    "conninfo_t *" },
+	{ "fc", "rport-logout-end", 1, 1, "fct_local_port_t *",
+	    "fc_port_info_t *" },
+	{ "fc", "rport-logout-end", 2, 2, "fct_i_remote_port_t *",
+	    "fc_port_info_t *" },
+	{ "fc", "rport-logout-end", 3, 3, "int", "int" },
+	{ "fc", "scsi-command", 0, 0, "fct_cmd_t *",
+	    "conninfo_t *" },
+	{ "fc", "scsi-command", 1, 1, "fct_i_local_port_t *",
+	    "fc_port_info_t *" },
+	{ "fc", "scsi-command", 2, 2, "scsi_task_t *",
+	    "scsicmd_t *" },
+	{ "fc", "scsi-command", 3, 3, "fct_i_remote_port_t *",
+	    "fc_port_info_t *" },
+	{ "fc", "scsi-response", 0, 0, "fct_cmd_t *",
+	    "conninfo_t *" },
+	{ "fc", "scsi-response", 1, 1, "fct_i_local_port_t *",
+	    "fc_port_info_t *" },
+	{ "fc", "scsi-response", 2, 2, "scsi_task_t *",
+	    "scsicmd_t *" },
+	{ "fc", "scsi-response", 3, 3, "fct_i_remote_port_t *",
+	    "fc_port_info_t *" },
+	{ "fc", "xfer-start", 0, 0, "fct_cmd_t *",
+	    "conninfo_t *" },
+	{ "fc", "xfer-start", 1, 1, "fct_i_local_port_t *",
+	    "fc_port_info_t *" },
+	{ "fc", "xfer-start", 2, 2, "scsi_task_t *",
+	    "scsicmd_t *" },
+	{ "fc", "xfer-start", 3, 3, "fct_i_remote_port_t *",
+	    "fc_port_info_t *" },
+	{ "fc", "xfer-start", 4, 4, "stmf_data_buf_t *",
+	    "fc_xferinfo_t *" },
+	{ "fc", "xfer-done", 0, 0, "fct_cmd_t *",
+	    "conninfo_t *" },
+	{ "fc", "xfer-done", 1, 1, "fct_i_local_port_t *",
+	    "fc_port_info_t *" },
+	{ "fc", "xfer-done", 2, 2, "scsi_task_t *",
+	    "scsicmd_t *" },
+	{ "fc", "xfer-done", 3, 3, "fct_i_remote_port_t *",
+	    "fc_port_info_t *" },
+	{ "fc", "xfer-done", 4, 4, "stmf_data_buf_t *",
+	    "fc_xferinfo_t *" },
+	{ "fc", "rscn-receive", 0, 0, "fct_i_local_port_t *",
+	    "conninfo_t *" },
+	{ "fc", "rscn-receive", 1, 1, "int", "int"},
+	{ "fc", "abts-receive", 0, 0, "fct_cmd_t *",
+	    "conninfo_t *" },
+	{ "fc", "abts-receive", 1, 1, "fct_i_local_port_t *",
+	    "fc_port_info_t *" },
+	{ "fc", "abts-receive", 2, 2, "fct_i_remote_port_t *",
+	    "fc_port_info_t *" },
+
+
 	{ NULL }
 };

--- a/uts/common/dtrace/systrace.c
+++ b/uts/common/dtrace/systrace.c
@ -19,11 +19,10 @@
 * CDDL HEADER END
 */
 /*
- * Copyright 2006 Sun Microsystems, Inc.  All rights reserved.
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
 * Use is subject to license terms.
 */

-#pragma ident	"%Z%%M%	%I%	%E% SMI"

 #include <sys/dtrace.h>
 #include <sys/systrace.h>
@ -141,7 +140,7 @@ systrace_destroy(void *arg, dtrace_id_t id, void *parg)
 }

 /*ARGSUSED*/
-static void
+static int
 systrace_enable(void *arg, dtrace_id_t id, void *parg)
 {
 	int sysnum = SYSTRACE_SYSNUM((uintptr_t)parg);
@ -162,7 +161,7 @@ systrace_enable(void *arg, dtrace_id_t id, void *parg)

 	if (enabled) {
 		ASSERT(sysent[sysnum].sy_callc == dtrace_systrace_syscall);
-		return;
+		return (0);
 	}

 	(void) casptr(&sysent[sysnum].sy_callc,
@ -173,6 +172,7 @@ systrace_enable(void *arg, dtrace_id_t id, void *parg)
 	    (void *)systrace_sysent32[sysnum].stsy_underlying,
 	    (void *)dtrace_systrace_syscall32);
 #endif
+	return (0);
 }

 /*ARGSUSED*/
@ -336,7 +336,8 @@ static struct dev_ops systrace_ops = {
 	nodev,			/* reset */
 	&systrace_cb_ops,	/* driver operations */
 	NULL,			/* bus operations */
-	nodev			/* dev power */
+	nodev,			/* dev power */
+	ddi_quiesce_not_needed,		/* quiesce */
 };

 /*
--- a/uts/common/fs/gfs.c
+++ b/uts/common/fs/gfs.c
--- a/uts/common/fs/vnode.c
+++ b/uts/common/fs/vnode.c
--- a/uts/common/fs/zfs/arc.c
+++ b/uts/common/fs/zfs/arc.c
--- a/uts/common/fs/zfs/bplist.c
+++ b/uts/common/fs/zfs/bplist.c
@ -0,0 +1,69 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#include <sys/bplist.h>
+#include <sys/zfs_context.h>
+
+
+void
+bplist_create(bplist_t *bpl)
+{
+	mutex_init(&bpl->bpl_lock, NULL, MUTEX_DEFAULT, NULL);
+	list_create(&bpl->bpl_list, sizeof (bplist_entry_t),
+	    offsetof(bplist_entry_t, bpe_node));
+}
+
+void
+bplist_destroy(bplist_t *bpl)
+{
+	list_destroy(&bpl->bpl_list);
+	mutex_destroy(&bpl->bpl_lock);
+}
+
+void
+bplist_append(bplist_t *bpl, const blkptr_t *bp)
+{
+	bplist_entry_t *bpe = kmem_alloc(sizeof (*bpe), KM_SLEEP);
+
+	mutex_enter(&bpl->bpl_lock);
+	bpe->bpe_blk = *bp;
+	list_insert_tail(&bpl->bpl_list, bpe);
+	mutex_exit(&bpl->bpl_lock);
+}
+
+void
+bplist_iterate(bplist_t *bpl, bplist_itor_t *func, void *arg, dmu_tx_t *tx)
+{
+	bplist_entry_t *bpe;
+
+	mutex_enter(&bpl->bpl_lock);
+	while (bpe = list_head(&bpl->bpl_list)) {
+		list_remove(&bpl->bpl_list, bpe);
+		mutex_exit(&bpl->bpl_lock);
+		func(arg, &bpe->bpe_blk, tx);
+		kmem_free(bpe, sizeof (*bpe));
+		mutex_enter(&bpl->bpl_lock);
+	}
+	mutex_exit(&bpl->bpl_lock);
+}
--- a/uts/common/fs/zfs/bpobj.c
+++ b/uts/common/fs/zfs/bpobj.c
@ -0,0 +1,495 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#include <sys/bpobj.h>
+#include <sys/zfs_context.h>
+#include <sys/refcount.h>
+
+uint64_t
+bpobj_alloc(objset_t *os, int blocksize, dmu_tx_t *tx)
+{
+	int size;
+
+	if (spa_version(dmu_objset_spa(os)) < SPA_VERSION_BPOBJ_ACCOUNT)
+		size = BPOBJ_SIZE_V0;
+	else if (spa_version(dmu_objset_spa(os)) < SPA_VERSION_DEADLISTS)
+		size = BPOBJ_SIZE_V1;
+	else
+		size = sizeof (bpobj_phys_t);
+
+	return (dmu_object_alloc(os, DMU_OT_BPOBJ, blocksize,
+	    DMU_OT_BPOBJ_HDR, size, tx));
+}
+
+void
+bpobj_free(objset_t *os, uint64_t obj, dmu_tx_t *tx)
+{
+	int64_t i;
+	bpobj_t bpo;
+	dmu_object_info_t doi;
+	int epb;
+	dmu_buf_t *dbuf = NULL;
+
+	VERIFY3U(0, ==, bpobj_open(&bpo, os, obj));
+
+	mutex_enter(&bpo.bpo_lock);
+
+	if (!bpo.bpo_havesubobj || bpo.bpo_phys->bpo_subobjs == 0)
+		goto out;
+
+	VERIFY3U(0, ==, dmu_object_info(os, bpo.bpo_phys->bpo_subobjs, &doi));
+	epb = doi.doi_data_block_size / sizeof (uint64_t);
+
+	for (i = bpo.bpo_phys->bpo_num_subobjs - 1; i >= 0; i--) {
+		uint64_t *objarray;
+		uint64_t offset, blkoff;
+
+		offset = i * sizeof (uint64_t);
+		blkoff = P2PHASE(i, epb);
+
+		if (dbuf == NULL || dbuf->db_offset > offset) {
+			if (dbuf)
+				dmu_buf_rele(dbuf, FTAG);
+			VERIFY3U(0, ==, dmu_buf_hold(os,
+			    bpo.bpo_phys->bpo_subobjs, offset, FTAG, &dbuf, 0));
+		}
+
+		ASSERT3U(offset, >=, dbuf->db_offset);
+		ASSERT3U(offset, <, dbuf->db_offset + dbuf->db_size);
+
+		objarray = dbuf->db_data;
+		bpobj_free(os, objarray[blkoff], tx);
+	}
+	if (dbuf) {
+		dmu_buf_rele(dbuf, FTAG);
+		dbuf = NULL;
+	}
+	VERIFY3U(0, ==, dmu_object_free(os, bpo.bpo_phys->bpo_subobjs, tx));
+
+out:
+	mutex_exit(&bpo.bpo_lock);
+	bpobj_close(&bpo);
+
+	VERIFY3U(0, ==, dmu_object_free(os, obj, tx));
+}
+
+int
+bpobj_open(bpobj_t *bpo, objset_t *os, uint64_t object)
+{
+	dmu_object_info_t doi;
+	int err;
+
+	err = dmu_object_info(os, object, &doi);
+	if (err)
+		return (err);
+
+	bzero(bpo, sizeof (*bpo));
+	mutex_init(&bpo->bpo_lock, NULL, MUTEX_DEFAULT, NULL);
+
+	ASSERT(bpo->bpo_dbuf == NULL);
+	ASSERT(bpo->bpo_phys == NULL);
+	ASSERT(object != 0);
+	ASSERT3U(doi.doi_type, ==, DMU_OT_BPOBJ);
+	ASSERT3U(doi.doi_bonus_type, ==, DMU_OT_BPOBJ_HDR);
+
+	err = dmu_bonus_hold(os, object, bpo, &bpo->bpo_dbuf);
+	if (err)
+		return (err);
+
+	bpo->bpo_os = os;
+	bpo->bpo_object = object;
+	bpo->bpo_epb = doi.doi_data_block_size >> SPA_BLKPTRSHIFT;
+	bpo->bpo_havecomp = (doi.doi_bonus_size > BPOBJ_SIZE_V0);
+	bpo->bpo_havesubobj = (doi.doi_bonus_size > BPOBJ_SIZE_V1);
+	bpo->bpo_phys = bpo->bpo_dbuf->db_data;
+	return (0);
+}
+
+void
+bpobj_close(bpobj_t *bpo)
+{
+	/* Lame workaround for closing a bpobj that was never opened. */
+	if (bpo->bpo_object == 0)
+		return;
+
+	dmu_buf_rele(bpo->bpo_dbuf, bpo);
+	if (bpo->bpo_cached_dbuf != NULL)
+		dmu_buf_rele(bpo->bpo_cached_dbuf, bpo);
+	bpo->bpo_dbuf = NULL;
+	bpo->bpo_phys = NULL;
+	bpo->bpo_cached_dbuf = NULL;
+	bpo->bpo_object = 0;
+
+	mutex_destroy(&bpo->bpo_lock);
+}
+
+static int
+bpobj_iterate_impl(bpobj_t *bpo, bpobj_itor_t func, void *arg, dmu_tx_t *tx,
+    boolean_t free)
+{
+	dmu_object_info_t doi;
+	int epb;
+	int64_t i;
+	int err = 0;
+	dmu_buf_t *dbuf = NULL;
+
+	mutex_enter(&bpo->bpo_lock);
+
+	if (free)
+		dmu_buf_will_dirty(bpo->bpo_dbuf, tx);
+
+	for (i = bpo->bpo_phys->bpo_num_blkptrs - 1; i >= 0; i--) {
+		blkptr_t *bparray;
+		blkptr_t *bp;
+		uint64_t offset, blkoff;
+
+		offset = i * sizeof (blkptr_t);
+		blkoff = P2PHASE(i, bpo->bpo_epb);
+
+		if (dbuf == NULL || dbuf->db_offset > offset) {
+			if (dbuf)
+				dmu_buf_rele(dbuf, FTAG);
+			err = dmu_buf_hold(bpo->bpo_os, bpo->bpo_object, offset,
+			    FTAG, &dbuf, 0);
+			if (err)
+				break;
+		}
+
+		ASSERT3U(offset, >=, dbuf->db_offset);
+		ASSERT3U(offset, <, dbuf->db_offset + dbuf->db_size);
+
+		bparray = dbuf->db_data;
+		bp = &bparray[blkoff];
+		err = func(arg, bp, tx);
+		if (err)
+			break;
+		if (free) {
+			bpo->bpo_phys->bpo_bytes -=
+			    bp_get_dsize_sync(dmu_objset_spa(bpo->bpo_os), bp);
+			ASSERT3S(bpo->bpo_phys->bpo_bytes, >=, 0);
+			if (bpo->bpo_havecomp) {
+				bpo->bpo_phys->bpo_comp -= BP_GET_PSIZE(bp);
+				bpo->bpo_phys->bpo_uncomp -= BP_GET_UCSIZE(bp);
+			}
+			bpo->bpo_phys->bpo_num_blkptrs--;
+			ASSERT3S(bpo->bpo_phys->bpo_num_blkptrs, >=, 0);
+		}
+	}
+	if (dbuf) {
+		dmu_buf_rele(dbuf, FTAG);
+		dbuf = NULL;
+	}
+	if (free) {
+		i++;
+		VERIFY3U(0, ==, dmu_free_range(bpo->bpo_os, bpo->bpo_object,
+		    i * sizeof (blkptr_t), -1ULL, tx));
+	}
+	if (err || !bpo->bpo_havesubobj || bpo->bpo_phys->bpo_subobjs == 0)
+		goto out;
+
+	ASSERT(bpo->bpo_havecomp);
+	err = dmu_object_info(bpo->bpo_os, bpo->bpo_phys->bpo_subobjs, &doi);
+	if (err) {
+		mutex_exit(&bpo->bpo_lock);
+		return (err);
+	}
+	epb = doi.doi_data_block_size / sizeof (uint64_t);
+
+	for (i = bpo->bpo_phys->bpo_num_subobjs - 1; i >= 0; i--) {
+		uint64_t *objarray;
+		uint64_t offset, blkoff;
+		bpobj_t sublist;
+		uint64_t used_before, comp_before, uncomp_before;
+		uint64_t used_after, comp_after, uncomp_after;
+
+		offset = i * sizeof (uint64_t);
+		blkoff = P2PHASE(i, epb);
+
+		if (dbuf == NULL || dbuf->db_offset > offset) {
+			if (dbuf)
+				dmu_buf_rele(dbuf, FTAG);
+			err = dmu_buf_hold(bpo->bpo_os,
+			    bpo->bpo_phys->bpo_subobjs, offset, FTAG, &dbuf, 0);
+			if (err)
+				break;
+		}
+
+		ASSERT3U(offset, >=, dbuf->db_offset);
+		ASSERT3U(offset, <, dbuf->db_offset + dbuf->db_size);
+
+		objarray = dbuf->db_data;
+		err = bpobj_open(&sublist, bpo->bpo_os, objarray[blkoff]);
+		if (err)
+			break;
+		if (free) {
+			err = bpobj_space(&sublist,
+			    &used_before, &comp_before, &uncomp_before);
+			if (err)
+				break;
+		}
+		err = bpobj_iterate_impl(&sublist, func, arg, tx, free);
+		if (free) {
+			VERIFY3U(0, ==, bpobj_space(&sublist,
+			    &used_after, &comp_after, &uncomp_after));
+			bpo->bpo_phys->bpo_bytes -= used_before - used_after;
+			ASSERT3S(bpo->bpo_phys->bpo_bytes, >=, 0);
+			bpo->bpo_phys->bpo_comp -= comp_before - comp_after;
+			bpo->bpo_phys->bpo_uncomp -=
+			    uncomp_before - uncomp_after;
+		}
+
+		bpobj_close(&sublist);
+		if (err)
+			break;
+		if (free) {
+			err = dmu_object_free(bpo->bpo_os,
+			    objarray[blkoff], tx);
+			if (err)
+				break;
+			bpo->bpo_phys->bpo_num_subobjs--;
+			ASSERT3S(bpo->bpo_phys->bpo_num_subobjs, >=, 0);
+		}
+	}
+	if (dbuf) {
+		dmu_buf_rele(dbuf, FTAG);
+		dbuf = NULL;
+	}
+	if (free) {
+		VERIFY3U(0, ==, dmu_free_range(bpo->bpo_os,
+		    bpo->bpo_phys->bpo_subobjs,
+		    (i + 1) * sizeof (uint64_t), -1ULL, tx));
+	}
+
+out:
+	/* If there are no entries, there should be no bytes. */
+	ASSERT(bpo->bpo_phys->bpo_num_blkptrs > 0 ||
+	    (bpo->bpo_havesubobj && bpo->bpo_phys->bpo_num_subobjs > 0) ||
+	    bpo->bpo_phys->bpo_bytes == 0);
+
+	mutex_exit(&bpo->bpo_lock);
+	return (err);
+}
+
+/*
+ * Iterate and remove the entries.  If func returns nonzero, iteration
+ * will stop and that entry will not be removed.
+ */
+int
+bpobj_iterate(bpobj_t *bpo, bpobj_itor_t func, void *arg, dmu_tx_t *tx)
+{
+	return (bpobj_iterate_impl(bpo, func, arg, tx, B_TRUE));
+}
+
+/*
+ * Iterate the entries.  If func returns nonzero, iteration will stop.
+ */
+int
+bpobj_iterate_nofree(bpobj_t *bpo, bpobj_itor_t func, void *arg, dmu_tx_t *tx)
+{
+	return (bpobj_iterate_impl(bpo, func, arg, tx, B_FALSE));
+}
+
+void
+bpobj_enqueue_subobj(bpobj_t *bpo, uint64_t subobj, dmu_tx_t *tx)
+{
+	bpobj_t subbpo;
+	uint64_t used, comp, uncomp, subsubobjs;
+
+	ASSERT(bpo->bpo_havesubobj);
+	ASSERT(bpo->bpo_havecomp);
+
+	VERIFY3U(0, ==, bpobj_open(&subbpo, bpo->bpo_os, subobj));
+	VERIFY3U(0, ==, bpobj_space(&subbpo, &used, &comp, &uncomp));
+
+	if (used == 0) {
+		/* No point in having an empty subobj. */
+		bpobj_close(&subbpo);
+		bpobj_free(bpo->bpo_os, subobj, tx);
+		return;
+	}
+
+	dmu_buf_will_dirty(bpo->bpo_dbuf, tx);
+	if (bpo->bpo_phys->bpo_subobjs == 0) {
+		bpo->bpo_phys->bpo_subobjs = dmu_object_alloc(bpo->bpo_os,
+		    DMU_OT_BPOBJ_SUBOBJ, SPA_MAXBLOCKSIZE, DMU_OT_NONE, 0, tx);
+	}
+
+	mutex_enter(&bpo->bpo_lock);
+	dmu_write(bpo->bpo_os, bpo->bpo_phys->bpo_subobjs,
+	    bpo->bpo_phys->bpo_num_subobjs * sizeof (subobj),
+	    sizeof (subobj), &subobj, tx);
+	bpo->bpo_phys->bpo_num_subobjs++;
+
+	/*
+	 * If subobj has only one block of subobjs, then move subobj's
+	 * subobjs to bpo's subobj list directly.  This reduces
+	 * recursion in bpobj_iterate due to nested subobjs.
+	 */
+	subsubobjs = subbpo.bpo_phys->bpo_subobjs;
+	if (subsubobjs != 0) {
+		dmu_object_info_t doi;
+
+		VERIFY3U(0, ==, dmu_object_info(bpo->bpo_os, subsubobjs, &doi));
+		if (doi.doi_max_offset == doi.doi_data_block_size) {
+			dmu_buf_t *subdb;
+			uint64_t numsubsub = subbpo.bpo_phys->bpo_num_subobjs;
+
+			VERIFY3U(0, ==, dmu_buf_hold(bpo->bpo_os, subsubobjs,
+			    0, FTAG, &subdb, 0));
+			dmu_write(bpo->bpo_os, bpo->bpo_phys->bpo_subobjs,
+			    bpo->bpo_phys->bpo_num_subobjs * sizeof (subobj),
+			    numsubsub * sizeof (subobj), subdb->db_data, tx);
+			dmu_buf_rele(subdb, FTAG);
+			bpo->bpo_phys->bpo_num_subobjs += numsubsub;
+
+			dmu_buf_will_dirty(subbpo.bpo_dbuf, tx);
+			subbpo.bpo_phys->bpo_subobjs = 0;
+			VERIFY3U(0, ==, dmu_object_free(bpo->bpo_os,
+			    subsubobjs, tx));
+		}
+	}
+	bpo->bpo_phys->bpo_bytes += used;
+	bpo->bpo_phys->bpo_comp += comp;
+	bpo->bpo_phys->bpo_uncomp += uncomp;
+	mutex_exit(&bpo->bpo_lock);
+
+	bpobj_close(&subbpo);
+}
+
+void
+bpobj_enqueue(bpobj_t *bpo, const blkptr_t *bp, dmu_tx_t *tx)
+{
+	blkptr_t stored_bp = *bp;
+	uint64_t offset;
+	int blkoff;
+	blkptr_t *bparray;
+
+	ASSERT(!BP_IS_HOLE(bp));
+
+	/* We never need the fill count. */
+	stored_bp.blk_fill = 0;
+
+	/* The bpobj will compress better if we can leave off the checksum */
+	if (!BP_GET_DEDUP(bp))
+		bzero(&stored_bp.blk_cksum, sizeof (stored_bp.blk_cksum));
+
+	mutex_enter(&bpo->bpo_lock);
+
+	offset = bpo->bpo_phys->bpo_num_blkptrs * sizeof (stored_bp);
+	blkoff = P2PHASE(bpo->bpo_phys->bpo_num_blkptrs, bpo->bpo_epb);
+
+	if (bpo->bpo_cached_dbuf == NULL ||
+	    offset < bpo->bpo_cached_dbuf->db_offset ||
+	    offset >= bpo->bpo_cached_dbuf->db_offset +
+	    bpo->bpo_cached_dbuf->db_size) {
+		if (bpo->bpo_cached_dbuf)
+			dmu_buf_rele(bpo->bpo_cached_dbuf, bpo);
+		VERIFY3U(0, ==, dmu_buf_hold(bpo->bpo_os, bpo->bpo_object,
+		    offset, bpo, &bpo->bpo_cached_dbuf, 0));
+	}
+
+	dmu_buf_will_dirty(bpo->bpo_cached_dbuf, tx);
+	bparray = bpo->bpo_cached_dbuf->db_data;
+	bparray[blkoff] = stored_bp;
+
+	dmu_buf_will_dirty(bpo->bpo_dbuf, tx);
+	bpo->bpo_phys->bpo_num_blkptrs++;
+	bpo->bpo_phys->bpo_bytes +=
+	    bp_get_dsize_sync(dmu_objset_spa(bpo->bpo_os), bp);
+	if (bpo->bpo_havecomp) {
+		bpo->bpo_phys->bpo_comp += BP_GET_PSIZE(bp);
+		bpo->bpo_phys->bpo_uncomp += BP_GET_UCSIZE(bp);
+	}
+	mutex_exit(&bpo->bpo_lock);
+}
+
+struct space_range_arg {
+	spa_t *spa;
+	uint64_t mintxg;
+	uint64_t maxtxg;
+	uint64_t used;
+	uint64_t comp;
+	uint64_t uncomp;
+};
+
+/* ARGSUSED */
+static int
+space_range_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
+{
+	struct space_range_arg *sra = arg;
+
+	if (bp->blk_birth > sra->mintxg && bp->blk_birth <= sra->maxtxg) {
+		sra->used += bp_get_dsize_sync(sra->spa, bp);
+		sra->comp += BP_GET_PSIZE(bp);
+		sra->uncomp += BP_GET_UCSIZE(bp);
+	}
+	return (0);
+}
+
+int
+bpobj_space(bpobj_t *bpo, uint64_t *usedp, uint64_t *compp, uint64_t *uncompp)
+{
+	mutex_enter(&bpo->bpo_lock);
+
+	*usedp = bpo->bpo_phys->bpo_bytes;
+	if (bpo->bpo_havecomp) {
+		*compp = bpo->bpo_phys->bpo_comp;
+		*uncompp = bpo->bpo_phys->bpo_uncomp;
+		mutex_exit(&bpo->bpo_lock);
+		return (0);
+	} else {
+		mutex_exit(&bpo->bpo_lock);
+		return (bpobj_space_range(bpo, 0, UINT64_MAX,
+		    usedp, compp, uncompp));
+	}
+}
+
+/*
+ * Return the amount of space in the bpobj which is:
+ * mintxg < blk_birth <= maxtxg
+ */
+int
+bpobj_space_range(bpobj_t *bpo, uint64_t mintxg, uint64_t maxtxg,
+    uint64_t *usedp, uint64_t *compp, uint64_t *uncompp)
+{
+	struct space_range_arg sra = { 0 };
+	int err;
+
+	/*
+	 * As an optimization, if they want the whole txg range, just
+	 * get bpo_bytes rather than iterating over the bps.
+	 */
+	if (mintxg < TXG_INITIAL && maxtxg == UINT64_MAX && bpo->bpo_havecomp)
+		return (bpobj_space(bpo, usedp, compp, uncompp));
+
+	sra.spa = dmu_objset_spa(bpo->bpo_os);
+	sra.mintxg = mintxg;
+	sra.maxtxg = maxtxg;
+
+	err = bpobj_iterate_nofree(bpo, space_range_cb, &sra, NULL);
+	*usedp = sra.used;
+	*compp = sra.comp;
+	*uncompp = sra.uncomp;
+	return (err);
+}
--- a/uts/common/fs/zfs/dbuf.c
+++ b/uts/common/fs/zfs/dbuf.c
--- a/uts/common/fs/zfs/ddt.c
+++ b/uts/common/fs/zfs/ddt.c
--- a/uts/common/fs/zfs/ddt_zap.c
+++ b/uts/common/fs/zfs/ddt_zap.c
@ -0,0 +1,157 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+
+/*
+ * Copyright (c) 2009, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#include <sys/zfs_context.h>
+#include <sys/spa.h>
+#include <sys/zio.h>
+#include <sys/ddt.h>
+#include <sys/zap.h>
+#include <sys/dmu_tx.h>
+#include <util/sscanf.h>
+
+int ddt_zap_leaf_blockshift = 12;
+int ddt_zap_indirect_blockshift = 12;
+
+static int
+ddt_zap_create(objset_t *os, uint64_t *objectp, dmu_tx_t *tx, boolean_t prehash)
+{
+	zap_flags_t flags = ZAP_FLAG_HASH64 | ZAP_FLAG_UINT64_KEY;
+
+	if (prehash)
+		flags |= ZAP_FLAG_PRE_HASHED_KEY;
+
+	*objectp = zap_create_flags(os, 0, flags, DMU_OT_DDT_ZAP,
+	    ddt_zap_leaf_blockshift, ddt_zap_indirect_blockshift,
+	    DMU_OT_NONE, 0, tx);
+
+	return (*objectp == 0 ? ENOTSUP : 0);
+}
+
+static int
+ddt_zap_destroy(objset_t *os, uint64_t object, dmu_tx_t *tx)
+{
+	return (zap_destroy(os, object, tx));
+}
+
+static int
+ddt_zap_lookup(objset_t *os, uint64_t object, ddt_entry_t *dde)
+{
+	uchar_t cbuf[sizeof (dde->dde_phys) + 1];
+	uint64_t one, csize;
+	int error;
+
+	error = zap_length_uint64(os, object, (uint64_t *)&dde->dde_key,
+	    DDT_KEY_WORDS, &one, &csize);
+	if (error)
+		return (error);
+
+	ASSERT(one == 1);
+	ASSERT(csize <= sizeof (cbuf));
+
+	error = zap_lookup_uint64(os, object, (uint64_t *)&dde->dde_key,
+	    DDT_KEY_WORDS, 1, csize, cbuf);
+	if (error)
+		return (error);
+
+	ddt_decompress(cbuf, dde->dde_phys, csize, sizeof (dde->dde_phys));
+
+	return (0);
+}
+
+static void
+ddt_zap_prefetch(objset_t *os, uint64_t object, ddt_entry_t *dde)
+{
+	(void) zap_prefetch_uint64(os, object, (uint64_t *)&dde->dde_key,
+	    DDT_KEY_WORDS);
+}
+
+static int
+ddt_zap_update(objset_t *os, uint64_t object, ddt_entry_t *dde, dmu_tx_t *tx)
+{
+	uchar_t cbuf[sizeof (dde->dde_phys) + 1];
+	uint64_t csize;
+
+	csize = ddt_compress(dde->dde_phys, cbuf,
+	    sizeof (dde->dde_phys), sizeof (cbuf));
+
+	return (zap_update_uint64(os, object, (uint64_t *)&dde->dde_key,
+	    DDT_KEY_WORDS, 1, csize, cbuf, tx));
+}
+
+static int
+ddt_zap_remove(objset_t *os, uint64_t object, ddt_entry_t *dde, dmu_tx_t *tx)
+{
+	return (zap_remove_uint64(os, object, (uint64_t *)&dde->dde_key,
+	    DDT_KEY_WORDS, tx));
+}
+
+static int
+ddt_zap_walk(objset_t *os, uint64_t object, ddt_entry_t *dde, uint64_t *walk)
+{
+	zap_cursor_t zc;
+	zap_attribute_t za;
+	int error;
+
+	zap_cursor_init_serialized(&zc, os, object, *walk);
+	if ((error = zap_cursor_retrieve(&zc, &za)) == 0) {
+		uchar_t cbuf[sizeof (dde->dde_phys) + 1];
+		uint64_t csize = za.za_num_integers;
+		ASSERT(za.za_integer_length == 1);
+		error = zap_lookup_uint64(os, object, (uint64_t *)za.za_name,
+		    DDT_KEY_WORDS, 1, csize, cbuf);
+		ASSERT(error == 0);
+		if (error == 0) {
+			ddt_decompress(cbuf, dde->dde_phys, csize,
+			    sizeof (dde->dde_phys));
+			dde->dde_key = *(ddt_key_t *)za.za_name;
+		}
+		zap_cursor_advance(&zc);
+		*walk = zap_cursor_serialize(&zc);
+	}
+	zap_cursor_fini(&zc);
+	return (error);
+}
+
+static uint64_t
+ddt_zap_count(objset_t *os, uint64_t object)
+{
+	uint64_t count = 0;
+
+	VERIFY(zap_count(os, object, &count) == 0);
+
+	return (count);
+}
+
+const ddt_ops_t ddt_zap_ops = {
+	"zap",
+	ddt_zap_create,
+	ddt_zap_destroy,
+	ddt_zap_lookup,
+	ddt_zap_prefetch,
+	ddt_zap_update,
+	ddt_zap_remove,
+	ddt_zap_walk,
+	ddt_zap_count,
+};
--- a/uts/common/fs/zfs/dmu.c
+++ b/uts/common/fs/zfs/dmu.c
--- a/uts/common/fs/zfs/dmu_diff.c
+++ b/uts/common/fs/zfs/dmu_diff.c
@ -0,0 +1,221 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#include <sys/dmu.h>
+#include <sys/dmu_impl.h>
+#include <sys/dmu_tx.h>
+#include <sys/dbuf.h>
+#include <sys/dnode.h>
+#include <sys/zfs_context.h>
+#include <sys/dmu_objset.h>
+#include <sys/dmu_traverse.h>
+#include <sys/dsl_dataset.h>
+#include <sys/dsl_dir.h>
+#include <sys/dsl_pool.h>
+#include <sys/dsl_synctask.h>
+#include <sys/zfs_ioctl.h>
+#include <sys/zap.h>
+#include <sys/zio_checksum.h>
+#include <sys/zfs_znode.h>
+
+struct diffarg {
+	struct vnode *da_vp;		/* file to which we are reporting */
+	offset_t *da_offp;
+	int da_err;			/* error that stopped diff search */
+	dmu_diff_record_t da_ddr;
+};
+
+static int
+write_record(struct diffarg *da)
+{
+	ssize_t resid; /* have to get resid to get detailed errno */
+
+	if (da->da_ddr.ddr_type == DDR_NONE) {
+		da->da_err = 0;
+		return (0);
+	}
+
+	da->da_err = vn_rdwr(UIO_WRITE, da->da_vp, (caddr_t)&da->da_ddr,
+	    sizeof (da->da_ddr), 0, UIO_SYSSPACE, FAPPEND,
+	    RLIM64_INFINITY, CRED(), &resid);
+	*da->da_offp += sizeof (da->da_ddr);
+	return (da->da_err);
+}
+
+static int
+report_free_dnode_range(struct diffarg *da, uint64_t first, uint64_t last)
+{
+	ASSERT(first <= last);
+	if (da->da_ddr.ddr_type != DDR_FREE ||
+	    first != da->da_ddr.ddr_last + 1) {
+		if (write_record(da) != 0)
+			return (da->da_err);
+		da->da_ddr.ddr_type = DDR_FREE;
+		da->da_ddr.ddr_first = first;
+		da->da_ddr.ddr_last = last;
+		return (0);
+	}
+	da->da_ddr.ddr_last = last;
+	return (0);
+}
+
+static int
+report_dnode(struct diffarg *da, uint64_t object, dnode_phys_t *dnp)
+{
+	ASSERT(dnp != NULL);
+	if (dnp->dn_type == DMU_OT_NONE)
+		return (report_free_dnode_range(da, object, object));
+
+	if (da->da_ddr.ddr_type != DDR_INUSE ||
+	    object != da->da_ddr.ddr_last + 1) {
+		if (write_record(da) != 0)
+			return (da->da_err);
+		da->da_ddr.ddr_type = DDR_INUSE;
+		da->da_ddr.ddr_first = da->da_ddr.ddr_last = object;
+		return (0);
+	}
+	da->da_ddr.ddr_last = object;
+	return (0);
+}
+
+#define	DBP_SPAN(dnp, level)				  \
+	(((uint64_t)dnp->dn_datablkszsec) << (SPA_MINBLOCKSHIFT + \
+	(level) * (dnp->dn_indblkshift - SPA_BLKPTRSHIFT)))
+
+/* ARGSUSED */
+static int
+diff_cb(spa_t *spa, zilog_t *zilog, const blkptr_t *bp, arc_buf_t *pbuf,
+    const zbookmark_t *zb, const dnode_phys_t *dnp, void *arg)
+{
+	struct diffarg *da = arg;
+	int err = 0;
+
+	if (issig(JUSTLOOKING) && issig(FORREAL))
+		return (EINTR);
+
+	if (zb->zb_object != DMU_META_DNODE_OBJECT)
+		return (0);
+
+	if (bp == NULL) {
+		uint64_t span = DBP_SPAN(dnp, zb->zb_level);
+		uint64_t dnobj = (zb->zb_blkid * span) >> DNODE_SHIFT;
+
+		err = report_free_dnode_range(da, dnobj,
+		    dnobj + (span >> DNODE_SHIFT) - 1);
+		if (err)
+			return (err);
+	} else if (zb->zb_level == 0) {
+		dnode_phys_t *blk;
+		arc_buf_t *abuf;
+		uint32_t aflags = ARC_WAIT;
+		int blksz = BP_GET_LSIZE(bp);
+		int i;
+
+		if (dsl_read(NULL, spa, bp, pbuf,
+		    arc_getbuf_func, &abuf, ZIO_PRIORITY_ASYNC_READ,
+		    ZIO_FLAG_CANFAIL, &aflags, zb) != 0)
+			return (EIO);
+
+		blk = abuf->b_data;
+		for (i = 0; i < blksz >> DNODE_SHIFT; i++) {
+			uint64_t dnobj = (zb->zb_blkid <<
+			    (DNODE_BLOCK_SHIFT - DNODE_SHIFT)) + i;
+			err = report_dnode(da, dnobj, blk+i);
+			if (err)
+				break;
+		}
+		(void) arc_buf_remove_ref(abuf, &abuf);
+		if (err)
+			return (err);
+		/* Don't care about the data blocks */
+		return (TRAVERSE_VISIT_NO_CHILDREN);
+	}
+	return (0);
+}
+
+int
+dmu_diff(objset_t *tosnap, objset_t *fromsnap, struct vnode *vp, offset_t *offp)
+{
+	struct diffarg da;
+	dsl_dataset_t *ds = tosnap->os_dsl_dataset;
+	dsl_dataset_t *fromds = fromsnap->os_dsl_dataset;
+	dsl_dataset_t *findds;
+	dsl_dataset_t *relds;
+	int err = 0;
+
+	/* make certain we are looking at snapshots */
+	if (!dsl_dataset_is_snapshot(ds) || !dsl_dataset_is_snapshot(fromds))
+		return (EINVAL);
+
+	/* fromsnap must be earlier and from the same lineage as tosnap */
+	if (fromds->ds_phys->ds_creation_txg >= ds->ds_phys->ds_creation_txg)
+		return (EXDEV);
+
+	relds = NULL;
+	findds = ds;
+
+	while (fromds->ds_dir != findds->ds_dir) {
+		dsl_pool_t *dp = ds->ds_dir->dd_pool;
+
+		if (!dsl_dir_is_clone(findds->ds_dir)) {
+			if (relds)
+				dsl_dataset_rele(relds, FTAG);
+			return (EXDEV);
+		}
+
+		rw_enter(&dp->dp_config_rwlock, RW_READER);
+		err = dsl_dataset_hold_obj(dp,
+		    findds->ds_dir->dd_phys->dd_origin_obj, FTAG, &findds);
+		rw_exit(&dp->dp_config_rwlock);
+
+		if (relds)
+			dsl_dataset_rele(relds, FTAG);
+
+		if (err)
+			return (EXDEV);
+
+		relds = findds;
+	}
+
+	if (relds)
+		dsl_dataset_rele(relds, FTAG);
+
+	da.da_vp = vp;
+	da.da_offp = offp;
+	da.da_ddr.ddr_type = DDR_NONE;
+	da.da_ddr.ddr_first = da.da_ddr.ddr_last = 0;
+	da.da_err = 0;
+
+	err = traverse_dataset(ds, fromds->ds_phys->ds_creation_txg,
+	    TRAVERSE_PRE | TRAVERSE_PREFETCH_METADATA, diff_cb, &da);
+
+	if (err) {
+		da.da_err = err;
+	} else {
+		/* we set the da.da_err we return as side-effect */
+		(void) write_record(&da);
+	}
+
+	return (da.da_err);
+}
--- a/uts/common/fs/zfs/dmu_object.c
+++ b/uts/common/fs/zfs/dmu_object.c
@ -0,0 +1,196 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#include <sys/dmu.h>
+#include <sys/dmu_objset.h>
+#include <sys/dmu_tx.h>
+#include <sys/dnode.h>
+
+uint64_t
+dmu_object_alloc(objset_t *os, dmu_object_type_t ot, int blocksize,
+    dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx)
+{
+	uint64_t object;
+	uint64_t L2_dnode_count = DNODES_PER_BLOCK <<
+	    (DMU_META_DNODE(os)->dn_indblkshift - SPA_BLKPTRSHIFT);
+	dnode_t *dn = NULL;
+	int restarted = B_FALSE;
+
+	mutex_enter(&os->os_obj_lock);
+	for (;;) {
+		object = os->os_obj_next;
+		/*
+		 * Each time we polish off an L2 bp worth of dnodes
+		 * (2^13 objects), move to another L2 bp that's still
+		 * reasonably sparse (at most 1/4 full).  Look from the
+		 * beginning once, but after that keep looking from here.
+		 * If we can't find one, just keep going from here.
+		 */
+		if (P2PHASE(object, L2_dnode_count) == 0) {
+			uint64_t offset = restarted ? object << DNODE_SHIFT : 0;
+			int error = dnode_next_offset(DMU_META_DNODE(os),
+			    DNODE_FIND_HOLE,
+			    &offset, 2, DNODES_PER_BLOCK >> 2, 0);
+			restarted = B_TRUE;
+			if (error == 0)
+				object = offset >> DNODE_SHIFT;
+		}
+		os->os_obj_next = ++object;
+
+		/*
+		 * XXX We should check for an i/o error here and return
+		 * up to our caller.  Actually we should pre-read it in
+		 * dmu_tx_assign(), but there is currently no mechanism
+		 * to do so.
+		 */
+		(void) dnode_hold_impl(os, object, DNODE_MUST_BE_FREE,
+		    FTAG, &dn);
+		if (dn)
+			break;
+
+		if (dmu_object_next(os, &object, B_TRUE, 0) == 0)
+			os->os_obj_next = object - 1;
+	}
+
+	dnode_allocate(dn, ot, blocksize, 0, bonustype, bonuslen, tx);
+	dnode_rele(dn, FTAG);
+
+	mutex_exit(&os->os_obj_lock);
+
+	dmu_tx_add_new_object(tx, os, object);
+	return (object);
+}
+
+int
+dmu_object_claim(objset_t *os, uint64_t object, dmu_object_type_t ot,
+    int blocksize, dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx)
+{
+	dnode_t *dn;
+	int err;
+
+	if (object == DMU_META_DNODE_OBJECT && !dmu_tx_private_ok(tx))
+		return (EBADF);
+
+	err = dnode_hold_impl(os, object, DNODE_MUST_BE_FREE, FTAG, &dn);
+	if (err)
+		return (err);
+	dnode_allocate(dn, ot, blocksize, 0, bonustype, bonuslen, tx);
+	dnode_rele(dn, FTAG);
+
+	dmu_tx_add_new_object(tx, os, object);
+	return (0);
+}
+
+int
+dmu_object_reclaim(objset_t *os, uint64_t object, dmu_object_type_t ot,
+    int blocksize, dmu_object_type_t bonustype, int bonuslen)
+{
+	dnode_t *dn;
+	dmu_tx_t *tx;
+	int nblkptr;
+	int err;
+
+	if (object == DMU_META_DNODE_OBJECT)
+		return (EBADF);
+
+	err = dnode_hold_impl(os, object, DNODE_MUST_BE_ALLOCATED,
+	    FTAG, &dn);
+	if (err)
+		return (err);
+
+	if (dn->dn_type == ot && dn->dn_datablksz == blocksize &&
+	    dn->dn_bonustype == bonustype && dn->dn_bonuslen == bonuslen) {
+		/* nothing is changing, this is a noop */
+		dnode_rele(dn, FTAG);
+		return (0);
+	}
+
+	if (bonustype == DMU_OT_SA) {
+		nblkptr = 1;
+	} else {
+		nblkptr = 1 + ((DN_MAX_BONUSLEN - bonuslen) >> SPA_BLKPTRSHIFT);
+	}
+
+	/*
+	 * If we are losing blkptrs or changing the block size this must
+	 * be a new file instance.   We must clear out the previous file
+	 * contents before we can change this type of metadata in the dnode.
+	 */
+	if (dn->dn_nblkptr > nblkptr || dn->dn_datablksz != blocksize) {
+		err = dmu_free_long_range(os, object, 0, DMU_OBJECT_END);
+		if (err)
+			goto out;
+	}
+
+	tx = dmu_tx_create(os);
+	dmu_tx_hold_bonus(tx, object);
+	err = dmu_tx_assign(tx, TXG_WAIT);
+	if (err) {
+		dmu_tx_abort(tx);
+		goto out;
+	}
+
+	dnode_reallocate(dn, ot, blocksize, bonustype, bonuslen, tx);
+
+	dmu_tx_commit(tx);
+out:
+	dnode_rele(dn, FTAG);
+
+	return (err);
+}
+
+int
+dmu_object_free(objset_t *os, uint64_t object, dmu_tx_t *tx)
+{
+	dnode_t *dn;
+	int err;
+
+	ASSERT(object != DMU_META_DNODE_OBJECT || dmu_tx_private_ok(tx));
+
+	err = dnode_hold_impl(os, object, DNODE_MUST_BE_ALLOCATED,
+	    FTAG, &dn);
+	if (err)
+		return (err);
+
+	ASSERT(dn->dn_type != DMU_OT_NONE);
+	dnode_free_range(dn, 0, DMU_OBJECT_END, tx);
+	dnode_free(dn, tx);
+	dnode_rele(dn, FTAG);
+
+	return (0);
+}
+
+int
+dmu_object_next(objset_t *os, uint64_t *objectp, boolean_t hole, uint64_t txg)
+{
+	uint64_t offset = (*objectp + 1) << DNODE_SHIFT;
+	int error;
+
+	error = dnode_next_offset(DMU_META_DNODE(os),
+	    (hole ? DNODE_FIND_HOLE : 0), &offset, 0, DNODES_PER_BLOCK, txg);
+
+	*objectp = offset >> DNODE_SHIFT;
+
+	return (error);
+}
--- a/uts/common/fs/zfs/dmu_objset.c
+++ b/uts/common/fs/zfs/dmu_objset.c
--- a/uts/common/fs/zfs/dmu_send.c
+++ b/uts/common/fs/zfs/dmu_send.c
--- a/uts/common/fs/zfs/dmu_traverse.c
+++ b/uts/common/fs/zfs/dmu_traverse.c
@ -0,0 +1,482 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#include <sys/zfs_context.h>
+#include <sys/dmu_objset.h>
+#include <sys/dmu_traverse.h>
+#include <sys/dsl_dataset.h>
+#include <sys/dsl_dir.h>
+#include <sys/dsl_pool.h>
+#include <sys/dnode.h>
+#include <sys/spa.h>
+#include <sys/zio.h>
+#include <sys/dmu_impl.h>
+#include <sys/sa.h>
+#include <sys/sa_impl.h>
+#include <sys/callb.h>
+
+int zfs_pd_blks_max = 100;
+
+typedef struct prefetch_data {
+	kmutex_t pd_mtx;
+	kcondvar_t pd_cv;
+	int pd_blks_max;
+	int pd_blks_fetched;
+	int pd_flags;
+	boolean_t pd_cancel;
+	boolean_t pd_exited;
+} prefetch_data_t;
+
+typedef struct traverse_data {
+	spa_t *td_spa;
+	uint64_t td_objset;
+	blkptr_t *td_rootbp;
+	uint64_t td_min_txg;
+	int td_flags;
+	prefetch_data_t *td_pfd;
+	blkptr_cb_t *td_func;
+	void *td_arg;
+} traverse_data_t;
+
+static int traverse_dnode(traverse_data_t *td, const dnode_phys_t *dnp,
+    arc_buf_t *buf, uint64_t objset, uint64_t object);
+
+static int
+traverse_zil_block(zilog_t *zilog, blkptr_t *bp, void *arg, uint64_t claim_txg)
+{
+	traverse_data_t *td = arg;
+	zbookmark_t zb;
+
+	if (bp->blk_birth == 0)
+		return (0);
+
+	if (claim_txg == 0 && bp->blk_birth >= spa_first_txg(td->td_spa))
+		return (0);
+
+	SET_BOOKMARK(&zb, td->td_objset, ZB_ZIL_OBJECT, ZB_ZIL_LEVEL,
+	    bp->blk_cksum.zc_word[ZIL_ZC_SEQ]);
+
+	(void) td->td_func(td->td_spa, zilog, bp, NULL, &zb, NULL, td->td_arg);
+
+	return (0);
+}
+
+static int
+traverse_zil_record(zilog_t *zilog, lr_t *lrc, void *arg, uint64_t claim_txg)
+{
+	traverse_data_t *td = arg;
+
+	if (lrc->lrc_txtype == TX_WRITE) {
+		lr_write_t *lr = (lr_write_t *)lrc;
+		blkptr_t *bp = &lr->lr_blkptr;
+		zbookmark_t zb;
+
+		if (bp->blk_birth == 0)
+			return (0);
+
+		if (claim_txg == 0 || bp->blk_birth < claim_txg)
+			return (0);
+
+		SET_BOOKMARK(&zb, td->td_objset, lr->lr_foid,
+		    ZB_ZIL_LEVEL, lr->lr_offset / BP_GET_LSIZE(bp));
+
+		(void) td->td_func(td->td_spa, zilog, bp, NULL, &zb, NULL,
+		    td->td_arg);
+	}
+	return (0);
+}
+
+static void
+traverse_zil(traverse_data_t *td, zil_header_t *zh)
+{
+	uint64_t claim_txg = zh->zh_claim_txg;
+	zilog_t *zilog;
+
+	/*
+	 * We only want to visit blocks that have been claimed but not yet
+	 * replayed; plus, in read-only mode, blocks that are already stable.
+	 */
+	if (claim_txg == 0 && spa_writeable(td->td_spa))
+		return;
+
+	zilog = zil_alloc(spa_get_dsl(td->td_spa)->dp_meta_objset, zh);
+
+	(void) zil_parse(zilog, traverse_zil_block, traverse_zil_record, td,
+	    claim_txg);
+
+	zil_free(zilog);
+}
+
+static int
+traverse_visitbp(traverse_data_t *td, const dnode_phys_t *dnp,
+    arc_buf_t *pbuf, blkptr_t *bp, const zbookmark_t *zb)
+{
+	zbookmark_t czb;
+	int err = 0, lasterr = 0;
+	arc_buf_t *buf = NULL;
+	prefetch_data_t *pd = td->td_pfd;
+	boolean_t hard = td->td_flags & TRAVERSE_HARD;
+
+	if (bp->blk_birth == 0) {
+		err = td->td_func(td->td_spa, NULL, NULL, pbuf, zb, dnp,
+		    td->td_arg);
+		return (err);
+	}
+
+	if (bp->blk_birth <= td->td_min_txg)
+		return (0);
+
+	if (pd && !pd->pd_exited &&
+	    ((pd->pd_flags & TRAVERSE_PREFETCH_DATA) ||
+	    BP_GET_TYPE(bp) == DMU_OT_DNODE || BP_GET_LEVEL(bp) > 0)) {
+		mutex_enter(&pd->pd_mtx);
+		ASSERT(pd->pd_blks_fetched >= 0);
+		while (pd->pd_blks_fetched == 0 && !pd->pd_exited)
+			cv_wait(&pd->pd_cv, &pd->pd_mtx);
+		pd->pd_blks_fetched--;
+		cv_broadcast(&pd->pd_cv);
+		mutex_exit(&pd->pd_mtx);
+	}
+
+	if (td->td_flags & TRAVERSE_PRE) {
+		err = td->td_func(td->td_spa, NULL, bp, pbuf, zb, dnp,
+		    td->td_arg);
+		if (err == TRAVERSE_VISIT_NO_CHILDREN)
+			return (0);
+		if (err)
+			return (err);
+	}
+
+	if (BP_GET_LEVEL(bp) > 0) {
+		uint32_t flags = ARC_WAIT;
+		int i;
+		blkptr_t *cbp;
+		int epb = BP_GET_LSIZE(bp) >> SPA_BLKPTRSHIFT;
+
+		err = dsl_read(NULL, td->td_spa, bp, pbuf,
+		    arc_getbuf_func, &buf,
+		    ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
+		if (err)
+			return (err);
+
+		/* recursively visitbp() blocks below this */
+		cbp = buf->b_data;
+		for (i = 0; i < epb; i++, cbp++) {
+			SET_BOOKMARK(&czb, zb->zb_objset, zb->zb_object,
+			    zb->zb_level - 1,
+			    zb->zb_blkid * epb + i);
+			err = traverse_visitbp(td, dnp, buf, cbp, &czb);
+			if (err) {
+				if (!hard)
+					break;
+				lasterr = err;
+			}
+		}
+	} else if (BP_GET_TYPE(bp) == DMU_OT_DNODE) {
+		uint32_t flags = ARC_WAIT;
+		int i;
+		int epb = BP_GET_LSIZE(bp) >> DNODE_SHIFT;
+
+		err = dsl_read(NULL, td->td_spa, bp, pbuf,
+		    arc_getbuf_func, &buf,
+		    ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
+		if (err)
+			return (err);
+
+		/* recursively visitbp() blocks below this */
+		dnp = buf->b_data;
+		for (i = 0; i < epb; i++, dnp++) {
+			err = traverse_dnode(td, dnp, buf, zb->zb_objset,
+			    zb->zb_blkid * epb + i);
+			if (err) {
+				if (!hard)
+					break;
+				lasterr = err;
+			}
+		}
+	} else if (BP_GET_TYPE(bp) == DMU_OT_OBJSET) {
+		uint32_t flags = ARC_WAIT;
+		objset_phys_t *osp;
+		dnode_phys_t *dnp;
+
+		err = dsl_read_nolock(NULL, td->td_spa, bp,
+		    arc_getbuf_func, &buf,
+		    ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
+		if (err)
+			return (err);
+
+		osp = buf->b_data;
+		dnp = &osp->os_meta_dnode;
+		err = traverse_dnode(td, dnp, buf, zb->zb_objset,
+		    DMU_META_DNODE_OBJECT);
+		if (err && hard) {
+			lasterr = err;
+			err = 0;
+		}
+		if (err == 0 && arc_buf_size(buf) >= sizeof (objset_phys_t)) {
+			dnp = &osp->os_userused_dnode;
+			err = traverse_dnode(td, dnp, buf, zb->zb_objset,
+			    DMU_USERUSED_OBJECT);
+		}
+		if (err && hard) {
+			lasterr = err;
+			err = 0;
+		}
+		if (err == 0 && arc_buf_size(buf) >= sizeof (objset_phys_t)) {
+			dnp = &osp->os_groupused_dnode;
+			err = traverse_dnode(td, dnp, buf, zb->zb_objset,
+			    DMU_GROUPUSED_OBJECT);
+		}
+	}
+
+	if (buf)
+		(void) arc_buf_remove_ref(buf, &buf);
+
+	if (err == 0 && lasterr == 0 && (td->td_flags & TRAVERSE_POST)) {
+		err = td->td_func(td->td_spa, NULL, bp, pbuf, zb, dnp,
+		    td->td_arg);
+	}
+
+	return (err != 0 ? err : lasterr);
+}
+
+static int
+traverse_dnode(traverse_data_t *td, const dnode_phys_t *dnp,
+    arc_buf_t *buf, uint64_t objset, uint64_t object)
+{
+	int j, err = 0, lasterr = 0;
+	zbookmark_t czb;
+	boolean_t hard = (td->td_flags & TRAVERSE_HARD);
+
+	for (j = 0; j < dnp->dn_nblkptr; j++) {
+		SET_BOOKMARK(&czb, objset, object, dnp->dn_nlevels - 1, j);
+		err = traverse_visitbp(td, dnp, buf,
+		    (blkptr_t *)&dnp->dn_blkptr[j], &czb);
+		if (err) {
+			if (!hard)
+				break;
+			lasterr = err;
+		}
+	}
+
+	if (dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR) {
+		SET_BOOKMARK(&czb, objset,
+		    object, 0, DMU_SPILL_BLKID);
+		err = traverse_visitbp(td, dnp, buf,
+		    (blkptr_t *)&dnp->dn_spill, &czb);
+		if (err) {
+			if (!hard)
+				return (err);
+			lasterr = err;
+		}
+	}
+	return (err != 0 ? err : lasterr);
+}
+
+/* ARGSUSED */
+static int
+traverse_prefetcher(spa_t *spa, zilog_t *zilog, const blkptr_t *bp,
+    arc_buf_t *pbuf, const zbookmark_t *zb, const dnode_phys_t *dnp,
+    void *arg)
+{
+	prefetch_data_t *pfd = arg;
+	uint32_t aflags = ARC_NOWAIT | ARC_PREFETCH;
+
+	ASSERT(pfd->pd_blks_fetched >= 0);
+	if (pfd->pd_cancel)
+		return (EINTR);
+
+	if (bp == NULL || !((pfd->pd_flags & TRAVERSE_PREFETCH_DATA) ||
+	    BP_GET_TYPE(bp) == DMU_OT_DNODE || BP_GET_LEVEL(bp) > 0) ||
+	    BP_GET_TYPE(bp) == DMU_OT_INTENT_LOG)
+		return (0);
+
+	mutex_enter(&pfd->pd_mtx);
+	while (!pfd->pd_cancel && pfd->pd_blks_fetched >= pfd->pd_blks_max)
+		cv_wait(&pfd->pd_cv, &pfd->pd_mtx);
+	pfd->pd_blks_fetched++;
+	cv_broadcast(&pfd->pd_cv);
+	mutex_exit(&pfd->pd_mtx);
+
+	(void) dsl_read(NULL, spa, bp, pbuf, NULL, NULL,
+	    ZIO_PRIORITY_ASYNC_READ,
+	    ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE,
+	    &aflags, zb);
+
+	return (0);
+}
+
+static void
+traverse_prefetch_thread(void *arg)
+{
+	traverse_data_t *td_main = arg;
+	traverse_data_t td = *td_main;
+	zbookmark_t czb;
+
+	td.td_func = traverse_prefetcher;
+	td.td_arg = td_main->td_pfd;
+	td.td_pfd = NULL;
+
+	SET_BOOKMARK(&czb, td.td_objset,
+	    ZB_ROOT_OBJECT, ZB_ROOT_LEVEL, ZB_ROOT_BLKID);
+	(void) traverse_visitbp(&td, NULL, NULL, td.td_rootbp, &czb);
+
+	mutex_enter(&td_main->td_pfd->pd_mtx);
+	td_main->td_pfd->pd_exited = B_TRUE;
+	cv_broadcast(&td_main->td_pfd->pd_cv);
+	mutex_exit(&td_main->td_pfd->pd_mtx);
+}
+
+/*
+ * NB: dataset must not be changing on-disk (eg, is a snapshot or we are
+ * in syncing context).
+ */
+static int
+traverse_impl(spa_t *spa, dsl_dataset_t *ds, blkptr_t *rootbp,
+    uint64_t txg_start, int flags, blkptr_cb_t func, void *arg)
+{
+	traverse_data_t td;
+	prefetch_data_t pd = { 0 };
+	zbookmark_t czb;
+	int err;
+
+	td.td_spa = spa;
+	td.td_objset = ds ? ds->ds_object : 0;
+	td.td_rootbp = rootbp;
+	td.td_min_txg = txg_start;
+	td.td_func = func;
+	td.td_arg = arg;
+	td.td_pfd = &pd;
+	td.td_flags = flags;
+
+	pd.pd_blks_max = zfs_pd_blks_max;
+	pd.pd_flags = flags;
+	mutex_init(&pd.pd_mtx, NULL, MUTEX_DEFAULT, NULL);
+	cv_init(&pd.pd_cv, NULL, CV_DEFAULT, NULL);
+
+	/* See comment on ZIL traversal in dsl_scan_visitds. */
+	if (ds != NULL && !dsl_dataset_is_snapshot(ds)) {
+		objset_t *os;
+
+		err = dmu_objset_from_ds(ds, &os);
+		if (err)
+			return (err);
+
+		traverse_zil(&td, &os->os_zil_header);
+	}
+
+	if (!(flags & TRAVERSE_PREFETCH) ||
+	    0 == taskq_dispatch(system_taskq, traverse_prefetch_thread,
+	    &td, TQ_NOQUEUE))
+		pd.pd_exited = B_TRUE;
+
+	SET_BOOKMARK(&czb, td.td_objset,
+	    ZB_ROOT_OBJECT, ZB_ROOT_LEVEL, ZB_ROOT_BLKID);
+	err = traverse_visitbp(&td, NULL, NULL, rootbp, &czb);
+
+	mutex_enter(&pd.pd_mtx);
+	pd.pd_cancel = B_TRUE;
+	cv_broadcast(&pd.pd_cv);
+	while (!pd.pd_exited)
+		cv_wait(&pd.pd_cv, &pd.pd_mtx);
+	mutex_exit(&pd.pd_mtx);
+
+	mutex_destroy(&pd.pd_mtx);
+	cv_destroy(&pd.pd_cv);
+
+	return (err);
+}
+
+/*
+ * NB: dataset must not be changing on-disk (eg, is a snapshot or we are
+ * in syncing context).
+ */
+int
+traverse_dataset(dsl_dataset_t *ds, uint64_t txg_start, int flags,
+    blkptr_cb_t func, void *arg)
+{
+	return (traverse_impl(ds->ds_dir->dd_pool->dp_spa, ds,
+	    &ds->ds_phys->ds_bp, txg_start, flags, func, arg));
+}
+
+/*
+ * NB: pool must not be changing on-disk (eg, from zdb or sync context).
+ */
+int
+traverse_pool(spa_t *spa, uint64_t txg_start, int flags,
+    blkptr_cb_t func, void *arg)
+{
+	int err, lasterr = 0;
+	uint64_t obj;
+	dsl_pool_t *dp = spa_get_dsl(spa);
+	objset_t *mos = dp->dp_meta_objset;
+	boolean_t hard = (flags & TRAVERSE_HARD);
+
+	/* visit the MOS */
+	err = traverse_impl(spa, NULL, spa_get_rootblkptr(spa),
+	    txg_start, flags, func, arg);
+	if (err)
+		return (err);
+
+	/* visit each dataset */
+	for (obj = 1; err == 0 || (err != ESRCH && hard);
+	    err = dmu_object_next(mos, &obj, FALSE, txg_start)) {
+		dmu_object_info_t doi;
+
+		err = dmu_object_info(mos, obj, &doi);
+		if (err) {
+			if (!hard)
+				return (err);
+			lasterr = err;
+			continue;
+		}
+
+		if (doi.doi_type == DMU_OT_DSL_DATASET) {
+			dsl_dataset_t *ds;
+			uint64_t txg = txg_start;
+
+			rw_enter(&dp->dp_config_rwlock, RW_READER);
+			err = dsl_dataset_hold_obj(dp, obj, FTAG, &ds);
+			rw_exit(&dp->dp_config_rwlock);
+			if (err) {
+				if (!hard)
+					return (err);
+				lasterr = err;
+				continue;
+			}
+			if (ds->ds_phys->ds_prev_snap_txg > txg)
+				txg = ds->ds_phys->ds_prev_snap_txg;
+			err = traverse_dataset(ds, txg, flags, func, arg);
+			dsl_dataset_rele(ds, FTAG);
+			if (err) {
+				if (!hard)
+					return (err);
+				lasterr = err;
+			}
+		}
+	}
+	if (err == ESRCH)
+		err = 0;
+	return (err != 0 ? err : lasterr);
+}
--- a/uts/common/fs/zfs/dmu_tx.c
+++ b/uts/common/fs/zfs/dmu_tx.c
--- a/uts/common/fs/zfs/dmu_zfetch.c
+++ b/uts/common/fs/zfs/dmu_zfetch.c
@ -0,0 +1,724 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+#include <sys/zfs_context.h>
+#include <sys/dnode.h>
+#include <sys/dmu_objset.h>
+#include <sys/dmu_zfetch.h>
+#include <sys/dmu.h>
+#include <sys/dbuf.h>
+#include <sys/kstat.h>
+
+/*
+ * I'm against tune-ables, but these should probably exist as tweakable globals
+ * until we can get this working the way we want it to.
+ */
+
+int zfs_prefetch_disable = 0;
+
+/* max # of streams per zfetch */
+uint32_t	zfetch_max_streams = 8;
+/* min time before stream reclaim */
+uint32_t	zfetch_min_sec_reap = 2;
+/* max number of blocks to fetch at a time */
+uint32_t	zfetch_block_cap = 256;
+/* number of bytes in a array_read at which we stop prefetching (1Mb) */
+uint64_t	zfetch_array_rd_sz = 1024 * 1024;
+
+/* forward decls for static routines */
+static int		dmu_zfetch_colinear(zfetch_t *, zstream_t *);
+static void		dmu_zfetch_dofetch(zfetch_t *, zstream_t *);
+static uint64_t		dmu_zfetch_fetch(dnode_t *, uint64_t, uint64_t);
+static uint64_t		dmu_zfetch_fetchsz(dnode_t *, uint64_t, uint64_t);
+static int		dmu_zfetch_find(zfetch_t *, zstream_t *, int);
+static int		dmu_zfetch_stream_insert(zfetch_t *, zstream_t *);
+static zstream_t	*dmu_zfetch_stream_reclaim(zfetch_t *);
+static void		dmu_zfetch_stream_remove(zfetch_t *, zstream_t *);
+static int		dmu_zfetch_streams_equal(zstream_t *, zstream_t *);
+
+typedef struct zfetch_stats {
+	kstat_named_t zfetchstat_hits;
+	kstat_named_t zfetchstat_misses;
+	kstat_named_t zfetchstat_colinear_hits;
+	kstat_named_t zfetchstat_colinear_misses;
+	kstat_named_t zfetchstat_stride_hits;
+	kstat_named_t zfetchstat_stride_misses;
+	kstat_named_t zfetchstat_reclaim_successes;
+	kstat_named_t zfetchstat_reclaim_failures;
+	kstat_named_t zfetchstat_stream_resets;
+	kstat_named_t zfetchstat_stream_noresets;
+	kstat_named_t zfetchstat_bogus_streams;
+} zfetch_stats_t;
+
+static zfetch_stats_t zfetch_stats = {
+	{ "hits",			KSTAT_DATA_UINT64 },
+	{ "misses",			KSTAT_DATA_UINT64 },
+	{ "colinear_hits",		KSTAT_DATA_UINT64 },
+	{ "colinear_misses",		KSTAT_DATA_UINT64 },
+	{ "stride_hits",		KSTAT_DATA_UINT64 },
+	{ "stride_misses",		KSTAT_DATA_UINT64 },
+	{ "reclaim_successes",		KSTAT_DATA_UINT64 },
+	{ "reclaim_failures",		KSTAT_DATA_UINT64 },
+	{ "streams_resets",		KSTAT_DATA_UINT64 },
+	{ "streams_noresets",		KSTAT_DATA_UINT64 },
+	{ "bogus_streams",		KSTAT_DATA_UINT64 },
+};
+
+#define	ZFETCHSTAT_INCR(stat, val) \
+	atomic_add_64(&zfetch_stats.stat.value.ui64, (val));
+
+#define	ZFETCHSTAT_BUMP(stat)		ZFETCHSTAT_INCR(stat, 1);
+
+kstat_t		*zfetch_ksp;
+
+/*
+ * Given a zfetch structure and a zstream structure, determine whether the
+ * blocks to be read are part of a co-linear pair of existing prefetch
+ * streams.  If a set is found, coalesce the streams, removing one, and
+ * configure the prefetch so it looks for a strided access pattern.
+ *
+ * In other words: if we find two sequential access streams that are
+ * the same length and distance N appart, and this read is N from the
+ * last stream, then we are probably in a strided access pattern.  So
+ * combine the two sequential streams into a single strided stream.
+ *
+ * If no co-linear streams are found, return NULL.
+ */
+static int
+dmu_zfetch_colinear(zfetch_t *zf, zstream_t *zh)
+{
+	zstream_t	*z_walk;
+	zstream_t	*z_comp;
+
+	if (! rw_tryenter(&zf->zf_rwlock, RW_WRITER))
+		return (0);
+
+	if (zh == NULL) {
+		rw_exit(&zf->zf_rwlock);
+		return (0);
+	}
+
+	for (z_walk = list_head(&zf->zf_stream); z_walk;
+	    z_walk = list_next(&zf->zf_stream, z_walk)) {
+		for (z_comp = list_next(&zf->zf_stream, z_walk); z_comp;
+		    z_comp = list_next(&zf->zf_stream, z_comp)) {
+			int64_t		diff;
+
+			if (z_walk->zst_len != z_walk->zst_stride ||
+			    z_comp->zst_len != z_comp->zst_stride) {
+				continue;
+			}
+
+			diff = z_comp->zst_offset - z_walk->zst_offset;
+			if (z_comp->zst_offset + diff == zh->zst_offset) {
+				z_walk->zst_offset = zh->zst_offset;
+				z_walk->zst_direction = diff < 0 ? -1 : 1;
+				z_walk->zst_stride =
+				    diff * z_walk->zst_direction;
+				z_walk->zst_ph_offset =
+				    zh->zst_offset + z_walk->zst_stride;
+				dmu_zfetch_stream_remove(zf, z_comp);
+				mutex_destroy(&z_comp->zst_lock);
+				kmem_free(z_comp, sizeof (zstream_t));
+
+				dmu_zfetch_dofetch(zf, z_walk);
+
+				rw_exit(&zf->zf_rwlock);
+				return (1);
+			}
+
+			diff = z_walk->zst_offset - z_comp->zst_offset;
+			if (z_walk->zst_offset + diff == zh->zst_offset) {
+				z_walk->zst_offset = zh->zst_offset;
+				z_walk->zst_direction = diff < 0 ? -1 : 1;
+				z_walk->zst_stride =
+				    diff * z_walk->zst_direction;
+				z_walk->zst_ph_offset =
+				    zh->zst_offset + z_walk->zst_stride;
+				dmu_zfetch_stream_remove(zf, z_comp);
+				mutex_destroy(&z_comp->zst_lock);
+				kmem_free(z_comp, sizeof (zstream_t));
+
+				dmu_zfetch_dofetch(zf, z_walk);
+
+				rw_exit(&zf->zf_rwlock);
+				return (1);
+			}
+		}
+	}
+
+	rw_exit(&zf->zf_rwlock);
+	return (0);
+}
+
+/*
+ * Given a zstream_t, determine the bounds of the prefetch.  Then call the
+ * routine that actually prefetches the individual blocks.
+ */
+static void
+dmu_zfetch_dofetch(zfetch_t *zf, zstream_t *zs)
+{
+	uint64_t	prefetch_tail;
+	uint64_t	prefetch_limit;
+	uint64_t	prefetch_ofst;
+	uint64_t	prefetch_len;
+	uint64_t	blocks_fetched;
+
+	zs->zst_stride = MAX((int64_t)zs->zst_stride, zs->zst_len);
+	zs->zst_cap = MIN(zfetch_block_cap, 2 * zs->zst_cap);
+
+	prefetch_tail = MAX((int64_t)zs->zst_ph_offset,
+	    (int64_t)(zs->zst_offset + zs->zst_stride));
+	/*
+	 * XXX: use a faster division method?
+	 */
+	prefetch_limit = zs->zst_offset + zs->zst_len +
+	    (zs->zst_cap * zs->zst_stride) / zs->zst_len;
+
+	while (prefetch_tail < prefetch_limit) {
+		prefetch_ofst = zs->zst_offset + zs->zst_direction *
+		    (prefetch_tail - zs->zst_offset);
+
+		prefetch_len = zs->zst_len;
+
+		/*
+		 * Don't prefetch beyond the end of the file, if working
+		 * backwards.
+		 */
+		if ((zs->zst_direction == ZFETCH_BACKWARD) &&
+		    (prefetch_ofst > prefetch_tail)) {
+			prefetch_len += prefetch_ofst;
+			prefetch_ofst = 0;
+		}
+
+		/* don't prefetch more than we're supposed to */
+		if (prefetch_len > zs->zst_len)
+			break;
+
+		blocks_fetched = dmu_zfetch_fetch(zf->zf_dnode,
+		    prefetch_ofst, zs->zst_len);
+
+		prefetch_tail += zs->zst_stride;
+		/* stop if we've run out of stuff to prefetch */
+		if (blocks_fetched < zs->zst_len)
+			break;
+	}
+	zs->zst_ph_offset = prefetch_tail;
+	zs->zst_last = ddi_get_lbolt();
+}
+
+void
+zfetch_init(void)
+{
+
+	zfetch_ksp = kstat_create("zfs", 0, "zfetchstats", "misc",
+	    KSTAT_TYPE_NAMED, sizeof (zfetch_stats) / sizeof (kstat_named_t),
+	    KSTAT_FLAG_VIRTUAL);
+
+	if (zfetch_ksp != NULL) {
+		zfetch_ksp->ks_data = &zfetch_stats;
+		kstat_install(zfetch_ksp);
+	}
+}
+
+void
+zfetch_fini(void)
+{
+	if (zfetch_ksp != NULL) {
+		kstat_delete(zfetch_ksp);
+		zfetch_ksp = NULL;
+	}
+}
+
+/*
+ * This takes a pointer to a zfetch structure and a dnode.  It performs the
+ * necessary setup for the zfetch structure, grokking data from the
+ * associated dnode.
+ */
+void
+dmu_zfetch_init(zfetch_t *zf, dnode_t *dno)
+{
+	if (zf == NULL) {
+		return;
+	}
+
+	zf->zf_dnode = dno;
+	zf->zf_stream_cnt = 0;
+	zf->zf_alloc_fail = 0;
+
+	list_create(&zf->zf_stream, sizeof (zstream_t),
+	    offsetof(zstream_t, zst_node));
+
+	rw_init(&zf->zf_rwlock, NULL, RW_DEFAULT, NULL);
+}
+
+/*
+ * This function computes the actual size, in blocks, that can be prefetched,
+ * and fetches it.
+ */
+static uint64_t
+dmu_zfetch_fetch(dnode_t *dn, uint64_t blkid, uint64_t nblks)
+{
+	uint64_t	fetchsz;
+	uint64_t	i;
+
+	fetchsz = dmu_zfetch_fetchsz(dn, blkid, nblks);
+
+	for (i = 0; i < fetchsz; i++) {
+		dbuf_prefetch(dn, blkid + i);
+	}
+
+	return (fetchsz);
+}
+
+/*
+ * this function returns the number of blocks that would be prefetched, based
+ * upon the supplied dnode, blockid, and nblks.  This is used so that we can
+ * update streams in place, and then prefetch with their old value after the
+ * fact.  This way, we can delay the prefetch, but subsequent accesses to the
+ * stream won't result in the same data being prefetched multiple times.
+ */
+static uint64_t
+dmu_zfetch_fetchsz(dnode_t *dn, uint64_t blkid, uint64_t nblks)
+{
+	uint64_t	fetchsz;
+
+	if (blkid > dn->dn_maxblkid) {
+		return (0);
+	}
+
+	/* compute fetch size */
+	if (blkid + nblks + 1 > dn->dn_maxblkid) {
+		fetchsz = (dn->dn_maxblkid - blkid) + 1;
+		ASSERT(blkid + fetchsz - 1 <= dn->dn_maxblkid);
+	} else {
+		fetchsz = nblks;
+	}
+
+
+	return (fetchsz);
+}
+
+/*
+ * given a zfetch and a zstream structure, see if there is an associated zstream
+ * for this block read.  If so, it starts a prefetch for the stream it
+ * located and returns true, otherwise it returns false
+ */
+static int
+dmu_zfetch_find(zfetch_t *zf, zstream_t *zh, int prefetched)
+{
+	zstream_t	*zs;
+	int64_t		diff;
+	int		reset = !prefetched;
+	int		rc = 0;
+
+	if (zh == NULL)
+		return (0);
+
+	/*
+	 * XXX: This locking strategy is a bit coarse; however, it's impact has
+	 * yet to be tested.  If this turns out to be an issue, it can be
+	 * modified in a number of different ways.
+	 */
+
+	rw_enter(&zf->zf_rwlock, RW_READER);
+top:
+
+	for (zs = list_head(&zf->zf_stream); zs;
+	    zs = list_next(&zf->zf_stream, zs)) {
+
+		/*
+		 * XXX - should this be an assert?
+		 */
+		if (zs->zst_len == 0) {
+			/* bogus stream */
+			ZFETCHSTAT_BUMP(zfetchstat_bogus_streams);
+			continue;
+		}
+
+		/*
+		 * We hit this case when we are in a strided prefetch stream:
+		 * we will read "len" blocks before "striding".
+		 */
+		if (zh->zst_offset >= zs->zst_offset &&
+		    zh->zst_offset < zs->zst_offset + zs->zst_len) {
+			if (prefetched) {
+				/* already fetched */
+				ZFETCHSTAT_BUMP(zfetchstat_stride_hits);
+				rc = 1;
+				goto out;
+			} else {
+				ZFETCHSTAT_BUMP(zfetchstat_stride_misses);
+			}
+		}
+
+		/*
+		 * This is the forward sequential read case: we increment
+		 * len by one each time we hit here, so we will enter this
+		 * case on every read.
+		 */
+		if (zh->zst_offset == zs->zst_offset + zs->zst_len) {
+
+			reset = !prefetched && zs->zst_len > 1;
+
+			mutex_enter(&zs->zst_lock);
+
+			if (zh->zst_offset != zs->zst_offset + zs->zst_len) {
+				mutex_exit(&zs->zst_lock);
+				goto top;
+			}
+			zs->zst_len += zh->zst_len;
+			diff = zs->zst_len - zfetch_block_cap;
+			if (diff > 0) {
+				zs->zst_offset += diff;
+				zs->zst_len = zs->zst_len > diff ?
+				    zs->zst_len - diff : 0;
+			}
+			zs->zst_direction = ZFETCH_FORWARD;
+
+			break;
+
+		/*
+		 * Same as above, but reading backwards through the file.
+		 */
+		} else if (zh->zst_offset == zs->zst_offset - zh->zst_len) {
+			/* backwards sequential access */
+
+			reset = !prefetched && zs->zst_len > 1;
+
+			mutex_enter(&zs->zst_lock);
+
+			if (zh->zst_offset != zs->zst_offset - zh->zst_len) {
+				mutex_exit(&zs->zst_lock);
+				goto top;
+			}
+
+			zs->zst_offset = zs->zst_offset > zh->zst_len ?
+			    zs->zst_offset - zh->zst_len : 0;
+			zs->zst_ph_offset = zs->zst_ph_offset > zh->zst_len ?
+			    zs->zst_ph_offset - zh->zst_len : 0;
+			zs->zst_len += zh->zst_len;
+
+			diff = zs->zst_len - zfetch_block_cap;
+			if (diff > 0) {
+				zs->zst_ph_offset = zs->zst_ph_offset > diff ?
+				    zs->zst_ph_offset - diff : 0;
+				zs->zst_len = zs->zst_len > diff ?
+				    zs->zst_len - diff : zs->zst_len;
+			}
+			zs->zst_direction = ZFETCH_BACKWARD;
+
+			break;
+
+		} else if ((zh->zst_offset - zs->zst_offset - zs->zst_stride <
+		    zs->zst_len) && (zs->zst_len != zs->zst_stride)) {
+			/* strided forward access */
+
+			mutex_enter(&zs->zst_lock);
+
+			if ((zh->zst_offset - zs->zst_offset - zs->zst_stride >=
+			    zs->zst_len) || (zs->zst_len == zs->zst_stride)) {
+				mutex_exit(&zs->zst_lock);
+				goto top;
+			}
+
+			zs->zst_offset += zs->zst_stride;
+			zs->zst_direction = ZFETCH_FORWARD;
+
+			break;
+
+		} else if ((zh->zst_offset - zs->zst_offset + zs->zst_stride <
+		    zs->zst_len) && (zs->zst_len != zs->zst_stride)) {
+			/* strided reverse access */
+
+			mutex_enter(&zs->zst_lock);
+
+			if ((zh->zst_offset - zs->zst_offset + zs->zst_stride >=
+			    zs->zst_len) || (zs->zst_len == zs->zst_stride)) {
+				mutex_exit(&zs->zst_lock);
+				goto top;
+			}
+
+			zs->zst_offset = zs->zst_offset > zs->zst_stride ?
+			    zs->zst_offset - zs->zst_stride : 0;
+			zs->zst_ph_offset = (zs->zst_ph_offset >
+			    (2 * zs->zst_stride)) ?
+			    (zs->zst_ph_offset - (2 * zs->zst_stride)) : 0;
+			zs->zst_direction = ZFETCH_BACKWARD;
+
+			break;
+		}
+	}
+
+	if (zs) {
+		if (reset) {
+			zstream_t *remove = zs;
+
+			ZFETCHSTAT_BUMP(zfetchstat_stream_resets);
+			rc = 0;
+			mutex_exit(&zs->zst_lock);
+			rw_exit(&zf->zf_rwlock);
+			rw_enter(&zf->zf_rwlock, RW_WRITER);
+			/*
+			 * Relocate the stream, in case someone removes
+			 * it while we were acquiring the WRITER lock.
+			 */
+			for (zs = list_head(&zf->zf_stream); zs;
+			    zs = list_next(&zf->zf_stream, zs)) {
+				if (zs == remove) {
+					dmu_zfetch_stream_remove(zf, zs);
+					mutex_destroy(&zs->zst_lock);
+					kmem_free(zs, sizeof (zstream_t));
+					break;
+				}
+			}
+		} else {
+			ZFETCHSTAT_BUMP(zfetchstat_stream_noresets);
+			rc = 1;
+			dmu_zfetch_dofetch(zf, zs);
+			mutex_exit(&zs->zst_lock);
+		}
+	}
+out:
+	rw_exit(&zf->zf_rwlock);
+	return (rc);
+}
+
+/*
+ * Clean-up state associated with a zfetch structure.  This frees allocated
+ * structure members, empties the zf_stream tree, and generally makes things
+ * nice.  This doesn't free the zfetch_t itself, that's left to the caller.
+ */
+void
+dmu_zfetch_rele(zfetch_t *zf)
+{
+	zstream_t	*zs;
+	zstream_t	*zs_next;
+
+	ASSERT(!RW_LOCK_HELD(&zf->zf_rwlock));
+
+	for (zs = list_head(&zf->zf_stream); zs; zs = zs_next) {
+		zs_next = list_next(&zf->zf_stream, zs);
+
+		list_remove(&zf->zf_stream, zs);
+		mutex_destroy(&zs->zst_lock);
+		kmem_free(zs, sizeof (zstream_t));
+	}
+	list_destroy(&zf->zf_stream);
+	rw_destroy(&zf->zf_rwlock);
+
+	zf->zf_dnode = NULL;
+}
+
+/*
+ * Given a zfetch and zstream structure, insert the zstream structure into the
+ * AVL tree contained within the zfetch structure.  Peform the appropriate
+ * book-keeping.  It is possible that another thread has inserted a stream which
+ * matches one that we are about to insert, so we must be sure to check for this
+ * case.  If one is found, return failure, and let the caller cleanup the
+ * duplicates.
+ */
+static int
+dmu_zfetch_stream_insert(zfetch_t *zf, zstream_t *zs)
+{
+	zstream_t	*zs_walk;
+	zstream_t	*zs_next;
+
+	ASSERT(RW_WRITE_HELD(&zf->zf_rwlock));
+
+	for (zs_walk = list_head(&zf->zf_stream); zs_walk; zs_walk = zs_next) {
+		zs_next = list_next(&zf->zf_stream, zs_walk);
+
+		if (dmu_zfetch_streams_equal(zs_walk, zs)) {
+			return (0);
+		}
+	}
+
+	list_insert_head(&zf->zf_stream, zs);
+	zf->zf_stream_cnt++;
+	return (1);
+}
+
+
+/*
+ * Walk the list of zstreams in the given zfetch, find an old one (by time), and
+ * reclaim it for use by the caller.
+ */
+static zstream_t *
+dmu_zfetch_stream_reclaim(zfetch_t *zf)
+{
+	zstream_t	*zs;
+
+	if (! rw_tryenter(&zf->zf_rwlock, RW_WRITER))
+		return (0);
+
+	for (zs = list_head(&zf->zf_stream); zs;
+	    zs = list_next(&zf->zf_stream, zs)) {
+
+		if (((ddi_get_lbolt() - zs->zst_last)/hz) > zfetch_min_sec_reap)
+			break;
+	}
+
+	if (zs) {
+		dmu_zfetch_stream_remove(zf, zs);
+		mutex_destroy(&zs->zst_lock);
+		bzero(zs, sizeof (zstream_t));
+	} else {
+		zf->zf_alloc_fail++;
+	}
+	rw_exit(&zf->zf_rwlock);
+
+	return (zs);
+}
+
+/*
+ * Given a zfetch and zstream structure, remove the zstream structure from its
+ * container in the zfetch structure.  Perform the appropriate book-keeping.
+ */
+static void
+dmu_zfetch_stream_remove(zfetch_t *zf, zstream_t *zs)
+{
+	ASSERT(RW_WRITE_HELD(&zf->zf_rwlock));
+
+	list_remove(&zf->zf_stream, zs);
+	zf->zf_stream_cnt--;
+}
+
+static int
+dmu_zfetch_streams_equal(zstream_t *zs1, zstream_t *zs2)
+{
+	if (zs1->zst_offset != zs2->zst_offset)
+		return (0);
+
+	if (zs1->zst_len != zs2->zst_len)
+		return (0);
+
+	if (zs1->zst_stride != zs2->zst_stride)
+		return (0);
+
+	if (zs1->zst_ph_offset != zs2->zst_ph_offset)
+		return (0);
+
+	if (zs1->zst_cap != zs2->zst_cap)
+		return (0);
+
+	if (zs1->zst_direction != zs2->zst_direction)
+		return (0);
+
+	return (1);
+}
+
+/*
+ * This is the prefetch entry point.  It calls all of the other dmu_zfetch
+ * routines to create, delete, find, or operate upon prefetch streams.
+ */
+void
+dmu_zfetch(zfetch_t *zf, uint64_t offset, uint64_t size, int prefetched)
+{
+	zstream_t	zst;
+	zstream_t	*newstream;
+	int		fetched;
+	int		inserted;
+	unsigned int	blkshft;
+	uint64_t	blksz;
+
+	if (zfs_prefetch_disable)
+		return;
+
+	/* files that aren't ln2 blocksz are only one block -- nothing to do */
+	if (!zf->zf_dnode->dn_datablkshift)
+		return;
+
+	/* convert offset and size, into blockid and nblocks */
+	blkshft = zf->zf_dnode->dn_datablkshift;
+	blksz = (1 << blkshft);
+
+	bzero(&zst, sizeof (zstream_t));
+	zst.zst_offset = offset >> blkshft;
+	zst.zst_len = (P2ROUNDUP(offset + size, blksz) -
+	    P2ALIGN(offset, blksz)) >> blkshft;
+
+	fetched = dmu_zfetch_find(zf, &zst, prefetched);
+	if (fetched) {
+		ZFETCHSTAT_BUMP(zfetchstat_hits);
+	} else {
+		ZFETCHSTAT_BUMP(zfetchstat_misses);
+		if (fetched = dmu_zfetch_colinear(zf, &zst)) {
+			ZFETCHSTAT_BUMP(zfetchstat_colinear_hits);
+		} else {
+			ZFETCHSTAT_BUMP(zfetchstat_colinear_misses);
+		}
+	}
+
+	if (!fetched) {
+		newstream = dmu_zfetch_stream_reclaim(zf);
+
+		/*
+		 * we still couldn't find a stream, drop the lock, and allocate
+		 * one if possible.  Otherwise, give up and go home.
+		 */
+		if (newstream) {
+			ZFETCHSTAT_BUMP(zfetchstat_reclaim_successes);
+		} else {
+			uint64_t	maxblocks;
+			uint32_t	max_streams;
+			uint32_t	cur_streams;
+
+			ZFETCHSTAT_BUMP(zfetchstat_reclaim_failures);
+			cur_streams = zf->zf_stream_cnt;
+			maxblocks = zf->zf_dnode->dn_maxblkid;
+
+			max_streams = MIN(zfetch_max_streams,
+			    (maxblocks / zfetch_block_cap));
+			if (max_streams == 0) {
+				max_streams++;
+			}
+
+			if (cur_streams >= max_streams) {
+				return;
+			}
+			newstream = kmem_zalloc(sizeof (zstream_t), KM_SLEEP);
+		}
+
+		newstream->zst_offset = zst.zst_offset;
+		newstream->zst_len = zst.zst_len;
+		newstream->zst_stride = zst.zst_len;
+		newstream->zst_ph_offset = zst.zst_len + zst.zst_offset;
+		newstream->zst_cap = zst.zst_len;
+		newstream->zst_direction = ZFETCH_FORWARD;
+		newstream->zst_last = ddi_get_lbolt();
+
+		mutex_init(&newstream->zst_lock, NULL, MUTEX_DEFAULT, NULL);
+
+		rw_enter(&zf->zf_rwlock, RW_WRITER);
+		inserted = dmu_zfetch_stream_insert(zf, newstream);
+		rw_exit(&zf->zf_rwlock);
+
+		if (!inserted) {
+			mutex_destroy(&newstream->zst_lock);
+			kmem_free(newstream, sizeof (zstream_t));
+		}
+	}
+}
--- a/uts/common/fs/zfs/dnode.c
+++ b/uts/common/fs/zfs/dnode.c
--- a/uts/common/fs/zfs/dnode_sync.c
+++ b/uts/common/fs/zfs/dnode_sync.c
@ -0,0 +1,693 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#include <sys/zfs_context.h>
+#include <sys/dbuf.h>
+#include <sys/dnode.h>
+#include <sys/dmu.h>
+#include <sys/dmu_tx.h>
+#include <sys/dmu_objset.h>
+#include <sys/dsl_dataset.h>
+#include <sys/spa.h>
+
+static void
+dnode_increase_indirection(dnode_t *dn, dmu_tx_t *tx)
+{
+	dmu_buf_impl_t *db;
+	int txgoff = tx->tx_txg & TXG_MASK;
+	int nblkptr = dn->dn_phys->dn_nblkptr;
+	int old_toplvl = dn->dn_phys->dn_nlevels - 1;
+	int new_level = dn->dn_next_nlevels[txgoff];
+	int i;
+
+	rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
+
+	/* this dnode can't be paged out because it's dirty */
+	ASSERT(dn->dn_phys->dn_type != DMU_OT_NONE);
+	ASSERT(RW_WRITE_HELD(&dn->dn_struct_rwlock));
+	ASSERT(new_level > 1 && dn->dn_phys->dn_nlevels > 0);
+
+	db = dbuf_hold_level(dn, dn->dn_phys->dn_nlevels, 0, FTAG);
+	ASSERT(db != NULL);
+
+	dn->dn_phys->dn_nlevels = new_level;
+	dprintf("os=%p obj=%llu, increase to %d\n", dn->dn_objset,
+	    dn->dn_object, dn->dn_phys->dn_nlevels);
+
+	/* check for existing blkptrs in the dnode */
+	for (i = 0; i < nblkptr; i++)
+		if (!BP_IS_HOLE(&dn->dn_phys->dn_blkptr[i]))
+			break;
+	if (i != nblkptr) {
+		/* transfer dnode's block pointers to new indirect block */
+		(void) dbuf_read(db, NULL, DB_RF_MUST_SUCCEED|DB_RF_HAVESTRUCT);
+		ASSERT(db->db.db_data);
+		ASSERT(arc_released(db->db_buf));
+		ASSERT3U(sizeof (blkptr_t) * nblkptr, <=, db->db.db_size);
+		bcopy(dn->dn_phys->dn_blkptr, db->db.db_data,
+		    sizeof (blkptr_t) * nblkptr);
+		arc_buf_freeze(db->db_buf);
+	}
+
+	/* set dbuf's parent pointers to new indirect buf */
+	for (i = 0; i < nblkptr; i++) {
+		dmu_buf_impl_t *child = dbuf_find(dn, old_toplvl, i);
+
+		if (child == NULL)
+			continue;
+#ifdef	DEBUG
+		DB_DNODE_ENTER(child);
+		ASSERT3P(DB_DNODE(child), ==, dn);
+		DB_DNODE_EXIT(child);
+#endif	/* DEBUG */
+		if (child->db_parent && child->db_parent != dn->dn_dbuf) {
+			ASSERT(child->db_parent->db_level == db->db_level);
+			ASSERT(child->db_blkptr !=
+			    &dn->dn_phys->dn_blkptr[child->db_blkid]);
+			mutex_exit(&child->db_mtx);
+			continue;
+		}
+		ASSERT(child->db_parent == NULL ||
+		    child->db_parent == dn->dn_dbuf);
+
+		child->db_parent = db;
+		dbuf_add_ref(db, child);
+		if (db->db.db_data)
+			child->db_blkptr = (blkptr_t *)db->db.db_data + i;
+		else
+			child->db_blkptr = NULL;
+		dprintf_dbuf_bp(child, child->db_blkptr,
+		    "changed db_blkptr to new indirect %s", "");
+
+		mutex_exit(&child->db_mtx);
+	}
+
+	bzero(dn->dn_phys->dn_blkptr, sizeof (blkptr_t) * nblkptr);
+
+	dbuf_rele(db, FTAG);
+
+	rw_exit(&dn->dn_struct_rwlock);
+}
+
+static int
+free_blocks(dnode_t *dn, blkptr_t *bp, int num, dmu_tx_t *tx)
+{
+	dsl_dataset_t *ds = dn->dn_objset->os_dsl_dataset;
+	uint64_t bytesfreed = 0;
+	int i, blocks_freed = 0;
+
+	dprintf("ds=%p obj=%llx num=%d\n", ds, dn->dn_object, num);
+
+	for (i = 0; i < num; i++, bp++) {
+		if (BP_IS_HOLE(bp))
+			continue;
+
+		bytesfreed += dsl_dataset_block_kill(ds, bp, tx, B_FALSE);
+		ASSERT3U(bytesfreed, <=, DN_USED_BYTES(dn->dn_phys));
+		bzero(bp, sizeof (blkptr_t));
+		blocks_freed += 1;
+	}
+	dnode_diduse_space(dn, -bytesfreed);
+	return (blocks_freed);
+}
+
+#ifdef ZFS_DEBUG
+static void
+free_verify(dmu_buf_impl_t *db, uint64_t start, uint64_t end, dmu_tx_t *tx)
+{
+	int off, num;
+	int i, err, epbs;
+	uint64_t txg = tx->tx_txg;
+	dnode_t *dn;
+
+	DB_DNODE_ENTER(db);
+	dn = DB_DNODE(db);
+	epbs = dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT;
+	off = start - (db->db_blkid * 1<<epbs);
+	num = end - start + 1;
+
+	ASSERT3U(off, >=, 0);
+	ASSERT3U(num, >=, 0);
+	ASSERT3U(db->db_level, >, 0);
+	ASSERT3U(db->db.db_size, ==, 1 << dn->dn_phys->dn_indblkshift);
+	ASSERT3U(off+num, <=, db->db.db_size >> SPA_BLKPTRSHIFT);
+	ASSERT(db->db_blkptr != NULL);
+
+	for (i = off; i < off+num; i++) {
+		uint64_t *buf;
+		dmu_buf_impl_t *child;
+		dbuf_dirty_record_t *dr;
+		int j;
+
+		ASSERT(db->db_level == 1);
+
+		rw_enter(&dn->dn_struct_rwlock, RW_READER);
+		err = dbuf_hold_impl(dn, db->db_level-1,
+		    (db->db_blkid << epbs) + i, TRUE, FTAG, &child);
+		rw_exit(&dn->dn_struct_rwlock);
+		if (err == ENOENT)
+			continue;
+		ASSERT(err == 0);
+		ASSERT(child->db_level == 0);
+		dr = child->db_last_dirty;
+		while (dr && dr->dr_txg > txg)
+			dr = dr->dr_next;
+		ASSERT(dr == NULL || dr->dr_txg == txg);
+
+		/* data_old better be zeroed */
+		if (dr) {
+			buf = dr->dt.dl.dr_data->b_data;
+			for (j = 0; j < child->db.db_size >> 3; j++) {
+				if (buf[j] != 0) {
+					panic("freed data not zero: "
+					    "child=%p i=%d off=%d num=%d\n",
+					    (void *)child, i, off, num);
+				}
+			}
+		}
+
+		/*
+		 * db_data better be zeroed unless it's dirty in a
+		 * future txg.
+		 */
+		mutex_enter(&child->db_mtx);
+		buf = child->db.db_data;
+		if (buf != NULL && child->db_state != DB_FILL &&
+		    child->db_last_dirty == NULL) {
+			for (j = 0; j < child->db.db_size >> 3; j++) {
+				if (buf[j] != 0) {
+					panic("freed data not zero: "
+					    "child=%p i=%d off=%d num=%d\n",
+					    (void *)child, i, off, num);
+				}
+			}
+		}
+		mutex_exit(&child->db_mtx);
+
+		dbuf_rele(child, FTAG);
+	}
+	DB_DNODE_EXIT(db);
+}
+#endif
+
+#define	ALL -1
+
+static int
+free_children(dmu_buf_impl_t *db, uint64_t blkid, uint64_t nblks, int trunc,
+    dmu_tx_t *tx)
+{
+	dnode_t *dn;
+	blkptr_t *bp;
+	dmu_buf_impl_t *subdb;
+	uint64_t start, end, dbstart, dbend, i;
+	int epbs, shift, err;
+	int all = TRUE;
+	int blocks_freed = 0;
+
+	/*
+	 * There is a small possibility that this block will not be cached:
+	 *   1 - if level > 1 and there are no children with level <= 1
+	 *   2 - if we didn't get a dirty hold (because this block had just
+	 *	 finished being written -- and so had no holds), and then this
+	 *	 block got evicted before we got here.
+	 */
+	if (db->db_state != DB_CACHED)
+		(void) dbuf_read(db, NULL, DB_RF_MUST_SUCCEED);
+
+	dbuf_release_bp(db);
+	bp = (blkptr_t *)db->db.db_data;
+
+	DB_DNODE_ENTER(db);
+	dn = DB_DNODE(db);
+	epbs = dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT;
+	shift = (db->db_level - 1) * epbs;
+	dbstart = db->db_blkid << epbs;
+	start = blkid >> shift;
+	if (dbstart < start) {
+		bp += start - dbstart;
+		all = FALSE;
+	} else {
+		start = dbstart;
+	}
+	dbend = ((db->db_blkid + 1) << epbs) - 1;
+	end = (blkid + nblks - 1) >> shift;
+	if (dbend <= end)
+		end = dbend;
+	else if (all)
+		all = trunc;
+	ASSERT3U(start, <=, end);
+
+	if (db->db_level == 1) {
+		FREE_VERIFY(db, start, end, tx);
+		blocks_freed = free_blocks(dn, bp, end-start+1, tx);
+		arc_buf_freeze(db->db_buf);
+		ASSERT(all || blocks_freed == 0 || db->db_last_dirty);
+		DB_DNODE_EXIT(db);
+		return (all ? ALL : blocks_freed);
+	}
+
+	for (i = start; i <= end; i++, bp++) {
+		if (BP_IS_HOLE(bp))
+			continue;
+		rw_enter(&dn->dn_struct_rwlock, RW_READER);
+		err = dbuf_hold_impl(dn, db->db_level-1, i, TRUE, FTAG, &subdb);
+		ASSERT3U(err, ==, 0);
+		rw_exit(&dn->dn_struct_rwlock);
+
+		if (free_children(subdb, blkid, nblks, trunc, tx) == ALL) {
+			ASSERT3P(subdb->db_blkptr, ==, bp);
+			blocks_freed += free_blocks(dn, bp, 1, tx);
+		} else {
+			all = FALSE;
+		}
+		dbuf_rele(subdb, FTAG);
+	}
+	DB_DNODE_EXIT(db);
+	arc_buf_freeze(db->db_buf);
+#ifdef ZFS_DEBUG
+	bp -= (end-start)+1;
+	for (i = start; i <= end; i++, bp++) {
+		if (i == start && blkid != 0)
+			continue;
+		else if (i == end && !trunc)
+			continue;
+		ASSERT3U(bp->blk_birth, ==, 0);
+	}
+#endif
+	ASSERT(all || blocks_freed == 0 || db->db_last_dirty);
+	return (all ? ALL : blocks_freed);
+}
+
+/*
+ * free_range: Traverse the indicated range of the provided file
+ * and "free" all the blocks contained there.
+ */
+static void
+dnode_sync_free_range(dnode_t *dn, uint64_t blkid, uint64_t nblks, dmu_tx_t *tx)
+{
+	blkptr_t *bp = dn->dn_phys->dn_blkptr;
+	dmu_buf_impl_t *db;
+	int trunc, start, end, shift, i, err;
+	int dnlevel = dn->dn_phys->dn_nlevels;
+
+	if (blkid > dn->dn_phys->dn_maxblkid)
+		return;
+
+	ASSERT(dn->dn_phys->dn_maxblkid < UINT64_MAX);
+	trunc = blkid + nblks > dn->dn_phys->dn_maxblkid;
+	if (trunc)
+		nblks = dn->dn_phys->dn_maxblkid - blkid + 1;
+
+	/* There are no indirect blocks in the object */
+	if (dnlevel == 1) {
+		if (blkid >= dn->dn_phys->dn_nblkptr) {
+			/* this range was never made persistent */
+			return;
+		}
+		ASSERT3U(blkid + nblks, <=, dn->dn_phys->dn_nblkptr);
+		(void) free_blocks(dn, bp + blkid, nblks, tx);
+		if (trunc) {
+			uint64_t off = (dn->dn_phys->dn_maxblkid + 1) *
+			    (dn->dn_phys->dn_datablkszsec << SPA_MINBLOCKSHIFT);
+			dn->dn_phys->dn_maxblkid = (blkid ? blkid - 1 : 0);
+			ASSERT(off < dn->dn_phys->dn_maxblkid ||
+			    dn->dn_phys->dn_maxblkid == 0 ||
+			    dnode_next_offset(dn, 0, &off, 1, 1, 0) != 0);
+		}
+		return;
+	}
+
+	shift = (dnlevel - 1) * (dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT);
+	start = blkid >> shift;
+	ASSERT(start < dn->dn_phys->dn_nblkptr);
+	end = (blkid + nblks - 1) >> shift;
+	bp += start;
+	for (i = start; i <= end; i++, bp++) {
+		if (BP_IS_HOLE(bp))
+			continue;
+		rw_enter(&dn->dn_struct_rwlock, RW_READER);
+		err = dbuf_hold_impl(dn, dnlevel-1, i, TRUE, FTAG, &db);
+		ASSERT3U(err, ==, 0);
+		rw_exit(&dn->dn_struct_rwlock);
+
+		if (free_children(db, blkid, nblks, trunc, tx) == ALL) {
+			ASSERT3P(db->db_blkptr, ==, bp);
+			(void) free_blocks(dn, bp, 1, tx);
+		}
+		dbuf_rele(db, FTAG);
+	}
+	if (trunc) {
+		uint64_t off = (dn->dn_phys->dn_maxblkid + 1) *
+		    (dn->dn_phys->dn_datablkszsec << SPA_MINBLOCKSHIFT);
+		dn->dn_phys->dn_maxblkid = (blkid ? blkid - 1 : 0);
+		ASSERT(off < dn->dn_phys->dn_maxblkid ||
+		    dn->dn_phys->dn_maxblkid == 0 ||
+		    dnode_next_offset(dn, 0, &off, 1, 1, 0) != 0);
+	}
+}
+
+/*
+ * Try to kick all the dnodes dbufs out of the cache...
+ */
+void
+dnode_evict_dbufs(dnode_t *dn)
+{
+	int progress;
+	int pass = 0;
+
+	do {
+		dmu_buf_impl_t *db, marker;
+		int evicting = FALSE;
+
+		progress = FALSE;
+		mutex_enter(&dn->dn_dbufs_mtx);
+		list_insert_tail(&dn->dn_dbufs, &marker);
+		db = list_head(&dn->dn_dbufs);
+		for (; db != &marker; db = list_head(&dn->dn_dbufs)) {
+			list_remove(&dn->dn_dbufs, db);
+			list_insert_tail(&dn->dn_dbufs, db);
+#ifdef	DEBUG
+			DB_DNODE_ENTER(db);
+			ASSERT3P(DB_DNODE(db), ==, dn);
+			DB_DNODE_EXIT(db);
+#endif	/* DEBUG */
+
+			mutex_enter(&db->db_mtx);
+			if (db->db_state == DB_EVICTING) {
+				progress = TRUE;
+				evicting = TRUE;
+				mutex_exit(&db->db_mtx);
+			} else if (refcount_is_zero(&db->db_holds)) {
+				progress = TRUE;
+				dbuf_clear(db); /* exits db_mtx for us */
+			} else {
+				mutex_exit(&db->db_mtx);
+			}
+
+		}
+		list_remove(&dn->dn_dbufs, &marker);
+		/*
+		 * NB: we need to drop dn_dbufs_mtx between passes so
+		 * that any DB_EVICTING dbufs can make progress.
+		 * Ideally, we would have some cv we could wait on, but
+		 * since we don't, just wait a bit to give the other
+		 * thread a chance to run.
+		 */
+		mutex_exit(&dn->dn_dbufs_mtx);
+		if (evicting)
+			delay(1);
+		pass++;
+		ASSERT(pass < 100); /* sanity check */
+	} while (progress);
+
+	rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
+	if (dn->dn_bonus && refcount_is_zero(&dn->dn_bonus->db_holds)) {
+		mutex_enter(&dn->dn_bonus->db_mtx);
+		dbuf_evict(dn->dn_bonus);
+		dn->dn_bonus = NULL;
+	}
+	rw_exit(&dn->dn_struct_rwlock);
+}
+
+static void
+dnode_undirty_dbufs(list_t *list)
+{
+	dbuf_dirty_record_t *dr;
+
+	while (dr = list_head(list)) {
+		dmu_buf_impl_t *db = dr->dr_dbuf;
+		uint64_t txg = dr->dr_txg;
+
+		if (db->db_level != 0)
+			dnode_undirty_dbufs(&dr->dt.di.dr_children);
+
+		mutex_enter(&db->db_mtx);
+		/* XXX - use dbuf_undirty()? */
+		list_remove(list, dr);
+		ASSERT(db->db_last_dirty == dr);
+		db->db_last_dirty = NULL;
+		db->db_dirtycnt -= 1;
+		if (db->db_level == 0) {
+			ASSERT(db->db_blkid == DMU_BONUS_BLKID ||
+			    dr->dt.dl.dr_data == db->db_buf);
+			dbuf_unoverride(dr);
+		}
+		kmem_free(dr, sizeof (dbuf_dirty_record_t));
+		dbuf_rele_and_unlock(db, (void *)(uintptr_t)txg);
+	}
+}
+
+static void
+dnode_sync_free(dnode_t *dn, dmu_tx_t *tx)
+{
+	int txgoff = tx->tx_txg & TXG_MASK;
+
+	ASSERT(dmu_tx_is_syncing(tx));
+
+	/*
+	 * Our contents should have been freed in dnode_sync() by the
+	 * free range record inserted by the caller of dnode_free().
+	 */
+	ASSERT3U(DN_USED_BYTES(dn->dn_phys), ==, 0);
+	ASSERT(BP_IS_HOLE(dn->dn_phys->dn_blkptr));
+
+	dnode_undirty_dbufs(&dn->dn_dirty_records[txgoff]);
+	dnode_evict_dbufs(dn);
+	ASSERT3P(list_head(&dn->dn_dbufs), ==, NULL);
+
+	/*
+	 * XXX - It would be nice to assert this, but we may still
+	 * have residual holds from async evictions from the arc...
+	 *
+	 * zfs_obj_to_path() also depends on this being
+	 * commented out.
+	 *
+	 * ASSERT3U(refcount_count(&dn->dn_holds), ==, 1);
+	 */
+
+	/* Undirty next bits */
+	dn->dn_next_nlevels[txgoff] = 0;
+	dn->dn_next_indblkshift[txgoff] = 0;
+	dn->dn_next_blksz[txgoff] = 0;
+
+	/* ASSERT(blkptrs are zero); */
+	ASSERT(dn->dn_phys->dn_type != DMU_OT_NONE);
+	ASSERT(dn->dn_type != DMU_OT_NONE);
+
+	ASSERT(dn->dn_free_txg > 0);
+	if (dn->dn_allocated_txg != dn->dn_free_txg)
+		dbuf_will_dirty(dn->dn_dbuf, tx);
+	bzero(dn->dn_phys, sizeof (dnode_phys_t));
+
+	mutex_enter(&dn->dn_mtx);
+	dn->dn_type = DMU_OT_NONE;
+	dn->dn_maxblkid = 0;
+	dn->dn_allocated_txg = 0;
+	dn->dn_free_txg = 0;
+	dn->dn_have_spill = B_FALSE;
+	mutex_exit(&dn->dn_mtx);
+
+	ASSERT(dn->dn_object != DMU_META_DNODE_OBJECT);
+
+	dnode_rele(dn, (void *)(uintptr_t)tx->tx_txg);
+	/*
+	 * Now that we've released our hold, the dnode may
+	 * be evicted, so we musn't access it.
+	 */
+}
+
+/*
+ * Write out the dnode's dirty buffers.
+ */
+void
+dnode_sync(dnode_t *dn, dmu_tx_t *tx)
+{
+	free_range_t *rp;
+	dnode_phys_t *dnp = dn->dn_phys;
+	int txgoff = tx->tx_txg & TXG_MASK;
+	list_t *list = &dn->dn_dirty_records[txgoff];
+	static const dnode_phys_t zerodn = { 0 };
+	boolean_t kill_spill = B_FALSE;
+
+	ASSERT(dmu_tx_is_syncing(tx));
+	ASSERT(dnp->dn_type != DMU_OT_NONE || dn->dn_allocated_txg);
+	ASSERT(dnp->dn_type != DMU_OT_NONE ||
+	    bcmp(dnp, &zerodn, DNODE_SIZE) == 0);
+	DNODE_VERIFY(dn);
+
+	ASSERT(dn->dn_dbuf == NULL || arc_released(dn->dn_dbuf->db_buf));
+
+	if (dmu_objset_userused_enabled(dn->dn_objset) &&
+	    !DMU_OBJECT_IS_SPECIAL(dn->dn_object)) {
+		mutex_enter(&dn->dn_mtx);
+		dn->dn_oldused = DN_USED_BYTES(dn->dn_phys);
+		dn->dn_oldflags = dn->dn_phys->dn_flags;
+		dn->dn_phys->dn_flags |= DNODE_FLAG_USERUSED_ACCOUNTED;
+		mutex_exit(&dn->dn_mtx);
+		dmu_objset_userquota_get_ids(dn, B_FALSE, tx);
+	} else {
+		/* Once we account for it, we should always account for it. */
+		ASSERT(!(dn->dn_phys->dn_flags &
+		    DNODE_FLAG_USERUSED_ACCOUNTED));
+	}
+
+	mutex_enter(&dn->dn_mtx);
+	if (dn->dn_allocated_txg == tx->tx_txg) {
+		/* The dnode is newly allocated or reallocated */
+		if (dnp->dn_type == DMU_OT_NONE) {
+			/* this is a first alloc, not a realloc */
+			dnp->dn_nlevels = 1;
+			dnp->dn_nblkptr = dn->dn_nblkptr;
+		}
+
+		dnp->dn_type = dn->dn_type;
+		dnp->dn_bonustype = dn->dn_bonustype;
+		dnp->dn_bonuslen = dn->dn_bonuslen;
+	}
+
+	ASSERT(dnp->dn_nlevels > 1 ||
+	    BP_IS_HOLE(&dnp->dn_blkptr[0]) ||
+	    BP_GET_LSIZE(&dnp->dn_blkptr[0]) ==
+	    dnp->dn_datablkszsec << SPA_MINBLOCKSHIFT);
+
+	if (dn->dn_next_blksz[txgoff]) {
+		ASSERT(P2PHASE(dn->dn_next_blksz[txgoff],
+		    SPA_MINBLOCKSIZE) == 0);
+		ASSERT(BP_IS_HOLE(&dnp->dn_blkptr[0]) ||
+		    dn->dn_maxblkid == 0 || list_head(list) != NULL ||
+		    avl_last(&dn->dn_ranges[txgoff]) ||
+		    dn->dn_next_blksz[txgoff] >> SPA_MINBLOCKSHIFT ==
+		    dnp->dn_datablkszsec);
+		dnp->dn_datablkszsec =
+		    dn->dn_next_blksz[txgoff] >> SPA_MINBLOCKSHIFT;
+		dn->dn_next_blksz[txgoff] = 0;
+	}
+
+	if (dn->dn_next_bonuslen[txgoff]) {
+		if (dn->dn_next_bonuslen[txgoff] == DN_ZERO_BONUSLEN)
+			dnp->dn_bonuslen = 0;
+		else
+			dnp->dn_bonuslen = dn->dn_next_bonuslen[txgoff];
+		ASSERT(dnp->dn_bonuslen <= DN_MAX_BONUSLEN);
+		dn->dn_next_bonuslen[txgoff] = 0;
+	}
+
+	if (dn->dn_next_bonustype[txgoff]) {
+		ASSERT(dn->dn_next_bonustype[txgoff] < DMU_OT_NUMTYPES);
+		dnp->dn_bonustype = dn->dn_next_bonustype[txgoff];
+		dn->dn_next_bonustype[txgoff] = 0;
+	}
+
+	/*
+	 * We will either remove a spill block when a file is being removed
+	 * or we have been asked to remove it.
+	 */
+	if (dn->dn_rm_spillblk[txgoff] ||
+	    ((dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR) &&
+	    dn->dn_free_txg > 0 && dn->dn_free_txg <= tx->tx_txg)) {
+		if ((dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR))
+			kill_spill = B_TRUE;
+		dn->dn_rm_spillblk[txgoff] = 0;
+	}
+
+	if (dn->dn_next_indblkshift[txgoff]) {
+		ASSERT(dnp->dn_nlevels == 1);
+		dnp->dn_indblkshift = dn->dn_next_indblkshift[txgoff];
+		dn->dn_next_indblkshift[txgoff] = 0;
+	}
+
+	/*
+	 * Just take the live (open-context) values for checksum and compress.
+	 * Strictly speaking it's a future leak, but nothing bad happens if we
+	 * start using the new checksum or compress algorithm a little early.
+	 */
+	dnp->dn_checksum = dn->dn_checksum;
+	dnp->dn_compress = dn->dn_compress;
+
+	mutex_exit(&dn->dn_mtx);
+
+	if (kill_spill) {
+		(void) free_blocks(dn, &dn->dn_phys->dn_spill, 1, tx);
+		mutex_enter(&dn->dn_mtx);
+		dnp->dn_flags &= ~DNODE_FLAG_SPILL_BLKPTR;
+		mutex_exit(&dn->dn_mtx);
+	}
+
+	/* process all the "freed" ranges in the file */
+	while (rp = avl_last(&dn->dn_ranges[txgoff])) {
+		dnode_sync_free_range(dn, rp->fr_blkid, rp->fr_nblks, tx);
+		/* grab the mutex so we don't race with dnode_block_freed() */
+		mutex_enter(&dn->dn_mtx);
+		avl_remove(&dn->dn_ranges[txgoff], rp);
+		mutex_exit(&dn->dn_mtx);
+		kmem_free(rp, sizeof (free_range_t));
+	}
+
+	if (dn->dn_free_txg > 0 && dn->dn_free_txg <= tx->tx_txg) {
+		dnode_sync_free(dn, tx);
+		return;
+	}
+
+	if (dn->dn_next_nblkptr[txgoff]) {
+		/* this should only happen on a realloc */
+		ASSERT(dn->dn_allocated_txg == tx->tx_txg);
+		if (dn->dn_next_nblkptr[txgoff] > dnp->dn_nblkptr) {
+			/* zero the new blkptrs we are gaining */
+			bzero(dnp->dn_blkptr + dnp->dn_nblkptr,
+			    sizeof (blkptr_t) *
+			    (dn->dn_next_nblkptr[txgoff] - dnp->dn_nblkptr));
+#ifdef ZFS_DEBUG
+		} else {
+			int i;
+			ASSERT(dn->dn_next_nblkptr[txgoff] < dnp->dn_nblkptr);
+			/* the blkptrs we are losing better be unallocated */
+			for (i = dn->dn_next_nblkptr[txgoff];
+			    i < dnp->dn_nblkptr; i++)
+				ASSERT(BP_IS_HOLE(&dnp->dn_blkptr[i]));
+#endif
+		}
+		mutex_enter(&dn->dn_mtx);
+		dnp->dn_nblkptr = dn->dn_next_nblkptr[txgoff];
+		dn->dn_next_nblkptr[txgoff] = 0;
+		mutex_exit(&dn->dn_mtx);
+	}
+
+	if (dn->dn_next_nlevels[txgoff]) {
+		dnode_increase_indirection(dn, tx);
+		dn->dn_next_nlevels[txgoff] = 0;
+	}
+
+	dbuf_sync_list(list, tx);
+
+	if (!DMU_OBJECT_IS_SPECIAL(dn->dn_object)) {
+		ASSERT3P(list_head(list), ==, NULL);
+		dnode_rele(dn, (void *)(uintptr_t)tx->tx_txg);
+	}
+
+	/*
+	 * Although we have dropped our reference to the dnode, it
+	 * can't be evicted until its written, and we haven't yet
+	 * initiated the IO for the dnode's dbuf.
+	 */
+}
--- a/uts/common/fs/zfs/dsl_dataset.c
+++ b/uts/common/fs/zfs/dsl_dataset.c
--- a/uts/common/fs/zfs/dsl_deadlist.c
+++ b/uts/common/fs/zfs/dsl_deadlist.c
@ -0,0 +1,474 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#include <sys/dsl_dataset.h>
+#include <sys/dmu.h>
+#include <sys/refcount.h>
+#include <sys/zap.h>
+#include <sys/zfs_context.h>
+#include <sys/dsl_pool.h>
+
+static int
+dsl_deadlist_compare(const void *arg1, const void *arg2)
+{
+	const dsl_deadlist_entry_t *dle1 = arg1;
+	const dsl_deadlist_entry_t *dle2 = arg2;
+
+	if (dle1->dle_mintxg < dle2->dle_mintxg)
+		return (-1);
+	else if (dle1->dle_mintxg > dle2->dle_mintxg)
+		return (+1);
+	else
+		return (0);
+}
+
+static void
+dsl_deadlist_load_tree(dsl_deadlist_t *dl)
+{
+	zap_cursor_t zc;
+	zap_attribute_t za;
+
+	ASSERT(!dl->dl_oldfmt);
+	if (dl->dl_havetree)
+		return;
+
+	avl_create(&dl->dl_tree, dsl_deadlist_compare,
+	    sizeof (dsl_deadlist_entry_t),
+	    offsetof(dsl_deadlist_entry_t, dle_node));
+	for (zap_cursor_init(&zc, dl->dl_os, dl->dl_object);
+	    zap_cursor_retrieve(&zc, &za) == 0;
+	    zap_cursor_advance(&zc)) {
+		dsl_deadlist_entry_t *dle = kmem_alloc(sizeof (*dle), KM_SLEEP);
+		dle->dle_mintxg = strtonum(za.za_name, NULL);
+		VERIFY3U(0, ==, bpobj_open(&dle->dle_bpobj, dl->dl_os,
+		    za.za_first_integer));
+		avl_add(&dl->dl_tree, dle);
+	}
+	zap_cursor_fini(&zc);
+	dl->dl_havetree = B_TRUE;
+}
+
+void
+dsl_deadlist_open(dsl_deadlist_t *dl, objset_t *os, uint64_t object)
+{
+	dmu_object_info_t doi;
+
+	mutex_init(&dl->dl_lock, NULL, MUTEX_DEFAULT, NULL);
+	dl->dl_os = os;
+	dl->dl_object = object;
+	VERIFY3U(0, ==, dmu_bonus_hold(os, object, dl, &dl->dl_dbuf));
+	dmu_object_info_from_db(dl->dl_dbuf, &doi);
+	if (doi.doi_type == DMU_OT_BPOBJ) {
+		dmu_buf_rele(dl->dl_dbuf, dl);
+		dl->dl_dbuf = NULL;
+		dl->dl_oldfmt = B_TRUE;
+		VERIFY3U(0, ==, bpobj_open(&dl->dl_bpobj, os, object));
+		return;
+	}
+
+	dl->dl_oldfmt = B_FALSE;
+	dl->dl_phys = dl->dl_dbuf->db_data;
+	dl->dl_havetree = B_FALSE;
+}
+
+void
+dsl_deadlist_close(dsl_deadlist_t *dl)
+{
+	void *cookie = NULL;
+	dsl_deadlist_entry_t *dle;
+
+	if (dl->dl_oldfmt) {
+		dl->dl_oldfmt = B_FALSE;
+		bpobj_close(&dl->dl_bpobj);
+		return;
+	}
+
+	if (dl->dl_havetree) {
+		while ((dle = avl_destroy_nodes(&dl->dl_tree, &cookie))
+		    != NULL) {
+			bpobj_close(&dle->dle_bpobj);
+			kmem_free(dle, sizeof (*dle));
+		}
+		avl_destroy(&dl->dl_tree);
+	}
+	dmu_buf_rele(dl->dl_dbuf, dl);
+	mutex_destroy(&dl->dl_lock);
+	dl->dl_dbuf = NULL;
+	dl->dl_phys = NULL;
+}
+
+uint64_t
+dsl_deadlist_alloc(objset_t *os, dmu_tx_t *tx)
+{
+	if (spa_version(dmu_objset_spa(os)) < SPA_VERSION_DEADLISTS)
+		return (bpobj_alloc(os, SPA_MAXBLOCKSIZE, tx));
+	return (zap_create(os, DMU_OT_DEADLIST, DMU_OT_DEADLIST_HDR,
+	    sizeof (dsl_deadlist_phys_t), tx));
+}
+
+void
+dsl_deadlist_free(objset_t *os, uint64_t dlobj, dmu_tx_t *tx)
+{
+	dmu_object_info_t doi;
+	zap_cursor_t zc;
+	zap_attribute_t za;
+
+	VERIFY3U(0, ==, dmu_object_info(os, dlobj, &doi));
+	if (doi.doi_type == DMU_OT_BPOBJ) {
+		bpobj_free(os, dlobj, tx);
+		return;
+	}
+
+	for (zap_cursor_init(&zc, os, dlobj);
+	    zap_cursor_retrieve(&zc, &za) == 0;
+	    zap_cursor_advance(&zc))
+		bpobj_free(os, za.za_first_integer, tx);
+	zap_cursor_fini(&zc);
+	VERIFY3U(0, ==, dmu_object_free(os, dlobj, tx));
+}
+
+void
+dsl_deadlist_insert(dsl_deadlist_t *dl, const blkptr_t *bp, dmu_tx_t *tx)
+{
+	dsl_deadlist_entry_t dle_tofind;
+	dsl_deadlist_entry_t *dle;
+	avl_index_t where;
+
+	if (dl->dl_oldfmt) {
+		bpobj_enqueue(&dl->dl_bpobj, bp, tx);
+		return;
+	}
+
+	dsl_deadlist_load_tree(dl);
+
+	dmu_buf_will_dirty(dl->dl_dbuf, tx);
+	mutex_enter(&dl->dl_lock);
+	dl->dl_phys->dl_used +=
+	    bp_get_dsize_sync(dmu_objset_spa(dl->dl_os), bp);
+	dl->dl_phys->dl_comp += BP_GET_PSIZE(bp);
+	dl->dl_phys->dl_uncomp += BP_GET_UCSIZE(bp);
+	mutex_exit(&dl->dl_lock);
+
+	dle_tofind.dle_mintxg = bp->blk_birth;
+	dle = avl_find(&dl->dl_tree, &dle_tofind, &where);
+	if (dle == NULL)
+		dle = avl_nearest(&dl->dl_tree, where, AVL_BEFORE);
+	else
+		dle = AVL_PREV(&dl->dl_tree, dle);
+	bpobj_enqueue(&dle->dle_bpobj, bp, tx);
+}
+
+/*
+ * Insert new key in deadlist, which must be > all current entries.
+ * mintxg is not inclusive.
+ */
+void
+dsl_deadlist_add_key(dsl_deadlist_t *dl, uint64_t mintxg, dmu_tx_t *tx)
+{
+	uint64_t obj;
+	dsl_deadlist_entry_t *dle;
+
+	if (dl->dl_oldfmt)
+		return;
+
+	dsl_deadlist_load_tree(dl);
+
+	dle = kmem_alloc(sizeof (*dle), KM_SLEEP);
+	dle->dle_mintxg = mintxg;
+	obj = bpobj_alloc(dl->dl_os, SPA_MAXBLOCKSIZE, tx);
+	VERIFY3U(0, ==, bpobj_open(&dle->dle_bpobj, dl->dl_os, obj));
+	avl_add(&dl->dl_tree, dle);
+
+	VERIFY3U(0, ==, zap_add_int_key(dl->dl_os, dl->dl_object,
+	    mintxg, obj, tx));
+}
+
+/*
+ * Remove this key, merging its entries into the previous key.
+ */
+void
+dsl_deadlist_remove_key(dsl_deadlist_t *dl, uint64_t mintxg, dmu_tx_t *tx)
+{
+	dsl_deadlist_entry_t dle_tofind;
+	dsl_deadlist_entry_t *dle, *dle_prev;
+
+	if (dl->dl_oldfmt)
+		return;
+
+	dsl_deadlist_load_tree(dl);
+
+	dle_tofind.dle_mintxg = mintxg;
+	dle = avl_find(&dl->dl_tree, &dle_tofind, NULL);
+	dle_prev = AVL_PREV(&dl->dl_tree, dle);
+
+	bpobj_enqueue_subobj(&dle_prev->dle_bpobj,
+	    dle->dle_bpobj.bpo_object, tx);
+
+	avl_remove(&dl->dl_tree, dle);
+	bpobj_close(&dle->dle_bpobj);
+	kmem_free(dle, sizeof (*dle));
+
+	VERIFY3U(0, ==, zap_remove_int(dl->dl_os, dl->dl_object, mintxg, tx));
+}
+
+/*
+ * Walk ds's snapshots to regenerate generate ZAP & AVL.
+ */
+static void
+dsl_deadlist_regenerate(objset_t *os, uint64_t dlobj,
+    uint64_t mrs_obj, dmu_tx_t *tx)
+{
+	dsl_deadlist_t dl;
+	dsl_pool_t *dp = dmu_objset_pool(os);
+
+	dsl_deadlist_open(&dl, os, dlobj);
+	if (dl.dl_oldfmt) {
+		dsl_deadlist_close(&dl);
+		return;
+	}
+
+	while (mrs_obj != 0) {
+		dsl_dataset_t *ds;
+		VERIFY3U(0, ==, dsl_dataset_hold_obj(dp, mrs_obj, FTAG, &ds));
+		dsl_deadlist_add_key(&dl, ds->ds_phys->ds_prev_snap_txg, tx);
+		mrs_obj = ds->ds_phys->ds_prev_snap_obj;
+		dsl_dataset_rele(ds, FTAG);
+	}
+	dsl_deadlist_close(&dl);
+}
+
+uint64_t
+dsl_deadlist_clone(dsl_deadlist_t *dl, uint64_t maxtxg,
+    uint64_t mrs_obj, dmu_tx_t *tx)
+{
+	dsl_deadlist_entry_t *dle;
+	uint64_t newobj;
+
+	newobj = dsl_deadlist_alloc(dl->dl_os, tx);
+
+	if (dl->dl_oldfmt) {
+		dsl_deadlist_regenerate(dl->dl_os, newobj, mrs_obj, tx);
+		return (newobj);
+	}
+
+	dsl_deadlist_load_tree(dl);
+
+	for (dle = avl_first(&dl->dl_tree); dle;
+	    dle = AVL_NEXT(&dl->dl_tree, dle)) {
+		uint64_t obj;
+
+		if (dle->dle_mintxg >= maxtxg)
+			break;
+
+		obj = bpobj_alloc(dl->dl_os, SPA_MAXBLOCKSIZE, tx);
+		VERIFY3U(0, ==, zap_add_int_key(dl->dl_os, newobj,
+		    dle->dle_mintxg, obj, tx));
+	}
+	return (newobj);
+}
+
+void
+dsl_deadlist_space(dsl_deadlist_t *dl,
+    uint64_t *usedp, uint64_t *compp, uint64_t *uncompp)
+{
+	if (dl->dl_oldfmt) {
+		VERIFY3U(0, ==, bpobj_space(&dl->dl_bpobj,
+		    usedp, compp, uncompp));
+		return;
+	}
+
+	mutex_enter(&dl->dl_lock);
+	*usedp = dl->dl_phys->dl_used;
+	*compp = dl->dl_phys->dl_comp;
+	*uncompp = dl->dl_phys->dl_uncomp;
+	mutex_exit(&dl->dl_lock);
+}
+
+/*
+ * return space used in the range (mintxg, maxtxg].
+ * Includes maxtxg, does not include mintxg.
+ * mintxg and maxtxg must both be keys in the deadlist (unless maxtxg is
+ * UINT64_MAX).
+ */
+void
+dsl_deadlist_space_range(dsl_deadlist_t *dl, uint64_t mintxg, uint64_t maxtxg,
+    uint64_t *usedp, uint64_t *compp, uint64_t *uncompp)
+{
+	dsl_deadlist_entry_t dle_tofind;
+	dsl_deadlist_entry_t *dle;
+	avl_index_t where;
+
+	if (dl->dl_oldfmt) {
+		VERIFY3U(0, ==, bpobj_space_range(&dl->dl_bpobj,
+		    mintxg, maxtxg, usedp, compp, uncompp));
+		return;
+	}
+
+	dsl_deadlist_load_tree(dl);
+	*usedp = *compp = *uncompp = 0;
+
+	dle_tofind.dle_mintxg = mintxg;
+	dle = avl_find(&dl->dl_tree, &dle_tofind, &where);
+	/*
+	 * If we don't find this mintxg, there shouldn't be anything
+	 * after it either.
+	 */
+	ASSERT(dle != NULL ||
+	    avl_nearest(&dl->dl_tree, where, AVL_AFTER) == NULL);
+	for (; dle && dle->dle_mintxg < maxtxg;
+	    dle = AVL_NEXT(&dl->dl_tree, dle)) {
+		uint64_t used, comp, uncomp;
+
+		VERIFY3U(0, ==, bpobj_space(&dle->dle_bpobj,
+		    &used, &comp, &uncomp));
+
+		*usedp += used;
+		*compp += comp;
+		*uncompp += uncomp;
+	}
+}
+
+static void
+dsl_deadlist_insert_bpobj(dsl_deadlist_t *dl, uint64_t obj, uint64_t birth,
+    dmu_tx_t *tx)
+{
+	dsl_deadlist_entry_t dle_tofind;
+	dsl_deadlist_entry_t *dle;
+	avl_index_t where;
+	uint64_t used, comp, uncomp;
+	bpobj_t bpo;
+
+	VERIFY3U(0, ==, bpobj_open(&bpo, dl->dl_os, obj));
+	VERIFY3U(0, ==, bpobj_space(&bpo, &used, &comp, &uncomp));
+	bpobj_close(&bpo);
+
+	dsl_deadlist_load_tree(dl);
+
+	dmu_buf_will_dirty(dl->dl_dbuf, tx);
+	mutex_enter(&dl->dl_lock);
+	dl->dl_phys->dl_used += used;
+	dl->dl_phys->dl_comp += comp;
+	dl->dl_phys->dl_uncomp += uncomp;
+	mutex_exit(&dl->dl_lock);
+
+	dle_tofind.dle_mintxg = birth;
+	dle = avl_find(&dl->dl_tree, &dle_tofind, &where);
+	if (dle == NULL)
+		dle = avl_nearest(&dl->dl_tree, where, AVL_BEFORE);
+	bpobj_enqueue_subobj(&dle->dle_bpobj, obj, tx);
+}
+
+static int
+dsl_deadlist_insert_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
+{
+	dsl_deadlist_t *dl = arg;
+	dsl_deadlist_insert(dl, bp, tx);
+	return (0);
+}
+
+/*
+ * Merge the deadlist pointed to by 'obj' into dl.  obj will be left as
+ * an empty deadlist.
+ */
+void
+dsl_deadlist_merge(dsl_deadlist_t *dl, uint64_t obj, dmu_tx_t *tx)
+{
+	zap_cursor_t zc;
+	zap_attribute_t za;
+	dmu_buf_t *bonus;
+	dsl_deadlist_phys_t *dlp;
+	dmu_object_info_t doi;
+
+	VERIFY3U(0, ==, dmu_object_info(dl->dl_os, obj, &doi));
+	if (doi.doi_type == DMU_OT_BPOBJ) {
+		bpobj_t bpo;
+		VERIFY3U(0, ==, bpobj_open(&bpo, dl->dl_os, obj));
+		VERIFY3U(0, ==, bpobj_iterate(&bpo,
+		    dsl_deadlist_insert_cb, dl, tx));
+		bpobj_close(&bpo);
+		return;
+	}
+
+	for (zap_cursor_init(&zc, dl->dl_os, obj);
+	    zap_cursor_retrieve(&zc, &za) == 0;
+	    zap_cursor_advance(&zc)) {
+		uint64_t mintxg = strtonum(za.za_name, NULL);
+		dsl_deadlist_insert_bpobj(dl, za.za_first_integer, mintxg, tx);
+		VERIFY3U(0, ==, zap_remove_int(dl->dl_os, obj, mintxg, tx));
+	}
+	zap_cursor_fini(&zc);
+
+	VERIFY3U(0, ==, dmu_bonus_hold(dl->dl_os, obj, FTAG, &bonus));
+	dlp = bonus->db_data;
+	dmu_buf_will_dirty(bonus, tx);
+	bzero(dlp, sizeof (*dlp));
+	dmu_buf_rele(bonus, FTAG);
+}
+
+/*
+ * Remove entries on dl that are >= mintxg, and put them on the bpobj.
+ */
+void
+dsl_deadlist_move_bpobj(dsl_deadlist_t *dl, bpobj_t *bpo, uint64_t mintxg,
+    dmu_tx_t *tx)
+{
+	dsl_deadlist_entry_t dle_tofind;
+	dsl_deadlist_entry_t *dle;
+	avl_index_t where;
+
+	ASSERT(!dl->dl_oldfmt);
+	dmu_buf_will_dirty(dl->dl_dbuf, tx);
+	dsl_deadlist_load_tree(dl);
+
+	dle_tofind.dle_mintxg = mintxg;
+	dle = avl_find(&dl->dl_tree, &dle_tofind, &where);
+	if (dle == NULL)
+		dle = avl_nearest(&dl->dl_tree, where, AVL_AFTER);
+	while (dle) {
+		uint64_t used, comp, uncomp;
+		dsl_deadlist_entry_t *dle_next;
+
+		bpobj_enqueue_subobj(bpo, dle->dle_bpobj.bpo_object, tx);
+
+		VERIFY3U(0, ==, bpobj_space(&dle->dle_bpobj,
+		    &used, &comp, &uncomp));
+		mutex_enter(&dl->dl_lock);
+		ASSERT3U(dl->dl_phys->dl_used, >=, used);
+		ASSERT3U(dl->dl_phys->dl_comp, >=, comp);
+		ASSERT3U(dl->dl_phys->dl_uncomp, >=, uncomp);
+		dl->dl_phys->dl_used -= used;
+		dl->dl_phys->dl_comp -= comp;
+		dl->dl_phys->dl_uncomp -= uncomp;
+		mutex_exit(&dl->dl_lock);
+
+		VERIFY3U(0, ==, zap_remove_int(dl->dl_os, dl->dl_object,
+		    dle->dle_mintxg, tx));
+
+		dle_next = AVL_NEXT(&dl->dl_tree, dle);
+		avl_remove(&dl->dl_tree, dle);
+		bpobj_close(&dle->dle_bpobj);
+		kmem_free(dle, sizeof (*dle));
+		dle = dle_next;
+	}
+}
--- a/uts/common/fs/zfs/dsl_deleg.c
+++ b/uts/common/fs/zfs/dsl_deleg.c
@ -0,0 +1,746 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2007, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+/*
+ * DSL permissions are stored in a two level zap attribute
+ * mechanism.   The first level identifies the "class" of
+ * entry.  The class is identified by the first 2 letters of
+ * the attribute.  The second letter "l" or "d" identifies whether
+ * it is a local or descendent permission.  The first letter
+ * identifies the type of entry.
+ *
+ * ul$<id>    identifies permissions granted locally for this userid.
+ * ud$<id>    identifies permissions granted on descendent datasets for
+ *            this userid.
+ * Ul$<id>    identifies permission sets granted locally for this userid.
+ * Ud$<id>    identifies permission sets granted on descendent datasets for
+ *            this userid.
+ * gl$<id>    identifies permissions granted locally for this groupid.
+ * gd$<id>    identifies permissions granted on descendent datasets for
+ *            this groupid.
+ * Gl$<id>    identifies permission sets granted locally for this groupid.
+ * Gd$<id>    identifies permission sets granted on descendent datasets for
+ *            this groupid.
+ * el$        identifies permissions granted locally for everyone.
+ * ed$        identifies permissions granted on descendent datasets
+ *            for everyone.
+ * El$        identifies permission sets granted locally for everyone.
+ * Ed$        identifies permission sets granted to descendent datasets for
+ *            everyone.
+ * c-$        identifies permission to create at dataset creation time.
+ * C-$        identifies permission sets to grant locally at dataset creation
+ *            time.
+ * s-$@<name> permissions defined in specified set @<name>
+ * S-$@<name> Sets defined in named set @<name>
+ *
+ * Each of the above entities points to another zap attribute that contains one
+ * attribute for each allowed permission, such as create, destroy,...
+ * All of the "upper" case class types will specify permission set names
+ * rather than permissions.
+ *
+ * Basically it looks something like this:
+ * ul$12 -> ZAP OBJ -> permissions...
+ *
+ * The ZAP OBJ is referred to as the jump object.
+ */
+
+#include <sys/dmu.h>
+#include <sys/dmu_objset.h>
+#include <sys/dmu_tx.h>
+#include <sys/dsl_dataset.h>
+#include <sys/dsl_dir.h>
+#include <sys/dsl_prop.h>
+#include <sys/dsl_synctask.h>
+#include <sys/dsl_deleg.h>
+#include <sys/spa.h>
+#include <sys/zap.h>
+#include <sys/fs/zfs.h>
+#include <sys/cred.h>
+#include <sys/sunddi.h>
+
+#include "zfs_deleg.h"
+
+/*
+ * Validate that user is allowed to delegate specified permissions.
+ *
+ * In order to delegate "create" you must have "create"
+ * and "allow".
+ */
+int
+dsl_deleg_can_allow(char *ddname, nvlist_t *nvp, cred_t *cr)
+{
+	nvpair_t *whopair = NULL;
+	int error;
+
+	if ((error = dsl_deleg_access(ddname, ZFS_DELEG_PERM_ALLOW, cr)) != 0)
+		return (error);
+
+	while (whopair = nvlist_next_nvpair(nvp, whopair)) {
+		nvlist_t *perms;
+		nvpair_t *permpair = NULL;
+
+		VERIFY(nvpair_value_nvlist(whopair, &perms) == 0);
+
+		while (permpair = nvlist_next_nvpair(perms, permpair)) {
+			const char *perm = nvpair_name(permpair);
+
+			if (strcmp(perm, ZFS_DELEG_PERM_ALLOW) == 0)
+				return (EPERM);
+
+			if ((error = dsl_deleg_access(ddname, perm, cr)) != 0)
+				return (error);
+		}
+	}
+	return (0);
+}
+
+/*
+ * Validate that user is allowed to unallow specified permissions.  They
+ * must have the 'allow' permission, and even then can only unallow
+ * perms for their uid.
+ */
+int
+dsl_deleg_can_unallow(char *ddname, nvlist_t *nvp, cred_t *cr)
+{
+	nvpair_t *whopair = NULL;
+	int error;
+	char idstr[32];
+
+	if ((error = dsl_deleg_access(ddname, ZFS_DELEG_PERM_ALLOW, cr)) != 0)
+		return (error);
+
+	(void) snprintf(idstr, sizeof (idstr), "%lld",
+	    (longlong_t)crgetuid(cr));
+
+	while (whopair = nvlist_next_nvpair(nvp, whopair)) {
+		zfs_deleg_who_type_t type = nvpair_name(whopair)[0];
+
+		if (type != ZFS_DELEG_USER &&
+		    type != ZFS_DELEG_USER_SETS)
+			return (EPERM);
+
+		if (strcmp(idstr, &nvpair_name(whopair)[3]) != 0)
+			return (EPERM);
+	}
+	return (0);
+}
+
+static void
+dsl_deleg_set_sync(void *arg1, void *arg2, dmu_tx_t *tx)
+{
+	dsl_dir_t *dd = arg1;
+	nvlist_t *nvp = arg2;
+	objset_t *mos = dd->dd_pool->dp_meta_objset;
+	nvpair_t *whopair = NULL;
+	uint64_t zapobj = dd->dd_phys->dd_deleg_zapobj;
+
+	if (zapobj == 0) {
+		dmu_buf_will_dirty(dd->dd_dbuf, tx);
+		zapobj = dd->dd_phys->dd_deleg_zapobj = zap_create(mos,
+		    DMU_OT_DSL_PERMS, DMU_OT_NONE, 0, tx);
+	}
+
+	while (whopair = nvlist_next_nvpair(nvp, whopair)) {
+		const char *whokey = nvpair_name(whopair);
+		nvlist_t *perms;
+		nvpair_t *permpair = NULL;
+		uint64_t jumpobj;
+
+		VERIFY(nvpair_value_nvlist(whopair, &perms) == 0);
+
+		if (zap_lookup(mos, zapobj, whokey, 8, 1, &jumpobj) != 0) {
+			jumpobj = zap_create(mos, DMU_OT_DSL_PERMS,
+			    DMU_OT_NONE, 0, tx);
+			VERIFY(zap_update(mos, zapobj,
+			    whokey, 8, 1, &jumpobj, tx) == 0);
+		}
+
+		while (permpair = nvlist_next_nvpair(perms, permpair)) {
+			const char *perm = nvpair_name(permpair);
+			uint64_t n = 0;
+
+			VERIFY(zap_update(mos, jumpobj,
+			    perm, 8, 1, &n, tx) == 0);
+			spa_history_log_internal(LOG_DS_PERM_UPDATE,
+			    dd->dd_pool->dp_spa, tx,
+			    "%s %s dataset = %llu", whokey, perm,
+			    dd->dd_phys->dd_head_dataset_obj);
+		}
+	}
+}
+
+static void
+dsl_deleg_unset_sync(void *arg1, void *arg2, dmu_tx_t *tx)
+{
+	dsl_dir_t *dd = arg1;
+	nvlist_t *nvp = arg2;
+	objset_t *mos = dd->dd_pool->dp_meta_objset;
+	nvpair_t *whopair = NULL;
+	uint64_t zapobj = dd->dd_phys->dd_deleg_zapobj;
+
+	if (zapobj == 0)
+		return;
+
+	while (whopair = nvlist_next_nvpair(nvp, whopair)) {
+		const char *whokey = nvpair_name(whopair);
+		nvlist_t *perms;
+		nvpair_t *permpair = NULL;
+		uint64_t jumpobj;
+
+		if (nvpair_value_nvlist(whopair, &perms) != 0) {
+			if (zap_lookup(mos, zapobj, whokey, 8,
+			    1, &jumpobj) == 0) {
+				(void) zap_remove(mos, zapobj, whokey, tx);
+				VERIFY(0 == zap_destroy(mos, jumpobj, tx));
+			}
+			spa_history_log_internal(LOG_DS_PERM_WHO_REMOVE,
+			    dd->dd_pool->dp_spa, tx,
+			    "%s dataset = %llu", whokey,
+			    dd->dd_phys->dd_head_dataset_obj);
+			continue;
+		}
+
+		if (zap_lookup(mos, zapobj, whokey, 8, 1, &jumpobj) != 0)
+			continue;
+
+		while (permpair = nvlist_next_nvpair(perms, permpair)) {
+			const char *perm = nvpair_name(permpair);
+			uint64_t n = 0;
+
+			(void) zap_remove(mos, jumpobj, perm, tx);
+			if (zap_count(mos, jumpobj, &n) == 0 && n == 0) {
+				(void) zap_remove(mos, zapobj,
+				    whokey, tx);
+				VERIFY(0 == zap_destroy(mos,
+				    jumpobj, tx));
+			}
+			spa_history_log_internal(LOG_DS_PERM_REMOVE,
+			    dd->dd_pool->dp_spa, tx,
+			    "%s %s dataset = %llu", whokey, perm,
+			    dd->dd_phys->dd_head_dataset_obj);
+		}
+	}
+}
+
+int
+dsl_deleg_set(const char *ddname, nvlist_t *nvp, boolean_t unset)
+{
+	dsl_dir_t *dd;
+	int error;
+	nvpair_t *whopair = NULL;
+	int blocks_modified = 0;
+
+	error = dsl_dir_open(ddname, FTAG, &dd, NULL);
+	if (error)
+		return (error);
+
+	if (spa_version(dmu_objset_spa(dd->dd_pool->dp_meta_objset)) <
+	    SPA_VERSION_DELEGATED_PERMS) {
+		dsl_dir_close(dd, FTAG);
+		return (ENOTSUP);
+	}
+
+	while (whopair = nvlist_next_nvpair(nvp, whopair))
+		blocks_modified++;
+
+	error = dsl_sync_task_do(dd->dd_pool, NULL,
+	    unset ? dsl_deleg_unset_sync : dsl_deleg_set_sync,
+	    dd, nvp, blocks_modified);
+	dsl_dir_close(dd, FTAG);
+
+	return (error);
+}
+
+/*
+ * Find all 'allow' permissions from a given point and then continue
+ * traversing up to the root.
+ *
+ * This function constructs an nvlist of nvlists.
+ * each setpoint is an nvlist composed of an nvlist of an nvlist
+ * of the individual * users/groups/everyone/create
+ * permissions.
+ *
+ * The nvlist will look like this.
+ *
+ * { source fsname -> { whokeys { permissions,...}, ...}}
+ *
+ * The fsname nvpairs will be arranged in a bottom up order.  For example,
+ * if we have the following structure a/b/c then the nvpairs for the fsnames
+ * will be ordered a/b/c, a/b, a.
+ */
+int
+dsl_deleg_get(const char *ddname, nvlist_t **nvp)
+{
+	dsl_dir_t *dd, *startdd;
+	dsl_pool_t *dp;
+	int error;
+	objset_t *mos;
+
+	error = dsl_dir_open(ddname, FTAG, &startdd, NULL);
+	if (error)
+		return (error);
+
+	dp = startdd->dd_pool;
+	mos = dp->dp_meta_objset;
+
+	VERIFY(nvlist_alloc(nvp, NV_UNIQUE_NAME, KM_SLEEP) == 0);
+
+	rw_enter(&dp->dp_config_rwlock, RW_READER);
+	for (dd = startdd; dd != NULL; dd = dd->dd_parent) {
+		zap_cursor_t basezc;
+		zap_attribute_t baseza;
+		nvlist_t *sp_nvp;
+		uint64_t n;
+		char source[MAXNAMELEN];
+
+		if (dd->dd_phys->dd_deleg_zapobj &&
+		    (zap_count(mos, dd->dd_phys->dd_deleg_zapobj,
+		    &n) == 0) && n) {
+			VERIFY(nvlist_alloc(&sp_nvp,
+			    NV_UNIQUE_NAME, KM_SLEEP) == 0);
+		} else {
+			continue;
+		}
+
+		for (zap_cursor_init(&basezc, mos,
+		    dd->dd_phys->dd_deleg_zapobj);
+		    zap_cursor_retrieve(&basezc, &baseza) == 0;
+		    zap_cursor_advance(&basezc)) {
+			zap_cursor_t zc;
+			zap_attribute_t za;
+			nvlist_t *perms_nvp;
+
+			ASSERT(baseza.za_integer_length == 8);
+			ASSERT(baseza.za_num_integers == 1);
+
+			VERIFY(nvlist_alloc(&perms_nvp,
+			    NV_UNIQUE_NAME, KM_SLEEP) == 0);
+			for (zap_cursor_init(&zc, mos, baseza.za_first_integer);
+			    zap_cursor_retrieve(&zc, &za) == 0;
+			    zap_cursor_advance(&zc)) {
+				VERIFY(nvlist_add_boolean(perms_nvp,
+				    za.za_name) == 0);
+			}
+			zap_cursor_fini(&zc);
+			VERIFY(nvlist_add_nvlist(sp_nvp, baseza.za_name,
+			    perms_nvp) == 0);
+			nvlist_free(perms_nvp);
+		}
+
+		zap_cursor_fini(&basezc);
+
+		dsl_dir_name(dd, source);
+		VERIFY(nvlist_add_nvlist(*nvp, source, sp_nvp) == 0);
+		nvlist_free(sp_nvp);
+	}
+	rw_exit(&dp->dp_config_rwlock);
+
+	dsl_dir_close(startdd, FTAG);
+	return (0);
+}
+
+/*
+ * Routines for dsl_deleg_access() -- access checking.
+ */
+typedef struct perm_set {
+	avl_node_t	p_node;
+	boolean_t	p_matched;
+	char		p_setname[ZFS_MAX_DELEG_NAME];
+} perm_set_t;
+
+static int
+perm_set_compare(const void *arg1, const void *arg2)
+{
+	const perm_set_t *node1 = arg1;
+	const perm_set_t *node2 = arg2;
+	int val;
+
+	val = strcmp(node1->p_setname, node2->p_setname);
+	if (val == 0)
+		return (0);
+	return (val > 0 ? 1 : -1);
+}
+
+/*
+ * Determine whether a specified permission exists.
+ *
+ * First the base attribute has to be retrieved.  i.e. ul$12
+ * Once the base object has been retrieved the actual permission
+ * is lookup up in the zap object the base object points to.
+ *
+ * Return 0 if permission exists, ENOENT if there is no whokey, EPERM if
+ * there is no perm in that jumpobj.
+ */
+static int
+dsl_check_access(objset_t *mos, uint64_t zapobj,
+    char type, char checkflag, void *valp, const char *perm)
+{
+	int error;
+	uint64_t jumpobj, zero;
+	char whokey[ZFS_MAX_DELEG_NAME];
+
+	zfs_deleg_whokey(whokey, type, checkflag, valp);
+	error = zap_lookup(mos, zapobj, whokey, 8, 1, &jumpobj);
+	if (error == 0) {
+		error = zap_lookup(mos, jumpobj, perm, 8, 1, &zero);
+		if (error == ENOENT)
+			error = EPERM;
+	}
+	return (error);
+}
+
+/*
+ * check a specified user/group for a requested permission
+ */
+static int
+dsl_check_user_access(objset_t *mos, uint64_t zapobj, const char *perm,
+    int checkflag, cred_t *cr)
+{
+	const	gid_t *gids;
+	int	ngids;
+	int	i;
+	uint64_t id;
+
+	/* check for user */
+	id = crgetuid(cr);
+	if (dsl_check_access(mos, zapobj,
+	    ZFS_DELEG_USER, checkflag, &id, perm) == 0)
+		return (0);
+
+	/* check for users primary group */
+	id = crgetgid(cr);
+	if (dsl_check_access(mos, zapobj,
+	    ZFS_DELEG_GROUP, checkflag, &id, perm) == 0)
+		return (0);
+
+	/* check for everyone entry */
+	id = -1;
+	if (dsl_check_access(mos, zapobj,
+	    ZFS_DELEG_EVERYONE, checkflag, &id, perm) == 0)
+		return (0);
+
+	/* check each supplemental group user is a member of */
+	ngids = crgetngroups(cr);
+	gids = crgetgroups(cr);
+	for (i = 0; i != ngids; i++) {
+		id = gids[i];
+		if (dsl_check_access(mos, zapobj,
+		    ZFS_DELEG_GROUP, checkflag, &id, perm) == 0)
+			return (0);
+	}
+
+	return (EPERM);
+}
+
+/*
+ * Iterate over the sets specified in the specified zapobj
+ * and load them into the permsets avl tree.
+ */
+static int
+dsl_load_sets(objset_t *mos, uint64_t zapobj,
+    char type, char checkflag, void *valp, avl_tree_t *avl)
+{
+	zap_cursor_t zc;
+	zap_attribute_t za;
+	perm_set_t *permnode;
+	avl_index_t idx;
+	uint64_t jumpobj;
+	int error;
+	char whokey[ZFS_MAX_DELEG_NAME];
+
+	zfs_deleg_whokey(whokey, type, checkflag, valp);
+
+	error = zap_lookup(mos, zapobj, whokey, 8, 1, &jumpobj);
+	if (error != 0)
+		return (error);
+
+	for (zap_cursor_init(&zc, mos, jumpobj);
+	    zap_cursor_retrieve(&zc, &za) == 0;
+	    zap_cursor_advance(&zc)) {
+		permnode = kmem_alloc(sizeof (perm_set_t), KM_SLEEP);
+		(void) strlcpy(permnode->p_setname, za.za_name,
+		    sizeof (permnode->p_setname));
+		permnode->p_matched = B_FALSE;
+
+		if (avl_find(avl, permnode, &idx) == NULL) {
+			avl_insert(avl, permnode, idx);
+		} else {
+			kmem_free(permnode, sizeof (perm_set_t));
+		}
+	}
+	zap_cursor_fini(&zc);
+	return (0);
+}
+
+/*
+ * Load all permissions user based on cred belongs to.
+ */
+static void
+dsl_load_user_sets(objset_t *mos, uint64_t zapobj, avl_tree_t *avl,
+    char checkflag, cred_t *cr)
+{
+	const	gid_t *gids;
+	int	ngids, i;
+	uint64_t id;
+
+	id = crgetuid(cr);
+	(void) dsl_load_sets(mos, zapobj,
+	    ZFS_DELEG_USER_SETS, checkflag, &id, avl);
+
+	id = crgetgid(cr);
+	(void) dsl_load_sets(mos, zapobj,
+	    ZFS_DELEG_GROUP_SETS, checkflag, &id, avl);
+
+	(void) dsl_load_sets(mos, zapobj,
+	    ZFS_DELEG_EVERYONE_SETS, checkflag, NULL, avl);
+
+	ngids = crgetngroups(cr);
+	gids = crgetgroups(cr);
+	for (i = 0; i != ngids; i++) {
+		id = gids[i];
+		(void) dsl_load_sets(mos, zapobj,
+		    ZFS_DELEG_GROUP_SETS, checkflag, &id, avl);
+	}
+}
+
+/*
+ * Check if user has requested permission.
+ */
+int
+dsl_deleg_access_impl(dsl_dataset_t *ds, const char *perm, cred_t *cr)
+{
+	dsl_dir_t *dd;
+	dsl_pool_t *dp;
+	void *cookie;
+	int	error;
+	char	checkflag;
+	objset_t *mos;
+	avl_tree_t permsets;
+	perm_set_t *setnode;
+
+	dp = ds->ds_dir->dd_pool;
+	mos = dp->dp_meta_objset;
+
+	if (dsl_delegation_on(mos) == B_FALSE)
+		return (ECANCELED);
+
+	if (spa_version(dmu_objset_spa(dp->dp_meta_objset)) <
+	    SPA_VERSION_DELEGATED_PERMS)
+		return (EPERM);
+
+	if (dsl_dataset_is_snapshot(ds)) {
+		/*
+		 * Snapshots are treated as descendents only,
+		 * local permissions do not apply.
+		 */
+		checkflag = ZFS_DELEG_DESCENDENT;
+	} else {
+		checkflag = ZFS_DELEG_LOCAL;
+	}
+
+	avl_create(&permsets, perm_set_compare, sizeof (perm_set_t),
+	    offsetof(perm_set_t, p_node));
+
+	rw_enter(&dp->dp_config_rwlock, RW_READER);
+	for (dd = ds->ds_dir; dd != NULL; dd = dd->dd_parent,
+	    checkflag = ZFS_DELEG_DESCENDENT) {
+		uint64_t zapobj;
+		boolean_t expanded;
+
+		/*
+		 * If not in global zone then make sure
+		 * the zoned property is set
+		 */
+		if (!INGLOBALZONE(curproc)) {
+			uint64_t zoned;
+
+			if (dsl_prop_get_dd(dd,
+			    zfs_prop_to_name(ZFS_PROP_ZONED),
+			    8, 1, &zoned, NULL, B_FALSE) != 0)
+				break;
+			if (!zoned)
+				break;
+		}
+		zapobj = dd->dd_phys->dd_deleg_zapobj;
+
+		if (zapobj == 0)
+			continue;
+
+		dsl_load_user_sets(mos, zapobj, &permsets, checkflag, cr);
+again:
+		expanded = B_FALSE;
+		for (setnode = avl_first(&permsets); setnode;
+		    setnode = AVL_NEXT(&permsets, setnode)) {
+			if (setnode->p_matched == B_TRUE)
+				continue;
+
+			/* See if this set directly grants this permission */
+			error = dsl_check_access(mos, zapobj,
+			    ZFS_DELEG_NAMED_SET, 0, setnode->p_setname, perm);
+			if (error == 0)
+				goto success;
+			if (error == EPERM)
+				setnode->p_matched = B_TRUE;
+
+			/* See if this set includes other sets */
+			error = dsl_load_sets(mos, zapobj,
+			    ZFS_DELEG_NAMED_SET_SETS, 0,
+			    setnode->p_setname, &permsets);
+			if (error == 0)
+				setnode->p_matched = expanded = B_TRUE;
+		}
+		/*
+		 * If we expanded any sets, that will define more sets,
+		 * which we need to check.
+		 */
+		if (expanded)
+			goto again;
+
+		error = dsl_check_user_access(mos, zapobj, perm, checkflag, cr);
+		if (error == 0)
+			goto success;
+	}
+	error = EPERM;
+success:
+	rw_exit(&dp->dp_config_rwlock);
+
+	cookie = NULL;
+	while ((setnode = avl_destroy_nodes(&permsets, &cookie)) != NULL)
+		kmem_free(setnode, sizeof (perm_set_t));
+
+	return (error);
+}
+
+int
+dsl_deleg_access(const char *dsname, const char *perm, cred_t *cr)
+{
+	dsl_dataset_t *ds;
+	int error;
+
+	error = dsl_dataset_hold(dsname, FTAG, &ds);
+	if (error)
+		return (error);
+
+	error = dsl_deleg_access_impl(ds, perm, cr);
+	dsl_dataset_rele(ds, FTAG);
+
+	return (error);
+}
+
+/*
+ * Other routines.
+ */
+
+static void
+copy_create_perms(dsl_dir_t *dd, uint64_t pzapobj,
+    boolean_t dosets, uint64_t uid, dmu_tx_t *tx)
+{
+	objset_t *mos = dd->dd_pool->dp_meta_objset;
+	uint64_t jumpobj, pjumpobj;
+	uint64_t zapobj = dd->dd_phys->dd_deleg_zapobj;
+	zap_cursor_t zc;
+	zap_attribute_t za;
+	char whokey[ZFS_MAX_DELEG_NAME];
+
+	zfs_deleg_whokey(whokey,
+	    dosets ? ZFS_DELEG_CREATE_SETS : ZFS_DELEG_CREATE,
+	    ZFS_DELEG_LOCAL, NULL);
+	if (zap_lookup(mos, pzapobj, whokey, 8, 1, &pjumpobj) != 0)
+		return;
+
+	if (zapobj == 0) {
+		dmu_buf_will_dirty(dd->dd_dbuf, tx);
+		zapobj = dd->dd_phys->dd_deleg_zapobj = zap_create(mos,
+		    DMU_OT_DSL_PERMS, DMU_OT_NONE, 0, tx);
+	}
+
+	zfs_deleg_whokey(whokey,
+	    dosets ? ZFS_DELEG_USER_SETS : ZFS_DELEG_USER,
+	    ZFS_DELEG_LOCAL, &uid);
+	if (zap_lookup(mos, zapobj, whokey, 8, 1, &jumpobj) == ENOENT) {
+		jumpobj = zap_create(mos, DMU_OT_DSL_PERMS, DMU_OT_NONE, 0, tx);
+		VERIFY(zap_add(mos, zapobj, whokey, 8, 1, &jumpobj, tx) == 0);
+	}
+
+	for (zap_cursor_init(&zc, mos, pjumpobj);
+	    zap_cursor_retrieve(&zc, &za) == 0;
+	    zap_cursor_advance(&zc)) {
+		uint64_t zero = 0;
+		ASSERT(za.za_integer_length == 8 && za.za_num_integers == 1);
+
+		VERIFY(zap_update(mos, jumpobj, za.za_name,
+		    8, 1, &zero, tx) == 0);
+	}
+	zap_cursor_fini(&zc);
+}
+
+/*
+ * set all create time permission on new dataset.
+ */
+void
+dsl_deleg_set_create_perms(dsl_dir_t *sdd, dmu_tx_t *tx, cred_t *cr)
+{
+	dsl_dir_t *dd;
+	uint64_t uid = crgetuid(cr);
+
+	if (spa_version(dmu_objset_spa(sdd->dd_pool->dp_meta_objset)) <
+	    SPA_VERSION_DELEGATED_PERMS)
+		return;
+
+	for (dd = sdd->dd_parent; dd != NULL; dd = dd->dd_parent) {
+		uint64_t pzapobj = dd->dd_phys->dd_deleg_zapobj;
+
+		if (pzapobj == 0)
+			continue;
+
+		copy_create_perms(sdd, pzapobj, B_FALSE, uid, tx);
+		copy_create_perms(sdd, pzapobj, B_TRUE, uid, tx);
+	}
+}
+
+int
+dsl_deleg_destroy(objset_t *mos, uint64_t zapobj, dmu_tx_t *tx)
+{
+	zap_cursor_t zc;
+	zap_attribute_t za;
+
+	if (zapobj == 0)
+		return (0);
+
+	for (zap_cursor_init(&zc, mos, zapobj);
+	    zap_cursor_retrieve(&zc, &za) == 0;
+	    zap_cursor_advance(&zc)) {
+		ASSERT(za.za_integer_length == 8 && za.za_num_integers == 1);
+		VERIFY(0 == zap_destroy(mos, za.za_first_integer, tx));
+	}
+	zap_cursor_fini(&zc);
+	VERIFY(0 == zap_destroy(mos, zapobj, tx));
+	return (0);
+}
+
+boolean_t
+dsl_delegation_on(objset_t *os)
+{
+	return (!!spa_delegation(os->os_spa));
+}
--- a/uts/common/fs/zfs/dsl_dir.c
+++ b/uts/common/fs/zfs/dsl_dir.c
--- a/uts/common/fs/zfs/dsl_pool.c
+++ b/uts/common/fs/zfs/dsl_pool.c
@ -0,0 +1,848 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#include <sys/dsl_pool.h>
+#include <sys/dsl_dataset.h>
+#include <sys/dsl_prop.h>
+#include <sys/dsl_dir.h>
+#include <sys/dsl_synctask.h>
+#include <sys/dsl_scan.h>
+#include <sys/dnode.h>
+#include <sys/dmu_tx.h>
+#include <sys/dmu_objset.h>
+#include <sys/arc.h>
+#include <sys/zap.h>
+#include <sys/zio.h>
+#include <sys/zfs_context.h>
+#include <sys/fs/zfs.h>
+#include <sys/zfs_znode.h>
+#include <sys/spa_impl.h>
+#include <sys/dsl_deadlist.h>
+
+int zfs_no_write_throttle = 0;
+int zfs_write_limit_shift = 3;			/* 1/8th of physical memory */
+int zfs_txg_synctime_ms = 1000;		/* target millisecs to sync a txg */
+
+uint64_t zfs_write_limit_min = 32 << 20;	/* min write limit is 32MB */
+uint64_t zfs_write_limit_max = 0;		/* max data payload per txg */
+uint64_t zfs_write_limit_inflated = 0;
+uint64_t zfs_write_limit_override = 0;
+
+kmutex_t zfs_write_limit_lock;
+
+static pgcnt_t old_physmem = 0;
+
+int
+dsl_pool_open_special_dir(dsl_pool_t *dp, const char *name, dsl_dir_t **ddp)
+{
+	uint64_t obj;
+	int err;
+
+	err = zap_lookup(dp->dp_meta_objset,
+	    dp->dp_root_dir->dd_phys->dd_child_dir_zapobj,
+	    name, sizeof (obj), 1, &obj);
+	if (err)
+		return (err);
+
+	return (dsl_dir_open_obj(dp, obj, name, dp, ddp));
+}
+
+static dsl_pool_t *
+dsl_pool_open_impl(spa_t *spa, uint64_t txg)
+{
+	dsl_pool_t *dp;
+	blkptr_t *bp = spa_get_rootblkptr(spa);
+
+	dp = kmem_zalloc(sizeof (dsl_pool_t), KM_SLEEP);
+	dp->dp_spa = spa;
+	dp->dp_meta_rootbp = *bp;
+	rw_init(&dp->dp_config_rwlock, NULL, RW_DEFAULT, NULL);
+	dp->dp_write_limit = zfs_write_limit_min;
+	txg_init(dp, txg);
+
+	txg_list_create(&dp->dp_dirty_datasets,
+	    offsetof(dsl_dataset_t, ds_dirty_link));
+	txg_list_create(&dp->dp_dirty_dirs,
+	    offsetof(dsl_dir_t, dd_dirty_link));
+	txg_list_create(&dp->dp_sync_tasks,
+	    offsetof(dsl_sync_task_group_t, dstg_node));
+	list_create(&dp->dp_synced_datasets, sizeof (dsl_dataset_t),
+	    offsetof(dsl_dataset_t, ds_synced_link));
+
+	mutex_init(&dp->dp_lock, NULL, MUTEX_DEFAULT, NULL);
+
+	dp->dp_vnrele_taskq = taskq_create("zfs_vn_rele_taskq", 1, minclsyspri,
+	    1, 4, 0);
+
+	return (dp);
+}
+
+int
+dsl_pool_open(spa_t *spa, uint64_t txg, dsl_pool_t **dpp)
+{
+	int err;
+	dsl_pool_t *dp = dsl_pool_open_impl(spa, txg);
+	dsl_dir_t *dd;
+	dsl_dataset_t *ds;
+	uint64_t obj;
+
+	rw_enter(&dp->dp_config_rwlock, RW_WRITER);
+	err = dmu_objset_open_impl(spa, NULL, &dp->dp_meta_rootbp,
+	    &dp->dp_meta_objset);
+	if (err)
+		goto out;
+
+	err = zap_lookup(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
+	    DMU_POOL_ROOT_DATASET, sizeof (uint64_t), 1,
+	    &dp->dp_root_dir_obj);
+	if (err)
+		goto out;
+
+	err = dsl_dir_open_obj(dp, dp->dp_root_dir_obj,
+	    NULL, dp, &dp->dp_root_dir);
+	if (err)
+		goto out;
+
+	err = dsl_pool_open_special_dir(dp, MOS_DIR_NAME, &dp->dp_mos_dir);
+	if (err)
+		goto out;
+
+	if (spa_version(spa) >= SPA_VERSION_ORIGIN) {
+		err = dsl_pool_open_special_dir(dp, ORIGIN_DIR_NAME, &dd);
+		if (err)
+			goto out;
+		err = dsl_dataset_hold_obj(dp, dd->dd_phys->dd_head_dataset_obj,
+		    FTAG, &ds);
+		if (err == 0) {
+			err = dsl_dataset_hold_obj(dp,
+			    ds->ds_phys->ds_prev_snap_obj, dp,
+			    &dp->dp_origin_snap);
+			dsl_dataset_rele(ds, FTAG);
+		}
+		dsl_dir_close(dd, dp);
+		if (err)
+			goto out;
+	}
+
+	if (spa_version(spa) >= SPA_VERSION_DEADLISTS) {
+		err = dsl_pool_open_special_dir(dp, FREE_DIR_NAME,
+		    &dp->dp_free_dir);
+		if (err)
+			goto out;
+
+		err = zap_lookup(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
+		    DMU_POOL_FREE_BPOBJ, sizeof (uint64_t), 1, &obj);
+		if (err)
+			goto out;
+		VERIFY3U(0, ==, bpobj_open(&dp->dp_free_bpobj,
+		    dp->dp_meta_objset, obj));
+	}
+
+	err = zap_lookup(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
+	    DMU_POOL_TMP_USERREFS, sizeof (uint64_t), 1,
+	    &dp->dp_tmp_userrefs_obj);
+	if (err == ENOENT)
+		err = 0;
+	if (err)
+		goto out;
+
+	err = dsl_scan_init(dp, txg);
+
+out:
+	rw_exit(&dp->dp_config_rwlock);
+	if (err)
+		dsl_pool_close(dp);
+	else
+		*dpp = dp;
+
+	return (err);
+}
+
+void
+dsl_pool_close(dsl_pool_t *dp)
+{
+	/* drop our references from dsl_pool_open() */
+
+	/*
+	 * Since we held the origin_snap from "syncing" context (which
+	 * includes pool-opening context), it actually only got a "ref"
+	 * and not a hold, so just drop that here.
+	 */
+	if (dp->dp_origin_snap)
+		dsl_dataset_drop_ref(dp->dp_origin_snap, dp);
+	if (dp->dp_mos_dir)
+		dsl_dir_close(dp->dp_mos_dir, dp);
+	if (dp->dp_free_dir)
+		dsl_dir_close(dp->dp_free_dir, dp);
+	if (dp->dp_root_dir)
+		dsl_dir_close(dp->dp_root_dir, dp);
+
+	bpobj_close(&dp->dp_free_bpobj);
+
+	/* undo the dmu_objset_open_impl(mos) from dsl_pool_open() */
+	if (dp->dp_meta_objset)
+		dmu_objset_evict(dp->dp_meta_objset);
+
+	txg_list_destroy(&dp->dp_dirty_datasets);
+	txg_list_destroy(&dp->dp_sync_tasks);
+	txg_list_destroy(&dp->dp_dirty_dirs);
+	list_destroy(&dp->dp_synced_datasets);
+
+	arc_flush(dp->dp_spa);
+	txg_fini(dp);
+	dsl_scan_fini(dp);
+	rw_destroy(&dp->dp_config_rwlock);
+	mutex_destroy(&dp->dp_lock);
+	taskq_destroy(dp->dp_vnrele_taskq);
+	if (dp->dp_blkstats)
+		kmem_free(dp->dp_blkstats, sizeof (zfs_all_blkstats_t));
+	kmem_free(dp, sizeof (dsl_pool_t));
+}
+
+dsl_pool_t *
+dsl_pool_create(spa_t *spa, nvlist_t *zplprops, uint64_t txg)
+{
+	int err;
+	dsl_pool_t *dp = dsl_pool_open_impl(spa, txg);
+	dmu_tx_t *tx = dmu_tx_create_assigned(dp, txg);
+	objset_t *os;
+	dsl_dataset_t *ds;
+	uint64_t obj;
+
+	/* create and open the MOS (meta-objset) */
+	dp->dp_meta_objset = dmu_objset_create_impl(spa,
+	    NULL, &dp->dp_meta_rootbp, DMU_OST_META, tx);
+
+	/* create the pool directory */
+	err = zap_create_claim(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
+	    DMU_OT_OBJECT_DIRECTORY, DMU_OT_NONE, 0, tx);
+	ASSERT3U(err, ==, 0);
+
+	/* Initialize scan structures */
+	VERIFY3U(0, ==, dsl_scan_init(dp, txg));
+
+	/* create and open the root dir */
+	dp->dp_root_dir_obj = dsl_dir_create_sync(dp, NULL, NULL, tx);
+	VERIFY(0 == dsl_dir_open_obj(dp, dp->dp_root_dir_obj,
+	    NULL, dp, &dp->dp_root_dir));
+
+	/* create and open the meta-objset dir */
+	(void) dsl_dir_create_sync(dp, dp->dp_root_dir, MOS_DIR_NAME, tx);
+	VERIFY(0 == dsl_pool_open_special_dir(dp,
+	    MOS_DIR_NAME, &dp->dp_mos_dir));
+
+	if (spa_version(spa) >= SPA_VERSION_DEADLISTS) {
+		/* create and open the free dir */
+		(void) dsl_dir_create_sync(dp, dp->dp_root_dir,
+		    FREE_DIR_NAME, tx);
+		VERIFY(0 == dsl_pool_open_special_dir(dp,
+		    FREE_DIR_NAME, &dp->dp_free_dir));
+
+		/* create and open the free_bplist */
+		obj = bpobj_alloc(dp->dp_meta_objset, SPA_MAXBLOCKSIZE, tx);
+		VERIFY(zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
+		    DMU_POOL_FREE_BPOBJ, sizeof (uint64_t), 1, &obj, tx) == 0);
+		VERIFY3U(0, ==, bpobj_open(&dp->dp_free_bpobj,
+		    dp->dp_meta_objset, obj));
+	}
+
+	if (spa_version(spa) >= SPA_VERSION_DSL_SCRUB)
+		dsl_pool_create_origin(dp, tx);
+
+	/* create the root dataset */
+	obj = dsl_dataset_create_sync_dd(dp->dp_root_dir, NULL, 0, tx);
+
+	/* create the root objset */
+	VERIFY(0 == dsl_dataset_hold_obj(dp, obj, FTAG, &ds));
+	os = dmu_objset_create_impl(dp->dp_spa, ds,
+	    dsl_dataset_get_blkptr(ds), DMU_OST_ZFS, tx);
+#ifdef _KERNEL
+	zfs_create_fs(os, kcred, zplprops, tx);
+#endif
+	dsl_dataset_rele(ds, FTAG);
+
+	dmu_tx_commit(tx);
+
+	return (dp);
+}
+
+static int
+deadlist_enqueue_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
+{
+	dsl_deadlist_t *dl = arg;
+	dsl_deadlist_insert(dl, bp, tx);
+	return (0);
+}
+
+void
+dsl_pool_sync(dsl_pool_t *dp, uint64_t txg)
+{
+	zio_t *zio;
+	dmu_tx_t *tx;
+	dsl_dir_t *dd;
+	dsl_dataset_t *ds;
+	dsl_sync_task_group_t *dstg;
+	objset_t *mos = dp->dp_meta_objset;
+	hrtime_t start, write_time;
+	uint64_t data_written;
+	int err;
+
+	/*
+	 * We need to copy dp_space_towrite() before doing
+	 * dsl_sync_task_group_sync(), because
+	 * dsl_dataset_snapshot_reserve_space() will increase
+	 * dp_space_towrite but not actually write anything.
+	 */
+	data_written = dp->dp_space_towrite[txg & TXG_MASK];
+
+	tx = dmu_tx_create_assigned(dp, txg);
+
+	dp->dp_read_overhead = 0;
+	start = gethrtime();
+
+	zio = zio_root(dp->dp_spa, NULL, NULL, ZIO_FLAG_MUSTSUCCEED);
+	while (ds = txg_list_remove(&dp->dp_dirty_datasets, txg)) {
+		/*
+		 * We must not sync any non-MOS datasets twice, because
+		 * we may have taken a snapshot of them.  However, we
+		 * may sync newly-created datasets on pass 2.
+		 */
+		ASSERT(!list_link_active(&ds->ds_synced_link));
+		list_insert_tail(&dp->dp_synced_datasets, ds);
+		dsl_dataset_sync(ds, zio, tx);
+	}
+	DTRACE_PROBE(pool_sync__1setup);
+	err = zio_wait(zio);
+
+	write_time = gethrtime() - start;
+	ASSERT(err == 0);
+	DTRACE_PROBE(pool_sync__2rootzio);
+
+	for (ds = list_head(&dp->dp_synced_datasets); ds;
+	    ds = list_next(&dp->dp_synced_datasets, ds))
+		dmu_objset_do_userquota_updates(ds->ds_objset, tx);
+
+	/*
+	 * Sync the datasets again to push out the changes due to
+	 * userspace updates.  This must be done before we process the
+	 * sync tasks, because that could cause a snapshot of a dataset
+	 * whose ds_bp will be rewritten when we do this 2nd sync.
+	 */
+	zio = zio_root(dp->dp_spa, NULL, NULL, ZIO_FLAG_MUSTSUCCEED);
+	while (ds = txg_list_remove(&dp->dp_dirty_datasets, txg)) {
+		ASSERT(list_link_active(&ds->ds_synced_link));
+		dmu_buf_rele(ds->ds_dbuf, ds);
+		dsl_dataset_sync(ds, zio, tx);
+	}
+	err = zio_wait(zio);
+
+	/*
+	 * Move dead blocks from the pending deadlist to the on-disk
+	 * deadlist.
+	 */
+	for (ds = list_head(&dp->dp_synced_datasets); ds;
+	    ds = list_next(&dp->dp_synced_datasets, ds)) {
+		bplist_iterate(&ds->ds_pending_deadlist,
+		    deadlist_enqueue_cb, &ds->ds_deadlist, tx);
+	}
+
+	while (dstg = txg_list_remove(&dp->dp_sync_tasks, txg)) {
+		/*
+		 * No more sync tasks should have been added while we
+		 * were syncing.
+		 */
+		ASSERT(spa_sync_pass(dp->dp_spa) == 1);
+		dsl_sync_task_group_sync(dstg, tx);
+	}
+	DTRACE_PROBE(pool_sync__3task);
+
+	start = gethrtime();
+	while (dd = txg_list_remove(&dp->dp_dirty_dirs, txg))
+		dsl_dir_sync(dd, tx);
+	write_time += gethrtime() - start;
+
+	start = gethrtime();
+	if (list_head(&mos->os_dirty_dnodes[txg & TXG_MASK]) != NULL ||
+	    list_head(&mos->os_free_dnodes[txg & TXG_MASK]) != NULL) {
+		zio = zio_root(dp->dp_spa, NULL, NULL, ZIO_FLAG_MUSTSUCCEED);
+		dmu_objset_sync(mos, zio, tx);
+		err = zio_wait(zio);
+		ASSERT(err == 0);
+		dprintf_bp(&dp->dp_meta_rootbp, "meta objset rootbp is %s", "");
+		spa_set_rootblkptr(dp->dp_spa, &dp->dp_meta_rootbp);
+	}
+	write_time += gethrtime() - start;
+	DTRACE_PROBE2(pool_sync__4io, hrtime_t, write_time,
+	    hrtime_t, dp->dp_read_overhead);
+	write_time -= dp->dp_read_overhead;
+
+	dmu_tx_commit(tx);
+
+	dp->dp_space_towrite[txg & TXG_MASK] = 0;
+	ASSERT(dp->dp_tempreserved[txg & TXG_MASK] == 0);
+
+	/*
+	 * If the write limit max has not been explicitly set, set it
+	 * to a fraction of available physical memory (default 1/8th).
+	 * Note that we must inflate the limit because the spa
+	 * inflates write sizes to account for data replication.
+	 * Check this each sync phase to catch changing memory size.
+	 */
+	if (physmem != old_physmem && zfs_write_limit_shift) {
+		mutex_enter(&zfs_write_limit_lock);
+		old_physmem = physmem;
+		zfs_write_limit_max = ptob(physmem) >> zfs_write_limit_shift;
+		zfs_write_limit_inflated = MAX(zfs_write_limit_min,
+		    spa_get_asize(dp->dp_spa, zfs_write_limit_max));
+		mutex_exit(&zfs_write_limit_lock);
+	}
+
+	/*
+	 * Attempt to keep the sync time consistent by adjusting the
+	 * amount of write traffic allowed into each transaction group.
+	 * Weight the throughput calculation towards the current value:
+	 * 	thru = 3/4 old_thru + 1/4 new_thru
+	 *
+	 * Note: write_time is in nanosecs, so write_time/MICROSEC
+	 * yields millisecs
+	 */
+	ASSERT(zfs_write_limit_min > 0);
+	if (data_written > zfs_write_limit_min / 8 && write_time > MICROSEC) {
+		uint64_t throughput = data_written / (write_time / MICROSEC);
+
+		if (dp->dp_throughput)
+			dp->dp_throughput = throughput / 4 +
+			    3 * dp->dp_throughput / 4;
+		else
+			dp->dp_throughput = throughput;
+		dp->dp_write_limit = MIN(zfs_write_limit_inflated,
+		    MAX(zfs_write_limit_min,
+		    dp->dp_throughput * zfs_txg_synctime_ms));
+	}
+}
+
+void
+dsl_pool_sync_done(dsl_pool_t *dp, uint64_t txg)
+{
+	dsl_dataset_t *ds;
+	objset_t *os;
+
+	while (ds = list_head(&dp->dp_synced_datasets)) {
+		list_remove(&dp->dp_synced_datasets, ds);
+		os = ds->ds_objset;
+		zil_clean(os->os_zil, txg);
+		ASSERT(!dmu_objset_is_dirty(os, txg));
+		dmu_buf_rele(ds->ds_dbuf, ds);
+	}
+	ASSERT(!dmu_objset_is_dirty(dp->dp_meta_objset, txg));
+}
+
+/*
+ * TRUE if the current thread is the tx_sync_thread or if we
+ * are being called from SPA context during pool initialization.
+ */
+int
+dsl_pool_sync_context(dsl_pool_t *dp)
+{
+	return (curthread == dp->dp_tx.tx_sync_thread ||
+	    spa_get_dsl(dp->dp_spa) == NULL);
+}
+
+uint64_t
+dsl_pool_adjustedsize(dsl_pool_t *dp, boolean_t netfree)
+{
+	uint64_t space, resv;
+
+	/*
+	 * Reserve about 1.6% (1/64), or at least 32MB, for allocation
+	 * efficiency.
+	 * XXX The intent log is not accounted for, so it must fit
+	 * within this slop.
+	 *
+	 * If we're trying to assess whether it's OK to do a free,
+	 * cut the reservation in half to allow forward progress
+	 * (e.g. make it possible to rm(1) files from a full pool).
+	 */
+	space = spa_get_dspace(dp->dp_spa);
+	resv = MAX(space >> 6, SPA_MINDEVSIZE >> 1);
+	if (netfree)
+		resv >>= 1;
+
+	return (space - resv);
+}
+
+int
+dsl_pool_tempreserve_space(dsl_pool_t *dp, uint64_t space, dmu_tx_t *tx)
+{
+	uint64_t reserved = 0;
+	uint64_t write_limit = (zfs_write_limit_override ?
+	    zfs_write_limit_override : dp->dp_write_limit);
+
+	if (zfs_no_write_throttle) {
+		atomic_add_64(&dp->dp_tempreserved[tx->tx_txg & TXG_MASK],
+		    space);
+		return (0);
+	}
+
+	/*
+	 * Check to see if we have exceeded the maximum allowed IO for
+	 * this transaction group.  We can do this without locks since
+	 * a little slop here is ok.  Note that we do the reserved check
+	 * with only half the requested reserve: this is because the
+	 * reserve requests are worst-case, and we really don't want to
+	 * throttle based off of worst-case estimates.
+	 */
+	if (write_limit > 0) {
+		reserved = dp->dp_space_towrite[tx->tx_txg & TXG_MASK]
+		    + dp->dp_tempreserved[tx->tx_txg & TXG_MASK] / 2;
+
+		if (reserved && reserved > write_limit)
+			return (ERESTART);
+	}
+
+	atomic_add_64(&dp->dp_tempreserved[tx->tx_txg & TXG_MASK], space);
+
+	/*
+	 * If this transaction group is over 7/8ths capacity, delay
+	 * the caller 1 clock tick.  This will slow down the "fill"
+	 * rate until the sync process can catch up with us.
+	 */
+	if (reserved && reserved > (write_limit - (write_limit >> 3)))
+		txg_delay(dp, tx->tx_txg, 1);
+
+	return (0);
+}
+
+void
+dsl_pool_tempreserve_clear(dsl_pool_t *dp, int64_t space, dmu_tx_t *tx)
+{
+	ASSERT(dp->dp_tempreserved[tx->tx_txg & TXG_MASK] >= space);
+	atomic_add_64(&dp->dp_tempreserved[tx->tx_txg & TXG_MASK], -space);
+}
+
+void
+dsl_pool_memory_pressure(dsl_pool_t *dp)
+{
+	uint64_t space_inuse = 0;
+	int i;
+
+	if (dp->dp_write_limit == zfs_write_limit_min)
+		return;
+
+	for (i = 0; i < TXG_SIZE; i++) {
+		space_inuse += dp->dp_space_towrite[i];
+		space_inuse += dp->dp_tempreserved[i];
+	}
+	dp->dp_write_limit = MAX(zfs_write_limit_min,
+	    MIN(dp->dp_write_limit, space_inuse / 4));
+}
+
+void
+dsl_pool_willuse_space(dsl_pool_t *dp, int64_t space, dmu_tx_t *tx)
+{
+	if (space > 0) {
+		mutex_enter(&dp->dp_lock);
+		dp->dp_space_towrite[tx->tx_txg & TXG_MASK] += space;
+		mutex_exit(&dp->dp_lock);
+	}
+}
+
+/* ARGSUSED */
+static int
+upgrade_clones_cb(spa_t *spa, uint64_t dsobj, const char *dsname, void *arg)
+{
+	dmu_tx_t *tx = arg;
+	dsl_dataset_t *ds, *prev = NULL;
+	int err;
+	dsl_pool_t *dp = spa_get_dsl(spa);
+
+	err = dsl_dataset_hold_obj(dp, dsobj, FTAG, &ds);
+	if (err)
+		return (err);
+
+	while (ds->ds_phys->ds_prev_snap_obj != 0) {
+		err = dsl_dataset_hold_obj(dp, ds->ds_phys->ds_prev_snap_obj,
+		    FTAG, &prev);
+		if (err) {
+			dsl_dataset_rele(ds, FTAG);
+			return (err);
+		}
+
+		if (prev->ds_phys->ds_next_snap_obj != ds->ds_object)
+			break;
+		dsl_dataset_rele(ds, FTAG);
+		ds = prev;
+		prev = NULL;
+	}
+
+	if (prev == NULL) {
+		prev = dp->dp_origin_snap;
+
+		/*
+		 * The $ORIGIN can't have any data, or the accounting
+		 * will be wrong.
+		 */
+		ASSERT(prev->ds_phys->ds_bp.blk_birth == 0);
+
+		/* The origin doesn't get attached to itself */
+		if (ds->ds_object == prev->ds_object) {
+			dsl_dataset_rele(ds, FTAG);
+			return (0);
+		}
+
+		dmu_buf_will_dirty(ds->ds_dbuf, tx);
+		ds->ds_phys->ds_prev_snap_obj = prev->ds_object;
+		ds->ds_phys->ds_prev_snap_txg = prev->ds_phys->ds_creation_txg;
+
+		dmu_buf_will_dirty(ds->ds_dir->dd_dbuf, tx);
+		ds->ds_dir->dd_phys->dd_origin_obj = prev->ds_object;
+
+		dmu_buf_will_dirty(prev->ds_dbuf, tx);
+		prev->ds_phys->ds_num_children++;
+
+		if (ds->ds_phys->ds_next_snap_obj == 0) {
+			ASSERT(ds->ds_prev == NULL);
+			VERIFY(0 == dsl_dataset_hold_obj(dp,
+			    ds->ds_phys->ds_prev_snap_obj, ds, &ds->ds_prev));
+		}
+	}
+
+	ASSERT(ds->ds_dir->dd_phys->dd_origin_obj == prev->ds_object);
+	ASSERT(ds->ds_phys->ds_prev_snap_obj == prev->ds_object);
+
+	if (prev->ds_phys->ds_next_clones_obj == 0) {
+		dmu_buf_will_dirty(prev->ds_dbuf, tx);
+		prev->ds_phys->ds_next_clones_obj =
+		    zap_create(dp->dp_meta_objset,
+		    DMU_OT_NEXT_CLONES, DMU_OT_NONE, 0, tx);
+	}
+	VERIFY(0 == zap_add_int(dp->dp_meta_objset,
+	    prev->ds_phys->ds_next_clones_obj, ds->ds_object, tx));
+
+	dsl_dataset_rele(ds, FTAG);
+	if (prev != dp->dp_origin_snap)
+		dsl_dataset_rele(prev, FTAG);
+	return (0);
+}
+
+void
+dsl_pool_upgrade_clones(dsl_pool_t *dp, dmu_tx_t *tx)
+{
+	ASSERT(dmu_tx_is_syncing(tx));
+	ASSERT(dp->dp_origin_snap != NULL);
+
+	VERIFY3U(0, ==, dmu_objset_find_spa(dp->dp_spa, NULL, upgrade_clones_cb,
+	    tx, DS_FIND_CHILDREN));
+}
+
+/* ARGSUSED */
+static int
+upgrade_dir_clones_cb(spa_t *spa, uint64_t dsobj, const char *dsname, void *arg)
+{
+	dmu_tx_t *tx = arg;
+	dsl_dataset_t *ds;
+	dsl_pool_t *dp = spa_get_dsl(spa);
+	objset_t *mos = dp->dp_meta_objset;
+
+	VERIFY3U(0, ==, dsl_dataset_hold_obj(dp, dsobj, FTAG, &ds));
+
+	if (ds->ds_dir->dd_phys->dd_origin_obj) {
+		dsl_dataset_t *origin;
+
+		VERIFY3U(0, ==, dsl_dataset_hold_obj(dp,
+		    ds->ds_dir->dd_phys->dd_origin_obj, FTAG, &origin));
+
+		if (origin->ds_dir->dd_phys->dd_clones == 0) {
+			dmu_buf_will_dirty(origin->ds_dir->dd_dbuf, tx);
+			origin->ds_dir->dd_phys->dd_clones = zap_create(mos,
+			    DMU_OT_DSL_CLONES, DMU_OT_NONE, 0, tx);
+		}
+
+		VERIFY3U(0, ==, zap_add_int(dp->dp_meta_objset,
+		    origin->ds_dir->dd_phys->dd_clones, dsobj, tx));
+
+		dsl_dataset_rele(origin, FTAG);
+	}
+
+	dsl_dataset_rele(ds, FTAG);
+	return (0);
+}
+
+void
+dsl_pool_upgrade_dir_clones(dsl_pool_t *dp, dmu_tx_t *tx)
+{
+	ASSERT(dmu_tx_is_syncing(tx));
+	uint64_t obj;
+
+	(void) dsl_dir_create_sync(dp, dp->dp_root_dir, FREE_DIR_NAME, tx);
+	VERIFY(0 == dsl_pool_open_special_dir(dp,
+	    FREE_DIR_NAME, &dp->dp_free_dir));
+
+	/*
+	 * We can't use bpobj_alloc(), because spa_version() still
+	 * returns the old version, and we need a new-version bpobj with
+	 * subobj support.  So call dmu_object_alloc() directly.
+	 */
+	obj = dmu_object_alloc(dp->dp_meta_objset, DMU_OT_BPOBJ,
+	    SPA_MAXBLOCKSIZE, DMU_OT_BPOBJ_HDR, sizeof (bpobj_phys_t), tx);
+	VERIFY3U(0, ==, zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
+	    DMU_POOL_FREE_BPOBJ, sizeof (uint64_t), 1, &obj, tx));
+	VERIFY3U(0, ==, bpobj_open(&dp->dp_free_bpobj,
+	    dp->dp_meta_objset, obj));
+
+	VERIFY3U(0, ==, dmu_objset_find_spa(dp->dp_spa, NULL,
+	    upgrade_dir_clones_cb, tx, DS_FIND_CHILDREN));
+}
+
+void
+dsl_pool_create_origin(dsl_pool_t *dp, dmu_tx_t *tx)
+{
+	uint64_t dsobj;
+	dsl_dataset_t *ds;
+
+	ASSERT(dmu_tx_is_syncing(tx));
+	ASSERT(dp->dp_origin_snap == NULL);
+
+	/* create the origin dir, ds, & snap-ds */
+	rw_enter(&dp->dp_config_rwlock, RW_WRITER);
+	dsobj = dsl_dataset_create_sync(dp->dp_root_dir, ORIGIN_DIR_NAME,
+	    NULL, 0, kcred, tx);
+	VERIFY(0 == dsl_dataset_hold_obj(dp, dsobj, FTAG, &ds));
+	dsl_dataset_snapshot_sync(ds, ORIGIN_DIR_NAME, tx);
+	VERIFY(0 == dsl_dataset_hold_obj(dp, ds->ds_phys->ds_prev_snap_obj,
+	    dp, &dp->dp_origin_snap));
+	dsl_dataset_rele(ds, FTAG);
+	rw_exit(&dp->dp_config_rwlock);
+}
+
+taskq_t *
+dsl_pool_vnrele_taskq(dsl_pool_t *dp)
+{
+	return (dp->dp_vnrele_taskq);
+}
+
+/*
+ * Walk through the pool-wide zap object of temporary snapshot user holds
+ * and release them.
+ */
+void
+dsl_pool_clean_tmp_userrefs(dsl_pool_t *dp)
+{
+	zap_attribute_t za;
+	zap_cursor_t zc;
+	objset_t *mos = dp->dp_meta_objset;
+	uint64_t zapobj = dp->dp_tmp_userrefs_obj;
+
+	if (zapobj == 0)
+		return;
+	ASSERT(spa_version(dp->dp_spa) >= SPA_VERSION_USERREFS);
+
+	for (zap_cursor_init(&zc, mos, zapobj);
+	    zap_cursor_retrieve(&zc, &za) == 0;
+	    zap_cursor_advance(&zc)) {
+		char *htag;
+		uint64_t dsobj;
+
+		htag = strchr(za.za_name, '-');
+		*htag = '\0';
+		++htag;
+		dsobj = strtonum(za.za_name, NULL);
+		(void) dsl_dataset_user_release_tmp(dp, dsobj, htag, B_FALSE);
+	}
+	zap_cursor_fini(&zc);
+}
+
+/*
+ * Create the pool-wide zap object for storing temporary snapshot holds.
+ */
+void
+dsl_pool_user_hold_create_obj(dsl_pool_t *dp, dmu_tx_t *tx)
+{
+	objset_t *mos = dp->dp_meta_objset;
+
+	ASSERT(dp->dp_tmp_userrefs_obj == 0);
+	ASSERT(dmu_tx_is_syncing(tx));
+
+	dp->dp_tmp_userrefs_obj = zap_create(mos, DMU_OT_USERREFS,
+	    DMU_OT_NONE, 0, tx);
+
+	VERIFY(zap_add(mos, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_TMP_USERREFS,
+	    sizeof (uint64_t), 1, &dp->dp_tmp_userrefs_obj, tx) == 0);
+}
+
+static int
+dsl_pool_user_hold_rele_impl(dsl_pool_t *dp, uint64_t dsobj,
+    const char *tag, uint64_t *now, dmu_tx_t *tx, boolean_t holding)
+{
+	objset_t *mos = dp->dp_meta_objset;
+	uint64_t zapobj = dp->dp_tmp_userrefs_obj;
+	char *name;
+	int error;
+
+	ASSERT(spa_version(dp->dp_spa) >= SPA_VERSION_USERREFS);
+	ASSERT(dmu_tx_is_syncing(tx));
+
+	/*
+	 * If the pool was created prior to SPA_VERSION_USERREFS, the
+	 * zap object for temporary holds might not exist yet.
+	 */
+	if (zapobj == 0) {
+		if (holding) {
+			dsl_pool_user_hold_create_obj(dp, tx);
+			zapobj = dp->dp_tmp_userrefs_obj;
+		} else {
+			return (ENOENT);
+		}
+	}
+
+	name = kmem_asprintf("%llx-%s", (u_longlong_t)dsobj, tag);
+	if (holding)
+		error = zap_add(mos, zapobj, name, 8, 1, now, tx);
+	else
+		error = zap_remove(mos, zapobj, name, tx);
+	strfree(name);
+
+	return (error);
+}
+
+/*
+ * Add a temporary hold for the given dataset object and tag.
+ */
+int
+dsl_pool_user_hold(dsl_pool_t *dp, uint64_t dsobj, const char *tag,
+    uint64_t *now, dmu_tx_t *tx)
+{
+	return (dsl_pool_user_hold_rele_impl(dp, dsobj, tag, now, tx, B_TRUE));
+}
+
+/*
+ * Release a temporary hold for the given dataset object and tag.
+ */
+int
+dsl_pool_user_release(dsl_pool_t *dp, uint64_t dsobj, const char *tag,
+    dmu_tx_t *tx)
+{
+	return (dsl_pool_user_hold_rele_impl(dp, dsobj, tag, NULL,
+	    tx, B_FALSE));
+}
--- a/uts/common/fs/zfs/dsl_prop.c
+++ b/uts/common/fs/zfs/dsl_prop.c
--- a/uts/common/fs/zfs/dsl_scan.c
+++ b/uts/common/fs/zfs/dsl_scan.c
--- a/uts/common/fs/zfs/dsl_synctask.c
+++ b/uts/common/fs/zfs/dsl_synctask.c
@ -0,0 +1,240 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#include <sys/dmu.h>
+#include <sys/dmu_tx.h>
+#include <sys/dsl_pool.h>
+#include <sys/dsl_dir.h>
+#include <sys/dsl_synctask.h>
+#include <sys/metaslab.h>
+
+#define	DST_AVG_BLKSHIFT 14
+
+/* ARGSUSED */
+static int
+dsl_null_checkfunc(void *arg1, void *arg2, dmu_tx_t *tx)
+{
+	return (0);
+}
+
+dsl_sync_task_group_t *
+dsl_sync_task_group_create(dsl_pool_t *dp)
+{
+	dsl_sync_task_group_t *dstg;
+
+	dstg = kmem_zalloc(sizeof (dsl_sync_task_group_t), KM_SLEEP);
+	list_create(&dstg->dstg_tasks, sizeof (dsl_sync_task_t),
+	    offsetof(dsl_sync_task_t, dst_node));
+	dstg->dstg_pool = dp;
+
+	return (dstg);
+}
+
+void
+dsl_sync_task_create(dsl_sync_task_group_t *dstg,
+    dsl_checkfunc_t *checkfunc, dsl_syncfunc_t *syncfunc,
+    void *arg1, void *arg2, int blocks_modified)
+{
+	dsl_sync_task_t *dst;
+
+	if (checkfunc == NULL)
+		checkfunc = dsl_null_checkfunc;
+	dst = kmem_zalloc(sizeof (dsl_sync_task_t), KM_SLEEP);
+	dst->dst_checkfunc = checkfunc;
+	dst->dst_syncfunc = syncfunc;
+	dst->dst_arg1 = arg1;
+	dst->dst_arg2 = arg2;
+	list_insert_tail(&dstg->dstg_tasks, dst);
+
+	dstg->dstg_space += blocks_modified << DST_AVG_BLKSHIFT;
+}
+
+int
+dsl_sync_task_group_wait(dsl_sync_task_group_t *dstg)
+{
+	dmu_tx_t *tx;
+	uint64_t txg;
+	dsl_sync_task_t *dst;
+
+top:
+	tx = dmu_tx_create_dd(dstg->dstg_pool->dp_mos_dir);
+	VERIFY(0 == dmu_tx_assign(tx, TXG_WAIT));
+
+	txg = dmu_tx_get_txg(tx);
+
+	/* Do a preliminary error check. */
+	dstg->dstg_err = 0;
+	rw_enter(&dstg->dstg_pool->dp_config_rwlock, RW_READER);
+	for (dst = list_head(&dstg->dstg_tasks); dst;
+	    dst = list_next(&dstg->dstg_tasks, dst)) {
+#ifdef ZFS_DEBUG
+		/*
+		 * Only check half the time, otherwise, the sync-context
+		 * check will almost never fail.
+		 */
+		if (spa_get_random(2) == 0)
+			continue;
+#endif
+		dst->dst_err =
+		    dst->dst_checkfunc(dst->dst_arg1, dst->dst_arg2, tx);
+		if (dst->dst_err)
+			dstg->dstg_err = dst->dst_err;
+	}
+	rw_exit(&dstg->dstg_pool->dp_config_rwlock);
+
+	if (dstg->dstg_err) {
+		dmu_tx_commit(tx);
+		return (dstg->dstg_err);
+	}
+
+	/*
+	 * We don't generally have many sync tasks, so pay the price of
+	 * add_tail to get the tasks executed in the right order.
+	 */
+	VERIFY(0 == txg_list_add_tail(&dstg->dstg_pool->dp_sync_tasks,
+	    dstg, txg));
+
+	dmu_tx_commit(tx);
+
+	txg_wait_synced(dstg->dstg_pool, txg);
+
+	if (dstg->dstg_err == EAGAIN) {
+		txg_wait_synced(dstg->dstg_pool, txg + TXG_DEFER_SIZE);
+		goto top;
+	}
+
+	return (dstg->dstg_err);
+}
+
+void
+dsl_sync_task_group_nowait(dsl_sync_task_group_t *dstg, dmu_tx_t *tx)
+{
+	uint64_t txg;
+
+	dstg->dstg_nowaiter = B_TRUE;
+	txg = dmu_tx_get_txg(tx);
+	/*
+	 * We don't generally have many sync tasks, so pay the price of
+	 * add_tail to get the tasks executed in the right order.
+	 */
+	VERIFY(0 == txg_list_add_tail(&dstg->dstg_pool->dp_sync_tasks,
+	    dstg, txg));
+}
+
+void
+dsl_sync_task_group_destroy(dsl_sync_task_group_t *dstg)
+{
+	dsl_sync_task_t *dst;
+
+	while (dst = list_head(&dstg->dstg_tasks)) {
+		list_remove(&dstg->dstg_tasks, dst);
+		kmem_free(dst, sizeof (dsl_sync_task_t));
+	}
+	kmem_free(dstg, sizeof (dsl_sync_task_group_t));
+}
+
+void
+dsl_sync_task_group_sync(dsl_sync_task_group_t *dstg, dmu_tx_t *tx)
+{
+	dsl_sync_task_t *dst;
+	dsl_pool_t *dp = dstg->dstg_pool;
+	uint64_t quota, used;
+
+	ASSERT3U(dstg->dstg_err, ==, 0);
+
+	/*
+	 * Check for sufficient space.  We just check against what's
+	 * on-disk; we don't want any in-flight accounting to get in our
+	 * way, because open context may have already used up various
+	 * in-core limits (arc_tempreserve, dsl_pool_tempreserve).
+	 */
+	quota = dsl_pool_adjustedsize(dp, B_FALSE) -
+	    metaslab_class_get_deferred(spa_normal_class(dp->dp_spa));
+	used = dp->dp_root_dir->dd_phys->dd_used_bytes;
+	/* MOS space is triple-dittoed, so we multiply by 3. */
+	if (dstg->dstg_space > 0 && used + dstg->dstg_space * 3 > quota) {
+		dstg->dstg_err = ENOSPC;
+		return;
+	}
+
+	/*
+	 * Check for errors by calling checkfuncs.
+	 */
+	rw_enter(&dp->dp_config_rwlock, RW_WRITER);
+	for (dst = list_head(&dstg->dstg_tasks); dst;
+	    dst = list_next(&dstg->dstg_tasks, dst)) {
+		dst->dst_err =
+		    dst->dst_checkfunc(dst->dst_arg1, dst->dst_arg2, tx);
+		if (dst->dst_err)
+			dstg->dstg_err = dst->dst_err;
+	}
+
+	if (dstg->dstg_err == 0) {
+		/*
+		 * Execute sync tasks.
+		 */
+		for (dst = list_head(&dstg->dstg_tasks); dst;
+		    dst = list_next(&dstg->dstg_tasks, dst)) {
+			dst->dst_syncfunc(dst->dst_arg1, dst->dst_arg2, tx);
+		}
+	}
+	rw_exit(&dp->dp_config_rwlock);
+
+	if (dstg->dstg_nowaiter)
+		dsl_sync_task_group_destroy(dstg);
+}
+
+int
+dsl_sync_task_do(dsl_pool_t *dp,
+    dsl_checkfunc_t *checkfunc, dsl_syncfunc_t *syncfunc,
+    void *arg1, void *arg2, int blocks_modified)
+{
+	dsl_sync_task_group_t *dstg;
+	int err;
+
+	ASSERT(spa_writeable(dp->dp_spa));
+
+	dstg = dsl_sync_task_group_create(dp);
+	dsl_sync_task_create(dstg, checkfunc, syncfunc,
+	    arg1, arg2, blocks_modified);
+	err = dsl_sync_task_group_wait(dstg);
+	dsl_sync_task_group_destroy(dstg);
+	return (err);
+}
+
+void
+dsl_sync_task_do_nowait(dsl_pool_t *dp,
+    dsl_checkfunc_t *checkfunc, dsl_syncfunc_t *syncfunc,
+    void *arg1, void *arg2, int blocks_modified, dmu_tx_t *tx)
+{
+	dsl_sync_task_group_t *dstg;
+
+	if (!spa_writeable(dp->dp_spa))
+		return;
+
+	dstg = dsl_sync_task_group_create(dp);
+	dsl_sync_task_create(dstg, checkfunc, syncfunc,
+	    arg1, arg2, blocks_modified);
+	dsl_sync_task_group_nowait(dstg, tx);
+}
--- a/uts/common/fs/zfs/gzip.c
+++ b/uts/common/fs/zfs/gzip.c
@ -0,0 +1,69 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+
+/*
+ * Copyright 2007 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+#pragma ident	"%Z%%M%	%I%	%E% SMI"
+
+#include <sys/debug.h>
+#include <sys/types.h>
+#include <sys/zmod.h>
+
+#ifdef _KERNEL
+#include <sys/systm.h>
+#else
+#include <strings.h>
+#endif
+
+size_t
+gzip_compress(void *s_start, void *d_start, size_t s_len, size_t d_len, int n)
+{
+	size_t dstlen = d_len;
+
+	ASSERT(d_len <= s_len);
+
+	if (z_compress_level(d_start, &dstlen, s_start, s_len, n) != Z_OK) {
+		if (d_len != s_len)
+			return (s_len);
+
+		bcopy(s_start, d_start, s_len);
+		return (s_len);
+	}
+
+	return (dstlen);
+}
+
+/*ARGSUSED*/
+int
+gzip_decompress(void *s_start, void *d_start, size_t s_len, size_t d_len, int n)
+{
+	size_t dstlen = d_len;
+
+	ASSERT(d_len >= s_len);
+
+	if (z_uncompress(d_start, &dstlen, s_start, s_len) != Z_OK)
+		return (-1);
+
+	return (0);
+}
--- a/uts/common/fs/zfs/lzjb.c
+++ b/uts/common/fs/zfs/lzjb.c
@ -0,0 +1,123 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+/*
+ * We keep our own copy of this algorithm for 3 main reasons:
+ *	1. If we didn't, anyone modifying common/os/compress.c would
+ *         directly break our on disk format
+ *	2. Our version of lzjb does not have a number of checks that the
+ *         common/os version needs and uses
+ *	3. We initialize the lempel to ensure deterministic results,
+ *	   so that identical blocks can always be deduplicated.
+ * In particular, we are adding the "feature" that compress() can
+ * take a destination buffer size and returns the compressed length, or the
+ * source length if compression would overflow the destination buffer.
+ */
+
+#include <sys/types.h>
+
+#define	MATCH_BITS	6
+#define	MATCH_MIN	3
+#define	MATCH_MAX	((1 << MATCH_BITS) + (MATCH_MIN - 1))
+#define	OFFSET_MASK	((1 << (16 - MATCH_BITS)) - 1)
+#define	LEMPEL_SIZE	1024
+
+/*ARGSUSED*/
+size_t
+lzjb_compress(void *s_start, void *d_start, size_t s_len, size_t d_len, int n)
+{
+	uchar_t *src = s_start;
+	uchar_t *dst = d_start;
+	uchar_t *cpy, *copymap;
+	int copymask = 1 << (NBBY - 1);
+	int mlen, offset, hash;
+	uint16_t *hp;
+	uint16_t lempel[LEMPEL_SIZE] = { 0 };
+
+	while (src < (uchar_t *)s_start + s_len) {
+		if ((copymask <<= 1) == (1 << NBBY)) {
+			if (dst >= (uchar_t *)d_start + d_len - 1 - 2 * NBBY)
+				return (s_len);
+			copymask = 1;
+			copymap = dst;
+			*dst++ = 0;
+		}
+		if (src > (uchar_t *)s_start + s_len - MATCH_MAX) {
+			*dst++ = *src++;
+			continue;
+		}
+		hash = (src[0] << 16) + (src[1] << 8) + src[2];
+		hash += hash >> 9;
+		hash += hash >> 5;
+		hp = &lempel[hash & (LEMPEL_SIZE - 1)];
+		offset = (intptr_t)(src - *hp) & OFFSET_MASK;
+		*hp = (uint16_t)(uintptr_t)src;
+		cpy = src - offset;
+		if (cpy >= (uchar_t *)s_start && cpy != src &&
+		    src[0] == cpy[0] && src[1] == cpy[1] && src[2] == cpy[2]) {
+			*copymap |= copymask;
+			for (mlen = MATCH_MIN; mlen < MATCH_MAX; mlen++)
+				if (src[mlen] != cpy[mlen])
+					break;
+			*dst++ = ((mlen - MATCH_MIN) << (NBBY - MATCH_BITS)) |
+			    (offset >> NBBY);
+			*dst++ = (uchar_t)offset;
+			src += mlen;
+		} else {
+			*dst++ = *src++;
+		}
+	}
+	return (dst - (uchar_t *)d_start);
+}
+
+/*ARGSUSED*/
+int
+lzjb_decompress(void *s_start, void *d_start, size_t s_len, size_t d_len, int n)
+{
+	uchar_t *src = s_start;
+	uchar_t *dst = d_start;
+	uchar_t *d_end = (uchar_t *)d_start + d_len;
+	uchar_t *cpy, copymap;
+	int copymask = 1 << (NBBY - 1);
+
+	while (dst < d_end) {
+		if ((copymask <<= 1) == (1 << NBBY)) {
+			copymask = 1;
+			copymap = *src++;
+		}
+		if (copymap & copymask) {
+			int mlen = (src[0] >> (NBBY - MATCH_BITS)) + MATCH_MIN;
+			int offset = ((src[0] << NBBY) | src[1]) & OFFSET_MASK;
+			src += 2;
+			if ((cpy = dst - offset) < (uchar_t *)d_start)
+				return (-1);
+			while (--mlen >= 0 && dst < d_end)
+				*dst++ = *cpy++;
+		} else {
+			*dst++ = *src++;
+		}
+	}
+	return (0);
+}
--- a/uts/common/fs/zfs/metaslab.c
+++ b/uts/common/fs/zfs/metaslab.c
--- a/uts/common/fs/zfs/refcount.c
+++ b/uts/common/fs/zfs/refcount.c
@ -0,0 +1,223 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#include <sys/zfs_context.h>
+#include <sys/refcount.h>
+
+#ifdef	ZFS_DEBUG
+
+#ifdef _KERNEL
+int reference_tracking_enable = FALSE; /* runs out of memory too easily */
+#else
+int reference_tracking_enable = TRUE;
+#endif
+int reference_history = 4; /* tunable */
+
+static kmem_cache_t *reference_cache;
+static kmem_cache_t *reference_history_cache;
+
+void
+refcount_init(void)
+{
+	reference_cache = kmem_cache_create("reference_cache",
+	    sizeof (reference_t), 0, NULL, NULL, NULL, NULL, NULL, 0);
+
+	reference_history_cache = kmem_cache_create("reference_history_cache",
+	    sizeof (uint64_t), 0, NULL, NULL, NULL, NULL, NULL, 0);
+}
+
+void
+refcount_fini(void)
+{
+	kmem_cache_destroy(reference_cache);
+	kmem_cache_destroy(reference_history_cache);
+}
+
+void
+refcount_create(refcount_t *rc)
+{
+	mutex_init(&rc->rc_mtx, NULL, MUTEX_DEFAULT, NULL);
+	list_create(&rc->rc_list, sizeof (reference_t),
+	    offsetof(reference_t, ref_link));
+	list_create(&rc->rc_removed, sizeof (reference_t),
+	    offsetof(reference_t, ref_link));
+	rc->rc_count = 0;
+	rc->rc_removed_count = 0;
+}
+
+void
+refcount_destroy_many(refcount_t *rc, uint64_t number)
+{
+	reference_t *ref;
+
+	ASSERT(rc->rc_count == number);
+	while (ref = list_head(&rc->rc_list)) {
+		list_remove(&rc->rc_list, ref);
+		kmem_cache_free(reference_cache, ref);
+	}
+	list_destroy(&rc->rc_list);
+
+	while (ref = list_head(&rc->rc_removed)) {
+		list_remove(&rc->rc_removed, ref);
+		kmem_cache_free(reference_history_cache, ref->ref_removed);
+		kmem_cache_free(reference_cache, ref);
+	}
+	list_destroy(&rc->rc_removed);
+	mutex_destroy(&rc->rc_mtx);
+}
+
+void
+refcount_destroy(refcount_t *rc)
+{
+	refcount_destroy_many(rc, 0);
+}
+
+int
+refcount_is_zero(refcount_t *rc)
+{
+	ASSERT(rc->rc_count >= 0);
+	return (rc->rc_count == 0);
+}
+
+int64_t
+refcount_count(refcount_t *rc)
+{
+	ASSERT(rc->rc_count >= 0);
+	return (rc->rc_count);
+}
+
+int64_t
+refcount_add_many(refcount_t *rc, uint64_t number, void *holder)
+{
+	reference_t *ref;
+	int64_t count;
+
+	if (reference_tracking_enable) {
+		ref = kmem_cache_alloc(reference_cache, KM_SLEEP);
+		ref->ref_holder = holder;
+		ref->ref_number = number;
+	}
+	mutex_enter(&rc->rc_mtx);
+	ASSERT(rc->rc_count >= 0);
+	if (reference_tracking_enable)
+		list_insert_head(&rc->rc_list, ref);
+	rc->rc_count += number;
+	count = rc->rc_count;
+	mutex_exit(&rc->rc_mtx);
+
+	return (count);
+}
+
+int64_t
+refcount_add(refcount_t *rc, void *holder)
+{
+	return (refcount_add_many(rc, 1, holder));
+}
+
+int64_t
+refcount_remove_many(refcount_t *rc, uint64_t number, void *holder)
+{
+	reference_t *ref;
+	int64_t count;
+
+	mutex_enter(&rc->rc_mtx);
+	ASSERT(rc->rc_count >= number);
+
+	if (!reference_tracking_enable) {
+		rc->rc_count -= number;
+		count = rc->rc_count;
+		mutex_exit(&rc->rc_mtx);
+		return (count);
+	}
+
+	for (ref = list_head(&rc->rc_list); ref;
+	    ref = list_next(&rc->rc_list, ref)) {
+		if (ref->ref_holder == holder && ref->ref_number == number) {
+			list_remove(&rc->rc_list, ref);
+			if (reference_history > 0) {
+				ref->ref_removed =
+				    kmem_cache_alloc(reference_history_cache,
+				    KM_SLEEP);
+				list_insert_head(&rc->rc_removed, ref);
+				rc->rc_removed_count++;
+				if (rc->rc_removed_count >= reference_history) {
+					ref = list_tail(&rc->rc_removed);
+					list_remove(&rc->rc_removed, ref);
+					kmem_cache_free(reference_history_cache,
+					    ref->ref_removed);
+					kmem_cache_free(reference_cache, ref);
+					rc->rc_removed_count--;
+				}
+			} else {
+				kmem_cache_free(reference_cache, ref);
+			}
+			rc->rc_count -= number;
+			count = rc->rc_count;
+			mutex_exit(&rc->rc_mtx);
+			return (count);
+		}
+	}
+	panic("No such hold %p on refcount %llx", holder,
+	    (u_longlong_t)(uintptr_t)rc);
+	return (-1);
+}
+
+int64_t
+refcount_remove(refcount_t *rc, void *holder)
+{
+	return (refcount_remove_many(rc, 1, holder));
+}
+
+void
+refcount_transfer(refcount_t *dst, refcount_t *src)
+{
+	int64_t count, removed_count;
+	list_t list, removed;
+
+	list_create(&list, sizeof (reference_t),
+	    offsetof(reference_t, ref_link));
+	list_create(&removed, sizeof (reference_t),
+	    offsetof(reference_t, ref_link));
+
+	mutex_enter(&src->rc_mtx);
+	count = src->rc_count;
+	removed_count = src->rc_removed_count;
+	src->rc_count = 0;
+	src->rc_removed_count = 0;
+	list_move_tail(&list, &src->rc_list);
+	list_move_tail(&removed, &src->rc_removed);
+	mutex_exit(&src->rc_mtx);
+
+	mutex_enter(&dst->rc_mtx);
+	dst->rc_count += count;
+	dst->rc_removed_count += removed_count;
+	list_move_tail(&dst->rc_list, &list);
+	list_move_tail(&dst->rc_removed, &removed);
+	mutex_exit(&dst->rc_mtx);
+
+	list_destroy(&list);
+	list_destroy(&removed);
+}
+
+#endif	/* ZFS_DEBUG */
--- a/uts/common/fs/zfs/rrwlock.c
+++ b/uts/common/fs/zfs/rrwlock.c
@ -0,0 +1,264 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+#include <sys/refcount.h>
+#include <sys/rrwlock.h>
+
+/*
+ * This file contains the implementation of a re-entrant read
+ * reader/writer lock (aka "rrwlock").
+ *
+ * This is a normal reader/writer lock with the additional feature
+ * of allowing threads who have already obtained a read lock to
+ * re-enter another read lock (re-entrant read) - even if there are
+ * waiting writers.
+ *
+ * Callers who have not obtained a read lock give waiting writers priority.
+ *
+ * The rrwlock_t lock does not allow re-entrant writers, nor does it
+ * allow a re-entrant mix of reads and writes (that is, it does not
+ * allow a caller who has already obtained a read lock to be able to
+ * then grab a write lock without first dropping all read locks, and
+ * vice versa).
+ *
+ * The rrwlock_t uses tsd (thread specific data) to keep a list of
+ * nodes (rrw_node_t), where each node keeps track of which specific
+ * lock (rrw_node_t::rn_rrl) the thread has grabbed.  Since re-entering
+ * should be rare, a thread that grabs multiple reads on the same rrwlock_t
+ * will store multiple rrw_node_ts of the same 'rrn_rrl'. Nodes on the
+ * tsd list can represent a different rrwlock_t.  This allows a thread
+ * to enter multiple and unique rrwlock_ts for read locks at the same time.
+ *
+ * Since using tsd exposes some overhead, the rrwlock_t only needs to
+ * keep tsd data when writers are waiting.  If no writers are waiting, then
+ * a reader just bumps the anonymous read count (rr_anon_rcount) - no tsd
+ * is needed.  Once a writer attempts to grab the lock, readers then
+ * keep tsd data and bump the linked readers count (rr_linked_rcount).
+ *
+ * If there are waiting writers and there are anonymous readers, then a
+ * reader doesn't know if it is a re-entrant lock. But since it may be one,
+ * we allow the read to proceed (otherwise it could deadlock).  Since once
+ * waiting writers are active, readers no longer bump the anonymous count,
+ * the anonymous readers will eventually flush themselves out.  At this point,
+ * readers will be able to tell if they are a re-entrant lock (have a
+ * rrw_node_t entry for the lock) or not. If they are a re-entrant lock, then
+ * we must let the proceed.  If they are not, then the reader blocks for the
+ * waiting writers.  Hence, we do not starve writers.
+ */
+
+/* global key for TSD */
+uint_t rrw_tsd_key;
+
+typedef struct rrw_node {
+	struct rrw_node	*rn_next;
+	rrwlock_t	*rn_rrl;
+} rrw_node_t;
+
+static rrw_node_t *
+rrn_find(rrwlock_t *rrl)
+{
+	rrw_node_t *rn;
+
+	if (refcount_count(&rrl->rr_linked_rcount) == 0)
+		return (NULL);
+
+	for (rn = tsd_get(rrw_tsd_key); rn != NULL; rn = rn->rn_next) {
+		if (rn->rn_rrl == rrl)
+			return (rn);
+	}
+	return (NULL);
+}
+
+/*
+ * Add a node to the head of the singly linked list.
+ */
+static void
+rrn_add(rrwlock_t *rrl)
+{
+	rrw_node_t *rn;
+
+	rn = kmem_alloc(sizeof (*rn), KM_SLEEP);
+	rn->rn_rrl = rrl;
+	rn->rn_next = tsd_get(rrw_tsd_key);
+	VERIFY(tsd_set(rrw_tsd_key, rn) == 0);
+}
+
+/*
+ * If a node is found for 'rrl', then remove the node from this
+ * thread's list and return TRUE; otherwise return FALSE.
+ */
+static boolean_t
+rrn_find_and_remove(rrwlock_t *rrl)
+{
+	rrw_node_t *rn;
+	rrw_node_t *prev = NULL;
+
+	if (refcount_count(&rrl->rr_linked_rcount) == 0)
+		return (B_FALSE);
+
+	for (rn = tsd_get(rrw_tsd_key); rn != NULL; rn = rn->rn_next) {
+		if (rn->rn_rrl == rrl) {
+			if (prev)
+				prev->rn_next = rn->rn_next;
+			else
+				VERIFY(tsd_set(rrw_tsd_key, rn->rn_next) == 0);
+			kmem_free(rn, sizeof (*rn));
+			return (B_TRUE);
+		}
+		prev = rn;
+	}
+	return (B_FALSE);
+}
+
+void
+rrw_init(rrwlock_t *rrl)
+{
+	mutex_init(&rrl->rr_lock, NULL, MUTEX_DEFAULT, NULL);
+	cv_init(&rrl->rr_cv, NULL, CV_DEFAULT, NULL);
+	rrl->rr_writer = NULL;
+	refcount_create(&rrl->rr_anon_rcount);
+	refcount_create(&rrl->rr_linked_rcount);
+	rrl->rr_writer_wanted = B_FALSE;
+}
+
+void
+rrw_destroy(rrwlock_t *rrl)
+{
+	mutex_destroy(&rrl->rr_lock);
+	cv_destroy(&rrl->rr_cv);
+	ASSERT(rrl->rr_writer == NULL);
+	refcount_destroy(&rrl->rr_anon_rcount);
+	refcount_destroy(&rrl->rr_linked_rcount);
+}
+
+static void
+rrw_enter_read(rrwlock_t *rrl, void *tag)
+{
+	mutex_enter(&rrl->rr_lock);
+#if !defined(DEBUG) && defined(_KERNEL)
+	if (!rrl->rr_writer && !rrl->rr_writer_wanted) {
+		rrl->rr_anon_rcount.rc_count++;
+		mutex_exit(&rrl->rr_lock);
+		return;
+	}
+	DTRACE_PROBE(zfs__rrwfastpath__rdmiss);
+#endif
+	ASSERT(rrl->rr_writer != curthread);
+	ASSERT(refcount_count(&rrl->rr_anon_rcount) >= 0);
+
+	while (rrl->rr_writer || (rrl->rr_writer_wanted &&
+	    refcount_is_zero(&rrl->rr_anon_rcount) &&
+	    rrn_find(rrl) == NULL))
+		cv_wait(&rrl->rr_cv, &rrl->rr_lock);
+
+	if (rrl->rr_writer_wanted) {
+		/* may or may not be a re-entrant enter */
+		rrn_add(rrl);
+		(void) refcount_add(&rrl->rr_linked_rcount, tag);
+	} else {
+		(void) refcount_add(&rrl->rr_anon_rcount, tag);
+	}
+	ASSERT(rrl->rr_writer == NULL);
+	mutex_exit(&rrl->rr_lock);
+}
+
+static void
+rrw_enter_write(rrwlock_t *rrl)
+{
+	mutex_enter(&rrl->rr_lock);
+	ASSERT(rrl->rr_writer != curthread);
+
+	while (refcount_count(&rrl->rr_anon_rcount) > 0 ||
+	    refcount_count(&rrl->rr_linked_rcount) > 0 ||
+	    rrl->rr_writer != NULL) {
+		rrl->rr_writer_wanted = B_TRUE;
+		cv_wait(&rrl->rr_cv, &rrl->rr_lock);
+	}
+	rrl->rr_writer_wanted = B_FALSE;
+	rrl->rr_writer = curthread;
+	mutex_exit(&rrl->rr_lock);
+}
+
+void
+rrw_enter(rrwlock_t *rrl, krw_t rw, void *tag)
+{
+	if (rw == RW_READER)
+		rrw_enter_read(rrl, tag);
+	else
+		rrw_enter_write(rrl);
+}
+
+void
+rrw_exit(rrwlock_t *rrl, void *tag)
+{
+	mutex_enter(&rrl->rr_lock);
+#if !defined(DEBUG) && defined(_KERNEL)
+	if (!rrl->rr_writer && rrl->rr_linked_rcount.rc_count == 0) {
+		rrl->rr_anon_rcount.rc_count--;
+		if (rrl->rr_anon_rcount.rc_count == 0)
+			cv_broadcast(&rrl->rr_cv);
+		mutex_exit(&rrl->rr_lock);
+		return;
+	}
+	DTRACE_PROBE(zfs__rrwfastpath__exitmiss);
+#endif
+	ASSERT(!refcount_is_zero(&rrl->rr_anon_rcount) ||
+	    !refcount_is_zero(&rrl->rr_linked_rcount) ||
+	    rrl->rr_writer != NULL);
+
+	if (rrl->rr_writer == NULL) {
+		int64_t count;
+		if (rrn_find_and_remove(rrl))
+			count = refcount_remove(&rrl->rr_linked_rcount, tag);
+		else
+			count = refcount_remove(&rrl->rr_anon_rcount, tag);
+		if (count == 0)
+			cv_broadcast(&rrl->rr_cv);
+	} else {
+		ASSERT(rrl->rr_writer == curthread);
+		ASSERT(refcount_is_zero(&rrl->rr_anon_rcount) &&
+		    refcount_is_zero(&rrl->rr_linked_rcount));
+		rrl->rr_writer = NULL;
+		cv_broadcast(&rrl->rr_cv);
+	}
+	mutex_exit(&rrl->rr_lock);
+}
+
+boolean_t
+rrw_held(rrwlock_t *rrl, krw_t rw)
+{
+	boolean_t held;
+
+	mutex_enter(&rrl->rr_lock);
+	if (rw == RW_WRITER) {
+		held = (rrl->rr_writer == curthread);
+	} else {
+		held = (!refcount_is_zero(&rrl->rr_anon_rcount) ||
+		    !refcount_is_zero(&rrl->rr_linked_rcount));
+	}
+	mutex_exit(&rrl->rr_lock);
+
+	return (held);
+}
--- a/uts/common/fs/zfs/sa.c
+++ b/uts/common/fs/zfs/sa.c
--- a/uts/common/fs/zfs/sha256.c
+++ b/uts/common/fs/zfs/sha256.c
@ -0,0 +1,50 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+#include <sys/zfs_context.h>
+#include <sys/zio.h>
+#include <sys/sha2.h>
+
+void
+zio_checksum_SHA256(const void *buf, uint64_t size, zio_cksum_t *zcp)
+{
+	SHA2_CTX ctx;
+	zio_cksum_t tmp;
+
+	SHA2Init(SHA256, &ctx);
+	SHA2Update(&ctx, buf, size);
+	SHA2Final(&tmp, &ctx);
+
+	/*
+	 * A prior implementation of this function had a
+	 * private SHA256 implementation always wrote things out in
+	 * Big Endian and there wasn't a byteswap variant of it.
+	 * To preseve on disk compatibility we need to force that
+	 * behaviour.
+	 */
+	zcp->zc_word[0] = BE_64(tmp.zc_word[0]);
+	zcp->zc_word[1] = BE_64(tmp.zc_word[1]);
+	zcp->zc_word[2] = BE_64(tmp.zc_word[2]);
+	zcp->zc_word[3] = BE_64(tmp.zc_word[3]);
+}
--- a/uts/common/fs/zfs/spa.c
+++ b/uts/common/fs/zfs/spa.c
--- a/uts/common/fs/zfs/spa_config.c
+++ b/uts/common/fs/zfs/spa_config.c
@ -0,0 +1,487 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#include <sys/spa.h>
+#include <sys/spa_impl.h>
+#include <sys/nvpair.h>
+#include <sys/uio.h>
+#include <sys/fs/zfs.h>
+#include <sys/vdev_impl.h>
+#include <sys/zfs_ioctl.h>
+#include <sys/utsname.h>
+#include <sys/systeminfo.h>
+#include <sys/sunddi.h>
+#ifdef _KERNEL
+#include <sys/kobj.h>
+#include <sys/zone.h>
+#endif
+
+/*
+ * Pool configuration repository.
+ *
+ * Pool configuration is stored as a packed nvlist on the filesystem.  By
+ * default, all pools are stored in /etc/zfs/zpool.cache and loaded on boot
+ * (when the ZFS module is loaded).  Pools can also have the 'cachefile'
+ * property set that allows them to be stored in an alternate location until
+ * the control of external software.
+ *
+ * For each cache file, we have a single nvlist which holds all the
+ * configuration information.  When the module loads, we read this information
+ * from /etc/zfs/zpool.cache and populate the SPA namespace.  This namespace is
+ * maintained independently in spa.c.  Whenever the namespace is modified, or
+ * the configuration of a pool is changed, we call spa_config_sync(), which
+ * walks through all the active pools and writes the configuration to disk.
+ */
+
+static uint64_t spa_config_generation = 1;
+
+/*
+ * This can be overridden in userland to preserve an alternate namespace for
+ * userland pools when doing testing.
+ */
+const char *spa_config_path = ZPOOL_CACHE;
+
+/*
+ * Called when the module is first loaded, this routine loads the configuration
+ * file into the SPA namespace.  It does not actually open or load the pools; it
+ * only populates the namespace.
+ */
+void
+spa_config_load(void)
+{
+	void *buf = NULL;
+	nvlist_t *nvlist, *child;
+	nvpair_t *nvpair;
+	char *pathname;
+	struct _buf *file;
+	uint64_t fsize;
+
+	/*
+	 * Open the configuration file.
+	 */
+	pathname = kmem_alloc(MAXPATHLEN, KM_SLEEP);
+
+	(void) snprintf(pathname, MAXPATHLEN, "%s%s",
+	    (rootdir != NULL) ? "./" : "", spa_config_path);
+
+	file = kobj_open_file(pathname);
+
+	kmem_free(pathname, MAXPATHLEN);
+
+	if (file == (struct _buf *)-1)
+		return;
+
+	if (kobj_get_filesize(file, &fsize) != 0)
+		goto out;
+
+	buf = kmem_alloc(fsize, KM_SLEEP);
+
+	/*
+	 * Read the nvlist from the file.
+	 */
+	if (kobj_read_file(file, buf, fsize, 0) < 0)
+		goto out;
+
+	/*
+	 * Unpack the nvlist.
+	 */
+	if (nvlist_unpack(buf, fsize, &nvlist, KM_SLEEP) != 0)
+		goto out;
+
+	/*
+	 * Iterate over all elements in the nvlist, creating a new spa_t for
+	 * each one with the specified configuration.
+	 */
+	mutex_enter(&spa_namespace_lock);
+	nvpair = NULL;
+	while ((nvpair = nvlist_next_nvpair(nvlist, nvpair)) != NULL) {
+		if (nvpair_type(nvpair) != DATA_TYPE_NVLIST)
+			continue;
+
+		VERIFY(nvpair_value_nvlist(nvpair, &child) == 0);
+
+		if (spa_lookup(nvpair_name(nvpair)) != NULL)
+			continue;
+		(void) spa_add(nvpair_name(nvpair), child, NULL);
+	}
+	mutex_exit(&spa_namespace_lock);
+
+	nvlist_free(nvlist);
+
+out:
+	if (buf != NULL)
+		kmem_free(buf, fsize);
+
+	kobj_close_file(file);
+}
+
+static void
+spa_config_write(spa_config_dirent_t *dp, nvlist_t *nvl)
+{
+	size_t buflen;
+	char *buf;
+	vnode_t *vp;
+	int oflags = FWRITE | FTRUNC | FCREAT | FOFFMAX;
+	char *temp;
+
+	/*
+	 * If the nvlist is empty (NULL), then remove the old cachefile.
+	 */
+	if (nvl == NULL) {
+		(void) vn_remove(dp->scd_path, UIO_SYSSPACE, RMFILE);
+		return;
+	}
+
+	/*
+	 * Pack the configuration into a buffer.
+	 */
+	VERIFY(nvlist_size(nvl, &buflen, NV_ENCODE_XDR) == 0);
+
+	buf = kmem_alloc(buflen, KM_SLEEP);
+	temp = kmem_zalloc(MAXPATHLEN, KM_SLEEP);
+
+	VERIFY(nvlist_pack(nvl, &buf, &buflen, NV_ENCODE_XDR,
+	    KM_SLEEP) == 0);
+
+	/*
+	 * Write the configuration to disk.  We need to do the traditional
+	 * 'write to temporary file, sync, move over original' to make sure we
+	 * always have a consistent view of the data.
+	 */
+	(void) snprintf(temp, MAXPATHLEN, "%s.tmp", dp->scd_path);
+
+	if (vn_open(temp, UIO_SYSSPACE, oflags, 0644, &vp, CRCREAT, 0) == 0) {
+		if (vn_rdwr(UIO_WRITE, vp, buf, buflen, 0, UIO_SYSSPACE,
+		    0, RLIM64_INFINITY, kcred, NULL) == 0 &&
+		    VOP_FSYNC(vp, FSYNC, kcred, NULL) == 0) {
+			(void) vn_rename(temp, dp->scd_path, UIO_SYSSPACE);
+		}
+		(void) VOP_CLOSE(vp, oflags, 1, 0, kcred, NULL);
+		VN_RELE(vp);
+	}
+
+	(void) vn_remove(temp, UIO_SYSSPACE, RMFILE);
+
+	kmem_free(buf, buflen);
+	kmem_free(temp, MAXPATHLEN);
+}
+
+/*
+ * Synchronize pool configuration to disk.  This must be called with the
+ * namespace lock held.
+ */
+void
+spa_config_sync(spa_t *target, boolean_t removing, boolean_t postsysevent)
+{
+	spa_config_dirent_t *dp, *tdp;
+	nvlist_t *nvl;
+
+	ASSERT(MUTEX_HELD(&spa_namespace_lock));
+
+	if (rootdir == NULL || !(spa_mode_global & FWRITE))
+		return;
+
+	/*
+	 * Iterate over all cachefiles for the pool, past or present.  When the
+	 * cachefile is changed, the new one is pushed onto this list, allowing
+	 * us to update previous cachefiles that no longer contain this pool.
+	 */
+	for (dp = list_head(&target->spa_config_list); dp != NULL;
+	    dp = list_next(&target->spa_config_list, dp)) {
+		spa_t *spa = NULL;
+		if (dp->scd_path == NULL)
+			continue;
+
+		/*
+		 * Iterate over all pools, adding any matching pools to 'nvl'.
+		 */
+		nvl = NULL;
+		while ((spa = spa_next(spa)) != NULL) {
+			if (spa == target && removing)
+				continue;
+
+			mutex_enter(&spa->spa_props_lock);
+			tdp = list_head(&spa->spa_config_list);
+			if (spa->spa_config == NULL ||
+			    tdp->scd_path == NULL ||
+			    strcmp(tdp->scd_path, dp->scd_path) != 0) {
+				mutex_exit(&spa->spa_props_lock);
+				continue;
+			}
+
+			if (nvl == NULL)
+				VERIFY(nvlist_alloc(&nvl, NV_UNIQUE_NAME,
+				    KM_SLEEP) == 0);
+
+			VERIFY(nvlist_add_nvlist(nvl, spa->spa_name,
+			    spa->spa_config) == 0);
+			mutex_exit(&spa->spa_props_lock);
+		}
+
+		spa_config_write(dp, nvl);
+		nvlist_free(nvl);
+	}
+
+	/*
+	 * Remove any config entries older than the current one.
+	 */
+	dp = list_head(&target->spa_config_list);
+	while ((tdp = list_next(&target->spa_config_list, dp)) != NULL) {
+		list_remove(&target->spa_config_list, tdp);
+		if (tdp->scd_path != NULL)
+			spa_strfree(tdp->scd_path);
+		kmem_free(tdp, sizeof (spa_config_dirent_t));
+	}
+
+	spa_config_generation++;
+
+	if (postsysevent)
+		spa_event_notify(target, NULL, ESC_ZFS_CONFIG_SYNC);
+}
+
+/*
+ * Sigh.  Inside a local zone, we don't have access to /etc/zfs/zpool.cache,
+ * and we don't want to allow the local zone to see all the pools anyway.
+ * So we have to invent the ZFS_IOC_CONFIG ioctl to grab the configuration
+ * information for all pool visible within the zone.
+ */
+nvlist_t *
+spa_all_configs(uint64_t *generation)
+{
+	nvlist_t *pools;
+	spa_t *spa = NULL;
+
+	if (*generation == spa_config_generation)
+		return (NULL);
+
+	VERIFY(nvlist_alloc(&pools, NV_UNIQUE_NAME, KM_SLEEP) == 0);
+
+	mutex_enter(&spa_namespace_lock);
+	while ((spa = spa_next(spa)) != NULL) {
+		if (INGLOBALZONE(curproc) ||
+		    zone_dataset_visible(spa_name(spa), NULL)) {
+			mutex_enter(&spa->spa_props_lock);
+			VERIFY(nvlist_add_nvlist(pools, spa_name(spa),
+			    spa->spa_config) == 0);
+			mutex_exit(&spa->spa_props_lock);
+		}
+	}
+	*generation = spa_config_generation;
+	mutex_exit(&spa_namespace_lock);
+
+	return (pools);
+}
+
+void
+spa_config_set(spa_t *spa, nvlist_t *config)
+{
+	mutex_enter(&spa->spa_props_lock);
+	if (spa->spa_config != NULL)
+		nvlist_free(spa->spa_config);
+	spa->spa_config = config;
+	mutex_exit(&spa->spa_props_lock);
+}
+
+/*
+ * Generate the pool's configuration based on the current in-core state.
+ * We infer whether to generate a complete config or just one top-level config
+ * based on whether vd is the root vdev.
+ */
+nvlist_t *
+spa_config_generate(spa_t *spa, vdev_t *vd, uint64_t txg, int getstats)
+{
+	nvlist_t *config, *nvroot;
+	vdev_t *rvd = spa->spa_root_vdev;
+	unsigned long hostid = 0;
+	boolean_t locked = B_FALSE;
+	uint64_t split_guid;
+
+	if (vd == NULL) {
+		vd = rvd;
+		locked = B_TRUE;
+		spa_config_enter(spa, SCL_CONFIG | SCL_STATE, FTAG, RW_READER);
+	}
+
+	ASSERT(spa_config_held(spa, SCL_CONFIG | SCL_STATE, RW_READER) ==
+	    (SCL_CONFIG | SCL_STATE));
+
+	/*
+	 * If txg is -1, report the current value of spa->spa_config_txg.
+	 */
+	if (txg == -1ULL)
+		txg = spa->spa_config_txg;
+
+	VERIFY(nvlist_alloc(&config, NV_UNIQUE_NAME, KM_SLEEP) == 0);
+
+	VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_VERSION,
+	    spa_version(spa)) == 0);
+	VERIFY(nvlist_add_string(config, ZPOOL_CONFIG_POOL_NAME,
+	    spa_name(spa)) == 0);
+	VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_STATE,
+	    spa_state(spa)) == 0);
+	VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_TXG,
+	    txg) == 0);
+	VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_GUID,
+	    spa_guid(spa)) == 0);
+#ifdef	_KERNEL
+	hostid = zone_get_hostid(NULL);
+#else	/* _KERNEL */
+	/*
+	 * We're emulating the system's hostid in userland, so we can't use
+	 * zone_get_hostid().
+	 */
+	(void) ddi_strtoul(hw_serial, NULL, 10, &hostid);
+#endif	/* _KERNEL */
+	if (hostid != 0) {
+		VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_HOSTID,
+		    hostid) == 0);
+	}
+	VERIFY(nvlist_add_string(config, ZPOOL_CONFIG_HOSTNAME,
+	    utsname.nodename) == 0);
+
+	if (vd != rvd) {
+		VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_TOP_GUID,
+		    vd->vdev_top->vdev_guid) == 0);
+		VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_GUID,
+		    vd->vdev_guid) == 0);
+		if (vd->vdev_isspare)
+			VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_IS_SPARE,
+			    1ULL) == 0);
+		if (vd->vdev_islog)
+			VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_IS_LOG,
+			    1ULL) == 0);
+		vd = vd->vdev_top;		/* label contains top config */
+	} else {
+		/*
+		 * Only add the (potentially large) split information
+		 * in the mos config, and not in the vdev labels
+		 */
+		if (spa->spa_config_splitting != NULL)
+			VERIFY(nvlist_add_nvlist(config, ZPOOL_CONFIG_SPLIT,
+			    spa->spa_config_splitting) == 0);
+	}
+
+	/*
+	 * Add the top-level config.  We even add this on pools which
+	 * don't support holes in the namespace.
+	 */
+	vdev_top_config_generate(spa, config);
+
+	/*
+	 * If we're splitting, record the original pool's guid.
+	 */
+	if (spa->spa_config_splitting != NULL &&
+	    nvlist_lookup_uint64(spa->spa_config_splitting,
+	    ZPOOL_CONFIG_SPLIT_GUID, &split_guid) == 0) {
+		VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_SPLIT_GUID,
+		    split_guid) == 0);
+	}
+
+	nvroot = vdev_config_generate(spa, vd, getstats, 0);
+	VERIFY(nvlist_add_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, nvroot) == 0);
+	nvlist_free(nvroot);
+
+	if (getstats && spa_load_state(spa) == SPA_LOAD_NONE) {
+		ddt_histogram_t *ddh;
+		ddt_stat_t *dds;
+		ddt_object_t *ddo;
+
+		ddh = kmem_zalloc(sizeof (ddt_histogram_t), KM_SLEEP);
+		ddt_get_dedup_histogram(spa, ddh);
+		VERIFY(nvlist_add_uint64_array(config,
+		    ZPOOL_CONFIG_DDT_HISTOGRAM,
+		    (uint64_t *)ddh, sizeof (*ddh) / sizeof (uint64_t)) == 0);
+		kmem_free(ddh, sizeof (ddt_histogram_t));
+
+		ddo = kmem_zalloc(sizeof (ddt_object_t), KM_SLEEP);
+		ddt_get_dedup_object_stats(spa, ddo);
+		VERIFY(nvlist_add_uint64_array(config,
+		    ZPOOL_CONFIG_DDT_OBJ_STATS,
+		    (uint64_t *)ddo, sizeof (*ddo) / sizeof (uint64_t)) == 0);
+		kmem_free(ddo, sizeof (ddt_object_t));
+
+		dds = kmem_zalloc(sizeof (ddt_stat_t), KM_SLEEP);
+		ddt_get_dedup_stats(spa, dds);
+		VERIFY(nvlist_add_uint64_array(config,
+		    ZPOOL_CONFIG_DDT_STATS,
+		    (uint64_t *)dds, sizeof (*dds) / sizeof (uint64_t)) == 0);
+		kmem_free(dds, sizeof (ddt_stat_t));
+	}
+
+	if (locked)
+		spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
+
+	return (config);
+}
+
+/*
+ * Update all disk labels, generate a fresh config based on the current
+ * in-core state, and sync the global config cache (do not sync the config
+ * cache if this is a booting rootpool).
+ */
+void
+spa_config_update(spa_t *spa, int what)
+{
+	vdev_t *rvd = spa->spa_root_vdev;
+	uint64_t txg;
+	int c;
+
+	ASSERT(MUTEX_HELD(&spa_namespace_lock));
+
+	spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
+	txg = spa_last_synced_txg(spa) + 1;
+	if (what == SPA_CONFIG_UPDATE_POOL) {
+		vdev_config_dirty(rvd);
+	} else {
+		/*
+		 * If we have top-level vdevs that were added but have
+		 * not yet been prepared for allocation, do that now.
+		 * (It's safe now because the config cache is up to date,
+		 * so it will be able to translate the new DVAs.)
+		 * See comments in spa_vdev_add() for full details.
+		 */
+		for (c = 0; c < rvd->vdev_children; c++) {
+			vdev_t *tvd = rvd->vdev_child[c];
+			if (tvd->vdev_ms_array == 0)
+				vdev_metaslab_set_size(tvd);
+			vdev_expand(tvd, txg);
+		}
+	}
+	spa_config_exit(spa, SCL_ALL, FTAG);
+
+	/*
+	 * Wait for the mosconfig to be regenerated and synced.
+	 */
+	txg_wait_synced(spa->spa_dsl_pool, txg);
+
+	/*
+	 * Update the global config cache to reflect the new mosconfig.
+	 */
+	if (!spa->spa_is_root)
+		spa_config_sync(spa, B_FALSE, what != SPA_CONFIG_UPDATE_POOL);
+
+	if (what == SPA_CONFIG_UPDATE_POOL)
+		spa_config_update(spa, SPA_CONFIG_UPDATE_VDEVS);
+}
--- a/uts/common/fs/zfs/spa_errlog.c
+++ b/uts/common/fs/zfs/spa_errlog.c
@ -0,0 +1,403 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2006, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+/*
+ * Routines to manage the on-disk persistent error log.
+ *
+ * Each pool stores a log of all logical data errors seen during normal
+ * operation.  This is actually the union of two distinct logs: the last log,
+ * and the current log.  All errors seen are logged to the current log.  When a
+ * scrub completes, the current log becomes the last log, the last log is thrown
+ * out, and the current log is reinitialized.  This way, if an error is somehow
+ * corrected, a new scrub will show that that it no longer exists, and will be
+ * deleted from the log when the scrub completes.
+ *
+ * The log is stored using a ZAP object whose key is a string form of the
+ * zbookmark tuple (objset, object, level, blkid), and whose contents is an
+ * optional 'objset:object' human-readable string describing the data.  When an
+ * error is first logged, this string will be empty, indicating that no name is
+ * known.  This prevents us from having to issue a potentially large amount of
+ * I/O to discover the object name during an error path.  Instead, we do the
+ * calculation when the data is requested, storing the result so future queries
+ * will be faster.
+ *
+ * This log is then shipped into an nvlist where the key is the dataset name and
+ * the value is the object name.  Userland is then responsible for uniquifying
+ * this list and displaying it to the user.
+ */
+
+#include <sys/dmu_tx.h>
+#include <sys/spa.h>
+#include <sys/spa_impl.h>
+#include <sys/zap.h>
+#include <sys/zio.h>
+
+
+/*
+ * Convert a bookmark to a string.
+ */
+static void
+bookmark_to_name(zbookmark_t *zb, char *buf, size_t len)
+{
+	(void) snprintf(buf, len, "%llx:%llx:%llx:%llx",
+	    (u_longlong_t)zb->zb_objset, (u_longlong_t)zb->zb_object,
+	    (u_longlong_t)zb->zb_level, (u_longlong_t)zb->zb_blkid);
+}
+
+/*
+ * Convert a string to a bookmark
+ */
+#ifdef _KERNEL
+static void
+name_to_bookmark(char *buf, zbookmark_t *zb)
+{
+	zb->zb_objset = strtonum(buf, &buf);
+	ASSERT(*buf == ':');
+	zb->zb_object = strtonum(buf + 1, &buf);
+	ASSERT(*buf == ':');
+	zb->zb_level = (int)strtonum(buf + 1, &buf);
+	ASSERT(*buf == ':');
+	zb->zb_blkid = strtonum(buf + 1, &buf);
+	ASSERT(*buf == '\0');
+}
+#endif
+
+/*
+ * Log an uncorrectable error to the persistent error log.  We add it to the
+ * spa's list of pending errors.  The changes are actually synced out to disk
+ * during spa_errlog_sync().
+ */
+void
+spa_log_error(spa_t *spa, zio_t *zio)
+{
+	zbookmark_t *zb = &zio->io_logical->io_bookmark;
+	spa_error_entry_t search;
+	spa_error_entry_t *new;
+	avl_tree_t *tree;
+	avl_index_t where;
+
+	/*
+	 * If we are trying to import a pool, ignore any errors, as we won't be
+	 * writing to the pool any time soon.
+	 */
+	if (spa_load_state(spa) == SPA_LOAD_TRYIMPORT)
+		return;
+
+	mutex_enter(&spa->spa_errlist_lock);
+
+	/*
+	 * If we have had a request to rotate the log, log it to the next list
+	 * instead of the current one.
+	 */
+	if (spa->spa_scrub_active || spa->spa_scrub_finished)
+		tree = &spa->spa_errlist_scrub;
+	else
+		tree = &spa->spa_errlist_last;
+
+	search.se_bookmark = *zb;
+	if (avl_find(tree, &search, &where) != NULL) {
+		mutex_exit(&spa->spa_errlist_lock);
+		return;
+	}
+
+	new = kmem_zalloc(sizeof (spa_error_entry_t), KM_SLEEP);
+	new->se_bookmark = *zb;
+	avl_insert(tree, new, where);
+
+	mutex_exit(&spa->spa_errlist_lock);
+}
+
+/*
+ * Return the number of errors currently in the error log.  This is actually the
+ * sum of both the last log and the current log, since we don't know the union
+ * of these logs until we reach userland.
+ */
+uint64_t
+spa_get_errlog_size(spa_t *spa)
+{
+	uint64_t total = 0, count;
+
+	mutex_enter(&spa->spa_errlog_lock);
+	if (spa->spa_errlog_scrub != 0 &&
+	    zap_count(spa->spa_meta_objset, spa->spa_errlog_scrub,
+	    &count) == 0)
+		total += count;
+
+	if (spa->spa_errlog_last != 0 && !spa->spa_scrub_finished &&
+	    zap_count(spa->spa_meta_objset, spa->spa_errlog_last,
+	    &count) == 0)
+		total += count;
+	mutex_exit(&spa->spa_errlog_lock);
+
+	mutex_enter(&spa->spa_errlist_lock);
+	total += avl_numnodes(&spa->spa_errlist_last);
+	total += avl_numnodes(&spa->spa_errlist_scrub);
+	mutex_exit(&spa->spa_errlist_lock);
+
+	return (total);
+}
+
+#ifdef _KERNEL
+static int
+process_error_log(spa_t *spa, uint64_t obj, void *addr, size_t *count)
+{
+	zap_cursor_t zc;
+	zap_attribute_t za;
+	zbookmark_t zb;
+
+	if (obj == 0)
+		return (0);
+
+	for (zap_cursor_init(&zc, spa->spa_meta_objset, obj);
+	    zap_cursor_retrieve(&zc, &za) == 0;
+	    zap_cursor_advance(&zc)) {
+
+		if (*count == 0) {
+			zap_cursor_fini(&zc);
+			return (ENOMEM);
+		}
+
+		name_to_bookmark(za.za_name, &zb);
+
+		if (copyout(&zb, (char *)addr +
+		    (*count - 1) * sizeof (zbookmark_t),
+		    sizeof (zbookmark_t)) != 0)
+			return (EFAULT);
+
+		*count -= 1;
+	}
+
+	zap_cursor_fini(&zc);
+
+	return (0);
+}
+
+static int
+process_error_list(avl_tree_t *list, void *addr, size_t *count)
+{
+	spa_error_entry_t *se;
+
+	for (se = avl_first(list); se != NULL; se = AVL_NEXT(list, se)) {
+
+		if (*count == 0)
+			return (ENOMEM);
+
+		if (copyout(&se->se_bookmark, (char *)addr +
+		    (*count - 1) * sizeof (zbookmark_t),
+		    sizeof (zbookmark_t)) != 0)
+			return (EFAULT);
+
+		*count -= 1;
+	}
+
+	return (0);
+}
+#endif
+
+/*
+ * Copy all known errors to userland as an array of bookmarks.  This is
+ * actually a union of the on-disk last log and current log, as well as any
+ * pending error requests.
+ *
+ * Because the act of reading the on-disk log could cause errors to be
+ * generated, we have two separate locks: one for the error log and one for the
+ * in-core error lists.  We only need the error list lock to log and error, so
+ * we grab the error log lock while we read the on-disk logs, and only pick up
+ * the error list lock when we are finished.
+ */
+int
+spa_get_errlog(spa_t *spa, void *uaddr, size_t *count)
+{
+	int ret = 0;
+
+#ifdef _KERNEL
+	mutex_enter(&spa->spa_errlog_lock);
+
+	ret = process_error_log(spa, spa->spa_errlog_scrub, uaddr, count);
+
+	if (!ret && !spa->spa_scrub_finished)
+		ret = process_error_log(spa, spa->spa_errlog_last, uaddr,
+		    count);
+
+	mutex_enter(&spa->spa_errlist_lock);
+	if (!ret)
+		ret = process_error_list(&spa->spa_errlist_scrub, uaddr,
+		    count);
+	if (!ret)
+		ret = process_error_list(&spa->spa_errlist_last, uaddr,
+		    count);
+	mutex_exit(&spa->spa_errlist_lock);
+
+	mutex_exit(&spa->spa_errlog_lock);
+#endif
+
+	return (ret);
+}
+
+/*
+ * Called when a scrub completes.  This simply set a bit which tells which AVL
+ * tree to add new errors.  spa_errlog_sync() is responsible for actually
+ * syncing the changes to the underlying objects.
+ */
+void
+spa_errlog_rotate(spa_t *spa)
+{
+	mutex_enter(&spa->spa_errlist_lock);
+	spa->spa_scrub_finished = B_TRUE;
+	mutex_exit(&spa->spa_errlist_lock);
+}
+
+/*
+ * Discard any pending errors from the spa_t.  Called when unloading a faulted
+ * pool, as the errors encountered during the open cannot be synced to disk.
+ */
+void
+spa_errlog_drain(spa_t *spa)
+{
+	spa_error_entry_t *se;
+	void *cookie;
+
+	mutex_enter(&spa->spa_errlist_lock);
+
+	cookie = NULL;
+	while ((se = avl_destroy_nodes(&spa->spa_errlist_last,
+	    &cookie)) != NULL)
+		kmem_free(se, sizeof (spa_error_entry_t));
+	cookie = NULL;
+	while ((se = avl_destroy_nodes(&spa->spa_errlist_scrub,
+	    &cookie)) != NULL)
+		kmem_free(se, sizeof (spa_error_entry_t));
+
+	mutex_exit(&spa->spa_errlist_lock);
+}
+
+/*
+ * Process a list of errors into the current on-disk log.
+ */
+static void
+sync_error_list(spa_t *spa, avl_tree_t *t, uint64_t *obj, dmu_tx_t *tx)
+{
+	spa_error_entry_t *se;
+	char buf[64];
+	void *cookie;
+
+	if (avl_numnodes(t) != 0) {
+		/* create log if necessary */
+		if (*obj == 0)
+			*obj = zap_create(spa->spa_meta_objset,
+			    DMU_OT_ERROR_LOG, DMU_OT_NONE,
+			    0, tx);
+
+		/* add errors to the current log */
+		for (se = avl_first(t); se != NULL; se = AVL_NEXT(t, se)) {
+			char *name = se->se_name ? se->se_name : "";
+
+			bookmark_to_name(&se->se_bookmark, buf, sizeof (buf));
+
+			(void) zap_update(spa->spa_meta_objset,
+			    *obj, buf, 1, strlen(name) + 1, name, tx);
+		}
+
+		/* purge the error list */
+		cookie = NULL;
+		while ((se = avl_destroy_nodes(t, &cookie)) != NULL)
+			kmem_free(se, sizeof (spa_error_entry_t));
+	}
+}
+
+/*
+ * Sync the error log out to disk.  This is a little tricky because the act of
+ * writing the error log requires the spa_errlist_lock.  So, we need to lock the
+ * error lists, take a copy of the lists, and then reinitialize them.  Then, we
+ * drop the error list lock and take the error log lock, at which point we
+ * do the errlog processing.  Then, if we encounter an I/O error during this
+ * process, we can successfully add the error to the list.  Note that this will
+ * result in the perpetual recycling of errors, but it is an unlikely situation
+ * and not a performance critical operation.
+ */
+void
+spa_errlog_sync(spa_t *spa, uint64_t txg)
+{
+	dmu_tx_t *tx;
+	avl_tree_t scrub, last;
+	int scrub_finished;
+
+	mutex_enter(&spa->spa_errlist_lock);
+
+	/*
+	 * Bail out early under normal circumstances.
+	 */
+	if (avl_numnodes(&spa->spa_errlist_scrub) == 0 &&
+	    avl_numnodes(&spa->spa_errlist_last) == 0 &&
+	    !spa->spa_scrub_finished) {
+		mutex_exit(&spa->spa_errlist_lock);
+		return;
+	}
+
+	spa_get_errlists(spa, &last, &scrub);
+	scrub_finished = spa->spa_scrub_finished;
+	spa->spa_scrub_finished = B_FALSE;
+
+	mutex_exit(&spa->spa_errlist_lock);
+	mutex_enter(&spa->spa_errlog_lock);
+
+	tx = dmu_tx_create_assigned(spa->spa_dsl_pool, txg);
+
+	/*
+	 * Sync out the current list of errors.
+	 */
+	sync_error_list(spa, &last, &spa->spa_errlog_last, tx);
+
+	/*
+	 * Rotate the log if necessary.
+	 */
+	if (scrub_finished) {
+		if (spa->spa_errlog_last != 0)
+			VERIFY(dmu_object_free(spa->spa_meta_objset,
+			    spa->spa_errlog_last, tx) == 0);
+		spa->spa_errlog_last = spa->spa_errlog_scrub;
+		spa->spa_errlog_scrub = 0;
+
+		sync_error_list(spa, &scrub, &spa->spa_errlog_last, tx);
+	}
+
+	/*
+	 * Sync out any pending scrub errors.
+	 */
+	sync_error_list(spa, &scrub, &spa->spa_errlog_scrub, tx);
+
+	/*
+	 * Update the MOS to reflect the new values.
+	 */
+	(void) zap_update(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
+	    DMU_POOL_ERRLOG_LAST, sizeof (uint64_t), 1,
+	    &spa->spa_errlog_last, tx);
+	(void) zap_update(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
+	    DMU_POOL_ERRLOG_SCRUB, sizeof (uint64_t), 1,
+	    &spa->spa_errlog_scrub, tx);
+
+	dmu_tx_commit(tx);
+
+	mutex_exit(&spa->spa_errlog_lock);
+}
--- a/uts/common/fs/zfs/spa_history.c
+++ b/uts/common/fs/zfs/spa_history.c
@ -0,0 +1,502 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+
+/*
+ * Copyright (c) 2006, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#include <sys/spa.h>
+#include <sys/spa_impl.h>
+#include <sys/zap.h>
+#include <sys/dsl_synctask.h>
+#include <sys/dmu_tx.h>
+#include <sys/dmu_objset.h>
+#include <sys/utsname.h>
+#include <sys/cmn_err.h>
+#include <sys/sunddi.h>
+#include "zfs_comutil.h"
+#ifdef _KERNEL
+#include <sys/zone.h>
+#endif
+
+/*
+ * Routines to manage the on-disk history log.
+ *
+ * The history log is stored as a dmu object containing
+ * <packed record length, record nvlist> tuples.
+ *
+ * Where "record nvlist" is a nvlist containing uint64_ts and strings, and
+ * "packed record length" is the packed length of the "record nvlist" stored
+ * as a little endian uint64_t.
+ *
+ * The log is implemented as a ring buffer, though the original creation
+ * of the pool ('zpool create') is never overwritten.
+ *
+ * The history log is tracked as object 'spa_t::spa_history'.  The bonus buffer
+ * of 'spa_history' stores the offsets for logging/retrieving history as
+ * 'spa_history_phys_t'.  'sh_pool_create_len' is the ending offset in bytes of
+ * where the 'zpool create' record is stored.  This allows us to never
+ * overwrite the original creation of the pool.  'sh_phys_max_off' is the
+ * physical ending offset in bytes of the log.  This tells you the length of
+ * the buffer. 'sh_eof' is the logical EOF (in bytes).  Whenever a record
+ * is added, 'sh_eof' is incremented by the the size of the record.
+ * 'sh_eof' is never decremented.  'sh_bof' is the logical BOF (in bytes).
+ * This is where the consumer should start reading from after reading in
+ * the 'zpool create' portion of the log.
+ *
+ * 'sh_records_lost' keeps track of how many records have been overwritten
+ * and permanently lost.
+ */
+
+/* convert a logical offset to physical */
+static uint64_t
+spa_history_log_to_phys(uint64_t log_off, spa_history_phys_t *shpp)
+{
+	uint64_t phys_len;
+
+	phys_len = shpp->sh_phys_max_off - shpp->sh_pool_create_len;
+	return ((log_off - shpp->sh_pool_create_len) % phys_len
+	    + shpp->sh_pool_create_len);
+}
+
+void
+spa_history_create_obj(spa_t *spa, dmu_tx_t *tx)
+{
+	dmu_buf_t *dbp;
+	spa_history_phys_t *shpp;
+	objset_t *mos = spa->spa_meta_objset;
+
+	ASSERT(spa->spa_history == 0);
+	spa->spa_history = dmu_object_alloc(mos, DMU_OT_SPA_HISTORY,
+	    SPA_MAXBLOCKSIZE, DMU_OT_SPA_HISTORY_OFFSETS,
+	    sizeof (spa_history_phys_t), tx);
+
+	VERIFY(zap_add(mos, DMU_POOL_DIRECTORY_OBJECT,
+	    DMU_POOL_HISTORY, sizeof (uint64_t), 1,
+	    &spa->spa_history, tx) == 0);
+
+	VERIFY(0 == dmu_bonus_hold(mos, spa->spa_history, FTAG, &dbp));
+	ASSERT(dbp->db_size >= sizeof (spa_history_phys_t));
+
+	shpp = dbp->db_data;
+	dmu_buf_will_dirty(dbp, tx);
+
+	/*
+	 * Figure out maximum size of history log.  We set it at
+	 * 1% of pool size, with a max of 32MB and min of 128KB.
+	 */
+	shpp->sh_phys_max_off =
+	    metaslab_class_get_dspace(spa_normal_class(spa)) / 100;
+	shpp->sh_phys_max_off = MIN(shpp->sh_phys_max_off, 32<<20);
+	shpp->sh_phys_max_off = MAX(shpp->sh_phys_max_off, 128<<10);
+
+	dmu_buf_rele(dbp, FTAG);
+}
+
+/*
+ * Change 'sh_bof' to the beginning of the next record.
+ */
+static int
+spa_history_advance_bof(spa_t *spa, spa_history_phys_t *shpp)
+{
+	objset_t *mos = spa->spa_meta_objset;
+	uint64_t firstread, reclen, phys_bof;
+	char buf[sizeof (reclen)];
+	int err;
+
+	phys_bof = spa_history_log_to_phys(shpp->sh_bof, shpp);
+	firstread = MIN(sizeof (reclen), shpp->sh_phys_max_off - phys_bof);
+
+	if ((err = dmu_read(mos, spa->spa_history, phys_bof, firstread,
+	    buf, DMU_READ_PREFETCH)) != 0)
+		return (err);
+	if (firstread != sizeof (reclen)) {
+		if ((err = dmu_read(mos, spa->spa_history,
+		    shpp->sh_pool_create_len, sizeof (reclen) - firstread,
+		    buf + firstread, DMU_READ_PREFETCH)) != 0)
+			return (err);
+	}
+
+	reclen = LE_64(*((uint64_t *)buf));
+	shpp->sh_bof += reclen + sizeof (reclen);
+	shpp->sh_records_lost++;
+	return (0);
+}
+
+static int
+spa_history_write(spa_t *spa, void *buf, uint64_t len, spa_history_phys_t *shpp,
+    dmu_tx_t *tx)
+{
+	uint64_t firstwrite, phys_eof;
+	objset_t *mos = spa->spa_meta_objset;
+	int err;
+
+	ASSERT(MUTEX_HELD(&spa->spa_history_lock));
+
+	/* see if we need to reset logical BOF */
+	while (shpp->sh_phys_max_off - shpp->sh_pool_create_len -
+	    (shpp->sh_eof - shpp->sh_bof) <= len) {
+		if ((err = spa_history_advance_bof(spa, shpp)) != 0) {
+			return (err);
+		}
+	}
+
+	phys_eof = spa_history_log_to_phys(shpp->sh_eof, shpp);
+	firstwrite = MIN(len, shpp->sh_phys_max_off - phys_eof);
+	shpp->sh_eof += len;
+	dmu_write(mos, spa->spa_history, phys_eof, firstwrite, buf, tx);
+
+	len -= firstwrite;
+	if (len > 0) {
+		/* write out the rest at the beginning of physical file */
+		dmu_write(mos, spa->spa_history, shpp->sh_pool_create_len,
+		    len, (char *)buf + firstwrite, tx);
+	}
+
+	return (0);
+}
+
+static char *
+spa_history_zone()
+{
+#ifdef _KERNEL
+	return (curproc->p_zone->zone_name);
+#else
+	return ("global");
+#endif
+}
+
+/*
+ * Write out a history event.
+ */
+/*ARGSUSED*/
+static void
+spa_history_log_sync(void *arg1, void *arg2, dmu_tx_t *tx)
+{
+	spa_t		*spa = arg1;
+	history_arg_t	*hap = arg2;
+	const char	*history_str = hap->ha_history_str;
+	objset_t	*mos = spa->spa_meta_objset;
+	dmu_buf_t	*dbp;
+	spa_history_phys_t *shpp;
+	size_t		reclen;
+	uint64_t	le_len;
+	nvlist_t	*nvrecord;
+	char		*record_packed = NULL;
+	int		ret;
+
+	/*
+	 * If we have an older pool that doesn't have a command
+	 * history object, create it now.
+	 */
+	mutex_enter(&spa->spa_history_lock);
+	if (!spa->spa_history)
+		spa_history_create_obj(spa, tx);
+	mutex_exit(&spa->spa_history_lock);
+
+	/*
+	 * Get the offset of where we need to write via the bonus buffer.
+	 * Update the offset when the write completes.
+	 */
+	VERIFY(0 == dmu_bonus_hold(mos, spa->spa_history, FTAG, &dbp));
+	shpp = dbp->db_data;
+
+	dmu_buf_will_dirty(dbp, tx);
+
+#ifdef ZFS_DEBUG
+	{
+		dmu_object_info_t doi;
+		dmu_object_info_from_db(dbp, &doi);
+		ASSERT3U(doi.doi_bonus_type, ==, DMU_OT_SPA_HISTORY_OFFSETS);
+	}
+#endif
+
+	VERIFY(nvlist_alloc(&nvrecord, NV_UNIQUE_NAME, KM_SLEEP) == 0);
+	VERIFY(nvlist_add_uint64(nvrecord, ZPOOL_HIST_TIME,
+	    gethrestime_sec()) == 0);
+	VERIFY(nvlist_add_uint64(nvrecord, ZPOOL_HIST_WHO, hap->ha_uid) == 0);
+	if (hap->ha_zone != NULL)
+		VERIFY(nvlist_add_string(nvrecord, ZPOOL_HIST_ZONE,
+		    hap->ha_zone) == 0);
+#ifdef _KERNEL
+	VERIFY(nvlist_add_string(nvrecord, ZPOOL_HIST_HOST,
+	    utsname.nodename) == 0);
+#endif
+	if (hap->ha_log_type == LOG_CMD_POOL_CREATE ||
+	    hap->ha_log_type == LOG_CMD_NORMAL) {
+		VERIFY(nvlist_add_string(nvrecord, ZPOOL_HIST_CMD,
+		    history_str) == 0);
+
+		zfs_dbgmsg("command: %s", history_str);
+	} else {
+		VERIFY(nvlist_add_uint64(nvrecord, ZPOOL_HIST_INT_EVENT,
+		    hap->ha_event) == 0);
+		VERIFY(nvlist_add_uint64(nvrecord, ZPOOL_HIST_TXG,
+		    tx->tx_txg) == 0);
+		VERIFY(nvlist_add_string(nvrecord, ZPOOL_HIST_INT_STR,
+		    history_str) == 0);
+
+		zfs_dbgmsg("internal %s pool:%s txg:%llu %s",
+		    zfs_history_event_names[hap->ha_event], spa_name(spa),
+		    (longlong_t)tx->tx_txg, history_str);
+
+	}
+
+	VERIFY(nvlist_size(nvrecord, &reclen, NV_ENCODE_XDR) == 0);
+	record_packed = kmem_alloc(reclen, KM_SLEEP);
+
+	VERIFY(nvlist_pack(nvrecord, &record_packed, &reclen,
+	    NV_ENCODE_XDR, KM_SLEEP) == 0);
+
+	mutex_enter(&spa->spa_history_lock);
+	if (hap->ha_log_type == LOG_CMD_POOL_CREATE)
+		VERIFY(shpp->sh_eof == shpp->sh_pool_create_len);
+
+	/* write out the packed length as little endian */
+	le_len = LE_64((uint64_t)reclen);
+	ret = spa_history_write(spa, &le_len, sizeof (le_len), shpp, tx);
+	if (!ret)
+		ret = spa_history_write(spa, record_packed, reclen, shpp, tx);
+
+	if (!ret && hap->ha_log_type == LOG_CMD_POOL_CREATE) {
+		shpp->sh_pool_create_len += sizeof (le_len) + reclen;
+		shpp->sh_bof = shpp->sh_pool_create_len;
+	}
+
+	mutex_exit(&spa->spa_history_lock);
+	nvlist_free(nvrecord);
+	kmem_free(record_packed, reclen);
+	dmu_buf_rele(dbp, FTAG);
+
+	strfree(hap->ha_history_str);
+	if (hap->ha_zone != NULL)
+		strfree(hap->ha_zone);
+	kmem_free(hap, sizeof (history_arg_t));
+}
+
+/*
+ * Write out a history event.
+ */
+int
+spa_history_log(spa_t *spa, const char *history_str, history_log_type_t what)
+{
+	history_arg_t *ha;
+	int err = 0;
+	dmu_tx_t *tx;
+
+	ASSERT(what != LOG_INTERNAL);
+
+	tx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir);
+	err = dmu_tx_assign(tx, TXG_WAIT);
+	if (err) {
+		dmu_tx_abort(tx);
+		return (err);
+	}
+
+	ha = kmem_alloc(sizeof (history_arg_t), KM_SLEEP);
+	ha->ha_history_str = strdup(history_str);
+	ha->ha_zone = strdup(spa_history_zone());
+	ha->ha_log_type = what;
+	ha->ha_uid = crgetuid(CRED());
+
+	/* Kick this off asynchronously; errors are ignored. */
+	dsl_sync_task_do_nowait(spa_get_dsl(spa), NULL,
+	    spa_history_log_sync, spa, ha, 0, tx);
+	dmu_tx_commit(tx);
+
+	/* spa_history_log_sync will free ha and strings */
+	return (err);
+}
+
+/*
+ * Read out the command history.
+ */
+int
+spa_history_get(spa_t *spa, uint64_t *offp, uint64_t *len, char *buf)
+{
+	objset_t *mos = spa->spa_meta_objset;
+	dmu_buf_t *dbp;
+	uint64_t read_len, phys_read_off, phys_eof;
+	uint64_t leftover = 0;
+	spa_history_phys_t *shpp;
+	int err;
+
+	/*
+	 * If the command history  doesn't exist (older pool),
+	 * that's ok, just return ENOENT.
+	 */
+	if (!spa->spa_history)
+		return (ENOENT);
+
+	/*
+	 * The history is logged asynchronously, so when they request
+	 * the first chunk of history, make sure everything has been
+	 * synced to disk so that we get it.
+	 */
+	if (*offp == 0 && spa_writeable(spa))
+		txg_wait_synced(spa_get_dsl(spa), 0);
+
+	if ((err = dmu_bonus_hold(mos, spa->spa_history, FTAG, &dbp)) != 0)
+		return (err);
+	shpp = dbp->db_data;
+
+#ifdef ZFS_DEBUG
+	{
+		dmu_object_info_t doi;
+		dmu_object_info_from_db(dbp, &doi);
+		ASSERT3U(doi.doi_bonus_type, ==, DMU_OT_SPA_HISTORY_OFFSETS);
+	}
+#endif
+
+	mutex_enter(&spa->spa_history_lock);
+	phys_eof = spa_history_log_to_phys(shpp->sh_eof, shpp);
+
+	if (*offp < shpp->sh_pool_create_len) {
+		/* read in just the zpool create history */
+		phys_read_off = *offp;
+		read_len = MIN(*len, shpp->sh_pool_create_len -
+		    phys_read_off);
+	} else {
+		/*
+		 * Need to reset passed in offset to BOF if the passed in
+		 * offset has since been overwritten.
+		 */
+		*offp = MAX(*offp, shpp->sh_bof);
+		phys_read_off = spa_history_log_to_phys(*offp, shpp);
+
+		/*
+		 * Read up to the minimum of what the user passed down or
+		 * the EOF (physical or logical).  If we hit physical EOF,
+		 * use 'leftover' to read from the physical BOF.
+		 */
+		if (phys_read_off <= phys_eof) {
+			read_len = MIN(*len, phys_eof - phys_read_off);
+		} else {
+			read_len = MIN(*len,
+			    shpp->sh_phys_max_off - phys_read_off);
+			if (phys_read_off + *len > shpp->sh_phys_max_off) {
+				leftover = MIN(*len - read_len,
+				    phys_eof - shpp->sh_pool_create_len);
+			}
+		}
+	}
+
+	/* offset for consumer to use next */
+	*offp += read_len + leftover;
+
+	/* tell the consumer how much you actually read */
+	*len = read_len + leftover;
+
+	if (read_len == 0) {
+		mutex_exit(&spa->spa_history_lock);
+		dmu_buf_rele(dbp, FTAG);
+		return (0);
+	}
+
+	err = dmu_read(mos, spa->spa_history, phys_read_off, read_len, buf,
+	    DMU_READ_PREFETCH);
+	if (leftover && err == 0) {
+		err = dmu_read(mos, spa->spa_history, shpp->sh_pool_create_len,
+		    leftover, buf + read_len, DMU_READ_PREFETCH);
+	}
+	mutex_exit(&spa->spa_history_lock);
+
+	dmu_buf_rele(dbp, FTAG);
+	return (err);
+}
+
+static void
+log_internal(history_internal_events_t event, spa_t *spa,
+    dmu_tx_t *tx, const char *fmt, va_list adx)
+{
+	history_arg_t *ha;
+
+	/*
+	 * If this is part of creating a pool, not everything is
+	 * initialized yet, so don't bother logging the internal events.
+	 */
+	if (tx->tx_txg == TXG_INITIAL)
+		return;
+
+	ha = kmem_alloc(sizeof (history_arg_t), KM_SLEEP);
+	ha->ha_history_str = kmem_alloc(vsnprintf(NULL, 0, fmt, adx) + 1,
+	    KM_SLEEP);
+
+	(void) vsprintf(ha->ha_history_str, fmt, adx);
+
+	ha->ha_log_type = LOG_INTERNAL;
+	ha->ha_event = event;
+	ha->ha_zone = NULL;
+	ha->ha_uid = 0;
+
+	if (dmu_tx_is_syncing(tx)) {
+		spa_history_log_sync(spa, ha, tx);
+	} else {
+		dsl_sync_task_do_nowait(spa_get_dsl(spa), NULL,
+		    spa_history_log_sync, spa, ha, 0, tx);
+	}
+	/* spa_history_log_sync() will free ha and strings */
+}
+
+void
+spa_history_log_internal(history_internal_events_t event, spa_t *spa,
+    dmu_tx_t *tx, const char *fmt, ...)
+{
+	dmu_tx_t *htx = tx;
+	va_list adx;
+
+	/* create a tx if we didn't get one */
+	if (tx == NULL) {
+		htx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir);
+		if (dmu_tx_assign(htx, TXG_WAIT) != 0) {
+			dmu_tx_abort(htx);
+			return;
+		}
+	}
+
+	va_start(adx, fmt);
+	log_internal(event, spa, htx, fmt, adx);
+	va_end(adx);
+
+	/* if we didn't get a tx from the caller, commit the one we made */
+	if (tx == NULL)
+		dmu_tx_commit(htx);
+}
+
+void
+spa_history_log_version(spa_t *spa, history_internal_events_t event)
+{
+#ifdef _KERNEL
+	uint64_t current_vers = spa_version(spa);
+
+	if (current_vers >= SPA_VERSION_ZPOOL_HISTORY) {
+		spa_history_log_internal(event, spa, NULL,
+		    "pool spa %llu; zfs spa %llu; zpl %d; uts %s %s %s %s",
+		    (u_longlong_t)current_vers, SPA_VERSION, ZPL_VERSION,
+		    utsname.nodename, utsname.release, utsname.version,
+		    utsname.machine);
+	}
+	cmn_err(CE_CONT, "!%s version %llu pool %s using %llu",
+	    event == LOG_POOL_IMPORT ? "imported" :
+	    event == LOG_POOL_CREATE ? "created" : "accessed",
+	    (u_longlong_t)current_vers, spa_name(spa), SPA_VERSION);
+#endif
+}
--- a/uts/common/fs/zfs/spa_misc.c
+++ b/uts/common/fs/zfs/spa_misc.c
--- a/uts/common/fs/zfs/space_map.c
+++ b/uts/common/fs/zfs/space_map.c
@ -0,0 +1,616 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+#include <sys/zfs_context.h>
+#include <sys/spa.h>
+#include <sys/dmu.h>
+#include <sys/zio.h>
+#include <sys/space_map.h>
+
+/*
+ * Space map routines.
+ * NOTE: caller is responsible for all locking.
+ */
+static int
+space_map_seg_compare(const void *x1, const void *x2)
+{
+	const space_seg_t *s1 = x1;
+	const space_seg_t *s2 = x2;
+
+	if (s1->ss_start < s2->ss_start) {
+		if (s1->ss_end > s2->ss_start)
+			return (0);
+		return (-1);
+	}
+	if (s1->ss_start > s2->ss_start) {
+		if (s1->ss_start < s2->ss_end)
+			return (0);
+		return (1);
+	}
+	return (0);
+}
+
+void
+space_map_create(space_map_t *sm, uint64_t start, uint64_t size, uint8_t shift,
+	kmutex_t *lp)
+{
+	bzero(sm, sizeof (*sm));
+
+	cv_init(&sm->sm_load_cv, NULL, CV_DEFAULT, NULL);
+
+	avl_create(&sm->sm_root, space_map_seg_compare,
+	    sizeof (space_seg_t), offsetof(struct space_seg, ss_node));
+
+	sm->sm_start = start;
+	sm->sm_size = size;
+	sm->sm_shift = shift;
+	sm->sm_lock = lp;
+}
+
+void
+space_map_destroy(space_map_t *sm)
+{
+	ASSERT(!sm->sm_loaded && !sm->sm_loading);
+	VERIFY3U(sm->sm_space, ==, 0);
+	avl_destroy(&sm->sm_root);
+	cv_destroy(&sm->sm_load_cv);
+}
+
+void
+space_map_add(space_map_t *sm, uint64_t start, uint64_t size)
+{
+	avl_index_t where;
+	space_seg_t ssearch, *ss_before, *ss_after, *ss;
+	uint64_t end = start + size;
+	int merge_before, merge_after;
+
+	ASSERT(MUTEX_HELD(sm->sm_lock));
+	VERIFY(size != 0);
+	VERIFY3U(start, >=, sm->sm_start);
+	VERIFY3U(end, <=, sm->sm_start + sm->sm_size);
+	VERIFY(sm->sm_space + size <= sm->sm_size);
+	VERIFY(P2PHASE(start, 1ULL << sm->sm_shift) == 0);
+	VERIFY(P2PHASE(size, 1ULL << sm->sm_shift) == 0);
+
+	ssearch.ss_start = start;
+	ssearch.ss_end = end;
+	ss = avl_find(&sm->sm_root, &ssearch, &where);
+
+	if (ss != NULL && ss->ss_start <= start && ss->ss_end >= end) {
+		zfs_panic_recover("zfs: allocating allocated segment"
+		    "(offset=%llu size=%llu)\n",
+		    (longlong_t)start, (longlong_t)size);
+		return;
+	}
+
+	/* Make sure we don't overlap with either of our neighbors */
+	VERIFY(ss == NULL);
+
+	ss_before = avl_nearest(&sm->sm_root, where, AVL_BEFORE);
+	ss_after = avl_nearest(&sm->sm_root, where, AVL_AFTER);
+
+	merge_before = (ss_before != NULL && ss_before->ss_end == start);
+	merge_after = (ss_after != NULL && ss_after->ss_start == end);
+
+	if (merge_before && merge_after) {
+		avl_remove(&sm->sm_root, ss_before);
+		if (sm->sm_pp_root) {
+			avl_remove(sm->sm_pp_root, ss_before);
+			avl_remove(sm->sm_pp_root, ss_after);
+		}
+		ss_after->ss_start = ss_before->ss_start;
+		kmem_free(ss_before, sizeof (*ss_before));
+		ss = ss_after;
+	} else if (merge_before) {
+		ss_before->ss_end = end;
+		if (sm->sm_pp_root)
+			avl_remove(sm->sm_pp_root, ss_before);
+		ss = ss_before;
+	} else if (merge_after) {
+		ss_after->ss_start = start;
+		if (sm->sm_pp_root)
+			avl_remove(sm->sm_pp_root, ss_after);
+		ss = ss_after;
+	} else {
+		ss = kmem_alloc(sizeof (*ss), KM_SLEEP);
+		ss->ss_start = start;
+		ss->ss_end = end;
+		avl_insert(&sm->sm_root, ss, where);
+	}
+
+	if (sm->sm_pp_root)
+		avl_add(sm->sm_pp_root, ss);
+
+	sm->sm_space += size;
+}
+
+void
+space_map_remove(space_map_t *sm, uint64_t start, uint64_t size)
+{
+	avl_index_t where;
+	space_seg_t ssearch, *ss, *newseg;
+	uint64_t end = start + size;
+	int left_over, right_over;
+
+	ASSERT(MUTEX_HELD(sm->sm_lock));
+	VERIFY(size != 0);
+	VERIFY(P2PHASE(start, 1ULL << sm->sm_shift) == 0);
+	VERIFY(P2PHASE(size, 1ULL << sm->sm_shift) == 0);
+
+	ssearch.ss_start = start;
+	ssearch.ss_end = end;
+	ss = avl_find(&sm->sm_root, &ssearch, &where);
+
+	/* Make sure we completely overlap with someone */
+	if (ss == NULL) {
+		zfs_panic_recover("zfs: freeing free segment "
+		    "(offset=%llu size=%llu)",
+		    (longlong_t)start, (longlong_t)size);
+		return;
+	}
+	VERIFY3U(ss->ss_start, <=, start);
+	VERIFY3U(ss->ss_end, >=, end);
+	VERIFY(sm->sm_space - size <= sm->sm_size);
+
+	left_over = (ss->ss_start != start);
+	right_over = (ss->ss_end != end);
+
+	if (sm->sm_pp_root)
+		avl_remove(sm->sm_pp_root, ss);
+
+	if (left_over && right_over) {
+		newseg = kmem_alloc(sizeof (*newseg), KM_SLEEP);
+		newseg->ss_start = end;
+		newseg->ss_end = ss->ss_end;
+		ss->ss_end = start;
+		avl_insert_here(&sm->sm_root, newseg, ss, AVL_AFTER);
+		if (sm->sm_pp_root)
+			avl_add(sm->sm_pp_root, newseg);
+	} else if (left_over) {
+		ss->ss_end = start;
+	} else if (right_over) {
+		ss->ss_start = end;
+	} else {
+		avl_remove(&sm->sm_root, ss);
+		kmem_free(ss, sizeof (*ss));
+		ss = NULL;
+	}
+
+	if (sm->sm_pp_root && ss != NULL)
+		avl_add(sm->sm_pp_root, ss);
+
+	sm->sm_space -= size;
+}
+
+boolean_t
+space_map_contains(space_map_t *sm, uint64_t start, uint64_t size)
+{
+	avl_index_t where;
+	space_seg_t ssearch, *ss;
+	uint64_t end = start + size;
+
+	ASSERT(MUTEX_HELD(sm->sm_lock));
+	VERIFY(size != 0);
+	VERIFY(P2PHASE(start, 1ULL << sm->sm_shift) == 0);
+	VERIFY(P2PHASE(size, 1ULL << sm->sm_shift) == 0);
+
+	ssearch.ss_start = start;
+	ssearch.ss_end = end;
+	ss = avl_find(&sm->sm_root, &ssearch, &where);
+
+	return (ss != NULL && ss->ss_start <= start && ss->ss_end >= end);
+}
+
+void
+space_map_vacate(space_map_t *sm, space_map_func_t *func, space_map_t *mdest)
+{
+	space_seg_t *ss;
+	void *cookie = NULL;
+
+	ASSERT(MUTEX_HELD(sm->sm_lock));
+
+	while ((ss = avl_destroy_nodes(&sm->sm_root, &cookie)) != NULL) {
+		if (func != NULL)
+			func(mdest, ss->ss_start, ss->ss_end - ss->ss_start);
+		kmem_free(ss, sizeof (*ss));
+	}
+	sm->sm_space = 0;
+}
+
+void
+space_map_walk(space_map_t *sm, space_map_func_t *func, space_map_t *mdest)
+{
+	space_seg_t *ss;
+
+	ASSERT(MUTEX_HELD(sm->sm_lock));
+
+	for (ss = avl_first(&sm->sm_root); ss; ss = AVL_NEXT(&sm->sm_root, ss))
+		func(mdest, ss->ss_start, ss->ss_end - ss->ss_start);
+}
+
+/*
+ * Wait for any in-progress space_map_load() to complete.
+ */
+void
+space_map_load_wait(space_map_t *sm)
+{
+	ASSERT(MUTEX_HELD(sm->sm_lock));
+
+	while (sm->sm_loading) {
+		ASSERT(!sm->sm_loaded);
+		cv_wait(&sm->sm_load_cv, sm->sm_lock);
+	}
+}
+
+/*
+ * Note: space_map_load() will drop sm_lock across dmu_read() calls.
+ * The caller must be OK with this.
+ */
+int
+space_map_load(space_map_t *sm, space_map_ops_t *ops, uint8_t maptype,
+	space_map_obj_t *smo, objset_t *os)
+{
+	uint64_t *entry, *entry_map, *entry_map_end;
+	uint64_t bufsize, size, offset, end, space;
+	uint64_t mapstart = sm->sm_start;
+	int error = 0;
+
+	ASSERT(MUTEX_HELD(sm->sm_lock));
+	ASSERT(!sm->sm_loaded);
+	ASSERT(!sm->sm_loading);
+
+	sm->sm_loading = B_TRUE;
+	end = smo->smo_objsize;
+	space = smo->smo_alloc;
+
+	ASSERT(sm->sm_ops == NULL);
+	VERIFY3U(sm->sm_space, ==, 0);
+
+	if (maptype == SM_FREE) {
+		space_map_add(sm, sm->sm_start, sm->sm_size);
+		space = sm->sm_size - space;
+	}
+
+	bufsize = 1ULL << SPACE_MAP_BLOCKSHIFT;
+	entry_map = zio_buf_alloc(bufsize);
+
+	mutex_exit(sm->sm_lock);
+	if (end > bufsize)
+		dmu_prefetch(os, smo->smo_object, bufsize, end - bufsize);
+	mutex_enter(sm->sm_lock);
+
+	for (offset = 0; offset < end; offset += bufsize) {
+		size = MIN(end - offset, bufsize);
+		VERIFY(P2PHASE(size, sizeof (uint64_t)) == 0);
+		VERIFY(size != 0);
+
+		dprintf("object=%llu  offset=%llx  size=%llx\n",
+		    smo->smo_object, offset, size);
+
+		mutex_exit(sm->sm_lock);
+		error = dmu_read(os, smo->smo_object, offset, size, entry_map,
+		    DMU_READ_PREFETCH);
+		mutex_enter(sm->sm_lock);
+		if (error != 0)
+			break;
+
+		entry_map_end = entry_map + (size / sizeof (uint64_t));
+		for (entry = entry_map; entry < entry_map_end; entry++) {
+			uint64_t e = *entry;
+
+			if (SM_DEBUG_DECODE(e))		/* Skip debug entries */
+				continue;
+
+			(SM_TYPE_DECODE(e) == maptype ?
+			    space_map_add : space_map_remove)(sm,
+			    (SM_OFFSET_DECODE(e) << sm->sm_shift) + mapstart,
+			    SM_RUN_DECODE(e) << sm->sm_shift);
+		}
+	}
+
+	if (error == 0) {
+		VERIFY3U(sm->sm_space, ==, space);
+
+		sm->sm_loaded = B_TRUE;
+		sm->sm_ops = ops;
+		if (ops != NULL)
+			ops->smop_load(sm);
+	} else {
+		space_map_vacate(sm, NULL, NULL);
+	}
+
+	zio_buf_free(entry_map, bufsize);
+
+	sm->sm_loading = B_FALSE;
+
+	cv_broadcast(&sm->sm_load_cv);
+
+	return (error);
+}
+
+void
+space_map_unload(space_map_t *sm)
+{
+	ASSERT(MUTEX_HELD(sm->sm_lock));
+
+	if (sm->sm_loaded && sm->sm_ops != NULL)
+		sm->sm_ops->smop_unload(sm);
+
+	sm->sm_loaded = B_FALSE;
+	sm->sm_ops = NULL;
+
+	space_map_vacate(sm, NULL, NULL);
+}
+
+uint64_t
+space_map_maxsize(space_map_t *sm)
+{
+	ASSERT(sm->sm_ops != NULL);
+	return (sm->sm_ops->smop_max(sm));
+}
+
+uint64_t
+space_map_alloc(space_map_t *sm, uint64_t size)
+{
+	uint64_t start;
+
+	start = sm->sm_ops->smop_alloc(sm, size);
+	if (start != -1ULL)
+		space_map_remove(sm, start, size);
+	return (start);
+}
+
+void
+space_map_claim(space_map_t *sm, uint64_t start, uint64_t size)
+{
+	sm->sm_ops->smop_claim(sm, start, size);
+	space_map_remove(sm, start, size);
+}
+
+void
+space_map_free(space_map_t *sm, uint64_t start, uint64_t size)
+{
+	space_map_add(sm, start, size);
+	sm->sm_ops->smop_free(sm, start, size);
+}
+
+/*
+ * Note: space_map_sync() will drop sm_lock across dmu_write() calls.
+ */
+void
+space_map_sync(space_map_t *sm, uint8_t maptype,
+	space_map_obj_t *smo, objset_t *os, dmu_tx_t *tx)
+{
+	spa_t *spa = dmu_objset_spa(os);
+	void *cookie = NULL;
+	space_seg_t *ss;
+	uint64_t bufsize, start, size, run_len;
+	uint64_t *entry, *entry_map, *entry_map_end;
+
+	ASSERT(MUTEX_HELD(sm->sm_lock));
+
+	if (sm->sm_space == 0)
+		return;
+
+	dprintf("object %4llu, txg %llu, pass %d, %c, count %lu, space %llx\n",
+	    smo->smo_object, dmu_tx_get_txg(tx), spa_sync_pass(spa),
+	    maptype == SM_ALLOC ? 'A' : 'F', avl_numnodes(&sm->sm_root),
+	    sm->sm_space);
+
+	if (maptype == SM_ALLOC)
+		smo->smo_alloc += sm->sm_space;
+	else
+		smo->smo_alloc -= sm->sm_space;
+
+	bufsize = (8 + avl_numnodes(&sm->sm_root)) * sizeof (uint64_t);
+	bufsize = MIN(bufsize, 1ULL << SPACE_MAP_BLOCKSHIFT);
+	entry_map = zio_buf_alloc(bufsize);
+	entry_map_end = entry_map + (bufsize / sizeof (uint64_t));
+	entry = entry_map;
+
+	*entry++ = SM_DEBUG_ENCODE(1) |
+	    SM_DEBUG_ACTION_ENCODE(maptype) |
+	    SM_DEBUG_SYNCPASS_ENCODE(spa_sync_pass(spa)) |
+	    SM_DEBUG_TXG_ENCODE(dmu_tx_get_txg(tx));
+
+	while ((ss = avl_destroy_nodes(&sm->sm_root, &cookie)) != NULL) {
+		size = ss->ss_end - ss->ss_start;
+		start = (ss->ss_start - sm->sm_start) >> sm->sm_shift;
+
+		sm->sm_space -= size;
+		size >>= sm->sm_shift;
+
+		while (size) {
+			run_len = MIN(size, SM_RUN_MAX);
+
+			if (entry == entry_map_end) {
+				mutex_exit(sm->sm_lock);
+				dmu_write(os, smo->smo_object, smo->smo_objsize,
+				    bufsize, entry_map, tx);
+				mutex_enter(sm->sm_lock);
+				smo->smo_objsize += bufsize;
+				entry = entry_map;
+			}
+
+			*entry++ = SM_OFFSET_ENCODE(start) |
+			    SM_TYPE_ENCODE(maptype) |
+			    SM_RUN_ENCODE(run_len);
+
+			start += run_len;
+			size -= run_len;
+		}
+		kmem_free(ss, sizeof (*ss));
+	}
+
+	if (entry != entry_map) {
+		size = (entry - entry_map) * sizeof (uint64_t);
+		mutex_exit(sm->sm_lock);
+		dmu_write(os, smo->smo_object, smo->smo_objsize,
+		    size, entry_map, tx);
+		mutex_enter(sm->sm_lock);
+		smo->smo_objsize += size;
+	}
+
+	zio_buf_free(entry_map, bufsize);
+
+	VERIFY3U(sm->sm_space, ==, 0);
+}
+
+void
+space_map_truncate(space_map_obj_t *smo, objset_t *os, dmu_tx_t *tx)
+{
+	VERIFY(dmu_free_range(os, smo->smo_object, 0, -1ULL, tx) == 0);
+
+	smo->smo_objsize = 0;
+	smo->smo_alloc = 0;
+}
+
+/*
+ * Space map reference trees.
+ *
+ * A space map is a collection of integers.  Every integer is either
+ * in the map, or it's not.  A space map reference tree generalizes
+ * the idea: it allows its members to have arbitrary reference counts,
+ * as opposed to the implicit reference count of 0 or 1 in a space map.
+ * This representation comes in handy when computing the union or
+ * intersection of multiple space maps.  For example, the union of
+ * N space maps is the subset of the reference tree with refcnt >= 1.
+ * The intersection of N space maps is the subset with refcnt >= N.
+ *
+ * [It's very much like a Fourier transform.  Unions and intersections
+ * are hard to perform in the 'space map domain', so we convert the maps
+ * into the 'reference count domain', where it's trivial, then invert.]
+ *
+ * vdev_dtl_reassess() uses computations of this form to determine
+ * DTL_MISSING and DTL_OUTAGE for interior vdevs -- e.g. a RAID-Z vdev
+ * has an outage wherever refcnt >= vdev_nparity + 1, and a mirror vdev
+ * has an outage wherever refcnt >= vdev_children.
+ */
+static int
+space_map_ref_compare(const void *x1, const void *x2)
+{
+	const space_ref_t *sr1 = x1;
+	const space_ref_t *sr2 = x2;
+
+	if (sr1->sr_offset < sr2->sr_offset)
+		return (-1);
+	if (sr1->sr_offset > sr2->sr_offset)
+		return (1);
+
+	if (sr1 < sr2)
+		return (-1);
+	if (sr1 > sr2)
+		return (1);
+
+	return (0);
+}
+
+void
+space_map_ref_create(avl_tree_t *t)
+{
+	avl_create(t, space_map_ref_compare,
+	    sizeof (space_ref_t), offsetof(space_ref_t, sr_node));
+}
+
+void
+space_map_ref_destroy(avl_tree_t *t)
+{
+	space_ref_t *sr;
+	void *cookie = NULL;
+
+	while ((sr = avl_destroy_nodes(t, &cookie)) != NULL)
+		kmem_free(sr, sizeof (*sr));
+
+	avl_destroy(t);
+}
+
+static void
+space_map_ref_add_node(avl_tree_t *t, uint64_t offset, int64_t refcnt)
+{
+	space_ref_t *sr;
+
+	sr = kmem_alloc(sizeof (*sr), KM_SLEEP);
+	sr->sr_offset = offset;
+	sr->sr_refcnt = refcnt;
+
+	avl_add(t, sr);
+}
+
+void
+space_map_ref_add_seg(avl_tree_t *t, uint64_t start, uint64_t end,
+	int64_t refcnt)
+{
+	space_map_ref_add_node(t, start, refcnt);
+	space_map_ref_add_node(t, end, -refcnt);
+}
+
+/*
+ * Convert (or add) a space map into a reference tree.
+ */
+void
+space_map_ref_add_map(avl_tree_t *t, space_map_t *sm, int64_t refcnt)
+{
+	space_seg_t *ss;
+
+	ASSERT(MUTEX_HELD(sm->sm_lock));
+
+	for (ss = avl_first(&sm->sm_root); ss; ss = AVL_NEXT(&sm->sm_root, ss))
+		space_map_ref_add_seg(t, ss->ss_start, ss->ss_end, refcnt);
+}
+
+/*
+ * Convert a reference tree into a space map.  The space map will contain
+ * all members of the reference tree for which refcnt >= minref.
+ */
+void
+space_map_ref_generate_map(avl_tree_t *t, space_map_t *sm, int64_t minref)
+{
+	uint64_t start = -1ULL;
+	int64_t refcnt = 0;
+	space_ref_t *sr;
+
+	ASSERT(MUTEX_HELD(sm->sm_lock));
+
+	space_map_vacate(sm, NULL, NULL);
+
+	for (sr = avl_first(t); sr != NULL; sr = AVL_NEXT(t, sr)) {
+		refcnt += sr->sr_refcnt;
+		if (refcnt >= minref) {
+			if (start == -1ULL) {
+				start = sr->sr_offset;
+			}
+		} else {
+			if (start != -1ULL) {
+				uint64_t end = sr->sr_offset;
+				ASSERT(start <= end);
+				if (end > start)
+					space_map_add(sm, start, end - start);
+				start = -1ULL;
+			}
+		}
+	}
+	ASSERT(refcnt == 0);
+	ASSERT(start == -1ULL);
+}
--- a/uts/common/fs/zfs/sys/arc.h
+++ b/uts/common/fs/zfs/sys/arc.h
@ -0,0 +1,142 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_SYS_ARC_H
+#define	_SYS_ARC_H
+
+#include <sys/zfs_context.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+#include <sys/zio.h>
+#include <sys/dmu.h>
+#include <sys/spa.h>
+
+typedef struct arc_buf_hdr arc_buf_hdr_t;
+typedef struct arc_buf arc_buf_t;
+typedef void arc_done_func_t(zio_t *zio, arc_buf_t *buf, void *private);
+typedef int arc_evict_func_t(void *private);
+
+/* generic arc_done_func_t's which you can use */
+arc_done_func_t arc_bcopy_func;
+arc_done_func_t arc_getbuf_func;
+
+struct arc_buf {
+	arc_buf_hdr_t		*b_hdr;
+	arc_buf_t		*b_next;
+	kmutex_t		b_evict_lock;
+	krwlock_t		b_data_lock;
+	void			*b_data;
+	arc_evict_func_t	*b_efunc;
+	void			*b_private;
+};
+
+typedef enum arc_buf_contents {
+	ARC_BUFC_DATA,				/* buffer contains data */
+	ARC_BUFC_METADATA,			/* buffer contains metadata */
+	ARC_BUFC_NUMTYPES
+} arc_buf_contents_t;
+/*
+ * These are the flags we pass into calls to the arc
+ */
+#define	ARC_WAIT	(1 << 1)	/* perform I/O synchronously */
+#define	ARC_NOWAIT	(1 << 2)	/* perform I/O asynchronously */
+#define	ARC_PREFETCH	(1 << 3)	/* I/O is a prefetch */
+#define	ARC_CACHED	(1 << 4)	/* I/O was already in cache */
+#define	ARC_L2CACHE	(1 << 5)	/* cache in L2ARC */
+
+/*
+ * The following breakdows of arc_size exist for kstat only.
+ */
+typedef enum arc_space_type {
+	ARC_SPACE_DATA,
+	ARC_SPACE_HDRS,
+	ARC_SPACE_L2HDRS,
+	ARC_SPACE_OTHER,
+	ARC_SPACE_NUMTYPES
+} arc_space_type_t;
+
+void arc_space_consume(uint64_t space, arc_space_type_t type);
+void arc_space_return(uint64_t space, arc_space_type_t type);
+void *arc_data_buf_alloc(uint64_t space);
+void arc_data_buf_free(void *buf, uint64_t space);
+arc_buf_t *arc_buf_alloc(spa_t *spa, int size, void *tag,
+    arc_buf_contents_t type);
+arc_buf_t *arc_loan_buf(spa_t *spa, int size);
+void arc_return_buf(arc_buf_t *buf, void *tag);
+void arc_loan_inuse_buf(arc_buf_t *buf, void *tag);
+void arc_buf_add_ref(arc_buf_t *buf, void *tag);
+int arc_buf_remove_ref(arc_buf_t *buf, void *tag);
+int arc_buf_size(arc_buf_t *buf);
+void arc_release(arc_buf_t *buf, void *tag);
+int arc_release_bp(arc_buf_t *buf, void *tag, blkptr_t *bp, spa_t *spa,
+    zbookmark_t *zb);
+int arc_released(arc_buf_t *buf);
+int arc_has_callback(arc_buf_t *buf);
+void arc_buf_freeze(arc_buf_t *buf);
+void arc_buf_thaw(arc_buf_t *buf);
+#ifdef ZFS_DEBUG
+int arc_referenced(arc_buf_t *buf);
+#endif
+
+int arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp, arc_buf_t *pbuf,
+    arc_done_func_t *done, void *private, int priority, int zio_flags,
+    uint32_t *arc_flags, const zbookmark_t *zb);
+int arc_read_nolock(zio_t *pio, spa_t *spa, const blkptr_t *bp,
+    arc_done_func_t *done, void *private, int priority, int flags,
+    uint32_t *arc_flags, const zbookmark_t *zb);
+zio_t *arc_write(zio_t *pio, spa_t *spa, uint64_t txg,
+    blkptr_t *bp, arc_buf_t *buf, boolean_t l2arc, const zio_prop_t *zp,
+    arc_done_func_t *ready, arc_done_func_t *done, void *private,
+    int priority, int zio_flags, const zbookmark_t *zb);
+
+void arc_set_callback(arc_buf_t *buf, arc_evict_func_t *func, void *private);
+int arc_buf_evict(arc_buf_t *buf);
+
+void arc_flush(spa_t *spa);
+void arc_tempreserve_clear(uint64_t reserve);
+int arc_tempreserve_space(uint64_t reserve, uint64_t txg);
+
+void arc_init(void);
+void arc_fini(void);
+
+/*
+ * Level 2 ARC
+ */
+
+void l2arc_add_vdev(spa_t *spa, vdev_t *vd);
+void l2arc_remove_vdev(vdev_t *vd);
+boolean_t l2arc_vdev_present(vdev_t *vd);
+void l2arc_init(void);
+void l2arc_fini(void);
+void l2arc_start(void);
+void l2arc_stop(void);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif /* _SYS_ARC_H */
--- a/uts/common/fs/zfs/sys/bplist.h
+++ b/uts/common/fs/zfs/sys/bplist.h
@ -0,0 +1,57 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_SYS_BPLIST_H
+#define	_SYS_BPLIST_H
+
+#include <sys/zfs_context.h>
+#include <sys/spa.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+typedef struct bplist_entry {
+	blkptr_t	bpe_blk;
+	list_node_t	bpe_node;
+} bplist_entry_t;
+
+typedef struct bplist {
+	kmutex_t	bpl_lock;
+	list_t		bpl_list;
+} bplist_t;
+
+typedef int bplist_itor_t(void *arg, const blkptr_t *bp, dmu_tx_t *tx);
+
+void bplist_create(bplist_t *bpl);
+void bplist_destroy(bplist_t *bpl);
+void bplist_append(bplist_t *bpl, const blkptr_t *bp);
+void bplist_iterate(bplist_t *bpl, bplist_itor_t *func,
+    void *arg, dmu_tx_t *tx);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif /* _SYS_BPLIST_H */
--- a/uts/common/fs/zfs/sys/bpobj.h
+++ b/uts/common/fs/zfs/sys/bpobj.h
@ -0,0 +1,91 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_SYS_BPOBJ_H
+#define	_SYS_BPOBJ_H
+
+#include <sys/dmu.h>
+#include <sys/spa.h>
+#include <sys/txg.h>
+#include <sys/zio.h>
+#include <sys/zfs_context.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+typedef struct bpobj_phys {
+	/*
+	 * This is the bonus buffer for the dead lists.  The object's
+	 * contents is an array of bpo_entries blkptr_t's, representing
+	 * a total of bpo_bytes physical space.
+	 */
+	uint64_t	bpo_num_blkptrs;
+	uint64_t	bpo_bytes;
+	uint64_t	bpo_comp;
+	uint64_t	bpo_uncomp;
+	uint64_t	bpo_subobjs;
+	uint64_t	bpo_num_subobjs;
+} bpobj_phys_t;
+
+#define	BPOBJ_SIZE_V0	(2 * sizeof (uint64_t))
+#define	BPOBJ_SIZE_V1	(4 * sizeof (uint64_t))
+
+typedef struct bpobj {
+	kmutex_t	bpo_lock;
+	objset_t	*bpo_os;
+	uint64_t	bpo_object;
+	int		bpo_epb;
+	uint8_t		bpo_havecomp;
+	uint8_t		bpo_havesubobj;
+	bpobj_phys_t	*bpo_phys;
+	dmu_buf_t	*bpo_dbuf;
+	dmu_buf_t	*bpo_cached_dbuf;
+} bpobj_t;
+
+typedef int bpobj_itor_t(void *arg, const blkptr_t *bp, dmu_tx_t *tx);
+
+uint64_t bpobj_alloc(objset_t *mos, int blocksize, dmu_tx_t *tx);
+void bpobj_free(objset_t *os, uint64_t obj, dmu_tx_t *tx);
+
+int bpobj_open(bpobj_t *bpo, objset_t *mos, uint64_t object);
+void bpobj_close(bpobj_t *bpo);
+
+int bpobj_iterate(bpobj_t *bpo, bpobj_itor_t func, void *arg, dmu_tx_t *tx);
+int bpobj_iterate_nofree(bpobj_t *bpo, bpobj_itor_t func, void *, dmu_tx_t *);
+int bpobj_iterate_dbg(bpobj_t *bpo, uint64_t *itorp, blkptr_t *bp);
+
+void bpobj_enqueue_subobj(bpobj_t *bpo, uint64_t subobj, dmu_tx_t *tx);
+void bpobj_enqueue(bpobj_t *bpo, const blkptr_t *bp, dmu_tx_t *tx);
+
+int bpobj_space(bpobj_t *bpo,
+    uint64_t *usedp, uint64_t *compp, uint64_t *uncompp);
+int bpobj_space_range(bpobj_t *bpo, uint64_t mintxg, uint64_t maxtxg,
+    uint64_t *usedp, uint64_t *compp, uint64_t *uncompp);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif /* _SYS_BPOBJ_H */
--- a/uts/common/fs/zfs/sys/dbuf.h
+++ b/uts/common/fs/zfs/sys/dbuf.h
@ -0,0 +1,375 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_SYS_DBUF_H
+#define	_SYS_DBUF_H
+
+#include <sys/dmu.h>
+#include <sys/spa.h>
+#include <sys/txg.h>
+#include <sys/zio.h>
+#include <sys/arc.h>
+#include <sys/zfs_context.h>
+#include <sys/refcount.h>
+#include <sys/zrlock.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+#define	IN_DMU_SYNC 2
+
+/*
+ * define flags for dbuf_read
+ */
+
+#define	DB_RF_MUST_SUCCEED	(1 << 0)
+#define	DB_RF_CANFAIL		(1 << 1)
+#define	DB_RF_HAVESTRUCT	(1 << 2)
+#define	DB_RF_NOPREFETCH	(1 << 3)
+#define	DB_RF_NEVERWAIT		(1 << 4)
+#define	DB_RF_CACHED		(1 << 5)
+
+/*
+ * The simplified state transition diagram for dbufs looks like:
+ *
+ *		+----> READ ----+
+ *		|		|
+ *		|		V
+ *  (alloc)-->UNCACHED	     CACHED-->EVICTING-->(free)
+ *		|		^	 ^
+ *		|		|	 |
+ *		+----> FILL ----+	 |
+ *		|			 |
+ *		|			 |
+ *		+--------> NOFILL -------+
+ */
+typedef enum dbuf_states {
+	DB_UNCACHED,
+	DB_FILL,
+	DB_NOFILL,
+	DB_READ,
+	DB_CACHED,
+	DB_EVICTING
+} dbuf_states_t;
+
+struct dnode;
+struct dmu_tx;
+
+/*
+ * level = 0 means the user data
+ * level = 1 means the single indirect block
+ * etc.
+ */
+
+struct dmu_buf_impl;
+
+typedef enum override_states {
+	DR_NOT_OVERRIDDEN,
+	DR_IN_DMU_SYNC,
+	DR_OVERRIDDEN
+} override_states_t;
+
+typedef struct dbuf_dirty_record {
+	/* link on our parents dirty list */
+	list_node_t dr_dirty_node;
+
+	/* transaction group this data will sync in */
+	uint64_t dr_txg;
+
+	/* zio of outstanding write IO */
+	zio_t *dr_zio;
+
+	/* pointer back to our dbuf */
+	struct dmu_buf_impl *dr_dbuf;
+
+	/* pointer to next dirty record */
+	struct dbuf_dirty_record *dr_next;
+
+	/* pointer to parent dirty record */
+	struct dbuf_dirty_record *dr_parent;
+
+	union dirty_types {
+		struct dirty_indirect {
+
+			/* protect access to list */
+			kmutex_t dr_mtx;
+
+			/* Our list of dirty children */
+			list_t dr_children;
+		} di;
+		struct dirty_leaf {
+
+			/*
+			 * dr_data is set when we dirty the buffer
+			 * so that we can retain the pointer even if it
+			 * gets COW'd in a subsequent transaction group.
+			 */
+			arc_buf_t *dr_data;
+			blkptr_t dr_overridden_by;
+			override_states_t dr_override_state;
+			uint8_t dr_copies;
+		} dl;
+	} dt;
+} dbuf_dirty_record_t;
+
+typedef struct dmu_buf_impl {
+	/*
+	 * The following members are immutable, with the exception of
+	 * db.db_data, which is protected by db_mtx.
+	 */
+
+	/* the publicly visible structure */
+	dmu_buf_t db;
+
+	/* the objset we belong to */
+	struct objset *db_objset;
+
+	/*
+	 * handle to safely access the dnode we belong to (NULL when evicted)
+	 */
+	struct dnode_handle *db_dnode_handle;
+
+	/*
+	 * our parent buffer; if the dnode points to us directly,
+	 * db_parent == db_dnode_handle->dnh_dnode->dn_dbuf
+	 * only accessed by sync thread ???
+	 * (NULL when evicted)
+	 * May change from NULL to non-NULL under the protection of db_mtx
+	 * (see dbuf_check_blkptr())
+	 */
+	struct dmu_buf_impl *db_parent;
+
+	/*
+	 * link for hash table of all dmu_buf_impl_t's
+	 */
+	struct dmu_buf_impl *db_hash_next;
+
+	/* our block number */
+	uint64_t db_blkid;
+
+	/*
+	 * Pointer to the blkptr_t which points to us. May be NULL if we
+	 * don't have one yet. (NULL when evicted)
+	 */
+	blkptr_t *db_blkptr;
+
+	/*
+	 * Our indirection level.  Data buffers have db_level==0.
+	 * Indirect buffers which point to data buffers have
+	 * db_level==1. etc.  Buffers which contain dnodes have
+	 * db_level==0, since the dnodes are stored in a file.
+	 */
+	uint8_t db_level;
+
+	/* db_mtx protects the members below */
+	kmutex_t db_mtx;
+
+	/*
+	 * Current state of the buffer
+	 */
+	dbuf_states_t db_state;
+
+	/*
+	 * Refcount accessed by dmu_buf_{hold,rele}.
+	 * If nonzero, the buffer can't be destroyed.
+	 * Protected by db_mtx.
+	 */
+	refcount_t db_holds;
+
+	/* buffer holding our data */
+	arc_buf_t *db_buf;
+
+	kcondvar_t db_changed;
+	dbuf_dirty_record_t *db_data_pending;
+
+	/* pointer to most recent dirty record for this buffer */
+	dbuf_dirty_record_t *db_last_dirty;
+
+	/*
+	 * Our link on the owner dnodes's dn_dbufs list.
+	 * Protected by its dn_dbufs_mtx.
+	 */
+	list_node_t db_link;
+
+	/* Data which is unique to data (leaf) blocks: */
+
+	/* stuff we store for the user (see dmu_buf_set_user) */
+	void *db_user_ptr;
+	void **db_user_data_ptr_ptr;
+	dmu_buf_evict_func_t *db_evict_func;
+
+	uint8_t db_immediate_evict;
+	uint8_t db_freed_in_flight;
+
+	uint8_t db_dirtycnt;
+} dmu_buf_impl_t;
+
+/* Note: the dbuf hash table is exposed only for the mdb module */
+#define	DBUF_MUTEXES 256
+#define	DBUF_HASH_MUTEX(h, idx) (&(h)->hash_mutexes[(idx) & (DBUF_MUTEXES-1)])
+typedef struct dbuf_hash_table {
+	uint64_t hash_table_mask;
+	dmu_buf_impl_t **hash_table;
+	kmutex_t hash_mutexes[DBUF_MUTEXES];
+} dbuf_hash_table_t;
+
+
+uint64_t dbuf_whichblock(struct dnode *di, uint64_t offset);
+
+dmu_buf_impl_t *dbuf_create_tlib(struct dnode *dn, char *data);
+void dbuf_create_bonus(struct dnode *dn);
+int dbuf_spill_set_blksz(dmu_buf_t *db, uint64_t blksz, dmu_tx_t *tx);
+void dbuf_spill_hold(struct dnode *dn, dmu_buf_impl_t **dbp, void *tag);
+
+void dbuf_rm_spill(struct dnode *dn, dmu_tx_t *tx);
+
+dmu_buf_impl_t *dbuf_hold(struct dnode *dn, uint64_t blkid, void *tag);
+dmu_buf_impl_t *dbuf_hold_level(struct dnode *dn, int level, uint64_t blkid,
+    void *tag);
+int dbuf_hold_impl(struct dnode *dn, uint8_t level, uint64_t blkid, int create,
+    void *tag, dmu_buf_impl_t **dbp);
+
+void dbuf_prefetch(struct dnode *dn, uint64_t blkid);
+
+void dbuf_add_ref(dmu_buf_impl_t *db, void *tag);
+uint64_t dbuf_refcount(dmu_buf_impl_t *db);
+
+void dbuf_rele(dmu_buf_impl_t *db, void *tag);
+void dbuf_rele_and_unlock(dmu_buf_impl_t *db, void *tag);
+
+dmu_buf_impl_t *dbuf_find(struct dnode *dn, uint8_t level, uint64_t blkid);
+
+int dbuf_read(dmu_buf_impl_t *db, zio_t *zio, uint32_t flags);
+void dbuf_will_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx);
+void dbuf_fill_done(dmu_buf_impl_t *db, dmu_tx_t *tx);
+void dmu_buf_will_not_fill(dmu_buf_t *db, dmu_tx_t *tx);
+void dmu_buf_will_fill(dmu_buf_t *db, dmu_tx_t *tx);
+void dmu_buf_fill_done(dmu_buf_t *db, dmu_tx_t *tx);
+void dbuf_assign_arcbuf(dmu_buf_impl_t *db, arc_buf_t *buf, dmu_tx_t *tx);
+dbuf_dirty_record_t *dbuf_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx);
+arc_buf_t *dbuf_loan_arcbuf(dmu_buf_impl_t *db);
+
+void dbuf_clear(dmu_buf_impl_t *db);
+void dbuf_evict(dmu_buf_impl_t *db);
+
+void dbuf_setdirty(dmu_buf_impl_t *db, dmu_tx_t *tx);
+void dbuf_unoverride(dbuf_dirty_record_t *dr);
+void dbuf_sync_list(list_t *list, dmu_tx_t *tx);
+void dbuf_release_bp(dmu_buf_impl_t *db);
+
+void dbuf_free_range(struct dnode *dn, uint64_t start, uint64_t end,
+    struct dmu_tx *);
+
+void dbuf_new_size(dmu_buf_impl_t *db, int size, dmu_tx_t *tx);
+
+#define	DB_DNODE(_db)		((_db)->db_dnode_handle->dnh_dnode)
+#define	DB_DNODE_LOCK(_db)	((_db)->db_dnode_handle->dnh_zrlock)
+#define	DB_DNODE_ENTER(_db)	(zrl_add(&DB_DNODE_LOCK(_db)))
+#define	DB_DNODE_EXIT(_db)	(zrl_remove(&DB_DNODE_LOCK(_db)))
+#define	DB_DNODE_HELD(_db)	(!zrl_is_zero(&DB_DNODE_LOCK(_db)))
+#define	DB_GET_SPA(_spa_p, _db) {		\
+	dnode_t *__dn;				\
+	DB_DNODE_ENTER(_db);			\
+	__dn = DB_DNODE(_db);			\
+	*(_spa_p) = __dn->dn_objset->os_spa;	\
+	DB_DNODE_EXIT(_db);			\
+}
+#define	DB_GET_OBJSET(_os_p, _db) {		\
+	dnode_t *__dn;				\
+	DB_DNODE_ENTER(_db);			\
+	__dn = DB_DNODE(_db);			\
+	*(_os_p) = __dn->dn_objset;		\
+	DB_DNODE_EXIT(_db);			\
+}
+
+void dbuf_init(void);
+void dbuf_fini(void);
+
+boolean_t dbuf_is_metadata(dmu_buf_impl_t *db);
+
+#define	DBUF_IS_METADATA(_db)	\
+	(dbuf_is_metadata(_db))
+
+#define	DBUF_GET_BUFC_TYPE(_db)	\
+	(DBUF_IS_METADATA(_db) ? ARC_BUFC_METADATA : ARC_BUFC_DATA)
+
+#define	DBUF_IS_CACHEABLE(_db)						\
+	((_db)->db_objset->os_primary_cache == ZFS_CACHE_ALL ||		\
+	(DBUF_IS_METADATA(_db) &&					\
+	((_db)->db_objset->os_primary_cache == ZFS_CACHE_METADATA)))
+
+#define	DBUF_IS_L2CACHEABLE(_db)					\
+	((_db)->db_objset->os_secondary_cache == ZFS_CACHE_ALL ||	\
+	(DBUF_IS_METADATA(_db) &&					\
+	((_db)->db_objset->os_secondary_cache == ZFS_CACHE_METADATA)))
+
+#ifdef ZFS_DEBUG
+
+/*
+ * There should be a ## between the string literal and fmt, to make it
+ * clear that we're joining two strings together, but gcc does not
+ * support that preprocessor token.
+ */
+#define	dprintf_dbuf(dbuf, fmt, ...) do { \
+	if (zfs_flags & ZFS_DEBUG_DPRINTF) { \
+	char __db_buf[32]; \
+	uint64_t __db_obj = (dbuf)->db.db_object; \
+	if (__db_obj == DMU_META_DNODE_OBJECT) \
+		(void) strcpy(__db_buf, "mdn"); \
+	else \
+		(void) snprintf(__db_buf, sizeof (__db_buf), "%lld", \
+		    (u_longlong_t)__db_obj); \
+	dprintf_ds((dbuf)->db_objset->os_dsl_dataset, \
+	    "obj=%s lvl=%u blkid=%lld " fmt, \
+	    __db_buf, (dbuf)->db_level, \
+	    (u_longlong_t)(dbuf)->db_blkid, __VA_ARGS__); \
+	} \
+_NOTE(CONSTCOND) } while (0)
+
+#define	dprintf_dbuf_bp(db, bp, fmt, ...) do {			\
+	if (zfs_flags & ZFS_DEBUG_DPRINTF) {			\
+	char *__blkbuf = kmem_alloc(BP_SPRINTF_LEN, KM_SLEEP);	\
+	sprintf_blkptr(__blkbuf, bp);				\
+	dprintf_dbuf(db, fmt " %s\n", __VA_ARGS__, __blkbuf);	\
+	kmem_free(__blkbuf, BP_SPRINTF_LEN);			\
+	}							\
+_NOTE(CONSTCOND) } while (0)
+
+#define	DBUF_VERIFY(db)	dbuf_verify(db)
+
+#else
+
+#define	dprintf_dbuf(db, fmt, ...)
+#define	dprintf_dbuf_bp(db, bp, fmt, ...)
+#define	DBUF_VERIFY(db)
+
+#endif
+
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif /* _SYS_DBUF_H */
--- a/uts/common/fs/zfs/sys/ddt.h
+++ b/uts/common/fs/zfs/sys/ddt.h
@ -0,0 +1,246 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2009, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef _SYS_DDT_H
+#define	_SYS_DDT_H
+
+#include <sys/sysmacros.h>
+#include <sys/types.h>
+#include <sys/fs/zfs.h>
+#include <sys/zio.h>
+#include <sys/dmu.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+/*
+ * On-disk DDT formats, in the desired search order (newest version first).
+ */
+enum ddt_type {
+	DDT_TYPE_ZAP = 0,
+	DDT_TYPES
+};
+
+/*
+ * DDT classes, in the desired search order (highest replication level first).
+ */
+enum ddt_class {
+	DDT_CLASS_DITTO = 0,
+	DDT_CLASS_DUPLICATE,
+	DDT_CLASS_UNIQUE,
+	DDT_CLASSES
+};
+
+#define	DDT_TYPE_CURRENT		0
+
+#define	DDT_COMPRESS_BYTEORDER_MASK	0x80
+#define	DDT_COMPRESS_FUNCTION_MASK	0x7f
+
+/*
+ * On-disk ddt entry:  key (name) and physical storage (value).
+ */
+typedef struct ddt_key {
+	zio_cksum_t	ddk_cksum;	/* 256-bit block checksum */
+	uint64_t	ddk_prop;	/* LSIZE, PSIZE, compression */
+} ddt_key_t;
+
+/*
+ * ddk_prop layout:
+ *
+ *	+-------+-------+-------+-------+-------+-------+-------+-------+
+ *	|   0	|   0	|   0	| comp	|     PSIZE	|     LSIZE	|
+ *	+-------+-------+-------+-------+-------+-------+-------+-------+
+ */
+#define	DDK_GET_LSIZE(ddk)	\
+	BF64_GET_SB((ddk)->ddk_prop, 0, 16, SPA_MINBLOCKSHIFT, 1)
+#define	DDK_SET_LSIZE(ddk, x)	\
+	BF64_SET_SB((ddk)->ddk_prop, 0, 16, SPA_MINBLOCKSHIFT, 1, x)
+
+#define	DDK_GET_PSIZE(ddk)	\
+	BF64_GET_SB((ddk)->ddk_prop, 16, 16, SPA_MINBLOCKSHIFT, 1)
+#define	DDK_SET_PSIZE(ddk, x)	\
+	BF64_SET_SB((ddk)->ddk_prop, 16, 16, SPA_MINBLOCKSHIFT, 1, x)
+
+#define	DDK_GET_COMPRESS(ddk)		BF64_GET((ddk)->ddk_prop, 32, 8)
+#define	DDK_SET_COMPRESS(ddk, x)	BF64_SET((ddk)->ddk_prop, 32, 8, x)
+
+#define	DDT_KEY_WORDS	(sizeof (ddt_key_t) / sizeof (uint64_t))
+
+typedef struct ddt_phys {
+	dva_t		ddp_dva[SPA_DVAS_PER_BP];
+	uint64_t	ddp_refcnt;
+	uint64_t	ddp_phys_birth;
+} ddt_phys_t;
+
+enum ddt_phys_type {
+	DDT_PHYS_DITTO = 0,
+	DDT_PHYS_SINGLE = 1,
+	DDT_PHYS_DOUBLE = 2,
+	DDT_PHYS_TRIPLE = 3,
+	DDT_PHYS_TYPES
+};
+
+/*
+ * In-core ddt entry
+ */
+struct ddt_entry {
+	ddt_key_t	dde_key;
+	ddt_phys_t	dde_phys[DDT_PHYS_TYPES];
+	zio_t		*dde_lead_zio[DDT_PHYS_TYPES];
+	void		*dde_repair_data;
+	enum ddt_type	dde_type;
+	enum ddt_class	dde_class;
+	uint8_t		dde_loading;
+	uint8_t		dde_loaded;
+	kcondvar_t	dde_cv;
+	avl_node_t	dde_node;
+};
+
+/*
+ * In-core ddt
+ */
+struct ddt {
+	kmutex_t	ddt_lock;
+	avl_tree_t	ddt_tree;
+	avl_tree_t	ddt_repair_tree;
+	enum zio_checksum ddt_checksum;
+	spa_t		*ddt_spa;
+	objset_t	*ddt_os;
+	uint64_t	ddt_stat_object;
+	uint64_t	ddt_object[DDT_TYPES][DDT_CLASSES];
+	ddt_histogram_t	ddt_histogram[DDT_TYPES][DDT_CLASSES];
+	ddt_histogram_t	ddt_histogram_cache[DDT_TYPES][DDT_CLASSES];
+	ddt_object_t	ddt_object_stats[DDT_TYPES][DDT_CLASSES];
+	avl_node_t	ddt_node;
+};
+
+/*
+ * In-core and on-disk bookmark for DDT walks
+ */
+typedef struct ddt_bookmark {
+	uint64_t	ddb_class;
+	uint64_t	ddb_type;
+	uint64_t	ddb_checksum;
+	uint64_t	ddb_cursor;
+} ddt_bookmark_t;
+
+/*
+ * Ops vector to access a specific DDT object type.
+ */
+typedef struct ddt_ops {
+	char ddt_op_name[32];
+	int (*ddt_op_create)(objset_t *os, uint64_t *object, dmu_tx_t *tx,
+	    boolean_t prehash);
+	int (*ddt_op_destroy)(objset_t *os, uint64_t object, dmu_tx_t *tx);
+	int (*ddt_op_lookup)(objset_t *os, uint64_t object, ddt_entry_t *dde);
+	void (*ddt_op_prefetch)(objset_t *os, uint64_t object,
+	    ddt_entry_t *dde);
+	int (*ddt_op_update)(objset_t *os, uint64_t object, ddt_entry_t *dde,
+	    dmu_tx_t *tx);
+	int (*ddt_op_remove)(objset_t *os, uint64_t object, ddt_entry_t *dde,
+	    dmu_tx_t *tx);
+	int (*ddt_op_walk)(objset_t *os, uint64_t object, ddt_entry_t *dde,
+	    uint64_t *walk);
+	uint64_t (*ddt_op_count)(objset_t *os, uint64_t object);
+} ddt_ops_t;
+
+#define	DDT_NAMELEN	80
+
+extern void ddt_object_name(ddt_t *ddt, enum ddt_type type,
+    enum ddt_class class, char *name);
+extern int ddt_object_walk(ddt_t *ddt, enum ddt_type type,
+    enum ddt_class class, uint64_t *walk, ddt_entry_t *dde);
+extern uint64_t ddt_object_count(ddt_t *ddt, enum ddt_type type,
+    enum ddt_class class);
+extern int ddt_object_info(ddt_t *ddt, enum ddt_type type,
+    enum ddt_class class, dmu_object_info_t *);
+extern boolean_t ddt_object_exists(ddt_t *ddt, enum ddt_type type,
+    enum ddt_class class);
+
+extern void ddt_bp_fill(const ddt_phys_t *ddp, blkptr_t *bp,
+    uint64_t txg);
+extern void ddt_bp_create(enum zio_checksum checksum, const ddt_key_t *ddk,
+    const ddt_phys_t *ddp, blkptr_t *bp);
+
+extern void ddt_key_fill(ddt_key_t *ddk, const blkptr_t *bp);
+
+extern void ddt_phys_fill(ddt_phys_t *ddp, const blkptr_t *bp);
+extern void ddt_phys_clear(ddt_phys_t *ddp);
+extern void ddt_phys_addref(ddt_phys_t *ddp);
+extern void ddt_phys_decref(ddt_phys_t *ddp);
+extern void ddt_phys_free(ddt_t *ddt, ddt_key_t *ddk, ddt_phys_t *ddp,
+    uint64_t txg);
+extern ddt_phys_t *ddt_phys_select(const ddt_entry_t *dde, const blkptr_t *bp);
+extern uint64_t ddt_phys_total_refcnt(const ddt_entry_t *dde);
+
+extern void ddt_stat_add(ddt_stat_t *dst, const ddt_stat_t *src, uint64_t neg);
+
+extern void ddt_histogram_add(ddt_histogram_t *dst, const ddt_histogram_t *src);
+extern void ddt_histogram_stat(ddt_stat_t *dds, const ddt_histogram_t *ddh);
+extern boolean_t ddt_histogram_empty(const ddt_histogram_t *ddh);
+extern void ddt_get_dedup_object_stats(spa_t *spa, ddt_object_t *ddo);
+extern void ddt_get_dedup_histogram(spa_t *spa, ddt_histogram_t *ddh);
+extern void ddt_get_dedup_stats(spa_t *spa, ddt_stat_t *dds_total);
+
+extern uint64_t ddt_get_dedup_dspace(spa_t *spa);
+extern uint64_t ddt_get_pool_dedup_ratio(spa_t *spa);
+
+extern int ddt_ditto_copies_needed(ddt_t *ddt, ddt_entry_t *dde,
+    ddt_phys_t *ddp_willref);
+extern int ddt_ditto_copies_present(ddt_entry_t *dde);
+
+extern size_t ddt_compress(void *src, uchar_t *dst, size_t s_len, size_t d_len);
+extern void ddt_decompress(uchar_t *src, void *dst, size_t s_len, size_t d_len);
+
+extern ddt_t *ddt_select(spa_t *spa, const blkptr_t *bp);
+extern void ddt_enter(ddt_t *ddt);
+extern void ddt_exit(ddt_t *ddt);
+extern ddt_entry_t *ddt_lookup(ddt_t *ddt, const blkptr_t *bp, boolean_t add);
+extern void ddt_prefetch(spa_t *spa, const blkptr_t *bp);
+extern void ddt_remove(ddt_t *ddt, ddt_entry_t *dde);
+
+extern boolean_t ddt_class_contains(spa_t *spa, enum ddt_class max_class,
+    const blkptr_t *bp);
+
+extern ddt_entry_t *ddt_repair_start(ddt_t *ddt, const blkptr_t *bp);
+extern void ddt_repair_done(ddt_t *ddt, ddt_entry_t *dde);
+
+extern int ddt_entry_compare(const void *x1, const void *x2);
+
+extern void ddt_create(spa_t *spa);
+extern int ddt_load(spa_t *spa);
+extern void ddt_unload(spa_t *spa);
+extern void ddt_sync(spa_t *spa, uint64_t txg);
+extern int ddt_walk(spa_t *spa, ddt_bookmark_t *ddb, ddt_entry_t *dde);
+extern int ddt_object_update(ddt_t *ddt, enum ddt_type type,
+    enum ddt_class class, ddt_entry_t *dde, dmu_tx_t *tx);
+
+extern const ddt_ops_t ddt_zap_ops;
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _SYS_DDT_H */
--- a/uts/common/fs/zfs/sys/dmu.h
+++ b/uts/common/fs/zfs/sys/dmu.h
@ -0,0 +1,740 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+/* Portions Copyright 2010 Robert Milkowski */
+
+#ifndef	_SYS_DMU_H
+#define	_SYS_DMU_H
+
+/*
+ * This file describes the interface that the DMU provides for its
+ * consumers.
+ *
+ * The DMU also interacts with the SPA.  That interface is described in
+ * dmu_spa.h.
+ */
+
+#include <sys/inttypes.h>
+#include <sys/types.h>
+#include <sys/param.h>
+#include <sys/cred.h>
+#include <sys/time.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+struct uio;
+struct xuio;
+struct page;
+struct vnode;
+struct spa;
+struct zilog;
+struct zio;
+struct blkptr;
+struct zap_cursor;
+struct dsl_dataset;
+struct dsl_pool;
+struct dnode;
+struct drr_begin;
+struct drr_end;
+struct zbookmark;
+struct spa;
+struct nvlist;
+struct arc_buf;
+struct zio_prop;
+struct sa_handle;
+
+typedef struct objset objset_t;
+typedef struct dmu_tx dmu_tx_t;
+typedef struct dsl_dir dsl_dir_t;
+
+typedef enum dmu_object_type {
+	DMU_OT_NONE,
+	/* general: */
+	DMU_OT_OBJECT_DIRECTORY,	/* ZAP */
+	DMU_OT_OBJECT_ARRAY,		/* UINT64 */
+	DMU_OT_PACKED_NVLIST,		/* UINT8 (XDR by nvlist_pack/unpack) */
+	DMU_OT_PACKED_NVLIST_SIZE,	/* UINT64 */
+	DMU_OT_BPOBJ,			/* UINT64 */
+	DMU_OT_BPOBJ_HDR,		/* UINT64 */
+	/* spa: */
+	DMU_OT_SPACE_MAP_HEADER,	/* UINT64 */
+	DMU_OT_SPACE_MAP,		/* UINT64 */
+	/* zil: */
+	DMU_OT_INTENT_LOG,		/* UINT64 */
+	/* dmu: */
+	DMU_OT_DNODE,			/* DNODE */
+	DMU_OT_OBJSET,			/* OBJSET */
+	/* dsl: */
+	DMU_OT_DSL_DIR,			/* UINT64 */
+	DMU_OT_DSL_DIR_CHILD_MAP,	/* ZAP */
+	DMU_OT_DSL_DS_SNAP_MAP,		/* ZAP */
+	DMU_OT_DSL_PROPS,		/* ZAP */
+	DMU_OT_DSL_DATASET,		/* UINT64 */
+	/* zpl: */
+	DMU_OT_ZNODE,			/* ZNODE */
+	DMU_OT_OLDACL,			/* Old ACL */
+	DMU_OT_PLAIN_FILE_CONTENTS,	/* UINT8 */
+	DMU_OT_DIRECTORY_CONTENTS,	/* ZAP */
+	DMU_OT_MASTER_NODE,		/* ZAP */
+	DMU_OT_UNLINKED_SET,		/* ZAP */
+	/* zvol: */
+	DMU_OT_ZVOL,			/* UINT8 */
+	DMU_OT_ZVOL_PROP,		/* ZAP */
+	/* other; for testing only! */
+	DMU_OT_PLAIN_OTHER,		/* UINT8 */
+	DMU_OT_UINT64_OTHER,		/* UINT64 */
+	DMU_OT_ZAP_OTHER,		/* ZAP */
+	/* new object types: */
+	DMU_OT_ERROR_LOG,		/* ZAP */
+	DMU_OT_SPA_HISTORY,		/* UINT8 */
+	DMU_OT_SPA_HISTORY_OFFSETS,	/* spa_his_phys_t */
+	DMU_OT_POOL_PROPS,		/* ZAP */
+	DMU_OT_DSL_PERMS,		/* ZAP */
+	DMU_OT_ACL,			/* ACL */
+	DMU_OT_SYSACL,			/* SYSACL */
+	DMU_OT_FUID,			/* FUID table (Packed NVLIST UINT8) */
+	DMU_OT_FUID_SIZE,		/* FUID table size UINT64 */
+	DMU_OT_NEXT_CLONES,		/* ZAP */
+	DMU_OT_SCAN_QUEUE,		/* ZAP */
+	DMU_OT_USERGROUP_USED,		/* ZAP */
+	DMU_OT_USERGROUP_QUOTA,		/* ZAP */
+	DMU_OT_USERREFS,		/* ZAP */
+	DMU_OT_DDT_ZAP,			/* ZAP */
+	DMU_OT_DDT_STATS,		/* ZAP */
+	DMU_OT_SA,			/* System attr */
+	DMU_OT_SA_MASTER_NODE,		/* ZAP */
+	DMU_OT_SA_ATTR_REGISTRATION,	/* ZAP */
+	DMU_OT_SA_ATTR_LAYOUTS,		/* ZAP */
+	DMU_OT_SCAN_XLATE,		/* ZAP */
+	DMU_OT_DEDUP,			/* fake dedup BP from ddt_bp_create() */
+	DMU_OT_DEADLIST,		/* ZAP */
+	DMU_OT_DEADLIST_HDR,		/* UINT64 */
+	DMU_OT_DSL_CLONES,		/* ZAP */
+	DMU_OT_BPOBJ_SUBOBJ,		/* UINT64 */
+	DMU_OT_NUMTYPES
+} dmu_object_type_t;
+
+typedef enum dmu_objset_type {
+	DMU_OST_NONE,
+	DMU_OST_META,
+	DMU_OST_ZFS,
+	DMU_OST_ZVOL,
+	DMU_OST_OTHER,			/* For testing only! */
+	DMU_OST_ANY,			/* Be careful! */
+	DMU_OST_NUMTYPES
+} dmu_objset_type_t;
+
+void byteswap_uint64_array(void *buf, size_t size);
+void byteswap_uint32_array(void *buf, size_t size);
+void byteswap_uint16_array(void *buf, size_t size);
+void byteswap_uint8_array(void *buf, size_t size);
+void zap_byteswap(void *buf, size_t size);
+void zfs_oldacl_byteswap(void *buf, size_t size);
+void zfs_acl_byteswap(void *buf, size_t size);
+void zfs_znode_byteswap(void *buf, size_t size);
+
+#define	DS_FIND_SNAPSHOTS	(1<<0)
+#define	DS_FIND_CHILDREN	(1<<1)
+
+/*
+ * The maximum number of bytes that can be accessed as part of one
+ * operation, including metadata.
+ */
+#define	DMU_MAX_ACCESS (10<<20) /* 10MB */
+#define	DMU_MAX_DELETEBLKCNT (20480) /* ~5MB of indirect blocks */
+
+#define	DMU_USERUSED_OBJECT	(-1ULL)
+#define	DMU_GROUPUSED_OBJECT	(-2ULL)
+#define	DMU_DEADLIST_OBJECT	(-3ULL)
+
+/*
+ * artificial blkids for bonus buffer and spill blocks
+ */
+#define	DMU_BONUS_BLKID		(-1ULL)
+#define	DMU_SPILL_BLKID		(-2ULL)
+/*
+ * Public routines to create, destroy, open, and close objsets.
+ */
+int dmu_objset_hold(const char *name, void *tag, objset_t **osp);
+int dmu_objset_own(const char *name, dmu_objset_type_t type,
+    boolean_t readonly, void *tag, objset_t **osp);
+void dmu_objset_rele(objset_t *os, void *tag);
+void dmu_objset_disown(objset_t *os, void *tag);
+int dmu_objset_open_ds(struct dsl_dataset *ds, objset_t **osp);
+
+int dmu_objset_evict_dbufs(objset_t *os);
+int dmu_objset_create(const char *name, dmu_objset_type_t type, uint64_t flags,
+    void (*func)(objset_t *os, void *arg, cred_t *cr, dmu_tx_t *tx), void *arg);
+int dmu_objset_clone(const char *name, struct dsl_dataset *clone_origin,
+    uint64_t flags);
+int dmu_objset_destroy(const char *name, boolean_t defer);
+int dmu_snapshots_destroy(char *fsname, char *snapname, boolean_t defer);
+int dmu_objset_snapshot(char *fsname, char *snapname, char *tag,
+    struct nvlist *props, boolean_t recursive, boolean_t temporary, int fd);
+int dmu_objset_rename(const char *name, const char *newname,
+    boolean_t recursive);
+int dmu_objset_find(char *name, int func(const char *, void *), void *arg,
+    int flags);
+void dmu_objset_byteswap(void *buf, size_t size);
+
+typedef struct dmu_buf {
+	uint64_t db_object;		/* object that this buffer is part of */
+	uint64_t db_offset;		/* byte offset in this object */
+	uint64_t db_size;		/* size of buffer in bytes */
+	void *db_data;			/* data in buffer */
+} dmu_buf_t;
+
+typedef void dmu_buf_evict_func_t(struct dmu_buf *db, void *user_ptr);
+
+/*
+ * The names of zap entries in the DIRECTORY_OBJECT of the MOS.
+ */
+#define	DMU_POOL_DIRECTORY_OBJECT	1
+#define	DMU_POOL_CONFIG			"config"
+#define	DMU_POOL_ROOT_DATASET		"root_dataset"
+#define	DMU_POOL_SYNC_BPOBJ		"sync_bplist"
+#define	DMU_POOL_ERRLOG_SCRUB		"errlog_scrub"
+#define	DMU_POOL_ERRLOG_LAST		"errlog_last"
+#define	DMU_POOL_SPARES			"spares"
+#define	DMU_POOL_DEFLATE		"deflate"
+#define	DMU_POOL_HISTORY		"history"
+#define	DMU_POOL_PROPS			"pool_props"
+#define	DMU_POOL_L2CACHE		"l2cache"
+#define	DMU_POOL_TMP_USERREFS		"tmp_userrefs"
+#define	DMU_POOL_DDT			"DDT-%s-%s-%s"
+#define	DMU_POOL_DDT_STATS		"DDT-statistics"
+#define	DMU_POOL_CREATION_VERSION	"creation_version"
+#define	DMU_POOL_SCAN			"scan"
+#define	DMU_POOL_FREE_BPOBJ		"free_bpobj"
+
+/*
+ * Allocate an object from this objset.  The range of object numbers
+ * available is (0, DN_MAX_OBJECT).  Object 0 is the meta-dnode.
+ *
+ * The transaction must be assigned to a txg.  The newly allocated
+ * object will be "held" in the transaction (ie. you can modify the
+ * newly allocated object in this transaction).
+ *
+ * dmu_object_alloc() chooses an object and returns it in *objectp.
+ *
+ * dmu_object_claim() allocates a specific object number.  If that
+ * number is already allocated, it fails and returns EEXIST.
+ *
+ * Return 0 on success, or ENOSPC or EEXIST as specified above.
+ */
+uint64_t dmu_object_alloc(objset_t *os, dmu_object_type_t ot,
+    int blocksize, dmu_object_type_t bonus_type, int bonus_len, dmu_tx_t *tx);
+int dmu_object_claim(objset_t *os, uint64_t object, dmu_object_type_t ot,
+    int blocksize, dmu_object_type_t bonus_type, int bonus_len, dmu_tx_t *tx);
+int dmu_object_reclaim(objset_t *os, uint64_t object, dmu_object_type_t ot,
+    int blocksize, dmu_object_type_t bonustype, int bonuslen);
+
+/*
+ * Free an object from this objset.
+ *
+ * The object's data will be freed as well (ie. you don't need to call
+ * dmu_free(object, 0, -1, tx)).
+ *
+ * The object need not be held in the transaction.
+ *
+ * If there are any holds on this object's buffers (via dmu_buf_hold()),
+ * or tx holds on the object (via dmu_tx_hold_object()), you can not
+ * free it; it fails and returns EBUSY.
+ *
+ * If the object is not allocated, it fails and returns ENOENT.
+ *
+ * Return 0 on success, or EBUSY or ENOENT as specified above.
+ */
+int dmu_object_free(objset_t *os, uint64_t object, dmu_tx_t *tx);
+
+/*
+ * Find the next allocated or free object.
+ *
+ * The objectp parameter is in-out.  It will be updated to be the next
+ * object which is allocated.  Ignore objects which have not been
+ * modified since txg.
+ *
+ * XXX Can only be called on a objset with no dirty data.
+ *
+ * Returns 0 on success, or ENOENT if there are no more objects.
+ */
+int dmu_object_next(objset_t *os, uint64_t *objectp,
+    boolean_t hole, uint64_t txg);
+
+/*
+ * Set the data blocksize for an object.
+ *
+ * The object cannot have any blocks allcated beyond the first.  If
+ * the first block is allocated already, the new size must be greater
+ * than the current block size.  If these conditions are not met,
+ * ENOTSUP will be returned.
+ *
+ * Returns 0 on success, or EBUSY if there are any holds on the object
+ * contents, or ENOTSUP as described above.
+ */
+int dmu_object_set_blocksize(objset_t *os, uint64_t object, uint64_t size,
+    int ibs, dmu_tx_t *tx);
+
+/*
+ * Set the checksum property on a dnode.  The new checksum algorithm will
+ * apply to all newly written blocks; existing blocks will not be affected.
+ */
+void dmu_object_set_checksum(objset_t *os, uint64_t object, uint8_t checksum,
+    dmu_tx_t *tx);
+
+/*
+ * Set the compress property on a dnode.  The new compression algorithm will
+ * apply to all newly written blocks; existing blocks will not be affected.
+ */
+void dmu_object_set_compress(objset_t *os, uint64_t object, uint8_t compress,
+    dmu_tx_t *tx);
+
+/*
+ * Decide how to write a block: checksum, compression, number of copies, etc.
+ */
+#define	WP_NOFILL	0x1
+#define	WP_DMU_SYNC	0x2
+#define	WP_SPILL	0x4
+
+void dmu_write_policy(objset_t *os, struct dnode *dn, int level, int wp,
+    struct zio_prop *zp);
+/*
+ * The bonus data is accessed more or less like a regular buffer.
+ * You must dmu_bonus_hold() to get the buffer, which will give you a
+ * dmu_buf_t with db_offset==-1ULL, and db_size = the size of the bonus
+ * data.  As with any normal buffer, you must call dmu_buf_read() to
+ * read db_data, dmu_buf_will_dirty() before modifying it, and the
+ * object must be held in an assigned transaction before calling
+ * dmu_buf_will_dirty.  You may use dmu_buf_set_user() on the bonus
+ * buffer as well.  You must release your hold with dmu_buf_rele().
+ */
+int dmu_bonus_hold(objset_t *os, uint64_t object, void *tag, dmu_buf_t **);
+int dmu_bonus_max(void);
+int dmu_set_bonus(dmu_buf_t *, int, dmu_tx_t *);
+int dmu_set_bonustype(dmu_buf_t *, dmu_object_type_t, dmu_tx_t *);
+dmu_object_type_t dmu_get_bonustype(dmu_buf_t *);
+int dmu_rm_spill(objset_t *, uint64_t, dmu_tx_t *);
+
+/*
+ * Special spill buffer support used by "SA" framework
+ */
+
+int dmu_spill_hold_by_bonus(dmu_buf_t *bonus, void *tag, dmu_buf_t **dbp);
+int dmu_spill_hold_by_dnode(struct dnode *dn, uint32_t flags,
+    void *tag, dmu_buf_t **dbp);
+int dmu_spill_hold_existing(dmu_buf_t *bonus, void *tag, dmu_buf_t **dbp);
+
+/*
+ * Obtain the DMU buffer from the specified object which contains the
+ * specified offset.  dmu_buf_hold() puts a "hold" on the buffer, so
+ * that it will remain in memory.  You must release the hold with
+ * dmu_buf_rele().  You musn't access the dmu_buf_t after releasing your
+ * hold.  You must have a hold on any dmu_buf_t* you pass to the DMU.
+ *
+ * You must call dmu_buf_read, dmu_buf_will_dirty, or dmu_buf_will_fill
+ * on the returned buffer before reading or writing the buffer's
+ * db_data.  The comments for those routines describe what particular
+ * operations are valid after calling them.
+ *
+ * The object number must be a valid, allocated object number.
+ */
+int dmu_buf_hold(objset_t *os, uint64_t object, uint64_t offset,
+    void *tag, dmu_buf_t **, int flags);
+void dmu_buf_add_ref(dmu_buf_t *db, void* tag);
+void dmu_buf_rele(dmu_buf_t *db, void *tag);
+uint64_t dmu_buf_refcount(dmu_buf_t *db);
+
+/*
+ * dmu_buf_hold_array holds the DMU buffers which contain all bytes in a
+ * range of an object.  A pointer to an array of dmu_buf_t*'s is
+ * returned (in *dbpp).
+ *
+ * dmu_buf_rele_array releases the hold on an array of dmu_buf_t*'s, and
+ * frees the array.  The hold on the array of buffers MUST be released
+ * with dmu_buf_rele_array.  You can NOT release the hold on each buffer
+ * individually with dmu_buf_rele.
+ */
+int dmu_buf_hold_array_by_bonus(dmu_buf_t *db, uint64_t offset,
+    uint64_t length, int read, void *tag, int *numbufsp, dmu_buf_t ***dbpp);
+void dmu_buf_rele_array(dmu_buf_t **, int numbufs, void *tag);
+
+/*
+ * Returns NULL on success, or the existing user ptr if it's already
+ * been set.
+ *
+ * user_ptr is for use by the user and can be obtained via dmu_buf_get_user().
+ *
+ * user_data_ptr_ptr should be NULL, or a pointer to a pointer which
+ * will be set to db->db_data when you are allowed to access it.  Note
+ * that db->db_data (the pointer) can change when you do dmu_buf_read(),
+ * dmu_buf_tryupgrade(), dmu_buf_will_dirty(), or dmu_buf_will_fill().
+ * *user_data_ptr_ptr will be set to the new value when it changes.
+ *
+ * If non-NULL, pageout func will be called when this buffer is being
+ * excised from the cache, so that you can clean up the data structure
+ * pointed to by user_ptr.
+ *
+ * dmu_evict_user() will call the pageout func for all buffers in a
+ * objset with a given pageout func.
+ */
+void *dmu_buf_set_user(dmu_buf_t *db, void *user_ptr, void *user_data_ptr_ptr,
+    dmu_buf_evict_func_t *pageout_func);
+/*
+ * set_user_ie is the same as set_user, but request immediate eviction
+ * when hold count goes to zero.
+ */
+void *dmu_buf_set_user_ie(dmu_buf_t *db, void *user_ptr,
+    void *user_data_ptr_ptr, dmu_buf_evict_func_t *pageout_func);
+void *dmu_buf_update_user(dmu_buf_t *db_fake, void *old_user_ptr,
+    void *user_ptr, void *user_data_ptr_ptr,
+    dmu_buf_evict_func_t *pageout_func);
+void dmu_evict_user(objset_t *os, dmu_buf_evict_func_t *func);
+
+/*
+ * Returns the user_ptr set with dmu_buf_set_user(), or NULL if not set.
+ */
+void *dmu_buf_get_user(dmu_buf_t *db);
+
+/*
+ * Indicate that you are going to modify the buffer's data (db_data).
+ *
+ * The transaction (tx) must be assigned to a txg (ie. you've called
+ * dmu_tx_assign()).  The buffer's object must be held in the tx
+ * (ie. you've called dmu_tx_hold_object(tx, db->db_object)).
+ */
+void dmu_buf_will_dirty(dmu_buf_t *db, dmu_tx_t *tx);
+
+/*
+ * Tells if the given dbuf is freeable.
+ */
+boolean_t dmu_buf_freeable(dmu_buf_t *);
+
+/*
+ * You must create a transaction, then hold the objects which you will
+ * (or might) modify as part of this transaction.  Then you must assign
+ * the transaction to a transaction group.  Once the transaction has
+ * been assigned, you can modify buffers which belong to held objects as
+ * part of this transaction.  You can't modify buffers before the
+ * transaction has been assigned; you can't modify buffers which don't
+ * belong to objects which this transaction holds; you can't hold
+ * objects once the transaction has been assigned.  You may hold an
+ * object which you are going to free (with dmu_object_free()), but you
+ * don't have to.
+ *
+ * You can abort the transaction before it has been assigned.
+ *
+ * Note that you may hold buffers (with dmu_buf_hold) at any time,
+ * regardless of transaction state.
+ */
+
+#define	DMU_NEW_OBJECT	(-1ULL)
+#define	DMU_OBJECT_END	(-1ULL)
+
+dmu_tx_t *dmu_tx_create(objset_t *os);
+void dmu_tx_hold_write(dmu_tx_t *tx, uint64_t object, uint64_t off, int len);
+void dmu_tx_hold_free(dmu_tx_t *tx, uint64_t object, uint64_t off,
+    uint64_t len);
+void dmu_tx_hold_zap(dmu_tx_t *tx, uint64_t object, int add, const char *name);
+void dmu_tx_hold_bonus(dmu_tx_t *tx, uint64_t object);
+void dmu_tx_hold_spill(dmu_tx_t *tx, uint64_t object);
+void dmu_tx_hold_sa(dmu_tx_t *tx, struct sa_handle *hdl, boolean_t may_grow);
+void dmu_tx_hold_sa_create(dmu_tx_t *tx, int total_size);
+void dmu_tx_abort(dmu_tx_t *tx);
+int dmu_tx_assign(dmu_tx_t *tx, uint64_t txg_how);
+void dmu_tx_wait(dmu_tx_t *tx);
+void dmu_tx_commit(dmu_tx_t *tx);
+
+/*
+ * To register a commit callback, dmu_tx_callback_register() must be called.
+ *
+ * dcb_data is a pointer to caller private data that is passed on as a
+ * callback parameter. The caller is responsible for properly allocating and
+ * freeing it.
+ *
+ * When registering a callback, the transaction must be already created, but
+ * it cannot be committed or aborted. It can be assigned to a txg or not.
+ *
+ * The callback will be called after the transaction has been safely written
+ * to stable storage and will also be called if the dmu_tx is aborted.
+ * If there is any error which prevents the transaction from being committed to
+ * disk, the callback will be called with a value of error != 0.
+ */
+typedef void dmu_tx_callback_func_t(void *dcb_data, int error);
+
+void dmu_tx_callback_register(dmu_tx_t *tx, dmu_tx_callback_func_t *dcb_func,
+    void *dcb_data);
+
+/*
+ * Free up the data blocks for a defined range of a file.  If size is
+ * zero, the range from offset to end-of-file is freed.
+ */
+int dmu_free_range(objset_t *os, uint64_t object, uint64_t offset,
+	uint64_t size, dmu_tx_t *tx);
+int dmu_free_long_range(objset_t *os, uint64_t object, uint64_t offset,
+	uint64_t size);
+int dmu_free_object(objset_t *os, uint64_t object);
+
+/*
+ * Convenience functions.
+ *
+ * Canfail routines will return 0 on success, or an errno if there is a
+ * nonrecoverable I/O error.
+ */
+#define	DMU_READ_PREFETCH	0 /* prefetch */
+#define	DMU_READ_NO_PREFETCH	1 /* don't prefetch */
+int dmu_read(objset_t *os, uint64_t object, uint64_t offset, uint64_t size,
+	void *buf, uint32_t flags);
+void dmu_write(objset_t *os, uint64_t object, uint64_t offset, uint64_t size,
+	const void *buf, dmu_tx_t *tx);
+void dmu_prealloc(objset_t *os, uint64_t object, uint64_t offset, uint64_t size,
+	dmu_tx_t *tx);
+int dmu_read_uio(objset_t *os, uint64_t object, struct uio *uio, uint64_t size);
+int dmu_write_uio(objset_t *os, uint64_t object, struct uio *uio, uint64_t size,
+    dmu_tx_t *tx);
+int dmu_write_uio_dbuf(dmu_buf_t *zdb, struct uio *uio, uint64_t size,
+    dmu_tx_t *tx);
+int dmu_write_pages(objset_t *os, uint64_t object, uint64_t offset,
+    uint64_t size, struct page *pp, dmu_tx_t *tx);
+struct arc_buf *dmu_request_arcbuf(dmu_buf_t *handle, int size);
+void dmu_return_arcbuf(struct arc_buf *buf);
+void dmu_assign_arcbuf(dmu_buf_t *handle, uint64_t offset, struct arc_buf *buf,
+    dmu_tx_t *tx);
+int dmu_xuio_init(struct xuio *uio, int niov);
+void dmu_xuio_fini(struct xuio *uio);
+int dmu_xuio_add(struct xuio *uio, struct arc_buf *abuf, offset_t off,
+    size_t n);
+int dmu_xuio_cnt(struct xuio *uio);
+struct arc_buf *dmu_xuio_arcbuf(struct xuio *uio, int i);
+void dmu_xuio_clear(struct xuio *uio, int i);
+void xuio_stat_wbuf_copied();
+void xuio_stat_wbuf_nocopy();
+
+extern int zfs_prefetch_disable;
+
+/*
+ * Asynchronously try to read in the data.
+ */
+void dmu_prefetch(objset_t *os, uint64_t object, uint64_t offset,
+    uint64_t len);
+
+typedef struct dmu_object_info {
+	/* All sizes are in bytes unless otherwise indicated. */
+	uint32_t doi_data_block_size;
+	uint32_t doi_metadata_block_size;
+	dmu_object_type_t doi_type;
+	dmu_object_type_t doi_bonus_type;
+	uint64_t doi_bonus_size;
+	uint8_t doi_indirection;		/* 2 = dnode->indirect->data */
+	uint8_t doi_checksum;
+	uint8_t doi_compress;
+	uint8_t doi_pad[5];
+	uint64_t doi_physical_blocks_512;	/* data + metadata, 512b blks */
+	uint64_t doi_max_offset;
+	uint64_t doi_fill_count;		/* number of non-empty blocks */
+} dmu_object_info_t;
+
+typedef void arc_byteswap_func_t(void *buf, size_t size);
+
+typedef struct dmu_object_type_info {
+	arc_byteswap_func_t	*ot_byteswap;
+	boolean_t		ot_metadata;
+	char			*ot_name;
+} dmu_object_type_info_t;
+
+extern const dmu_object_type_info_t dmu_ot[DMU_OT_NUMTYPES];
+
+/*
+ * Get information on a DMU object.
+ *
+ * Return 0 on success or ENOENT if object is not allocated.
+ *
+ * If doi is NULL, just indicates whether the object exists.
+ */
+int dmu_object_info(objset_t *os, uint64_t object, dmu_object_info_t *doi);
+void dmu_object_info_from_dnode(struct dnode *dn, dmu_object_info_t *doi);
+void dmu_object_info_from_db(dmu_buf_t *db, dmu_object_info_t *doi);
+void dmu_object_size_from_db(dmu_buf_t *db, uint32_t *blksize,
+    u_longlong_t *nblk512);
+
+typedef struct dmu_objset_stats {
+	uint64_t dds_num_clones; /* number of clones of this */
+	uint64_t dds_creation_txg;
+	uint64_t dds_guid;
+	dmu_objset_type_t dds_type;
+	uint8_t dds_is_snapshot;
+	uint8_t dds_inconsistent;
+	char dds_origin[MAXNAMELEN];
+} dmu_objset_stats_t;
+
+/*
+ * Get stats on a dataset.
+ */
+void dmu_objset_fast_stat(objset_t *os, dmu_objset_stats_t *stat);
+
+/*
+ * Add entries to the nvlist for all the objset's properties.  See
+ * zfs_prop_table[] and zfs(1m) for details on the properties.
+ */
+void dmu_objset_stats(objset_t *os, struct nvlist *nv);
+
+/*
+ * Get the space usage statistics for statvfs().
+ *
+ * refdbytes is the amount of space "referenced" by this objset.
+ * availbytes is the amount of space available to this objset, taking
+ * into account quotas & reservations, assuming that no other objsets
+ * use the space first.  These values correspond to the 'referenced' and
+ * 'available' properties, described in the zfs(1m) manpage.
+ *
+ * usedobjs and availobjs are the number of objects currently allocated,
+ * and available.
+ */
+void dmu_objset_space(objset_t *os, uint64_t *refdbytesp, uint64_t *availbytesp,
+    uint64_t *usedobjsp, uint64_t *availobjsp);
+
+/*
+ * The fsid_guid is a 56-bit ID that can change to avoid collisions.
+ * (Contrast with the ds_guid which is a 64-bit ID that will never
+ * change, so there is a small probability that it will collide.)
+ */
+uint64_t dmu_objset_fsid_guid(objset_t *os);
+
+/*
+ * Get the [cm]time for an objset's snapshot dir
+ */
+timestruc_t dmu_objset_snap_cmtime(objset_t *os);
+
+int dmu_objset_is_snapshot(objset_t *os);
+
+extern struct spa *dmu_objset_spa(objset_t *os);
+extern struct zilog *dmu_objset_zil(objset_t *os);
+extern struct dsl_pool *dmu_objset_pool(objset_t *os);
+extern struct dsl_dataset *dmu_objset_ds(objset_t *os);
+extern void dmu_objset_name(objset_t *os, char *buf);
+extern dmu_objset_type_t dmu_objset_type(objset_t *os);
+extern uint64_t dmu_objset_id(objset_t *os);
+extern uint64_t dmu_objset_syncprop(objset_t *os);
+extern uint64_t dmu_objset_logbias(objset_t *os);
+extern int dmu_snapshot_list_next(objset_t *os, int namelen, char *name,
+    uint64_t *id, uint64_t *offp, boolean_t *case_conflict);
+extern int dmu_snapshot_realname(objset_t *os, char *name, char *real,
+    int maxlen, boolean_t *conflict);
+extern int dmu_dir_list_next(objset_t *os, int namelen, char *name,
+    uint64_t *idp, uint64_t *offp);
+
+typedef int objset_used_cb_t(dmu_object_type_t bonustype,
+    void *bonus, uint64_t *userp, uint64_t *groupp);
+extern void dmu_objset_register_type(dmu_objset_type_t ost,
+    objset_used_cb_t *cb);
+extern void dmu_objset_set_user(objset_t *os, void *user_ptr);
+extern void *dmu_objset_get_user(objset_t *os);
+
+/*
+ * Return the txg number for the given assigned transaction.
+ */
+uint64_t dmu_tx_get_txg(dmu_tx_t *tx);
+
+/*
+ * Synchronous write.
+ * If a parent zio is provided this function initiates a write on the
+ * provided buffer as a child of the parent zio.
+ * In the absence of a parent zio, the write is completed synchronously.
+ * At write completion, blk is filled with the bp of the written block.
+ * Note that while the data covered by this function will be on stable
+ * storage when the write completes this new data does not become a
+ * permanent part of the file until the associated transaction commits.
+ */
+
+/*
+ * {zfs,zvol,ztest}_get_done() args
+ */
+typedef struct zgd {
+	struct zilog	*zgd_zilog;
+	struct blkptr	*zgd_bp;
+	dmu_buf_t	*zgd_db;
+	struct rl	*zgd_rl;
+	void		*zgd_private;
+} zgd_t;
+
+typedef void dmu_sync_cb_t(zgd_t *arg, int error);
+int dmu_sync(struct zio *zio, uint64_t txg, dmu_sync_cb_t *done, zgd_t *zgd);
+
+/*
+ * Find the next hole or data block in file starting at *off
+ * Return found offset in *off. Return ESRCH for end of file.
+ */
+int dmu_offset_next(objset_t *os, uint64_t object, boolean_t hole,
+    uint64_t *off);
+
+/*
+ * Initial setup and final teardown.
+ */
+extern void dmu_init(void);
+extern void dmu_fini(void);
+
+typedef void (*dmu_traverse_cb_t)(objset_t *os, void *arg, struct blkptr *bp,
+    uint64_t object, uint64_t offset, int len);
+void dmu_traverse_objset(objset_t *os, uint64_t txg_start,
+    dmu_traverse_cb_t cb, void *arg);
+
+int dmu_sendbackup(objset_t *tosnap, objset_t *fromsnap, boolean_t fromorigin,
+    struct vnode *vp, offset_t *off);
+
+typedef struct dmu_recv_cookie {
+	/*
+	 * This structure is opaque!
+	 *
+	 * If logical and real are different, we are recving the stream
+	 * into the "real" temporary clone, and then switching it with
+	 * the "logical" target.
+	 */
+	struct dsl_dataset *drc_logical_ds;
+	struct dsl_dataset *drc_real_ds;
+	struct drr_begin *drc_drrb;
+	char *drc_tosnap;
+	char *drc_top_ds;
+	boolean_t drc_newfs;
+	boolean_t drc_force;
+} dmu_recv_cookie_t;
+
+int dmu_recv_begin(char *tofs, char *tosnap, char *topds, struct drr_begin *,
+    boolean_t force, objset_t *origin, dmu_recv_cookie_t *);
+int dmu_recv_stream(dmu_recv_cookie_t *drc, struct vnode *vp, offset_t *voffp,
+    int cleanup_fd, uint64_t *action_handlep);
+int dmu_recv_end(dmu_recv_cookie_t *drc);
+
+int dmu_diff(objset_t *tosnap, objset_t *fromsnap, struct vnode *vp,
+    offset_t *off);
+
+/* CRC64 table */
+#define	ZFS_CRC64_POLY	0xC96C5795D7870F42ULL	/* ECMA-182, reflected form */
+extern uint64_t zfs_crc64_table[256];
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _SYS_DMU_H */
--- a/uts/common/fs/zfs/sys/dmu_impl.h
+++ b/uts/common/fs/zfs/sys/dmu_impl.h
@ -0,0 +1,272 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright 2010 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+#ifndef _SYS_DMU_IMPL_H
+#define	_SYS_DMU_IMPL_H
+
+#include <sys/txg_impl.h>
+#include <sys/zio.h>
+#include <sys/dnode.h>
+#include <sys/zfs_context.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+/*
+ * This is the locking strategy for the DMU.  Numbers in parenthesis are
+ * cases that use that lock order, referenced below:
+ *
+ * ARC is self-contained
+ * bplist is self-contained
+ * refcount is self-contained
+ * txg is self-contained (hopefully!)
+ * zst_lock
+ * zf_rwlock
+ *
+ * XXX try to improve evicting path?
+ *
+ * dp_config_rwlock > os_obj_lock > dn_struct_rwlock >
+ * 	dn_dbufs_mtx > hash_mutexes > db_mtx > dd_lock > leafs
+ *
+ * dp_config_rwlock
+ *    must be held before: everything
+ *    protects dd namespace changes
+ *    protects property changes globally
+ *    held from:
+ *    	dsl_dir_open/r:
+ *    	dsl_dir_create_sync/w:
+ *    	dsl_dir_sync_destroy/w:
+ *    	dsl_dir_rename_sync/w:
+ *    	dsl_prop_changed_notify/r:
+ *
+ * os_obj_lock
+ *   must be held before:
+ *   	everything except dp_config_rwlock
+ *   protects os_obj_next
+ *   held from:
+ *   	dmu_object_alloc: dn_dbufs_mtx, db_mtx, hash_mutexes, dn_struct_rwlock
+ *
+ * dn_struct_rwlock
+ *   must be held before:
+ *   	everything except dp_config_rwlock and os_obj_lock
+ *   protects structure of dnode (eg. nlevels)
+ *   	db_blkptr can change when syncing out change to nlevels
+ *   	dn_maxblkid
+ *   	dn_nlevels
+ *   	dn_*blksz*
+ *   	phys nlevels, maxblkid, physical blkptr_t's (?)
+ *   held from:
+ *   	callers of dbuf_read_impl, dbuf_hold[_impl], dbuf_prefetch
+ *   	dmu_object_info_from_dnode: dn_dirty_mtx (dn_datablksz)
+ *   	dmu_tx_count_free:
+ *   	dbuf_read_impl: db_mtx, dmu_zfetch()
+ *   	dmu_zfetch: zf_rwlock/r, zst_lock, dbuf_prefetch()
+ *   	dbuf_new_size: db_mtx
+ *   	dbuf_dirty: db_mtx
+ *	dbuf_findbp: (callers, phys? - the real need)
+ *	dbuf_create: dn_dbufs_mtx, hash_mutexes, db_mtx (phys?)
+ *	dbuf_prefetch: dn_dirty_mtx, hash_mutexes, db_mtx, dn_dbufs_mtx
+ *	dbuf_hold_impl: hash_mutexes, db_mtx, dn_dbufs_mtx, dbuf_findbp()
+ *	dnode_sync/w (increase_indirection): db_mtx (phys)
+ *	dnode_set_blksz/w: dn_dbufs_mtx (dn_*blksz*)
+ *	dnode_new_blkid/w: (dn_maxblkid)
+ *	dnode_free_range/w: dn_dirty_mtx (dn_maxblkid)
+ *	dnode_next_offset: (phys)
+ *
+ * dn_dbufs_mtx
+ *    must be held before:
+ *    	db_mtx, hash_mutexes
+ *    protects:
+ *    	dn_dbufs
+ *    	dn_evicted
+ *    held from:
+ *    	dmu_evict_user: db_mtx (dn_dbufs)
+ *    	dbuf_free_range: db_mtx (dn_dbufs)
+ *    	dbuf_remove_ref: db_mtx, callees:
+ *    		dbuf_hash_remove: hash_mutexes, db_mtx
+ *    	dbuf_create: hash_mutexes, db_mtx (dn_dbufs)
+ *    	dnode_set_blksz: (dn_dbufs)
+ *
+ * hash_mutexes (global)
+ *   must be held before:
+ *   	db_mtx
+ *   protects dbuf_hash_table (global) and db_hash_next
+ *   held from:
+ *   	dbuf_find: db_mtx
+ *   	dbuf_hash_insert: db_mtx
+ *   	dbuf_hash_remove: db_mtx
+ *
+ * db_mtx (meta-leaf)
+ *   must be held before:
+ *   	dn_mtx, dn_dirty_mtx, dd_lock (leaf mutexes)
+ *   protects:
+ *   	db_state
+ * 	db_holds
+ * 	db_buf
+ * 	db_changed
+ * 	db_data_pending
+ * 	db_dirtied
+ * 	db_link
+ * 	db_dirty_node (??)
+ * 	db_dirtycnt
+ * 	db_d.*
+ * 	db.*
+ *   held from:
+ * 	dbuf_dirty: dn_mtx, dn_dirty_mtx
+ * 	dbuf_dirty->dsl_dir_willuse_space: dd_lock
+ * 	dbuf_dirty->dbuf_new_block->dsl_dataset_block_freeable: dd_lock
+ * 	dbuf_undirty: dn_dirty_mtx (db_d)
+ * 	dbuf_write_done: dn_dirty_mtx (db_state)
+ * 	dbuf_*
+ * 	dmu_buf_update_user: none (db_d)
+ * 	dmu_evict_user: none (db_d) (maybe can eliminate)
+ *   	dbuf_find: none (db_holds)
+ *   	dbuf_hash_insert: none (db_holds)
+ *   	dmu_buf_read_array_impl: none (db_state, db_changed)
+ *   	dmu_sync: none (db_dirty_node, db_d)
+ *   	dnode_reallocate: none (db)
+ *
+ * dn_mtx (leaf)
+ *   protects:
+ *   	dn_dirty_dbufs
+ *   	dn_ranges
+ *   	phys accounting
+ * 	dn_allocated_txg
+ * 	dn_free_txg
+ * 	dn_assigned_txg
+ * 	dd_assigned_tx
+ * 	dn_notxholds
+ * 	dn_dirtyctx
+ * 	dn_dirtyctx_firstset
+ * 	(dn_phys copy fields?)
+ * 	(dn_phys contents?)
+ *   held from:
+ *   	dnode_*
+ *   	dbuf_dirty: none
+ *   	dbuf_sync: none (phys accounting)
+ *   	dbuf_undirty: none (dn_ranges, dn_dirty_dbufs)
+ *   	dbuf_write_done: none (phys accounting)
+ *   	dmu_object_info_from_dnode: none (accounting)
+ *   	dmu_tx_commit: none
+ *   	dmu_tx_hold_object_impl: none
+ *   	dmu_tx_try_assign: dn_notxholds(cv)
+ *   	dmu_tx_unassign: none
+ *
+ * dd_lock
+ *    must be held before:
+ *      ds_lock
+ *      ancestors' dd_lock
+ *    protects:
+ *    	dd_prop_cbs
+ *    	dd_sync_*
+ *    	dd_used_bytes
+ *    	dd_tempreserved
+ *    	dd_space_towrite
+ *    	dd_myname
+ *    	dd_phys accounting?
+ *    held from:
+ *    	dsl_dir_*
+ *    	dsl_prop_changed_notify: none (dd_prop_cbs)
+ *    	dsl_prop_register: none (dd_prop_cbs)
+ *    	dsl_prop_unregister: none (dd_prop_cbs)
+ *    	dsl_dataset_block_freeable: none (dd_sync_*)
+ *
+ * os_lock (leaf)
+ *   protects:
+ *   	os_dirty_dnodes
+ *   	os_free_dnodes
+ *   	os_dnodes
+ *   	os_downgraded_dbufs
+ *   	dn_dirtyblksz
+ *   	dn_dirty_link
+ *   held from:
+ *   	dnode_create: none (os_dnodes)
+ *   	dnode_destroy: none (os_dnodes)
+ *   	dnode_setdirty: none (dn_dirtyblksz, os_*_dnodes)
+ *   	dnode_free: none (dn_dirtyblksz, os_*_dnodes)
+ *
+ * ds_lock
+ *    protects:
+ *    	ds_objset
+ *    	ds_open_refcount
+ *    	ds_snapname
+ *    	ds_phys accounting
+ *	ds_phys userrefs zapobj
+ *	ds_reserved
+ *    held from:
+ *    	dsl_dataset_*
+ *
+ * dr_mtx (leaf)
+ *    protects:
+ *	dr_children
+ *    held from:
+ *	dbuf_dirty
+ *	dbuf_undirty
+ *	dbuf_sync_indirect
+ *	dnode_new_blkid
+ */
+
+struct objset;
+struct dmu_pool;
+
+typedef struct dmu_xuio {
+	int next;
+	int cnt;
+	struct arc_buf **bufs;
+	iovec_t *iovp;
+} dmu_xuio_t;
+
+typedef struct xuio_stats {
+	/* loaned yet not returned arc_buf */
+	kstat_named_t xuiostat_onloan_rbuf;
+	kstat_named_t xuiostat_onloan_wbuf;
+	/* whether a copy is made when loaning out a read buffer */
+	kstat_named_t xuiostat_rbuf_copied;
+	kstat_named_t xuiostat_rbuf_nocopy;
+	/* whether a copy is made when assigning a write buffer */
+	kstat_named_t xuiostat_wbuf_copied;
+	kstat_named_t xuiostat_wbuf_nocopy;
+} xuio_stats_t;
+
+static xuio_stats_t xuio_stats = {
+	{ "onloan_read_buf",	KSTAT_DATA_UINT64 },
+	{ "onloan_write_buf",	KSTAT_DATA_UINT64 },
+	{ "read_buf_copied",	KSTAT_DATA_UINT64 },
+	{ "read_buf_nocopy",	KSTAT_DATA_UINT64 },
+	{ "write_buf_copied",	KSTAT_DATA_UINT64 },
+	{ "write_buf_nocopy",	KSTAT_DATA_UINT64 }
+};
+
+#define	XUIOSTAT_INCR(stat, val)	\
+	atomic_add_64(&xuio_stats.stat.value.ui64, (val))
+#define	XUIOSTAT_BUMP(stat)	XUIOSTAT_INCR(stat, 1)
+
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _SYS_DMU_IMPL_H */
--- a/uts/common/fs/zfs/sys/dmu_objset.h
+++ b/uts/common/fs/zfs/sys/dmu_objset.h
@ -0,0 +1,183 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+/* Portions Copyright 2010 Robert Milkowski */
+
+#ifndef	_SYS_DMU_OBJSET_H
+#define	_SYS_DMU_OBJSET_H
+
+#include <sys/spa.h>
+#include <sys/arc.h>
+#include <sys/txg.h>
+#include <sys/zfs_context.h>
+#include <sys/dnode.h>
+#include <sys/zio.h>
+#include <sys/zil.h>
+#include <sys/sa.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+extern krwlock_t os_lock;
+
+struct dsl_dataset;
+struct dmu_tx;
+
+#define	OBJSET_PHYS_SIZE 2048
+#define	OBJSET_OLD_PHYS_SIZE 1024
+
+#define	OBJSET_BUF_HAS_USERUSED(buf) \
+	(arc_buf_size(buf) > OBJSET_OLD_PHYS_SIZE)
+
+#define	OBJSET_FLAG_USERACCOUNTING_COMPLETE	(1ULL<<0)
+
+typedef struct objset_phys {
+	dnode_phys_t os_meta_dnode;
+	zil_header_t os_zil_header;
+	uint64_t os_type;
+	uint64_t os_flags;
+	char os_pad[OBJSET_PHYS_SIZE - sizeof (dnode_phys_t)*3 -
+	    sizeof (zil_header_t) - sizeof (uint64_t)*2];
+	dnode_phys_t os_userused_dnode;
+	dnode_phys_t os_groupused_dnode;
+} objset_phys_t;
+
+struct objset {
+	/* Immutable: */
+	struct dsl_dataset *os_dsl_dataset;
+	spa_t *os_spa;
+	arc_buf_t *os_phys_buf;
+	objset_phys_t *os_phys;
+	/*
+	 * The following "special" dnodes have no parent and are exempt from
+	 * dnode_move(), but they root their descendents in this objset using
+	 * handles anyway, so that all access to dnodes from dbufs consistently
+	 * uses handles.
+	 */
+	dnode_handle_t os_meta_dnode;
+	dnode_handle_t os_userused_dnode;
+	dnode_handle_t os_groupused_dnode;
+	zilog_t *os_zil;
+
+	/* can change, under dsl_dir's locks: */
+	uint8_t os_checksum;
+	uint8_t os_compress;
+	uint8_t os_copies;
+	uint8_t os_dedup_checksum;
+	uint8_t os_dedup_verify;
+	uint8_t os_logbias;
+	uint8_t os_primary_cache;
+	uint8_t os_secondary_cache;
+	uint8_t os_sync;
+
+	/* no lock needed: */
+	struct dmu_tx *os_synctx; /* XXX sketchy */
+	blkptr_t *os_rootbp;
+	zil_header_t os_zil_header;
+	list_t os_synced_dnodes;
+	uint64_t os_flags;
+
+	/* Protected by os_obj_lock */
+	kmutex_t os_obj_lock;
+	uint64_t os_obj_next;
+
+	/* Protected by os_lock */
+	kmutex_t os_lock;
+	list_t os_dirty_dnodes[TXG_SIZE];
+	list_t os_free_dnodes[TXG_SIZE];
+	list_t os_dnodes;
+	list_t os_downgraded_dbufs;
+
+	/* stuff we store for the user */
+	kmutex_t os_user_ptr_lock;
+	void *os_user_ptr;
+
+	/* SA layout/attribute registration */
+	sa_os_t *os_sa;
+};
+
+#define	DMU_META_OBJSET		0
+#define	DMU_META_DNODE_OBJECT	0
+#define	DMU_OBJECT_IS_SPECIAL(obj) ((int64_t)(obj) <= 0)
+#define	DMU_META_DNODE(os)	((os)->os_meta_dnode.dnh_dnode)
+#define	DMU_USERUSED_DNODE(os)	((os)->os_userused_dnode.dnh_dnode)
+#define	DMU_GROUPUSED_DNODE(os)	((os)->os_groupused_dnode.dnh_dnode)
+
+#define	DMU_OS_IS_L2CACHEABLE(os)				\
+	((os)->os_secondary_cache == ZFS_CACHE_ALL ||		\
+	(os)->os_secondary_cache == ZFS_CACHE_METADATA)
+
+/* called from zpl */
+int dmu_objset_hold(const char *name, void *tag, objset_t **osp);
+int dmu_objset_own(const char *name, dmu_objset_type_t type,
+    boolean_t readonly, void *tag, objset_t **osp);
+void dmu_objset_rele(objset_t *os, void *tag);
+void dmu_objset_disown(objset_t *os, void *tag);
+int dmu_objset_from_ds(struct dsl_dataset *ds, objset_t **osp);
+
+int dmu_objset_create(const char *name, dmu_objset_type_t type, uint64_t flags,
+    void (*func)(objset_t *os, void *arg, cred_t *cr, dmu_tx_t *tx), void *arg);
+int dmu_objset_clone(const char *name, struct dsl_dataset *clone_origin,
+    uint64_t flags);
+int dmu_objset_destroy(const char *name, boolean_t defer);
+int dmu_objset_snapshot(char *fsname, char *snapname, char *tag,
+    struct nvlist *props, boolean_t recursive, boolean_t temporary, int fd);
+void dmu_objset_stats(objset_t *os, nvlist_t *nv);
+void dmu_objset_fast_stat(objset_t *os, dmu_objset_stats_t *stat);
+void dmu_objset_space(objset_t *os, uint64_t *refdbytesp, uint64_t *availbytesp,
+    uint64_t *usedobjsp, uint64_t *availobjsp);
+uint64_t dmu_objset_fsid_guid(objset_t *os);
+int dmu_objset_find(char *name, int func(const char *, void *), void *arg,
+    int flags);
+int dmu_objset_find_spa(spa_t *spa, const char *name,
+    int func(spa_t *, uint64_t, const char *, void *), void *arg, int flags);
+int dmu_objset_prefetch(const char *name, void *arg);
+void dmu_objset_byteswap(void *buf, size_t size);
+int dmu_objset_evict_dbufs(objset_t *os);
+timestruc_t dmu_objset_snap_cmtime(objset_t *os);
+
+/* called from dsl */
+void dmu_objset_sync(objset_t *os, zio_t *zio, dmu_tx_t *tx);
+boolean_t dmu_objset_is_dirty(objset_t *os, uint64_t txg);
+boolean_t dmu_objset_is_dirty_anywhere(objset_t *os);
+objset_t *dmu_objset_create_impl(spa_t *spa, struct dsl_dataset *ds,
+    blkptr_t *bp, dmu_objset_type_t type, dmu_tx_t *tx);
+int dmu_objset_open_impl(spa_t *spa, struct dsl_dataset *ds, blkptr_t *bp,
+    objset_t **osp);
+void dmu_objset_evict(objset_t *os);
+void dmu_objset_do_userquota_updates(objset_t *os, dmu_tx_t *tx);
+void dmu_objset_userquota_get_ids(dnode_t *dn, boolean_t before, dmu_tx_t *tx);
+boolean_t dmu_objset_userused_enabled(objset_t *os);
+int dmu_objset_userspace_upgrade(objset_t *os);
+boolean_t dmu_objset_userspace_present(objset_t *os);
+
+void dmu_objset_init(void);
+void dmu_objset_fini(void);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif /* _SYS_DMU_OBJSET_H */
--- a/uts/common/fs/zfs/sys/dmu_traverse.h
+++ b/uts/common/fs/zfs/sys/dmu_traverse.h
@ -0,0 +1,64 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_SYS_DMU_TRAVERSE_H
+#define	_SYS_DMU_TRAVERSE_H
+
+#include <sys/zfs_context.h>
+#include <sys/spa.h>
+#include <sys/zio.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+struct dnode_phys;
+struct dsl_dataset;
+struct zilog;
+struct arc_buf;
+
+typedef int (blkptr_cb_t)(spa_t *spa, zilog_t *zilog, const blkptr_t *bp,
+    struct arc_buf *pbuf, const zbookmark_t *zb, const struct dnode_phys *dnp,
+    void *arg);
+
+#define	TRAVERSE_PRE			(1<<0)
+#define	TRAVERSE_POST			(1<<1)
+#define	TRAVERSE_PREFETCH_METADATA	(1<<2)
+#define	TRAVERSE_PREFETCH_DATA		(1<<3)
+#define	TRAVERSE_PREFETCH (TRAVERSE_PREFETCH_METADATA | TRAVERSE_PREFETCH_DATA)
+#define	TRAVERSE_HARD			(1<<4)
+
+/* Special traverse error return value to indicate skipping of children */
+#define	TRAVERSE_VISIT_NO_CHILDREN	-1
+
+int traverse_dataset(struct dsl_dataset *ds,
+    uint64_t txg_start, int flags, blkptr_cb_t func, void *arg);
+int traverse_pool(spa_t *spa,
+    uint64_t txg_start, int flags, blkptr_cb_t func, void *arg);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif /* _SYS_DMU_TRAVERSE_H */
--- a/uts/common/fs/zfs/sys/dmu_tx.h
+++ b/uts/common/fs/zfs/sys/dmu_tx.h
@ -0,0 +1,148 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright 2010 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+#ifndef	_SYS_DMU_TX_H
+#define	_SYS_DMU_TX_H
+
+#include <sys/inttypes.h>
+#include <sys/dmu.h>
+#include <sys/txg.h>
+#include <sys/refcount.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+struct dmu_buf_impl;
+struct dmu_tx_hold;
+struct dnode_link;
+struct dsl_pool;
+struct dnode;
+struct dsl_dir;
+
+struct dmu_tx {
+	/*
+	 * No synchronization is needed because a tx can only be handled
+	 * by one thread.
+	 */
+	list_t tx_holds; /* list of dmu_tx_hold_t */
+	objset_t *tx_objset;
+	struct dsl_dir *tx_dir;
+	struct dsl_pool *tx_pool;
+	uint64_t tx_txg;
+	uint64_t tx_lastsnap_txg;
+	uint64_t tx_lasttried_txg;
+	txg_handle_t tx_txgh;
+	void *tx_tempreserve_cookie;
+	struct dmu_tx_hold *tx_needassign_txh;
+	list_t tx_callbacks; /* list of dmu_tx_callback_t on this dmu_tx */
+	uint8_t tx_anyobj;
+	int tx_err;
+#ifdef ZFS_DEBUG
+	uint64_t tx_space_towrite;
+	uint64_t tx_space_tofree;
+	uint64_t tx_space_tooverwrite;
+	uint64_t tx_space_tounref;
+	refcount_t tx_space_written;
+	refcount_t tx_space_freed;
+#endif
+};
+
+enum dmu_tx_hold_type {
+	THT_NEWOBJECT,
+	THT_WRITE,
+	THT_BONUS,
+	THT_FREE,
+	THT_ZAP,
+	THT_SPACE,
+	THT_SPILL,
+	THT_NUMTYPES
+};
+
+typedef struct dmu_tx_hold {
+	dmu_tx_t *txh_tx;
+	list_node_t txh_node;
+	struct dnode *txh_dnode;
+	uint64_t txh_space_towrite;
+	uint64_t txh_space_tofree;
+	uint64_t txh_space_tooverwrite;
+	uint64_t txh_space_tounref;
+	uint64_t txh_memory_tohold;
+	uint64_t txh_fudge;
+#ifdef ZFS_DEBUG
+	enum dmu_tx_hold_type txh_type;
+	uint64_t txh_arg1;
+	uint64_t txh_arg2;
+#endif
+} dmu_tx_hold_t;
+
+typedef struct dmu_tx_callback {
+	list_node_t		dcb_node;    /* linked to tx_callbacks list */
+	dmu_tx_callback_func_t	*dcb_func;   /* caller function pointer */
+	void			*dcb_data;   /* caller private data */
+} dmu_tx_callback_t;
+
+/*
+ * These routines are defined in dmu.h, and are called by the user.
+ */
+dmu_tx_t *dmu_tx_create(objset_t *dd);
+int dmu_tx_assign(dmu_tx_t *tx, uint64_t txg_how);
+void dmu_tx_commit(dmu_tx_t *tx);
+void dmu_tx_abort(dmu_tx_t *tx);
+uint64_t dmu_tx_get_txg(dmu_tx_t *tx);
+void dmu_tx_wait(dmu_tx_t *tx);
+
+void dmu_tx_callback_register(dmu_tx_t *tx, dmu_tx_callback_func_t *dcb_func,
+    void *dcb_data);
+void dmu_tx_do_callbacks(list_t *cb_list, int error);
+
+/*
+ * These routines are defined in dmu_spa.h, and are called by the SPA.
+ */
+extern dmu_tx_t *dmu_tx_create_assigned(struct dsl_pool *dp, uint64_t txg);
+
+/*
+ * These routines are only called by the DMU.
+ */
+dmu_tx_t *dmu_tx_create_dd(dsl_dir_t *dd);
+int dmu_tx_is_syncing(dmu_tx_t *tx);
+int dmu_tx_private_ok(dmu_tx_t *tx);
+void dmu_tx_add_new_object(dmu_tx_t *tx, objset_t *os, uint64_t object);
+void dmu_tx_willuse_space(dmu_tx_t *tx, int64_t delta);
+void dmu_tx_dirty_buf(dmu_tx_t *tx, struct dmu_buf_impl *db);
+int dmu_tx_holds(dmu_tx_t *tx, uint64_t object);
+void dmu_tx_hold_space(dmu_tx_t *tx, uint64_t space);
+
+#ifdef ZFS_DEBUG
+#define	DMU_TX_DIRTY_BUF(tx, db)	dmu_tx_dirty_buf(tx, db)
+#else
+#define	DMU_TX_DIRTY_BUF(tx, db)
+#endif
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _SYS_DMU_TX_H */
--- a/uts/common/fs/zfs/sys/dmu_zfetch.h
+++ b/uts/common/fs/zfs/sys/dmu_zfetch.h
@ -0,0 +1,76 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+#ifndef	_DFETCH_H
+#define	_DFETCH_H
+
+#include <sys/zfs_context.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+extern uint64_t	zfetch_array_rd_sz;
+
+struct dnode;				/* so we can reference dnode */
+
+typedef enum zfetch_dirn {
+	ZFETCH_FORWARD = 1,		/* prefetch increasing block numbers */
+	ZFETCH_BACKWARD	= -1		/* prefetch decreasing block numbers */
+} zfetch_dirn_t;
+
+typedef struct zstream {
+	uint64_t	zst_offset;	/* offset of starting block in range */
+	uint64_t	zst_len;	/* length of range, in blocks */
+	zfetch_dirn_t	zst_direction;	/* direction of prefetch */
+	uint64_t	zst_stride;	/* length of stride, in blocks */
+	uint64_t	zst_ph_offset;	/* prefetch offset, in blocks */
+	uint64_t	zst_cap;	/* prefetch limit (cap), in blocks */
+	kmutex_t	zst_lock;	/* protects stream */
+	clock_t		zst_last;	/* lbolt of last prefetch */
+	avl_node_t	zst_node;	/* embed avl node here */
+} zstream_t;
+
+typedef struct zfetch {
+	krwlock_t	zf_rwlock;	/* protects zfetch structure */
+	list_t		zf_stream;	/* AVL tree of zstream_t's */
+	struct dnode	*zf_dnode;	/* dnode that owns this zfetch */
+	uint32_t	zf_stream_cnt;	/* # of active streams */
+	uint64_t	zf_alloc_fail;	/* # of failed attempts to alloc strm */
+} zfetch_t;
+
+void		zfetch_init(void);
+void		zfetch_fini(void);
+
+void		dmu_zfetch_init(zfetch_t *, struct dnode *);
+void		dmu_zfetch_rele(zfetch_t *);
+void		dmu_zfetch(zfetch_t *, uint64_t, uint64_t, int);
+
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _DFETCH_H */
--- a/uts/common/fs/zfs/sys/dnode.h
+++ b/uts/common/fs/zfs/sys/dnode.h
@ -0,0 +1,329 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_SYS_DNODE_H
+#define	_SYS_DNODE_H
+
+#include <sys/zfs_context.h>
+#include <sys/avl.h>
+#include <sys/spa.h>
+#include <sys/txg.h>
+#include <sys/zio.h>
+#include <sys/refcount.h>
+#include <sys/dmu_zfetch.h>
+#include <sys/zrlock.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+/*
+ * dnode_hold() flags.
+ */
+#define	DNODE_MUST_BE_ALLOCATED	1
+#define	DNODE_MUST_BE_FREE	2
+
+/*
+ * dnode_next_offset() flags.
+ */
+#define	DNODE_FIND_HOLE		1
+#define	DNODE_FIND_BACKWARDS	2
+#define	DNODE_FIND_HAVELOCK	4
+
+/*
+ * Fixed constants.
+ */
+#define	DNODE_SHIFT		9	/* 512 bytes */
+#define	DN_MIN_INDBLKSHIFT	10	/* 1k */
+#define	DN_MAX_INDBLKSHIFT	14	/* 16k */
+#define	DNODE_BLOCK_SHIFT	14	/* 16k */
+#define	DNODE_CORE_SIZE		64	/* 64 bytes for dnode sans blkptrs */
+#define	DN_MAX_OBJECT_SHIFT	48	/* 256 trillion (zfs_fid_t limit) */
+#define	DN_MAX_OFFSET_SHIFT	64	/* 2^64 bytes in a dnode */
+
+/*
+ * dnode id flags
+ *
+ * Note: a file will never ever have its
+ * ids moved from bonus->spill
+ * and only in a crypto environment would it be on spill
+ */
+#define	DN_ID_CHKED_BONUS	0x1
+#define	DN_ID_CHKED_SPILL	0x2
+#define	DN_ID_OLD_EXIST		0x4
+#define	DN_ID_NEW_EXIST		0x8
+
+/*
+ * Derived constants.
+ */
+#define	DNODE_SIZE	(1 << DNODE_SHIFT)
+#define	DN_MAX_NBLKPTR	((DNODE_SIZE - DNODE_CORE_SIZE) >> SPA_BLKPTRSHIFT)
+#define	DN_MAX_BONUSLEN	(DNODE_SIZE - DNODE_CORE_SIZE - (1 << SPA_BLKPTRSHIFT))
+#define	DN_MAX_OBJECT	(1ULL << DN_MAX_OBJECT_SHIFT)
+#define	DN_ZERO_BONUSLEN	(DN_MAX_BONUSLEN + 1)
+#define	DN_KILL_SPILLBLK (1)
+
+#define	DNODES_PER_BLOCK_SHIFT	(DNODE_BLOCK_SHIFT - DNODE_SHIFT)
+#define	DNODES_PER_BLOCK	(1ULL << DNODES_PER_BLOCK_SHIFT)
+#define	DNODES_PER_LEVEL_SHIFT	(DN_MAX_INDBLKSHIFT - SPA_BLKPTRSHIFT)
+#define	DNODES_PER_LEVEL	(1ULL << DNODES_PER_LEVEL_SHIFT)
+
+/* The +2 here is a cheesy way to round up */
+#define	DN_MAX_LEVELS	(2 + ((DN_MAX_OFFSET_SHIFT - SPA_MINBLOCKSHIFT) / \
+	(DN_MIN_INDBLKSHIFT - SPA_BLKPTRSHIFT)))
+
+#define	DN_BONUS(dnp)	((void*)((dnp)->dn_bonus + \
+	(((dnp)->dn_nblkptr - 1) * sizeof (blkptr_t))))
+
+#define	DN_USED_BYTES(dnp) (((dnp)->dn_flags & DNODE_FLAG_USED_BYTES) ? \
+	(dnp)->dn_used : (dnp)->dn_used << SPA_MINBLOCKSHIFT)
+
+#define	EPB(blkshift, typeshift)	(1 << (blkshift - typeshift))
+
+struct dmu_buf_impl;
+struct objset;
+struct zio;
+
+enum dnode_dirtycontext {
+	DN_UNDIRTIED,
+	DN_DIRTY_OPEN,
+	DN_DIRTY_SYNC
+};
+
+/* Is dn_used in bytes?  if not, it's in multiples of SPA_MINBLOCKSIZE */
+#define	DNODE_FLAG_USED_BYTES		(1<<0)
+#define	DNODE_FLAG_USERUSED_ACCOUNTED	(1<<1)
+
+/* Does dnode have a SA spill blkptr in bonus? */
+#define	DNODE_FLAG_SPILL_BLKPTR	(1<<2)
+
+typedef struct dnode_phys {
+	uint8_t dn_type;		/* dmu_object_type_t */
+	uint8_t dn_indblkshift;		/* ln2(indirect block size) */
+	uint8_t dn_nlevels;		/* 1=dn_blkptr->data blocks */
+	uint8_t dn_nblkptr;		/* length of dn_blkptr */
+	uint8_t dn_bonustype;		/* type of data in bonus buffer */
+	uint8_t	dn_checksum;		/* ZIO_CHECKSUM type */
+	uint8_t	dn_compress;		/* ZIO_COMPRESS type */
+	uint8_t dn_flags;		/* DNODE_FLAG_* */
+	uint16_t dn_datablkszsec;	/* data block size in 512b sectors */
+	uint16_t dn_bonuslen;		/* length of dn_bonus */
+	uint8_t dn_pad2[4];
+
+	/* accounting is protected by dn_dirty_mtx */
+	uint64_t dn_maxblkid;		/* largest allocated block ID */
+	uint64_t dn_used;		/* bytes (or sectors) of disk space */
+
+	uint64_t dn_pad3[4];
+
+	blkptr_t dn_blkptr[1];
+	uint8_t dn_bonus[DN_MAX_BONUSLEN - sizeof (blkptr_t)];
+	blkptr_t dn_spill;
+} dnode_phys_t;
+
+typedef struct dnode {
+	/*
+	 * dn_struct_rwlock protects the structure of the dnode,
+	 * including the number of levels of indirection (dn_nlevels),
+	 * dn_maxblkid, and dn_next_*
+	 */
+	krwlock_t dn_struct_rwlock;
+
+	/* Our link on dn_objset->os_dnodes list; protected by os_lock.  */
+	list_node_t dn_link;
+
+	/* immutable: */
+	struct objset *dn_objset;
+	uint64_t dn_object;
+	struct dmu_buf_impl *dn_dbuf;
+	struct dnode_handle *dn_handle;
+	dnode_phys_t *dn_phys; /* pointer into dn->dn_dbuf->db.db_data */
+
+	/*
+	 * Copies of stuff in dn_phys.  They're valid in the open
+	 * context (eg. even before the dnode is first synced).
+	 * Where necessary, these are protected by dn_struct_rwlock.
+	 */
+	dmu_object_type_t dn_type;	/* object type */
+	uint16_t dn_bonuslen;		/* bonus length */
+	uint8_t dn_bonustype;		/* bonus type */
+	uint8_t dn_nblkptr;		/* number of blkptrs (immutable) */
+	uint8_t dn_checksum;		/* ZIO_CHECKSUM type */
+	uint8_t dn_compress;		/* ZIO_COMPRESS type */
+	uint8_t dn_nlevels;
+	uint8_t dn_indblkshift;
+	uint8_t dn_datablkshift;	/* zero if blksz not power of 2! */
+	uint8_t dn_moved;		/* Has this dnode been moved? */
+	uint16_t dn_datablkszsec;	/* in 512b sectors */
+	uint32_t dn_datablksz;		/* in bytes */
+	uint64_t dn_maxblkid;
+	uint8_t dn_next_nblkptr[TXG_SIZE];
+	uint8_t dn_next_nlevels[TXG_SIZE];
+	uint8_t dn_next_indblkshift[TXG_SIZE];
+	uint8_t dn_next_bonustype[TXG_SIZE];
+	uint8_t dn_rm_spillblk[TXG_SIZE];	/* for removing spill blk */
+	uint16_t dn_next_bonuslen[TXG_SIZE];
+	uint32_t dn_next_blksz[TXG_SIZE];	/* next block size in bytes */
+
+	/* protected by dn_dbufs_mtx; declared here to fill 32-bit hole */
+	uint32_t dn_dbufs_count;	/* count of dn_dbufs */
+
+	/* protected by os_lock: */
+	list_node_t dn_dirty_link[TXG_SIZE];	/* next on dataset's dirty */
+
+	/* protected by dn_mtx: */
+	kmutex_t dn_mtx;
+	list_t dn_dirty_records[TXG_SIZE];
+	avl_tree_t dn_ranges[TXG_SIZE];
+	uint64_t dn_allocated_txg;
+	uint64_t dn_free_txg;
+	uint64_t dn_assigned_txg;
+	kcondvar_t dn_notxholds;
+	enum dnode_dirtycontext dn_dirtyctx;
+	uint8_t *dn_dirtyctx_firstset;		/* dbg: contents meaningless */
+
+	/* protected by own devices */
+	refcount_t dn_tx_holds;
+	refcount_t dn_holds;
+
+	kmutex_t dn_dbufs_mtx;
+	list_t dn_dbufs;		/* descendent dbufs */
+
+	/* protected by dn_struct_rwlock */
+	struct dmu_buf_impl *dn_bonus;	/* bonus buffer dbuf */
+
+	boolean_t dn_have_spill;	/* have spill or are spilling */
+
+	/* parent IO for current sync write */
+	zio_t *dn_zio;
+
+	/* used in syncing context */
+	uint64_t dn_oldused;	/* old phys used bytes */
+	uint64_t dn_oldflags;	/* old phys dn_flags */
+	uint64_t dn_olduid, dn_oldgid;
+	uint64_t dn_newuid, dn_newgid;
+	int dn_id_flags;
+
+	/* holds prefetch structure */
+	struct zfetch	dn_zfetch;
+} dnode_t;
+
+/*
+ * Adds a level of indirection between the dbuf and the dnode to avoid
+ * iterating descendent dbufs in dnode_move(). Handles are not allocated
+ * individually, but as an array of child dnodes in dnode_hold_impl().
+ */
+typedef struct dnode_handle {
+	/* Protects dnh_dnode from modification by dnode_move(). */
+	zrlock_t dnh_zrlock;
+	dnode_t *dnh_dnode;
+} dnode_handle_t;
+
+typedef struct dnode_children {
+	size_t dnc_count;		/* number of children */
+	dnode_handle_t dnc_children[1];	/* sized dynamically */
+} dnode_children_t;
+
+typedef struct free_range {
+	avl_node_t fr_node;
+	uint64_t fr_blkid;
+	uint64_t fr_nblks;
+} free_range_t;
+
+dnode_t *dnode_special_open(struct objset *dd, dnode_phys_t *dnp,
+    uint64_t object, dnode_handle_t *dnh);
+void dnode_special_close(dnode_handle_t *dnh);
+
+void dnode_setbonuslen(dnode_t *dn, int newsize, dmu_tx_t *tx);
+void dnode_setbonus_type(dnode_t *dn, dmu_object_type_t, dmu_tx_t *tx);
+void dnode_rm_spill(dnode_t *dn, dmu_tx_t *tx);
+
+int dnode_hold(struct objset *dd, uint64_t object,
+    void *ref, dnode_t **dnp);
+int dnode_hold_impl(struct objset *dd, uint64_t object, int flag,
+    void *ref, dnode_t **dnp);
+boolean_t dnode_add_ref(dnode_t *dn, void *ref);
+void dnode_rele(dnode_t *dn, void *ref);
+void dnode_setdirty(dnode_t *dn, dmu_tx_t *tx);
+void dnode_sync(dnode_t *dn, dmu_tx_t *tx);
+void dnode_allocate(dnode_t *dn, dmu_object_type_t ot, int blocksize, int ibs,
+    dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx);
+void dnode_reallocate(dnode_t *dn, dmu_object_type_t ot, int blocksize,
+    dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx);
+void dnode_free(dnode_t *dn, dmu_tx_t *tx);
+void dnode_byteswap(dnode_phys_t *dnp);
+void dnode_buf_byteswap(void *buf, size_t size);
+void dnode_verify(dnode_t *dn);
+int dnode_set_blksz(dnode_t *dn, uint64_t size, int ibs, dmu_tx_t *tx);
+uint64_t dnode_current_max_length(dnode_t *dn);
+void dnode_free_range(dnode_t *dn, uint64_t off, uint64_t len, dmu_tx_t *tx);
+void dnode_clear_range(dnode_t *dn, uint64_t blkid,
+    uint64_t nblks, dmu_tx_t *tx);
+void dnode_diduse_space(dnode_t *dn, int64_t space);
+void dnode_willuse_space(dnode_t *dn, int64_t space, dmu_tx_t *tx);
+void dnode_new_blkid(dnode_t *dn, uint64_t blkid, dmu_tx_t *tx, boolean_t);
+uint64_t dnode_block_freed(dnode_t *dn, uint64_t blkid);
+void dnode_init(void);
+void dnode_fini(void);
+int dnode_next_offset(dnode_t *dn, int flags, uint64_t *off,
+    int minlvl, uint64_t blkfill, uint64_t txg);
+void dnode_evict_dbufs(dnode_t *dn);
+
+#ifdef ZFS_DEBUG
+
+/*
+ * There should be a ## between the string literal and fmt, to make it
+ * clear that we're joining two strings together, but that piece of shit
+ * gcc doesn't support that preprocessor token.
+ */
+#define	dprintf_dnode(dn, fmt, ...) do { \
+	if (zfs_flags & ZFS_DEBUG_DPRINTF) { \
+	char __db_buf[32]; \
+	uint64_t __db_obj = (dn)->dn_object; \
+	if (__db_obj == DMU_META_DNODE_OBJECT) \
+		(void) strcpy(__db_buf, "mdn"); \
+	else \
+		(void) snprintf(__db_buf, sizeof (__db_buf), "%lld", \
+		    (u_longlong_t)__db_obj);\
+	dprintf_ds((dn)->dn_objset->os_dsl_dataset, "obj=%s " fmt, \
+	    __db_buf, __VA_ARGS__); \
+	} \
+_NOTE(CONSTCOND) } while (0)
+
+#define	DNODE_VERIFY(dn)		dnode_verify(dn)
+#define	FREE_VERIFY(db, start, end, tx)	free_verify(db, start, end, tx)
+
+#else
+
+#define	dprintf_dnode(db, fmt, ...)
+#define	DNODE_VERIFY(dn)
+#define	FREE_VERIFY(db, start, end, tx)
+
+#endif
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _SYS_DNODE_H */
--- a/uts/common/fs/zfs/sys/dsl_dataset.h
+++ b/uts/common/fs/zfs/sys/dsl_dataset.h
@ -0,0 +1,283 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_SYS_DSL_DATASET_H
+#define	_SYS_DSL_DATASET_H
+
+#include <sys/dmu.h>
+#include <sys/spa.h>
+#include <sys/txg.h>
+#include <sys/zio.h>
+#include <sys/bplist.h>
+#include <sys/dsl_synctask.h>
+#include <sys/zfs_context.h>
+#include <sys/dsl_deadlist.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+struct dsl_dataset;
+struct dsl_dir;
+struct dsl_pool;
+
+#define	DS_FLAG_INCONSISTENT	(1ULL<<0)
+#define	DS_IS_INCONSISTENT(ds)	\
+	((ds)->ds_phys->ds_flags & DS_FLAG_INCONSISTENT)
+/*
+ * NB: nopromote can not yet be set, but we want support for it in this
+ * on-disk version, so that we don't need to upgrade for it later.  It
+ * will be needed when we implement 'zfs split' (where the split off
+ * clone should not be promoted).
+ */
+#define	DS_FLAG_NOPROMOTE	(1ULL<<1)
+
+/*
+ * DS_FLAG_UNIQUE_ACCURATE is set if ds_unique_bytes has been correctly
+ * calculated for head datasets (starting with SPA_VERSION_UNIQUE_ACCURATE,
+ * refquota/refreservations).
+ */
+#define	DS_FLAG_UNIQUE_ACCURATE	(1ULL<<2)
+
+/*
+ * DS_FLAG_DEFER_DESTROY is set after 'zfs destroy -d' has been called
+ * on a dataset. This allows the dataset to be destroyed using 'zfs release'.
+ */
+#define	DS_FLAG_DEFER_DESTROY	(1ULL<<3)
+#define	DS_IS_DEFER_DESTROY(ds)	\
+	((ds)->ds_phys->ds_flags & DS_FLAG_DEFER_DESTROY)
+
+/*
+ * DS_FLAG_CI_DATASET is set if the dataset contains a file system whose
+ * name lookups should be performed case-insensitively.
+ */
+#define	DS_FLAG_CI_DATASET	(1ULL<<16)
+
+typedef struct dsl_dataset_phys {
+	uint64_t ds_dir_obj;		/* DMU_OT_DSL_DIR */
+	uint64_t ds_prev_snap_obj;	/* DMU_OT_DSL_DATASET */
+	uint64_t ds_prev_snap_txg;
+	uint64_t ds_next_snap_obj;	/* DMU_OT_DSL_DATASET */
+	uint64_t ds_snapnames_zapobj;	/* DMU_OT_DSL_DS_SNAP_MAP 0 for snaps */
+	uint64_t ds_num_children;	/* clone/snap children; ==0 for head */
+	uint64_t ds_creation_time;	/* seconds since 1970 */
+	uint64_t ds_creation_txg;
+	uint64_t ds_deadlist_obj;	/* DMU_OT_DEADLIST */
+	uint64_t ds_used_bytes;
+	uint64_t ds_compressed_bytes;
+	uint64_t ds_uncompressed_bytes;
+	uint64_t ds_unique_bytes;	/* only relevant to snapshots */
+	/*
+	 * The ds_fsid_guid is a 56-bit ID that can change to avoid
+	 * collisions.  The ds_guid is a 64-bit ID that will never
+	 * change, so there is a small probability that it will collide.
+	 */
+	uint64_t ds_fsid_guid;
+	uint64_t ds_guid;
+	uint64_t ds_flags;		/* DS_FLAG_* */
+	blkptr_t ds_bp;
+	uint64_t ds_next_clones_obj;	/* DMU_OT_DSL_CLONES */
+	uint64_t ds_props_obj;		/* DMU_OT_DSL_PROPS for snaps */
+	uint64_t ds_userrefs_obj;	/* DMU_OT_USERREFS */
+	uint64_t ds_pad[5]; /* pad out to 320 bytes for good measure */
+} dsl_dataset_phys_t;
+
+typedef struct dsl_dataset {
+	/* Immutable: */
+	struct dsl_dir *ds_dir;
+	dsl_dataset_phys_t *ds_phys;
+	dmu_buf_t *ds_dbuf;
+	uint64_t ds_object;
+	uint64_t ds_fsid_guid;
+
+	/* only used in syncing context, only valid for non-snapshots: */
+	struct dsl_dataset *ds_prev;
+
+	/* has internal locking: */
+	dsl_deadlist_t ds_deadlist;
+	bplist_t ds_pending_deadlist;
+
+	/* to protect against multiple concurrent incremental recv */
+	kmutex_t ds_recvlock;
+
+	/* protected by lock on pool's dp_dirty_datasets list */
+	txg_node_t ds_dirty_link;
+	list_node_t ds_synced_link;
+
+	/*
+	 * ds_phys->ds_<accounting> is also protected by ds_lock.
+	 * Protected by ds_lock:
+	 */
+	kmutex_t ds_lock;
+	objset_t *ds_objset;
+	uint64_t ds_userrefs;
+
+	/*
+	 * ds_owner is protected by the ds_rwlock and the ds_lock
+	 */
+	krwlock_t ds_rwlock;
+	kcondvar_t ds_exclusive_cv;
+	void *ds_owner;
+
+	/* no locking; only for making guesses */
+	uint64_t ds_trysnap_txg;
+
+	/* for objset_open() */
+	kmutex_t ds_opening_lock;
+
+	uint64_t ds_reserved;	/* cached refreservation */
+	uint64_t ds_quota;	/* cached refquota */
+
+	/* Protected by ds_lock; keep at end of struct for better locality */
+	char ds_snapname[MAXNAMELEN];
+} dsl_dataset_t;
+
+struct dsl_ds_destroyarg {
+	dsl_dataset_t *ds;		/* ds to destroy */
+	dsl_dataset_t *rm_origin;	/* also remove our origin? */
+	boolean_t is_origin_rm;		/* set if removing origin snap */
+	boolean_t defer;		/* destroy -d requested? */
+	boolean_t releasing;		/* destroying due to release? */
+	boolean_t need_prep;		/* do we need to retry due to EBUSY? */
+};
+
+/*
+ * The max length of a temporary tag prefix is the number of hex digits
+ * required to express UINT64_MAX plus one for the hyphen.
+ */
+#define	MAX_TAG_PREFIX_LEN	17
+
+struct dsl_ds_holdarg {
+	dsl_sync_task_group_t *dstg;
+	char *htag;
+	char *snapname;
+	boolean_t recursive;
+	boolean_t gotone;
+	boolean_t temphold;
+	char failed[MAXPATHLEN];
+};
+
+#define	dsl_dataset_is_snapshot(ds) \
+	((ds)->ds_phys->ds_num_children != 0)
+
+#define	DS_UNIQUE_IS_ACCURATE(ds)	\
+	(((ds)->ds_phys->ds_flags & DS_FLAG_UNIQUE_ACCURATE) != 0)
+
+int dsl_dataset_hold(const char *name, void *tag, dsl_dataset_t **dsp);
+int dsl_dataset_hold_obj(struct dsl_pool *dp, uint64_t dsobj,
+    void *tag, dsl_dataset_t **);
+int dsl_dataset_own(const char *name, boolean_t inconsistentok,
+    void *tag, dsl_dataset_t **dsp);
+int dsl_dataset_own_obj(struct dsl_pool *dp, uint64_t dsobj,
+    boolean_t inconsistentok, void *tag, dsl_dataset_t **dsp);
+void dsl_dataset_name(dsl_dataset_t *ds, char *name);
+void dsl_dataset_rele(dsl_dataset_t *ds, void *tag);
+void dsl_dataset_disown(dsl_dataset_t *ds, void *tag);
+void dsl_dataset_drop_ref(dsl_dataset_t *ds, void *tag);
+boolean_t dsl_dataset_tryown(dsl_dataset_t *ds, boolean_t inconsistentok,
+    void *tag);
+void dsl_dataset_make_exclusive(dsl_dataset_t *ds, void *tag);
+void dsl_register_onexit_hold_cleanup(dsl_dataset_t *ds, const char *htag,
+    minor_t minor);
+uint64_t dsl_dataset_create_sync(dsl_dir_t *pds, const char *lastname,
+    dsl_dataset_t *origin, uint64_t flags, cred_t *, dmu_tx_t *);
+uint64_t dsl_dataset_create_sync_dd(dsl_dir_t *dd, dsl_dataset_t *origin,
+    uint64_t flags, dmu_tx_t *tx);
+int dsl_dataset_destroy(dsl_dataset_t *ds, void *tag, boolean_t defer);
+int dsl_snapshots_destroy(char *fsname, char *snapname, boolean_t defer);
+dsl_checkfunc_t dsl_dataset_destroy_check;
+dsl_syncfunc_t dsl_dataset_destroy_sync;
+dsl_checkfunc_t dsl_dataset_snapshot_check;
+dsl_syncfunc_t dsl_dataset_snapshot_sync;
+dsl_syncfunc_t dsl_dataset_user_hold_sync;
+int dsl_dataset_rename(char *name, const char *newname, boolean_t recursive);
+int dsl_dataset_promote(const char *name, char *conflsnap);
+int dsl_dataset_clone_swap(dsl_dataset_t *clone, dsl_dataset_t *origin_head,
+    boolean_t force);
+int dsl_dataset_user_hold(char *dsname, char *snapname, char *htag,
+    boolean_t recursive, boolean_t temphold, int cleanup_fd);
+int dsl_dataset_user_hold_for_send(dsl_dataset_t *ds, char *htag,
+    boolean_t temphold);
+int dsl_dataset_user_release(char *dsname, char *snapname, char *htag,
+    boolean_t recursive);
+int dsl_dataset_user_release_tmp(struct dsl_pool *dp, uint64_t dsobj,
+    char *htag, boolean_t retry);
+int dsl_dataset_get_holds(const char *dsname, nvlist_t **nvp);
+
+blkptr_t *dsl_dataset_get_blkptr(dsl_dataset_t *ds);
+void dsl_dataset_set_blkptr(dsl_dataset_t *ds, blkptr_t *bp, dmu_tx_t *tx);
+
+spa_t *dsl_dataset_get_spa(dsl_dataset_t *ds);
+
+boolean_t dsl_dataset_modified_since_lastsnap(dsl_dataset_t *ds);
+
+void dsl_dataset_sync(dsl_dataset_t *os, zio_t *zio, dmu_tx_t *tx);
+
+void dsl_dataset_block_born(dsl_dataset_t *ds, const blkptr_t *bp,
+    dmu_tx_t *tx);
+int dsl_dataset_block_kill(dsl_dataset_t *ds, const blkptr_t *bp,
+    dmu_tx_t *tx, boolean_t async);
+boolean_t dsl_dataset_block_freeable(dsl_dataset_t *ds, const blkptr_t *bp,
+    uint64_t blk_birth);
+uint64_t dsl_dataset_prev_snap_txg(dsl_dataset_t *ds);
+
+void dsl_dataset_dirty(dsl_dataset_t *ds, dmu_tx_t *tx);
+void dsl_dataset_stats(dsl_dataset_t *os, nvlist_t *nv);
+void dsl_dataset_fast_stat(dsl_dataset_t *ds, dmu_objset_stats_t *stat);
+void dsl_dataset_space(dsl_dataset_t *ds,
+    uint64_t *refdbytesp, uint64_t *availbytesp,
+    uint64_t *usedobjsp, uint64_t *availobjsp);
+uint64_t dsl_dataset_fsid_guid(dsl_dataset_t *ds);
+
+int dsl_dsobj_to_dsname(char *pname, uint64_t obj, char *buf);
+
+int dsl_dataset_check_quota(dsl_dataset_t *ds, boolean_t check_quota,
+    uint64_t asize, uint64_t inflight, uint64_t *used,
+    uint64_t *ref_rsrv);
+int dsl_dataset_set_quota(const char *dsname, zprop_source_t source,
+    uint64_t quota);
+dsl_syncfunc_t dsl_dataset_set_quota_sync;
+int dsl_dataset_set_reservation(const char *dsname, zprop_source_t source,
+    uint64_t reservation);
+
+int dsl_destroy_inconsistent(const char *dsname, void *arg);
+
+#ifdef ZFS_DEBUG
+#define	dprintf_ds(ds, fmt, ...) do { \
+	if (zfs_flags & ZFS_DEBUG_DPRINTF) { \
+	char *__ds_name = kmem_alloc(MAXNAMELEN, KM_SLEEP); \
+	dsl_dataset_name(ds, __ds_name); \
+	dprintf("ds=%s " fmt, __ds_name, __VA_ARGS__); \
+	kmem_free(__ds_name, MAXNAMELEN); \
+	} \
+_NOTE(CONSTCOND) } while (0)
+#else
+#define	dprintf_ds(dd, fmt, ...)
+#endif
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif /* _SYS_DSL_DATASET_H */
--- a/uts/common/fs/zfs/sys/dsl_deadlist.h
+++ b/uts/common/fs/zfs/sys/dsl_deadlist.h
@ -0,0 +1,87 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_SYS_DSL_DEADLIST_H
+#define	_SYS_DSL_DEADLIST_H
+
+#include <sys/bpobj.h>
+#include <sys/zfs_context.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+struct dmu_buf;
+struct dsl_dataset;
+
+typedef struct dsl_deadlist_phys {
+	uint64_t dl_used;
+	uint64_t dl_comp;
+	uint64_t dl_uncomp;
+	uint64_t dl_pad[37]; /* pad out to 320b for future expansion */
+} dsl_deadlist_phys_t;
+
+typedef struct dsl_deadlist {
+	objset_t *dl_os;
+	uint64_t dl_object;
+	avl_tree_t dl_tree;
+	boolean_t dl_havetree;
+	struct dmu_buf *dl_dbuf;
+	dsl_deadlist_phys_t *dl_phys;
+	kmutex_t dl_lock;
+
+	/* if it's the old on-disk format: */
+	bpobj_t dl_bpobj;
+	boolean_t dl_oldfmt;
+} dsl_deadlist_t;
+
+typedef struct dsl_deadlist_entry {
+	avl_node_t dle_node;
+	uint64_t dle_mintxg;
+	bpobj_t dle_bpobj;
+} dsl_deadlist_entry_t;
+
+void dsl_deadlist_open(dsl_deadlist_t *dl, objset_t *os, uint64_t object);
+void dsl_deadlist_close(dsl_deadlist_t *dl);
+uint64_t dsl_deadlist_alloc(objset_t *os, dmu_tx_t *tx);
+void dsl_deadlist_free(objset_t *os, uint64_t dlobj, dmu_tx_t *tx);
+void dsl_deadlist_insert(dsl_deadlist_t *dl, const blkptr_t *bp, dmu_tx_t *tx);
+void dsl_deadlist_add_key(dsl_deadlist_t *dl, uint64_t mintxg, dmu_tx_t *tx);
+void dsl_deadlist_remove_key(dsl_deadlist_t *dl, uint64_t mintxg, dmu_tx_t *tx);
+uint64_t dsl_deadlist_clone(dsl_deadlist_t *dl, uint64_t maxtxg,
+    uint64_t mrs_obj, dmu_tx_t *tx);
+void dsl_deadlist_space(dsl_deadlist_t *dl,
+    uint64_t *usedp, uint64_t *compp, uint64_t *uncompp);
+void dsl_deadlist_space_range(dsl_deadlist_t *dl,
+    uint64_t mintxg, uint64_t maxtxg,
+    uint64_t *usedp, uint64_t *compp, uint64_t *uncompp);
+void dsl_deadlist_merge(dsl_deadlist_t *dl, uint64_t obj, dmu_tx_t *tx);
+void dsl_deadlist_move_bpobj(dsl_deadlist_t *dl, bpobj_t *bpo, uint64_t mintxg,
+    dmu_tx_t *tx);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif /* _SYS_DSL_DEADLIST_H */
--- a/uts/common/fs/zfs/sys/dsl_deleg.h
+++ b/uts/common/fs/zfs/sys/dsl_deleg.h
@ -0,0 +1,78 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2007, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_SYS_DSL_DELEG_H
+#define	_SYS_DSL_DELEG_H
+
+#include <sys/dmu.h>
+#include <sys/dsl_pool.h>
+#include <sys/zfs_context.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+#define	ZFS_DELEG_PERM_NONE		""
+#define	ZFS_DELEG_PERM_CREATE		"create"
+#define	ZFS_DELEG_PERM_DESTROY		"destroy"
+#define	ZFS_DELEG_PERM_SNAPSHOT		"snapshot"
+#define	ZFS_DELEG_PERM_ROLLBACK		"rollback"
+#define	ZFS_DELEG_PERM_CLONE		"clone"
+#define	ZFS_DELEG_PERM_PROMOTE		"promote"
+#define	ZFS_DELEG_PERM_RENAME		"rename"
+#define	ZFS_DELEG_PERM_MOUNT		"mount"
+#define	ZFS_DELEG_PERM_SHARE		"share"
+#define	ZFS_DELEG_PERM_SEND		"send"
+#define	ZFS_DELEG_PERM_RECEIVE		"receive"
+#define	ZFS_DELEG_PERM_ALLOW		"allow"
+#define	ZFS_DELEG_PERM_USERPROP		"userprop"
+#define	ZFS_DELEG_PERM_VSCAN		"vscan"
+#define	ZFS_DELEG_PERM_USERQUOTA	"userquota"
+#define	ZFS_DELEG_PERM_GROUPQUOTA	"groupquota"
+#define	ZFS_DELEG_PERM_USERUSED		"userused"
+#define	ZFS_DELEG_PERM_GROUPUSED	"groupused"
+#define	ZFS_DELEG_PERM_HOLD		"hold"
+#define	ZFS_DELEG_PERM_RELEASE		"release"
+#define	ZFS_DELEG_PERM_DIFF		"diff"
+
+/*
+ * Note: the names of properties that are marked delegatable are also
+ * valid delegated permissions
+ */
+
+int dsl_deleg_get(const char *ddname, nvlist_t **nvp);
+int dsl_deleg_set(const char *ddname, nvlist_t *nvp, boolean_t unset);
+int dsl_deleg_access(const char *ddname, const char *perm, cred_t *cr);
+int dsl_deleg_access_impl(struct dsl_dataset *ds, const char *perm, cred_t *cr);
+void dsl_deleg_set_create_perms(dsl_dir_t *dd, dmu_tx_t *tx, cred_t *cr);
+int dsl_deleg_can_allow(char *ddname, nvlist_t *nvp, cred_t *cr);
+int dsl_deleg_can_unallow(char *ddname, nvlist_t *nvp, cred_t *cr);
+int dsl_deleg_destroy(objset_t *os, uint64_t zapobj, dmu_tx_t *tx);
+boolean_t dsl_delegation_on(objset_t *os);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _SYS_DSL_DELEG_H */
--- a/uts/common/fs/zfs/sys/dsl_dir.h
+++ b/uts/common/fs/zfs/sys/dsl_dir.h
@ -0,0 +1,167 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_SYS_DSL_DIR_H
+#define	_SYS_DSL_DIR_H
+
+#include <sys/dmu.h>
+#include <sys/dsl_pool.h>
+#include <sys/dsl_synctask.h>
+#include <sys/refcount.h>
+#include <sys/zfs_context.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+struct dsl_dataset;
+
+typedef enum dd_used {
+	DD_USED_HEAD,
+	DD_USED_SNAP,
+	DD_USED_CHILD,
+	DD_USED_CHILD_RSRV,
+	DD_USED_REFRSRV,
+	DD_USED_NUM
+} dd_used_t;
+
+#define	DD_FLAG_USED_BREAKDOWN (1<<0)
+
+typedef struct dsl_dir_phys {
+	uint64_t dd_creation_time; /* not actually used */
+	uint64_t dd_head_dataset_obj;
+	uint64_t dd_parent_obj;
+	uint64_t dd_origin_obj;
+	uint64_t dd_child_dir_zapobj;
+	/*
+	 * how much space our children are accounting for; for leaf
+	 * datasets, == physical space used by fs + snaps
+	 */
+	uint64_t dd_used_bytes;
+	uint64_t dd_compressed_bytes;
+	uint64_t dd_uncompressed_bytes;
+	/* Administrative quota setting */
+	uint64_t dd_quota;
+	/* Administrative reservation setting */
+	uint64_t dd_reserved;
+	uint64_t dd_props_zapobj;
+	uint64_t dd_deleg_zapobj; /* dataset delegation permissions */
+	uint64_t dd_flags;
+	uint64_t dd_used_breakdown[DD_USED_NUM];
+	uint64_t dd_clones; /* dsl_dir objects */
+	uint64_t dd_pad[13]; /* pad out to 256 bytes for good measure */
+} dsl_dir_phys_t;
+
+struct dsl_dir {
+	/* These are immutable; no lock needed: */
+	uint64_t dd_object;
+	dsl_dir_phys_t *dd_phys;
+	dmu_buf_t *dd_dbuf;
+	dsl_pool_t *dd_pool;
+
+	/* protected by lock on pool's dp_dirty_dirs list */
+	txg_node_t dd_dirty_link;
+
+	/* protected by dp_config_rwlock */
+	dsl_dir_t *dd_parent;
+
+	/* Protected by dd_lock */
+	kmutex_t dd_lock;
+	list_t dd_prop_cbs; /* list of dsl_prop_cb_record_t's */
+	timestruc_t dd_snap_cmtime; /* last time snapshot namespace changed */
+	uint64_t dd_origin_txg;
+
+	/* gross estimate of space used by in-flight tx's */
+	uint64_t dd_tempreserved[TXG_SIZE];
+	/* amount of space we expect to write; == amount of dirty data */
+	int64_t dd_space_towrite[TXG_SIZE];
+
+	/* protected by dd_lock; keep at end of struct for better locality */
+	char dd_myname[MAXNAMELEN];
+};
+
+void dsl_dir_close(dsl_dir_t *dd, void *tag);
+int dsl_dir_open(const char *name, void *tag, dsl_dir_t **, const char **tail);
+int dsl_dir_open_spa(spa_t *spa, const char *name, void *tag, dsl_dir_t **,
+    const char **tailp);
+int dsl_dir_open_obj(dsl_pool_t *dp, uint64_t ddobj,
+    const char *tail, void *tag, dsl_dir_t **);
+void dsl_dir_name(dsl_dir_t *dd, char *buf);
+int dsl_dir_namelen(dsl_dir_t *dd);
+uint64_t dsl_dir_create_sync(dsl_pool_t *dp, dsl_dir_t *pds,
+    const char *name, dmu_tx_t *tx);
+dsl_checkfunc_t dsl_dir_destroy_check;
+dsl_syncfunc_t dsl_dir_destroy_sync;
+void dsl_dir_stats(dsl_dir_t *dd, nvlist_t *nv);
+uint64_t dsl_dir_space_available(dsl_dir_t *dd,
+    dsl_dir_t *ancestor, int64_t delta, int ondiskonly);
+void dsl_dir_dirty(dsl_dir_t *dd, dmu_tx_t *tx);
+void dsl_dir_sync(dsl_dir_t *dd, dmu_tx_t *tx);
+int dsl_dir_tempreserve_space(dsl_dir_t *dd, uint64_t mem,
+    uint64_t asize, uint64_t fsize, uint64_t usize, void **tr_cookiep,
+    dmu_tx_t *tx);
+void dsl_dir_tempreserve_clear(void *tr_cookie, dmu_tx_t *tx);
+void dsl_dir_willuse_space(dsl_dir_t *dd, int64_t space, dmu_tx_t *tx);
+void dsl_dir_diduse_space(dsl_dir_t *dd, dd_used_t type,
+    int64_t used, int64_t compressed, int64_t uncompressed, dmu_tx_t *tx);
+void dsl_dir_transfer_space(dsl_dir_t *dd, int64_t delta,
+    dd_used_t oldtype, dd_used_t newtype, dmu_tx_t *tx);
+int dsl_dir_set_quota(const char *ddname, zprop_source_t source,
+    uint64_t quota);
+int dsl_dir_set_reservation(const char *ddname, zprop_source_t source,
+    uint64_t reservation);
+int dsl_dir_rename(dsl_dir_t *dd, const char *newname);
+int dsl_dir_transfer_possible(dsl_dir_t *sdd, dsl_dir_t *tdd, uint64_t space);
+int dsl_dir_set_reservation_check(void *arg1, void *arg2, dmu_tx_t *tx);
+boolean_t dsl_dir_is_clone(dsl_dir_t *dd);
+void dsl_dir_new_refreservation(dsl_dir_t *dd, struct dsl_dataset *ds,
+    uint64_t reservation, cred_t *cr, dmu_tx_t *tx);
+void dsl_dir_snap_cmtime_update(dsl_dir_t *dd);
+timestruc_t dsl_dir_snap_cmtime(dsl_dir_t *dd);
+
+/* internal reserved dir name */
+#define	MOS_DIR_NAME "$MOS"
+#define	ORIGIN_DIR_NAME "$ORIGIN"
+#define	XLATION_DIR_NAME "$XLATION"
+#define	FREE_DIR_NAME "$FREE"
+
+#ifdef ZFS_DEBUG
+#define	dprintf_dd(dd, fmt, ...) do { \
+	if (zfs_flags & ZFS_DEBUG_DPRINTF) { \
+	char *__ds_name = kmem_alloc(MAXNAMELEN + strlen(MOS_DIR_NAME) + 1, \
+	    KM_SLEEP); \
+	dsl_dir_name(dd, __ds_name); \
+	dprintf("dd=%s " fmt, __ds_name, __VA_ARGS__); \
+	kmem_free(__ds_name, MAXNAMELEN + strlen(MOS_DIR_NAME) + 1); \
+	} \
+_NOTE(CONSTCOND) } while (0)
+#else
+#define	dprintf_dd(dd, fmt, ...)
+#endif
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif /* _SYS_DSL_DIR_H */
--- a/uts/common/fs/zfs/sys/dsl_pool.h
+++ b/uts/common/fs/zfs/sys/dsl_pool.h
@ -0,0 +1,151 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_SYS_DSL_POOL_H
+#define	_SYS_DSL_POOL_H
+
+#include <sys/spa.h>
+#include <sys/txg.h>
+#include <sys/txg_impl.h>
+#include <sys/zfs_context.h>
+#include <sys/zio.h>
+#include <sys/dnode.h>
+#include <sys/ddt.h>
+#include <sys/arc.h>
+#include <sys/bpobj.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+struct objset;
+struct dsl_dir;
+struct dsl_dataset;
+struct dsl_pool;
+struct dmu_tx;
+struct dsl_scan;
+
+/* These macros are for indexing into the zfs_all_blkstats_t. */
+#define	DMU_OT_DEFERRED	DMU_OT_NONE
+#define	DMU_OT_TOTAL	DMU_OT_NUMTYPES
+
+typedef struct zfs_blkstat {
+	uint64_t	zb_count;
+	uint64_t	zb_asize;
+	uint64_t	zb_lsize;
+	uint64_t	zb_psize;
+	uint64_t	zb_gangs;
+	uint64_t	zb_ditto_2_of_2_samevdev;
+	uint64_t	zb_ditto_2_of_3_samevdev;
+	uint64_t	zb_ditto_3_of_3_samevdev;
+} zfs_blkstat_t;
+
+typedef struct zfs_all_blkstats {
+	zfs_blkstat_t	zab_type[DN_MAX_LEVELS + 1][DMU_OT_TOTAL + 1];
+} zfs_all_blkstats_t;
+
+
+typedef struct dsl_pool {
+	/* Immutable */
+	spa_t *dp_spa;
+	struct objset *dp_meta_objset;
+	struct dsl_dir *dp_root_dir;
+	struct dsl_dir *dp_mos_dir;
+	struct dsl_dir *dp_free_dir;
+	struct dsl_dataset *dp_origin_snap;
+	uint64_t dp_root_dir_obj;
+	struct taskq *dp_vnrele_taskq;
+
+	/* No lock needed - sync context only */
+	blkptr_t dp_meta_rootbp;
+	list_t dp_synced_datasets;
+	hrtime_t dp_read_overhead;
+	uint64_t dp_throughput; /* bytes per millisec */
+	uint64_t dp_write_limit;
+	uint64_t dp_tmp_userrefs_obj;
+	bpobj_t dp_free_bpobj;
+
+	struct dsl_scan *dp_scan;
+
+	/* Uses dp_lock */
+	kmutex_t dp_lock;
+	uint64_t dp_space_towrite[TXG_SIZE];
+	uint64_t dp_tempreserved[TXG_SIZE];
+
+	/* Has its own locking */
+	tx_state_t dp_tx;
+	txg_list_t dp_dirty_datasets;
+	txg_list_t dp_dirty_dirs;
+	txg_list_t dp_sync_tasks;
+
+	/*
+	 * Protects administrative changes (properties, namespace)
+	 * It is only held for write in syncing context.  Therefore
+	 * syncing context does not need to ever have it for read, since
+	 * nobody else could possibly have it for write.
+	 */
+	krwlock_t dp_config_rwlock;
+
+	zfs_all_blkstats_t *dp_blkstats;
+} dsl_pool_t;
+
+int dsl_pool_open(spa_t *spa, uint64_t txg, dsl_pool_t **dpp);
+void dsl_pool_close(dsl_pool_t *dp);
+dsl_pool_t *dsl_pool_create(spa_t *spa, nvlist_t *zplprops, uint64_t txg);
+void dsl_pool_sync(dsl_pool_t *dp, uint64_t txg);
+void dsl_pool_sync_done(dsl_pool_t *dp, uint64_t txg);
+int dsl_pool_sync_context(dsl_pool_t *dp);
+uint64_t dsl_pool_adjustedsize(dsl_pool_t *dp, boolean_t netfree);
+uint64_t dsl_pool_adjustedfree(dsl_pool_t *dp, boolean_t netfree);
+int dsl_pool_tempreserve_space(dsl_pool_t *dp, uint64_t space, dmu_tx_t *tx);
+void dsl_pool_tempreserve_clear(dsl_pool_t *dp, int64_t space, dmu_tx_t *tx);
+void dsl_pool_memory_pressure(dsl_pool_t *dp);
+void dsl_pool_willuse_space(dsl_pool_t *dp, int64_t space, dmu_tx_t *tx);
+void dsl_free(dsl_pool_t *dp, uint64_t txg, const blkptr_t *bpp);
+void dsl_free_sync(zio_t *pio, dsl_pool_t *dp, uint64_t txg,
+    const blkptr_t *bpp);
+int dsl_read(zio_t *pio, spa_t *spa, const blkptr_t *bpp, arc_buf_t *pbuf,
+    arc_done_func_t *done, void *private, int priority, int zio_flags,
+    uint32_t *arc_flags, const zbookmark_t *zb);
+int dsl_read_nolock(zio_t *pio, spa_t *spa, const blkptr_t *bpp,
+    arc_done_func_t *done, void *private, int priority, int zio_flags,
+    uint32_t *arc_flags, const zbookmark_t *zb);
+void dsl_pool_create_origin(dsl_pool_t *dp, dmu_tx_t *tx);
+void dsl_pool_upgrade_clones(dsl_pool_t *dp, dmu_tx_t *tx);
+void dsl_pool_upgrade_dir_clones(dsl_pool_t *dp, dmu_tx_t *tx);
+
+taskq_t *dsl_pool_vnrele_taskq(dsl_pool_t *dp);
+
+extern int dsl_pool_user_hold(dsl_pool_t *dp, uint64_t dsobj,
+    const char *tag, uint64_t *now, dmu_tx_t *tx);
+extern int dsl_pool_user_release(dsl_pool_t *dp, uint64_t dsobj,
+    const char *tag, dmu_tx_t *tx);
+extern void dsl_pool_clean_tmp_userrefs(dsl_pool_t *dp);
+int dsl_pool_open_special_dir(dsl_pool_t *dp, const char *name, dsl_dir_t **);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif /* _SYS_DSL_POOL_H */
--- a/uts/common/fs/zfs/sys/dsl_prop.h
+++ b/uts/common/fs/zfs/sys/dsl_prop.h
@ -0,0 +1,119 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_SYS_DSL_PROP_H
+#define	_SYS_DSL_PROP_H
+
+#include <sys/dmu.h>
+#include <sys/dsl_pool.h>
+#include <sys/zfs_context.h>
+#include <sys/dsl_synctask.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+struct dsl_dataset;
+struct dsl_dir;
+
+/* The callback func may not call into the DMU or DSL! */
+typedef void (dsl_prop_changed_cb_t)(void *arg, uint64_t newval);
+
+typedef struct dsl_prop_cb_record {
+	list_node_t cbr_node; /* link on dd_prop_cbs */
+	struct dsl_dataset *cbr_ds;
+	const char *cbr_propname;
+	dsl_prop_changed_cb_t *cbr_func;
+	void *cbr_arg;
+} dsl_prop_cb_record_t;
+
+typedef struct dsl_props_arg {
+	nvlist_t *pa_props;
+	zprop_source_t pa_source;
+} dsl_props_arg_t;
+
+typedef struct dsl_prop_set_arg {
+	const char *psa_name;
+	zprop_source_t psa_source;
+	int psa_intsz;
+	int psa_numints;
+	const void *psa_value;
+
+	/*
+	 * Used to handle the special requirements of the quota and reservation
+	 * properties.
+	 */
+	uint64_t psa_effective_value;
+} dsl_prop_setarg_t;
+
+int dsl_prop_register(struct dsl_dataset *ds, const char *propname,
+    dsl_prop_changed_cb_t *callback, void *cbarg);
+int dsl_prop_unregister(struct dsl_dataset *ds, const char *propname,
+    dsl_prop_changed_cb_t *callback, void *cbarg);
+int dsl_prop_numcb(struct dsl_dataset *ds);
+
+int dsl_prop_get(const char *ddname, const char *propname,
+    int intsz, int numints, void *buf, char *setpoint);
+int dsl_prop_get_integer(const char *ddname, const char *propname,
+    uint64_t *valuep, char *setpoint);
+int dsl_prop_get_all(objset_t *os, nvlist_t **nvp);
+int dsl_prop_get_received(objset_t *os, nvlist_t **nvp);
+int dsl_prop_get_ds(struct dsl_dataset *ds, const char *propname,
+    int intsz, int numints, void *buf, char *setpoint);
+int dsl_prop_get_dd(struct dsl_dir *dd, const char *propname,
+    int intsz, int numints, void *buf, char *setpoint,
+    boolean_t snapshot);
+
+dsl_syncfunc_t dsl_props_set_sync;
+int dsl_prop_set(const char *ddname, const char *propname,
+    zprop_source_t source, int intsz, int numints, const void *buf);
+int dsl_props_set(const char *dsname, zprop_source_t source, nvlist_t *nvl);
+void dsl_dir_prop_set_uint64_sync(dsl_dir_t *dd, const char *name, uint64_t val,
+    dmu_tx_t *tx);
+
+void dsl_prop_setarg_init_uint64(dsl_prop_setarg_t *psa, const char *propname,
+    zprop_source_t source, uint64_t *value);
+int dsl_prop_predict_sync(dsl_dir_t *dd, dsl_prop_setarg_t *psa);
+#ifdef	ZFS_DEBUG
+void dsl_prop_check_prediction(dsl_dir_t *dd, dsl_prop_setarg_t *psa);
+#define	DSL_PROP_CHECK_PREDICTION(dd, psa)	\
+	dsl_prop_check_prediction((dd), (psa))
+#else
+#define	DSL_PROP_CHECK_PREDICTION(dd, psa)	/* nothing */
+#endif
+
+/* flag first receive on or after SPA_VERSION_RECVD_PROPS */
+boolean_t dsl_prop_get_hasrecvd(objset_t *os);
+void dsl_prop_set_hasrecvd(objset_t *os);
+void dsl_prop_unset_hasrecvd(objset_t *os);
+
+void dsl_prop_nvlist_add_uint64(nvlist_t *nv, zfs_prop_t prop, uint64_t value);
+void dsl_prop_nvlist_add_string(nvlist_t *nv,
+    zfs_prop_t prop, const char *value);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _SYS_DSL_PROP_H */
--- a/uts/common/fs/zfs/sys/dsl_scan.h
+++ b/uts/common/fs/zfs/sys/dsl_scan.h
@ -0,0 +1,108 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_SYS_DSL_SCAN_H
+#define	_SYS_DSL_SCAN_H
+
+#include <sys/zfs_context.h>
+#include <sys/zio.h>
+#include <sys/ddt.h>
+#include <sys/bplist.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+struct objset;
+struct dsl_dir;
+struct dsl_dataset;
+struct dsl_pool;
+struct dmu_tx;
+
+/*
+ * All members of this structure must be uint64_t, for byteswap
+ * purposes.
+ */
+typedef struct dsl_scan_phys {
+	uint64_t scn_func; /* pool_scan_func_t */
+	uint64_t scn_state; /* dsl_scan_state_t */
+	uint64_t scn_queue_obj;
+	uint64_t scn_min_txg;
+	uint64_t scn_max_txg;
+	uint64_t scn_cur_min_txg;
+	uint64_t scn_cur_max_txg;
+	uint64_t scn_start_time;
+	uint64_t scn_end_time;
+	uint64_t scn_to_examine; /* total bytes to be scanned */
+	uint64_t scn_examined; /* bytes scanned so far */
+	uint64_t scn_to_process;
+	uint64_t scn_processed;
+	uint64_t scn_errors;	/* scan I/O error count */
+	uint64_t scn_ddt_class_max;
+	ddt_bookmark_t scn_ddt_bookmark;
+	zbookmark_t scn_bookmark;
+	uint64_t scn_flags; /* dsl_scan_flags_t */
+} dsl_scan_phys_t;
+
+#define	SCAN_PHYS_NUMINTS (sizeof (dsl_scan_phys_t) / sizeof (uint64_t))
+
+typedef enum dsl_scan_flags {
+	DSF_VISIT_DS_AGAIN = 1<<0,
+} dsl_scan_flags_t;
+
+typedef struct dsl_scan {
+	struct dsl_pool *scn_dp;
+
+	boolean_t scn_pausing;
+	uint64_t scn_restart_txg;
+	uint64_t scn_sync_start_time;
+	zio_t *scn_zio_root;
+
+	/* for debugging / information */
+	uint64_t scn_visited_this_txg;
+
+	dsl_scan_phys_t scn_phys;
+} dsl_scan_t;
+
+int dsl_scan_init(struct dsl_pool *dp, uint64_t txg);
+void dsl_scan_fini(struct dsl_pool *dp);
+void dsl_scan_sync(struct dsl_pool *, dmu_tx_t *);
+int dsl_scan_cancel(struct dsl_pool *);
+int dsl_scan(struct dsl_pool *, pool_scan_func_t);
+void dsl_resilver_restart(struct dsl_pool *, uint64_t txg);
+boolean_t dsl_scan_resilvering(struct dsl_pool *dp);
+boolean_t dsl_dataset_unstable(struct dsl_dataset *ds);
+void dsl_scan_ddt_entry(dsl_scan_t *scn, enum zio_checksum checksum,
+    ddt_entry_t *dde, dmu_tx_t *tx);
+void dsl_scan_ds_destroyed(struct dsl_dataset *ds, struct dmu_tx *tx);
+void dsl_scan_ds_snapshotted(struct dsl_dataset *ds, struct dmu_tx *tx);
+void dsl_scan_ds_clone_swapped(struct dsl_dataset *ds1, struct dsl_dataset *ds2,
+    struct dmu_tx *tx);
+boolean_t dsl_scan_active(dsl_scan_t *scn);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif /* _SYS_DSL_SCAN_H */
--- a/uts/common/fs/zfs/sys/dsl_synctask.h
+++ b/uts/common/fs/zfs/sys/dsl_synctask.h
@ -0,0 +1,79 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_SYS_DSL_SYNCTASK_H
+#define	_SYS_DSL_SYNCTASK_H
+
+#include <sys/txg.h>
+#include <sys/zfs_context.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+struct dsl_pool;
+
+typedef int (dsl_checkfunc_t)(void *, void *, dmu_tx_t *);
+typedef void (dsl_syncfunc_t)(void *, void *, dmu_tx_t *);
+
+typedef struct dsl_sync_task {
+	list_node_t dst_node;
+	dsl_checkfunc_t *dst_checkfunc;
+	dsl_syncfunc_t *dst_syncfunc;
+	void *dst_arg1;
+	void *dst_arg2;
+	int dst_err;
+} dsl_sync_task_t;
+
+typedef struct dsl_sync_task_group {
+	txg_node_t dstg_node;
+	list_t dstg_tasks;
+	struct dsl_pool *dstg_pool;
+	uint64_t dstg_txg;
+	int dstg_err;
+	int dstg_space;
+	boolean_t dstg_nowaiter;
+} dsl_sync_task_group_t;
+
+dsl_sync_task_group_t *dsl_sync_task_group_create(struct dsl_pool *dp);
+void dsl_sync_task_create(dsl_sync_task_group_t *dstg,
+    dsl_checkfunc_t *, dsl_syncfunc_t *,
+    void *arg1, void *arg2, int blocks_modified);
+int dsl_sync_task_group_wait(dsl_sync_task_group_t *dstg);
+void dsl_sync_task_group_nowait(dsl_sync_task_group_t *dstg, dmu_tx_t *tx);
+void dsl_sync_task_group_destroy(dsl_sync_task_group_t *dstg);
+void dsl_sync_task_group_sync(dsl_sync_task_group_t *dstg, dmu_tx_t *tx);
+
+int dsl_sync_task_do(struct dsl_pool *dp,
+    dsl_checkfunc_t *checkfunc, dsl_syncfunc_t *syncfunc,
+    void *arg1, void *arg2, int blocks_modified);
+void dsl_sync_task_do_nowait(struct dsl_pool *dp,
+    dsl_checkfunc_t *checkfunc, dsl_syncfunc_t *syncfunc,
+    void *arg1, void *arg2, int blocks_modified, dmu_tx_t *tx);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif /* _SYS_DSL_SYNCTASK_H */
--- a/uts/common/fs/zfs/sys/metaslab.h
+++ b/uts/common/fs/zfs/sys/metaslab.h
@ -0,0 +1,80 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef _SYS_METASLAB_H
+#define	_SYS_METASLAB_H
+
+#include <sys/spa.h>
+#include <sys/space_map.h>
+#include <sys/txg.h>
+#include <sys/zio.h>
+#include <sys/avl.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+extern space_map_ops_t *zfs_metaslab_ops;
+
+extern metaslab_t *metaslab_init(metaslab_group_t *mg, space_map_obj_t *smo,
+    uint64_t start, uint64_t size, uint64_t txg);
+extern void metaslab_fini(metaslab_t *msp);
+extern void metaslab_sync(metaslab_t *msp, uint64_t txg);
+extern void metaslab_sync_done(metaslab_t *msp, uint64_t txg);
+extern void metaslab_sync_reassess(metaslab_group_t *mg);
+
+#define	METASLAB_HINTBP_FAVOR	0x0
+#define	METASLAB_HINTBP_AVOID	0x1
+#define	METASLAB_GANG_HEADER	0x2
+
+extern int metaslab_alloc(spa_t *spa, metaslab_class_t *mc, uint64_t psize,
+    blkptr_t *bp, int ncopies, uint64_t txg, blkptr_t *hintbp, int flags);
+extern void metaslab_free(spa_t *spa, const blkptr_t *bp, uint64_t txg,
+    boolean_t now);
+extern int metaslab_claim(spa_t *spa, const blkptr_t *bp, uint64_t txg);
+
+extern metaslab_class_t *metaslab_class_create(spa_t *spa,
+    space_map_ops_t *ops);
+extern void metaslab_class_destroy(metaslab_class_t *mc);
+extern int metaslab_class_validate(metaslab_class_t *mc);
+
+extern void metaslab_class_space_update(metaslab_class_t *mc,
+    int64_t alloc_delta, int64_t defer_delta,
+    int64_t space_delta, int64_t dspace_delta);
+extern uint64_t metaslab_class_get_alloc(metaslab_class_t *mc);
+extern uint64_t metaslab_class_get_space(metaslab_class_t *mc);
+extern uint64_t metaslab_class_get_dspace(metaslab_class_t *mc);
+extern uint64_t metaslab_class_get_deferred(metaslab_class_t *mc);
+
+extern metaslab_group_t *metaslab_group_create(metaslab_class_t *mc,
+    vdev_t *vd);
+extern void metaslab_group_destroy(metaslab_group_t *mg);
+extern void metaslab_group_activate(metaslab_group_t *mg);
+extern void metaslab_group_passivate(metaslab_group_t *mg);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _SYS_METASLAB_H */
--- a/uts/common/fs/zfs/sys/metaslab_impl.h
+++ b/uts/common/fs/zfs/sys/metaslab_impl.h
@ -0,0 +1,89 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+#ifndef _SYS_METASLAB_IMPL_H
+#define	_SYS_METASLAB_IMPL_H
+
+#include <sys/metaslab.h>
+#include <sys/space_map.h>
+#include <sys/vdev.h>
+#include <sys/txg.h>
+#include <sys/avl.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+struct metaslab_class {
+	spa_t			*mc_spa;
+	metaslab_group_t	*mc_rotor;
+	space_map_ops_t		*mc_ops;
+	uint64_t		mc_aliquot;
+	uint64_t		mc_alloc;	/* total allocated space */
+	uint64_t		mc_deferred;	/* total deferred frees */
+	uint64_t		mc_space;	/* total space (alloc + free) */
+	uint64_t		mc_dspace;	/* total deflated space */
+};
+
+struct metaslab_group {
+	kmutex_t		mg_lock;
+	avl_tree_t		mg_metaslab_tree;
+	uint64_t		mg_aliquot;
+	uint64_t		mg_bonus_area;
+	int64_t			mg_bias;
+	int64_t			mg_activation_count;
+	metaslab_class_t	*mg_class;
+	vdev_t			*mg_vd;
+	metaslab_group_t	*mg_prev;
+	metaslab_group_t	*mg_next;
+};
+
+/*
+ * Each metaslab's free space is tracked in space map object in the MOS,
+ * which is only updated in syncing context.  Each time we sync a txg,
+ * we append the allocs and frees from that txg to the space map object.
+ * When the txg is done syncing, metaslab_sync_done() updates ms_smo
+ * to ms_smo_syncing.  Everything in ms_smo is always safe to allocate.
+ */
+struct metaslab {
+	kmutex_t	ms_lock;	/* metaslab lock		*/
+	space_map_obj_t	ms_smo;		/* synced space map object	*/
+	space_map_obj_t	ms_smo_syncing;	/* syncing space map object	*/
+	space_map_t	ms_allocmap[TXG_SIZE];  /* allocated this txg	*/
+	space_map_t	ms_freemap[TXG_SIZE];	/* freed this txg	*/
+	space_map_t	ms_defermap[TXG_DEFER_SIZE]; /* deferred frees	*/
+	space_map_t	ms_map;		/* in-core free space map	*/
+	int64_t		ms_deferspace;	/* sum of ms_defermap[] space	*/
+	uint64_t	ms_weight;	/* weight vs. others in group	*/
+	metaslab_group_t *ms_group;	/* metaslab group		*/
+	avl_node_t	ms_group_node;	/* node in metaslab group tree	*/
+	txg_node_t	ms_txg_node;	/* per-txg dirty metaslab links	*/
+};
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _SYS_METASLAB_IMPL_H */
--- a/uts/common/fs/zfs/sys/refcount.h
+++ b/uts/common/fs/zfs/sys/refcount.h
@ -0,0 +1,107 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_SYS_REFCOUNT_H
+#define	_SYS_REFCOUNT_H
+
+#include <sys/inttypes.h>
+#include <sys/list.h>
+#include <sys/zfs_context.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+/*
+ * If the reference is held only by the calling function and not any
+ * particular object, use FTAG (which is a string) for the holder_tag.
+ * Otherwise, use the object that holds the reference.
+ */
+#define	FTAG ((char *)__func__)
+
+#ifdef	ZFS_DEBUG
+typedef struct reference {
+	list_node_t ref_link;
+	void *ref_holder;
+	uint64_t ref_number;
+	uint8_t *ref_removed;
+} reference_t;
+
+typedef struct refcount {
+	kmutex_t rc_mtx;
+	list_t rc_list;
+	list_t rc_removed;
+	int64_t rc_count;
+	int64_t rc_removed_count;
+} refcount_t;
+
+/* Note: refcount_t must be initialized with refcount_create() */
+
+void refcount_create(refcount_t *rc);
+void refcount_destroy(refcount_t *rc);
+void refcount_destroy_many(refcount_t *rc, uint64_t number);
+int refcount_is_zero(refcount_t *rc);
+int64_t refcount_count(refcount_t *rc);
+int64_t refcount_add(refcount_t *rc, void *holder_tag);
+int64_t refcount_remove(refcount_t *rc, void *holder_tag);
+int64_t refcount_add_many(refcount_t *rc, uint64_t number, void *holder_tag);
+int64_t refcount_remove_many(refcount_t *rc, uint64_t number, void *holder_tag);
+void refcount_transfer(refcount_t *dst, refcount_t *src);
+
+void refcount_init(void);
+void refcount_fini(void);
+
+#else	/* ZFS_DEBUG */
+
+typedef struct refcount {
+	uint64_t rc_count;
+} refcount_t;
+
+#define	refcount_create(rc) ((rc)->rc_count = 0)
+#define	refcount_destroy(rc) ((rc)->rc_count = 0)
+#define	refcount_destroy_many(rc, number) ((rc)->rc_count = 0)
+#define	refcount_is_zero(rc) ((rc)->rc_count == 0)
+#define	refcount_count(rc) ((rc)->rc_count)
+#define	refcount_add(rc, holder) atomic_add_64_nv(&(rc)->rc_count, 1)
+#define	refcount_remove(rc, holder) atomic_add_64_nv(&(rc)->rc_count, -1)
+#define	refcount_add_many(rc, number, holder) \
+	atomic_add_64_nv(&(rc)->rc_count, number)
+#define	refcount_remove_many(rc, number, holder) \
+	atomic_add_64_nv(&(rc)->rc_count, -number)
+#define	refcount_transfer(dst, src) { \
+	uint64_t __tmp = (src)->rc_count; \
+	atomic_add_64(&(src)->rc_count, -__tmp); \
+	atomic_add_64(&(dst)->rc_count, __tmp); \
+}
+
+#define	refcount_init()
+#define	refcount_fini()
+
+#endif	/* ZFS_DEBUG */
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif /* _SYS_REFCOUNT_H */
--- a/uts/common/fs/zfs/sys/rrwlock.h
+++ b/uts/common/fs/zfs/sys/rrwlock.h
@ -0,0 +1,80 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright 2007 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+#ifndef	_SYS_RR_RW_LOCK_H
+#define	_SYS_RR_RW_LOCK_H
+
+#pragma ident	"%Z%%M%	%I%	%E% SMI"
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+#include <sys/inttypes.h>
+#include <sys/zfs_context.h>
+#include <sys/refcount.h>
+
+/*
+ * A reader-writer lock implementation that allows re-entrant reads, but
+ * still gives writers priority on "new" reads.
+ *
+ * See rrwlock.c for more details about the implementation.
+ *
+ * Fields of the rrwlock_t structure:
+ * - rr_lock: protects modification and reading of rrwlock_t fields
+ * - rr_cv: cv for waking up readers or waiting writers
+ * - rr_writer: thread id of the current writer
+ * - rr_anon_rount: number of active anonymous readers
+ * - rr_linked_rcount: total number of non-anonymous active readers
+ * - rr_writer_wanted: a writer wants the lock
+ */
+typedef struct rrwlock {
+	kmutex_t	rr_lock;
+	kcondvar_t	rr_cv;
+	kthread_t	*rr_writer;
+	refcount_t	rr_anon_rcount;
+	refcount_t	rr_linked_rcount;
+	boolean_t	rr_writer_wanted;
+} rrwlock_t;
+
+/*
+ * 'tag' is used in reference counting tracking.  The
+ * 'tag' must be the same in a rrw_enter() as in its
+ * corresponding rrw_exit().
+ */
+void rrw_init(rrwlock_t *rrl);
+void rrw_destroy(rrwlock_t *rrl);
+void rrw_enter(rrwlock_t *rrl, krw_t rw, void *tag);
+void rrw_exit(rrwlock_t *rrl, void *tag);
+boolean_t rrw_held(rrwlock_t *rrl, krw_t rw);
+
+#define	RRW_READ_HELD(x)	rrw_held(x, RW_READER)
+#define	RRW_WRITE_HELD(x)	rrw_held(x, RW_WRITER)
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _SYS_RR_RW_LOCK_H */
--- a/uts/common/fs/zfs/sys/sa.h
+++ b/uts/common/fs/zfs/sys/sa.h
@ -0,0 +1,170 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_SYS_SA_H
+#define	_SYS_SA_H
+
+#include <sys/dmu.h>
+
+/*
+ * Currently available byteswap functions.
+ * If it all possible new attributes should used
+ * one of the already defined byteswap functions.
+ * If a new byteswap function is added then the
+ * ZPL/Pool version will need to be bumped.
+ */
+
+typedef enum sa_bswap_type {
+	SA_UINT64_ARRAY,
+	SA_UINT32_ARRAY,
+	SA_UINT16_ARRAY,
+	SA_UINT8_ARRAY,
+	SA_ACL,
+} sa_bswap_type_t;
+
+typedef uint16_t	sa_attr_type_t;
+
+/*
+ * Attribute to register support for.
+ */
+typedef struct sa_attr_reg {
+	char 			*sa_name;	/* attribute name */
+	uint16_t 		sa_length;
+	sa_bswap_type_t		sa_byteswap;	/* bswap functon enum */
+	sa_attr_type_t 		sa_attr; /* filled in during registration */
+} sa_attr_reg_t;
+
+
+typedef void (sa_data_locator_t)(void **, uint32_t *, uint32_t,
+    boolean_t, void *userptr);
+
+/*
+ * array of attributes to store.
+ *
+ * This array should be treated as opaque/private data.
+ * The SA_BULK_ADD_ATTR() macro should be used for manipulating
+ * the array.
+ *
+ * When sa_replace_all_by_template() is used the attributes
+ * will be stored in the order defined in the array, except that
+ * the attributes may be split between the bonus and the spill buffer
+ *
+ */
+typedef struct sa_bulk_attr {
+	void			*sa_data;
+	sa_data_locator_t	*sa_data_func;
+	uint16_t		sa_length;
+	sa_attr_type_t		sa_attr;
+	/* the following are private to the sa framework */
+	void 			*sa_addr;
+	uint16_t		sa_buftype;
+	uint16_t		sa_size;
+} sa_bulk_attr_t;
+
+
+/*
+ * special macro for adding entries for bulk attr support
+ * bulk - sa_bulk_attr_t
+ * count - integer that will be incremented during each add
+ * attr - attribute to manipulate
+ * func - function for accessing data.
+ * data - pointer to data.
+ * len - length of data
+ */
+
+#define	SA_ADD_BULK_ATTR(b, idx, attr, func, data, len) \
+{ \
+	b[idx].sa_attr = attr;\
+	b[idx].sa_data_func = func; \
+	b[idx].sa_data = data; \
+	b[idx++].sa_length = len; \
+}
+
+typedef struct sa_os sa_os_t;
+
+typedef enum sa_handle_type {
+	SA_HDL_SHARED,
+	SA_HDL_PRIVATE
+} sa_handle_type_t;
+
+struct sa_handle;
+typedef void *sa_lookup_tab_t;
+typedef struct sa_handle sa_handle_t;
+
+typedef void (sa_update_cb_t)(sa_handle_t *, dmu_tx_t *tx);
+
+int sa_handle_get(objset_t *, uint64_t, void *userp,
+    sa_handle_type_t, sa_handle_t **);
+int sa_handle_get_from_db(objset_t *, dmu_buf_t *, void *userp,
+    sa_handle_type_t, sa_handle_t **);
+void sa_handle_destroy(sa_handle_t *);
+int sa_buf_hold(objset_t *, uint64_t, void *, dmu_buf_t **);
+void sa_buf_rele(dmu_buf_t *, void *);
+int sa_lookup(sa_handle_t *, sa_attr_type_t, void *buf, uint32_t buflen);
+int sa_update(sa_handle_t *, sa_attr_type_t, void *buf,
+    uint32_t buflen, dmu_tx_t *);
+int sa_remove(sa_handle_t *, sa_attr_type_t, dmu_tx_t *);
+int sa_bulk_lookup(sa_handle_t *, sa_bulk_attr_t *, int count);
+int sa_bulk_lookup_locked(sa_handle_t *, sa_bulk_attr_t *, int count);
+int sa_bulk_update(sa_handle_t *, sa_bulk_attr_t *, int count, dmu_tx_t *);
+int sa_size(sa_handle_t *, sa_attr_type_t, int *);
+int sa_update_from_cb(sa_handle_t *, sa_attr_type_t,
+    uint32_t buflen, sa_data_locator_t *, void *userdata, dmu_tx_t *);
+void sa_object_info(sa_handle_t *, dmu_object_info_t *);
+void sa_object_size(sa_handle_t *, uint32_t *, u_longlong_t *);
+void sa_update_user(sa_handle_t *, sa_handle_t *);
+void *sa_get_userdata(sa_handle_t *);
+void sa_set_userp(sa_handle_t *, void *);
+dmu_buf_t *sa_get_db(sa_handle_t *);
+uint64_t sa_handle_object(sa_handle_t *);
+boolean_t sa_attr_would_spill(sa_handle_t *, sa_attr_type_t, int size);
+void sa_register_update_callback(objset_t *, sa_update_cb_t *);
+int sa_setup(objset_t *, uint64_t, sa_attr_reg_t *, int, sa_attr_type_t **);
+void sa_tear_down(objset_t *);
+int sa_replace_all_by_template(sa_handle_t *, sa_bulk_attr_t *,
+    int, dmu_tx_t *);
+int sa_replace_all_by_template_locked(sa_handle_t *, sa_bulk_attr_t *,
+    int, dmu_tx_t *);
+boolean_t sa_enabled(objset_t *);
+void sa_cache_init();
+void sa_cache_fini();
+int sa_set_sa_object(objset_t *, uint64_t);
+int sa_hdrsize(void *);
+void sa_handle_lock(sa_handle_t *);
+void sa_handle_unlock(sa_handle_t *);
+
+#ifdef _KERNEL
+int sa_lookup_uio(sa_handle_t *, sa_attr_type_t, uio_t *);
+#endif
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _SYS_SA_H */
--- a/uts/common/fs/zfs/sys/sa_impl.h
+++ b/uts/common/fs/zfs/sys/sa_impl.h
@ -0,0 +1,287 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef	_SYS_SA_IMPL_H
+#define	_SYS_SA_IMPL_H
+
+#include <sys/dmu.h>
+#include <sys/refcount.h>
+#include <sys/list.h>
+
+/*
+ * Array of known attributes and their
+ * various characteristics.
+ */
+typedef struct sa_attr_table {
+	sa_attr_type_t	sa_attr;
+	uint8_t sa_registered;
+	uint16_t sa_length;
+	sa_bswap_type_t sa_byteswap;
+	char *sa_name;
+} sa_attr_table_t;
+
+/*
+ * Zap attribute format for attribute registration
+ *
+ * 64      56      48      40      32      24      16      8       0
+ * +-------+-------+-------+-------+-------+-------+-------+-------+
+ * |        unused         |      len      | bswap |   attr num    |
+ * +-------+-------+-------+-------+-------+-------+-------+-------+
+ *
+ * Zap attribute format for layout information.
+ *
+ * layout information is stored as an array of attribute numbers
+ * The name of the attribute is the layout number (0, 1, 2, ...)
+ *
+ * 16       0
+ * +---- ---+
+ * | attr # |
+ * +--------+
+ * | attr # |
+ * +--- ----+
+ *  ......
+ *
+ */
+
+#define	ATTR_BSWAP(x)	BF32_GET(x, 16, 8)
+#define	ATTR_LENGTH(x)	BF32_GET(x, 24, 16)
+#define	ATTR_NUM(x)	BF32_GET(x, 0, 16)
+#define	ATTR_ENCODE(x, attr, length, bswap) \
+{ \
+	BF64_SET(x, 24, 16, length); \
+	BF64_SET(x, 16, 8, bswap); \
+	BF64_SET(x, 0, 16, attr); \
+}
+
+#define	TOC_OFF(x)		BF32_GET(x, 0, 23)
+#define	TOC_ATTR_PRESENT(x)	BF32_GET(x, 31, 1)
+#define	TOC_LEN_IDX(x)		BF32_GET(x, 24, 4)
+#define	TOC_ATTR_ENCODE(x, len_idx, offset) \
+{ \
+	BF32_SET(x, 31, 1, 1); \
+	BF32_SET(x, 24, 7, len_idx); \
+	BF32_SET(x, 0, 24, offset); \
+}
+
+#define	SA_LAYOUTS	"LAYOUTS"
+#define	SA_REGISTRY	"REGISTRY"
+
+/*
+ * Each unique layout will have their own table
+ * sa_lot (layout_table)
+ */
+typedef struct sa_lot {
+	avl_node_t lot_num_node;
+	avl_node_t lot_hash_node;
+	uint64_t lot_num;
+	uint64_t lot_hash;
+	sa_attr_type_t *lot_attrs;	/* array of attr #'s */
+	uint32_t lot_var_sizes;	/* how many aren't fixed size */
+	uint32_t lot_attr_count;	/* total attr count */
+	list_t 	lot_idx_tab;	/* should be only a couple of entries */
+	int	lot_instance;	/* used with lot_hash to identify entry */
+} sa_lot_t;
+
+/* index table of offsets */
+typedef struct sa_idx_tab {
+	list_node_t	sa_next;
+	sa_lot_t	*sa_layout;
+	uint16_t	*sa_variable_lengths;
+	refcount_t	sa_refcount;
+	uint32_t	*sa_idx_tab;	/* array of offsets */
+} sa_idx_tab_t;
+
+/*
+ * Since the offset/index information into the actual data
+ * will usually be identical we can share that information with
+ * all handles that have the exact same offsets.
+ *
+ * You would typically only have a large number of different table of
+ * contents if you had a several variable sized attributes.
+ *
+ * Two AVL trees are used to track the attribute layout numbers.
+ * one is keyed by number and will be consulted when a DMU_OT_SA
+ * object is first read.  The second tree is keyed by the hash signature
+ * of the attributes and will be consulted when an attribute is added
+ * to determine if we already have an instance of that layout.  Both
+ * of these tree's are interconnected.  The only difference is that
+ * when an entry is found in the "hash" tree the list of attributes will
+ * need to be compared against the list of attributes you have in hand.
+ * The assumption is that typically attributes will just be updated and
+ * adding a completely new attribute is a very rare operation.
+ */
+struct sa_os {
+	kmutex_t 	sa_lock;
+	boolean_t	sa_need_attr_registration;
+	boolean_t	sa_force_spill;
+	uint64_t	sa_master_obj;
+	uint64_t	sa_reg_attr_obj;
+	uint64_t	sa_layout_attr_obj;
+	int		sa_num_attrs;
+	sa_attr_table_t *sa_attr_table;	 /* private attr table */
+	sa_update_cb_t	*sa_update_cb;
+	avl_tree_t	sa_layout_num_tree;  /* keyed by layout number */
+	avl_tree_t	sa_layout_hash_tree; /* keyed by layout hash value */
+	int		sa_user_table_sz;
+	sa_attr_type_t	*sa_user_table; /* user name->attr mapping table */
+};
+
+/*
+ * header for all bonus and spill buffers.
+ * The header has a fixed portion with a variable number
+ * of "lengths" depending on the number of variable sized
+ * attribues which are determined by the "layout number"
+ */
+
+#define	SA_MAGIC	0x2F505A  /* ZFS SA */
+typedef struct sa_hdr_phys {
+	uint32_t sa_magic;
+	uint16_t sa_layout_info;  /* Encoded with hdrsize and layout number */
+	uint16_t sa_lengths[1];	/* optional sizes for variable length attrs */
+	/* ... Data follows the lengths.  */
+} sa_hdr_phys_t;
+
+/*
+ * sa_hdr_phys -> sa_layout_info
+ *
+ * 16      10       0
+ * +--------+-------+
+ * | hdrsz  |layout |
+ * +--------+-------+
+ *
+ * Bits 0-10 are the layout number
+ * Bits 11-16 are the size of the header.
+ * The hdrsize is the number * 8
+ *
+ * For example.
+ * hdrsz of 1 ==> 8 byte header
+ *          2 ==> 16 byte header
+ *
+ */
+
+#define	SA_HDR_LAYOUT_NUM(hdr) BF32_GET(hdr->sa_layout_info, 0, 10)
+#define	SA_HDR_SIZE(hdr) BF32_GET_SB(hdr->sa_layout_info, 10, 16, 3, 0)
+#define	SA_HDR_LAYOUT_INFO_ENCODE(x, num, size) \
+{ \
+	BF32_SET_SB(x, 10, 6, 3, 0, size); \
+	BF32_SET(x, 0, 10, num); \
+}
+
+typedef enum sa_buf_type {
+	SA_BONUS = 1,
+	SA_SPILL = 2
+} sa_buf_type_t;
+
+typedef enum sa_data_op {
+	SA_LOOKUP,
+	SA_UPDATE,
+	SA_ADD,
+	SA_REPLACE,
+	SA_REMOVE
+} sa_data_op_t;
+
+/*
+ * Opaque handle used for most sa functions
+ *
+ * This needs to be kept as small as possible.
+ */
+
+struct sa_handle {
+	kmutex_t	sa_lock;
+	dmu_buf_t	*sa_bonus;
+	dmu_buf_t	*sa_spill;
+	objset_t	*sa_os;
+	void 		*sa_userp;
+	sa_idx_tab_t	*sa_bonus_tab;	 /* idx of bonus */
+	sa_idx_tab_t	*sa_spill_tab; /* only present if spill activated */
+};
+
+#define	SA_GET_DB(hdl, type)	\
+	(dmu_buf_impl_t *)((type == SA_BONUS) ? hdl->sa_bonus : hdl->sa_spill)
+
+#define	SA_GET_HDR(hdl, type) \
+	((sa_hdr_phys_t *)((dmu_buf_impl_t *)(SA_GET_DB(hdl, \
+	type))->db.db_data))
+
+#define	SA_IDX_TAB_GET(hdl, type) \
+	(type == SA_BONUS ? hdl->sa_bonus_tab : hdl->sa_spill_tab)
+
+#define	IS_SA_BONUSTYPE(a)	\
+	((a == DMU_OT_SA) ? B_TRUE : B_FALSE)
+
+#define	SA_BONUSTYPE_FROM_DB(db) \
+	(dmu_get_bonustype((dmu_buf_t *)db))
+
+#define	SA_BLKPTR_SPACE	(DN_MAX_BONUSLEN - sizeof (blkptr_t))
+
+#define	SA_LAYOUT_NUM(x, type) \
+	((!IS_SA_BONUSTYPE(type) ? 0 : (((IS_SA_BONUSTYPE(type)) && \
+	((SA_HDR_LAYOUT_NUM(x)) == 0)) ? 1 : SA_HDR_LAYOUT_NUM(x))))
+
+
+#define	SA_REGISTERED_LEN(sa, attr) sa->sa_attr_table[attr].sa_length
+
+#define	SA_ATTR_LEN(sa, idx, attr, hdr) ((SA_REGISTERED_LEN(sa, attr) == 0) ?\
+	hdr->sa_lengths[TOC_LEN_IDX(idx->sa_idx_tab[attr])] : \
+	SA_REGISTERED_LEN(sa, attr))
+
+#define	SA_SET_HDR(hdr, num, size) \
+	{ \
+		hdr->sa_magic = SA_MAGIC; \
+		SA_HDR_LAYOUT_INFO_ENCODE(hdr->sa_layout_info, num, size); \
+	}
+
+#define	SA_ATTR_INFO(sa, idx, hdr, attr, bulk, type, hdl) \
+	{ \
+		bulk.sa_size = SA_ATTR_LEN(sa, idx, attr, hdr); \
+		bulk.sa_buftype = type; \
+		bulk.sa_addr = \
+		    (void *)((uintptr_t)TOC_OFF(idx->sa_idx_tab[attr]) + \
+		    (uintptr_t)hdr); \
+}
+
+#define	SA_HDR_SIZE_MATCH_LAYOUT(hdr, tb) \
+	(SA_HDR_SIZE(hdr) == (sizeof (sa_hdr_phys_t) + \
+	(tb->lot_var_sizes > 1 ? P2ROUNDUP((tb->lot_var_sizes - 1) * \
+	sizeof (uint16_t), 8) : 0)))
+
+int sa_add_impl(sa_handle_t *, sa_attr_type_t,
+    uint32_t, sa_data_locator_t, void *, dmu_tx_t *);
+
+void sa_register_update_callback_locked(objset_t *, sa_update_cb_t *);
+int sa_size_locked(sa_handle_t *, sa_attr_type_t, int *);
+
+void sa_default_locator(void **, uint32_t *, uint32_t, boolean_t, void *);
+int sa_attr_size(sa_os_t *, sa_idx_tab_t *, sa_attr_type_t,
+    uint16_t *, sa_hdr_phys_t *);
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _SYS_SA_IMPL_H */
--- a/uts/common/fs/zfs/sys/spa.h
+++ b/uts/common/fs/zfs/sys/spa.h
@ -0,0 +1,706 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef _SYS_SPA_H
+#define	_SYS_SPA_H
+
+#include <sys/avl.h>
+#include <sys/zfs_context.h>
+#include <sys/nvpair.h>
+#include <sys/sysmacros.h>
+#include <sys/types.h>
+#include <sys/fs/zfs.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+/*
+ * Forward references that lots of things need.
+ */
+typedef struct spa spa_t;
+typedef struct vdev vdev_t;
+typedef struct metaslab metaslab_t;
+typedef struct metaslab_group metaslab_group_t;
+typedef struct metaslab_class metaslab_class_t;
+typedef struct zio zio_t;
+typedef struct zilog zilog_t;
+typedef struct spa_aux_vdev spa_aux_vdev_t;
+typedef struct ddt ddt_t;
+typedef struct ddt_entry ddt_entry_t;
+struct dsl_pool;
+
+/*
+ * General-purpose 32-bit and 64-bit bitfield encodings.
+ */
+#define	BF32_DECODE(x, low, len)	P2PHASE((x) >> (low), 1U << (len))
+#define	BF64_DECODE(x, low, len)	P2PHASE((x) >> (low), 1ULL << (len))
+#define	BF32_ENCODE(x, low, len)	(P2PHASE((x), 1U << (len)) << (low))
+#define	BF64_ENCODE(x, low, len)	(P2PHASE((x), 1ULL << (len)) << (low))
+
+#define	BF32_GET(x, low, len)		BF32_DECODE(x, low, len)
+#define	BF64_GET(x, low, len)		BF64_DECODE(x, low, len)
+
+#define	BF32_SET(x, low, len, val)	\
+	((x) ^= BF32_ENCODE((x >> low) ^ (val), low, len))
+#define	BF64_SET(x, low, len, val)	\
+	((x) ^= BF64_ENCODE((x >> low) ^ (val), low, len))
+
+#define	BF32_GET_SB(x, low, len, shift, bias)	\
+	((BF32_GET(x, low, len) + (bias)) << (shift))
+#define	BF64_GET_SB(x, low, len, shift, bias)	\
+	((BF64_GET(x, low, len) + (bias)) << (shift))
+
+#define	BF32_SET_SB(x, low, len, shift, bias, val)	\
+	BF32_SET(x, low, len, ((val) >> (shift)) - (bias))
+#define	BF64_SET_SB(x, low, len, shift, bias, val)	\
+	BF64_SET(x, low, len, ((val) >> (shift)) - (bias))
+
+/*
+ * We currently support nine block sizes, from 512 bytes to 128K.
+ * We could go higher, but the benefits are near-zero and the cost
+ * of COWing a giant block to modify one byte would become excessive.
+ */
+#define	SPA_MINBLOCKSHIFT	9
+#define	SPA_MAXBLOCKSHIFT	17
+#define	SPA_MINBLOCKSIZE	(1ULL << SPA_MINBLOCKSHIFT)
+#define	SPA_MAXBLOCKSIZE	(1ULL << SPA_MAXBLOCKSHIFT)
+
+#define	SPA_BLOCKSIZES		(SPA_MAXBLOCKSHIFT - SPA_MINBLOCKSHIFT + 1)
+
+/*
+ * Size of block to hold the configuration data (a packed nvlist)
+ */
+#define	SPA_CONFIG_BLOCKSIZE	(1 << 14)
+
+/*
+ * The DVA size encodings for LSIZE and PSIZE support blocks up to 32MB.
+ * The ASIZE encoding should be at least 64 times larger (6 more bits)
+ * to support up to 4-way RAID-Z mirror mode with worst-case gang block
+ * overhead, three DVAs per bp, plus one more bit in case we do anything
+ * else that expands the ASIZE.
+ */
+#define	SPA_LSIZEBITS		16	/* LSIZE up to 32M (2^16 * 512)	*/
+#define	SPA_PSIZEBITS		16	/* PSIZE up to 32M (2^16 * 512)	*/
+#define	SPA_ASIZEBITS		24	/* ASIZE up to 64 times larger	*/
+
+/*
+ * All SPA data is represented by 128-bit data virtual addresses (DVAs).
+ * The members of the dva_t should be considered opaque outside the SPA.
+ */
+typedef struct dva {
+	uint64_t	dva_word[2];
+} dva_t;
+
+/*
+ * Each block has a 256-bit checksum -- strong enough for cryptographic hashes.
+ */
+typedef struct zio_cksum {
+	uint64_t	zc_word[4];
+} zio_cksum_t;
+
+/*
+ * Each block is described by its DVAs, time of birth, checksum, etc.
+ * The word-by-word, bit-by-bit layout of the blkptr is as follows:
+ *
+ *	64	56	48	40	32	24	16	8	0
+ *	+-------+-------+-------+-------+-------+-------+-------+-------+
+ * 0	|		vdev1		| GRID  |	  ASIZE		|
+ *	+-------+-------+-------+-------+-------+-------+-------+-------+
+ * 1	|G|			 offset1				|
+ *	+-------+-------+-------+-------+-------+-------+-------+-------+
+ * 2	|		vdev2		| GRID  |	  ASIZE		|
+ *	+-------+-------+-------+-------+-------+-------+-------+-------+
+ * 3	|G|			 offset2				|
+ *	+-------+-------+-------+-------+-------+-------+-------+-------+
+ * 4	|		vdev3		| GRID  |	  ASIZE		|
+ *	+-------+-------+-------+-------+-------+-------+-------+-------+
+ * 5	|G|			 offset3				|
+ *	+-------+-------+-------+-------+-------+-------+-------+-------+
+ * 6	|BDX|lvl| type	| cksum | comp	|     PSIZE	|     LSIZE	|
+ *	+-------+-------+-------+-------+-------+-------+-------+-------+
+ * 7	|			padding					|
+ *	+-------+-------+-------+-------+-------+-------+-------+-------+
+ * 8	|			padding					|
+ *	+-------+-------+-------+-------+-------+-------+-------+-------+
+ * 9	|			physical birth txg			|
+ *	+-------+-------+-------+-------+-------+-------+-------+-------+
+ * a	|			logical birth txg			|
+ *	+-------+-------+-------+-------+-------+-------+-------+-------+
+ * b	|			fill count				|
+ *	+-------+-------+-------+-------+-------+-------+-------+-------+
+ * c	|			checksum[0]				|
+ *	+-------+-------+-------+-------+-------+-------+-------+-------+
+ * d	|			checksum[1]				|
+ *	+-------+-------+-------+-------+-------+-------+-------+-------+
+ * e	|			checksum[2]				|
+ *	+-------+-------+-------+-------+-------+-------+-------+-------+
+ * f	|			checksum[3]				|
+ *	+-------+-------+-------+-------+-------+-------+-------+-------+
+ *
+ * Legend:
+ *
+ * vdev		virtual device ID
+ * offset	offset into virtual device
+ * LSIZE	logical size
+ * PSIZE	physical size (after compression)
+ * ASIZE	allocated size (including RAID-Z parity and gang block headers)
+ * GRID		RAID-Z layout information (reserved for future use)
+ * cksum	checksum function
+ * comp		compression function
+ * G		gang block indicator
+ * B		byteorder (endianness)
+ * D		dedup
+ * X		unused
+ * lvl		level of indirection
+ * type		DMU object type
+ * phys birth	txg of block allocation; zero if same as logical birth txg
+ * log. birth	transaction group in which the block was logically born
+ * fill count	number of non-zero blocks under this bp
+ * checksum[4]	256-bit checksum of the data this bp describes
+ */
+#define	SPA_BLKPTRSHIFT	7		/* blkptr_t is 128 bytes	*/
+#define	SPA_DVAS_PER_BP	3		/* Number of DVAs in a bp	*/
+
+typedef struct blkptr {
+	dva_t		blk_dva[SPA_DVAS_PER_BP]; /* Data Virtual Addresses */
+	uint64_t	blk_prop;	/* size, compression, type, etc	    */
+	uint64_t	blk_pad[2];	/* Extra space for the future	    */
+	uint64_t	blk_phys_birth;	/* txg when block was allocated	    */
+	uint64_t	blk_birth;	/* transaction group at birth	    */
+	uint64_t	blk_fill;	/* fill count			    */
+	zio_cksum_t	blk_cksum;	/* 256-bit checksum		    */
+} blkptr_t;
+
+/*
+ * Macros to get and set fields in a bp or DVA.
+ */
+#define	DVA_GET_ASIZE(dva)	\
+	BF64_GET_SB((dva)->dva_word[0], 0, 24, SPA_MINBLOCKSHIFT, 0)
+#define	DVA_SET_ASIZE(dva, x)	\
+	BF64_SET_SB((dva)->dva_word[0], 0, 24, SPA_MINBLOCKSHIFT, 0, x)
+
+#define	DVA_GET_GRID(dva)	BF64_GET((dva)->dva_word[0], 24, 8)
+#define	DVA_SET_GRID(dva, x)	BF64_SET((dva)->dva_word[0], 24, 8, x)
+
+#define	DVA_GET_VDEV(dva)	BF64_GET((dva)->dva_word[0], 32, 32)
+#define	DVA_SET_VDEV(dva, x)	BF64_SET((dva)->dva_word[0], 32, 32, x)
+
+#define	DVA_GET_OFFSET(dva)	\
+	BF64_GET_SB((dva)->dva_word[1], 0, 63, SPA_MINBLOCKSHIFT, 0)
+#define	DVA_SET_OFFSET(dva, x)	\
+	BF64_SET_SB((dva)->dva_word[1], 0, 63, SPA_MINBLOCKSHIFT, 0, x)
+
+#define	DVA_GET_GANG(dva)	BF64_GET((dva)->dva_word[1], 63, 1)
+#define	DVA_SET_GANG(dva, x)	BF64_SET((dva)->dva_word[1], 63, 1, x)
+
+#define	BP_GET_LSIZE(bp)	\
+	BF64_GET_SB((bp)->blk_prop, 0, 16, SPA_MINBLOCKSHIFT, 1)
+#define	BP_SET_LSIZE(bp, x)	\
+	BF64_SET_SB((bp)->blk_prop, 0, 16, SPA_MINBLOCKSHIFT, 1, x)
+
+#define	BP_GET_PSIZE(bp)	\
+	BF64_GET_SB((bp)->blk_prop, 16, 16, SPA_MINBLOCKSHIFT, 1)
+#define	BP_SET_PSIZE(bp, x)	\
+	BF64_SET_SB((bp)->blk_prop, 16, 16, SPA_MINBLOCKSHIFT, 1, x)
+
+#define	BP_GET_COMPRESS(bp)		BF64_GET((bp)->blk_prop, 32, 8)
+#define	BP_SET_COMPRESS(bp, x)		BF64_SET((bp)->blk_prop, 32, 8, x)
+
+#define	BP_GET_CHECKSUM(bp)		BF64_GET((bp)->blk_prop, 40, 8)
+#define	BP_SET_CHECKSUM(bp, x)		BF64_SET((bp)->blk_prop, 40, 8, x)
+
+#define	BP_GET_TYPE(bp)			BF64_GET((bp)->blk_prop, 48, 8)
+#define	BP_SET_TYPE(bp, x)		BF64_SET((bp)->blk_prop, 48, 8, x)
+
+#define	BP_GET_LEVEL(bp)		BF64_GET((bp)->blk_prop, 56, 5)
+#define	BP_SET_LEVEL(bp, x)		BF64_SET((bp)->blk_prop, 56, 5, x)
+
+#define	BP_GET_PROP_BIT_61(bp)		BF64_GET((bp)->blk_prop, 61, 1)
+#define	BP_SET_PROP_BIT_61(bp, x)	BF64_SET((bp)->blk_prop, 61, 1, x)
+
+#define	BP_GET_DEDUP(bp)		BF64_GET((bp)->blk_prop, 62, 1)
+#define	BP_SET_DEDUP(bp, x)		BF64_SET((bp)->blk_prop, 62, 1, x)
+
+#define	BP_GET_BYTEORDER(bp)		(0 - BF64_GET((bp)->blk_prop, 63, 1))
+#define	BP_SET_BYTEORDER(bp, x)		BF64_SET((bp)->blk_prop, 63, 1, x)
+
+#define	BP_PHYSICAL_BIRTH(bp)		\
+	((bp)->blk_phys_birth ? (bp)->blk_phys_birth : (bp)->blk_birth)
+
+#define	BP_SET_BIRTH(bp, logical, physical)	\
+{						\
+	(bp)->blk_birth = (logical);		\
+	(bp)->blk_phys_birth = ((logical) == (physical) ? 0 : (physical)); \
+}
+
+#define	BP_GET_ASIZE(bp)	\
+	(DVA_GET_ASIZE(&(bp)->blk_dva[0]) + DVA_GET_ASIZE(&(bp)->blk_dva[1]) + \
+		DVA_GET_ASIZE(&(bp)->blk_dva[2]))
+
+#define	BP_GET_UCSIZE(bp) \
+	((BP_GET_LEVEL(bp) > 0 || dmu_ot[BP_GET_TYPE(bp)].ot_metadata) ? \
+	BP_GET_PSIZE(bp) : BP_GET_LSIZE(bp))
+
+#define	BP_GET_NDVAS(bp)	\
+	(!!DVA_GET_ASIZE(&(bp)->blk_dva[0]) + \
+	!!DVA_GET_ASIZE(&(bp)->blk_dva[1]) + \
+	!!DVA_GET_ASIZE(&(bp)->blk_dva[2]))
+
+#define	BP_COUNT_GANG(bp)	\
+	(DVA_GET_GANG(&(bp)->blk_dva[0]) + \
+	DVA_GET_GANG(&(bp)->blk_dva[1]) + \
+	DVA_GET_GANG(&(bp)->blk_dva[2]))
+
+#define	DVA_EQUAL(dva1, dva2)	\
+	((dva1)->dva_word[1] == (dva2)->dva_word[1] && \
+	(dva1)->dva_word[0] == (dva2)->dva_word[0])
+
+#define	BP_EQUAL(bp1, bp2)	\
+	(BP_PHYSICAL_BIRTH(bp1) == BP_PHYSICAL_BIRTH(bp2) &&	\
+	DVA_EQUAL(&(bp1)->blk_dva[0], &(bp2)->blk_dva[0]) &&	\
+	DVA_EQUAL(&(bp1)->blk_dva[1], &(bp2)->blk_dva[1]) &&	\
+	DVA_EQUAL(&(bp1)->blk_dva[2], &(bp2)->blk_dva[2]))
+
+#define	ZIO_CHECKSUM_EQUAL(zc1, zc2) \
+	(0 == (((zc1).zc_word[0] - (zc2).zc_word[0]) | \
+	((zc1).zc_word[1] - (zc2).zc_word[1]) | \
+	((zc1).zc_word[2] - (zc2).zc_word[2]) | \
+	((zc1).zc_word[3] - (zc2).zc_word[3])))
+
+#define	DVA_IS_VALID(dva)	(DVA_GET_ASIZE(dva) != 0)
+
+#define	ZIO_SET_CHECKSUM(zcp, w0, w1, w2, w3)	\
+{						\
+	(zcp)->zc_word[0] = w0;			\
+	(zcp)->zc_word[1] = w1;			\
+	(zcp)->zc_word[2] = w2;			\
+	(zcp)->zc_word[3] = w3;			\
+}
+
+#define	BP_IDENTITY(bp)		(&(bp)->blk_dva[0])
+#define	BP_IS_GANG(bp)		DVA_GET_GANG(BP_IDENTITY(bp))
+#define	BP_IS_HOLE(bp)		((bp)->blk_birth == 0)
+
+/* BP_IS_RAIDZ(bp) assumes no block compression */
+#define	BP_IS_RAIDZ(bp)		(DVA_GET_ASIZE(&(bp)->blk_dva[0]) > \
+				BP_GET_PSIZE(bp))
+
+#define	BP_ZERO(bp)				\
+{						\
+	(bp)->blk_dva[0].dva_word[0] = 0;	\
+	(bp)->blk_dva[0].dva_word[1] = 0;	\
+	(bp)->blk_dva[1].dva_word[0] = 0;	\
+	(bp)->blk_dva[1].dva_word[1] = 0;	\
+	(bp)->blk_dva[2].dva_word[0] = 0;	\
+	(bp)->blk_dva[2].dva_word[1] = 0;	\
+	(bp)->blk_prop = 0;			\
+	(bp)->blk_pad[0] = 0;			\
+	(bp)->blk_pad[1] = 0;			\
+	(bp)->blk_phys_birth = 0;		\
+	(bp)->blk_birth = 0;			\
+	(bp)->blk_fill = 0;			\
+	ZIO_SET_CHECKSUM(&(bp)->blk_cksum, 0, 0, 0, 0);	\
+}
+
+/*
+ * Note: the byteorder is either 0 or -1, both of which are palindromes.
+ * This simplifies the endianness handling a bit.
+ */
+#ifdef _BIG_ENDIAN
+#define	ZFS_HOST_BYTEORDER	(0ULL)
+#else
+#define	ZFS_HOST_BYTEORDER	(-1ULL)
+#endif
+
+#define	BP_SHOULD_BYTESWAP(bp)	(BP_GET_BYTEORDER(bp) != ZFS_HOST_BYTEORDER)
+
+#define	BP_SPRINTF_LEN	320
+
+/*
+ * This macro allows code sharing between zfs, libzpool, and mdb.
+ * 'func' is either snprintf() or mdb_snprintf().
+ * 'ws' (whitespace) can be ' ' for single-line format, '\n' for multi-line.
+ */
+#define	SPRINTF_BLKPTR(func, ws, buf, bp, type, checksum, compress)	\
+{									\
+	static const char *copyname[] =					\
+	    { "zero", "single", "double", "triple" };			\
+	int size = BP_SPRINTF_LEN;					\
+	int len = 0;							\
+	int copies = 0;							\
+									\
+	if (bp == NULL) {						\
+		len = func(buf + len, size - len, "<NULL>");		\
+	} else if (BP_IS_HOLE(bp)) {					\
+		len = func(buf + len, size - len, "<hole>");		\
+	} else {							\
+		for (int d = 0; d < BP_GET_NDVAS(bp); d++) {		\
+			const dva_t *dva = &bp->blk_dva[d];		\
+			if (DVA_IS_VALID(dva))				\
+				copies++;				\
+			len += func(buf + len, size - len,		\
+			    "DVA[%d]=<%llu:%llx:%llx>%c", d,		\
+			    (u_longlong_t)DVA_GET_VDEV(dva),		\
+			    (u_longlong_t)DVA_GET_OFFSET(dva),		\
+			    (u_longlong_t)DVA_GET_ASIZE(dva),		\
+			    ws);					\
+		}							\
+		if (BP_IS_GANG(bp) &&					\
+		    DVA_GET_ASIZE(&bp->blk_dva[2]) <=			\
+		    DVA_GET_ASIZE(&bp->blk_dva[1]) / 2)			\
+			copies--;					\
+		len += func(buf + len, size - len,			\
+		    "[L%llu %s] %s %s %s %s %s %s%c"			\
+		    "size=%llxL/%llxP birth=%lluL/%lluP fill=%llu%c"	\
+		    "cksum=%llx:%llx:%llx:%llx",			\
+		    (u_longlong_t)BP_GET_LEVEL(bp),			\
+		    type,						\
+		    checksum,						\
+		    compress,						\
+		    BP_GET_BYTEORDER(bp) == 0 ? "BE" : "LE",		\
+		    BP_IS_GANG(bp) ? "gang" : "contiguous",		\
+		    BP_GET_DEDUP(bp) ? "dedup" : "unique",		\
+		    copyname[copies],					\
+		    ws,							\
+		    (u_longlong_t)BP_GET_LSIZE(bp),			\
+		    (u_longlong_t)BP_GET_PSIZE(bp),			\
+		    (u_longlong_t)bp->blk_birth,			\
+		    (u_longlong_t)BP_PHYSICAL_BIRTH(bp),		\
+		    (u_longlong_t)bp->blk_fill,				\
+		    ws,							\
+		    (u_longlong_t)bp->blk_cksum.zc_word[0],		\
+		    (u_longlong_t)bp->blk_cksum.zc_word[1],		\
+		    (u_longlong_t)bp->blk_cksum.zc_word[2],		\
+		    (u_longlong_t)bp->blk_cksum.zc_word[3]);		\
+	}								\
+	ASSERT(len < size);						\
+}
+
+#include <sys/dmu.h>
+
+#define	BP_GET_BUFC_TYPE(bp)						\
+	(((BP_GET_LEVEL(bp) > 0) || (dmu_ot[BP_GET_TYPE(bp)].ot_metadata)) ? \
+	ARC_BUFC_METADATA : ARC_BUFC_DATA);
+
+typedef enum spa_import_type {
+	SPA_IMPORT_EXISTING,
+	SPA_IMPORT_ASSEMBLE
+} spa_import_type_t;
+
+/* state manipulation functions */
+extern int spa_open(const char *pool, spa_t **, void *tag);
+extern int spa_open_rewind(const char *pool, spa_t **, void *tag,
+    nvlist_t *policy, nvlist_t **config);
+extern int spa_get_stats(const char *pool, nvlist_t **config,
+    char *altroot, size_t buflen);
+extern int spa_create(const char *pool, nvlist_t *config, nvlist_t *props,
+    const char *history_str, nvlist_t *zplprops);
+extern int spa_import_rootpool(char *devpath, char *devid);
+extern int spa_import(const char *pool, nvlist_t *config, nvlist_t *props,
+    uint64_t flags);
+extern nvlist_t *spa_tryimport(nvlist_t *tryconfig);
+extern int spa_destroy(char *pool);
+extern int spa_export(char *pool, nvlist_t **oldconfig, boolean_t force,
+    boolean_t hardforce);
+extern int spa_reset(char *pool);
+extern void spa_async_request(spa_t *spa, int flag);
+extern void spa_async_unrequest(spa_t *spa, int flag);
+extern void spa_async_suspend(spa_t *spa);
+extern void spa_async_resume(spa_t *spa);
+extern spa_t *spa_inject_addref(char *pool);
+extern void spa_inject_delref(spa_t *spa);
+extern void spa_scan_stat_init(spa_t *spa);
+extern int spa_scan_get_stats(spa_t *spa, pool_scan_stat_t *ps);
+
+#define	SPA_ASYNC_CONFIG_UPDATE	0x01
+#define	SPA_ASYNC_REMOVE	0x02
+#define	SPA_ASYNC_PROBE		0x04
+#define	SPA_ASYNC_RESILVER_DONE	0x08
+#define	SPA_ASYNC_RESILVER	0x10
+#define	SPA_ASYNC_AUTOEXPAND	0x20
+#define	SPA_ASYNC_REMOVE_DONE	0x40
+#define	SPA_ASYNC_REMOVE_STOP	0x80
+
+/*
+ * Controls the behavior of spa_vdev_remove().
+ */
+#define	SPA_REMOVE_UNSPARE	0x01
+#define	SPA_REMOVE_DONE		0x02
+
+/* device manipulation */
+extern int spa_vdev_add(spa_t *spa, nvlist_t *nvroot);
+extern int spa_vdev_attach(spa_t *spa, uint64_t guid, nvlist_t *nvroot,
+    int replacing);
+extern int spa_vdev_detach(spa_t *spa, uint64_t guid, uint64_t pguid,
+    int replace_done);
+extern int spa_vdev_remove(spa_t *spa, uint64_t guid, boolean_t unspare);
+extern boolean_t spa_vdev_remove_active(spa_t *spa);
+extern int spa_vdev_setpath(spa_t *spa, uint64_t guid, const char *newpath);
+extern int spa_vdev_setfru(spa_t *spa, uint64_t guid, const char *newfru);
+extern int spa_vdev_split_mirror(spa_t *spa, char *newname, nvlist_t *config,
+    nvlist_t *props, boolean_t exp);
+
+/* spare state (which is global across all pools) */
+extern void spa_spare_add(vdev_t *vd);
+extern void spa_spare_remove(vdev_t *vd);
+extern boolean_t spa_spare_exists(uint64_t guid, uint64_t *pool, int *refcnt);
+extern void spa_spare_activate(vdev_t *vd);
+
+/* L2ARC state (which is global across all pools) */
+extern void spa_l2cache_add(vdev_t *vd);
+extern void spa_l2cache_remove(vdev_t *vd);
+extern boolean_t spa_l2cache_exists(uint64_t guid, uint64_t *pool);
+extern void spa_l2cache_activate(vdev_t *vd);
+extern void spa_l2cache_drop(spa_t *spa);
+
+/* scanning */
+extern int spa_scan(spa_t *spa, pool_scan_func_t func);
+extern int spa_scan_stop(spa_t *spa);
+
+/* spa syncing */
+extern void spa_sync(spa_t *spa, uint64_t txg); /* only for DMU use */
+extern void spa_sync_allpools(void);
+
+/*
+ * DEFERRED_FREE must be large enough that regular blocks are not
+ * deferred.  XXX so can't we change it back to 1?
+ */
+#define	SYNC_PASS_DEFERRED_FREE	2	/* defer frees after this pass */
+#define	SYNC_PASS_DONT_COMPRESS	4	/* don't compress after this pass */
+#define	SYNC_PASS_REWRITE	1	/* rewrite new bps after this pass */
+
+/* spa namespace global mutex */
+extern kmutex_t spa_namespace_lock;
+
+/*
+ * SPA configuration functions in spa_config.c
+ */
+
+#define	SPA_CONFIG_UPDATE_POOL	0
+#define	SPA_CONFIG_UPDATE_VDEVS	1
+
+extern void spa_config_sync(spa_t *, boolean_t, boolean_t);
+extern void spa_config_load(void);
+extern nvlist_t *spa_all_configs(uint64_t *);
+extern void spa_config_set(spa_t *spa, nvlist_t *config);
+extern nvlist_t *spa_config_generate(spa_t *spa, vdev_t *vd, uint64_t txg,
+    int getstats);
+extern void spa_config_update(spa_t *spa, int what);
+
+/*
+ * Miscellaneous SPA routines in spa_misc.c
+ */
+
+/* Namespace manipulation */
+extern spa_t *spa_lookup(const char *name);
+extern spa_t *spa_add(const char *name, nvlist_t *config, const char *altroot);
+extern void spa_remove(spa_t *spa);
+extern spa_t *spa_next(spa_t *prev);
+
+/* Refcount functions */
+extern void spa_open_ref(spa_t *spa, void *tag);
+extern void spa_close(spa_t *spa, void *tag);
+extern boolean_t spa_refcount_zero(spa_t *spa);
+
+#define	SCL_NONE	0x00
+#define	SCL_CONFIG	0x01
+#define	SCL_STATE	0x02
+#define	SCL_L2ARC	0x04		/* hack until L2ARC 2.0 */
+#define	SCL_ALLOC	0x08
+#define	SCL_ZIO		0x10
+#define	SCL_FREE	0x20
+#define	SCL_VDEV	0x40
+#define	SCL_LOCKS	7
+#define	SCL_ALL		((1 << SCL_LOCKS) - 1)
+#define	SCL_STATE_ALL	(SCL_STATE | SCL_L2ARC | SCL_ZIO)
+
+/* Pool configuration locks */
+extern int spa_config_tryenter(spa_t *spa, int locks, void *tag, krw_t rw);
+extern void spa_config_enter(spa_t *spa, int locks, void *tag, krw_t rw);
+extern void spa_config_exit(spa_t *spa, int locks, void *tag);
+extern int spa_config_held(spa_t *spa, int locks, krw_t rw);
+
+/* Pool vdev add/remove lock */
+extern uint64_t spa_vdev_enter(spa_t *spa);
+extern uint64_t spa_vdev_config_enter(spa_t *spa);
+extern void spa_vdev_config_exit(spa_t *spa, vdev_t *vd, uint64_t txg,
+    int error, char *tag);
+extern int spa_vdev_exit(spa_t *spa, vdev_t *vd, uint64_t txg, int error);
+
+/* Pool vdev state change lock */
+extern void spa_vdev_state_enter(spa_t *spa, int oplock);
+extern int spa_vdev_state_exit(spa_t *spa, vdev_t *vd, int error);
+
+/* Log state */
+typedef enum spa_log_state {
+	SPA_LOG_UNKNOWN = 0,	/* unknown log state */
+	SPA_LOG_MISSING,	/* missing log(s) */
+	SPA_LOG_CLEAR,		/* clear the log(s) */
+	SPA_LOG_GOOD,		/* log(s) are good */
+} spa_log_state_t;
+
+extern spa_log_state_t spa_get_log_state(spa_t *spa);
+extern void spa_set_log_state(spa_t *spa, spa_log_state_t state);
+extern int spa_offline_log(spa_t *spa);
+
+/* Log claim callback */
+extern void spa_claim_notify(zio_t *zio);
+
+/* Accessor functions */
+extern boolean_t spa_shutting_down(spa_t *spa);
+extern struct dsl_pool *spa_get_dsl(spa_t *spa);
+extern blkptr_t *spa_get_rootblkptr(spa_t *spa);
+extern void spa_set_rootblkptr(spa_t *spa, const blkptr_t *bp);
+extern void spa_altroot(spa_t *, char *, size_t);
+extern int spa_sync_pass(spa_t *spa);
+extern char *spa_name(spa_t *spa);
+extern uint64_t spa_guid(spa_t *spa);
+extern uint64_t spa_last_synced_txg(spa_t *spa);
+extern uint64_t spa_first_txg(spa_t *spa);
+extern uint64_t spa_syncing_txg(spa_t *spa);
+extern uint64_t spa_version(spa_t *spa);
+extern pool_state_t spa_state(spa_t *spa);
+extern spa_load_state_t spa_load_state(spa_t *spa);
+extern uint64_t spa_freeze_txg(spa_t *spa);
+extern uint64_t spa_get_asize(spa_t *spa, uint64_t lsize);
+extern uint64_t spa_get_dspace(spa_t *spa);
+extern void spa_update_dspace(spa_t *spa);
+extern uint64_t spa_version(spa_t *spa);
+extern boolean_t spa_deflate(spa_t *spa);
+extern metaslab_class_t *spa_normal_class(spa_t *spa);
+extern metaslab_class_t *spa_log_class(spa_t *spa);
+extern int spa_max_replication(spa_t *spa);
+extern int spa_prev_software_version(spa_t *spa);
+extern int spa_busy(void);
+extern uint8_t spa_get_failmode(spa_t *spa);
+extern boolean_t spa_suspended(spa_t *spa);
+extern uint64_t spa_bootfs(spa_t *spa);
+extern uint64_t spa_delegation(spa_t *spa);
+extern objset_t *spa_meta_objset(spa_t *spa);
+
+/* Miscellaneous support routines */
+extern int spa_rename(const char *oldname, const char *newname);
+extern spa_t *spa_by_guid(uint64_t pool_guid, uint64_t device_guid);
+extern boolean_t spa_guid_exists(uint64_t pool_guid, uint64_t device_guid);
+extern char *spa_strdup(const char *);
+extern void spa_strfree(char *);
+extern uint64_t spa_get_random(uint64_t range);
+extern uint64_t spa_generate_guid(spa_t *spa);
+extern void sprintf_blkptr(char *buf, const blkptr_t *bp);
+extern void spa_freeze(spa_t *spa);
+extern void spa_upgrade(spa_t *spa, uint64_t version);
+extern void spa_evict_all(void);
+extern vdev_t *spa_lookup_by_guid(spa_t *spa, uint64_t guid,
+    boolean_t l2cache);
+extern boolean_t spa_has_spare(spa_t *, uint64_t guid);
+extern uint64_t dva_get_dsize_sync(spa_t *spa, const dva_t *dva);
+extern uint64_t bp_get_dsize_sync(spa_t *spa, const blkptr_t *bp);
+extern uint64_t bp_get_dsize(spa_t *spa, const blkptr_t *bp);
+extern boolean_t spa_has_slogs(spa_t *spa);
+extern boolean_t spa_is_root(spa_t *spa);
+extern boolean_t spa_writeable(spa_t *spa);
+
+extern int spa_mode(spa_t *spa);
+extern uint64_t strtonum(const char *str, char **nptr);
+
+/* history logging */
+typedef enum history_log_type {
+	LOG_CMD_POOL_CREATE,
+	LOG_CMD_NORMAL,
+	LOG_INTERNAL
+} history_log_type_t;
+
+typedef struct history_arg {
+	char *ha_history_str;
+	history_log_type_t ha_log_type;
+	history_internal_events_t ha_event;
+	char *ha_zone;
+	uid_t ha_uid;
+} history_arg_t;
+
+extern char *spa_his_ievent_table[];
+
+extern void spa_history_create_obj(spa_t *spa, dmu_tx_t *tx);
+extern int spa_history_get(spa_t *spa, uint64_t *offset, uint64_t *len_read,
+    char *his_buf);
+extern int spa_history_log(spa_t *spa, const char *his_buf,
+    history_log_type_t what);
+extern void spa_history_log_internal(history_internal_events_t event,
+    spa_t *spa, dmu_tx_t *tx, const char *fmt, ...);
+extern void spa_history_log_version(spa_t *spa, history_internal_events_t evt);
+
+/* error handling */
+struct zbookmark;
+extern void spa_log_error(spa_t *spa, zio_t *zio);
+extern void zfs_ereport_post(const char *class, spa_t *spa, vdev_t *vd,
+    zio_t *zio, uint64_t stateoroffset, uint64_t length);
+extern void zfs_post_remove(spa_t *spa, vdev_t *vd);
+extern void zfs_post_state_change(spa_t *spa, vdev_t *vd);
+extern void zfs_post_autoreplace(spa_t *spa, vdev_t *vd);
+extern uint64_t spa_get_errlog_size(spa_t *spa);
+extern int spa_get_errlog(spa_t *spa, void *uaddr, size_t *count);
+extern void spa_errlog_rotate(spa_t *spa);
+extern void spa_errlog_drain(spa_t *spa);
+extern void spa_errlog_sync(spa_t *spa, uint64_t txg);
+extern void spa_get_errlists(spa_t *spa, avl_tree_t *last, avl_tree_t *scrub);
+
+/* vdev cache */
+extern void vdev_cache_stat_init(void);
+extern void vdev_cache_stat_fini(void);
+
+/* Initialization and termination */
+extern void spa_init(int flags);
+extern void spa_fini(void);
+extern void spa_boot_init();
+
+/* properties */
+extern int spa_prop_set(spa_t *spa, nvlist_t *nvp);
+extern int spa_prop_get(spa_t *spa, nvlist_t **nvp);
+extern void spa_prop_clear_bootfs(spa_t *spa, uint64_t obj, dmu_tx_t *tx);
+extern void spa_configfile_set(spa_t *, nvlist_t *, boolean_t);
+
+/* asynchronous event notification */
+extern void spa_event_notify(spa_t *spa, vdev_t *vdev, const char *name);
+
+#ifdef ZFS_DEBUG
+#define	dprintf_bp(bp, fmt, ...) do {				\
+	if (zfs_flags & ZFS_DEBUG_DPRINTF) { 			\
+	char *__blkbuf = kmem_alloc(BP_SPRINTF_LEN, KM_SLEEP);	\
+	sprintf_blkptr(__blkbuf, (bp));				\
+	dprintf(fmt " %s\n", __VA_ARGS__, __blkbuf);		\
+	kmem_free(__blkbuf, BP_SPRINTF_LEN);			\
+	} \
+_NOTE(CONSTCOND) } while (0)
+#else
+#define	dprintf_bp(bp, fmt, ...)
+#endif
+
+extern int spa_mode_global;			/* mode, e.g. FREAD | FWRITE */
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _SYS_SPA_H */
--- a/uts/common/fs/zfs/sys/spa_boot.h
+++ b/uts/common/fs/zfs/sys/spa_boot.h
@ -0,0 +1,42 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+#ifndef _SYS_SPA_BOOT_H
+#define	_SYS_SPA_BOOT_H
+
+#include <sys/nvpair.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+extern char *spa_get_bootprop(char *prop);
+extern void spa_free_bootprop(char *prop);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _SYS_SPA_BOOT_H */
--- a/uts/common/fs/zfs/sys/spa_impl.h
+++ b/uts/common/fs/zfs/sys/spa_impl.h
@ -0,0 +1,235 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ */
+
+#ifndef _SYS_SPA_IMPL_H
+#define	_SYS_SPA_IMPL_H
+
+#include <sys/spa.h>
+#include <sys/vdev.h>
+#include <sys/metaslab.h>
+#include <sys/dmu.h>
+#include <sys/dsl_pool.h>
+#include <sys/uberblock_impl.h>
+#include <sys/zfs_context.h>
+#include <sys/avl.h>
+#include <sys/refcount.h>
+#include <sys/bplist.h>
+#include <sys/bpobj.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+typedef struct spa_error_entry {
+	zbookmark_t	se_bookmark;
+	char		*se_name;
+	avl_node_t	se_avl;
+} spa_error_entry_t;
+
+typedef struct spa_history_phys {
+	uint64_t sh_pool_create_len;	/* ending offset of zpool create */
+	uint64_t sh_phys_max_off;	/* physical EOF */
+	uint64_t sh_bof;		/* logical BOF */
+	uint64_t sh_eof;		/* logical EOF */
+	uint64_t sh_records_lost;	/* num of records overwritten */
+} spa_history_phys_t;
+
+struct spa_aux_vdev {
+	uint64_t	sav_object;		/* MOS object for device list */
+	nvlist_t	*sav_config;		/* cached device config */
+	vdev_t		**sav_vdevs;		/* devices */
+	int		sav_count;		/* number devices */
+	boolean_t	sav_sync;		/* sync the device list */
+	nvlist_t	**sav_pending;		/* pending device additions */
+	uint_t		sav_npending;		/* # pending devices */
+};
+
+typedef struct spa_config_lock {
+	kmutex_t	scl_lock;
+	kthread_t	*scl_writer;
+	int		scl_write_wanted;
+	kcondvar_t	scl_cv;
+	refcount_t	scl_count;
+} spa_config_lock_t;
+
+typedef struct spa_config_dirent {
+	list_node_t	scd_link;
+	char		*scd_path;
+} spa_config_dirent_t;
+
+enum zio_taskq_type {
+	ZIO_TASKQ_ISSUE = 0,
+	ZIO_TASKQ_ISSUE_HIGH,
+	ZIO_TASKQ_INTERRUPT,
+	ZIO_TASKQ_INTERRUPT_HIGH,
+	ZIO_TASKQ_TYPES
+};
+
+/*
+ * State machine for the zpool-pooname process.  The states transitions
+ * are done as follows:
+ *
+ *	From		   To			Routine
+ *	PROC_NONE	-> PROC_CREATED		spa_activate()
+ *	PROC_CREATED	-> PROC_ACTIVE		spa_thread()
+ *	PROC_ACTIVE	-> PROC_DEACTIVATE	spa_deactivate()
+ *	PROC_DEACTIVATE	-> PROC_GONE		spa_thread()
+ *	PROC_GONE	-> PROC_NONE		spa_deactivate()
+ */
+typedef enum spa_proc_state {
+	SPA_PROC_NONE,		/* spa_proc = &p0, no process created */
+	SPA_PROC_CREATED,	/* spa_activate() has proc, is waiting */
+	SPA_PROC_ACTIVE,	/* taskqs created, spa_proc set */
+	SPA_PROC_DEACTIVATE,	/* spa_deactivate() requests process exit */
+	SPA_PROC_GONE		/* spa_thread() is exiting, spa_proc = &p0 */
+} spa_proc_state_t;
+
+struct spa {
+	/*
+	 * Fields protected by spa_namespace_lock.
+	 */
+	char		spa_name[MAXNAMELEN];	/* pool name */
+	avl_node_t	spa_avl;		/* node in spa_namespace_avl */
+	nvlist_t	*spa_config;		/* last synced config */
+	nvlist_t	*spa_config_syncing;	/* currently syncing config */
+	nvlist_t	*spa_config_splitting;	/* config for splitting */
+	nvlist_t	*spa_load_info;		/* info and errors from load */
+	uint64_t	spa_config_txg;		/* txg of last config change */
+	int		spa_sync_pass;		/* iterate-to-convergence */
+	pool_state_t	spa_state;		/* pool state */
+	int		spa_inject_ref;		/* injection references */
+	uint8_t		spa_sync_on;		/* sync threads are running */
+	spa_load_state_t spa_load_state;	/* current load operation */
+	uint64_t	spa_import_flags;	/* import specific flags */
+	taskq_t		*spa_zio_taskq[ZIO_TYPES][ZIO_TASKQ_TYPES];
+	dsl_pool_t	*spa_dsl_pool;
+	metaslab_class_t *spa_normal_class;	/* normal data class */
+	metaslab_class_t *spa_log_class;	/* intent log data class */
+	uint64_t	spa_first_txg;		/* first txg after spa_open() */
+	uint64_t	spa_final_txg;		/* txg of export/destroy */
+	uint64_t	spa_freeze_txg;		/* freeze pool at this txg */
+	uint64_t	spa_load_max_txg;	/* best initial ub_txg */
+	uint64_t	spa_claim_max_txg;	/* highest claimed birth txg */
+	timespec_t	spa_loaded_ts;		/* 1st successful open time */
+	objset_t	*spa_meta_objset;	/* copy of dp->dp_meta_objset */
+	txg_list_t	spa_vdev_txg_list;	/* per-txg dirty vdev list */
+	vdev_t		*spa_root_vdev;		/* top-level vdev container */
+	uint64_t	spa_load_guid;		/* initial guid for spa_load */
+	list_t		spa_config_dirty_list;	/* vdevs with dirty config */
+	list_t		spa_state_dirty_list;	/* vdevs with dirty state */
+	spa_aux_vdev_t	spa_spares;		/* hot spares */
+	spa_aux_vdev_t	spa_l2cache;		/* L2ARC cache devices */
+	uint64_t	spa_config_object;	/* MOS object for pool config */
+	uint64_t	spa_config_generation;	/* config generation number */
+	uint64_t	spa_syncing_txg;	/* txg currently syncing */
+	bpobj_t		spa_deferred_bpobj;	/* deferred-free bplist */
+	bplist_t	spa_free_bplist[TXG_SIZE]; /* bplist of stuff to free */
+	uberblock_t	spa_ubsync;		/* last synced uberblock */
+	uberblock_t	spa_uberblock;		/* current uberblock */
+	boolean_t	spa_extreme_rewind;	/* rewind past deferred frees */
+	uint64_t	spa_last_io;		/* lbolt of last non-scan I/O */
+	kmutex_t	spa_scrub_lock;		/* resilver/scrub lock */
+	uint64_t	spa_scrub_inflight;	/* in-flight scrub I/Os */
+	kcondvar_t	spa_scrub_io_cv;	/* scrub I/O completion */
+	uint8_t		spa_scrub_active;	/* active or suspended? */
+	uint8_t		spa_scrub_type;		/* type of scrub we're doing */
+	uint8_t		spa_scrub_finished;	/* indicator to rotate logs */
+	uint8_t		spa_scrub_started;	/* started since last boot */
+	uint8_t		spa_scrub_reopen;	/* scrub doing vdev_reopen */
+	uint64_t	spa_scan_pass_start;	/* start time per pass/reboot */
+	uint64_t	spa_scan_pass_exam;	/* examined bytes per pass */
+	kmutex_t	spa_async_lock;		/* protect async state */
+	kthread_t	*spa_async_thread;	/* thread doing async task */
+	int		spa_async_suspended;	/* async tasks suspended */
+	kcondvar_t	spa_async_cv;		/* wait for thread_exit() */
+	uint16_t	spa_async_tasks;	/* async task mask */
+	char		*spa_root;		/* alternate root directory */
+	uint64_t	spa_ena;		/* spa-wide ereport ENA */
+	int		spa_last_open_failed;	/* error if last open failed */
+	uint64_t	spa_last_ubsync_txg;	/* "best" uberblock txg */
+	uint64_t	spa_last_ubsync_txg_ts;	/* timestamp from that ub */
+	uint64_t	spa_load_txg;		/* ub txg that loaded */
+	uint64_t	spa_load_txg_ts;	/* timestamp from that ub */
+	uint64_t	spa_load_meta_errors;	/* verify metadata err count */
+	uint64_t	spa_load_data_errors;	/* verify data err count */
+	uint64_t	spa_verify_min_txg;	/* start txg of verify scrub */
+	kmutex_t	spa_errlog_lock;	/* error log lock */
+	uint64_t	spa_errlog_last;	/* last error log object */
+	uint64_t	spa_errlog_scrub;	/* scrub error log object */
+	kmutex_t	spa_errlist_lock;	/* error list/ereport lock */
+	avl_tree_t	spa_errlist_last;	/* last error list */
+	avl_tree_t	spa_errlist_scrub;	/* scrub error list */
+	uint64_t	spa_deflate;		/* should we deflate? */
+	uint64_t	spa_history;		/* history object */
+	kmutex_t	spa_history_lock;	/* history lock */
+	vdev_t		*spa_pending_vdev;	/* pending vdev additions */
+	kmutex_t	spa_props_lock;		/* property lock */
+	uint64_t	spa_pool_props_object;	/* object for properties */
+	uint64_t	spa_bootfs;		/* default boot filesystem */
+	uint64_t	spa_failmode;		/* failure mode for the pool */
+	uint64_t	spa_delegation;		/* delegation on/off */
+	list_t		spa_config_list;	/* previous cache file(s) */
+	zio_t		*spa_async_zio_root;	/* root of all async I/O */
+	zio_t		*spa_suspend_zio_root;	/* root of all suspended I/O */
+	kmutex_t	spa_suspend_lock;	/* protects suspend_zio_root */
+	kcondvar_t	spa_suspend_cv;		/* notification of resume */
+	uint8_t		spa_suspended;		/* pool is suspended */
+	uint8_t		spa_claiming;		/* pool is doing zil_claim() */
+	boolean_t	spa_is_root;		/* pool is root */
+	int		spa_minref;		/* num refs when first opened */
+	int		spa_mode;		/* FREAD | FWRITE */
+	spa_log_state_t spa_log_state;		/* log state */
+	uint64_t	spa_autoexpand;		/* lun expansion on/off */
+	ddt_t		*spa_ddt[ZIO_CHECKSUM_FUNCTIONS]; /* in-core DDTs */
+	uint64_t	spa_ddt_stat_object;	/* DDT statistics */
+	uint64_t	spa_dedup_ditto;	/* dedup ditto threshold */
+	uint64_t	spa_dedup_checksum;	/* default dedup checksum */
+	uint64_t	spa_dspace;		/* dspace in normal class */
+	kmutex_t	spa_vdev_top_lock;	/* dueling offline/remove */
+	kmutex_t	spa_proc_lock;		/* protects spa_proc* */
+	kcondvar_t	spa_proc_cv;		/* spa_proc_state transitions */
+	spa_proc_state_t spa_proc_state;	/* see definition */
+	struct proc	*spa_proc;		/* "zpool-poolname" process */
+	uint64_t	spa_did;		/* if procp != p0, did of t1 */
+	boolean_t	spa_autoreplace;	/* autoreplace set in open */
+	int		spa_vdev_locks;		/* locks grabbed */
+	uint64_t	spa_creation_version;	/* version at pool creation */
+	uint64_t	spa_prev_software_version;
+	/*
+	 * spa_refcnt & spa_config_lock must be the last elements
+	 * because refcount_t changes size based on compilation options.
+	 * In order for the MDB module to function correctly, the other
+	 * fields must remain in the same location.
+	 */
+	spa_config_lock_t spa_config_lock[SCL_LOCKS]; /* config changes */
+	refcount_t	spa_refcount;		/* number of opens */
+};
+
+extern const char *spa_config_path;
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _SYS_SPA_IMPL_H */
--- a/uts/common/fs/zfs/sys/space_map.h
+++ b/uts/common/fs/zfs/sys/space_map.h
@ -0,0 +1,179 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+#ifndef _SYS_SPACE_MAP_H
+#define	_SYS_SPACE_MAP_H
+
+#include <sys/avl.h>
+#include <sys/dmu.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+typedef struct space_map_ops space_map_ops_t;
+
+typedef struct space_map {
+	avl_tree_t	sm_root;	/* AVL tree of map segments */
+	uint64_t	sm_space;	/* sum of all segments in the map */
+	uint64_t	sm_start;	/* start of map */
+	uint64_t	sm_size;	/* size of map */
+	uint8_t		sm_shift;	/* unit shift */
+	uint8_t		sm_pad[3];	/* unused */
+	uint8_t		sm_loaded;	/* map loaded? */
+	uint8_t		sm_loading;	/* map loading? */
+	kcondvar_t	sm_load_cv;	/* map load completion */
+	space_map_ops_t	*sm_ops;	/* space map block picker ops vector */
+	avl_tree_t	*sm_pp_root;	/* picker-private AVL tree */
+	void		*sm_ppd;	/* picker-private data */
+	kmutex_t	*sm_lock;	/* pointer to lock that protects map */
+} space_map_t;
+
+typedef struct space_seg {
+	avl_node_t	ss_node;	/* AVL node */
+	avl_node_t	ss_pp_node;	/* AVL picker-private node */
+	uint64_t	ss_start;	/* starting offset of this segment */
+	uint64_t	ss_end;		/* ending offset (non-inclusive) */
+} space_seg_t;
+
+typedef struct space_ref {
+	avl_node_t	sr_node;	/* AVL node */
+	uint64_t	sr_offset;	/* offset (start or end) */
+	int64_t		sr_refcnt;	/* associated reference count */
+} space_ref_t;
+
+typedef struct space_map_obj {
+	uint64_t	smo_object;	/* on-disk space map object */
+	uint64_t	smo_objsize;	/* size of the object */
+	uint64_t	smo_alloc;	/* space allocated from the map */
+} space_map_obj_t;
+
+struct space_map_ops {
+	void	(*smop_load)(space_map_t *sm);
+	void	(*smop_unload)(space_map_t *sm);
+	uint64_t (*smop_alloc)(space_map_t *sm, uint64_t size);
+	void	(*smop_claim)(space_map_t *sm, uint64_t start, uint64_t size);
+	void	(*smop_free)(space_map_t *sm, uint64_t start, uint64_t size);
+	uint64_t (*smop_max)(space_map_t *sm);
+	boolean_t (*smop_fragmented)(space_map_t *sm);
+};
+
+/*
+ * debug entry
+ *
+ *    1      3         10                     50
+ *  ,---+--------+------------+---------------------------------.
+ *  | 1 | action |  syncpass  |        txg (lower bits)         |
+ *  `---+--------+------------+---------------------------------'
+ *   63  62    60 59        50 49                               0
+ *
+ *
+ *
+ * non-debug entry
+ *
+ *    1               47                   1           15
+ *  ,-----------------------------------------------------------.
+ *  | 0 |   offset (sm_shift units)    | type |       run       |
+ *  `-----------------------------------------------------------'
+ *   63  62                          17   16   15               0
+ */
+
+/* All this stuff takes and returns bytes */
+#define	SM_RUN_DECODE(x)	(BF64_DECODE(x, 0, 15) + 1)
+#define	SM_RUN_ENCODE(x)	BF64_ENCODE((x) - 1, 0, 15)
+#define	SM_TYPE_DECODE(x)	BF64_DECODE(x, 15, 1)
+#define	SM_TYPE_ENCODE(x)	BF64_ENCODE(x, 15, 1)
+#define	SM_OFFSET_DECODE(x)	BF64_DECODE(x, 16, 47)
+#define	SM_OFFSET_ENCODE(x)	BF64_ENCODE(x, 16, 47)
+#define	SM_DEBUG_DECODE(x)	BF64_DECODE(x, 63, 1)
+#define	SM_DEBUG_ENCODE(x)	BF64_ENCODE(x, 63, 1)
+
+#define	SM_DEBUG_ACTION_DECODE(x)	BF64_DECODE(x, 60, 3)
+#define	SM_DEBUG_ACTION_ENCODE(x)	BF64_ENCODE(x, 60, 3)
+
+#define	SM_DEBUG_SYNCPASS_DECODE(x)	BF64_DECODE(x, 50, 10)
+#define	SM_DEBUG_SYNCPASS_ENCODE(x)	BF64_ENCODE(x, 50, 10)
+
+#define	SM_DEBUG_TXG_DECODE(x)		BF64_DECODE(x, 0, 50)
+#define	SM_DEBUG_TXG_ENCODE(x)		BF64_ENCODE(x, 0, 50)
+
+#define	SM_RUN_MAX			SM_RUN_DECODE(~0ULL)
+
+#define	SM_ALLOC	0x0
+#define	SM_FREE		0x1
+
+/*
+ * The data for a given space map can be kept on blocks of any size.
+ * Larger blocks entail fewer i/o operations, but they also cause the
+ * DMU to keep more data in-core, and also to waste more i/o bandwidth
+ * when only a few blocks have changed since the last transaction group.
+ * This could use a lot more research, but for now, set the freelist
+ * block size to 4k (2^12).
+ */
+#define	SPACE_MAP_BLOCKSHIFT	12
+
+typedef void space_map_func_t(space_map_t *sm, uint64_t start, uint64_t size);
+
+extern void space_map_create(space_map_t *sm, uint64_t start, uint64_t size,
+    uint8_t shift, kmutex_t *lp);
+extern void space_map_destroy(space_map_t *sm);
+extern void space_map_add(space_map_t *sm, uint64_t start, uint64_t size);
+extern void space_map_remove(space_map_t *sm, uint64_t start, uint64_t size);
+extern boolean_t space_map_contains(space_map_t *sm,
+    uint64_t start, uint64_t size);
+extern void space_map_vacate(space_map_t *sm,
+    space_map_func_t *func, space_map_t *mdest);
+extern void space_map_walk(space_map_t *sm,
+    space_map_func_t *func, space_map_t *mdest);
+
+extern void space_map_load_wait(space_map_t *sm);
+extern int space_map_load(space_map_t *sm, space_map_ops_t *ops,
+    uint8_t maptype, space_map_obj_t *smo, objset_t *os);
+extern void space_map_unload(space_map_t *sm);
+
+extern uint64_t space_map_alloc(space_map_t *sm, uint64_t size);
+extern void space_map_claim(space_map_t *sm, uint64_t start, uint64_t size);
+extern void space_map_free(space_map_t *sm, uint64_t start, uint64_t size);
+extern uint64_t space_map_maxsize(space_map_t *sm);
+
+extern void space_map_sync(space_map_t *sm, uint8_t maptype,
+    space_map_obj_t *smo, objset_t *os, dmu_tx_t *tx);
+extern void space_map_truncate(space_map_obj_t *smo,
+    objset_t *os, dmu_tx_t *tx);
+
+extern void space_map_ref_create(avl_tree_t *t);
+extern void space_map_ref_destroy(avl_tree_t *t);
+extern void space_map_ref_add_seg(avl_tree_t *t,
+    uint64_t start, uint64_t end, int64_t refcnt);
+extern void space_map_ref_add_map(avl_tree_t *t,
+    space_map_t *sm, int64_t refcnt);
+extern void space_map_ref_generate_map(avl_tree_t *t,
+    space_map_t *sm, int64_t minref);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _SYS_SPACE_MAP_H */
--- a/uts/common/fs/zfs/sys/txg.h
+++ b/uts/common/fs/zfs/sys/txg.h
@ -0,0 +1,131 @@
+/*
+ * CDDL HEADER START
+ *
+ * The contents of this file are subject to the terms of the
+ * Common Development and Distribution License (the "License").
+ * You may not use this file except in compliance with the License.
+ *
+ * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
+ * or http://www.opensolaris.org/os/licensing.
+ * See the License for the specific language governing permissions
+ * and limitations under the License.
+ *
+ * When distributing Covered Code, include this CDDL HEADER in each
+ * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
+ * If applicable, add the following below this CDDL HEADER, with the
+ * fields enclosed by brackets "[]" replaced with your own identifying
+ * information: Portions Copyright [yyyy] [name of copyright owner]
+ *
+ * CDDL HEADER END
+ */
+/*
+ * Copyright 2010 Sun Microsystems, Inc.  All rights reserved.
+ * Use is subject to license terms.
+ */
+
+#ifndef _SYS_TXG_H
+#define	_SYS_TXG_H
+
+#include <sys/spa.h>
+#include <sys/zfs_context.h>
+
+#ifdef	__cplusplus
+extern "C" {
+#endif
+
+#define	TXG_CONCURRENT_STATES	3	/* open, quiescing, syncing	*/
+#define	TXG_SIZE		4		/* next power of 2	*/
+#define	TXG_MASK		(TXG_SIZE - 1)	/* mask for size	*/
+#define	TXG_INITIAL		TXG_SIZE	/* initial txg 		*/
+#define	TXG_IDX			(txg & TXG_MASK)
+
+/* Number of txgs worth of frees we defer adding to in-core spacemaps */
+#define	TXG_DEFER_SIZE		2
+
+#define	TXG_WAIT		1ULL
+#define	TXG_NOWAIT		2ULL
+
+typedef struct tx_cpu tx_cpu_t;
+
+typedef struct txg_handle {
+	tx_cpu_t	*th_cpu;
+	uint64_t	th_txg;
+} txg_handle_t;
+
+typedef struct txg_node {
+	struct txg_node	*tn_next[TXG_SIZE];
+	uint8_t		tn_member[TXG_SIZE];
+} txg_node_t;
+
+typedef struct txg_list {
+	kmutex_t	tl_lock;
+	size_t		tl_offset;
+	txg_node_t	*tl_head[TXG_SIZE];
+} txg_list_t;
+
+struct dsl_pool;
+
+extern void txg_init(struct dsl_pool *dp, uint64_t txg);
+extern void txg_fini(struct dsl_pool *dp);
+extern void txg_sync_start(struct dsl_pool *dp);
+extern void txg_sync_stop(struct dsl_pool *dp);
+extern uint64_t txg_hold_open(struct dsl_pool *dp, txg_handle_t *txghp);
+extern void txg_rele_to_quiesce(txg_handle_t *txghp);
+extern void txg_rele_to_sync(txg_handle_t *txghp);
+extern void txg_register_callbacks(txg_handle_t *txghp, list_t *tx_callbacks);
+
+/*
+ * Delay the caller by the specified number of ticks or until
+ * the txg closes (whichever comes first).  This is intended
+ * to be used to throttle writers when the system nears its
+ * capacity.
+ */
+extern void txg_delay(struct dsl_pool *dp, uint64_t txg, int ticks);
+
+/*
+ * Wait until the given transaction group has finished syncing.
+ * Try to make this happen as soon as possible (eg. kick off any
+ * necessary syncs immediately).  If txg==0, wait for the currently open
+ * txg to finish syncing.
+ */
+extern void txg_wait_synced(struct dsl_pool *dp, uint64_t txg);
+
+/*
+ * Wait until the given transaction group, or one after it, is
+ * the open transaction group.  Try to make this happen as soon
+ * as possible (eg. kick off any necessary syncs immediately).
+ * If txg == 0, wait for the next open txg.
+ */
+extern void txg_wait_open(struct dsl_pool *dp, uint64_t txg);
+
+/*
+ * Returns TRUE if we are "backed up" waiting for the syncing
+ * transaction to complete; otherwise returns FALSE.
+ */
+extern boolean_t txg_stalled(struct dsl_pool *dp);
+
+/* returns TRUE if someone is waiting for the next txg to sync */
+extern boolean_t txg_sync_waiting(struct dsl_pool *dp);
+
+/*
+ * Per-txg object lists.
+ */
+
+#define	TXG_CLEAN(txg)	((txg) - 1)
+
+extern void txg_list_create(txg_list_t *tl, size_t offset);
+extern void txg_list_destroy(txg_list_t *tl);
+extern int txg_list_empty(txg_list_t *tl, uint64_t txg);
+extern int txg_list_add(txg_list_t *tl, void *p, uint64_t txg);
+extern int txg_list_add_tail(txg_list_t *tl, void *p, uint64_t txg);
+extern void *txg_list_remove(txg_list_t *tl, uint64_t txg);
+extern void *txg_list_remove_this(txg_list_t *tl, void *p, uint64_t txg);
+extern int txg_list_member(txg_list_t *tl, void *p, uint64_t txg);
+extern void *txg_list_head(txg_list_t *tl, uint64_t txg);
+extern void *txg_list_next(txg_list_t *tl, void *p, uint64_t txg);
+
+#ifdef	__cplusplus
+}
+#endif
+
+#endif	/* _SYS_TXG_H */
--- a/Show More
+++ b/Show More