Update vendor-sys/opensolaris to last OpenSolaris state (13149:b23a4dab3d50)

Add ZFS bits to vendor-sys/opensolaris

Obtained from:	https://hg.openindiana.org/upstream/oracle/onnv-gate
This commit is contained in:
mm 2012-07-18 08:12:04 +00:00
parent e27f30edd4
commit 39f5422299
236 changed files with 178767 additions and 110 deletions

384
OPENSOLARIS.LICENSE Normal file
View File

@ -0,0 +1,384 @@
Unless otherwise noted, all files in this distribution are released
under the Common Development and Distribution License (CDDL).
Exceptions are noted within the associated source files.
--------------------------------------------------------------------
COMMON DEVELOPMENT AND DISTRIBUTION LICENSE Version 1.0
1. Definitions.
1.1. "Contributor" means each individual or entity that creates
or contributes to the creation of Modifications.
1.2. "Contributor Version" means the combination of the Original
Software, prior Modifications used by a Contributor (if any),
and the Modifications made by that particular Contributor.
1.3. "Covered Software" means (a) the Original Software, or (b)
Modifications, or (c) the combination of files containing
Original Software with files containing Modifications, in
each case including portions thereof.
1.4. "Executable" means the Covered Software in any form other
than Source Code.
1.5. "Initial Developer" means the individual or entity that first
makes Original Software available under this License.
1.6. "Larger Work" means a work which combines Covered Software or
portions thereof with code not governed by the terms of this
License.
1.7. "License" means this document.
1.8. "Licensable" means having the right to grant, to the maximum
extent possible, whether at the time of the initial grant or
subsequently acquired, any and all of the rights conveyed
herein.
1.9. "Modifications" means the Source Code and Executable form of
any of the following:
A. Any file that results from an addition to, deletion from or
modification of the contents of a file containing Original
Software or previous Modifications;
B. Any new file that contains any part of the Original
Software or previous Modifications; or
C. Any new file that is contributed or otherwise made
available under the terms of this License.
1.10. "Original Software" means the Source Code and Executable
form of computer software code that is originally released
under this License.
1.11. "Patent Claims" means any patent claim(s), now owned or
hereafter acquired, including without limitation, method,
process, and apparatus claims, in any patent Licensable by
grantor.
1.12. "Source Code" means (a) the common form of computer software
code in which modifications are made and (b) associated
documentation included in or with such code.
1.13. "You" (or "Your") means an individual or a legal entity
exercising rights under, and complying with all of the terms
of, this License. For legal entities, "You" includes any
entity which controls, is controlled by, or is under common
control with You. For purposes of this definition,
"control" means (a) the power, direct or indirect, to cause
the direction or management of such entity, whether by
contract or otherwise, or (b) ownership of more than fifty
percent (50%) of the outstanding shares or beneficial
ownership of such entity.
2. License Grants.
2.1. The Initial Developer Grant.
Conditioned upon Your compliance with Section 3.1 below and
subject to third party intellectual property claims, the Initial
Developer hereby grants You a world-wide, royalty-free,
non-exclusive license:
(a) under intellectual property rights (other than patent or
trademark) Licensable by Initial Developer, to use,
reproduce, modify, display, perform, sublicense and
distribute the Original Software (or portions thereof),
with or without Modifications, and/or as part of a Larger
Work; and
(b) under Patent Claims infringed by the making, using or
selling of Original Software, to make, have made, use,
practice, sell, and offer for sale, and/or otherwise
dispose of the Original Software (or portions thereof).
(c) The licenses granted in Sections 2.1(a) and (b) are
effective on the date Initial Developer first distributes
or otherwise makes the Original Software available to a
third party under the terms of this License.
(d) Notwithstanding Section 2.1(b) above, no patent license is
granted: (1) for code that You delete from the Original
Software, or (2) for infringements caused by: (i) the
modification of the Original Software, or (ii) the
combination of the Original Software with other software
or devices.
2.2. Contributor Grant.
Conditioned upon Your compliance with Section 3.1 below and
subject to third party intellectual property claims, each
Contributor hereby grants You a world-wide, royalty-free,
non-exclusive license:
(a) under intellectual property rights (other than patent or
trademark) Licensable by Contributor to use, reproduce,
modify, display, perform, sublicense and distribute the
Modifications created by such Contributor (or portions
thereof), either on an unmodified basis, with other
Modifications, as Covered Software and/or as part of a
Larger Work; and
(b) under Patent Claims infringed by the making, using, or
selling of Modifications made by that Contributor either
alone and/or in combination with its Contributor Version
(or portions of such combination), to make, use, sell,
offer for sale, have made, and/or otherwise dispose of:
(1) Modifications made by that Contributor (or portions
thereof); and (2) the combination of Modifications made by
that Contributor with its Contributor Version (or portions
of such combination).
(c) The licenses granted in Sections 2.2(a) and 2.2(b) are
effective on the date Contributor first distributes or
otherwise makes the Modifications available to a third
party.
(d) Notwithstanding Section 2.2(b) above, no patent license is
granted: (1) for any code that Contributor has deleted
from the Contributor Version; (2) for infringements caused
by: (i) third party modifications of Contributor Version,
or (ii) the combination of Modifications made by that
Contributor with other software (except as part of the
Contributor Version) or other devices; or (3) under Patent
Claims infringed by Covered Software in the absence of
Modifications made by that Contributor.
3. Distribution Obligations.
3.1. Availability of Source Code.
Any Covered Software that You distribute or otherwise make
available in Executable form must also be made available in Source
Code form and that Source Code form must be distributed only under
the terms of this License. You must include a copy of this
License with every copy of the Source Code form of the Covered
Software You distribute or otherwise make available. You must
inform recipients of any such Covered Software in Executable form
as to how they can obtain such Covered Software in Source Code
form in a reasonable manner on or through a medium customarily
used for software exchange.
3.2. Modifications.
The Modifications that You create or to which You contribute are
governed by the terms of this License. You represent that You
believe Your Modifications are Your original creation(s) and/or
You have sufficient rights to grant the rights conveyed by this
License.
3.3. Required Notices.
You must include a notice in each of Your Modifications that
identifies You as the Contributor of the Modification. You may
not remove or alter any copyright, patent or trademark notices
contained within the Covered Software, or any notices of licensing
or any descriptive text giving attribution to any Contributor or
the Initial Developer.
3.4. Application of Additional Terms.
You may not offer or impose any terms on any Covered Software in
Source Code form that alters or restricts the applicable version
of this License or the recipients' rights hereunder. You may
choose to offer, and to charge a fee for, warranty, support,
indemnity or liability obligations to one or more recipients of
Covered Software. However, you may do so only on Your own behalf,
and not on behalf of the Initial Developer or any Contributor.
You must make it absolutely clear that any such warranty, support,
indemnity or liability obligation is offered by You alone, and You
hereby agree to indemnify the Initial Developer and every
Contributor for any liability incurred by the Initial Developer or
such Contributor as a result of warranty, support, indemnity or
liability terms You offer.
3.5. Distribution of Executable Versions.
You may distribute the Executable form of the Covered Software
under the terms of this License or under the terms of a license of
Your choice, which may contain terms different from this License,
provided that You are in compliance with the terms of this License
and that the license for the Executable form does not attempt to
limit or alter the recipient's rights in the Source Code form from
the rights set forth in this License. If You distribute the
Covered Software in Executable form under a different license, You
must make it absolutely clear that any terms which differ from
this License are offered by You alone, not by the Initial
Developer or Contributor. You hereby agree to indemnify the
Initial Developer and every Contributor for any liability incurred
by the Initial Developer or such Contributor as a result of any
such terms You offer.
3.6. Larger Works.
You may create a Larger Work by combining Covered Software with
other code not governed by the terms of this License and
distribute the Larger Work as a single product. In such a case,
You must make sure the requirements of this License are fulfilled
for the Covered Software.
4. Versions of the License.
4.1. New Versions.
Sun Microsystems, Inc. is the initial license steward and may
publish revised and/or new versions of this License from time to
time. Each version will be given a distinguishing version number.
Except as provided in Section 4.3, no one other than the license
steward has the right to modify this License.
4.2. Effect of New Versions.
You may always continue to use, distribute or otherwise make the
Covered Software available under the terms of the version of the
License under which You originally received the Covered Software.
If the Initial Developer includes a notice in the Original
Software prohibiting it from being distributed or otherwise made
available under any subsequent version of the License, You must
distribute and make the Covered Software available under the terms
of the version of the License under which You originally received
the Covered Software. Otherwise, You may also choose to use,
distribute or otherwise make the Covered Software available under
the terms of any subsequent version of the License published by
the license steward.
4.3. Modified Versions.
When You are an Initial Developer and You want to create a new
license for Your Original Software, You may create and use a
modified version of this License if You: (a) rename the license
and remove any references to the name of the license steward
(except to note that the license differs from this License); and
(b) otherwise make it clear that the license contains terms which
differ from this License.
5. DISCLAIMER OF WARRANTY.
COVERED SOFTWARE IS PROVIDED UNDER THIS LICENSE ON AN "AS IS"
BASIS, WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED,
INCLUDING, WITHOUT LIMITATION, WARRANTIES THAT THE COVERED
SOFTWARE IS FREE OF DEFECTS, MERCHANTABLE, FIT FOR A PARTICULAR
PURPOSE OR NON-INFRINGING. THE ENTIRE RISK AS TO THE QUALITY AND
PERFORMANCE OF THE COVERED SOFTWARE IS WITH YOU. SHOULD ANY
COVERED SOFTWARE PROVE DEFECTIVE IN ANY RESPECT, YOU (NOT THE
INITIAL DEVELOPER OR ANY OTHER CONTRIBUTOR) ASSUME THE COST OF ANY
NECESSARY SERVICING, REPAIR OR CORRECTION. THIS DISCLAIMER OF
WARRANTY CONSTITUTES AN ESSENTIAL PART OF THIS LICENSE. NO USE OF
ANY COVERED SOFTWARE IS AUTHORIZED HEREUNDER EXCEPT UNDER THIS
DISCLAIMER.
6. TERMINATION.
6.1. This License and the rights granted hereunder will terminate
automatically if You fail to comply with terms herein and fail to
cure such breach within 30 days of becoming aware of the breach.
Provisions which, by their nature, must remain in effect beyond
the termination of this License shall survive.
6.2. If You assert a patent infringement claim (excluding
declaratory judgment actions) against Initial Developer or a
Contributor (the Initial Developer or Contributor against whom You
assert such claim is referred to as "Participant") alleging that
the Participant Software (meaning the Contributor Version where
the Participant is a Contributor or the Original Software where
the Participant is the Initial Developer) directly or indirectly
infringes any patent, then any and all rights granted directly or
indirectly to You by such Participant, the Initial Developer (if
the Initial Developer is not the Participant) and all Contributors
under Sections 2.1 and/or 2.2 of this License shall, upon 60 days
notice from Participant terminate prospectively and automatically
at the expiration of such 60 day notice period, unless if within
such 60 day period You withdraw Your claim with respect to the
Participant Software against such Participant either unilaterally
or pursuant to a written agreement with Participant.
6.3. In the event of termination under Sections 6.1 or 6.2 above,
all end user licenses that have been validly granted by You or any
distributor hereunder prior to termination (excluding licenses
granted to You by any distributor) shall survive termination.
7. LIMITATION OF LIABILITY.
UNDER NO CIRCUMSTANCES AND UNDER NO LEGAL THEORY, WHETHER TORT
(INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE, SHALL YOU, THE
INITIAL DEVELOPER, ANY OTHER CONTRIBUTOR, OR ANY DISTRIBUTOR OF
COVERED SOFTWARE, OR ANY SUPPLIER OF ANY OF SUCH PARTIES, BE
LIABLE TO ANY PERSON FOR ANY INDIRECT, SPECIAL, INCIDENTAL, OR
CONSEQUENTIAL DAMAGES OF ANY CHARACTER INCLUDING, WITHOUT
LIMITATION, DAMAGES FOR LOST PROFITS, LOSS OF GOODWILL, WORK
STOPPAGE, COMPUTER FAILURE OR MALFUNCTION, OR ANY AND ALL OTHER
COMMERCIAL DAMAGES OR LOSSES, EVEN IF SUCH PARTY SHALL HAVE BEEN
INFORMED OF THE POSSIBILITY OF SUCH DAMAGES. THIS LIMITATION OF
LIABILITY SHALL NOT APPLY TO LIABILITY FOR DEATH OR PERSONAL
INJURY RESULTING FROM SUCH PARTY'S NEGLIGENCE TO THE EXTENT
APPLICABLE LAW PROHIBITS SUCH LIMITATION. SOME JURISDICTIONS DO
NOT ALLOW THE EXCLUSION OR LIMITATION OF INCIDENTAL OR
CONSEQUENTIAL DAMAGES, SO THIS EXCLUSION AND LIMITATION MAY NOT
APPLY TO YOU.
8. U.S. GOVERNMENT END USERS.
The Covered Software is a "commercial item," as that term is
defined in 48 C.F.R. 2.101 (Oct. 1995), consisting of "commercial
computer software" (as that term is defined at 48
C.F.R. 252.227-7014(a)(1)) and "commercial computer software
documentation" as such terms are used in 48 C.F.R. 12.212
(Sept. 1995). Consistent with 48 C.F.R. 12.212 and 48
C.F.R. 227.7202-1 through 227.7202-4 (June 1995), all
U.S. Government End Users acquire Covered Software with only those
rights set forth herein. This U.S. Government Rights clause is in
lieu of, and supersedes, any other FAR, DFAR, or other clause or
provision that addresses Government rights in computer software
under this License.
9. MISCELLANEOUS.
This License represents the complete agreement concerning subject
matter hereof. If any provision of this License is held to be
unenforceable, such provision shall be reformed only to the extent
necessary to make it enforceable. This License shall be governed
by the law of the jurisdiction specified in a notice contained
within the Original Software (except to the extent applicable law,
if any, provides otherwise), excluding such jurisdiction's
conflict-of-law provisions. Any litigation relating to this
License shall be subject to the jurisdiction of the courts located
in the jurisdiction and venue specified in a notice contained
within the Original Software, with the losing party responsible
for costs, including, without limitation, court costs and
reasonable attorneys' fees and expenses. The application of the
United Nations Convention on Contracts for the International Sale
of Goods is expressly excluded. Any law or regulation which
provides that the language of a contract shall be construed
against the drafter shall not apply to this License. You agree
that You alone are responsible for compliance with the United
States export administration regulations (and the export control
laws and regulation of any other countries) when You use,
distribute or otherwise make available any Covered Software.
10. RESPONSIBILITY FOR CLAIMS.
As between Initial Developer and the Contributors, each party is
responsible for claims and damages arising, directly or
indirectly, out of its utilization of rights under this License
and You agree to work with Initial Developer and Contributors to
distribute such responsibility on an equitable basis. Nothing
herein is intended or shall be deemed to constitute any admission
of liability.
--------------------------------------------------------------------
NOTICE PURSUANT TO SECTION 9 OF THE COMMON DEVELOPMENT AND
DISTRIBUTION LICENSE (CDDL)
For Covered Software in this distribution, this License shall
be governed by the laws of the State of California (excluding
conflict-of-law provisions).
Any litigation relating to this License shall be subject to the
jurisdiction of the Federal Courts of the Northern District of
California and the state courts of the State of California, with
venue lying in Santa Clara County, California.

1755
common/acl/acl_common.c Normal file

File diff suppressed because it is too large Load Diff

59
common/acl/acl_common.h Normal file
View File

@ -0,0 +1,59 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _ACL_COMMON_H
#define _ACL_COMMON_H
#include <sys/types.h>
#include <sys/acl.h>
#include <sys/stat.h>
#ifdef __cplusplus
extern "C" {
#endif
extern ace_t trivial_acl[6];
extern int acltrivial(const char *);
extern void adjust_ace_pair(ace_t *pair, mode_t mode);
extern void adjust_ace_pair_common(void *, size_t, size_t, mode_t);
extern int ace_trivial(ace_t *acep, int aclcnt);
extern int ace_trivial_common(void *, int,
uint64_t (*walk)(void *, uint64_t, int aclcnt, uint16_t *, uint16_t *,
uint32_t *mask));
extern acl_t *acl_alloc(acl_type_t);
extern void acl_free(acl_t *aclp);
extern int acl_translate(acl_t *aclp, int target_flavor,
int isdir, uid_t owner, gid_t group);
void ksort(caddr_t v, int n, int s, int (*f)());
int cmp2acls(void *a, void *b);
int acl_trivial_create(mode_t mode, ace_t **acl, int *count);
void acl_trivial_access_masks(mode_t mode, uint32_t *allow0, uint32_t *deny1,
uint32_t *deny2, uint32_t *owner, uint32_t *group, uint32_t *everyone);
#ifdef __cplusplus
}
#endif
#endif /* _ACL_COMMON_H */

View File

@ -0,0 +1,573 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2004, 2010, Oracle and/or its affiliates. All rights reserved.
*/
.file "atomic.s"
#include <sys/asm_linkage.h>
#if defined(_KERNEL)
/*
* Legacy kernel interfaces; they will go away (eventually).
*/
ANSI_PRAGMA_WEAK2(cas8,atomic_cas_8,function)
ANSI_PRAGMA_WEAK2(cas32,atomic_cas_32,function)
ANSI_PRAGMA_WEAK2(cas64,atomic_cas_64,function)
ANSI_PRAGMA_WEAK2(caslong,atomic_cas_ulong,function)
ANSI_PRAGMA_WEAK2(casptr,atomic_cas_ptr,function)
ANSI_PRAGMA_WEAK2(atomic_and_long,atomic_and_ulong,function)
ANSI_PRAGMA_WEAK2(atomic_or_long,atomic_or_ulong,function)
#endif
ENTRY(atomic_inc_8)
ALTENTRY(atomic_inc_uchar)
lock
incb (%rdi)
ret
SET_SIZE(atomic_inc_uchar)
SET_SIZE(atomic_inc_8)
ENTRY(atomic_inc_16)
ALTENTRY(atomic_inc_ushort)
lock
incw (%rdi)
ret
SET_SIZE(atomic_inc_ushort)
SET_SIZE(atomic_inc_16)
ENTRY(atomic_inc_32)
ALTENTRY(atomic_inc_uint)
lock
incl (%rdi)
ret
SET_SIZE(atomic_inc_uint)
SET_SIZE(atomic_inc_32)
ENTRY(atomic_inc_64)
ALTENTRY(atomic_inc_ulong)
lock
incq (%rdi)
ret
SET_SIZE(atomic_inc_ulong)
SET_SIZE(atomic_inc_64)
ENTRY(atomic_inc_8_nv)
ALTENTRY(atomic_inc_uchar_nv)
xorl %eax, %eax / clear upper bits of %eax return register
incb %al / %al = 1
lock
xaddb %al, (%rdi) / %al = old value, (%rdi) = new value
incb %al / return new value
ret
SET_SIZE(atomic_inc_uchar_nv)
SET_SIZE(atomic_inc_8_nv)
ENTRY(atomic_inc_16_nv)
ALTENTRY(atomic_inc_ushort_nv)
xorl %eax, %eax / clear upper bits of %eax return register
incw %ax / %ax = 1
lock
xaddw %ax, (%rdi) / %ax = old value, (%rdi) = new value
incw %ax / return new value
ret
SET_SIZE(atomic_inc_ushort_nv)
SET_SIZE(atomic_inc_16_nv)
ENTRY(atomic_inc_32_nv)
ALTENTRY(atomic_inc_uint_nv)
xorl %eax, %eax / %eax = 0
incl %eax / %eax = 1
lock
xaddl %eax, (%rdi) / %eax = old value, (%rdi) = new value
incl %eax / return new value
ret
SET_SIZE(atomic_inc_uint_nv)
SET_SIZE(atomic_inc_32_nv)
ENTRY(atomic_inc_64_nv)
ALTENTRY(atomic_inc_ulong_nv)
xorq %rax, %rax / %rax = 0
incq %rax / %rax = 1
lock
xaddq %rax, (%rdi) / %rax = old value, (%rdi) = new value
incq %rax / return new value
ret
SET_SIZE(atomic_inc_ulong_nv)
SET_SIZE(atomic_inc_64_nv)
ENTRY(atomic_dec_8)
ALTENTRY(atomic_dec_uchar)
lock
decb (%rdi)
ret
SET_SIZE(atomic_dec_uchar)
SET_SIZE(atomic_dec_8)
ENTRY(atomic_dec_16)
ALTENTRY(atomic_dec_ushort)
lock
decw (%rdi)
ret
SET_SIZE(atomic_dec_ushort)
SET_SIZE(atomic_dec_16)
ENTRY(atomic_dec_32)
ALTENTRY(atomic_dec_uint)
lock
decl (%rdi)
ret
SET_SIZE(atomic_dec_uint)
SET_SIZE(atomic_dec_32)
ENTRY(atomic_dec_64)
ALTENTRY(atomic_dec_ulong)
lock
decq (%rdi)
ret
SET_SIZE(atomic_dec_ulong)
SET_SIZE(atomic_dec_64)
ENTRY(atomic_dec_8_nv)
ALTENTRY(atomic_dec_uchar_nv)
xorl %eax, %eax / clear upper bits of %eax return register
decb %al / %al = -1
lock
xaddb %al, (%rdi) / %al = old value, (%rdi) = new value
decb %al / return new value
ret
SET_SIZE(atomic_dec_uchar_nv)
SET_SIZE(atomic_dec_8_nv)
ENTRY(atomic_dec_16_nv)
ALTENTRY(atomic_dec_ushort_nv)
xorl %eax, %eax / clear upper bits of %eax return register
decw %ax / %ax = -1
lock
xaddw %ax, (%rdi) / %ax = old value, (%rdi) = new value
decw %ax / return new value
ret
SET_SIZE(atomic_dec_ushort_nv)
SET_SIZE(atomic_dec_16_nv)
ENTRY(atomic_dec_32_nv)
ALTENTRY(atomic_dec_uint_nv)
xorl %eax, %eax / %eax = 0
decl %eax / %eax = -1
lock
xaddl %eax, (%rdi) / %eax = old value, (%rdi) = new value
decl %eax / return new value
ret
SET_SIZE(atomic_dec_uint_nv)
SET_SIZE(atomic_dec_32_nv)
ENTRY(atomic_dec_64_nv)
ALTENTRY(atomic_dec_ulong_nv)
xorq %rax, %rax / %rax = 0
decq %rax / %rax = -1
lock
xaddq %rax, (%rdi) / %rax = old value, (%rdi) = new value
decq %rax / return new value
ret
SET_SIZE(atomic_dec_ulong_nv)
SET_SIZE(atomic_dec_64_nv)
ENTRY(atomic_add_8)
ALTENTRY(atomic_add_char)
lock
addb %sil, (%rdi)
ret
SET_SIZE(atomic_add_char)
SET_SIZE(atomic_add_8)
ENTRY(atomic_add_16)
ALTENTRY(atomic_add_short)
lock
addw %si, (%rdi)
ret
SET_SIZE(atomic_add_short)
SET_SIZE(atomic_add_16)
ENTRY(atomic_add_32)
ALTENTRY(atomic_add_int)
lock
addl %esi, (%rdi)
ret
SET_SIZE(atomic_add_int)
SET_SIZE(atomic_add_32)
ENTRY(atomic_add_64)
ALTENTRY(atomic_add_ptr)
ALTENTRY(atomic_add_long)
lock
addq %rsi, (%rdi)
ret
SET_SIZE(atomic_add_long)
SET_SIZE(atomic_add_ptr)
SET_SIZE(atomic_add_64)
ENTRY(atomic_or_8)
ALTENTRY(atomic_or_uchar)
lock
orb %sil, (%rdi)
ret
SET_SIZE(atomic_or_uchar)
SET_SIZE(atomic_or_8)
ENTRY(atomic_or_16)
ALTENTRY(atomic_or_ushort)
lock
orw %si, (%rdi)
ret
SET_SIZE(atomic_or_ushort)
SET_SIZE(atomic_or_16)
ENTRY(atomic_or_32)
ALTENTRY(atomic_or_uint)
lock
orl %esi, (%rdi)
ret
SET_SIZE(atomic_or_uint)
SET_SIZE(atomic_or_32)
ENTRY(atomic_or_64)
ALTENTRY(atomic_or_ulong)
lock
orq %rsi, (%rdi)
ret
SET_SIZE(atomic_or_ulong)
SET_SIZE(atomic_or_64)
ENTRY(atomic_and_8)
ALTENTRY(atomic_and_uchar)
lock
andb %sil, (%rdi)
ret
SET_SIZE(atomic_and_uchar)
SET_SIZE(atomic_and_8)
ENTRY(atomic_and_16)
ALTENTRY(atomic_and_ushort)
lock
andw %si, (%rdi)
ret
SET_SIZE(atomic_and_ushort)
SET_SIZE(atomic_and_16)
ENTRY(atomic_and_32)
ALTENTRY(atomic_and_uint)
lock
andl %esi, (%rdi)
ret
SET_SIZE(atomic_and_uint)
SET_SIZE(atomic_and_32)
ENTRY(atomic_and_64)
ALTENTRY(atomic_and_ulong)
lock
andq %rsi, (%rdi)
ret
SET_SIZE(atomic_and_ulong)
SET_SIZE(atomic_and_64)
ENTRY(atomic_add_8_nv)
ALTENTRY(atomic_add_char_nv)
movzbl %sil, %eax / %al = delta addend, clear upper bits
lock
xaddb %sil, (%rdi) / %sil = old value, (%rdi) = sum
addb %sil, %al / new value = original value + delta
ret
SET_SIZE(atomic_add_char_nv)
SET_SIZE(atomic_add_8_nv)
ENTRY(atomic_add_16_nv)
ALTENTRY(atomic_add_short_nv)
movzwl %si, %eax / %ax = delta addend, clean upper bits
lock
xaddw %si, (%rdi) / %si = old value, (%rdi) = sum
addw %si, %ax / new value = original value + delta
ret
SET_SIZE(atomic_add_short_nv)
SET_SIZE(atomic_add_16_nv)
ENTRY(atomic_add_32_nv)
ALTENTRY(atomic_add_int_nv)
mov %esi, %eax / %eax = delta addend
lock
xaddl %esi, (%rdi) / %esi = old value, (%rdi) = sum
add %esi, %eax / new value = original value + delta
ret
SET_SIZE(atomic_add_int_nv)
SET_SIZE(atomic_add_32_nv)
ENTRY(atomic_add_64_nv)
ALTENTRY(atomic_add_ptr_nv)
ALTENTRY(atomic_add_long_nv)
mov %rsi, %rax / %rax = delta addend
lock
xaddq %rsi, (%rdi) / %rsi = old value, (%rdi) = sum
addq %rsi, %rax / new value = original value + delta
ret
SET_SIZE(atomic_add_long_nv)
SET_SIZE(atomic_add_ptr_nv)
SET_SIZE(atomic_add_64_nv)
ENTRY(atomic_and_8_nv)
ALTENTRY(atomic_and_uchar_nv)
movb (%rdi), %al / %al = old value
1:
movb %sil, %cl
andb %al, %cl / %cl = new value
lock
cmpxchgb %cl, (%rdi) / try to stick it in
jne 1b
movzbl %cl, %eax / return new value
ret
SET_SIZE(atomic_and_uchar_nv)
SET_SIZE(atomic_and_8_nv)
ENTRY(atomic_and_16_nv)
ALTENTRY(atomic_and_ushort_nv)
movw (%rdi), %ax / %ax = old value
1:
movw %si, %cx
andw %ax, %cx / %cx = new value
lock
cmpxchgw %cx, (%rdi) / try to stick it in
jne 1b
movzwl %cx, %eax / return new value
ret
SET_SIZE(atomic_and_ushort_nv)
SET_SIZE(atomic_and_16_nv)
ENTRY(atomic_and_32_nv)
ALTENTRY(atomic_and_uint_nv)
movl (%rdi), %eax
1:
movl %esi, %ecx
andl %eax, %ecx
lock
cmpxchgl %ecx, (%rdi)
jne 1b
movl %ecx, %eax
ret
SET_SIZE(atomic_and_uint_nv)
SET_SIZE(atomic_and_32_nv)
ENTRY(atomic_and_64_nv)
ALTENTRY(atomic_and_ulong_nv)
movq (%rdi), %rax
1:
movq %rsi, %rcx
andq %rax, %rcx
lock
cmpxchgq %rcx, (%rdi)
jne 1b
movq %rcx, %rax
ret
SET_SIZE(atomic_and_ulong_nv)
SET_SIZE(atomic_and_64_nv)
ENTRY(atomic_or_8_nv)
ALTENTRY(atomic_or_uchar_nv)
movb (%rdi), %al / %al = old value
1:
movb %sil, %cl
orb %al, %cl / %cl = new value
lock
cmpxchgb %cl, (%rdi) / try to stick it in
jne 1b
movzbl %cl, %eax / return new value
ret
SET_SIZE(atomic_or_uchar_nv)
SET_SIZE(atomic_or_8_nv)
ENTRY(atomic_or_16_nv)
ALTENTRY(atomic_or_ushort_nv)
movw (%rdi), %ax / %ax = old value
1:
movw %si, %cx
orw %ax, %cx / %cx = new value
lock
cmpxchgw %cx, (%rdi) / try to stick it in
jne 1b
movzwl %cx, %eax / return new value
ret
SET_SIZE(atomic_or_ushort_nv)
SET_SIZE(atomic_or_16_nv)
ENTRY(atomic_or_32_nv)
ALTENTRY(atomic_or_uint_nv)
movl (%rdi), %eax
1:
movl %esi, %ecx
orl %eax, %ecx
lock
cmpxchgl %ecx, (%rdi)
jne 1b
movl %ecx, %eax
ret
SET_SIZE(atomic_or_uint_nv)
SET_SIZE(atomic_or_32_nv)
ENTRY(atomic_or_64_nv)
ALTENTRY(atomic_or_ulong_nv)
movq (%rdi), %rax
1:
movq %rsi, %rcx
orq %rax, %rcx
lock
cmpxchgq %rcx, (%rdi)
jne 1b
movq %rcx, %rax
ret
SET_SIZE(atomic_or_ulong_nv)
SET_SIZE(atomic_or_64_nv)
ENTRY(atomic_cas_8)
ALTENTRY(atomic_cas_uchar)
movzbl %sil, %eax
lock
cmpxchgb %dl, (%rdi)
ret
SET_SIZE(atomic_cas_uchar)
SET_SIZE(atomic_cas_8)
ENTRY(atomic_cas_16)
ALTENTRY(atomic_cas_ushort)
movzwl %si, %eax
lock
cmpxchgw %dx, (%rdi)
ret
SET_SIZE(atomic_cas_ushort)
SET_SIZE(atomic_cas_16)
ENTRY(atomic_cas_32)
ALTENTRY(atomic_cas_uint)
movl %esi, %eax
lock
cmpxchgl %edx, (%rdi)
ret
SET_SIZE(atomic_cas_uint)
SET_SIZE(atomic_cas_32)
ENTRY(atomic_cas_64)
ALTENTRY(atomic_cas_ulong)
ALTENTRY(atomic_cas_ptr)
movq %rsi, %rax
lock
cmpxchgq %rdx, (%rdi)
ret
SET_SIZE(atomic_cas_ptr)
SET_SIZE(atomic_cas_ulong)
SET_SIZE(atomic_cas_64)
ENTRY(atomic_swap_8)
ALTENTRY(atomic_swap_uchar)
movzbl %sil, %eax
lock
xchgb %al, (%rdi)
ret
SET_SIZE(atomic_swap_uchar)
SET_SIZE(atomic_swap_8)
ENTRY(atomic_swap_16)
ALTENTRY(atomic_swap_ushort)
movzwl %si, %eax
lock
xchgw %ax, (%rdi)
ret
SET_SIZE(atomic_swap_ushort)
SET_SIZE(atomic_swap_16)
ENTRY(atomic_swap_32)
ALTENTRY(atomic_swap_uint)
movl %esi, %eax
lock
xchgl %eax, (%rdi)
ret
SET_SIZE(atomic_swap_uint)
SET_SIZE(atomic_swap_32)
ENTRY(atomic_swap_64)
ALTENTRY(atomic_swap_ulong)
ALTENTRY(atomic_swap_ptr)
movq %rsi, %rax
lock
xchgq %rax, (%rdi)
ret
SET_SIZE(atomic_swap_ptr)
SET_SIZE(atomic_swap_ulong)
SET_SIZE(atomic_swap_64)
ENTRY(atomic_set_long_excl)
xorl %eax, %eax
lock
btsq %rsi, (%rdi)
jnc 1f
decl %eax / return -1
1:
ret
SET_SIZE(atomic_set_long_excl)
ENTRY(atomic_clear_long_excl)
xorl %eax, %eax
lock
btrq %rsi, (%rdi)
jc 1f
decl %eax / return -1
1:
ret
SET_SIZE(atomic_clear_long_excl)
#if !defined(_KERNEL)
/*
* NOTE: membar_enter, and membar_exit are identical routines.
* We define them separately, instead of using an ALTENTRY
* definitions to alias them together, so that DTrace and
* debuggers will see a unique address for them, allowing
* more accurate tracing.
*/
ENTRY(membar_enter)
mfence
ret
SET_SIZE(membar_enter)
ENTRY(membar_exit)
mfence
ret
SET_SIZE(membar_exit)
ENTRY(membar_producer)
sfence
ret
SET_SIZE(membar_producer)
ENTRY(membar_consumer)
lfence
ret
SET_SIZE(membar_consumer)
#endif /* !_KERNEL */

720
common/atomic/i386/atomic.s Normal file
View File

@ -0,0 +1,720 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2010 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
.file "atomic.s"
#include <sys/asm_linkage.h>
#if defined(_KERNEL)
/*
* Legacy kernel interfaces; they will go away (eventually).
*/
ANSI_PRAGMA_WEAK2(cas8,atomic_cas_8,function)
ANSI_PRAGMA_WEAK2(cas32,atomic_cas_32,function)
ANSI_PRAGMA_WEAK2(cas64,atomic_cas_64,function)
ANSI_PRAGMA_WEAK2(caslong,atomic_cas_ulong,function)
ANSI_PRAGMA_WEAK2(casptr,atomic_cas_ptr,function)
ANSI_PRAGMA_WEAK2(atomic_and_long,atomic_and_ulong,function)
ANSI_PRAGMA_WEAK2(atomic_or_long,atomic_or_ulong,function)
#endif
ENTRY(atomic_inc_8)
ALTENTRY(atomic_inc_uchar)
movl 4(%esp), %eax
lock
incb (%eax)
ret
SET_SIZE(atomic_inc_uchar)
SET_SIZE(atomic_inc_8)
ENTRY(atomic_inc_16)
ALTENTRY(atomic_inc_ushort)
movl 4(%esp), %eax
lock
incw (%eax)
ret
SET_SIZE(atomic_inc_ushort)
SET_SIZE(atomic_inc_16)
ENTRY(atomic_inc_32)
ALTENTRY(atomic_inc_uint)
ALTENTRY(atomic_inc_ulong)
movl 4(%esp), %eax
lock
incl (%eax)
ret
SET_SIZE(atomic_inc_ulong)
SET_SIZE(atomic_inc_uint)
SET_SIZE(atomic_inc_32)
ENTRY(atomic_inc_8_nv)
ALTENTRY(atomic_inc_uchar_nv)
movl 4(%esp), %edx / %edx = target address
xorl %eax, %eax / clear upper bits of %eax
incb %al / %al = 1
lock
xaddb %al, (%edx) / %al = old value, inc (%edx)
incb %al / return new value
ret
SET_SIZE(atomic_inc_uchar_nv)
SET_SIZE(atomic_inc_8_nv)
ENTRY(atomic_inc_16_nv)
ALTENTRY(atomic_inc_ushort_nv)
movl 4(%esp), %edx / %edx = target address
xorl %eax, %eax / clear upper bits of %eax
incw %ax / %ax = 1
lock
xaddw %ax, (%edx) / %ax = old value, inc (%edx)
incw %ax / return new value
ret
SET_SIZE(atomic_inc_ushort_nv)
SET_SIZE(atomic_inc_16_nv)
ENTRY(atomic_inc_32_nv)
ALTENTRY(atomic_inc_uint_nv)
ALTENTRY(atomic_inc_ulong_nv)
movl 4(%esp), %edx / %edx = target address
xorl %eax, %eax / %eax = 0
incl %eax / %eax = 1
lock
xaddl %eax, (%edx) / %eax = old value, inc (%edx)
incl %eax / return new value
ret
SET_SIZE(atomic_inc_ulong_nv)
SET_SIZE(atomic_inc_uint_nv)
SET_SIZE(atomic_inc_32_nv)
/*
* NOTE: If atomic_inc_64 and atomic_inc_64_nv are ever
* separated, you need to also edit the libc i386 platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_inc_64_nv.
*/
ENTRY(atomic_inc_64)
ALTENTRY(atomic_inc_64_nv)
pushl %edi
pushl %ebx
movl 12(%esp), %edi / %edi = target address
movl (%edi), %eax
movl 4(%edi), %edx / %edx:%eax = old value
1:
xorl %ebx, %ebx
xorl %ecx, %ecx
incl %ebx / %ecx:%ebx = 1
addl %eax, %ebx
adcl %edx, %ecx / add in the carry from inc
lock
cmpxchg8b (%edi) / try to stick it in
jne 1b
movl %ebx, %eax
movl %ecx, %edx / return new value
popl %ebx
popl %edi
ret
SET_SIZE(atomic_inc_64_nv)
SET_SIZE(atomic_inc_64)
ENTRY(atomic_dec_8)
ALTENTRY(atomic_dec_uchar)
movl 4(%esp), %eax
lock
decb (%eax)
ret
SET_SIZE(atomic_dec_uchar)
SET_SIZE(atomic_dec_8)
ENTRY(atomic_dec_16)
ALTENTRY(atomic_dec_ushort)
movl 4(%esp), %eax
lock
decw (%eax)
ret
SET_SIZE(atomic_dec_ushort)
SET_SIZE(atomic_dec_16)
ENTRY(atomic_dec_32)
ALTENTRY(atomic_dec_uint)
ALTENTRY(atomic_dec_ulong)
movl 4(%esp), %eax
lock
decl (%eax)
ret
SET_SIZE(atomic_dec_ulong)
SET_SIZE(atomic_dec_uint)
SET_SIZE(atomic_dec_32)
ENTRY(atomic_dec_8_nv)
ALTENTRY(atomic_dec_uchar_nv)
movl 4(%esp), %edx / %edx = target address
xorl %eax, %eax / zero upper bits of %eax
decb %al / %al = -1
lock
xaddb %al, (%edx) / %al = old value, dec (%edx)
decb %al / return new value
ret
SET_SIZE(atomic_dec_uchar_nv)
SET_SIZE(atomic_dec_8_nv)
ENTRY(atomic_dec_16_nv)
ALTENTRY(atomic_dec_ushort_nv)
movl 4(%esp), %edx / %edx = target address
xorl %eax, %eax / zero upper bits of %eax
decw %ax / %ax = -1
lock
xaddw %ax, (%edx) / %ax = old value, dec (%edx)
decw %ax / return new value
ret
SET_SIZE(atomic_dec_ushort_nv)
SET_SIZE(atomic_dec_16_nv)
ENTRY(atomic_dec_32_nv)
ALTENTRY(atomic_dec_uint_nv)
ALTENTRY(atomic_dec_ulong_nv)
movl 4(%esp), %edx / %edx = target address
xorl %eax, %eax / %eax = 0
decl %eax / %eax = -1
lock
xaddl %eax, (%edx) / %eax = old value, dec (%edx)
decl %eax / return new value
ret
SET_SIZE(atomic_dec_ulong_nv)
SET_SIZE(atomic_dec_uint_nv)
SET_SIZE(atomic_dec_32_nv)
/*
* NOTE: If atomic_dec_64 and atomic_dec_64_nv are ever
* separated, it is important to edit the libc i386 platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_dec_64_nv.
*/
ENTRY(atomic_dec_64)
ALTENTRY(atomic_dec_64_nv)
pushl %edi
pushl %ebx
movl 12(%esp), %edi / %edi = target address
movl (%edi), %eax
movl 4(%edi), %edx / %edx:%eax = old value
1:
xorl %ebx, %ebx
xorl %ecx, %ecx
not %ecx
not %ebx / %ecx:%ebx = -1
addl %eax, %ebx
adcl %edx, %ecx / add in the carry from inc
lock
cmpxchg8b (%edi) / try to stick it in
jne 1b
movl %ebx, %eax
movl %ecx, %edx / return new value
popl %ebx
popl %edi
ret
SET_SIZE(atomic_dec_64_nv)
SET_SIZE(atomic_dec_64)
ENTRY(atomic_add_8)
ALTENTRY(atomic_add_char)
movl 4(%esp), %eax
movl 8(%esp), %ecx
lock
addb %cl, (%eax)
ret
SET_SIZE(atomic_add_char)
SET_SIZE(atomic_add_8)
ENTRY(atomic_add_16)
ALTENTRY(atomic_add_short)
movl 4(%esp), %eax
movl 8(%esp), %ecx
lock
addw %cx, (%eax)
ret
SET_SIZE(atomic_add_short)
SET_SIZE(atomic_add_16)
ENTRY(atomic_add_32)
ALTENTRY(atomic_add_int)
ALTENTRY(atomic_add_ptr)
ALTENTRY(atomic_add_long)
movl 4(%esp), %eax
movl 8(%esp), %ecx
lock
addl %ecx, (%eax)
ret
SET_SIZE(atomic_add_long)
SET_SIZE(atomic_add_ptr)
SET_SIZE(atomic_add_int)
SET_SIZE(atomic_add_32)
ENTRY(atomic_or_8)
ALTENTRY(atomic_or_uchar)
movl 4(%esp), %eax
movb 8(%esp), %cl
lock
orb %cl, (%eax)
ret
SET_SIZE(atomic_or_uchar)
SET_SIZE(atomic_or_8)
ENTRY(atomic_or_16)
ALTENTRY(atomic_or_ushort)
movl 4(%esp), %eax
movw 8(%esp), %cx
lock
orw %cx, (%eax)
ret
SET_SIZE(atomic_or_ushort)
SET_SIZE(atomic_or_16)
ENTRY(atomic_or_32)
ALTENTRY(atomic_or_uint)
ALTENTRY(atomic_or_ulong)
movl 4(%esp), %eax
movl 8(%esp), %ecx
lock
orl %ecx, (%eax)
ret
SET_SIZE(atomic_or_ulong)
SET_SIZE(atomic_or_uint)
SET_SIZE(atomic_or_32)
ENTRY(atomic_and_8)
ALTENTRY(atomic_and_uchar)
movl 4(%esp), %eax
movb 8(%esp), %cl
lock
andb %cl, (%eax)
ret
SET_SIZE(atomic_and_uchar)
SET_SIZE(atomic_and_8)
ENTRY(atomic_and_16)
ALTENTRY(atomic_and_ushort)
movl 4(%esp), %eax
movw 8(%esp), %cx
lock
andw %cx, (%eax)
ret
SET_SIZE(atomic_and_ushort)
SET_SIZE(atomic_and_16)
ENTRY(atomic_and_32)
ALTENTRY(atomic_and_uint)
ALTENTRY(atomic_and_ulong)
movl 4(%esp), %eax
movl 8(%esp), %ecx
lock
andl %ecx, (%eax)
ret
SET_SIZE(atomic_and_ulong)
SET_SIZE(atomic_and_uint)
SET_SIZE(atomic_and_32)
ENTRY(atomic_add_8_nv)
ALTENTRY(atomic_add_char_nv)
movl 4(%esp), %edx / %edx = target address
movb 8(%esp), %cl / %cl = delta
movzbl %cl, %eax / %al = delta, zero extended
lock
xaddb %cl, (%edx) / %cl = old value, (%edx) = sum
addb %cl, %al / return old value plus delta
ret
SET_SIZE(atomic_add_char_nv)
SET_SIZE(atomic_add_8_nv)
ENTRY(atomic_add_16_nv)
ALTENTRY(atomic_add_short_nv)
movl 4(%esp), %edx / %edx = target address
movw 8(%esp), %cx / %cx = delta
movzwl %cx, %eax / %ax = delta, zero extended
lock
xaddw %cx, (%edx) / %cx = old value, (%edx) = sum
addw %cx, %ax / return old value plus delta
ret
SET_SIZE(atomic_add_short_nv)
SET_SIZE(atomic_add_16_nv)
ENTRY(atomic_add_32_nv)
ALTENTRY(atomic_add_int_nv)
ALTENTRY(atomic_add_ptr_nv)
ALTENTRY(atomic_add_long_nv)
movl 4(%esp), %edx / %edx = target address
movl 8(%esp), %eax / %eax = delta
movl %eax, %ecx / %ecx = delta
lock
xaddl %eax, (%edx) / %eax = old value, (%edx) = sum
addl %ecx, %eax / return old value plus delta
ret
SET_SIZE(atomic_add_long_nv)
SET_SIZE(atomic_add_ptr_nv)
SET_SIZE(atomic_add_int_nv)
SET_SIZE(atomic_add_32_nv)
/*
* NOTE: If atomic_add_64 and atomic_add_64_nv are ever
* separated, it is important to edit the libc i386 platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_add_64_nv.
*/
ENTRY(atomic_add_64)
ALTENTRY(atomic_add_64_nv)
pushl %edi
pushl %ebx
movl 12(%esp), %edi / %edi = target address
movl (%edi), %eax
movl 4(%edi), %edx / %edx:%eax = old value
1:
movl 16(%esp), %ebx
movl 20(%esp), %ecx / %ecx:%ebx = delta
addl %eax, %ebx
adcl %edx, %ecx / %ecx:%ebx = new value
lock
cmpxchg8b (%edi) / try to stick it in
jne 1b
movl %ebx, %eax
movl %ecx, %edx / return new value
popl %ebx
popl %edi
ret
SET_SIZE(atomic_add_64_nv)
SET_SIZE(atomic_add_64)
ENTRY(atomic_or_8_nv)
ALTENTRY(atomic_or_uchar_nv)
movl 4(%esp), %edx / %edx = target address
movb (%edx), %al / %al = old value
1:
movl 8(%esp), %ecx / %ecx = delta
orb %al, %cl / %cl = new value
lock
cmpxchgb %cl, (%edx) / try to stick it in
jne 1b
movzbl %cl, %eax / return new value
ret
SET_SIZE(atomic_or_uchar_nv)
SET_SIZE(atomic_or_8_nv)
ENTRY(atomic_or_16_nv)
ALTENTRY(atomic_or_ushort_nv)
movl 4(%esp), %edx / %edx = target address
movw (%edx), %ax / %ax = old value
1:
movl 8(%esp), %ecx / %ecx = delta
orw %ax, %cx / %cx = new value
lock
cmpxchgw %cx, (%edx) / try to stick it in
jne 1b
movzwl %cx, %eax / return new value
ret
SET_SIZE(atomic_or_ushort_nv)
SET_SIZE(atomic_or_16_nv)
ENTRY(atomic_or_32_nv)
ALTENTRY(atomic_or_uint_nv)
ALTENTRY(atomic_or_ulong_nv)
movl 4(%esp), %edx / %edx = target address
movl (%edx), %eax / %eax = old value
1:
movl 8(%esp), %ecx / %ecx = delta
orl %eax, %ecx / %ecx = new value
lock
cmpxchgl %ecx, (%edx) / try to stick it in
jne 1b
movl %ecx, %eax / return new value
ret
SET_SIZE(atomic_or_ulong_nv)
SET_SIZE(atomic_or_uint_nv)
SET_SIZE(atomic_or_32_nv)
/*
* NOTE: If atomic_or_64 and atomic_or_64_nv are ever
* separated, it is important to edit the libc i386 platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_or_64_nv.
*/
ENTRY(atomic_or_64)
ALTENTRY(atomic_or_64_nv)
pushl %edi
pushl %ebx
movl 12(%esp), %edi / %edi = target address
movl (%edi), %eax
movl 4(%edi), %edx / %edx:%eax = old value
1:
movl 16(%esp), %ebx
movl 20(%esp), %ecx / %ecx:%ebx = delta
orl %eax, %ebx
orl %edx, %ecx / %ecx:%ebx = new value
lock
cmpxchg8b (%edi) / try to stick it in
jne 1b
movl %ebx, %eax
movl %ecx, %edx / return new value
popl %ebx
popl %edi
ret
SET_SIZE(atomic_or_64_nv)
SET_SIZE(atomic_or_64)
ENTRY(atomic_and_8_nv)
ALTENTRY(atomic_and_uchar_nv)
movl 4(%esp), %edx / %edx = target address
movb (%edx), %al / %al = old value
1:
movl 8(%esp), %ecx / %ecx = delta
andb %al, %cl / %cl = new value
lock
cmpxchgb %cl, (%edx) / try to stick it in
jne 1b
movzbl %cl, %eax / return new value
ret
SET_SIZE(atomic_and_uchar_nv)
SET_SIZE(atomic_and_8_nv)
ENTRY(atomic_and_16_nv)
ALTENTRY(atomic_and_ushort_nv)
movl 4(%esp), %edx / %edx = target address
movw (%edx), %ax / %ax = old value
1:
movl 8(%esp), %ecx / %ecx = delta
andw %ax, %cx / %cx = new value
lock
cmpxchgw %cx, (%edx) / try to stick it in
jne 1b
movzwl %cx, %eax / return new value
ret
SET_SIZE(atomic_and_ushort_nv)
SET_SIZE(atomic_and_16_nv)
ENTRY(atomic_and_32_nv)
ALTENTRY(atomic_and_uint_nv)
ALTENTRY(atomic_and_ulong_nv)
movl 4(%esp), %edx / %edx = target address
movl (%edx), %eax / %eax = old value
1:
movl 8(%esp), %ecx / %ecx = delta
andl %eax, %ecx / %ecx = new value
lock
cmpxchgl %ecx, (%edx) / try to stick it in
jne 1b
movl %ecx, %eax / return new value
ret
SET_SIZE(atomic_and_ulong_nv)
SET_SIZE(atomic_and_uint_nv)
SET_SIZE(atomic_and_32_nv)
/*
* NOTE: If atomic_and_64 and atomic_and_64_nv are ever
* separated, it is important to edit the libc i386 platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_and_64_nv.
*/
ENTRY(atomic_and_64)
ALTENTRY(atomic_and_64_nv)
pushl %edi
pushl %ebx
movl 12(%esp), %edi / %edi = target address
movl (%edi), %eax
movl 4(%edi), %edx / %edx:%eax = old value
1:
movl 16(%esp), %ebx
movl 20(%esp), %ecx / %ecx:%ebx = delta
andl %eax, %ebx
andl %edx, %ecx / %ecx:%ebx = new value
lock
cmpxchg8b (%edi) / try to stick it in
jne 1b
movl %ebx, %eax
movl %ecx, %edx / return new value
popl %ebx
popl %edi
ret
SET_SIZE(atomic_and_64_nv)
SET_SIZE(atomic_and_64)
ENTRY(atomic_cas_8)
ALTENTRY(atomic_cas_uchar)
movl 4(%esp), %edx
movzbl 8(%esp), %eax
movb 12(%esp), %cl
lock
cmpxchgb %cl, (%edx)
ret
SET_SIZE(atomic_cas_uchar)
SET_SIZE(atomic_cas_8)
ENTRY(atomic_cas_16)
ALTENTRY(atomic_cas_ushort)
movl 4(%esp), %edx
movzwl 8(%esp), %eax
movw 12(%esp), %cx
lock
cmpxchgw %cx, (%edx)
ret
SET_SIZE(atomic_cas_ushort)
SET_SIZE(atomic_cas_16)
ENTRY(atomic_cas_32)
ALTENTRY(atomic_cas_uint)
ALTENTRY(atomic_cas_ulong)
ALTENTRY(atomic_cas_ptr)
movl 4(%esp), %edx
movl 8(%esp), %eax
movl 12(%esp), %ecx
lock
cmpxchgl %ecx, (%edx)
ret
SET_SIZE(atomic_cas_ptr)
SET_SIZE(atomic_cas_ulong)
SET_SIZE(atomic_cas_uint)
SET_SIZE(atomic_cas_32)
ENTRY(atomic_cas_64)
pushl %ebx
pushl %esi
movl 12(%esp), %esi
movl 16(%esp), %eax
movl 20(%esp), %edx
movl 24(%esp), %ebx
movl 28(%esp), %ecx
lock
cmpxchg8b (%esi)
popl %esi
popl %ebx
ret
SET_SIZE(atomic_cas_64)
ENTRY(atomic_swap_8)
ALTENTRY(atomic_swap_uchar)
movl 4(%esp), %edx
movzbl 8(%esp), %eax
lock
xchgb %al, (%edx)
ret
SET_SIZE(atomic_swap_uchar)
SET_SIZE(atomic_swap_8)
ENTRY(atomic_swap_16)
ALTENTRY(atomic_swap_ushort)
movl 4(%esp), %edx
movzwl 8(%esp), %eax
lock
xchgw %ax, (%edx)
ret
SET_SIZE(atomic_swap_ushort)
SET_SIZE(atomic_swap_16)
ENTRY(atomic_swap_32)
ALTENTRY(atomic_swap_uint)
ALTENTRY(atomic_swap_ptr)
ALTENTRY(atomic_swap_ulong)
movl 4(%esp), %edx
movl 8(%esp), %eax
lock
xchgl %eax, (%edx)
ret
SET_SIZE(atomic_swap_ulong)
SET_SIZE(atomic_swap_ptr)
SET_SIZE(atomic_swap_uint)
SET_SIZE(atomic_swap_32)
ENTRY(atomic_swap_64)
pushl %esi
pushl %ebx
movl 12(%esp), %esi
movl 16(%esp), %ebx
movl 20(%esp), %ecx
movl (%esi), %eax
movl 4(%esi), %edx / %edx:%eax = old value
1:
lock
cmpxchg8b (%esi)
jne 1b
popl %ebx
popl %esi
ret
SET_SIZE(atomic_swap_64)
ENTRY(atomic_set_long_excl)
movl 4(%esp), %edx / %edx = target address
movl 8(%esp), %ecx / %ecx = bit id
xorl %eax, %eax
lock
btsl %ecx, (%edx)
jnc 1f
decl %eax / return -1
1:
ret
SET_SIZE(atomic_set_long_excl)
ENTRY(atomic_clear_long_excl)
movl 4(%esp), %edx / %edx = target address
movl 8(%esp), %ecx / %ecx = bit id
xorl %eax, %eax
lock
btrl %ecx, (%edx)
jc 1f
decl %eax / return -1
1:
ret
SET_SIZE(atomic_clear_long_excl)
#if !defined(_KERNEL)
/*
* NOTE: membar_enter, membar_exit, membar_producer, and
* membar_consumer are all identical routines. We define them
* separately, instead of using ALTENTRY definitions to alias them
* together, so that DTrace and debuggers will see a unique address
* for them, allowing more accurate tracing.
*/
ENTRY(membar_enter)
lock
xorl $0, (%esp)
ret
SET_SIZE(membar_enter)
ENTRY(membar_exit)
lock
xorl $0, (%esp)
ret
SET_SIZE(membar_exit)
ENTRY(membar_producer)
lock
xorl $0, (%esp)
ret
SET_SIZE(membar_producer)
ENTRY(membar_consumer)
lock
xorl $0, (%esp)
ret
SET_SIZE(membar_consumer)
#endif /* !_KERNEL */

View File

@ -0,0 +1,801 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
.file "atomic.s"
#include <sys/asm_linkage.h>
#if defined(_KERNEL)
/*
* Legacy kernel interfaces; they will go away (eventually).
*/
ANSI_PRAGMA_WEAK2(cas8,atomic_cas_8,function)
ANSI_PRAGMA_WEAK2(cas32,atomic_cas_32,function)
ANSI_PRAGMA_WEAK2(cas64,atomic_cas_64,function)
ANSI_PRAGMA_WEAK2(caslong,atomic_cas_ulong,function)
ANSI_PRAGMA_WEAK2(casptr,atomic_cas_ptr,function)
ANSI_PRAGMA_WEAK2(atomic_and_long,atomic_and_ulong,function)
ANSI_PRAGMA_WEAK2(atomic_or_long,atomic_or_ulong,function)
ANSI_PRAGMA_WEAK2(swapl,atomic_swap_32,function)
#endif
/*
* NOTE: If atomic_inc_8 and atomic_inc_8_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_inc_8_nv.
*/
ENTRY(atomic_inc_8)
ALTENTRY(atomic_inc_8_nv)
ALTENTRY(atomic_inc_uchar)
ALTENTRY(atomic_inc_uchar_nv)
ba add_8
add %g0, 1, %o1
SET_SIZE(atomic_inc_uchar_nv)
SET_SIZE(atomic_inc_uchar)
SET_SIZE(atomic_inc_8_nv)
SET_SIZE(atomic_inc_8)
/*
* NOTE: If atomic_dec_8 and atomic_dec_8_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_dec_8_nv.
*/
ENTRY(atomic_dec_8)
ALTENTRY(atomic_dec_8_nv)
ALTENTRY(atomic_dec_uchar)
ALTENTRY(atomic_dec_uchar_nv)
ba add_8
sub %g0, 1, %o1
SET_SIZE(atomic_dec_uchar_nv)
SET_SIZE(atomic_dec_uchar)
SET_SIZE(atomic_dec_8_nv)
SET_SIZE(atomic_dec_8)
/*
* NOTE: If atomic_add_8 and atomic_add_8_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_add_8_nv.
*/
ENTRY(atomic_add_8)
ALTENTRY(atomic_add_8_nv)
ALTENTRY(atomic_add_char)
ALTENTRY(atomic_add_char_nv)
add_8:
and %o0, 0x3, %o4 ! %o4 = byte offset, left-to-right
xor %o4, 0x3, %g1 ! %g1 = byte offset, right-to-left
sll %g1, 3, %g1 ! %g1 = bit offset, right-to-left
set 0xff, %o3 ! %o3 = mask
sll %o3, %g1, %o3 ! %o3 = shifted to bit offset
sll %o1, %g1, %o1 ! %o1 = shifted to bit offset
and %o1, %o3, %o1 ! %o1 = single byte value
andn %o0, 0x3, %o0 ! %o0 = word address
ld [%o0], %o2 ! read old value
1:
add %o2, %o1, %o5 ! add value to the old value
and %o5, %o3, %o5 ! clear other bits
andn %o2, %o3, %o4 ! clear target bits
or %o4, %o5, %o5 ! insert the new value
cas [%o0], %o2, %o5
cmp %o2, %o5
bne,a,pn %icc, 1b
mov %o5, %o2 ! %o2 = old value
add %o2, %o1, %o5
and %o5, %o3, %o5
retl
srl %o5, %g1, %o0 ! %o0 = new value
SET_SIZE(atomic_add_char_nv)
SET_SIZE(atomic_add_char)
SET_SIZE(atomic_add_8_nv)
SET_SIZE(atomic_add_8)
/*
* NOTE: If atomic_inc_16 and atomic_inc_16_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_inc_16_nv.
*/
ENTRY(atomic_inc_16)
ALTENTRY(atomic_inc_16_nv)
ALTENTRY(atomic_inc_ushort)
ALTENTRY(atomic_inc_ushort_nv)
ba add_16
add %g0, 1, %o1
SET_SIZE(atomic_inc_ushort_nv)
SET_SIZE(atomic_inc_ushort)
SET_SIZE(atomic_inc_16_nv)
SET_SIZE(atomic_inc_16)
/*
* NOTE: If atomic_dec_16 and atomic_dec_16_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_dec_16_nv.
*/
ENTRY(atomic_dec_16)
ALTENTRY(atomic_dec_16_nv)
ALTENTRY(atomic_dec_ushort)
ALTENTRY(atomic_dec_ushort_nv)
ba add_16
sub %g0, 1, %o1
SET_SIZE(atomic_dec_ushort_nv)
SET_SIZE(atomic_dec_ushort)
SET_SIZE(atomic_dec_16_nv)
SET_SIZE(atomic_dec_16)
/*
* NOTE: If atomic_add_16 and atomic_add_16_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_add_16_nv.
*/
ENTRY(atomic_add_16)
ALTENTRY(atomic_add_16_nv)
ALTENTRY(atomic_add_short)
ALTENTRY(atomic_add_short_nv)
add_16:
and %o0, 0x2, %o4 ! %o4 = byte offset, left-to-right
xor %o4, 0x2, %g1 ! %g1 = byte offset, right-to-left
sll %o4, 3, %o4 ! %o4 = bit offset, left-to-right
sll %g1, 3, %g1 ! %g1 = bit offset, right-to-left
sethi %hi(0xffff0000), %o3 ! %o3 = mask
srl %o3, %o4, %o3 ! %o3 = shifted to bit offset
sll %o1, %g1, %o1 ! %o1 = shifted to bit offset
and %o1, %o3, %o1 ! %o1 = single short value
andn %o0, 0x2, %o0 ! %o0 = word address
! if low-order bit is 1, we will properly get an alignment fault here
ld [%o0], %o2 ! read old value
1:
add %o1, %o2, %o5 ! add value to the old value
and %o5, %o3, %o5 ! clear other bits
andn %o2, %o3, %o4 ! clear target bits
or %o4, %o5, %o5 ! insert the new value
cas [%o0], %o2, %o5
cmp %o2, %o5
bne,a,pn %icc, 1b
mov %o5, %o2 ! %o2 = old value
add %o1, %o2, %o5
and %o5, %o3, %o5
retl
srl %o5, %g1, %o0 ! %o0 = new value
SET_SIZE(atomic_add_short_nv)
SET_SIZE(atomic_add_short)
SET_SIZE(atomic_add_16_nv)
SET_SIZE(atomic_add_16)
/*
* NOTE: If atomic_inc_32 and atomic_inc_32_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_inc_32_nv.
*/
ENTRY(atomic_inc_32)
ALTENTRY(atomic_inc_32_nv)
ALTENTRY(atomic_inc_uint)
ALTENTRY(atomic_inc_uint_nv)
ALTENTRY(atomic_inc_ulong)
ALTENTRY(atomic_inc_ulong_nv)
ba add_32
add %g0, 1, %o1
SET_SIZE(atomic_inc_ulong_nv)
SET_SIZE(atomic_inc_ulong)
SET_SIZE(atomic_inc_uint_nv)
SET_SIZE(atomic_inc_uint)
SET_SIZE(atomic_inc_32_nv)
SET_SIZE(atomic_inc_32)
/*
* NOTE: If atomic_dec_32 and atomic_dec_32_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_dec_32_nv.
*/
ENTRY(atomic_dec_32)
ALTENTRY(atomic_dec_32_nv)
ALTENTRY(atomic_dec_uint)
ALTENTRY(atomic_dec_uint_nv)
ALTENTRY(atomic_dec_ulong)
ALTENTRY(atomic_dec_ulong_nv)
ba add_32
sub %g0, 1, %o1
SET_SIZE(atomic_dec_ulong_nv)
SET_SIZE(atomic_dec_ulong)
SET_SIZE(atomic_dec_uint_nv)
SET_SIZE(atomic_dec_uint)
SET_SIZE(atomic_dec_32_nv)
SET_SIZE(atomic_dec_32)
/*
* NOTE: If atomic_add_32 and atomic_add_32_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_add_32_nv.
*/
ENTRY(atomic_add_32)
ALTENTRY(atomic_add_32_nv)
ALTENTRY(atomic_add_int)
ALTENTRY(atomic_add_int_nv)
ALTENTRY(atomic_add_ptr)
ALTENTRY(atomic_add_ptr_nv)
ALTENTRY(atomic_add_long)
ALTENTRY(atomic_add_long_nv)
add_32:
ld [%o0], %o2
1:
add %o2, %o1, %o3
cas [%o0], %o2, %o3
cmp %o2, %o3
bne,a,pn %icc, 1b
mov %o3, %o2
retl
add %o2, %o1, %o0 ! return new value
SET_SIZE(atomic_add_long_nv)
SET_SIZE(atomic_add_long)
SET_SIZE(atomic_add_ptr_nv)
SET_SIZE(atomic_add_ptr)
SET_SIZE(atomic_add_int_nv)
SET_SIZE(atomic_add_int)
SET_SIZE(atomic_add_32_nv)
SET_SIZE(atomic_add_32)
/*
* NOTE: If atomic_inc_64 and atomic_inc_64_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_inc_64_nv.
*/
ENTRY(atomic_inc_64)
ALTENTRY(atomic_inc_64_nv)
ba add_64
add %g0, 1, %o1
SET_SIZE(atomic_inc_64_nv)
SET_SIZE(atomic_inc_64)
/*
* NOTE: If atomic_dec_64 and atomic_dec_64_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_dec_64_nv.
*/
ENTRY(atomic_dec_64)
ALTENTRY(atomic_dec_64_nv)
ba add_64
sub %g0, 1, %o1
SET_SIZE(atomic_dec_64_nv)
SET_SIZE(atomic_dec_64)
/*
* NOTE: If atomic_add_64 and atomic_add_64_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_add_64_nv.
*/
ENTRY(atomic_add_64)
ALTENTRY(atomic_add_64_nv)
sllx %o1, 32, %o1 ! upper 32 in %o1, lower in %o2
srl %o2, 0, %o2
add %o1, %o2, %o1 ! convert 2 32-bit args into 1 64-bit
add_64:
ldx [%o0], %o2
1:
add %o2, %o1, %o3
casx [%o0], %o2, %o3
cmp %o2, %o3
bne,a,pn %xcc, 1b
mov %o3, %o2
add %o2, %o1, %o1 ! return lower 32-bits in %o1
retl
srlx %o1, 32, %o0 ! return upper 32-bits in %o0
SET_SIZE(atomic_add_64_nv)
SET_SIZE(atomic_add_64)
/*
* NOTE: If atomic_or_8 and atomic_or_8_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_or_8_nv.
*/
ENTRY(atomic_or_8)
ALTENTRY(atomic_or_8_nv)
ALTENTRY(atomic_or_uchar)
ALTENTRY(atomic_or_uchar_nv)
and %o0, 0x3, %o4 ! %o4 = byte offset, left-to-right
xor %o4, 0x3, %g1 ! %g1 = byte offset, right-to-left
sll %g1, 3, %g1 ! %g1 = bit offset, right-to-left
set 0xff, %o3 ! %o3 = mask
sll %o3, %g1, %o3 ! %o3 = shifted to bit offset
sll %o1, %g1, %o1 ! %o1 = shifted to bit offset
and %o1, %o3, %o1 ! %o1 = single byte value
andn %o0, 0x3, %o0 ! %o0 = word address
ld [%o0], %o2 ! read old value
1:
or %o2, %o1, %o5 ! or in the new value
cas [%o0], %o2, %o5
cmp %o2, %o5
bne,a,pn %icc, 1b
mov %o5, %o2 ! %o2 = old value
or %o2, %o1, %o5
and %o5, %o3, %o5
retl
srl %o5, %g1, %o0 ! %o0 = new value
SET_SIZE(atomic_or_uchar_nv)
SET_SIZE(atomic_or_uchar)
SET_SIZE(atomic_or_8_nv)
SET_SIZE(atomic_or_8)
/*
* NOTE: If atomic_or_16 and atomic_or_16_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_or_16_nv.
*/
ENTRY(atomic_or_16)
ALTENTRY(atomic_or_16_nv)
ALTENTRY(atomic_or_ushort)
ALTENTRY(atomic_or_ushort_nv)
and %o0, 0x2, %o4 ! %o4 = byte offset, left-to-right
xor %o4, 0x2, %g1 ! %g1 = byte offset, right-to-left
sll %o4, 3, %o4 ! %o4 = bit offset, left-to-right
sll %g1, 3, %g1 ! %g1 = bit offset, right-to-left
sethi %hi(0xffff0000), %o3 ! %o3 = mask
srl %o3, %o4, %o3 ! %o3 = shifted to bit offset
sll %o1, %g1, %o1 ! %o1 = shifted to bit offset
and %o1, %o3, %o1 ! %o1 = single short value
andn %o0, 0x2, %o0 ! %o0 = word address
! if low-order bit is 1, we will properly get an alignment fault here
ld [%o0], %o2 ! read old value
1:
or %o2, %o1, %o5 ! or in the new value
cas [%o0], %o2, %o5
cmp %o2, %o5
bne,a,pn %icc, 1b
mov %o5, %o2 ! %o2 = old value
or %o2, %o1, %o5 ! or in the new value
and %o5, %o3, %o5
retl
srl %o5, %g1, %o0 ! %o0 = new value
SET_SIZE(atomic_or_ushort_nv)
SET_SIZE(atomic_or_ushort)
SET_SIZE(atomic_or_16_nv)
SET_SIZE(atomic_or_16)
/*
* NOTE: If atomic_or_32 and atomic_or_32_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_or_32_nv.
*/
ENTRY(atomic_or_32)
ALTENTRY(atomic_or_32_nv)
ALTENTRY(atomic_or_uint)
ALTENTRY(atomic_or_uint_nv)
ALTENTRY(atomic_or_ulong)
ALTENTRY(atomic_or_ulong_nv)
ld [%o0], %o2
1:
or %o2, %o1, %o3
cas [%o0], %o2, %o3
cmp %o2, %o3
bne,a,pn %icc, 1b
mov %o3, %o2
retl
or %o2, %o1, %o0 ! return new value
SET_SIZE(atomic_or_ulong_nv)
SET_SIZE(atomic_or_ulong)
SET_SIZE(atomic_or_uint_nv)
SET_SIZE(atomic_or_uint)
SET_SIZE(atomic_or_32_nv)
SET_SIZE(atomic_or_32)
/*
* NOTE: If atomic_or_64 and atomic_or_64_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_or_64_nv.
*/
ENTRY(atomic_or_64)
ALTENTRY(atomic_or_64_nv)
sllx %o1, 32, %o1 ! upper 32 in %o1, lower in %o2
srl %o2, 0, %o2
add %o1, %o2, %o1 ! convert 2 32-bit args into 1 64-bit
ldx [%o0], %o2
1:
or %o2, %o1, %o3
casx [%o0], %o2, %o3
cmp %o2, %o3
bne,a,pn %xcc, 1b
mov %o3, %o2
or %o2, %o1, %o1 ! return lower 32-bits in %o1
retl
srlx %o1, 32, %o0 ! return upper 32-bits in %o0
SET_SIZE(atomic_or_64_nv)
SET_SIZE(atomic_or_64)
/*
* NOTE: If atomic_and_8 and atomic_and_8_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_and_8_nv.
*/
ENTRY(atomic_and_8)
ALTENTRY(atomic_and_8_nv)
ALTENTRY(atomic_and_uchar)
ALTENTRY(atomic_and_uchar_nv)
and %o0, 0x3, %o4 ! %o4 = byte offset, left-to-right
xor %o4, 0x3, %g1 ! %g1 = byte offset, right-to-left
sll %g1, 3, %g1 ! %g1 = bit offset, right-to-left
set 0xff, %o3 ! %o3 = mask
sll %o3, %g1, %o3 ! %o3 = shifted to bit offset
sll %o1, %g1, %o1 ! %o1 = shifted to bit offset
orn %o1, %o3, %o1 ! all ones in other bytes
andn %o0, 0x3, %o0 ! %o0 = word address
ld [%o0], %o2 ! read old value
1:
and %o2, %o1, %o5 ! and in the new value
cas [%o0], %o2, %o5
cmp %o2, %o5
bne,a,pn %icc, 1b
mov %o5, %o2 ! %o2 = old value
and %o2, %o1, %o5
and %o5, %o3, %o5
retl
srl %o5, %g1, %o0 ! %o0 = new value
SET_SIZE(atomic_and_uchar_nv)
SET_SIZE(atomic_and_uchar)
SET_SIZE(atomic_and_8_nv)
SET_SIZE(atomic_and_8)
/*
* NOTE: If atomic_and_16 and atomic_and_16_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_and_16_nv.
*/
ENTRY(atomic_and_16)
ALTENTRY(atomic_and_16_nv)
ALTENTRY(atomic_and_ushort)
ALTENTRY(atomic_and_ushort_nv)
and %o0, 0x2, %o4 ! %o4 = byte offset, left-to-right
xor %o4, 0x2, %g1 ! %g1 = byte offset, right-to-left
sll %o4, 3, %o4 ! %o4 = bit offset, left-to-right
sll %g1, 3, %g1 ! %g1 = bit offset, right-to-left
sethi %hi(0xffff0000), %o3 ! %o3 = mask
srl %o3, %o4, %o3 ! %o3 = shifted to bit offset
sll %o1, %g1, %o1 ! %o1 = shifted to bit offset
orn %o1, %o3, %o1 ! all ones in the other half
andn %o0, 0x2, %o0 ! %o0 = word address
! if low-order bit is 1, we will properly get an alignment fault here
ld [%o0], %o2 ! read old value
1:
and %o2, %o1, %o5 ! and in the new value
cas [%o0], %o2, %o5
cmp %o2, %o5
bne,a,pn %icc, 1b
mov %o5, %o2 ! %o2 = old value
and %o2, %o1, %o5
and %o5, %o3, %o5
retl
srl %o5, %g1, %o0 ! %o0 = new value
SET_SIZE(atomic_and_ushort_nv)
SET_SIZE(atomic_and_ushort)
SET_SIZE(atomic_and_16_nv)
SET_SIZE(atomic_and_16)
/*
* NOTE: If atomic_and_32 and atomic_and_32_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_and_32_nv.
*/
ENTRY(atomic_and_32)
ALTENTRY(atomic_and_32_nv)
ALTENTRY(atomic_and_uint)
ALTENTRY(atomic_and_uint_nv)
ALTENTRY(atomic_and_ulong)
ALTENTRY(atomic_and_ulong_nv)
ld [%o0], %o2
1:
and %o2, %o1, %o3
cas [%o0], %o2, %o3
cmp %o2, %o3
bne,a,pn %icc, 1b
mov %o3, %o2
retl
and %o2, %o1, %o0 ! return new value
SET_SIZE(atomic_and_ulong_nv)
SET_SIZE(atomic_and_ulong)
SET_SIZE(atomic_and_uint_nv)
SET_SIZE(atomic_and_uint)
SET_SIZE(atomic_and_32_nv)
SET_SIZE(atomic_and_32)
/*
* NOTE: If atomic_and_64 and atomic_and_64_nv are ever
* separated, you need to also edit the libc sparc platform
* specific mapfile and remove the NODYNSORT attribute
* from atomic_and_64_nv.
*/
ENTRY(atomic_and_64)
ALTENTRY(atomic_and_64_nv)
sllx %o1, 32, %o1 ! upper 32 in %o1, lower in %o2
srl %o2, 0, %o2
add %o1, %o2, %o1 ! convert 2 32-bit args into 1 64-bit
ldx [%o0], %o2
1:
and %o2, %o1, %o3
casx [%o0], %o2, %o3
cmp %o2, %o3
bne,a,pn %xcc, 1b
mov %o3, %o2
and %o2, %o1, %o1 ! return lower 32-bits in %o1
retl
srlx %o1, 32, %o0 ! return upper 32-bits in %o0
SET_SIZE(atomic_and_64_nv)
SET_SIZE(atomic_and_64)
ENTRY(atomic_cas_8)
ALTENTRY(atomic_cas_uchar)
and %o0, 0x3, %o4 ! %o4 = byte offset, left-to-right
xor %o4, 0x3, %g1 ! %g1 = byte offset, right-to-left
sll %g1, 3, %g1 ! %g1 = bit offset, right-to-left
set 0xff, %o3 ! %o3 = mask
sll %o3, %g1, %o3 ! %o3 = shifted to bit offset
sll %o1, %g1, %o1 ! %o1 = shifted to bit offset
and %o1, %o3, %o1 ! %o1 = single byte value
sll %o2, %g1, %o2 ! %o2 = shifted to bit offset
and %o2, %o3, %o2 ! %o2 = single byte value
andn %o0, 0x3, %o0 ! %o0 = word address
ld [%o0], %o4 ! read old value
1:
andn %o4, %o3, %o4 ! clear target bits
or %o4, %o2, %o5 ! insert the new value
or %o4, %o1, %o4 ! insert the comparison value
cas [%o0], %o4, %o5
cmp %o4, %o5 ! did we succeed?
be,pt %icc, 2f
and %o5, %o3, %o4 ! isolate the old value
cmp %o1, %o4 ! should we have succeeded?
be,a,pt %icc, 1b ! yes, try again
mov %o5, %o4 ! %o4 = old value
2:
retl
srl %o4, %g1, %o0 ! %o0 = old value
SET_SIZE(atomic_cas_uchar)
SET_SIZE(atomic_cas_8)
ENTRY(atomic_cas_16)
ALTENTRY(atomic_cas_ushort)
and %o0, 0x2, %o4 ! %o4 = byte offset, left-to-right
xor %o4, 0x2, %g1 ! %g1 = byte offset, right-to-left
sll %o4, 3, %o4 ! %o4 = bit offset, left-to-right
sll %g1, 3, %g1 ! %g1 = bit offset, right-to-left
sethi %hi(0xffff0000), %o3 ! %o3 = mask
srl %o3, %o4, %o3 ! %o3 = shifted to bit offset
sll %o1, %g1, %o1 ! %o1 = shifted to bit offset
and %o1, %o3, %o1 ! %o1 = single short value
sll %o2, %g1, %o2 ! %o2 = shifted to bit offset
and %o2, %o3, %o2 ! %o2 = single short value
andn %o0, 0x2, %o0 ! %o0 = word address
! if low-order bit is 1, we will properly get an alignment fault here
ld [%o0], %o4 ! read old value
1:
andn %o4, %o3, %o4 ! clear target bits
or %o4, %o2, %o5 ! insert the new value
or %o4, %o1, %o4 ! insert the comparison value
cas [%o0], %o4, %o5
cmp %o4, %o5 ! did we succeed?
be,pt %icc, 2f
and %o5, %o3, %o4 ! isolate the old value
cmp %o1, %o4 ! should we have succeeded?
be,a,pt %icc, 1b ! yes, try again
mov %o5, %o4 ! %o4 = old value
2:
retl
srl %o4, %g1, %o0 ! %o0 = old value
SET_SIZE(atomic_cas_ushort)
SET_SIZE(atomic_cas_16)
ENTRY(atomic_cas_32)
ALTENTRY(atomic_cas_uint)
ALTENTRY(atomic_cas_ptr)
ALTENTRY(atomic_cas_ulong)
cas [%o0], %o1, %o2
retl
mov %o2, %o0
SET_SIZE(atomic_cas_ulong)
SET_SIZE(atomic_cas_ptr)
SET_SIZE(atomic_cas_uint)
SET_SIZE(atomic_cas_32)
ENTRY(atomic_cas_64)
sllx %o1, 32, %o1 ! cmp's upper 32 in %o1, lower in %o2
srl %o2, 0, %o2 ! convert 2 32-bit args into 1 64-bit
add %o1, %o2, %o1
sllx %o3, 32, %o2 ! newval upper 32 in %o3, lower in %o4
srl %o4, 0, %o4 ! setup %o2 to have newval
add %o2, %o4, %o2
casx [%o0], %o1, %o2
srl %o2, 0, %o1 ! return lower 32-bits in %o1
retl
srlx %o2, 32, %o0 ! return upper 32-bits in %o0
SET_SIZE(atomic_cas_64)
ENTRY(atomic_swap_8)
ALTENTRY(atomic_swap_uchar)
and %o0, 0x3, %o4 ! %o4 = byte offset, left-to-right
xor %o4, 0x3, %g1 ! %g1 = byte offset, right-to-left
sll %g1, 3, %g1 ! %g1 = bit offset, right-to-left
set 0xff, %o3 ! %o3 = mask
sll %o3, %g1, %o3 ! %o3 = shifted to bit offset
sll %o1, %g1, %o1 ! %o1 = shifted to bit offset
and %o1, %o3, %o1 ! %o1 = single byte value
andn %o0, 0x3, %o0 ! %o0 = word address
ld [%o0], %o2 ! read old value
1:
andn %o2, %o3, %o5 ! clear target bits
or %o5, %o1, %o5 ! insert the new value
cas [%o0], %o2, %o5
cmp %o2, %o5
bne,a,pn %icc, 1b
mov %o5, %o2 ! %o2 = old value
and %o5, %o3, %o5
retl
srl %o5, %g1, %o0 ! %o0 = old value
SET_SIZE(atomic_swap_uchar)
SET_SIZE(atomic_swap_8)
ENTRY(atomic_swap_16)
ALTENTRY(atomic_swap_ushort)
and %o0, 0x2, %o4 ! %o4 = byte offset, left-to-right
xor %o4, 0x2, %g1 ! %g1 = byte offset, right-to-left
sll %o4, 3, %o4 ! %o4 = bit offset, left-to-right
sll %g1, 3, %g1 ! %g1 = bit offset, right-to-left
sethi %hi(0xffff0000), %o3 ! %o3 = mask
srl %o3, %o4, %o3 ! %o3 = shifted to bit offset
sll %o1, %g1, %o1 ! %o1 = shifted to bit offset
and %o1, %o3, %o1 ! %o1 = single short value
andn %o0, 0x2, %o0 ! %o0 = word address
! if low-order bit is 1, we will properly get an alignment fault here
ld [%o0], %o2 ! read old value
1:
andn %o2, %o3, %o5 ! clear target bits
or %o5, %o1, %o5 ! insert the new value
cas [%o0], %o2, %o5
cmp %o2, %o5
bne,a,pn %icc, 1b
mov %o5, %o2 ! %o2 = old value
and %o5, %o3, %o5
retl
srl %o5, %g1, %o0 ! %o0 = old value
SET_SIZE(atomic_swap_ushort)
SET_SIZE(atomic_swap_16)
ENTRY(atomic_swap_32)
ALTENTRY(atomic_swap_uint)
ALTENTRY(atomic_swap_ptr)
ALTENTRY(atomic_swap_ulong)
ld [%o0], %o2
1:
mov %o1, %o3
cas [%o0], %o2, %o3
cmp %o2, %o3
bne,a,pn %icc, 1b
mov %o3, %o2
retl
mov %o3, %o0
SET_SIZE(atomic_swap_ulong)
SET_SIZE(atomic_swap_ptr)
SET_SIZE(atomic_swap_uint)
SET_SIZE(atomic_swap_32)
ENTRY(atomic_swap_64)
sllx %o1, 32, %o1 ! upper 32 in %o1, lower in %o2
srl %o2, 0, %o2
add %o1, %o2, %o1 ! convert 2 32-bit args into 1 64-bit
ldx [%o0], %o2
1:
mov %o1, %o3
casx [%o0], %o2, %o3
cmp %o2, %o3
bne,a,pn %xcc, 1b
mov %o3, %o2
srl %o3, 0, %o1 ! return lower 32-bits in %o1
retl
srlx %o3, 32, %o0 ! return upper 32-bits in %o0
SET_SIZE(atomic_swap_64)
ENTRY(atomic_set_long_excl)
mov 1, %o3
slln %o3, %o1, %o3
ldn [%o0], %o2
1:
andcc %o2, %o3, %g0 ! test if the bit is set
bnz,a,pn %ncc, 2f ! if so, then fail out
mov -1, %o0
or %o2, %o3, %o4 ! set the bit, and try to commit it
casn [%o0], %o2, %o4
cmp %o2, %o4
bne,a,pn %ncc, 1b ! failed to commit, try again
mov %o4, %o2
mov %g0, %o0
2:
retl
nop
SET_SIZE(atomic_set_long_excl)
ENTRY(atomic_clear_long_excl)
mov 1, %o3
slln %o3, %o1, %o3
ldn [%o0], %o2
1:
andncc %o3, %o2, %g0 ! test if the bit is clear
bnz,a,pn %ncc, 2f ! if so, then fail out
mov -1, %o0
andn %o2, %o3, %o4 ! clear the bit, and try to commit it
casn [%o0], %o2, %o4
cmp %o2, %o4
bne,a,pn %ncc, 1b ! failed to commit, try again
mov %o4, %o2
mov %g0, %o0
2:
retl
nop
SET_SIZE(atomic_clear_long_excl)
#if !defined(_KERNEL)
/*
* Spitfires and Blackbirds have a problem with membars in the
* delay slot (SF_ERRATA_51). For safety's sake, we assume
* that the whole world needs the workaround.
*/
ENTRY(membar_enter)
membar #StoreLoad|#StoreStore
retl
nop
SET_SIZE(membar_enter)
ENTRY(membar_exit)
membar #LoadStore|#StoreStore
retl
nop
SET_SIZE(membar_exit)
ENTRY(membar_producer)
membar #StoreStore
retl
nop
SET_SIZE(membar_producer)
ENTRY(membar_consumer)
membar #LoadLoad
retl
nop
SET_SIZE(membar_consumer)
#endif /* !_KERNEL */

1030
common/avl/avl.c Normal file

File diff suppressed because it is too large Load Diff

251
common/list/list.c Normal file
View File

@ -0,0 +1,251 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2003, 2010, Oracle and/or its affiliates. All rights reserved.
*/
/*
* Generic doubly-linked list implementation
*/
#include <sys/list.h>
#include <sys/list_impl.h>
#include <sys/types.h>
#include <sys/sysmacros.h>
#ifdef _KERNEL
#include <sys/debug.h>
#else
#include <assert.h>
#define ASSERT(a) assert(a)
#endif
#ifdef lint
extern list_node_t *list_d2l(list_t *list, void *obj);
#else
#define list_d2l(a, obj) ((list_node_t *)(((char *)obj) + (a)->list_offset))
#endif
#define list_object(a, node) ((void *)(((char *)node) - (a)->list_offset))
#define list_empty(a) ((a)->list_head.list_next == &(a)->list_head)
#define list_insert_after_node(list, node, object) { \
list_node_t *lnew = list_d2l(list, object); \
lnew->list_prev = (node); \
lnew->list_next = (node)->list_next; \
(node)->list_next->list_prev = lnew; \
(node)->list_next = lnew; \
}
#define list_insert_before_node(list, node, object) { \
list_node_t *lnew = list_d2l(list, object); \
lnew->list_next = (node); \
lnew->list_prev = (node)->list_prev; \
(node)->list_prev->list_next = lnew; \
(node)->list_prev = lnew; \
}
#define list_remove_node(node) \
(node)->list_prev->list_next = (node)->list_next; \
(node)->list_next->list_prev = (node)->list_prev; \
(node)->list_next = (node)->list_prev = NULL
void
list_create(list_t *list, size_t size, size_t offset)
{
ASSERT(list);
ASSERT(size > 0);
ASSERT(size >= offset + sizeof (list_node_t));
list->list_size = size;
list->list_offset = offset;
list->list_head.list_next = list->list_head.list_prev =
&list->list_head;
}
void
list_destroy(list_t *list)
{
list_node_t *node = &list->list_head;
ASSERT(list);
ASSERT(list->list_head.list_next == node);
ASSERT(list->list_head.list_prev == node);
node->list_next = node->list_prev = NULL;
}
void
list_insert_after(list_t *list, void *object, void *nobject)
{
if (object == NULL) {
list_insert_head(list, nobject);
} else {
list_node_t *lold = list_d2l(list, object);
list_insert_after_node(list, lold, nobject);
}
}
void
list_insert_before(list_t *list, void *object, void *nobject)
{
if (object == NULL) {
list_insert_tail(list, nobject);
} else {
list_node_t *lold = list_d2l(list, object);
list_insert_before_node(list, lold, nobject);
}
}
void
list_insert_head(list_t *list, void *object)
{
list_node_t *lold = &list->list_head;
list_insert_after_node(list, lold, object);
}
void
list_insert_tail(list_t *list, void *object)
{
list_node_t *lold = &list->list_head;
list_insert_before_node(list, lold, object);
}
void
list_remove(list_t *list, void *object)
{
list_node_t *lold = list_d2l(list, object);
ASSERT(!list_empty(list));
ASSERT(lold->list_next != NULL);
list_remove_node(lold);
}
void *
list_remove_head(list_t *list)
{
list_node_t *head = list->list_head.list_next;
if (head == &list->list_head)
return (NULL);
list_remove_node(head);
return (list_object(list, head));
}
void *
list_remove_tail(list_t *list)
{
list_node_t *tail = list->list_head.list_prev;
if (tail == &list->list_head)
return (NULL);
list_remove_node(tail);
return (list_object(list, tail));
}
void *
list_head(list_t *list)
{
if (list_empty(list))
return (NULL);
return (list_object(list, list->list_head.list_next));
}
void *
list_tail(list_t *list)
{
if (list_empty(list))
return (NULL);
return (list_object(list, list->list_head.list_prev));
}
void *
list_next(list_t *list, void *object)
{
list_node_t *node = list_d2l(list, object);
if (node->list_next != &list->list_head)
return (list_object(list, node->list_next));
return (NULL);
}
void *
list_prev(list_t *list, void *object)
{
list_node_t *node = list_d2l(list, object);
if (node->list_prev != &list->list_head)
return (list_object(list, node->list_prev));
return (NULL);
}
/*
* Insert src list after dst list. Empty src list thereafter.
*/
void
list_move_tail(list_t *dst, list_t *src)
{
list_node_t *dstnode = &dst->list_head;
list_node_t *srcnode = &src->list_head;
ASSERT(dst->list_size == src->list_size);
ASSERT(dst->list_offset == src->list_offset);
if (list_empty(src))
return;
dstnode->list_prev->list_next = srcnode->list_next;
srcnode->list_next->list_prev = dstnode->list_prev;
dstnode->list_prev = srcnode->list_prev;
srcnode->list_prev->list_next = dstnode;
/* empty src list */
srcnode->list_next = srcnode->list_prev = srcnode;
}
void
list_link_replace(list_node_t *lold, list_node_t *lnew)
{
ASSERT(list_link_active(lold));
ASSERT(!list_link_active(lnew));
lnew->list_next = lold->list_next;
lnew->list_prev = lold->list_prev;
lold->list_prev->list_next = lnew;
lold->list_next->list_prev = lnew;
lold->list_next = lold->list_prev = NULL;
}
void
list_link_init(list_node_t *link)
{
link->list_next = NULL;
link->list_prev = NULL;
}
int
list_link_active(list_node_t *link)
{
return (link->list_next != NULL);
}
int
list_is_empty(list_t *list)
{
return (list_empty(list));
}

3297
common/nvpair/nvpair.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,120 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2006 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#pragma ident "%Z%%M% %I% %E% SMI"
#include <sys/stropts.h>
#include <sys/isa_defs.h>
#include <sys/nvpair.h>
#include <sys/sysmacros.h>
#if defined(_KERNEL) && !defined(_BOOT)
#include <sys/varargs.h>
#else
#include <stdarg.h>
#include <strings.h>
#endif
/*
* This allocator is very simple.
* - it uses a pre-allocated buffer for memory allocations.
* - it does _not_ free memory in the pre-allocated buffer.
*
* The reason for the selected implemention is simplicity.
* This allocator is designed for the usage in interrupt context when
* the caller may not wait for free memory.
*/
/* pre-allocated buffer for memory allocations */
typedef struct nvbuf {
uintptr_t nvb_buf; /* address of pre-allocated buffer */
uintptr_t nvb_lim; /* limit address in the buffer */
uintptr_t nvb_cur; /* current address in the buffer */
} nvbuf_t;
/*
* Initialize the pre-allocated buffer allocator. The caller needs to supply
*
* buf address of pre-allocated buffer
* bufsz size of pre-allocated buffer
*
* nv_fixed_init() calculates the remaining members of nvbuf_t.
*/
static int
nv_fixed_init(nv_alloc_t *nva, va_list valist)
{
uintptr_t base = va_arg(valist, uintptr_t);
uintptr_t lim = base + va_arg(valist, size_t);
nvbuf_t *nvb = (nvbuf_t *)P2ROUNDUP(base, sizeof (uintptr_t));
if (base == 0 || (uintptr_t)&nvb[1] > lim)
return (EINVAL);
nvb->nvb_buf = (uintptr_t)&nvb[0];
nvb->nvb_cur = (uintptr_t)&nvb[1];
nvb->nvb_lim = lim;
nva->nva_arg = nvb;
return (0);
}
static void *
nv_fixed_alloc(nv_alloc_t *nva, size_t size)
{
nvbuf_t *nvb = nva->nva_arg;
uintptr_t new = nvb->nvb_cur;
if (size == 0 || new + size > nvb->nvb_lim)
return (NULL);
nvb->nvb_cur = P2ROUNDUP(new + size, sizeof (uintptr_t));
return ((void *)new);
}
/*ARGSUSED*/
static void
nv_fixed_free(nv_alloc_t *nva, void *buf, size_t size)
{
/* don't free memory in the pre-allocated buffer */
}
static void
nv_fixed_reset(nv_alloc_t *nva)
{
nvbuf_t *nvb = nva->nva_arg;
nvb->nvb_cur = (uintptr_t)&nvb[1];
}
const nv_alloc_ops_t nv_fixed_ops_def = {
nv_fixed_init, /* nv_ao_init() */
NULL, /* nv_ao_fini() */
nv_fixed_alloc, /* nv_ao_alloc() */
nv_fixed_free, /* nv_ao_free() */
nv_fixed_reset /* nv_ao_reset() */
};
const nv_alloc_ops_t *nv_fixed_ops = &nv_fixed_ops_def;

2132
common/unicode/u8_textprep.c Normal file

File diff suppressed because it is too large Load Diff

202
common/zfs/zfs_comutil.c Normal file
View File

@ -0,0 +1,202 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2008, 2010, Oracle and/or its affiliates. All rights reserved.
*/
/*
* This file is intended for functions that ought to be common between user
* land (libzfs) and the kernel. When many common routines need to be shared
* then a separate file should to be created.
*/
#if defined(_KERNEL)
#include <sys/systm.h>
#else
#include <string.h>
#endif
#include <sys/types.h>
#include <sys/fs/zfs.h>
#include <sys/int_limits.h>
#include <sys/nvpair.h>
#include "zfs_comutil.h"
/*
* Are there allocatable vdevs?
*/
boolean_t
zfs_allocatable_devs(nvlist_t *nv)
{
uint64_t is_log;
uint_t c;
nvlist_t **child;
uint_t children;
if (nvlist_lookup_nvlist_array(nv, ZPOOL_CONFIG_CHILDREN,
&child, &children) != 0) {
return (B_FALSE);
}
for (c = 0; c < children; c++) {
is_log = 0;
(void) nvlist_lookup_uint64(child[c], ZPOOL_CONFIG_IS_LOG,
&is_log);
if (!is_log)
return (B_TRUE);
}
return (B_FALSE);
}
void
zpool_get_rewind_policy(nvlist_t *nvl, zpool_rewind_policy_t *zrpp)
{
nvlist_t *policy;
nvpair_t *elem;
char *nm;
/* Defaults */
zrpp->zrp_request = ZPOOL_NO_REWIND;
zrpp->zrp_maxmeta = 0;
zrpp->zrp_maxdata = UINT64_MAX;
zrpp->zrp_txg = UINT64_MAX;
if (nvl == NULL)
return;
elem = NULL;
while ((elem = nvlist_next_nvpair(nvl, elem)) != NULL) {
nm = nvpair_name(elem);
if (strcmp(nm, ZPOOL_REWIND_POLICY) == 0) {
if (nvpair_value_nvlist(elem, &policy) == 0)
zpool_get_rewind_policy(policy, zrpp);
return;
} else if (strcmp(nm, ZPOOL_REWIND_REQUEST) == 0) {
if (nvpair_value_uint32(elem, &zrpp->zrp_request) == 0)
if (zrpp->zrp_request & ~ZPOOL_REWIND_POLICIES)
zrpp->zrp_request = ZPOOL_NO_REWIND;
} else if (strcmp(nm, ZPOOL_REWIND_REQUEST_TXG) == 0) {
(void) nvpair_value_uint64(elem, &zrpp->zrp_txg);
} else if (strcmp(nm, ZPOOL_REWIND_META_THRESH) == 0) {
(void) nvpair_value_uint64(elem, &zrpp->zrp_maxmeta);
} else if (strcmp(nm, ZPOOL_REWIND_DATA_THRESH) == 0) {
(void) nvpair_value_uint64(elem, &zrpp->zrp_maxdata);
}
}
if (zrpp->zrp_request == 0)
zrpp->zrp_request = ZPOOL_NO_REWIND;
}
typedef struct zfs_version_spa_map {
int version_zpl;
int version_spa;
} zfs_version_spa_map_t;
/*
* Keep this table in monotonically increasing version number order.
*/
static zfs_version_spa_map_t zfs_version_table[] = {
{ZPL_VERSION_INITIAL, SPA_VERSION_INITIAL},
{ZPL_VERSION_DIRENT_TYPE, SPA_VERSION_INITIAL},
{ZPL_VERSION_FUID, SPA_VERSION_FUID},
{ZPL_VERSION_USERSPACE, SPA_VERSION_USERSPACE},
{ZPL_VERSION_SA, SPA_VERSION_SA},
{0, 0}
};
/*
* Return the max zpl version for a corresponding spa version
* -1 is returned if no mapping exists.
*/
int
zfs_zpl_version_map(int spa_version)
{
int i;
int version = -1;
for (i = 0; zfs_version_table[i].version_spa; i++) {
if (spa_version >= zfs_version_table[i].version_spa)
version = zfs_version_table[i].version_zpl;
}
return (version);
}
/*
* Return the min spa version for a corresponding spa version
* -1 is returned if no mapping exists.
*/
int
zfs_spa_version_map(int zpl_version)
{
int i;
int version = -1;
for (i = 0; zfs_version_table[i].version_zpl; i++) {
if (zfs_version_table[i].version_zpl >= zpl_version)
return (zfs_version_table[i].version_spa);
}
return (version);
}
const char *zfs_history_event_names[LOG_END] = {
"invalid event",
"pool create",
"vdev add",
"pool remove",
"pool destroy",
"pool export",
"pool import",
"vdev attach",
"vdev replace",
"vdev detach",
"vdev online",
"vdev offline",
"vdev upgrade",
"pool clear",
"pool scrub",
"pool property set",
"create",
"clone",
"destroy",
"destroy_begin_sync",
"inherit",
"property set",
"quota set",
"permission update",
"permission remove",
"permission who remove",
"promote",
"receive",
"rename",
"reservation set",
"replay_inc_sync",
"replay_full_sync",
"rollback",
"snapshot",
"filesystem version upgrade",
"refquota set",
"refreservation set",
"pool scrub done",
"user hold",
"user release",
"pool split",
};

46
common/zfs/zfs_comutil.h Normal file
View File

@ -0,0 +1,46 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2008, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _ZFS_COMUTIL_H
#define _ZFS_COMUTIL_H
#include <sys/fs/zfs.h>
#include <sys/types.h>
#ifdef __cplusplus
extern "C" {
#endif
extern boolean_t zfs_allocatable_devs(nvlist_t *);
extern void zpool_get_rewind_policy(nvlist_t *, zpool_rewind_policy_t *);
extern int zfs_zpl_version_map(int spa_version);
extern int zfs_spa_version_map(int zpl_version);
extern const char *zfs_history_event_names[LOG_END];
#ifdef __cplusplus
}
#endif
#endif /* _ZFS_COMUTIL_H */

237
common/zfs/zfs_deleg.c Normal file
View File

@ -0,0 +1,237 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2007, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#if defined(_KERNEL)
#include <sys/systm.h>
#include <sys/sunddi.h>
#include <sys/ctype.h>
#else
#include <stdio.h>
#include <unistd.h>
#include <strings.h>
#include <libnvpair.h>
#include <ctype.h>
#endif
/* XXX includes zfs_context.h, so why bother with the above? */
#include <sys/dsl_deleg.h>
#include "zfs_prop.h"
#include "zfs_deleg.h"
#include "zfs_namecheck.h"
/*
* permission table
*
* Keep this table in sorted order
*
* This table is used for displaying all permissions for
* zfs allow
*/
zfs_deleg_perm_tab_t zfs_deleg_perm_tab[] = {
{ZFS_DELEG_PERM_ALLOW, ZFS_DELEG_NOTE_ALLOW},
{ZFS_DELEG_PERM_CLONE, ZFS_DELEG_NOTE_CLONE },
{ZFS_DELEG_PERM_CREATE, ZFS_DELEG_NOTE_CREATE },
{ZFS_DELEG_PERM_DESTROY, ZFS_DELEG_NOTE_DESTROY },
{ZFS_DELEG_PERM_MOUNT, ZFS_DELEG_NOTE_MOUNT },
{ZFS_DELEG_PERM_PROMOTE, ZFS_DELEG_NOTE_PROMOTE },
{ZFS_DELEG_PERM_RECEIVE, ZFS_DELEG_NOTE_RECEIVE },
{ZFS_DELEG_PERM_RENAME, ZFS_DELEG_NOTE_RENAME },
{ZFS_DELEG_PERM_ROLLBACK, ZFS_DELEG_NOTE_ROLLBACK },
{ZFS_DELEG_PERM_SNAPSHOT, ZFS_DELEG_NOTE_SNAPSHOT },
{ZFS_DELEG_PERM_SHARE, ZFS_DELEG_NOTE_SHARE },
{ZFS_DELEG_PERM_SEND, ZFS_DELEG_NOTE_NONE },
{ZFS_DELEG_PERM_USERPROP, ZFS_DELEG_NOTE_USERPROP },
{ZFS_DELEG_PERM_USERQUOTA, ZFS_DELEG_NOTE_USERQUOTA },
{ZFS_DELEG_PERM_GROUPQUOTA, ZFS_DELEG_NOTE_GROUPQUOTA },
{ZFS_DELEG_PERM_USERUSED, ZFS_DELEG_NOTE_USERUSED },
{ZFS_DELEG_PERM_GROUPUSED, ZFS_DELEG_NOTE_GROUPUSED },
{ZFS_DELEG_PERM_HOLD, ZFS_DELEG_NOTE_HOLD },
{ZFS_DELEG_PERM_RELEASE, ZFS_DELEG_NOTE_RELEASE },
{ZFS_DELEG_PERM_DIFF, ZFS_DELEG_NOTE_DIFF},
{NULL, ZFS_DELEG_NOTE_NONE }
};
static int
zfs_valid_permission_name(const char *perm)
{
if (zfs_deleg_canonicalize_perm(perm))
return (0);
return (permset_namecheck(perm, NULL, NULL));
}
const char *
zfs_deleg_canonicalize_perm(const char *perm)
{
int i;
zfs_prop_t prop;
for (i = 0; zfs_deleg_perm_tab[i].z_perm != NULL; i++) {
if (strcmp(perm, zfs_deleg_perm_tab[i].z_perm) == 0)
return (perm);
}
prop = zfs_name_to_prop(perm);
if (prop != ZPROP_INVAL && zfs_prop_delegatable(prop))
return (zfs_prop_to_name(prop));
return (NULL);
}
static int
zfs_validate_who(char *who)
{
char *p;
if (who[2] != ZFS_DELEG_FIELD_SEP_CHR)
return (-1);
switch (who[0]) {
case ZFS_DELEG_USER:
case ZFS_DELEG_GROUP:
case ZFS_DELEG_USER_SETS:
case ZFS_DELEG_GROUP_SETS:
if (who[1] != ZFS_DELEG_LOCAL && who[1] != ZFS_DELEG_DESCENDENT)
return (-1);
for (p = &who[3]; *p; p++)
if (!isdigit(*p))
return (-1);
break;
case ZFS_DELEG_NAMED_SET:
case ZFS_DELEG_NAMED_SET_SETS:
if (who[1] != ZFS_DELEG_NA)
return (-1);
return (permset_namecheck(&who[3], NULL, NULL));
case ZFS_DELEG_CREATE:
case ZFS_DELEG_CREATE_SETS:
if (who[1] != ZFS_DELEG_NA)
return (-1);
if (who[3] != '\0')
return (-1);
break;
case ZFS_DELEG_EVERYONE:
case ZFS_DELEG_EVERYONE_SETS:
if (who[1] != ZFS_DELEG_LOCAL && who[1] != ZFS_DELEG_DESCENDENT)
return (-1);
if (who[3] != '\0')
return (-1);
break;
default:
return (-1);
}
return (0);
}
int
zfs_deleg_verify_nvlist(nvlist_t *nvp)
{
nvpair_t *who, *perm_name;
nvlist_t *perms;
int error;
if (nvp == NULL)
return (-1);
who = nvlist_next_nvpair(nvp, NULL);
if (who == NULL)
return (-1);
do {
if (zfs_validate_who(nvpair_name(who)))
return (-1);
error = nvlist_lookup_nvlist(nvp, nvpair_name(who), &perms);
if (error && error != ENOENT)
return (-1);
if (error == ENOENT)
continue;
perm_name = nvlist_next_nvpair(perms, NULL);
if (perm_name == NULL) {
return (-1);
}
do {
error = zfs_valid_permission_name(
nvpair_name(perm_name));
if (error)
return (-1);
} while (perm_name = nvlist_next_nvpair(perms, perm_name));
} while (who = nvlist_next_nvpair(nvp, who));
return (0);
}
/*
* Construct the base attribute name. The base attribute names
* are the "key" to locate the jump objects which contain the actual
* permissions. The base attribute names are encoded based on
* type of entry and whether it is a local or descendent permission.
*
* Arguments:
* attr - attribute name return string, attribute is assumed to be
* ZFS_MAX_DELEG_NAME long.
* type - type of entry to construct
* inheritchr - inheritance type (local,descendent, or NA for create and
* permission set definitions
* data - is either a permission set name or a 64 bit uid/gid.
*/
void
zfs_deleg_whokey(char *attr, zfs_deleg_who_type_t type,
char inheritchr, void *data)
{
int len = ZFS_MAX_DELEG_NAME;
uint64_t *id = data;
switch (type) {
case ZFS_DELEG_USER:
case ZFS_DELEG_GROUP:
case ZFS_DELEG_USER_SETS:
case ZFS_DELEG_GROUP_SETS:
(void) snprintf(attr, len, "%c%c%c%lld", type, inheritchr,
ZFS_DELEG_FIELD_SEP_CHR, (longlong_t)*id);
break;
case ZFS_DELEG_NAMED_SET_SETS:
case ZFS_DELEG_NAMED_SET:
(void) snprintf(attr, len, "%c-%c%s", type,
ZFS_DELEG_FIELD_SEP_CHR, (char *)data);
break;
case ZFS_DELEG_CREATE:
case ZFS_DELEG_CREATE_SETS:
(void) snprintf(attr, len, "%c-%c", type,
ZFS_DELEG_FIELD_SEP_CHR);
break;
case ZFS_DELEG_EVERYONE:
case ZFS_DELEG_EVERYONE_SETS:
(void) snprintf(attr, len, "%c%c%c", type, inheritchr,
ZFS_DELEG_FIELD_SEP_CHR);
break;
default:
ASSERT(!"bad zfs_deleg_who_type_t");
}
}

85
common/zfs/zfs_deleg.h Normal file
View File

@ -0,0 +1,85 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2007, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _ZFS_DELEG_H
#define _ZFS_DELEG_H
#include <sys/fs/zfs.h>
#ifdef __cplusplus
extern "C" {
#endif
#define ZFS_DELEG_SET_NAME_CHR '@' /* set name lead char */
#define ZFS_DELEG_FIELD_SEP_CHR '$' /* field separator */
/*
* Max name length for a delegation attribute
*/
#define ZFS_MAX_DELEG_NAME 128
#define ZFS_DELEG_LOCAL 'l'
#define ZFS_DELEG_DESCENDENT 'd'
#define ZFS_DELEG_NA '-'
typedef enum {
ZFS_DELEG_NOTE_CREATE,
ZFS_DELEG_NOTE_DESTROY,
ZFS_DELEG_NOTE_SNAPSHOT,
ZFS_DELEG_NOTE_ROLLBACK,
ZFS_DELEG_NOTE_CLONE,
ZFS_DELEG_NOTE_PROMOTE,
ZFS_DELEG_NOTE_RENAME,
ZFS_DELEG_NOTE_RECEIVE,
ZFS_DELEG_NOTE_ALLOW,
ZFS_DELEG_NOTE_USERPROP,
ZFS_DELEG_NOTE_MOUNT,
ZFS_DELEG_NOTE_SHARE,
ZFS_DELEG_NOTE_USERQUOTA,
ZFS_DELEG_NOTE_GROUPQUOTA,
ZFS_DELEG_NOTE_USERUSED,
ZFS_DELEG_NOTE_GROUPUSED,
ZFS_DELEG_NOTE_HOLD,
ZFS_DELEG_NOTE_RELEASE,
ZFS_DELEG_NOTE_DIFF,
ZFS_DELEG_NOTE_NONE
} zfs_deleg_note_t;
typedef struct zfs_deleg_perm_tab {
char *z_perm;
zfs_deleg_note_t z_note;
} zfs_deleg_perm_tab_t;
extern zfs_deleg_perm_tab_t zfs_deleg_perm_tab[];
int zfs_deleg_verify_nvlist(nvlist_t *nvlist);
void zfs_deleg_whokey(char *attr, zfs_deleg_who_type_t type,
char checkflag, void *data);
const char *zfs_deleg_canonicalize_perm(const char *perm);
#ifdef __cplusplus
}
#endif
#endif /* _ZFS_DELEG_H */

246
common/zfs/zfs_fletcher.c Normal file
View File

@ -0,0 +1,246 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
/*
* Fletcher Checksums
* ------------------
*
* ZFS's 2nd and 4th order Fletcher checksums are defined by the following
* recurrence relations:
*
* a = a + f
* i i-1 i-1
*
* b = b + a
* i i-1 i
*
* c = c + b (fletcher-4 only)
* i i-1 i
*
* d = d + c (fletcher-4 only)
* i i-1 i
*
* Where
* a_0 = b_0 = c_0 = d_0 = 0
* and
* f_0 .. f_(n-1) are the input data.
*
* Using standard techniques, these translate into the following series:
*
* __n_ __n_
* \ | \ |
* a = > f b = > i * f
* n /___| n - i n /___| n - i
* i = 1 i = 1
*
*
* __n_ __n_
* \ | i*(i+1) \ | i*(i+1)*(i+2)
* c = > ------- f d = > ------------- f
* n /___| 2 n - i n /___| 6 n - i
* i = 1 i = 1
*
* For fletcher-2, the f_is are 64-bit, and [ab]_i are 64-bit accumulators.
* Since the additions are done mod (2^64), errors in the high bits may not
* be noticed. For this reason, fletcher-2 is deprecated.
*
* For fletcher-4, the f_is are 32-bit, and [abcd]_i are 64-bit accumulators.
* A conservative estimate of how big the buffer can get before we overflow
* can be estimated using f_i = 0xffffffff for all i:
*
* % bc
* f=2^32-1;d=0; for (i = 1; d<2^64; i++) { d += f*i*(i+1)*(i+2)/6 }; (i-1)*4
* 2264
* quit
* %
*
* So blocks of up to 2k will not overflow. Our largest block size is
* 128k, which has 32k 4-byte words, so we can compute the largest possible
* accumulators, then divide by 2^64 to figure the max amount of overflow:
*
* % bc
* a=b=c=d=0; f=2^32-1; for (i=1; i<=32*1024; i++) { a+=f; b+=a; c+=b; d+=c }
* a/2^64;b/2^64;c/2^64;d/2^64
* 0
* 0
* 1365
* 11186858
* quit
* %
*
* So a and b cannot overflow. To make sure each bit of input has some
* effect on the contents of c and d, we can look at what the factors of
* the coefficients in the equations for c_n and d_n are. The number of 2s
* in the factors determines the lowest set bit in the multiplier. Running
* through the cases for n*(n+1)/2 reveals that the highest power of 2 is
* 2^14, and for n*(n+1)*(n+2)/6 it is 2^15. So while some data may overflow
* the 64-bit accumulators, every bit of every f_i effects every accumulator,
* even for 128k blocks.
*
* If we wanted to make a stronger version of fletcher4 (fletcher4c?),
* we could do our calculations mod (2^32 - 1) by adding in the carries
* periodically, and store the number of carries in the top 32-bits.
*
* --------------------
* Checksum Performance
* --------------------
*
* There are two interesting components to checksum performance: cached and
* uncached performance. With cached data, fletcher-2 is about four times
* faster than fletcher-4. With uncached data, the performance difference is
* negligible, since the cost of a cache fill dominates the processing time.
* Even though fletcher-4 is slower than fletcher-2, it is still a pretty
* efficient pass over the data.
*
* In normal operation, the data which is being checksummed is in a buffer
* which has been filled either by:
*
* 1. a compression step, which will be mostly cached, or
* 2. a bcopy() or copyin(), which will be uncached (because the
* copy is cache-bypassing).
*
* For both cached and uncached data, both fletcher checksums are much faster
* than sha-256, and slower than 'off', which doesn't touch the data at all.
*/
#include <sys/types.h>
#include <sys/sysmacros.h>
#include <sys/byteorder.h>
#include <sys/zio.h>
#include <sys/spa.h>
void
fletcher_2_native(const void *buf, uint64_t size, zio_cksum_t *zcp)
{
const uint64_t *ip = buf;
const uint64_t *ipend = ip + (size / sizeof (uint64_t));
uint64_t a0, b0, a1, b1;
for (a0 = b0 = a1 = b1 = 0; ip < ipend; ip += 2) {
a0 += ip[0];
a1 += ip[1];
b0 += a0;
b1 += a1;
}
ZIO_SET_CHECKSUM(zcp, a0, a1, b0, b1);
}
void
fletcher_2_byteswap(const void *buf, uint64_t size, zio_cksum_t *zcp)
{
const uint64_t *ip = buf;
const uint64_t *ipend = ip + (size / sizeof (uint64_t));
uint64_t a0, b0, a1, b1;
for (a0 = b0 = a1 = b1 = 0; ip < ipend; ip += 2) {
a0 += BSWAP_64(ip[0]);
a1 += BSWAP_64(ip[1]);
b0 += a0;
b1 += a1;
}
ZIO_SET_CHECKSUM(zcp, a0, a1, b0, b1);
}
void
fletcher_4_native(const void *buf, uint64_t size, zio_cksum_t *zcp)
{
const uint32_t *ip = buf;
const uint32_t *ipend = ip + (size / sizeof (uint32_t));
uint64_t a, b, c, d;
for (a = b = c = d = 0; ip < ipend; ip++) {
a += ip[0];
b += a;
c += b;
d += c;
}
ZIO_SET_CHECKSUM(zcp, a, b, c, d);
}
void
fletcher_4_byteswap(const void *buf, uint64_t size, zio_cksum_t *zcp)
{
const uint32_t *ip = buf;
const uint32_t *ipend = ip + (size / sizeof (uint32_t));
uint64_t a, b, c, d;
for (a = b = c = d = 0; ip < ipend; ip++) {
a += BSWAP_32(ip[0]);
b += a;
c += b;
d += c;
}
ZIO_SET_CHECKSUM(zcp, a, b, c, d);
}
void
fletcher_4_incremental_native(const void *buf, uint64_t size,
zio_cksum_t *zcp)
{
const uint32_t *ip = buf;
const uint32_t *ipend = ip + (size / sizeof (uint32_t));
uint64_t a, b, c, d;
a = zcp->zc_word[0];
b = zcp->zc_word[1];
c = zcp->zc_word[2];
d = zcp->zc_word[3];
for (; ip < ipend; ip++) {
a += ip[0];
b += a;
c += b;
d += c;
}
ZIO_SET_CHECKSUM(zcp, a, b, c, d);
}
void
fletcher_4_incremental_byteswap(const void *buf, uint64_t size,
zio_cksum_t *zcp)
{
const uint32_t *ip = buf;
const uint32_t *ipend = ip + (size / sizeof (uint32_t));
uint64_t a, b, c, d;
a = zcp->zc_word[0];
b = zcp->zc_word[1];
c = zcp->zc_word[2];
d = zcp->zc_word[3];
for (; ip < ipend; ip++) {
a += BSWAP_32(ip[0]);
b += a;
c += b;
d += c;
}
ZIO_SET_CHECKSUM(zcp, a, b, c, d);
}

53
common/zfs/zfs_fletcher.h Normal file
View File

@ -0,0 +1,53 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#ifndef _ZFS_FLETCHER_H
#define _ZFS_FLETCHER_H
#include <sys/types.h>
#include <sys/spa.h>
#ifdef __cplusplus
extern "C" {
#endif
/*
* fletcher checksum functions
*/
void fletcher_2_native(const void *, uint64_t, zio_cksum_t *);
void fletcher_2_byteswap(const void *, uint64_t, zio_cksum_t *);
void fletcher_4_native(const void *, uint64_t, zio_cksum_t *);
void fletcher_4_byteswap(const void *, uint64_t, zio_cksum_t *);
void fletcher_4_incremental_native(const void *, uint64_t,
zio_cksum_t *);
void fletcher_4_incremental_byteswap(const void *, uint64_t,
zio_cksum_t *);
#ifdef __cplusplus
}
#endif
#endif /* _ZFS_FLETCHER_H */

345
common/zfs/zfs_namecheck.c Normal file
View File

@ -0,0 +1,345 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
/*
* Common name validation routines for ZFS. These routines are shared by the
* userland code as well as the ioctl() layer to ensure that we don't
* inadvertently expose a hole through direct ioctl()s that never gets tested.
* In userland, however, we want significantly more information about _why_ the
* name is invalid. In the kernel, we only care whether it's valid or not.
* Each routine therefore takes a 'namecheck_err_t' which describes exactly why
* the name failed to validate.
*
* Each function returns 0 on success, -1 on error.
*/
#if defined(_KERNEL)
#include <sys/systm.h>
#else
#include <string.h>
#endif
#include <sys/param.h>
#include <sys/nvpair.h>
#include "zfs_namecheck.h"
#include "zfs_deleg.h"
static int
valid_char(char c)
{
return ((c >= 'a' && c <= 'z') ||
(c >= 'A' && c <= 'Z') ||
(c >= '0' && c <= '9') ||
c == '-' || c == '_' || c == '.' || c == ':' || c == ' ');
}
/*
* Snapshot names must be made up of alphanumeric characters plus the following
* characters:
*
* [-_.: ]
*/
int
snapshot_namecheck(const char *path, namecheck_err_t *why, char *what)
{
const char *loc;
if (strlen(path) >= MAXNAMELEN) {
if (why)
*why = NAME_ERR_TOOLONG;
return (-1);
}
if (path[0] == '\0') {
if (why)
*why = NAME_ERR_EMPTY_COMPONENT;
return (-1);
}
for (loc = path; *loc; loc++) {
if (!valid_char(*loc)) {
if (why) {
*why = NAME_ERR_INVALCHAR;
*what = *loc;
}
return (-1);
}
}
return (0);
}
/*
* Permissions set name must start with the letter '@' followed by the
* same character restrictions as snapshot names, except that the name
* cannot exceed 64 characters.
*/
int
permset_namecheck(const char *path, namecheck_err_t *why, char *what)
{
if (strlen(path) >= ZFS_PERMSET_MAXLEN) {
if (why)
*why = NAME_ERR_TOOLONG;
return (-1);
}
if (path[0] != '@') {
if (why) {
*why = NAME_ERR_NO_AT;
*what = path[0];
}
return (-1);
}
return (snapshot_namecheck(&path[1], why, what));
}
/*
* Dataset names must be of the following form:
*
* [component][/]*[component][@component]
*
* Where each component is made up of alphanumeric characters plus the following
* characters:
*
* [-_.:%]
*
* We allow '%' here as we use that character internally to create unique
* names for temporary clones (for online recv).
*/
int
dataset_namecheck(const char *path, namecheck_err_t *why, char *what)
{
const char *loc, *end;
int found_snapshot;
/*
* Make sure the name is not too long.
*
* ZFS_MAXNAMELEN is the maximum dataset length used in the userland
* which is the same as MAXNAMELEN used in the kernel.
* If ZFS_MAXNAMELEN value is changed, make sure to cleanup all
* places using MAXNAMELEN.
*/
if (strlen(path) >= MAXNAMELEN) {
if (why)
*why = NAME_ERR_TOOLONG;
return (-1);
}
/* Explicitly check for a leading slash. */
if (path[0] == '/') {
if (why)
*why = NAME_ERR_LEADING_SLASH;
return (-1);
}
if (path[0] == '\0') {
if (why)
*why = NAME_ERR_EMPTY_COMPONENT;
return (-1);
}
loc = path;
found_snapshot = 0;
for (;;) {
/* Find the end of this component */
end = loc;
while (*end != '/' && *end != '@' && *end != '\0')
end++;
if (*end == '\0' && end[-1] == '/') {
/* trailing slashes are not allowed */
if (why)
*why = NAME_ERR_TRAILING_SLASH;
return (-1);
}
/* Zero-length components are not allowed */
if (loc == end) {
if (why) {
/*
* Make sure this is really a zero-length
* component and not a '@@'.
*/
if (*end == '@' && found_snapshot) {
*why = NAME_ERR_MULTIPLE_AT;
} else {
*why = NAME_ERR_EMPTY_COMPONENT;
}
}
return (-1);
}
/* Validate the contents of this component */
while (loc != end) {
if (!valid_char(*loc) && *loc != '%') {
if (why) {
*why = NAME_ERR_INVALCHAR;
*what = *loc;
}
return (-1);
}
loc++;
}
/* If we've reached the end of the string, we're OK */
if (*end == '\0')
return (0);
if (*end == '@') {
/*
* If we've found an @ symbol, indicate that we're in
* the snapshot component, and report a second '@'
* character as an error.
*/
if (found_snapshot) {
if (why)
*why = NAME_ERR_MULTIPLE_AT;
return (-1);
}
found_snapshot = 1;
}
/*
* If there is a '/' in a snapshot name
* then report an error
*/
if (*end == '/' && found_snapshot) {
if (why)
*why = NAME_ERR_TRAILING_SLASH;
return (-1);
}
/* Update to the next component */
loc = end + 1;
}
}
/*
* mountpoint names must be of the following form:
*
* /[component][/]*[component][/]
*/
int
mountpoint_namecheck(const char *path, namecheck_err_t *why)
{
const char *start, *end;
/*
* Make sure none of the mountpoint component names are too long.
* If a component name is too long then the mkdir of the mountpoint
* will fail but then the mountpoint property will be set to a value
* that can never be mounted. Better to fail before setting the prop.
* Extra slashes are OK, they will be tossed by the mountpoint mkdir.
*/
if (path == NULL || *path != '/') {
if (why)
*why = NAME_ERR_LEADING_SLASH;
return (-1);
}
/* Skip leading slash */
start = &path[1];
do {
end = start;
while (*end != '/' && *end != '\0')
end++;
if (end - start >= MAXNAMELEN) {
if (why)
*why = NAME_ERR_TOOLONG;
return (-1);
}
start = end + 1;
} while (*end != '\0');
return (0);
}
/*
* For pool names, we have the same set of valid characters as described in
* dataset names, with the additional restriction that the pool name must begin
* with a letter. The pool names 'raidz' and 'mirror' are also reserved names
* that cannot be used.
*/
int
pool_namecheck(const char *pool, namecheck_err_t *why, char *what)
{
const char *c;
/*
* Make sure the name is not too long.
*
* ZPOOL_MAXNAMELEN is the maximum pool length used in the userland
* which is the same as MAXNAMELEN used in the kernel.
* If ZPOOL_MAXNAMELEN value is changed, make sure to cleanup all
* places using MAXNAMELEN.
*/
if (strlen(pool) >= MAXNAMELEN) {
if (why)
*why = NAME_ERR_TOOLONG;
return (-1);
}
c = pool;
while (*c != '\0') {
if (!valid_char(*c)) {
if (why) {
*why = NAME_ERR_INVALCHAR;
*what = *c;
}
return (-1);
}
c++;
}
if (!(*pool >= 'a' && *pool <= 'z') &&
!(*pool >= 'A' && *pool <= 'Z')) {
if (why)
*why = NAME_ERR_NOLETTER;
return (-1);
}
if (strcmp(pool, "mirror") == 0 || strcmp(pool, "raidz") == 0) {
if (why)
*why = NAME_ERR_RESERVED;
return (-1);
}
if (pool[0] == 'c' && (pool[1] >= '0' && pool[1] <= '9')) {
if (why)
*why = NAME_ERR_DISKLIKE;
return (-1);
}
return (0);
}

View File

@ -0,0 +1,58 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#ifndef _ZFS_NAMECHECK_H
#define _ZFS_NAMECHECK_H
#ifdef __cplusplus
extern "C" {
#endif
typedef enum {
NAME_ERR_LEADING_SLASH, /* name begins with leading slash */
NAME_ERR_EMPTY_COMPONENT, /* name contains an empty component */
NAME_ERR_TRAILING_SLASH, /* name ends with a slash */
NAME_ERR_INVALCHAR, /* invalid character found */
NAME_ERR_MULTIPLE_AT, /* multiple '@' characters found */
NAME_ERR_NOLETTER, /* pool doesn't begin with a letter */
NAME_ERR_RESERVED, /* entire name is reserved */
NAME_ERR_DISKLIKE, /* reserved disk name (c[0-9].*) */
NAME_ERR_TOOLONG, /* name is too long */
NAME_ERR_NO_AT, /* permission set is missing '@' */
} namecheck_err_t;
#define ZFS_PERMSET_MAXLEN 64
int pool_namecheck(const char *, namecheck_err_t *, char *);
int dataset_namecheck(const char *, namecheck_err_t *, char *);
int mountpoint_namecheck(const char *, namecheck_err_t *);
int snapshot_namecheck(const char *, namecheck_err_t *, char *);
int permset_namecheck(const char *, namecheck_err_t *, char *);
#ifdef __cplusplus
}
#endif
#endif /* _ZFS_NAMECHECK_H */

595
common/zfs/zfs_prop.c Normal file
View File

@ -0,0 +1,595 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
/* Portions Copyright 2010 Robert Milkowski */
#include <sys/zio.h>
#include <sys/spa.h>
#include <sys/u8_textprep.h>
#include <sys/zfs_acl.h>
#include <sys/zfs_ioctl.h>
#include <sys/zfs_znode.h>
#include "zfs_prop.h"
#include "zfs_deleg.h"
#if defined(_KERNEL)
#include <sys/systm.h>
#else
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#endif
static zprop_desc_t zfs_prop_table[ZFS_NUM_PROPS];
/* Note this is indexed by zfs_userquota_prop_t, keep the order the same */
const char *zfs_userquota_prop_prefixes[] = {
"userused@",
"userquota@",
"groupused@",
"groupquota@"
};
zprop_desc_t *
zfs_prop_get_table(void)
{
return (zfs_prop_table);
}
void
zfs_prop_init(void)
{
static zprop_index_t checksum_table[] = {
{ "on", ZIO_CHECKSUM_ON },
{ "off", ZIO_CHECKSUM_OFF },
{ "fletcher2", ZIO_CHECKSUM_FLETCHER_2 },
{ "fletcher4", ZIO_CHECKSUM_FLETCHER_4 },
{ "sha256", ZIO_CHECKSUM_SHA256 },
{ NULL }
};
static zprop_index_t dedup_table[] = {
{ "on", ZIO_CHECKSUM_ON },
{ "off", ZIO_CHECKSUM_OFF },
{ "verify", ZIO_CHECKSUM_ON | ZIO_CHECKSUM_VERIFY },
{ "sha256", ZIO_CHECKSUM_SHA256 },
{ "sha256,verify",
ZIO_CHECKSUM_SHA256 | ZIO_CHECKSUM_VERIFY },
{ NULL }
};
static zprop_index_t compress_table[] = {
{ "on", ZIO_COMPRESS_ON },
{ "off", ZIO_COMPRESS_OFF },
{ "lzjb", ZIO_COMPRESS_LZJB },
{ "gzip", ZIO_COMPRESS_GZIP_6 }, /* gzip default */
{ "gzip-1", ZIO_COMPRESS_GZIP_1 },
{ "gzip-2", ZIO_COMPRESS_GZIP_2 },
{ "gzip-3", ZIO_COMPRESS_GZIP_3 },
{ "gzip-4", ZIO_COMPRESS_GZIP_4 },
{ "gzip-5", ZIO_COMPRESS_GZIP_5 },
{ "gzip-6", ZIO_COMPRESS_GZIP_6 },
{ "gzip-7", ZIO_COMPRESS_GZIP_7 },
{ "gzip-8", ZIO_COMPRESS_GZIP_8 },
{ "gzip-9", ZIO_COMPRESS_GZIP_9 },
{ "zle", ZIO_COMPRESS_ZLE },
{ NULL }
};
static zprop_index_t snapdir_table[] = {
{ "hidden", ZFS_SNAPDIR_HIDDEN },
{ "visible", ZFS_SNAPDIR_VISIBLE },
{ NULL }
};
static zprop_index_t acl_inherit_table[] = {
{ "discard", ZFS_ACL_DISCARD },
{ "noallow", ZFS_ACL_NOALLOW },
{ "restricted", ZFS_ACL_RESTRICTED },
{ "passthrough", ZFS_ACL_PASSTHROUGH },
{ "secure", ZFS_ACL_RESTRICTED }, /* bkwrd compatability */
{ "passthrough-x", ZFS_ACL_PASSTHROUGH_X },
{ NULL }
};
static zprop_index_t case_table[] = {
{ "sensitive", ZFS_CASE_SENSITIVE },
{ "insensitive", ZFS_CASE_INSENSITIVE },
{ "mixed", ZFS_CASE_MIXED },
{ NULL }
};
static zprop_index_t copies_table[] = {
{ "1", 1 },
{ "2", 2 },
{ "3", 3 },
{ NULL }
};
/*
* Use the unique flags we have to send to u8_strcmp() and/or
* u8_textprep() to represent the various normalization property
* values.
*/
static zprop_index_t normalize_table[] = {
{ "none", 0 },
{ "formD", U8_TEXTPREP_NFD },
{ "formKC", U8_TEXTPREP_NFKC },
{ "formC", U8_TEXTPREP_NFC },
{ "formKD", U8_TEXTPREP_NFKD },
{ NULL }
};
static zprop_index_t version_table[] = {
{ "1", 1 },
{ "2", 2 },
{ "3", 3 },
{ "4", 4 },
{ "5", 5 },
{ "current", ZPL_VERSION },
{ NULL }
};
static zprop_index_t boolean_table[] = {
{ "off", 0 },
{ "on", 1 },
{ NULL }
};
static zprop_index_t logbias_table[] = {
{ "latency", ZFS_LOGBIAS_LATENCY },
{ "throughput", ZFS_LOGBIAS_THROUGHPUT },
{ NULL }
};
static zprop_index_t canmount_table[] = {
{ "off", ZFS_CANMOUNT_OFF },
{ "on", ZFS_CANMOUNT_ON },
{ "noauto", ZFS_CANMOUNT_NOAUTO },
{ NULL }
};
static zprop_index_t cache_table[] = {
{ "none", ZFS_CACHE_NONE },
{ "metadata", ZFS_CACHE_METADATA },
{ "all", ZFS_CACHE_ALL },
{ NULL }
};
static zprop_index_t sync_table[] = {
{ "standard", ZFS_SYNC_STANDARD },
{ "always", ZFS_SYNC_ALWAYS },
{ "disabled", ZFS_SYNC_DISABLED },
{ NULL }
};
/* inherit index properties */
zprop_register_index(ZFS_PROP_SYNC, "sync", ZFS_SYNC_STANDARD,
PROP_INHERIT, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME,
"standard | always | disabled", "SYNC",
sync_table);
zprop_register_index(ZFS_PROP_CHECKSUM, "checksum",
ZIO_CHECKSUM_DEFAULT, PROP_INHERIT, ZFS_TYPE_FILESYSTEM |
ZFS_TYPE_VOLUME,
"on | off | fletcher2 | fletcher4 | sha256", "CHECKSUM",
checksum_table);
zprop_register_index(ZFS_PROP_DEDUP, "dedup", ZIO_CHECKSUM_OFF,
PROP_INHERIT, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME,
"on | off | verify | sha256[,verify]", "DEDUP",
dedup_table);
zprop_register_index(ZFS_PROP_COMPRESSION, "compression",
ZIO_COMPRESS_DEFAULT, PROP_INHERIT,
ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME,
"on | off | lzjb | gzip | gzip-[1-9] | zle", "COMPRESS",
compress_table);
zprop_register_index(ZFS_PROP_SNAPDIR, "snapdir", ZFS_SNAPDIR_HIDDEN,
PROP_INHERIT, ZFS_TYPE_FILESYSTEM,
"hidden | visible", "SNAPDIR", snapdir_table);
zprop_register_index(ZFS_PROP_ACLINHERIT, "aclinherit",
ZFS_ACL_RESTRICTED, PROP_INHERIT, ZFS_TYPE_FILESYSTEM,
"discard | noallow | restricted | passthrough | passthrough-x",
"ACLINHERIT", acl_inherit_table);
zprop_register_index(ZFS_PROP_COPIES, "copies", 1, PROP_INHERIT,
ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME,
"1 | 2 | 3", "COPIES", copies_table);
zprop_register_index(ZFS_PROP_PRIMARYCACHE, "primarycache",
ZFS_CACHE_ALL, PROP_INHERIT,
ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT | ZFS_TYPE_VOLUME,
"all | none | metadata", "PRIMARYCACHE", cache_table);
zprop_register_index(ZFS_PROP_SECONDARYCACHE, "secondarycache",
ZFS_CACHE_ALL, PROP_INHERIT,
ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT | ZFS_TYPE_VOLUME,
"all | none | metadata", "SECONDARYCACHE", cache_table);
zprop_register_index(ZFS_PROP_LOGBIAS, "logbias", ZFS_LOGBIAS_LATENCY,
PROP_INHERIT, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME,
"latency | throughput", "LOGBIAS", logbias_table);
/* inherit index (boolean) properties */
zprop_register_index(ZFS_PROP_ATIME, "atime", 1, PROP_INHERIT,
ZFS_TYPE_FILESYSTEM, "on | off", "ATIME", boolean_table);
zprop_register_index(ZFS_PROP_DEVICES, "devices", 1, PROP_INHERIT,
ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT, "on | off", "DEVICES",
boolean_table);
zprop_register_index(ZFS_PROP_EXEC, "exec", 1, PROP_INHERIT,
ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT, "on | off", "EXEC",
boolean_table);
zprop_register_index(ZFS_PROP_SETUID, "setuid", 1, PROP_INHERIT,
ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT, "on | off", "SETUID",
boolean_table);
zprop_register_index(ZFS_PROP_READONLY, "readonly", 0, PROP_INHERIT,
ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME, "on | off", "RDONLY",
boolean_table);
zprop_register_index(ZFS_PROP_ZONED, "zoned", 0, PROP_INHERIT,
ZFS_TYPE_FILESYSTEM, "on | off", "ZONED", boolean_table);
zprop_register_index(ZFS_PROP_XATTR, "xattr", 1, PROP_INHERIT,
ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT, "on | off", "XATTR",
boolean_table);
zprop_register_index(ZFS_PROP_VSCAN, "vscan", 0, PROP_INHERIT,
ZFS_TYPE_FILESYSTEM, "on | off", "VSCAN",
boolean_table);
zprop_register_index(ZFS_PROP_NBMAND, "nbmand", 0, PROP_INHERIT,
ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT, "on | off", "NBMAND",
boolean_table);
/* default index properties */
zprop_register_index(ZFS_PROP_VERSION, "version", 0, PROP_DEFAULT,
ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT,
"1 | 2 | 3 | 4 | current", "VERSION", version_table);
zprop_register_index(ZFS_PROP_CANMOUNT, "canmount", ZFS_CANMOUNT_ON,
PROP_DEFAULT, ZFS_TYPE_FILESYSTEM, "on | off | noauto",
"CANMOUNT", canmount_table);
/* readonly index (boolean) properties */
zprop_register_index(ZFS_PROP_MOUNTED, "mounted", 0, PROP_READONLY,
ZFS_TYPE_FILESYSTEM, "yes | no", "MOUNTED", boolean_table);
zprop_register_index(ZFS_PROP_DEFER_DESTROY, "defer_destroy", 0,
PROP_READONLY, ZFS_TYPE_SNAPSHOT, "yes | no", "DEFER_DESTROY",
boolean_table);
/* set once index properties */
zprop_register_index(ZFS_PROP_NORMALIZE, "normalization", 0,
PROP_ONETIME, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT,
"none | formC | formD | formKC | formKD", "NORMALIZATION",
normalize_table);
zprop_register_index(ZFS_PROP_CASE, "casesensitivity",
ZFS_CASE_SENSITIVE, PROP_ONETIME, ZFS_TYPE_FILESYSTEM |
ZFS_TYPE_SNAPSHOT,
"sensitive | insensitive | mixed", "CASE", case_table);
/* set once index (boolean) properties */
zprop_register_index(ZFS_PROP_UTF8ONLY, "utf8only", 0, PROP_ONETIME,
ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT,
"on | off", "UTF8ONLY", boolean_table);
/* string properties */
zprop_register_string(ZFS_PROP_ORIGIN, "origin", NULL, PROP_READONLY,
ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME, "<snapshot>", "ORIGIN");
zprop_register_string(ZFS_PROP_MOUNTPOINT, "mountpoint", "/",
PROP_INHERIT, ZFS_TYPE_FILESYSTEM, "<path> | legacy | none",
"MOUNTPOINT");
zprop_register_string(ZFS_PROP_SHARENFS, "sharenfs", "off",
PROP_INHERIT, ZFS_TYPE_FILESYSTEM, "on | off | share(1M) options",
"SHARENFS");
zprop_register_string(ZFS_PROP_TYPE, "type", NULL, PROP_READONLY,
ZFS_TYPE_DATASET, "filesystem | volume | snapshot", "TYPE");
zprop_register_string(ZFS_PROP_SHARESMB, "sharesmb", "off",
PROP_INHERIT, ZFS_TYPE_FILESYSTEM,
"on | off | sharemgr(1M) options", "SHARESMB");
zprop_register_string(ZFS_PROP_MLSLABEL, "mlslabel",
ZFS_MLSLABEL_DEFAULT, PROP_INHERIT, ZFS_TYPE_DATASET,
"<sensitivity label>", "MLSLABEL");
/* readonly number properties */
zprop_register_number(ZFS_PROP_USED, "used", 0, PROP_READONLY,
ZFS_TYPE_DATASET, "<size>", "USED");
zprop_register_number(ZFS_PROP_AVAILABLE, "available", 0, PROP_READONLY,
ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME, "<size>", "AVAIL");
zprop_register_number(ZFS_PROP_REFERENCED, "referenced", 0,
PROP_READONLY, ZFS_TYPE_DATASET, "<size>", "REFER");
zprop_register_number(ZFS_PROP_COMPRESSRATIO, "compressratio", 0,
PROP_READONLY, ZFS_TYPE_DATASET,
"<1.00x or higher if compressed>", "RATIO");
zprop_register_number(ZFS_PROP_VOLBLOCKSIZE, "volblocksize",
ZVOL_DEFAULT_BLOCKSIZE, PROP_ONETIME,
ZFS_TYPE_VOLUME, "512 to 128k, power of 2", "VOLBLOCK");
zprop_register_number(ZFS_PROP_USEDSNAP, "usedbysnapshots", 0,
PROP_READONLY, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME, "<size>",
"USEDSNAP");
zprop_register_number(ZFS_PROP_USEDDS, "usedbydataset", 0,
PROP_READONLY, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME, "<size>",
"USEDDS");
zprop_register_number(ZFS_PROP_USEDCHILD, "usedbychildren", 0,
PROP_READONLY, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME, "<size>",
"USEDCHILD");
zprop_register_number(ZFS_PROP_USEDREFRESERV, "usedbyrefreservation", 0,
PROP_READONLY,
ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME, "<size>", "USEDREFRESERV");
zprop_register_number(ZFS_PROP_USERREFS, "userrefs", 0, PROP_READONLY,
ZFS_TYPE_SNAPSHOT, "<count>", "USERREFS");
/* default number properties */
zprop_register_number(ZFS_PROP_QUOTA, "quota", 0, PROP_DEFAULT,
ZFS_TYPE_FILESYSTEM, "<size> | none", "QUOTA");
zprop_register_number(ZFS_PROP_RESERVATION, "reservation", 0,
PROP_DEFAULT, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME,
"<size> | none", "RESERV");
zprop_register_number(ZFS_PROP_VOLSIZE, "volsize", 0, PROP_DEFAULT,
ZFS_TYPE_VOLUME, "<size>", "VOLSIZE");
zprop_register_number(ZFS_PROP_REFQUOTA, "refquota", 0, PROP_DEFAULT,
ZFS_TYPE_FILESYSTEM, "<size> | none", "REFQUOTA");
zprop_register_number(ZFS_PROP_REFRESERVATION, "refreservation", 0,
PROP_DEFAULT, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME,
"<size> | none", "REFRESERV");
/* inherit number properties */
zprop_register_number(ZFS_PROP_RECORDSIZE, "recordsize",
SPA_MAXBLOCKSIZE, PROP_INHERIT,
ZFS_TYPE_FILESYSTEM, "512 to 128k, power of 2", "RECSIZE");
/* hidden properties */
zprop_register_hidden(ZFS_PROP_CREATETXG, "createtxg", PROP_TYPE_NUMBER,
PROP_READONLY, ZFS_TYPE_DATASET, "CREATETXG");
zprop_register_hidden(ZFS_PROP_NUMCLONES, "numclones", PROP_TYPE_NUMBER,
PROP_READONLY, ZFS_TYPE_SNAPSHOT, "NUMCLONES");
zprop_register_hidden(ZFS_PROP_NAME, "name", PROP_TYPE_STRING,
PROP_READONLY, ZFS_TYPE_DATASET, "NAME");
zprop_register_hidden(ZFS_PROP_ISCSIOPTIONS, "iscsioptions",
PROP_TYPE_STRING, PROP_INHERIT, ZFS_TYPE_VOLUME, "ISCSIOPTIONS");
zprop_register_hidden(ZFS_PROP_STMF_SHAREINFO, "stmf_sbd_lu",
PROP_TYPE_STRING, PROP_INHERIT, ZFS_TYPE_VOLUME,
"STMF_SBD_LU");
zprop_register_hidden(ZFS_PROP_GUID, "guid", PROP_TYPE_NUMBER,
PROP_READONLY, ZFS_TYPE_DATASET, "GUID");
zprop_register_hidden(ZFS_PROP_USERACCOUNTING, "useraccounting",
PROP_TYPE_NUMBER, PROP_READONLY, ZFS_TYPE_DATASET,
"USERACCOUNTING");
zprop_register_hidden(ZFS_PROP_UNIQUE, "unique", PROP_TYPE_NUMBER,
PROP_READONLY, ZFS_TYPE_DATASET, "UNIQUE");
zprop_register_hidden(ZFS_PROP_OBJSETID, "objsetid", PROP_TYPE_NUMBER,
PROP_READONLY, ZFS_TYPE_DATASET, "OBJSETID");
/*
* Property to be removed once libbe is integrated
*/
zprop_register_hidden(ZFS_PROP_PRIVATE, "priv_prop",
PROP_TYPE_NUMBER, PROP_READONLY, ZFS_TYPE_FILESYSTEM,
"PRIV_PROP");
/* oddball properties */
zprop_register_impl(ZFS_PROP_CREATION, "creation", PROP_TYPE_NUMBER, 0,
NULL, PROP_READONLY, ZFS_TYPE_DATASET,
"<date>", "CREATION", B_FALSE, B_TRUE, NULL);
}
boolean_t
zfs_prop_delegatable(zfs_prop_t prop)
{
zprop_desc_t *pd = &zfs_prop_table[prop];
/* The mlslabel property is never delegatable. */
if (prop == ZFS_PROP_MLSLABEL)
return (B_FALSE);
return (pd->pd_attr != PROP_READONLY);
}
/*
* Given a zfs dataset property name, returns the corresponding property ID.
*/
zfs_prop_t
zfs_name_to_prop(const char *propname)
{
return (zprop_name_to_prop(propname, ZFS_TYPE_DATASET));
}
/*
* For user property names, we allow all lowercase alphanumeric characters, plus
* a few useful punctuation characters.
*/
static int
valid_char(char c)
{
return ((c >= 'a' && c <= 'z') ||
(c >= '0' && c <= '9') ||
c == '-' || c == '_' || c == '.' || c == ':');
}
/*
* Returns true if this is a valid user-defined property (one with a ':').
*/
boolean_t
zfs_prop_user(const char *name)
{
int i;
char c;
boolean_t foundsep = B_FALSE;
for (i = 0; i < strlen(name); i++) {
c = name[i];
if (!valid_char(c))
return (B_FALSE);
if (c == ':')
foundsep = B_TRUE;
}
if (!foundsep)
return (B_FALSE);
return (B_TRUE);
}
/*
* Returns true if this is a valid userspace-type property (one with a '@').
* Note that after the @, any character is valid (eg, another @, for SID
* user@domain).
*/
boolean_t
zfs_prop_userquota(const char *name)
{
zfs_userquota_prop_t prop;
for (prop = 0; prop < ZFS_NUM_USERQUOTA_PROPS; prop++) {
if (strncmp(name, zfs_userquota_prop_prefixes[prop],
strlen(zfs_userquota_prop_prefixes[prop])) == 0) {
return (B_TRUE);
}
}
return (B_FALSE);
}
/*
* Tables of index types, plus functions to convert between the user view
* (strings) and internal representation (uint64_t).
*/
int
zfs_prop_string_to_index(zfs_prop_t prop, const char *string, uint64_t *index)
{
return (zprop_string_to_index(prop, string, index, ZFS_TYPE_DATASET));
}
int
zfs_prop_index_to_string(zfs_prop_t prop, uint64_t index, const char **string)
{
return (zprop_index_to_string(prop, index, string, ZFS_TYPE_DATASET));
}
uint64_t
zfs_prop_random_value(zfs_prop_t prop, uint64_t seed)
{
return (zprop_random_value(prop, seed, ZFS_TYPE_DATASET));
}
/*
* Returns TRUE if the property applies to any of the given dataset types.
*/
boolean_t
zfs_prop_valid_for_type(int prop, zfs_type_t types)
{
return (zprop_valid_for_type(prop, types));
}
zprop_type_t
zfs_prop_get_type(zfs_prop_t prop)
{
return (zfs_prop_table[prop].pd_proptype);
}
/*
* Returns TRUE if the property is readonly.
*/
boolean_t
zfs_prop_readonly(zfs_prop_t prop)
{
return (zfs_prop_table[prop].pd_attr == PROP_READONLY ||
zfs_prop_table[prop].pd_attr == PROP_ONETIME);
}
/*
* Returns TRUE if the property is only allowed to be set once.
*/
boolean_t
zfs_prop_setonce(zfs_prop_t prop)
{
return (zfs_prop_table[prop].pd_attr == PROP_ONETIME);
}
const char *
zfs_prop_default_string(zfs_prop_t prop)
{
return (zfs_prop_table[prop].pd_strdefault);
}
uint64_t
zfs_prop_default_numeric(zfs_prop_t prop)
{
return (zfs_prop_table[prop].pd_numdefault);
}
/*
* Given a dataset property ID, returns the corresponding name.
* Assuming the zfs dataset property ID is valid.
*/
const char *
zfs_prop_to_name(zfs_prop_t prop)
{
return (zfs_prop_table[prop].pd_name);
}
/*
* Returns TRUE if the property is inheritable.
*/
boolean_t
zfs_prop_inheritable(zfs_prop_t prop)
{
return (zfs_prop_table[prop].pd_attr == PROP_INHERIT ||
zfs_prop_table[prop].pd_attr == PROP_ONETIME);
}
#ifndef _KERNEL
/*
* Returns a string describing the set of acceptable values for the given
* zfs property, or NULL if it cannot be set.
*/
const char *
zfs_prop_values(zfs_prop_t prop)
{
return (zfs_prop_table[prop].pd_values);
}
/*
* Returns TRUE if this property is a string type. Note that index types
* (compression, checksum) are treated as strings in userland, even though they
* are stored numerically on disk.
*/
int
zfs_prop_is_string(zfs_prop_t prop)
{
return (zfs_prop_table[prop].pd_proptype == PROP_TYPE_STRING ||
zfs_prop_table[prop].pd_proptype == PROP_TYPE_INDEX);
}
/*
* Returns the column header for the given property. Used only in
* 'zfs list -o', but centralized here with the other property information.
*/
const char *
zfs_prop_column_name(zfs_prop_t prop)
{
return (zfs_prop_table[prop].pd_colname);
}
/*
* Returns whether the given property should be displayed right-justified for
* 'zfs list'.
*/
boolean_t
zfs_prop_align_right(zfs_prop_t prop)
{
return (zfs_prop_table[prop].pd_rightalign);
}
#endif

129
common/zfs/zfs_prop.h Normal file
View File

@ -0,0 +1,129 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2010 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#ifndef _ZFS_PROP_H
#define _ZFS_PROP_H
#include <sys/fs/zfs.h>
#include <sys/types.h>
#ifdef __cplusplus
extern "C" {
#endif
/*
* For index types (e.g. compression and checksum), we want the numeric value
* in the kernel, but the string value in userland.
*/
typedef enum {
PROP_TYPE_NUMBER, /* numeric value */
PROP_TYPE_STRING, /* string value */
PROP_TYPE_INDEX /* numeric value indexed by string */
} zprop_type_t;
typedef enum {
PROP_DEFAULT,
PROP_READONLY,
PROP_INHERIT,
/*
* ONETIME properties are a sort of conglomeration of READONLY
* and INHERIT. They can be set only during object creation,
* after that they are READONLY. If not explicitly set during
* creation, they can be inherited.
*/
PROP_ONETIME
} zprop_attr_t;
typedef struct zfs_index {
const char *pi_name;
uint64_t pi_value;
} zprop_index_t;
typedef struct {
const char *pd_name; /* human-readable property name */
int pd_propnum; /* property number */
zprop_type_t pd_proptype; /* string, boolean, index, number */
const char *pd_strdefault; /* default for strings */
uint64_t pd_numdefault; /* for boolean / index / number */
zprop_attr_t pd_attr; /* default, readonly, inherit */
int pd_types; /* bitfield of valid dataset types */
/* fs | vol | snap; or pool */
const char *pd_values; /* string telling acceptable values */
const char *pd_colname; /* column header for "zfs list" */
boolean_t pd_rightalign; /* column alignment for "zfs list" */
boolean_t pd_visible; /* do we list this property with the */
/* "zfs get" help message */
const zprop_index_t *pd_table; /* for index properties, a table */
/* defining the possible values */
size_t pd_table_size; /* number of entries in pd_table[] */
} zprop_desc_t;
/*
* zfs dataset property functions
*/
void zfs_prop_init(void);
zprop_type_t zfs_prop_get_type(zfs_prop_t);
boolean_t zfs_prop_delegatable(zfs_prop_t prop);
zprop_desc_t *zfs_prop_get_table(void);
/*
* zpool property functions
*/
void zpool_prop_init(void);
zprop_type_t zpool_prop_get_type(zpool_prop_t);
zprop_desc_t *zpool_prop_get_table(void);
/*
* Common routines to initialize property tables
*/
void zprop_register_impl(int, const char *, zprop_type_t, uint64_t,
const char *, zprop_attr_t, int, const char *, const char *,
boolean_t, boolean_t, const zprop_index_t *);
void zprop_register_string(int, const char *, const char *,
zprop_attr_t attr, int, const char *, const char *);
void zprop_register_number(int, const char *, uint64_t, zprop_attr_t, int,
const char *, const char *);
void zprop_register_index(int, const char *, uint64_t, zprop_attr_t, int,
const char *, const char *, const zprop_index_t *);
void zprop_register_hidden(int, const char *, zprop_type_t, zprop_attr_t,
int, const char *);
/*
* Common routines for zfs and zpool property management
*/
int zprop_iter_common(zprop_func, void *, boolean_t, boolean_t, zfs_type_t);
int zprop_name_to_prop(const char *, zfs_type_t);
int zprop_string_to_index(int, const char *, uint64_t *, zfs_type_t);
int zprop_index_to_string(int, uint64_t, const char **, zfs_type_t);
uint64_t zprop_random_value(int, uint64_t, zfs_type_t);
const char *zprop_values(int, zfs_type_t);
size_t zprop_width(int, boolean_t *, zfs_type_t);
boolean_t zprop_valid_for_type(int, zfs_type_t);
#ifdef __cplusplus
}
#endif
#endif /* _ZFS_PROP_H */

202
common/zfs/zpool_prop.c Normal file
View File

@ -0,0 +1,202 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2007, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#include <sys/zio.h>
#include <sys/spa.h>
#include <sys/zfs_acl.h>
#include <sys/zfs_ioctl.h>
#include <sys/fs/zfs.h>
#include "zfs_prop.h"
#if defined(_KERNEL)
#include <sys/systm.h>
#else
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#endif
static zprop_desc_t zpool_prop_table[ZPOOL_NUM_PROPS];
zprop_desc_t *
zpool_prop_get_table(void)
{
return (zpool_prop_table);
}
void
zpool_prop_init(void)
{
static zprop_index_t boolean_table[] = {
{ "off", 0},
{ "on", 1},
{ NULL }
};
static zprop_index_t failuremode_table[] = {
{ "wait", ZIO_FAILURE_MODE_WAIT },
{ "continue", ZIO_FAILURE_MODE_CONTINUE },
{ "panic", ZIO_FAILURE_MODE_PANIC },
{ NULL }
};
/* string properties */
zprop_register_string(ZPOOL_PROP_ALTROOT, "altroot", NULL, PROP_DEFAULT,
ZFS_TYPE_POOL, "<path>", "ALTROOT");
zprop_register_string(ZPOOL_PROP_BOOTFS, "bootfs", NULL, PROP_DEFAULT,
ZFS_TYPE_POOL, "<filesystem>", "BOOTFS");
zprop_register_string(ZPOOL_PROP_CACHEFILE, "cachefile", NULL,
PROP_DEFAULT, ZFS_TYPE_POOL, "<file> | none", "CACHEFILE");
/* readonly number properties */
zprop_register_number(ZPOOL_PROP_SIZE, "size", 0, PROP_READONLY,
ZFS_TYPE_POOL, "<size>", "SIZE");
zprop_register_number(ZPOOL_PROP_FREE, "free", 0, PROP_READONLY,
ZFS_TYPE_POOL, "<size>", "FREE");
zprop_register_number(ZPOOL_PROP_ALLOCATED, "allocated", 0,
PROP_READONLY, ZFS_TYPE_POOL, "<size>", "ALLOC");
zprop_register_number(ZPOOL_PROP_CAPACITY, "capacity", 0, PROP_READONLY,
ZFS_TYPE_POOL, "<size>", "CAP");
zprop_register_number(ZPOOL_PROP_GUID, "guid", 0, PROP_READONLY,
ZFS_TYPE_POOL, "<guid>", "GUID");
zprop_register_number(ZPOOL_PROP_HEALTH, "health", 0, PROP_READONLY,
ZFS_TYPE_POOL, "<state>", "HEALTH");
zprop_register_number(ZPOOL_PROP_DEDUPRATIO, "dedupratio", 0,
PROP_READONLY, ZFS_TYPE_POOL, "<1.00x or higher if deduped>",
"DEDUP");
/* default number properties */
zprop_register_number(ZPOOL_PROP_VERSION, "version", SPA_VERSION,
PROP_DEFAULT, ZFS_TYPE_POOL, "<version>", "VERSION");
zprop_register_number(ZPOOL_PROP_DEDUPDITTO, "dedupditto", 0,
PROP_DEFAULT, ZFS_TYPE_POOL, "<threshold (min 100)>", "DEDUPDITTO");
/* default index (boolean) properties */
zprop_register_index(ZPOOL_PROP_DELEGATION, "delegation", 1,
PROP_DEFAULT, ZFS_TYPE_POOL, "on | off", "DELEGATION",
boolean_table);
zprop_register_index(ZPOOL_PROP_AUTOREPLACE, "autoreplace", 0,
PROP_DEFAULT, ZFS_TYPE_POOL, "on | off", "REPLACE", boolean_table);
zprop_register_index(ZPOOL_PROP_LISTSNAPS, "listsnapshots", 0,
PROP_DEFAULT, ZFS_TYPE_POOL, "on | off", "LISTSNAPS",
boolean_table);
zprop_register_index(ZPOOL_PROP_AUTOEXPAND, "autoexpand", 0,
PROP_DEFAULT, ZFS_TYPE_POOL, "on | off", "EXPAND", boolean_table);
zprop_register_index(ZPOOL_PROP_READONLY, "readonly", 0,
PROP_DEFAULT, ZFS_TYPE_POOL, "on | off", "RDONLY", boolean_table);
/* default index properties */
zprop_register_index(ZPOOL_PROP_FAILUREMODE, "failmode",
ZIO_FAILURE_MODE_WAIT, PROP_DEFAULT, ZFS_TYPE_POOL,
"wait | continue | panic", "FAILMODE", failuremode_table);
/* hidden properties */
zprop_register_hidden(ZPOOL_PROP_NAME, "name", PROP_TYPE_STRING,
PROP_READONLY, ZFS_TYPE_POOL, "NAME");
}
/*
* Given a property name and its type, returns the corresponding property ID.
*/
zpool_prop_t
zpool_name_to_prop(const char *propname)
{
return (zprop_name_to_prop(propname, ZFS_TYPE_POOL));
}
/*
* Given a pool property ID, returns the corresponding name.
* Assuming the pool propety ID is valid.
*/
const char *
zpool_prop_to_name(zpool_prop_t prop)
{
return (zpool_prop_table[prop].pd_name);
}
zprop_type_t
zpool_prop_get_type(zpool_prop_t prop)
{
return (zpool_prop_table[prop].pd_proptype);
}
boolean_t
zpool_prop_readonly(zpool_prop_t prop)
{
return (zpool_prop_table[prop].pd_attr == PROP_READONLY);
}
const char *
zpool_prop_default_string(zpool_prop_t prop)
{
return (zpool_prop_table[prop].pd_strdefault);
}
uint64_t
zpool_prop_default_numeric(zpool_prop_t prop)
{
return (zpool_prop_table[prop].pd_numdefault);
}
int
zpool_prop_string_to_index(zpool_prop_t prop, const char *string,
uint64_t *index)
{
return (zprop_string_to_index(prop, string, index, ZFS_TYPE_POOL));
}
int
zpool_prop_index_to_string(zpool_prop_t prop, uint64_t index,
const char **string)
{
return (zprop_index_to_string(prop, index, string, ZFS_TYPE_POOL));
}
uint64_t
zpool_prop_random_value(zpool_prop_t prop, uint64_t seed)
{
return (zprop_random_value(prop, seed, ZFS_TYPE_POOL));
}
#ifndef _KERNEL
const char *
zpool_prop_values(zpool_prop_t prop)
{
return (zpool_prop_table[prop].pd_values);
}
const char *
zpool_prop_column_name(zpool_prop_t prop)
{
return (zpool_prop_table[prop].pd_colname);
}
boolean_t
zpool_prop_align_right(zpool_prop_t prop)
{
return (zpool_prop_table[prop].pd_rightalign);
}
#endif

426
common/zfs/zprop_common.c Normal file
View File

@ -0,0 +1,426 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2010 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
/*
* Common routines used by zfs and zpool property management.
*/
#include <sys/zio.h>
#include <sys/spa.h>
#include <sys/zfs_acl.h>
#include <sys/zfs_ioctl.h>
#include <sys/zfs_znode.h>
#include <sys/fs/zfs.h>
#include "zfs_prop.h"
#include "zfs_deleg.h"
#if defined(_KERNEL)
#include <sys/systm.h>
#include <util/qsort.h>
#else
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#endif
static zprop_desc_t *
zprop_get_proptable(zfs_type_t type)
{
if (type == ZFS_TYPE_POOL)
return (zpool_prop_get_table());
else
return (zfs_prop_get_table());
}
static int
zprop_get_numprops(zfs_type_t type)
{
if (type == ZFS_TYPE_POOL)
return (ZPOOL_NUM_PROPS);
else
return (ZFS_NUM_PROPS);
}
void
zprop_register_impl(int prop, const char *name, zprop_type_t type,
uint64_t numdefault, const char *strdefault, zprop_attr_t attr,
int objset_types, const char *values, const char *colname,
boolean_t rightalign, boolean_t visible, const zprop_index_t *idx_tbl)
{
zprop_desc_t *prop_tbl = zprop_get_proptable(objset_types);
zprop_desc_t *pd;
pd = &prop_tbl[prop];
ASSERT(pd->pd_name == NULL || pd->pd_name == name);
ASSERT(name != NULL);
ASSERT(colname != NULL);
pd->pd_name = name;
pd->pd_propnum = prop;
pd->pd_proptype = type;
pd->pd_numdefault = numdefault;
pd->pd_strdefault = strdefault;
pd->pd_attr = attr;
pd->pd_types = objset_types;
pd->pd_values = values;
pd->pd_colname = colname;
pd->pd_rightalign = rightalign;
pd->pd_visible = visible;
pd->pd_table = idx_tbl;
pd->pd_table_size = 0;
while (idx_tbl && (idx_tbl++)->pi_name != NULL)
pd->pd_table_size++;
}
void
zprop_register_string(int prop, const char *name, const char *def,
zprop_attr_t attr, int objset_types, const char *values,
const char *colname)
{
zprop_register_impl(prop, name, PROP_TYPE_STRING, 0, def, attr,
objset_types, values, colname, B_FALSE, B_TRUE, NULL);
}
void
zprop_register_number(int prop, const char *name, uint64_t def,
zprop_attr_t attr, int objset_types, const char *values,
const char *colname)
{
zprop_register_impl(prop, name, PROP_TYPE_NUMBER, def, NULL, attr,
objset_types, values, colname, B_TRUE, B_TRUE, NULL);
}
void
zprop_register_index(int prop, const char *name, uint64_t def,
zprop_attr_t attr, int objset_types, const char *values,
const char *colname, const zprop_index_t *idx_tbl)
{
zprop_register_impl(prop, name, PROP_TYPE_INDEX, def, NULL, attr,
objset_types, values, colname, B_TRUE, B_TRUE, idx_tbl);
}
void
zprop_register_hidden(int prop, const char *name, zprop_type_t type,
zprop_attr_t attr, int objset_types, const char *colname)
{
zprop_register_impl(prop, name, type, 0, NULL, attr,
objset_types, NULL, colname, B_FALSE, B_FALSE, NULL);
}
/*
* A comparison function we can use to order indexes into property tables.
*/
static int
zprop_compare(const void *arg1, const void *arg2)
{
const zprop_desc_t *p1 = *((zprop_desc_t **)arg1);
const zprop_desc_t *p2 = *((zprop_desc_t **)arg2);
boolean_t p1ro, p2ro;
p1ro = (p1->pd_attr == PROP_READONLY);
p2ro = (p2->pd_attr == PROP_READONLY);
if (p1ro == p2ro)
return (strcmp(p1->pd_name, p2->pd_name));
return (p1ro ? -1 : 1);
}
/*
* Iterate over all properties in the given property table, calling back
* into the specified function for each property. We will continue to
* iterate until we either reach the end or the callback function returns
* something other than ZPROP_CONT.
*/
int
zprop_iter_common(zprop_func func, void *cb, boolean_t show_all,
boolean_t ordered, zfs_type_t type)
{
int i, num_props, size, prop;
zprop_desc_t *prop_tbl;
zprop_desc_t **order;
prop_tbl = zprop_get_proptable(type);
num_props = zprop_get_numprops(type);
size = num_props * sizeof (zprop_desc_t *);
#if defined(_KERNEL)
order = kmem_alloc(size, KM_SLEEP);
#else
if ((order = malloc(size)) == NULL)
return (ZPROP_CONT);
#endif
for (int j = 0; j < num_props; j++)
order[j] = &prop_tbl[j];
if (ordered) {
qsort((void *)order, num_props, sizeof (zprop_desc_t *),
zprop_compare);
}
prop = ZPROP_CONT;
for (i = 0; i < num_props; i++) {
if ((order[i]->pd_visible || show_all) &&
(func(order[i]->pd_propnum, cb) != ZPROP_CONT)) {
prop = order[i]->pd_propnum;
break;
}
}
#if defined(_KERNEL)
kmem_free(order, size);
#else
free(order);
#endif
return (prop);
}
static boolean_t
propname_match(const char *p, size_t len, zprop_desc_t *prop_entry)
{
const char *propname = prop_entry->pd_name;
#ifndef _KERNEL
const char *colname = prop_entry->pd_colname;
int c;
#endif
if (len == strlen(propname) &&
strncmp(p, propname, len) == 0)
return (B_TRUE);
#ifndef _KERNEL
if (colname == NULL || len != strlen(colname))
return (B_FALSE);
for (c = 0; c < len; c++)
if (p[c] != tolower(colname[c]))
break;
return (colname[c] == '\0');
#else
return (B_FALSE);
#endif
}
typedef struct name_to_prop_cb {
const char *propname;
zprop_desc_t *prop_tbl;
} name_to_prop_cb_t;
static int
zprop_name_to_prop_cb(int prop, void *cb_data)
{
name_to_prop_cb_t *data = cb_data;
if (propname_match(data->propname, strlen(data->propname),
&data->prop_tbl[prop]))
return (prop);
return (ZPROP_CONT);
}
int
zprop_name_to_prop(const char *propname, zfs_type_t type)
{
int prop;
name_to_prop_cb_t cb_data;
cb_data.propname = propname;
cb_data.prop_tbl = zprop_get_proptable(type);
prop = zprop_iter_common(zprop_name_to_prop_cb, &cb_data,
B_TRUE, B_FALSE, type);
return (prop == ZPROP_CONT ? ZPROP_INVAL : prop);
}
int
zprop_string_to_index(int prop, const char *string, uint64_t *index,
zfs_type_t type)
{
zprop_desc_t *prop_tbl;
const zprop_index_t *idx_tbl;
int i;
if (prop == ZPROP_INVAL || prop == ZPROP_CONT)
return (-1);
ASSERT(prop < zprop_get_numprops(type));
prop_tbl = zprop_get_proptable(type);
if ((idx_tbl = prop_tbl[prop].pd_table) == NULL)
return (-1);
for (i = 0; idx_tbl[i].pi_name != NULL; i++) {
if (strcmp(string, idx_tbl[i].pi_name) == 0) {
*index = idx_tbl[i].pi_value;
return (0);
}
}
return (-1);
}
int
zprop_index_to_string(int prop, uint64_t index, const char **string,
zfs_type_t type)
{
zprop_desc_t *prop_tbl;
const zprop_index_t *idx_tbl;
int i;
if (prop == ZPROP_INVAL || prop == ZPROP_CONT)
return (-1);
ASSERT(prop < zprop_get_numprops(type));
prop_tbl = zprop_get_proptable(type);
if ((idx_tbl = prop_tbl[prop].pd_table) == NULL)
return (-1);
for (i = 0; idx_tbl[i].pi_name != NULL; i++) {
if (idx_tbl[i].pi_value == index) {
*string = idx_tbl[i].pi_name;
return (0);
}
}
return (-1);
}
/*
* Return a random valid property value. Used by ztest.
*/
uint64_t
zprop_random_value(int prop, uint64_t seed, zfs_type_t type)
{
zprop_desc_t *prop_tbl;
const zprop_index_t *idx_tbl;
ASSERT((uint_t)prop < zprop_get_numprops(type));
prop_tbl = zprop_get_proptable(type);
idx_tbl = prop_tbl[prop].pd_table;
if (idx_tbl == NULL)
return (seed);
return (idx_tbl[seed % prop_tbl[prop].pd_table_size].pi_value);
}
const char *
zprop_values(int prop, zfs_type_t type)
{
zprop_desc_t *prop_tbl;
ASSERT(prop != ZPROP_INVAL && prop != ZPROP_CONT);
ASSERT(prop < zprop_get_numprops(type));
prop_tbl = zprop_get_proptable(type);
return (prop_tbl[prop].pd_values);
}
/*
* Returns TRUE if the property applies to any of the given dataset types.
*/
boolean_t
zprop_valid_for_type(int prop, zfs_type_t type)
{
zprop_desc_t *prop_tbl;
if (prop == ZPROP_INVAL || prop == ZPROP_CONT)
return (B_FALSE);
ASSERT(prop < zprop_get_numprops(type));
prop_tbl = zprop_get_proptable(type);
return ((prop_tbl[prop].pd_types & type) != 0);
}
#ifndef _KERNEL
/*
* Determines the minimum width for the column, and indicates whether it's fixed
* or not. Only string columns are non-fixed.
*/
size_t
zprop_width(int prop, boolean_t *fixed, zfs_type_t type)
{
zprop_desc_t *prop_tbl, *pd;
const zprop_index_t *idx;
size_t ret;
int i;
ASSERT(prop != ZPROP_INVAL && prop != ZPROP_CONT);
ASSERT(prop < zprop_get_numprops(type));
prop_tbl = zprop_get_proptable(type);
pd = &prop_tbl[prop];
*fixed = B_TRUE;
/*
* Start with the width of the column name.
*/
ret = strlen(pd->pd_colname);
/*
* For fixed-width values, make sure the width is large enough to hold
* any possible value.
*/
switch (pd->pd_proptype) {
case PROP_TYPE_NUMBER:
/*
* The maximum length of a human-readable number is 5 characters
* ("20.4M", for example).
*/
if (ret < 5)
ret = 5;
/*
* 'creation' is handled specially because it's a number
* internally, but displayed as a date string.
*/
if (prop == ZFS_PROP_CREATION)
*fixed = B_FALSE;
break;
case PROP_TYPE_INDEX:
idx = prop_tbl[prop].pd_table;
for (i = 0; idx[i].pi_name != NULL; i++) {
if (strlen(idx[i].pi_name) > ret)
ret = strlen(idx[i].pi_name);
}
break;
case PROP_TYPE_STRING:
*fixed = B_FALSE;
break;
}
return (ret);
}
#endif

2007
uts/common/Makefile.files Normal file

File diff suppressed because it is too large Load Diff

View File

@ -20,12 +20,9 @@
*/
/*
* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
* Copyright (c) 2003, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#pragma ident "%Z%%M% %I% %E% SMI"
/*
* DTrace - Dynamic Tracing for Solaris
*
@ -186,7 +183,9 @@ static dtrace_ecb_t *dtrace_ecb_create_cache; /* cached created ECB */
static dtrace_genid_t dtrace_probegen; /* current probe generation */
static dtrace_helpers_t *dtrace_deferred_pid; /* deferred helper list */
static dtrace_enabling_t *dtrace_retained; /* list of retained enablings */
static dtrace_genid_t dtrace_retained_gen; /* current retained enab gen */
static dtrace_dynvar_t dtrace_dynhash_sink; /* end of dynamic hash chains */
static int dtrace_dynvar_failclean; /* dynvars failed to clean */
/*
* DTrace Locking
@ -240,10 +239,16 @@ static void
dtrace_nullop(void)
{}
static int
dtrace_enable_nullop(void)
{
return (0);
}
static dtrace_pops_t dtrace_provider_ops = {
(void (*)(void *, const dtrace_probedesc_t *))dtrace_nullop,
(void (*)(void *, struct modctl *))dtrace_nullop,
(void (*)(void *, dtrace_id_t, void *))dtrace_nullop,
(int (*)(void *, dtrace_id_t, void *))dtrace_enable_nullop,
(void (*)(void *, dtrace_id_t, void *))dtrace_nullop,
(void (*)(void *, dtrace_id_t, void *))dtrace_nullop,
(void (*)(void *, dtrace_id_t, void *))dtrace_nullop,
@ -427,6 +432,7 @@ dtrace_load##bits(uintptr_t addr) \
#define DTRACE_DYNHASH_SINK 1
#define DTRACE_DYNHASH_VALID 2
#define DTRACE_MATCH_FAIL -1
#define DTRACE_MATCH_NEXT 0
#define DTRACE_MATCH_DONE 1
#define DTRACE_ANCHORED(probe) ((probe)->dtpr_func[0] != '\0')
@ -1182,12 +1188,12 @@ dtrace_dynvar_clean(dtrace_dstate_t *dstate)
{
dtrace_dynvar_t *dirty;
dtrace_dstate_percpu_t *dcpu;
int i, work = 0;
dtrace_dynvar_t **rinsep;
int i, j, work = 0;
for (i = 0; i < NCPU; i++) {
dcpu = &dstate->dtds_percpu[i];
ASSERT(dcpu->dtdsc_rinsing == NULL);
rinsep = &dcpu->dtdsc_rinsing;
/*
* If the dirty list is NULL, there is no dirty work to do.
@ -1195,14 +1201,62 @@ dtrace_dynvar_clean(dtrace_dstate_t *dstate)
if (dcpu->dtdsc_dirty == NULL)
continue;
/*
* If the clean list is non-NULL, then we're not going to do
* any work for this CPU -- it means that there has not been
* a dtrace_dynvar() allocation on this CPU (or from this CPU)
* since the last time we cleaned house.
*/
if (dcpu->dtdsc_clean != NULL)
if (dcpu->dtdsc_rinsing != NULL) {
/*
* If the rinsing list is non-NULL, then it is because
* this CPU was selected to accept another CPU's
* dirty list -- and since that time, dirty buffers
* have accumulated. This is a highly unlikely
* condition, but we choose to ignore the dirty
* buffers -- they'll be picked up a future cleanse.
*/
continue;
}
if (dcpu->dtdsc_clean != NULL) {
/*
* If the clean list is non-NULL, then we're in a
* situation where a CPU has done deallocations (we
* have a non-NULL dirty list) but no allocations (we
* also have a non-NULL clean list). We can't simply
* move the dirty list into the clean list on this
* CPU, yet we also don't want to allow this condition
* to persist, lest a short clean list prevent a
* massive dirty list from being cleaned (which in
* turn could lead to otherwise avoidable dynamic
* drops). To deal with this, we look for some CPU
* with a NULL clean list, NULL dirty list, and NULL
* rinsing list -- and then we borrow this CPU to
* rinse our dirty list.
*/
for (j = 0; j < NCPU; j++) {
dtrace_dstate_percpu_t *rinser;
rinser = &dstate->dtds_percpu[j];
if (rinser->dtdsc_rinsing != NULL)
continue;
if (rinser->dtdsc_dirty != NULL)
continue;
if (rinser->dtdsc_clean != NULL)
continue;
rinsep = &rinser->dtdsc_rinsing;
break;
}
if (j == NCPU) {
/*
* We were unable to find another CPU that
* could accept this dirty list -- we are
* therefore unable to clean it now.
*/
dtrace_dynvar_failclean++;
continue;
}
}
work = 1;
@ -1219,7 +1273,7 @@ dtrace_dynvar_clean(dtrace_dstate_t *dstate)
* on a hash chain, either the dirty list or the
* rinsing list for some CPU must be non-NULL.)
*/
dcpu->dtdsc_rinsing = dirty;
*rinsep = dirty;
dtrace_membar_producer();
} while (dtrace_casptr(&dcpu->dtdsc_dirty,
dirty, NULL) != dirty);
@ -1650,7 +1704,7 @@ retry:
ASSERT(clean->dtdv_hashval == DTRACE_DYNHASH_FREE);
/*
* Now we'll move the clean list to the free list.
* Now we'll move the clean list to our free list.
* It's impossible for this to fail: the only way
* the free list can be updated is through this
* code path, and only one CPU can own the clean list.
@ -1663,6 +1717,7 @@ retry:
* owners of the clean lists out before resetting
* the clean lists.
*/
dcpu = &dstate->dtds_percpu[me];
rval = dtrace_casptr(&dcpu->dtdsc_free, NULL, clean);
ASSERT(rval == NULL);
goto retry;
@ -3600,7 +3655,7 @@ dtrace_dif_subr(uint_t subr, uint_t rd, uint64_t *regs,
int64_t index = (int64_t)tupregs[1].dttk_value;
int64_t remaining = (int64_t)tupregs[2].dttk_value;
size_t len = dtrace_strlen((char *)s, size);
int64_t i = 0;
int64_t i;
if (!dtrace_canload(s, len + 1, mstate, vstate)) {
regs[rd] = NULL;
@ -6655,7 +6710,7 @@ dtrace_match(const dtrace_probekey_t *pkp, uint32_t priv, uid_t uid,
{
dtrace_probe_t template, *probe;
dtrace_hash_t *hash = NULL;
int len, best = INT_MAX, nmatched = 0;
int len, rc, best = INT_MAX, nmatched = 0;
dtrace_id_t i;
ASSERT(MUTEX_HELD(&dtrace_lock));
@ -6667,7 +6722,8 @@ dtrace_match(const dtrace_probekey_t *pkp, uint32_t priv, uid_t uid,
if (pkp->dtpk_id != DTRACE_IDNONE) {
if ((probe = dtrace_probe_lookup_id(pkp->dtpk_id)) != NULL &&
dtrace_match_probe(probe, pkp, priv, uid, zoneid) > 0) {
(void) (*matched)(probe, arg);
if ((*matched)(probe, arg) == DTRACE_MATCH_FAIL)
return (DTRACE_MATCH_FAIL);
nmatched++;
}
return (nmatched);
@ -6714,8 +6770,12 @@ dtrace_match(const dtrace_probekey_t *pkp, uint32_t priv, uid_t uid,
nmatched++;
if ((*matched)(probe, arg) != DTRACE_MATCH_NEXT)
if ((rc = (*matched)(probe, arg)) !=
DTRACE_MATCH_NEXT) {
if (rc == DTRACE_MATCH_FAIL)
return (DTRACE_MATCH_FAIL);
break;
}
}
return (nmatched);
@ -6734,8 +6794,11 @@ dtrace_match(const dtrace_probekey_t *pkp, uint32_t priv, uid_t uid,
nmatched++;
if ((*matched)(probe, arg) != DTRACE_MATCH_NEXT)
if ((rc = (*matched)(probe, arg)) != DTRACE_MATCH_NEXT) {
if (rc == DTRACE_MATCH_FAIL)
return (DTRACE_MATCH_FAIL);
break;
}
}
return (nmatched);
@ -6955,7 +7018,7 @@ dtrace_unregister(dtrace_provider_id_t id)
dtrace_probe_t *probe, *first = NULL;
if (old->dtpv_pops.dtps_enable ==
(void (*)(void *, dtrace_id_t, void *))dtrace_nullop) {
(int (*)(void *, dtrace_id_t, void *))dtrace_enable_nullop) {
/*
* If DTrace itself is the provider, we're called with locks
* already held.
@ -7101,7 +7164,7 @@ dtrace_invalidate(dtrace_provider_id_t id)
dtrace_provider_t *pvp = (dtrace_provider_t *)id;
ASSERT(pvp->dtpv_pops.dtps_enable !=
(void (*)(void *, dtrace_id_t, void *))dtrace_nullop);
(int (*)(void *, dtrace_id_t, void *))dtrace_enable_nullop);
mutex_enter(&dtrace_provider_lock);
mutex_enter(&dtrace_lock);
@ -7142,7 +7205,7 @@ dtrace_condense(dtrace_provider_id_t id)
* Make sure this isn't the dtrace provider itself.
*/
ASSERT(prov->dtpv_pops.dtps_enable !=
(void (*)(void *, dtrace_id_t, void *))dtrace_nullop);
(int (*)(void *, dtrace_id_t, void *))dtrace_enable_nullop);
mutex_enter(&dtrace_provider_lock);
mutex_enter(&dtrace_lock);
@ -8103,7 +8166,7 @@ dtrace_difo_validate(dtrace_difo_t *dp, dtrace_vstate_t *vstate, uint_t nregs,
break;
default:
err += efunc(dp->dtdo_len - 1, "bad return size");
err += efunc(dp->dtdo_len - 1, "bad return size\n");
}
}
@ -9096,7 +9159,7 @@ dtrace_ecb_add(dtrace_state_t *state, dtrace_probe_t *probe)
return (ecb);
}
static void
static int
dtrace_ecb_enable(dtrace_ecb_t *ecb)
{
dtrace_probe_t *probe = ecb->dte_probe;
@ -9109,7 +9172,7 @@ dtrace_ecb_enable(dtrace_ecb_t *ecb)
/*
* This is the NULL probe -- there's nothing to do.
*/
return;
return (0);
}
if (probe->dtpr_ecb == NULL) {
@ -9123,8 +9186,8 @@ dtrace_ecb_enable(dtrace_ecb_t *ecb)
if (ecb->dte_predicate != NULL)
probe->dtpr_predcache = ecb->dte_predicate->dtp_cacheid;
prov->dtpv_pops.dtps_enable(prov->dtpv_arg,
probe->dtpr_id, probe->dtpr_arg);
return (prov->dtpv_pops.dtps_enable(prov->dtpv_arg,
probe->dtpr_id, probe->dtpr_arg));
} else {
/*
* This probe is already active. Swing the last pointer to
@ -9137,6 +9200,7 @@ dtrace_ecb_enable(dtrace_ecb_t *ecb)
probe->dtpr_predcache = 0;
dtrace_sync();
return (0);
}
}
@ -9920,7 +9984,9 @@ dtrace_ecb_create_enable(dtrace_probe_t *probe, void *arg)
if ((ecb = dtrace_ecb_create(state, probe, enab)) == NULL)
return (DTRACE_MATCH_DONE);
dtrace_ecb_enable(ecb);
if (dtrace_ecb_enable(ecb) < 0)
return (DTRACE_MATCH_FAIL);
return (DTRACE_MATCH_NEXT);
}
@ -10557,6 +10623,7 @@ dtrace_enabling_destroy(dtrace_enabling_t *enab)
ASSERT(enab->dten_vstate->dtvs_state != NULL);
ASSERT(enab->dten_vstate->dtvs_state->dts_nretained > 0);
enab->dten_vstate->dtvs_state->dts_nretained--;
dtrace_retained_gen++;
}
if (enab->dten_prev == NULL) {
@ -10599,6 +10666,7 @@ dtrace_enabling_retain(dtrace_enabling_t *enab)
return (ENOSPC);
state->dts_nretained++;
dtrace_retained_gen++;
if (dtrace_retained == NULL) {
dtrace_retained = enab;
@ -10713,7 +10781,7 @@ static int
dtrace_enabling_match(dtrace_enabling_t *enab, int *nmatched)
{
int i = 0;
int matched = 0;
int total_matched = 0, matched = 0;
ASSERT(MUTEX_HELD(&cpu_lock));
ASSERT(MUTEX_HELD(&dtrace_lock));
@ -10724,7 +10792,14 @@ dtrace_enabling_match(dtrace_enabling_t *enab, int *nmatched)
enab->dten_current = ep;
enab->dten_error = 0;
matched += dtrace_probe_enable(&ep->dted_probe, enab);
/*
* If a provider failed to enable a probe then get out and
* let the consumer know we failed.
*/
if ((matched = dtrace_probe_enable(&ep->dted_probe, enab)) < 0)
return (EBUSY);
total_matched += matched;
if (enab->dten_error != 0) {
/*
@ -10752,7 +10827,7 @@ dtrace_enabling_match(dtrace_enabling_t *enab, int *nmatched)
enab->dten_probegen = dtrace_probegen;
if (nmatched != NULL)
*nmatched = matched;
*nmatched = total_matched;
return (0);
}
@ -10766,13 +10841,22 @@ dtrace_enabling_matchall(void)
mutex_enter(&dtrace_lock);
/*
* Because we can be called after dtrace_detach() has been called, we
* cannot assert that there are retained enablings. We can safely
* load from dtrace_retained, however: the taskq_destroy() at the
* end of dtrace_detach() will block pending our completion.
* Iterate over all retained enablings to see if any probes match
* against them. We only perform this operation on enablings for which
* we have sufficient permissions by virtue of being in the global zone
* or in the same zone as the DTrace client. Because we can be called
* after dtrace_detach() has been called, we cannot assert that there
* are retained enablings. We can safely load from dtrace_retained,
* however: the taskq_destroy() at the end of dtrace_detach() will
* block pending our completion.
*/
for (enab = dtrace_retained; enab != NULL; enab = enab->dten_next)
(void) dtrace_enabling_match(enab, NULL);
for (enab = dtrace_retained; enab != NULL; enab = enab->dten_next) {
cred_t *cr = enab->dten_vstate->dtvs_state->dts_cred.dcr_cred;
if (INGLOBALZONE(curproc) ||
cr != NULL && getzoneid() == crgetzoneid(cr))
(void) dtrace_enabling_match(enab, NULL);
}
mutex_exit(&dtrace_lock);
mutex_exit(&cpu_lock);
@ -10830,6 +10914,7 @@ dtrace_enabling_provide(dtrace_provider_t *prv)
{
int i, all = 0;
dtrace_probedesc_t desc;
dtrace_genid_t gen;
ASSERT(MUTEX_HELD(&dtrace_lock));
ASSERT(MUTEX_HELD(&dtrace_provider_lock));
@ -10840,15 +10925,25 @@ dtrace_enabling_provide(dtrace_provider_t *prv)
}
do {
dtrace_enabling_t *enab = dtrace_retained;
dtrace_enabling_t *enab;
void *parg = prv->dtpv_arg;
for (; enab != NULL; enab = enab->dten_next) {
retry:
gen = dtrace_retained_gen;
for (enab = dtrace_retained; enab != NULL;
enab = enab->dten_next) {
for (i = 0; i < enab->dten_ndesc; i++) {
desc = enab->dten_desc[i]->dted_probe;
mutex_exit(&dtrace_lock);
prv->dtpv_pops.dtps_provide(parg, &desc);
mutex_enter(&dtrace_lock);
/*
* Process the retained enablings again if
* they have changed while we weren't holding
* dtrace_lock.
*/
if (gen != dtrace_retained_gen)
goto retry;
}
}
} while (all && (prv = prv->dtpv_next) != NULL);
@ -10970,7 +11065,8 @@ dtrace_dof_copyin(uintptr_t uarg, int *errp)
dof = kmem_alloc(hdr.dofh_loadsz, KM_SLEEP);
if (copyin((void *)uarg, dof, hdr.dofh_loadsz) != 0) {
if (copyin((void *)uarg, dof, hdr.dofh_loadsz) != 0 ||
dof->dofh_loadsz != hdr.dofh_loadsz) {
kmem_free(dof, hdr.dofh_loadsz);
*errp = EFAULT;
return (NULL);
@ -11698,6 +11794,13 @@ dtrace_dof_slurp(dof_hdr_t *dof, dtrace_vstate_t *vstate, cred_t *cr,
}
}
if (DOF_SEC_ISLOADABLE(sec->dofs_type) &&
!(sec->dofs_flags & DOF_SECF_LOAD)) {
dtrace_dof_error(dof, "loadable section with load "
"flag unset");
return (-1);
}
if (!(sec->dofs_flags & DOF_SECF_LOAD))
continue; /* just ignore non-loadable sections */
@ -14390,7 +14493,8 @@ dtrace_open(dev_t *devp, int flag, int otyp, cred_t *cred_p)
* If this wasn't an open with the "helper" minor, then it must be
* the "dtrace" minor.
*/
ASSERT(getminor(*devp) == DTRACEMNRN_DTRACE);
if (getminor(*devp) != DTRACEMNRN_DTRACE)
return (ENXIO);
/*
* If no DTRACE_PRIV_* bits are set in the credential, then the
@ -14427,7 +14531,7 @@ dtrace_open(dev_t *devp, int flag, int otyp, cred_t *cred_p)
mutex_exit(&cpu_lock);
if (state == NULL) {
if (--dtrace_opens == 0)
if (--dtrace_opens == 0 && dtrace_anon.dta_enabling == NULL)
(void) kdi_dtrace_set(KDI_DTSET_DTRACE_DEACTIVATE);
mutex_exit(&dtrace_lock);
return (EAGAIN);
@ -14463,7 +14567,12 @@ dtrace_close(dev_t dev, int flag, int otyp, cred_t *cred_p)
dtrace_state_destroy(state);
ASSERT(dtrace_opens > 0);
if (--dtrace_opens == 0)
/*
* Only relinquish control of the kernel debugger interface when there
* are no consumers and no anonymous enablings.
*/
if (--dtrace_opens == 0 && dtrace_anon.dta_enabling == NULL)
(void) kdi_dtrace_set(KDI_DTSET_DTRACE_DEACTIVATE);
mutex_exit(&dtrace_lock);
@ -15458,7 +15567,8 @@ static struct dev_ops dtrace_ops = {
nodev, /* reset */
&dtrace_cb_ops, /* driver operations */
NULL, /* bus operations */
nodev /* dev power */
nodev, /* dev power */
ddi_quiesce_not_needed, /* quiesce */
};
static struct modldrv modldrv = {

View File

@ -20,11 +20,10 @@
*/
/*
* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#pragma ident "%Z%%M% %I% %E% SMI"
#include <sys/atomic.h>
#include <sys/errno.h>
@ -876,7 +875,7 @@ fasttrap_disable_callbacks(void)
}
/*ARGSUSED*/
static void
static int
fasttrap_pid_enable(void *arg, dtrace_id_t id, void *parg)
{
fasttrap_probe_t *probe = parg;
@ -904,7 +903,7 @@ fasttrap_pid_enable(void *arg, dtrace_id_t id, void *parg)
* provider can't go away while we're in this code path.
*/
if (probe->ftp_prov->ftp_retired)
return;
return (0);
/*
* If we can't find the process, it may be that we're in the context of
@ -913,7 +912,7 @@ fasttrap_pid_enable(void *arg, dtrace_id_t id, void *parg)
*/
if ((p = sprlock(probe->ftp_pid)) == NULL) {
if ((curproc->p_flag & SFORKING) == 0)
return;
return (0);
mutex_enter(&pidlock);
p = prfind(probe->ftp_pid);
@ -975,7 +974,7 @@ fasttrap_pid_enable(void *arg, dtrace_id_t id, void *parg)
* drop our reference on the trap table entry.
*/
fasttrap_disable_callbacks();
return;
return (0);
}
}
@ -983,6 +982,7 @@ fasttrap_pid_enable(void *arg, dtrace_id_t id, void *parg)
sprunlock(p);
probe->ftp_enabled = 1;
return (0);
}
/*ARGSUSED*/
@ -1946,7 +1946,8 @@ fasttrap_ioctl(dev_t dev, int cmd, intptr_t arg, int md, cred_t *cr, int *rv)
probe = kmem_alloc(size, KM_SLEEP);
if (copyin(uprobe, probe, size) != 0) {
if (copyin(uprobe, probe, size) != 0 ||
probe->ftps_noffs != noffs) {
kmem_free(probe, size);
return (EFAULT);
}
@ -2044,13 +2045,6 @@ err:
tp->ftt_proc->ftpc_acount != 0)
break;
/*
* The count of active providers can only be
* decremented (i.e. to zero) during exec, exit, and
* removal of a meta provider so it should be
* impossible to drop the count during this operation().
*/
ASSERT(tp->ftt_proc->ftpc_acount != 0);
tp = tp->ftt_next;
}
@ -2346,7 +2340,8 @@ static struct dev_ops fasttrap_ops = {
nodev, /* reset */
&fasttrap_cb_ops, /* driver operations */
NULL, /* bus operations */
nodev /* dev power */
nodev, /* dev power */
ddi_quiesce_not_needed, /* quiesce */
};
/*

View File

@ -19,11 +19,10 @@
* CDDL HEADER END
*/
/*
* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#pragma ident "%Z%%M% %I% %E% SMI"
#include <sys/types.h>
#include <sys/param.h>
@ -84,7 +83,7 @@ static kmutex_t lockstat_test; /* for testing purposes only */
static dtrace_provider_id_t lockstat_id;
/*ARGSUSED*/
static void
static int
lockstat_enable(void *arg, dtrace_id_t id, void *parg)
{
lockstat_probe_t *probe = parg;
@ -103,6 +102,7 @@ lockstat_enable(void *arg, dtrace_id_t id, void *parg)
*/
mutex_enter(&lockstat_test);
mutex_exit(&lockstat_test);
return (0);
}
/*ARGSUSED*/
@ -310,11 +310,13 @@ static struct dev_ops lockstat_ops = {
nulldev, /* reset */
&lockstat_cb_ops, /* cb_ops */
NULL, /* bus_ops */
NULL, /* power */
ddi_quiesce_not_needed, /* quiesce */
};
static struct modldrv modldrv = {
&mod_driverops, /* Type of module. This one is a driver */
"Lock Statistics %I%", /* name of module */
"Lock Statistics", /* name of module */
&lockstat_ops, /* driver ops */
};

View File

@ -19,11 +19,10 @@
* CDDL HEADER END
*/
/*
* Copyright 2007 Sun Microsystems, Inc. All rights reserved.
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#pragma ident "%Z%%M% %I% %E% SMI"
#include <sys/errno.h>
#include <sys/stat.h>
@ -361,7 +360,7 @@ profile_offline(void *arg, cpu_t *cpu, void *oarg)
}
/*ARGSUSED*/
static void
static int
profile_enable(void *arg, dtrace_id_t id, void *parg)
{
profile_probe_t *prof = parg;
@ -391,6 +390,7 @@ profile_enable(void *arg, dtrace_id_t id, void *parg)
} else {
prof->prof_cyclic = cyclic_add_omni(&omni);
}
return (0);
}
/*ARGSUSED*/
@ -539,7 +539,8 @@ static struct dev_ops profile_ops = {
nodev, /* reset */
&profile_cb_ops, /* driver operations */
NULL, /* bus operations */
nodev /* dev power */
nodev, /* dev power */
ddi_quiesce_not_needed, /* quiesce */
};
/*

View File

@ -19,12 +19,9 @@
* CDDL HEADER END
*/
/*
* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
* Copyright (c) 2004, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#pragma ident "%Z%%M% %I% %E% SMI"
#include <sys/sdt_impl.h>
static dtrace_pattr_t vtrace_attr = {
@ -43,6 +40,14 @@ static dtrace_pattr_t info_attr = {
{ DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_ISA },
};
static dtrace_pattr_t fc_attr = {
{ DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
{ DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
{ DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
{ DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_ISA },
{ DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
};
static dtrace_pattr_t fpu_attr = {
{ DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
{ DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
@ -83,6 +88,14 @@ static dtrace_pattr_t xpv_attr = {
{ DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_PLATFORM },
};
static dtrace_pattr_t iscsi_attr = {
{ DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
{ DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
{ DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
{ DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_ISA },
{ DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
};
sdt_provider_t sdt_providers[] = {
{ "vtrace", "__vtrace_", &vtrace_attr, 0 },
{ "sysinfo", "__cpu_sysinfo_", &info_attr, 0 },
@ -91,11 +104,17 @@ sdt_provider_t sdt_providers[] = {
{ "sched", "__sched_", &stab_attr, 0 },
{ "proc", "__proc_", &stab_attr, 0 },
{ "io", "__io_", &stab_attr, 0 },
{ "ip", "__ip_", &stab_attr, 0 },
{ "tcp", "__tcp_", &stab_attr, 0 },
{ "udp", "__udp_", &stab_attr, 0 },
{ "mib", "__mib_", &stab_attr, 0 },
{ "fsinfo", "__fsinfo_", &fsinfo_attr, 0 },
{ "iscsi", "__iscsi_", &iscsi_attr, 0 },
{ "nfsv3", "__nfsv3_", &stab_attr, 0 },
{ "nfsv4", "__nfsv4_", &stab_attr, 0 },
{ "xpv", "__xpv_", &xpv_attr, 0 },
{ "fc", "__fc_", &fc_attr, 0 },
{ "srp", "__srp_", &fc_attr, 0 },
{ "sysevent", "__sysevent_", &stab_attr, 0 },
{ "sdt", NULL, &sdt_attr, 0 },
{ NULL }
@ -169,6 +188,73 @@ sdt_argdesc_t sdt_args[] = {
{ "fsinfo", NULL, 0, 0, "vnode_t *", "fileinfo_t *" },
{ "fsinfo", NULL, 1, 1, "int", "int" },
{ "iscsi", "async-send", 0, 0, "idm_conn_t *", "conninfo_t *" },
{ "iscsi", "async-send", 1, 1, "iscsi_async_evt_hdr_t *",
"iscsiinfo_t *" },
{ "iscsi", "login-command", 0, 0, "idm_conn_t *", "conninfo_t *" },
{ "iscsi", "login-command", 1, 1, "iscsi_login_hdr_t *",
"iscsiinfo_t *" },
{ "iscsi", "login-response", 0, 0, "idm_conn_t *", "conninfo_t *" },
{ "iscsi", "login-response", 1, 1, "iscsi_login_rsp_hdr_t *",
"iscsiinfo_t *" },
{ "iscsi", "logout-command", 0, 0, "idm_conn_t *", "conninfo_t *" },
{ "iscsi", "logout-command", 1, 1, "iscsi_logout_hdr_t *",
"iscsiinfo_t *" },
{ "iscsi", "logout-response", 0, 0, "idm_conn_t *", "conninfo_t *" },
{ "iscsi", "logout-response", 1, 1, "iscsi_logout_rsp_hdr_t *",
"iscsiinfo_t *" },
{ "iscsi", "data-request", 0, 0, "idm_conn_t *", "conninfo_t *" },
{ "iscsi", "data-request", 1, 1, "iscsi_rtt_hdr_t *",
"iscsiinfo_t *" },
{ "iscsi", "data-send", 0, 0, "idm_conn_t *", "conninfo_t *" },
{ "iscsi", "data-send", 1, 1, "iscsi_data_rsp_hdr_t *",
"iscsiinfo_t *" },
{ "iscsi", "data-receive", 0, 0, "idm_conn_t *", "conninfo_t *" },
{ "iscsi", "data-receive", 1, 1, "iscsi_data_hdr_t *",
"iscsiinfo_t *" },
{ "iscsi", "nop-send", 0, 0, "idm_conn_t *", "conninfo_t *" },
{ "iscsi", "nop-send", 1, 1, "iscsi_nop_in_hdr_t *", "iscsiinfo_t *" },
{ "iscsi", "nop-receive", 0, 0, "idm_conn_t *", "conninfo_t *" },
{ "iscsi", "nop-receive", 1, 1, "iscsi_nop_out_hdr_t *",
"iscsiinfo_t *" },
{ "iscsi", "scsi-command", 0, 0, "idm_conn_t *", "conninfo_t *" },
{ "iscsi", "scsi-command", 1, 1, "iscsi_scsi_cmd_hdr_t *",
"iscsiinfo_t *" },
{ "iscsi", "scsi-command", 2, 2, "scsi_task_t *", "scsicmd_t *" },
{ "iscsi", "scsi-response", 0, 0, "idm_conn_t *", "conninfo_t *" },
{ "iscsi", "scsi-response", 1, 1, "iscsi_scsi_rsp_hdr_t *",
"iscsiinfo_t *" },
{ "iscsi", "task-command", 0, 0, "idm_conn_t *", "conninfo_t *" },
{ "iscsi", "task-command", 1, 1, "iscsi_scsi_task_mgt_hdr_t *",
"iscsiinfo_t *" },
{ "iscsi", "task-response", 0, 0, "idm_conn_t *", "conninfo_t *" },
{ "iscsi", "task-response", 1, 1, "iscsi_scsi_task_mgt_rsp_hdr_t *",
"iscsiinfo_t *" },
{ "iscsi", "text-command", 0, 0, "idm_conn_t *", "conninfo_t *" },
{ "iscsi", "text-command", 1, 1, "iscsi_text_hdr_t *",
"iscsiinfo_t *" },
{ "iscsi", "text-response", 0, 0, "idm_conn_t *", "conninfo_t *" },
{ "iscsi", "text-response", 1, 1, "iscsi_text_rsp_hdr_t *",
"iscsiinfo_t *" },
{ "iscsi", "xfer-start", 0, 0, "idm_conn_t *", "conninfo_t *" },
{ "iscsi", "xfer-start", 1, 0, "idm_conn_t *", "iscsiinfo_t *" },
{ "iscsi", "xfer-start", 2, 1, "uintptr_t", "xferinfo_t *" },
{ "iscsi", "xfer-start", 3, 2, "uint32_t"},
{ "iscsi", "xfer-start", 4, 3, "uintptr_t"},
{ "iscsi", "xfer-start", 5, 4, "uint32_t"},
{ "iscsi", "xfer-start", 6, 5, "uint32_t"},
{ "iscsi", "xfer-start", 7, 6, "uint32_t"},
{ "iscsi", "xfer-start", 8, 7, "int"},
{ "iscsi", "xfer-done", 0, 0, "idm_conn_t *", "conninfo_t *" },
{ "iscsi", "xfer-done", 1, 0, "idm_conn_t *", "iscsiinfo_t *" },
{ "iscsi", "xfer-done", 2, 1, "uintptr_t", "xferinfo_t *" },
{ "iscsi", "xfer-done", 3, 2, "uint32_t"},
{ "iscsi", "xfer-done", 4, 3, "uintptr_t"},
{ "iscsi", "xfer-done", 5, 4, "uint32_t"},
{ "iscsi", "xfer-done", 6, 5, "uint32_t"},
{ "iscsi", "xfer-done", 7, 6, "uint32_t"},
{ "iscsi", "xfer-done", 8, 7, "int"},
{ "nfsv3", "op-getattr-start", 0, 0, "struct svc_req *",
"conninfo_t *" },
{ "nfsv3", "op-getattr-start", 1, 1, "nfsv3oparg_t *",
@ -788,6 +874,75 @@ sdt_argdesc_t sdt_args[] = {
"nfsv4cbinfo_t *" },
{ "nfsv4", "cb-recall-done", 2, 2, "CB_RECALL4res *" },
{ "ip", "send", 0, 0, "mblk_t *", "pktinfo_t *" },
{ "ip", "send", 1, 1, "conn_t *", "csinfo_t *" },
{ "ip", "send", 2, 2, "void_ip_t *", "ipinfo_t *" },
{ "ip", "send", 3, 3, "__dtrace_ipsr_ill_t *", "ifinfo_t *" },
{ "ip", "send", 4, 4, "ipha_t *", "ipv4info_t *" },
{ "ip", "send", 5, 5, "ip6_t *", "ipv6info_t *" },
{ "ip", "send", 6, 6, "int" }, /* used by __dtrace_ipsr_ill_t */
{ "ip", "receive", 0, 0, "mblk_t *", "pktinfo_t *" },
{ "ip", "receive", 1, 1, "conn_t *", "csinfo_t *" },
{ "ip", "receive", 2, 2, "void_ip_t *", "ipinfo_t *" },
{ "ip", "receive", 3, 3, "__dtrace_ipsr_ill_t *", "ifinfo_t *" },
{ "ip", "receive", 4, 4, "ipha_t *", "ipv4info_t *" },
{ "ip", "receive", 5, 5, "ip6_t *", "ipv6info_t *" },
{ "ip", "receive", 6, 6, "int" }, /* used by __dtrace_ipsr_ill_t */
{ "tcp", "connect-established", 0, 0, "mblk_t *", "pktinfo_t *" },
{ "tcp", "connect-established", 1, 1, "ip_xmit_attr_t *",
"csinfo_t *" },
{ "tcp", "connect-established", 2, 2, "void_ip_t *", "ipinfo_t *" },
{ "tcp", "connect-established", 3, 3, "tcp_t *", "tcpsinfo_t *" },
{ "tcp", "connect-established", 4, 4, "tcph_t *", "tcpinfo_t *" },
{ "tcp", "connect-refused", 0, 0, "mblk_t *", "pktinfo_t *" },
{ "tcp", "connect-refused", 1, 1, "ip_xmit_attr_t *", "csinfo_t *" },
{ "tcp", "connect-refused", 2, 2, "void_ip_t *", "ipinfo_t *" },
{ "tcp", "connect-refused", 3, 3, "tcp_t *", "tcpsinfo_t *" },
{ "tcp", "connect-refused", 4, 4, "tcph_t *", "tcpinfo_t *" },
{ "tcp", "connect-request", 0, 0, "mblk_t *", "pktinfo_t *" },
{ "tcp", "connect-request", 1, 1, "ip_xmit_attr_t *", "csinfo_t *" },
{ "tcp", "connect-request", 2, 2, "void_ip_t *", "ipinfo_t *" },
{ "tcp", "connect-request", 3, 3, "tcp_t *", "tcpsinfo_t *" },
{ "tcp", "connect-request", 4, 4, "tcph_t *", "tcpinfo_t *" },
{ "tcp", "accept-established", 0, 0, "mblk_t *", "pktinfo_t *" },
{ "tcp", "accept-established", 1, 1, "ip_xmit_attr_t *", "csinfo_t *" },
{ "tcp", "accept-established", 2, 2, "void_ip_t *", "ipinfo_t *" },
{ "tcp", "accept-established", 3, 3, "tcp_t *", "tcpsinfo_t *" },
{ "tcp", "accept-established", 4, 4, "tcph_t *", "tcpinfo_t *" },
{ "tcp", "accept-refused", 0, 0, "mblk_t *", "pktinfo_t *" },
{ "tcp", "accept-refused", 1, 1, "ip_xmit_attr_t *", "csinfo_t *" },
{ "tcp", "accept-refused", 2, 2, "void_ip_t *", "ipinfo_t *" },
{ "tcp", "accept-refused", 3, 3, "tcp_t *", "tcpsinfo_t *" },
{ "tcp", "accept-refused", 4, 4, "tcph_t *", "tcpinfo_t *" },
{ "tcp", "state-change", 0, 0, "void", "void" },
{ "tcp", "state-change", 1, 1, "ip_xmit_attr_t *", "csinfo_t *" },
{ "tcp", "state-change", 2, 2, "void", "void" },
{ "tcp", "state-change", 3, 3, "tcp_t *", "tcpsinfo_t *" },
{ "tcp", "state-change", 4, 4, "void", "void" },
{ "tcp", "state-change", 5, 5, "int32_t", "tcplsinfo_t *" },
{ "tcp", "send", 0, 0, "mblk_t *", "pktinfo_t *" },
{ "tcp", "send", 1, 1, "ip_xmit_attr_t *", "csinfo_t *" },
{ "tcp", "send", 2, 2, "__dtrace_tcp_void_ip_t *", "ipinfo_t *" },
{ "tcp", "send", 3, 3, "tcp_t *", "tcpsinfo_t *" },
{ "tcp", "send", 4, 4, "__dtrace_tcp_tcph_t *", "tcpinfo_t *" },
{ "tcp", "receive", 0, 0, "mblk_t *", "pktinfo_t *" },
{ "tcp", "receive", 1, 1, "ip_xmit_attr_t *", "csinfo_t *" },
{ "tcp", "receive", 2, 2, "__dtrace_tcp_void_ip_t *", "ipinfo_t *" },
{ "tcp", "receive", 3, 3, "tcp_t *", "tcpsinfo_t *" },
{ "tcp", "receive", 4, 4, "__dtrace_tcp_tcph_t *", "tcpinfo_t *" },
{ "udp", "send", 0, 0, "mblk_t *", "pktinfo_t *" },
{ "udp", "send", 1, 1, "ip_xmit_attr_t *", "csinfo_t *" },
{ "udp", "send", 2, 2, "void_ip_t *", "ipinfo_t *" },
{ "udp", "send", 3, 3, "udp_t *", "udpsinfo_t *" },
{ "udp", "send", 4, 4, "udpha_t *", "udpinfo_t *" },
{ "udp", "receive", 0, 0, "mblk_t *", "pktinfo_t *" },
{ "udp", "receive", 1, 1, "ip_xmit_attr_t *", "csinfo_t *" },
{ "udp", "receive", 2, 2, "void_ip_t *", "ipinfo_t *" },
{ "udp", "receive", 3, 3, "udp_t *", "udpsinfo_t *" },
{ "udp", "receive", 4, 4, "udpha_t *", "udpinfo_t *" },
{ "sysevent", "post", 0, 0, "evch_bind_t *", "syseventchaninfo_t *" },
{ "sysevent", "post", 1, 1, "sysevent_impl_t *", "syseventinfo_t *" },
@ -848,6 +1003,154 @@ sdt_argdesc_t sdt_args[] = {
{ "xpv", "setvcpucontext-end", 0, 0, "int" },
{ "xpv", "setvcpucontext-start", 0, 0, "domid_t" },
{ "xpv", "setvcpucontext-start", 1, 1, "vcpu_guest_context_t *" },
{ "srp", "service-up", 0, 0, "srpt_session_t *", "conninfo_t *" },
{ "srp", "service-up", 1, 0, "srpt_session_t *", "srp_portinfo_t *" },
{ "srp", "service-down", 0, 0, "srpt_session_t *", "conninfo_t *" },
{ "srp", "service-down", 1, 0, "srpt_session_t *",
"srp_portinfo_t *" },
{ "srp", "login-command", 0, 0, "srpt_session_t *", "conninfo_t *" },
{ "srp", "login-command", 1, 0, "srpt_session_t *",
"srp_portinfo_t *" },
{ "srp", "login-command", 2, 1, "srp_login_req_t *",
"srp_logininfo_t *" },
{ "srp", "login-response", 0, 0, "srpt_session_t *", "conninfo_t *" },
{ "srp", "login-response", 1, 0, "srpt_session_t *",
"srp_portinfo_t *" },
{ "srp", "login-response", 2, 1, "srp_login_rsp_t *",
"srp_logininfo_t *" },
{ "srp", "login-response", 3, 2, "srp_login_rej_t *" },
{ "srp", "logout-command", 0, 0, "srpt_channel_t *", "conninfo_t *" },
{ "srp", "logout-command", 1, 0, "srpt_channel_t *",
"srp_portinfo_t *" },
{ "srp", "task-command", 0, 0, "srpt_channel_t *", "conninfo_t *" },
{ "srp", "task-command", 1, 0, "srpt_channel_t *",
"srp_portinfo_t *" },
{ "srp", "task-command", 2, 1, "srp_cmd_req_t *", "srp_taskinfo_t *" },
{ "srp", "task-response", 0, 0, "srpt_channel_t *", "conninfo_t *" },
{ "srp", "task-response", 1, 0, "srpt_channel_t *",
"srp_portinfo_t *" },
{ "srp", "task-response", 2, 1, "srp_rsp_t *", "srp_taskinfo_t *" },
{ "srp", "task-response", 3, 2, "scsi_task_t *" },
{ "srp", "task-response", 4, 3, "int8_t" },
{ "srp", "scsi-command", 0, 0, "srpt_channel_t *", "conninfo_t *" },
{ "srp", "scsi-command", 1, 0, "srpt_channel_t *",
"srp_portinfo_t *" },
{ "srp", "scsi-command", 2, 1, "scsi_task_t *", "scsicmd_t *" },
{ "srp", "scsi-command", 3, 2, "srp_cmd_req_t *", "srp_taskinfo_t *" },
{ "srp", "scsi-response", 0, 0, "srpt_channel_t *", "conninfo_t *" },
{ "srp", "scsi-response", 1, 0, "srpt_channel_t *",
"srp_portinfo_t *" },
{ "srp", "scsi-response", 2, 1, "srp_rsp_t *", "srp_taskinfo_t *" },
{ "srp", "scsi-response", 3, 2, "scsi_task_t *" },
{ "srp", "scsi-response", 4, 3, "int8_t" },
{ "srp", "xfer-start", 0, 0, "srpt_channel_t *", "conninfo_t *" },
{ "srp", "xfer-start", 1, 0, "srpt_channel_t *",
"srp_portinfo_t *" },
{ "srp", "xfer-start", 2, 1, "ibt_wr_ds_t *", "xferinfo_t *" },
{ "srp", "xfer-start", 3, 2, "srpt_iu_t *", "srp_taskinfo_t *" },
{ "srp", "xfer-start", 4, 3, "ibt_send_wr_t *"},
{ "srp", "xfer-start", 5, 4, "uint32_t" },
{ "srp", "xfer-start", 6, 5, "uint32_t" },
{ "srp", "xfer-start", 7, 6, "uint32_t" },
{ "srp", "xfer-start", 8, 7, "uint32_t" },
{ "srp", "xfer-done", 0, 0, "srpt_channel_t *", "conninfo_t *" },
{ "srp", "xfer-done", 1, 0, "srpt_channel_t *",
"srp_portinfo_t *" },
{ "srp", "xfer-done", 2, 1, "ibt_wr_ds_t *", "xferinfo_t *" },
{ "srp", "xfer-done", 3, 2, "srpt_iu_t *", "srp_taskinfo_t *" },
{ "srp", "xfer-done", 4, 3, "ibt_send_wr_t *"},
{ "srp", "xfer-done", 5, 4, "uint32_t" },
{ "srp", "xfer-done", 6, 5, "uint32_t" },
{ "srp", "xfer-done", 7, 6, "uint32_t" },
{ "srp", "xfer-done", 8, 7, "uint32_t" },
{ "fc", "link-up", 0, 0, "fct_i_local_port_t *", "conninfo_t *" },
{ "fc", "link-down", 0, 0, "fct_i_local_port_t *", "conninfo_t *" },
{ "fc", "fabric-login-start", 0, 0, "fct_i_local_port_t *",
"conninfo_t *" },
{ "fc", "fabric-login-start", 1, 0, "fct_i_local_port_t *",
"fc_port_info_t *" },
{ "fc", "fabric-login-end", 0, 0, "fct_i_local_port_t *",
"conninfo_t *" },
{ "fc", "fabric-login-end", 1, 0, "fct_i_local_port_t *",
"fc_port_info_t *" },
{ "fc", "rport-login-start", 0, 0, "fct_cmd_t *",
"conninfo_t *" },
{ "fc", "rport-login-start", 1, 1, "fct_local_port_t *",
"fc_port_info_t *" },
{ "fc", "rport-login-start", 2, 2, "fct_i_remote_port_t *",
"fc_port_info_t *" },
{ "fc", "rport-login-start", 3, 3, "int", "int" },
{ "fc", "rport-login-end", 0, 0, "fct_cmd_t *",
"conninfo_t *" },
{ "fc", "rport-login-end", 1, 1, "fct_local_port_t *",
"fc_port_info_t *" },
{ "fc", "rport-login-end", 2, 2, "fct_i_remote_port_t *",
"fc_port_info_t *" },
{ "fc", "rport-login-end", 3, 3, "int", "int" },
{ "fc", "rport-login-end", 4, 4, "int", "int" },
{ "fc", "rport-logout-start", 0, 0, "fct_cmd_t *",
"conninfo_t *" },
{ "fc", "rport-logout-start", 1, 1, "fct_local_port_t *",
"fc_port_info_t *" },
{ "fc", "rport-logout-start", 2, 2, "fct_i_remote_port_t *",
"fc_port_info_t *" },
{ "fc", "rport-logout-start", 3, 3, "int", "int" },
{ "fc", "rport-logout-end", 0, 0, "fct_cmd_t *",
"conninfo_t *" },
{ "fc", "rport-logout-end", 1, 1, "fct_local_port_t *",
"fc_port_info_t *" },
{ "fc", "rport-logout-end", 2, 2, "fct_i_remote_port_t *",
"fc_port_info_t *" },
{ "fc", "rport-logout-end", 3, 3, "int", "int" },
{ "fc", "scsi-command", 0, 0, "fct_cmd_t *",
"conninfo_t *" },
{ "fc", "scsi-command", 1, 1, "fct_i_local_port_t *",
"fc_port_info_t *" },
{ "fc", "scsi-command", 2, 2, "scsi_task_t *",
"scsicmd_t *" },
{ "fc", "scsi-command", 3, 3, "fct_i_remote_port_t *",
"fc_port_info_t *" },
{ "fc", "scsi-response", 0, 0, "fct_cmd_t *",
"conninfo_t *" },
{ "fc", "scsi-response", 1, 1, "fct_i_local_port_t *",
"fc_port_info_t *" },
{ "fc", "scsi-response", 2, 2, "scsi_task_t *",
"scsicmd_t *" },
{ "fc", "scsi-response", 3, 3, "fct_i_remote_port_t *",
"fc_port_info_t *" },
{ "fc", "xfer-start", 0, 0, "fct_cmd_t *",
"conninfo_t *" },
{ "fc", "xfer-start", 1, 1, "fct_i_local_port_t *",
"fc_port_info_t *" },
{ "fc", "xfer-start", 2, 2, "scsi_task_t *",
"scsicmd_t *" },
{ "fc", "xfer-start", 3, 3, "fct_i_remote_port_t *",
"fc_port_info_t *" },
{ "fc", "xfer-start", 4, 4, "stmf_data_buf_t *",
"fc_xferinfo_t *" },
{ "fc", "xfer-done", 0, 0, "fct_cmd_t *",
"conninfo_t *" },
{ "fc", "xfer-done", 1, 1, "fct_i_local_port_t *",
"fc_port_info_t *" },
{ "fc", "xfer-done", 2, 2, "scsi_task_t *",
"scsicmd_t *" },
{ "fc", "xfer-done", 3, 3, "fct_i_remote_port_t *",
"fc_port_info_t *" },
{ "fc", "xfer-done", 4, 4, "stmf_data_buf_t *",
"fc_xferinfo_t *" },
{ "fc", "rscn-receive", 0, 0, "fct_i_local_port_t *",
"conninfo_t *" },
{ "fc", "rscn-receive", 1, 1, "int", "int"},
{ "fc", "abts-receive", 0, 0, "fct_cmd_t *",
"conninfo_t *" },
{ "fc", "abts-receive", 1, 1, "fct_i_local_port_t *",
"fc_port_info_t *" },
{ "fc", "abts-receive", 2, 2, "fct_i_remote_port_t *",
"fc_port_info_t *" },
{ NULL }
};

View File

@ -19,11 +19,10 @@
* CDDL HEADER END
*/
/*
* Copyright 2006 Sun Microsystems, Inc. All rights reserved.
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#pragma ident "%Z%%M% %I% %E% SMI"
#include <sys/dtrace.h>
#include <sys/systrace.h>
@ -141,7 +140,7 @@ systrace_destroy(void *arg, dtrace_id_t id, void *parg)
}
/*ARGSUSED*/
static void
static int
systrace_enable(void *arg, dtrace_id_t id, void *parg)
{
int sysnum = SYSTRACE_SYSNUM((uintptr_t)parg);
@ -162,7 +161,7 @@ systrace_enable(void *arg, dtrace_id_t id, void *parg)
if (enabled) {
ASSERT(sysent[sysnum].sy_callc == dtrace_systrace_syscall);
return;
return (0);
}
(void) casptr(&sysent[sysnum].sy_callc,
@ -173,6 +172,7 @@ systrace_enable(void *arg, dtrace_id_t id, void *parg)
(void *)systrace_sysent32[sysnum].stsy_underlying,
(void *)dtrace_systrace_syscall32);
#endif
return (0);
}
/*ARGSUSED*/
@ -336,7 +336,8 @@ static struct dev_ops systrace_ops = {
nodev, /* reset */
&systrace_cb_ops, /* driver operations */
NULL, /* bus operations */
nodev /* dev power */
nodev, /* dev power */
ddi_quiesce_not_needed, /* quiesce */
};
/*

1178
uts/common/fs/gfs.c Normal file

File diff suppressed because it is too large Load Diff

4536
uts/common/fs/vnode.c Normal file

File diff suppressed because it is too large Load Diff

4658
uts/common/fs/zfs/arc.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,69 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#include <sys/bplist.h>
#include <sys/zfs_context.h>
void
bplist_create(bplist_t *bpl)
{
mutex_init(&bpl->bpl_lock, NULL, MUTEX_DEFAULT, NULL);
list_create(&bpl->bpl_list, sizeof (bplist_entry_t),
offsetof(bplist_entry_t, bpe_node));
}
void
bplist_destroy(bplist_t *bpl)
{
list_destroy(&bpl->bpl_list);
mutex_destroy(&bpl->bpl_lock);
}
void
bplist_append(bplist_t *bpl, const blkptr_t *bp)
{
bplist_entry_t *bpe = kmem_alloc(sizeof (*bpe), KM_SLEEP);
mutex_enter(&bpl->bpl_lock);
bpe->bpe_blk = *bp;
list_insert_tail(&bpl->bpl_list, bpe);
mutex_exit(&bpl->bpl_lock);
}
void
bplist_iterate(bplist_t *bpl, bplist_itor_t *func, void *arg, dmu_tx_t *tx)
{
bplist_entry_t *bpe;
mutex_enter(&bpl->bpl_lock);
while (bpe = list_head(&bpl->bpl_list)) {
list_remove(&bpl->bpl_list, bpe);
mutex_exit(&bpl->bpl_lock);
func(arg, &bpe->bpe_blk, tx);
kmem_free(bpe, sizeof (*bpe));
mutex_enter(&bpl->bpl_lock);
}
mutex_exit(&bpl->bpl_lock);
}

495
uts/common/fs/zfs/bpobj.c Normal file
View File

@ -0,0 +1,495 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#include <sys/bpobj.h>
#include <sys/zfs_context.h>
#include <sys/refcount.h>
uint64_t
bpobj_alloc(objset_t *os, int blocksize, dmu_tx_t *tx)
{
int size;
if (spa_version(dmu_objset_spa(os)) < SPA_VERSION_BPOBJ_ACCOUNT)
size = BPOBJ_SIZE_V0;
else if (spa_version(dmu_objset_spa(os)) < SPA_VERSION_DEADLISTS)
size = BPOBJ_SIZE_V1;
else
size = sizeof (bpobj_phys_t);
return (dmu_object_alloc(os, DMU_OT_BPOBJ, blocksize,
DMU_OT_BPOBJ_HDR, size, tx));
}
void
bpobj_free(objset_t *os, uint64_t obj, dmu_tx_t *tx)
{
int64_t i;
bpobj_t bpo;
dmu_object_info_t doi;
int epb;
dmu_buf_t *dbuf = NULL;
VERIFY3U(0, ==, bpobj_open(&bpo, os, obj));
mutex_enter(&bpo.bpo_lock);
if (!bpo.bpo_havesubobj || bpo.bpo_phys->bpo_subobjs == 0)
goto out;
VERIFY3U(0, ==, dmu_object_info(os, bpo.bpo_phys->bpo_subobjs, &doi));
epb = doi.doi_data_block_size / sizeof (uint64_t);
for (i = bpo.bpo_phys->bpo_num_subobjs - 1; i >= 0; i--) {
uint64_t *objarray;
uint64_t offset, blkoff;
offset = i * sizeof (uint64_t);
blkoff = P2PHASE(i, epb);
if (dbuf == NULL || dbuf->db_offset > offset) {
if (dbuf)
dmu_buf_rele(dbuf, FTAG);
VERIFY3U(0, ==, dmu_buf_hold(os,
bpo.bpo_phys->bpo_subobjs, offset, FTAG, &dbuf, 0));
}
ASSERT3U(offset, >=, dbuf->db_offset);
ASSERT3U(offset, <, dbuf->db_offset + dbuf->db_size);
objarray = dbuf->db_data;
bpobj_free(os, objarray[blkoff], tx);
}
if (dbuf) {
dmu_buf_rele(dbuf, FTAG);
dbuf = NULL;
}
VERIFY3U(0, ==, dmu_object_free(os, bpo.bpo_phys->bpo_subobjs, tx));
out:
mutex_exit(&bpo.bpo_lock);
bpobj_close(&bpo);
VERIFY3U(0, ==, dmu_object_free(os, obj, tx));
}
int
bpobj_open(bpobj_t *bpo, objset_t *os, uint64_t object)
{
dmu_object_info_t doi;
int err;
err = dmu_object_info(os, object, &doi);
if (err)
return (err);
bzero(bpo, sizeof (*bpo));
mutex_init(&bpo->bpo_lock, NULL, MUTEX_DEFAULT, NULL);
ASSERT(bpo->bpo_dbuf == NULL);
ASSERT(bpo->bpo_phys == NULL);
ASSERT(object != 0);
ASSERT3U(doi.doi_type, ==, DMU_OT_BPOBJ);
ASSERT3U(doi.doi_bonus_type, ==, DMU_OT_BPOBJ_HDR);
err = dmu_bonus_hold(os, object, bpo, &bpo->bpo_dbuf);
if (err)
return (err);
bpo->bpo_os = os;
bpo->bpo_object = object;
bpo->bpo_epb = doi.doi_data_block_size >> SPA_BLKPTRSHIFT;
bpo->bpo_havecomp = (doi.doi_bonus_size > BPOBJ_SIZE_V0);
bpo->bpo_havesubobj = (doi.doi_bonus_size > BPOBJ_SIZE_V1);
bpo->bpo_phys = bpo->bpo_dbuf->db_data;
return (0);
}
void
bpobj_close(bpobj_t *bpo)
{
/* Lame workaround for closing a bpobj that was never opened. */
if (bpo->bpo_object == 0)
return;
dmu_buf_rele(bpo->bpo_dbuf, bpo);
if (bpo->bpo_cached_dbuf != NULL)
dmu_buf_rele(bpo->bpo_cached_dbuf, bpo);
bpo->bpo_dbuf = NULL;
bpo->bpo_phys = NULL;
bpo->bpo_cached_dbuf = NULL;
bpo->bpo_object = 0;
mutex_destroy(&bpo->bpo_lock);
}
static int
bpobj_iterate_impl(bpobj_t *bpo, bpobj_itor_t func, void *arg, dmu_tx_t *tx,
boolean_t free)
{
dmu_object_info_t doi;
int epb;
int64_t i;
int err = 0;
dmu_buf_t *dbuf = NULL;
mutex_enter(&bpo->bpo_lock);
if (free)
dmu_buf_will_dirty(bpo->bpo_dbuf, tx);
for (i = bpo->bpo_phys->bpo_num_blkptrs - 1; i >= 0; i--) {
blkptr_t *bparray;
blkptr_t *bp;
uint64_t offset, blkoff;
offset = i * sizeof (blkptr_t);
blkoff = P2PHASE(i, bpo->bpo_epb);
if (dbuf == NULL || dbuf->db_offset > offset) {
if (dbuf)
dmu_buf_rele(dbuf, FTAG);
err = dmu_buf_hold(bpo->bpo_os, bpo->bpo_object, offset,
FTAG, &dbuf, 0);
if (err)
break;
}
ASSERT3U(offset, >=, dbuf->db_offset);
ASSERT3U(offset, <, dbuf->db_offset + dbuf->db_size);
bparray = dbuf->db_data;
bp = &bparray[blkoff];
err = func(arg, bp, tx);
if (err)
break;
if (free) {
bpo->bpo_phys->bpo_bytes -=
bp_get_dsize_sync(dmu_objset_spa(bpo->bpo_os), bp);
ASSERT3S(bpo->bpo_phys->bpo_bytes, >=, 0);
if (bpo->bpo_havecomp) {
bpo->bpo_phys->bpo_comp -= BP_GET_PSIZE(bp);
bpo->bpo_phys->bpo_uncomp -= BP_GET_UCSIZE(bp);
}
bpo->bpo_phys->bpo_num_blkptrs--;
ASSERT3S(bpo->bpo_phys->bpo_num_blkptrs, >=, 0);
}
}
if (dbuf) {
dmu_buf_rele(dbuf, FTAG);
dbuf = NULL;
}
if (free) {
i++;
VERIFY3U(0, ==, dmu_free_range(bpo->bpo_os, bpo->bpo_object,
i * sizeof (blkptr_t), -1ULL, tx));
}
if (err || !bpo->bpo_havesubobj || bpo->bpo_phys->bpo_subobjs == 0)
goto out;
ASSERT(bpo->bpo_havecomp);
err = dmu_object_info(bpo->bpo_os, bpo->bpo_phys->bpo_subobjs, &doi);
if (err) {
mutex_exit(&bpo->bpo_lock);
return (err);
}
epb = doi.doi_data_block_size / sizeof (uint64_t);
for (i = bpo->bpo_phys->bpo_num_subobjs - 1; i >= 0; i--) {
uint64_t *objarray;
uint64_t offset, blkoff;
bpobj_t sublist;
uint64_t used_before, comp_before, uncomp_before;
uint64_t used_after, comp_after, uncomp_after;
offset = i * sizeof (uint64_t);
blkoff = P2PHASE(i, epb);
if (dbuf == NULL || dbuf->db_offset > offset) {
if (dbuf)
dmu_buf_rele(dbuf, FTAG);
err = dmu_buf_hold(bpo->bpo_os,
bpo->bpo_phys->bpo_subobjs, offset, FTAG, &dbuf, 0);
if (err)
break;
}
ASSERT3U(offset, >=, dbuf->db_offset);
ASSERT3U(offset, <, dbuf->db_offset + dbuf->db_size);
objarray = dbuf->db_data;
err = bpobj_open(&sublist, bpo->bpo_os, objarray[blkoff]);
if (err)
break;
if (free) {
err = bpobj_space(&sublist,
&used_before, &comp_before, &uncomp_before);
if (err)
break;
}
err = bpobj_iterate_impl(&sublist, func, arg, tx, free);
if (free) {
VERIFY3U(0, ==, bpobj_space(&sublist,
&used_after, &comp_after, &uncomp_after));
bpo->bpo_phys->bpo_bytes -= used_before - used_after;
ASSERT3S(bpo->bpo_phys->bpo_bytes, >=, 0);
bpo->bpo_phys->bpo_comp -= comp_before - comp_after;
bpo->bpo_phys->bpo_uncomp -=
uncomp_before - uncomp_after;
}
bpobj_close(&sublist);
if (err)
break;
if (free) {
err = dmu_object_free(bpo->bpo_os,
objarray[blkoff], tx);
if (err)
break;
bpo->bpo_phys->bpo_num_subobjs--;
ASSERT3S(bpo->bpo_phys->bpo_num_subobjs, >=, 0);
}
}
if (dbuf) {
dmu_buf_rele(dbuf, FTAG);
dbuf = NULL;
}
if (free) {
VERIFY3U(0, ==, dmu_free_range(bpo->bpo_os,
bpo->bpo_phys->bpo_subobjs,
(i + 1) * sizeof (uint64_t), -1ULL, tx));
}
out:
/* If there are no entries, there should be no bytes. */
ASSERT(bpo->bpo_phys->bpo_num_blkptrs > 0 ||
(bpo->bpo_havesubobj && bpo->bpo_phys->bpo_num_subobjs > 0) ||
bpo->bpo_phys->bpo_bytes == 0);
mutex_exit(&bpo->bpo_lock);
return (err);
}
/*
* Iterate and remove the entries. If func returns nonzero, iteration
* will stop and that entry will not be removed.
*/
int
bpobj_iterate(bpobj_t *bpo, bpobj_itor_t func, void *arg, dmu_tx_t *tx)
{
return (bpobj_iterate_impl(bpo, func, arg, tx, B_TRUE));
}
/*
* Iterate the entries. If func returns nonzero, iteration will stop.
*/
int
bpobj_iterate_nofree(bpobj_t *bpo, bpobj_itor_t func, void *arg, dmu_tx_t *tx)
{
return (bpobj_iterate_impl(bpo, func, arg, tx, B_FALSE));
}
void
bpobj_enqueue_subobj(bpobj_t *bpo, uint64_t subobj, dmu_tx_t *tx)
{
bpobj_t subbpo;
uint64_t used, comp, uncomp, subsubobjs;
ASSERT(bpo->bpo_havesubobj);
ASSERT(bpo->bpo_havecomp);
VERIFY3U(0, ==, bpobj_open(&subbpo, bpo->bpo_os, subobj));
VERIFY3U(0, ==, bpobj_space(&subbpo, &used, &comp, &uncomp));
if (used == 0) {
/* No point in having an empty subobj. */
bpobj_close(&subbpo);
bpobj_free(bpo->bpo_os, subobj, tx);
return;
}
dmu_buf_will_dirty(bpo->bpo_dbuf, tx);
if (bpo->bpo_phys->bpo_subobjs == 0) {
bpo->bpo_phys->bpo_subobjs = dmu_object_alloc(bpo->bpo_os,
DMU_OT_BPOBJ_SUBOBJ, SPA_MAXBLOCKSIZE, DMU_OT_NONE, 0, tx);
}
mutex_enter(&bpo->bpo_lock);
dmu_write(bpo->bpo_os, bpo->bpo_phys->bpo_subobjs,
bpo->bpo_phys->bpo_num_subobjs * sizeof (subobj),
sizeof (subobj), &subobj, tx);
bpo->bpo_phys->bpo_num_subobjs++;
/*
* If subobj has only one block of subobjs, then move subobj's
* subobjs to bpo's subobj list directly. This reduces
* recursion in bpobj_iterate due to nested subobjs.
*/
subsubobjs = subbpo.bpo_phys->bpo_subobjs;
if (subsubobjs != 0) {
dmu_object_info_t doi;
VERIFY3U(0, ==, dmu_object_info(bpo->bpo_os, subsubobjs, &doi));
if (doi.doi_max_offset == doi.doi_data_block_size) {
dmu_buf_t *subdb;
uint64_t numsubsub = subbpo.bpo_phys->bpo_num_subobjs;
VERIFY3U(0, ==, dmu_buf_hold(bpo->bpo_os, subsubobjs,
0, FTAG, &subdb, 0));
dmu_write(bpo->bpo_os, bpo->bpo_phys->bpo_subobjs,
bpo->bpo_phys->bpo_num_subobjs * sizeof (subobj),
numsubsub * sizeof (subobj), subdb->db_data, tx);
dmu_buf_rele(subdb, FTAG);
bpo->bpo_phys->bpo_num_subobjs += numsubsub;
dmu_buf_will_dirty(subbpo.bpo_dbuf, tx);
subbpo.bpo_phys->bpo_subobjs = 0;
VERIFY3U(0, ==, dmu_object_free(bpo->bpo_os,
subsubobjs, tx));
}
}
bpo->bpo_phys->bpo_bytes += used;
bpo->bpo_phys->bpo_comp += comp;
bpo->bpo_phys->bpo_uncomp += uncomp;
mutex_exit(&bpo->bpo_lock);
bpobj_close(&subbpo);
}
void
bpobj_enqueue(bpobj_t *bpo, const blkptr_t *bp, dmu_tx_t *tx)
{
blkptr_t stored_bp = *bp;
uint64_t offset;
int blkoff;
blkptr_t *bparray;
ASSERT(!BP_IS_HOLE(bp));
/* We never need the fill count. */
stored_bp.blk_fill = 0;
/* The bpobj will compress better if we can leave off the checksum */
if (!BP_GET_DEDUP(bp))
bzero(&stored_bp.blk_cksum, sizeof (stored_bp.blk_cksum));
mutex_enter(&bpo->bpo_lock);
offset = bpo->bpo_phys->bpo_num_blkptrs * sizeof (stored_bp);
blkoff = P2PHASE(bpo->bpo_phys->bpo_num_blkptrs, bpo->bpo_epb);
if (bpo->bpo_cached_dbuf == NULL ||
offset < bpo->bpo_cached_dbuf->db_offset ||
offset >= bpo->bpo_cached_dbuf->db_offset +
bpo->bpo_cached_dbuf->db_size) {
if (bpo->bpo_cached_dbuf)
dmu_buf_rele(bpo->bpo_cached_dbuf, bpo);
VERIFY3U(0, ==, dmu_buf_hold(bpo->bpo_os, bpo->bpo_object,
offset, bpo, &bpo->bpo_cached_dbuf, 0));
}
dmu_buf_will_dirty(bpo->bpo_cached_dbuf, tx);
bparray = bpo->bpo_cached_dbuf->db_data;
bparray[blkoff] = stored_bp;
dmu_buf_will_dirty(bpo->bpo_dbuf, tx);
bpo->bpo_phys->bpo_num_blkptrs++;
bpo->bpo_phys->bpo_bytes +=
bp_get_dsize_sync(dmu_objset_spa(bpo->bpo_os), bp);
if (bpo->bpo_havecomp) {
bpo->bpo_phys->bpo_comp += BP_GET_PSIZE(bp);
bpo->bpo_phys->bpo_uncomp += BP_GET_UCSIZE(bp);
}
mutex_exit(&bpo->bpo_lock);
}
struct space_range_arg {
spa_t *spa;
uint64_t mintxg;
uint64_t maxtxg;
uint64_t used;
uint64_t comp;
uint64_t uncomp;
};
/* ARGSUSED */
static int
space_range_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
{
struct space_range_arg *sra = arg;
if (bp->blk_birth > sra->mintxg && bp->blk_birth <= sra->maxtxg) {
sra->used += bp_get_dsize_sync(sra->spa, bp);
sra->comp += BP_GET_PSIZE(bp);
sra->uncomp += BP_GET_UCSIZE(bp);
}
return (0);
}
int
bpobj_space(bpobj_t *bpo, uint64_t *usedp, uint64_t *compp, uint64_t *uncompp)
{
mutex_enter(&bpo->bpo_lock);
*usedp = bpo->bpo_phys->bpo_bytes;
if (bpo->bpo_havecomp) {
*compp = bpo->bpo_phys->bpo_comp;
*uncompp = bpo->bpo_phys->bpo_uncomp;
mutex_exit(&bpo->bpo_lock);
return (0);
} else {
mutex_exit(&bpo->bpo_lock);
return (bpobj_space_range(bpo, 0, UINT64_MAX,
usedp, compp, uncompp));
}
}
/*
* Return the amount of space in the bpobj which is:
* mintxg < blk_birth <= maxtxg
*/
int
bpobj_space_range(bpobj_t *bpo, uint64_t mintxg, uint64_t maxtxg,
uint64_t *usedp, uint64_t *compp, uint64_t *uncompp)
{
struct space_range_arg sra = { 0 };
int err;
/*
* As an optimization, if they want the whole txg range, just
* get bpo_bytes rather than iterating over the bps.
*/
if (mintxg < TXG_INITIAL && maxtxg == UINT64_MAX && bpo->bpo_havecomp)
return (bpobj_space(bpo, usedp, compp, uncompp));
sra.spa = dmu_objset_spa(bpo->bpo_os);
sra.mintxg = mintxg;
sra.maxtxg = maxtxg;
err = bpobj_iterate_nofree(bpo, space_range_cb, &sra, NULL);
*usedp = sra.used;
*compp = sra.comp;
*uncompp = sra.uncomp;
return (err);
}

2707
uts/common/fs/zfs/dbuf.c Normal file

File diff suppressed because it is too large Load Diff

1146
uts/common/fs/zfs/ddt.c Normal file

File diff suppressed because it is too large Load Diff

157
uts/common/fs/zfs/ddt_zap.c Normal file
View File

@ -0,0 +1,157 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2009, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#include <sys/zfs_context.h>
#include <sys/spa.h>
#include <sys/zio.h>
#include <sys/ddt.h>
#include <sys/zap.h>
#include <sys/dmu_tx.h>
#include <util/sscanf.h>
int ddt_zap_leaf_blockshift = 12;
int ddt_zap_indirect_blockshift = 12;
static int
ddt_zap_create(objset_t *os, uint64_t *objectp, dmu_tx_t *tx, boolean_t prehash)
{
zap_flags_t flags = ZAP_FLAG_HASH64 | ZAP_FLAG_UINT64_KEY;
if (prehash)
flags |= ZAP_FLAG_PRE_HASHED_KEY;
*objectp = zap_create_flags(os, 0, flags, DMU_OT_DDT_ZAP,
ddt_zap_leaf_blockshift, ddt_zap_indirect_blockshift,
DMU_OT_NONE, 0, tx);
return (*objectp == 0 ? ENOTSUP : 0);
}
static int
ddt_zap_destroy(objset_t *os, uint64_t object, dmu_tx_t *tx)
{
return (zap_destroy(os, object, tx));
}
static int
ddt_zap_lookup(objset_t *os, uint64_t object, ddt_entry_t *dde)
{
uchar_t cbuf[sizeof (dde->dde_phys) + 1];
uint64_t one, csize;
int error;
error = zap_length_uint64(os, object, (uint64_t *)&dde->dde_key,
DDT_KEY_WORDS, &one, &csize);
if (error)
return (error);
ASSERT(one == 1);
ASSERT(csize <= sizeof (cbuf));
error = zap_lookup_uint64(os, object, (uint64_t *)&dde->dde_key,
DDT_KEY_WORDS, 1, csize, cbuf);
if (error)
return (error);
ddt_decompress(cbuf, dde->dde_phys, csize, sizeof (dde->dde_phys));
return (0);
}
static void
ddt_zap_prefetch(objset_t *os, uint64_t object, ddt_entry_t *dde)
{
(void) zap_prefetch_uint64(os, object, (uint64_t *)&dde->dde_key,
DDT_KEY_WORDS);
}
static int
ddt_zap_update(objset_t *os, uint64_t object, ddt_entry_t *dde, dmu_tx_t *tx)
{
uchar_t cbuf[sizeof (dde->dde_phys) + 1];
uint64_t csize;
csize = ddt_compress(dde->dde_phys, cbuf,
sizeof (dde->dde_phys), sizeof (cbuf));
return (zap_update_uint64(os, object, (uint64_t *)&dde->dde_key,
DDT_KEY_WORDS, 1, csize, cbuf, tx));
}
static int
ddt_zap_remove(objset_t *os, uint64_t object, ddt_entry_t *dde, dmu_tx_t *tx)
{
return (zap_remove_uint64(os, object, (uint64_t *)&dde->dde_key,
DDT_KEY_WORDS, tx));
}
static int
ddt_zap_walk(objset_t *os, uint64_t object, ddt_entry_t *dde, uint64_t *walk)
{
zap_cursor_t zc;
zap_attribute_t za;
int error;
zap_cursor_init_serialized(&zc, os, object, *walk);
if ((error = zap_cursor_retrieve(&zc, &za)) == 0) {
uchar_t cbuf[sizeof (dde->dde_phys) + 1];
uint64_t csize = za.za_num_integers;
ASSERT(za.za_integer_length == 1);
error = zap_lookup_uint64(os, object, (uint64_t *)za.za_name,
DDT_KEY_WORDS, 1, csize, cbuf);
ASSERT(error == 0);
if (error == 0) {
ddt_decompress(cbuf, dde->dde_phys, csize,
sizeof (dde->dde_phys));
dde->dde_key = *(ddt_key_t *)za.za_name;
}
zap_cursor_advance(&zc);
*walk = zap_cursor_serialize(&zc);
}
zap_cursor_fini(&zc);
return (error);
}
static uint64_t
ddt_zap_count(objset_t *os, uint64_t object)
{
uint64_t count = 0;
VERIFY(zap_count(os, object, &count) == 0);
return (count);
}
const ddt_ops_t ddt_zap_ops = {
"zap",
ddt_zap_create,
ddt_zap_destroy,
ddt_zap_lookup,
ddt_zap_prefetch,
ddt_zap_update,
ddt_zap_remove,
ddt_zap_walk,
ddt_zap_count,
};

1764
uts/common/fs/zfs/dmu.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,221 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved.
*/
#include <sys/dmu.h>
#include <sys/dmu_impl.h>
#include <sys/dmu_tx.h>
#include <sys/dbuf.h>
#include <sys/dnode.h>
#include <sys/zfs_context.h>
#include <sys/dmu_objset.h>
#include <sys/dmu_traverse.h>
#include <sys/dsl_dataset.h>
#include <sys/dsl_dir.h>
#include <sys/dsl_pool.h>
#include <sys/dsl_synctask.h>
#include <sys/zfs_ioctl.h>
#include <sys/zap.h>
#include <sys/zio_checksum.h>
#include <sys/zfs_znode.h>
struct diffarg {
struct vnode *da_vp; /* file to which we are reporting */
offset_t *da_offp;
int da_err; /* error that stopped diff search */
dmu_diff_record_t da_ddr;
};
static int
write_record(struct diffarg *da)
{
ssize_t resid; /* have to get resid to get detailed errno */
if (da->da_ddr.ddr_type == DDR_NONE) {
da->da_err = 0;
return (0);
}
da->da_err = vn_rdwr(UIO_WRITE, da->da_vp, (caddr_t)&da->da_ddr,
sizeof (da->da_ddr), 0, UIO_SYSSPACE, FAPPEND,
RLIM64_INFINITY, CRED(), &resid);
*da->da_offp += sizeof (da->da_ddr);
return (da->da_err);
}
static int
report_free_dnode_range(struct diffarg *da, uint64_t first, uint64_t last)
{
ASSERT(first <= last);
if (da->da_ddr.ddr_type != DDR_FREE ||
first != da->da_ddr.ddr_last + 1) {
if (write_record(da) != 0)
return (da->da_err);
da->da_ddr.ddr_type = DDR_FREE;
da->da_ddr.ddr_first = first;
da->da_ddr.ddr_last = last;
return (0);
}
da->da_ddr.ddr_last = last;
return (0);
}
static int
report_dnode(struct diffarg *da, uint64_t object, dnode_phys_t *dnp)
{
ASSERT(dnp != NULL);
if (dnp->dn_type == DMU_OT_NONE)
return (report_free_dnode_range(da, object, object));
if (da->da_ddr.ddr_type != DDR_INUSE ||
object != da->da_ddr.ddr_last + 1) {
if (write_record(da) != 0)
return (da->da_err);
da->da_ddr.ddr_type = DDR_INUSE;
da->da_ddr.ddr_first = da->da_ddr.ddr_last = object;
return (0);
}
da->da_ddr.ddr_last = object;
return (0);
}
#define DBP_SPAN(dnp, level) \
(((uint64_t)dnp->dn_datablkszsec) << (SPA_MINBLOCKSHIFT + \
(level) * (dnp->dn_indblkshift - SPA_BLKPTRSHIFT)))
/* ARGSUSED */
static int
diff_cb(spa_t *spa, zilog_t *zilog, const blkptr_t *bp, arc_buf_t *pbuf,
const zbookmark_t *zb, const dnode_phys_t *dnp, void *arg)
{
struct diffarg *da = arg;
int err = 0;
if (issig(JUSTLOOKING) && issig(FORREAL))
return (EINTR);
if (zb->zb_object != DMU_META_DNODE_OBJECT)
return (0);
if (bp == NULL) {
uint64_t span = DBP_SPAN(dnp, zb->zb_level);
uint64_t dnobj = (zb->zb_blkid * span) >> DNODE_SHIFT;
err = report_free_dnode_range(da, dnobj,
dnobj + (span >> DNODE_SHIFT) - 1);
if (err)
return (err);
} else if (zb->zb_level == 0) {
dnode_phys_t *blk;
arc_buf_t *abuf;
uint32_t aflags = ARC_WAIT;
int blksz = BP_GET_LSIZE(bp);
int i;
if (dsl_read(NULL, spa, bp, pbuf,
arc_getbuf_func, &abuf, ZIO_PRIORITY_ASYNC_READ,
ZIO_FLAG_CANFAIL, &aflags, zb) != 0)
return (EIO);
blk = abuf->b_data;
for (i = 0; i < blksz >> DNODE_SHIFT; i++) {
uint64_t dnobj = (zb->zb_blkid <<
(DNODE_BLOCK_SHIFT - DNODE_SHIFT)) + i;
err = report_dnode(da, dnobj, blk+i);
if (err)
break;
}
(void) arc_buf_remove_ref(abuf, &abuf);
if (err)
return (err);
/* Don't care about the data blocks */
return (TRAVERSE_VISIT_NO_CHILDREN);
}
return (0);
}
int
dmu_diff(objset_t *tosnap, objset_t *fromsnap, struct vnode *vp, offset_t *offp)
{
struct diffarg da;
dsl_dataset_t *ds = tosnap->os_dsl_dataset;
dsl_dataset_t *fromds = fromsnap->os_dsl_dataset;
dsl_dataset_t *findds;
dsl_dataset_t *relds;
int err = 0;
/* make certain we are looking at snapshots */
if (!dsl_dataset_is_snapshot(ds) || !dsl_dataset_is_snapshot(fromds))
return (EINVAL);
/* fromsnap must be earlier and from the same lineage as tosnap */
if (fromds->ds_phys->ds_creation_txg >= ds->ds_phys->ds_creation_txg)
return (EXDEV);
relds = NULL;
findds = ds;
while (fromds->ds_dir != findds->ds_dir) {
dsl_pool_t *dp = ds->ds_dir->dd_pool;
if (!dsl_dir_is_clone(findds->ds_dir)) {
if (relds)
dsl_dataset_rele(relds, FTAG);
return (EXDEV);
}
rw_enter(&dp->dp_config_rwlock, RW_READER);
err = dsl_dataset_hold_obj(dp,
findds->ds_dir->dd_phys->dd_origin_obj, FTAG, &findds);
rw_exit(&dp->dp_config_rwlock);
if (relds)
dsl_dataset_rele(relds, FTAG);
if (err)
return (EXDEV);
relds = findds;
}
if (relds)
dsl_dataset_rele(relds, FTAG);
da.da_vp = vp;
da.da_offp = offp;
da.da_ddr.ddr_type = DDR_NONE;
da.da_ddr.ddr_first = da.da_ddr.ddr_last = 0;
da.da_err = 0;
err = traverse_dataset(ds, fromds->ds_phys->ds_creation_txg,
TRAVERSE_PRE | TRAVERSE_PREFETCH_METADATA, diff_cb, &da);
if (err) {
da.da_err = err;
} else {
/* we set the da.da_err we return as side-effect */
(void) write_record(&da);
}
return (da.da_err);
}

View File

@ -0,0 +1,196 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#include <sys/dmu.h>
#include <sys/dmu_objset.h>
#include <sys/dmu_tx.h>
#include <sys/dnode.h>
uint64_t
dmu_object_alloc(objset_t *os, dmu_object_type_t ot, int blocksize,
dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx)
{
uint64_t object;
uint64_t L2_dnode_count = DNODES_PER_BLOCK <<
(DMU_META_DNODE(os)->dn_indblkshift - SPA_BLKPTRSHIFT);
dnode_t *dn = NULL;
int restarted = B_FALSE;
mutex_enter(&os->os_obj_lock);
for (;;) {
object = os->os_obj_next;
/*
* Each time we polish off an L2 bp worth of dnodes
* (2^13 objects), move to another L2 bp that's still
* reasonably sparse (at most 1/4 full). Look from the
* beginning once, but after that keep looking from here.
* If we can't find one, just keep going from here.
*/
if (P2PHASE(object, L2_dnode_count) == 0) {
uint64_t offset = restarted ? object << DNODE_SHIFT : 0;
int error = dnode_next_offset(DMU_META_DNODE(os),
DNODE_FIND_HOLE,
&offset, 2, DNODES_PER_BLOCK >> 2, 0);
restarted = B_TRUE;
if (error == 0)
object = offset >> DNODE_SHIFT;
}
os->os_obj_next = ++object;
/*
* XXX We should check for an i/o error here and return
* up to our caller. Actually we should pre-read it in
* dmu_tx_assign(), but there is currently no mechanism
* to do so.
*/
(void) dnode_hold_impl(os, object, DNODE_MUST_BE_FREE,
FTAG, &dn);
if (dn)
break;
if (dmu_object_next(os, &object, B_TRUE, 0) == 0)
os->os_obj_next = object - 1;
}
dnode_allocate(dn, ot, blocksize, 0, bonustype, bonuslen, tx);
dnode_rele(dn, FTAG);
mutex_exit(&os->os_obj_lock);
dmu_tx_add_new_object(tx, os, object);
return (object);
}
int
dmu_object_claim(objset_t *os, uint64_t object, dmu_object_type_t ot,
int blocksize, dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx)
{
dnode_t *dn;
int err;
if (object == DMU_META_DNODE_OBJECT && !dmu_tx_private_ok(tx))
return (EBADF);
err = dnode_hold_impl(os, object, DNODE_MUST_BE_FREE, FTAG, &dn);
if (err)
return (err);
dnode_allocate(dn, ot, blocksize, 0, bonustype, bonuslen, tx);
dnode_rele(dn, FTAG);
dmu_tx_add_new_object(tx, os, object);
return (0);
}
int
dmu_object_reclaim(objset_t *os, uint64_t object, dmu_object_type_t ot,
int blocksize, dmu_object_type_t bonustype, int bonuslen)
{
dnode_t *dn;
dmu_tx_t *tx;
int nblkptr;
int err;
if (object == DMU_META_DNODE_OBJECT)
return (EBADF);
err = dnode_hold_impl(os, object, DNODE_MUST_BE_ALLOCATED,
FTAG, &dn);
if (err)
return (err);
if (dn->dn_type == ot && dn->dn_datablksz == blocksize &&
dn->dn_bonustype == bonustype && dn->dn_bonuslen == bonuslen) {
/* nothing is changing, this is a noop */
dnode_rele(dn, FTAG);
return (0);
}
if (bonustype == DMU_OT_SA) {
nblkptr = 1;
} else {
nblkptr = 1 + ((DN_MAX_BONUSLEN - bonuslen) >> SPA_BLKPTRSHIFT);
}
/*
* If we are losing blkptrs or changing the block size this must
* be a new file instance. We must clear out the previous file
* contents before we can change this type of metadata in the dnode.
*/
if (dn->dn_nblkptr > nblkptr || dn->dn_datablksz != blocksize) {
err = dmu_free_long_range(os, object, 0, DMU_OBJECT_END);
if (err)
goto out;
}
tx = dmu_tx_create(os);
dmu_tx_hold_bonus(tx, object);
err = dmu_tx_assign(tx, TXG_WAIT);
if (err) {
dmu_tx_abort(tx);
goto out;
}
dnode_reallocate(dn, ot, blocksize, bonustype, bonuslen, tx);
dmu_tx_commit(tx);
out:
dnode_rele(dn, FTAG);
return (err);
}
int
dmu_object_free(objset_t *os, uint64_t object, dmu_tx_t *tx)
{
dnode_t *dn;
int err;
ASSERT(object != DMU_META_DNODE_OBJECT || dmu_tx_private_ok(tx));
err = dnode_hold_impl(os, object, DNODE_MUST_BE_ALLOCATED,
FTAG, &dn);
if (err)
return (err);
ASSERT(dn->dn_type != DMU_OT_NONE);
dnode_free_range(dn, 0, DMU_OBJECT_END, tx);
dnode_free(dn, tx);
dnode_rele(dn, FTAG);
return (0);
}
int
dmu_object_next(objset_t *os, uint64_t *objectp, boolean_t hole, uint64_t txg)
{
uint64_t offset = (*objectp + 1) << DNODE_SHIFT;
int error;
error = dnode_next_offset(DMU_META_DNODE(os),
(hole ? DNODE_FIND_HOLE : 0), &offset, 0, DNODES_PER_BLOCK, txg);
*objectp = offset >> DNODE_SHIFT;
return (error);
}

File diff suppressed because it is too large Load Diff

1606
uts/common/fs/zfs/dmu_send.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,482 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#include <sys/zfs_context.h>
#include <sys/dmu_objset.h>
#include <sys/dmu_traverse.h>
#include <sys/dsl_dataset.h>
#include <sys/dsl_dir.h>
#include <sys/dsl_pool.h>
#include <sys/dnode.h>
#include <sys/spa.h>
#include <sys/zio.h>
#include <sys/dmu_impl.h>
#include <sys/sa.h>
#include <sys/sa_impl.h>
#include <sys/callb.h>
int zfs_pd_blks_max = 100;
typedef struct prefetch_data {
kmutex_t pd_mtx;
kcondvar_t pd_cv;
int pd_blks_max;
int pd_blks_fetched;
int pd_flags;
boolean_t pd_cancel;
boolean_t pd_exited;
} prefetch_data_t;
typedef struct traverse_data {
spa_t *td_spa;
uint64_t td_objset;
blkptr_t *td_rootbp;
uint64_t td_min_txg;
int td_flags;
prefetch_data_t *td_pfd;
blkptr_cb_t *td_func;
void *td_arg;
} traverse_data_t;
static int traverse_dnode(traverse_data_t *td, const dnode_phys_t *dnp,
arc_buf_t *buf, uint64_t objset, uint64_t object);
static int
traverse_zil_block(zilog_t *zilog, blkptr_t *bp, void *arg, uint64_t claim_txg)
{
traverse_data_t *td = arg;
zbookmark_t zb;
if (bp->blk_birth == 0)
return (0);
if (claim_txg == 0 && bp->blk_birth >= spa_first_txg(td->td_spa))
return (0);
SET_BOOKMARK(&zb, td->td_objset, ZB_ZIL_OBJECT, ZB_ZIL_LEVEL,
bp->blk_cksum.zc_word[ZIL_ZC_SEQ]);
(void) td->td_func(td->td_spa, zilog, bp, NULL, &zb, NULL, td->td_arg);
return (0);
}
static int
traverse_zil_record(zilog_t *zilog, lr_t *lrc, void *arg, uint64_t claim_txg)
{
traverse_data_t *td = arg;
if (lrc->lrc_txtype == TX_WRITE) {
lr_write_t *lr = (lr_write_t *)lrc;
blkptr_t *bp = &lr->lr_blkptr;
zbookmark_t zb;
if (bp->blk_birth == 0)
return (0);
if (claim_txg == 0 || bp->blk_birth < claim_txg)
return (0);
SET_BOOKMARK(&zb, td->td_objset, lr->lr_foid,
ZB_ZIL_LEVEL, lr->lr_offset / BP_GET_LSIZE(bp));
(void) td->td_func(td->td_spa, zilog, bp, NULL, &zb, NULL,
td->td_arg);
}
return (0);
}
static void
traverse_zil(traverse_data_t *td, zil_header_t *zh)
{
uint64_t claim_txg = zh->zh_claim_txg;
zilog_t *zilog;
/*
* We only want to visit blocks that have been claimed but not yet
* replayed; plus, in read-only mode, blocks that are already stable.
*/
if (claim_txg == 0 && spa_writeable(td->td_spa))
return;
zilog = zil_alloc(spa_get_dsl(td->td_spa)->dp_meta_objset, zh);
(void) zil_parse(zilog, traverse_zil_block, traverse_zil_record, td,
claim_txg);
zil_free(zilog);
}
static int
traverse_visitbp(traverse_data_t *td, const dnode_phys_t *dnp,
arc_buf_t *pbuf, blkptr_t *bp, const zbookmark_t *zb)
{
zbookmark_t czb;
int err = 0, lasterr = 0;
arc_buf_t *buf = NULL;
prefetch_data_t *pd = td->td_pfd;
boolean_t hard = td->td_flags & TRAVERSE_HARD;
if (bp->blk_birth == 0) {
err = td->td_func(td->td_spa, NULL, NULL, pbuf, zb, dnp,
td->td_arg);
return (err);
}
if (bp->blk_birth <= td->td_min_txg)
return (0);
if (pd && !pd->pd_exited &&
((pd->pd_flags & TRAVERSE_PREFETCH_DATA) ||
BP_GET_TYPE(bp) == DMU_OT_DNODE || BP_GET_LEVEL(bp) > 0)) {
mutex_enter(&pd->pd_mtx);
ASSERT(pd->pd_blks_fetched >= 0);
while (pd->pd_blks_fetched == 0 && !pd->pd_exited)
cv_wait(&pd->pd_cv, &pd->pd_mtx);
pd->pd_blks_fetched--;
cv_broadcast(&pd->pd_cv);
mutex_exit(&pd->pd_mtx);
}
if (td->td_flags & TRAVERSE_PRE) {
err = td->td_func(td->td_spa, NULL, bp, pbuf, zb, dnp,
td->td_arg);
if (err == TRAVERSE_VISIT_NO_CHILDREN)
return (0);
if (err)
return (err);
}
if (BP_GET_LEVEL(bp) > 0) {
uint32_t flags = ARC_WAIT;
int i;
blkptr_t *cbp;
int epb = BP_GET_LSIZE(bp) >> SPA_BLKPTRSHIFT;
err = dsl_read(NULL, td->td_spa, bp, pbuf,
arc_getbuf_func, &buf,
ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
if (err)
return (err);
/* recursively visitbp() blocks below this */
cbp = buf->b_data;
for (i = 0; i < epb; i++, cbp++) {
SET_BOOKMARK(&czb, zb->zb_objset, zb->zb_object,
zb->zb_level - 1,
zb->zb_blkid * epb + i);
err = traverse_visitbp(td, dnp, buf, cbp, &czb);
if (err) {
if (!hard)
break;
lasterr = err;
}
}
} else if (BP_GET_TYPE(bp) == DMU_OT_DNODE) {
uint32_t flags = ARC_WAIT;
int i;
int epb = BP_GET_LSIZE(bp) >> DNODE_SHIFT;
err = dsl_read(NULL, td->td_spa, bp, pbuf,
arc_getbuf_func, &buf,
ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
if (err)
return (err);
/* recursively visitbp() blocks below this */
dnp = buf->b_data;
for (i = 0; i < epb; i++, dnp++) {
err = traverse_dnode(td, dnp, buf, zb->zb_objset,
zb->zb_blkid * epb + i);
if (err) {
if (!hard)
break;
lasterr = err;
}
}
} else if (BP_GET_TYPE(bp) == DMU_OT_OBJSET) {
uint32_t flags = ARC_WAIT;
objset_phys_t *osp;
dnode_phys_t *dnp;
err = dsl_read_nolock(NULL, td->td_spa, bp,
arc_getbuf_func, &buf,
ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
if (err)
return (err);
osp = buf->b_data;
dnp = &osp->os_meta_dnode;
err = traverse_dnode(td, dnp, buf, zb->zb_objset,
DMU_META_DNODE_OBJECT);
if (err && hard) {
lasterr = err;
err = 0;
}
if (err == 0 && arc_buf_size(buf) >= sizeof (objset_phys_t)) {
dnp = &osp->os_userused_dnode;
err = traverse_dnode(td, dnp, buf, zb->zb_objset,
DMU_USERUSED_OBJECT);
}
if (err && hard) {
lasterr = err;
err = 0;
}
if (err == 0 && arc_buf_size(buf) >= sizeof (objset_phys_t)) {
dnp = &osp->os_groupused_dnode;
err = traverse_dnode(td, dnp, buf, zb->zb_objset,
DMU_GROUPUSED_OBJECT);
}
}
if (buf)
(void) arc_buf_remove_ref(buf, &buf);
if (err == 0 && lasterr == 0 && (td->td_flags & TRAVERSE_POST)) {
err = td->td_func(td->td_spa, NULL, bp, pbuf, zb, dnp,
td->td_arg);
}
return (err != 0 ? err : lasterr);
}
static int
traverse_dnode(traverse_data_t *td, const dnode_phys_t *dnp,
arc_buf_t *buf, uint64_t objset, uint64_t object)
{
int j, err = 0, lasterr = 0;
zbookmark_t czb;
boolean_t hard = (td->td_flags & TRAVERSE_HARD);
for (j = 0; j < dnp->dn_nblkptr; j++) {
SET_BOOKMARK(&czb, objset, object, dnp->dn_nlevels - 1, j);
err = traverse_visitbp(td, dnp, buf,
(blkptr_t *)&dnp->dn_blkptr[j], &czb);
if (err) {
if (!hard)
break;
lasterr = err;
}
}
if (dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR) {
SET_BOOKMARK(&czb, objset,
object, 0, DMU_SPILL_BLKID);
err = traverse_visitbp(td, dnp, buf,
(blkptr_t *)&dnp->dn_spill, &czb);
if (err) {
if (!hard)
return (err);
lasterr = err;
}
}
return (err != 0 ? err : lasterr);
}
/* ARGSUSED */
static int
traverse_prefetcher(spa_t *spa, zilog_t *zilog, const blkptr_t *bp,
arc_buf_t *pbuf, const zbookmark_t *zb, const dnode_phys_t *dnp,
void *arg)
{
prefetch_data_t *pfd = arg;
uint32_t aflags = ARC_NOWAIT | ARC_PREFETCH;
ASSERT(pfd->pd_blks_fetched >= 0);
if (pfd->pd_cancel)
return (EINTR);
if (bp == NULL || !((pfd->pd_flags & TRAVERSE_PREFETCH_DATA) ||
BP_GET_TYPE(bp) == DMU_OT_DNODE || BP_GET_LEVEL(bp) > 0) ||
BP_GET_TYPE(bp) == DMU_OT_INTENT_LOG)
return (0);
mutex_enter(&pfd->pd_mtx);
while (!pfd->pd_cancel && pfd->pd_blks_fetched >= pfd->pd_blks_max)
cv_wait(&pfd->pd_cv, &pfd->pd_mtx);
pfd->pd_blks_fetched++;
cv_broadcast(&pfd->pd_cv);
mutex_exit(&pfd->pd_mtx);
(void) dsl_read(NULL, spa, bp, pbuf, NULL, NULL,
ZIO_PRIORITY_ASYNC_READ,
ZIO_FLAG_CANFAIL | ZIO_FLAG_SPECULATIVE,
&aflags, zb);
return (0);
}
static void
traverse_prefetch_thread(void *arg)
{
traverse_data_t *td_main = arg;
traverse_data_t td = *td_main;
zbookmark_t czb;
td.td_func = traverse_prefetcher;
td.td_arg = td_main->td_pfd;
td.td_pfd = NULL;
SET_BOOKMARK(&czb, td.td_objset,
ZB_ROOT_OBJECT, ZB_ROOT_LEVEL, ZB_ROOT_BLKID);
(void) traverse_visitbp(&td, NULL, NULL, td.td_rootbp, &czb);
mutex_enter(&td_main->td_pfd->pd_mtx);
td_main->td_pfd->pd_exited = B_TRUE;
cv_broadcast(&td_main->td_pfd->pd_cv);
mutex_exit(&td_main->td_pfd->pd_mtx);
}
/*
* NB: dataset must not be changing on-disk (eg, is a snapshot or we are
* in syncing context).
*/
static int
traverse_impl(spa_t *spa, dsl_dataset_t *ds, blkptr_t *rootbp,
uint64_t txg_start, int flags, blkptr_cb_t func, void *arg)
{
traverse_data_t td;
prefetch_data_t pd = { 0 };
zbookmark_t czb;
int err;
td.td_spa = spa;
td.td_objset = ds ? ds->ds_object : 0;
td.td_rootbp = rootbp;
td.td_min_txg = txg_start;
td.td_func = func;
td.td_arg = arg;
td.td_pfd = &pd;
td.td_flags = flags;
pd.pd_blks_max = zfs_pd_blks_max;
pd.pd_flags = flags;
mutex_init(&pd.pd_mtx, NULL, MUTEX_DEFAULT, NULL);
cv_init(&pd.pd_cv, NULL, CV_DEFAULT, NULL);
/* See comment on ZIL traversal in dsl_scan_visitds. */
if (ds != NULL && !dsl_dataset_is_snapshot(ds)) {
objset_t *os;
err = dmu_objset_from_ds(ds, &os);
if (err)
return (err);
traverse_zil(&td, &os->os_zil_header);
}
if (!(flags & TRAVERSE_PREFETCH) ||
0 == taskq_dispatch(system_taskq, traverse_prefetch_thread,
&td, TQ_NOQUEUE))
pd.pd_exited = B_TRUE;
SET_BOOKMARK(&czb, td.td_objset,
ZB_ROOT_OBJECT, ZB_ROOT_LEVEL, ZB_ROOT_BLKID);
err = traverse_visitbp(&td, NULL, NULL, rootbp, &czb);
mutex_enter(&pd.pd_mtx);
pd.pd_cancel = B_TRUE;
cv_broadcast(&pd.pd_cv);
while (!pd.pd_exited)
cv_wait(&pd.pd_cv, &pd.pd_mtx);
mutex_exit(&pd.pd_mtx);
mutex_destroy(&pd.pd_mtx);
cv_destroy(&pd.pd_cv);
return (err);
}
/*
* NB: dataset must not be changing on-disk (eg, is a snapshot or we are
* in syncing context).
*/
int
traverse_dataset(dsl_dataset_t *ds, uint64_t txg_start, int flags,
blkptr_cb_t func, void *arg)
{
return (traverse_impl(ds->ds_dir->dd_pool->dp_spa, ds,
&ds->ds_phys->ds_bp, txg_start, flags, func, arg));
}
/*
* NB: pool must not be changing on-disk (eg, from zdb or sync context).
*/
int
traverse_pool(spa_t *spa, uint64_t txg_start, int flags,
blkptr_cb_t func, void *arg)
{
int err, lasterr = 0;
uint64_t obj;
dsl_pool_t *dp = spa_get_dsl(spa);
objset_t *mos = dp->dp_meta_objset;
boolean_t hard = (flags & TRAVERSE_HARD);
/* visit the MOS */
err = traverse_impl(spa, NULL, spa_get_rootblkptr(spa),
txg_start, flags, func, arg);
if (err)
return (err);
/* visit each dataset */
for (obj = 1; err == 0 || (err != ESRCH && hard);
err = dmu_object_next(mos, &obj, FALSE, txg_start)) {
dmu_object_info_t doi;
err = dmu_object_info(mos, obj, &doi);
if (err) {
if (!hard)
return (err);
lasterr = err;
continue;
}
if (doi.doi_type == DMU_OT_DSL_DATASET) {
dsl_dataset_t *ds;
uint64_t txg = txg_start;
rw_enter(&dp->dp_config_rwlock, RW_READER);
err = dsl_dataset_hold_obj(dp, obj, FTAG, &ds);
rw_exit(&dp->dp_config_rwlock);
if (err) {
if (!hard)
return (err);
lasterr = err;
continue;
}
if (ds->ds_phys->ds_prev_snap_txg > txg)
txg = ds->ds_phys->ds_prev_snap_txg;
err = traverse_dataset(ds, txg, flags, func, arg);
dsl_dataset_rele(ds, FTAG);
if (err) {
if (!hard)
return (err);
lasterr = err;
}
}
}
if (err == ESRCH)
err = 0;
return (err != 0 ? err : lasterr);
}

1382
uts/common/fs/zfs/dmu_tx.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,724 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#include <sys/zfs_context.h>
#include <sys/dnode.h>
#include <sys/dmu_objset.h>
#include <sys/dmu_zfetch.h>
#include <sys/dmu.h>
#include <sys/dbuf.h>
#include <sys/kstat.h>
/*
* I'm against tune-ables, but these should probably exist as tweakable globals
* until we can get this working the way we want it to.
*/
int zfs_prefetch_disable = 0;
/* max # of streams per zfetch */
uint32_t zfetch_max_streams = 8;
/* min time before stream reclaim */
uint32_t zfetch_min_sec_reap = 2;
/* max number of blocks to fetch at a time */
uint32_t zfetch_block_cap = 256;
/* number of bytes in a array_read at which we stop prefetching (1Mb) */
uint64_t zfetch_array_rd_sz = 1024 * 1024;
/* forward decls for static routines */
static int dmu_zfetch_colinear(zfetch_t *, zstream_t *);
static void dmu_zfetch_dofetch(zfetch_t *, zstream_t *);
static uint64_t dmu_zfetch_fetch(dnode_t *, uint64_t, uint64_t);
static uint64_t dmu_zfetch_fetchsz(dnode_t *, uint64_t, uint64_t);
static int dmu_zfetch_find(zfetch_t *, zstream_t *, int);
static int dmu_zfetch_stream_insert(zfetch_t *, zstream_t *);
static zstream_t *dmu_zfetch_stream_reclaim(zfetch_t *);
static void dmu_zfetch_stream_remove(zfetch_t *, zstream_t *);
static int dmu_zfetch_streams_equal(zstream_t *, zstream_t *);
typedef struct zfetch_stats {
kstat_named_t zfetchstat_hits;
kstat_named_t zfetchstat_misses;
kstat_named_t zfetchstat_colinear_hits;
kstat_named_t zfetchstat_colinear_misses;
kstat_named_t zfetchstat_stride_hits;
kstat_named_t zfetchstat_stride_misses;
kstat_named_t zfetchstat_reclaim_successes;
kstat_named_t zfetchstat_reclaim_failures;
kstat_named_t zfetchstat_stream_resets;
kstat_named_t zfetchstat_stream_noresets;
kstat_named_t zfetchstat_bogus_streams;
} zfetch_stats_t;
static zfetch_stats_t zfetch_stats = {
{ "hits", KSTAT_DATA_UINT64 },
{ "misses", KSTAT_DATA_UINT64 },
{ "colinear_hits", KSTAT_DATA_UINT64 },
{ "colinear_misses", KSTAT_DATA_UINT64 },
{ "stride_hits", KSTAT_DATA_UINT64 },
{ "stride_misses", KSTAT_DATA_UINT64 },
{ "reclaim_successes", KSTAT_DATA_UINT64 },
{ "reclaim_failures", KSTAT_DATA_UINT64 },
{ "streams_resets", KSTAT_DATA_UINT64 },
{ "streams_noresets", KSTAT_DATA_UINT64 },
{ "bogus_streams", KSTAT_DATA_UINT64 },
};
#define ZFETCHSTAT_INCR(stat, val) \
atomic_add_64(&zfetch_stats.stat.value.ui64, (val));
#define ZFETCHSTAT_BUMP(stat) ZFETCHSTAT_INCR(stat, 1);
kstat_t *zfetch_ksp;
/*
* Given a zfetch structure and a zstream structure, determine whether the
* blocks to be read are part of a co-linear pair of existing prefetch
* streams. If a set is found, coalesce the streams, removing one, and
* configure the prefetch so it looks for a strided access pattern.
*
* In other words: if we find two sequential access streams that are
* the same length and distance N appart, and this read is N from the
* last stream, then we are probably in a strided access pattern. So
* combine the two sequential streams into a single strided stream.
*
* If no co-linear streams are found, return NULL.
*/
static int
dmu_zfetch_colinear(zfetch_t *zf, zstream_t *zh)
{
zstream_t *z_walk;
zstream_t *z_comp;
if (! rw_tryenter(&zf->zf_rwlock, RW_WRITER))
return (0);
if (zh == NULL) {
rw_exit(&zf->zf_rwlock);
return (0);
}
for (z_walk = list_head(&zf->zf_stream); z_walk;
z_walk = list_next(&zf->zf_stream, z_walk)) {
for (z_comp = list_next(&zf->zf_stream, z_walk); z_comp;
z_comp = list_next(&zf->zf_stream, z_comp)) {
int64_t diff;
if (z_walk->zst_len != z_walk->zst_stride ||
z_comp->zst_len != z_comp->zst_stride) {
continue;
}
diff = z_comp->zst_offset - z_walk->zst_offset;
if (z_comp->zst_offset + diff == zh->zst_offset) {
z_walk->zst_offset = zh->zst_offset;
z_walk->zst_direction = diff < 0 ? -1 : 1;
z_walk->zst_stride =
diff * z_walk->zst_direction;
z_walk->zst_ph_offset =
zh->zst_offset + z_walk->zst_stride;
dmu_zfetch_stream_remove(zf, z_comp);
mutex_destroy(&z_comp->zst_lock);
kmem_free(z_comp, sizeof (zstream_t));
dmu_zfetch_dofetch(zf, z_walk);
rw_exit(&zf->zf_rwlock);
return (1);
}
diff = z_walk->zst_offset - z_comp->zst_offset;
if (z_walk->zst_offset + diff == zh->zst_offset) {
z_walk->zst_offset = zh->zst_offset;
z_walk->zst_direction = diff < 0 ? -1 : 1;
z_walk->zst_stride =
diff * z_walk->zst_direction;
z_walk->zst_ph_offset =
zh->zst_offset + z_walk->zst_stride;
dmu_zfetch_stream_remove(zf, z_comp);
mutex_destroy(&z_comp->zst_lock);
kmem_free(z_comp, sizeof (zstream_t));
dmu_zfetch_dofetch(zf, z_walk);
rw_exit(&zf->zf_rwlock);
return (1);
}
}
}
rw_exit(&zf->zf_rwlock);
return (0);
}
/*
* Given a zstream_t, determine the bounds of the prefetch. Then call the
* routine that actually prefetches the individual blocks.
*/
static void
dmu_zfetch_dofetch(zfetch_t *zf, zstream_t *zs)
{
uint64_t prefetch_tail;
uint64_t prefetch_limit;
uint64_t prefetch_ofst;
uint64_t prefetch_len;
uint64_t blocks_fetched;
zs->zst_stride = MAX((int64_t)zs->zst_stride, zs->zst_len);
zs->zst_cap = MIN(zfetch_block_cap, 2 * zs->zst_cap);
prefetch_tail = MAX((int64_t)zs->zst_ph_offset,
(int64_t)(zs->zst_offset + zs->zst_stride));
/*
* XXX: use a faster division method?
*/
prefetch_limit = zs->zst_offset + zs->zst_len +
(zs->zst_cap * zs->zst_stride) / zs->zst_len;
while (prefetch_tail < prefetch_limit) {
prefetch_ofst = zs->zst_offset + zs->zst_direction *
(prefetch_tail - zs->zst_offset);
prefetch_len = zs->zst_len;
/*
* Don't prefetch beyond the end of the file, if working
* backwards.
*/
if ((zs->zst_direction == ZFETCH_BACKWARD) &&
(prefetch_ofst > prefetch_tail)) {
prefetch_len += prefetch_ofst;
prefetch_ofst = 0;
}
/* don't prefetch more than we're supposed to */
if (prefetch_len > zs->zst_len)
break;
blocks_fetched = dmu_zfetch_fetch(zf->zf_dnode,
prefetch_ofst, zs->zst_len);
prefetch_tail += zs->zst_stride;
/* stop if we've run out of stuff to prefetch */
if (blocks_fetched < zs->zst_len)
break;
}
zs->zst_ph_offset = prefetch_tail;
zs->zst_last = ddi_get_lbolt();
}
void
zfetch_init(void)
{
zfetch_ksp = kstat_create("zfs", 0, "zfetchstats", "misc",
KSTAT_TYPE_NAMED, sizeof (zfetch_stats) / sizeof (kstat_named_t),
KSTAT_FLAG_VIRTUAL);
if (zfetch_ksp != NULL) {
zfetch_ksp->ks_data = &zfetch_stats;
kstat_install(zfetch_ksp);
}
}
void
zfetch_fini(void)
{
if (zfetch_ksp != NULL) {
kstat_delete(zfetch_ksp);
zfetch_ksp = NULL;
}
}
/*
* This takes a pointer to a zfetch structure and a dnode. It performs the
* necessary setup for the zfetch structure, grokking data from the
* associated dnode.
*/
void
dmu_zfetch_init(zfetch_t *zf, dnode_t *dno)
{
if (zf == NULL) {
return;
}
zf->zf_dnode = dno;
zf->zf_stream_cnt = 0;
zf->zf_alloc_fail = 0;
list_create(&zf->zf_stream, sizeof (zstream_t),
offsetof(zstream_t, zst_node));
rw_init(&zf->zf_rwlock, NULL, RW_DEFAULT, NULL);
}
/*
* This function computes the actual size, in blocks, that can be prefetched,
* and fetches it.
*/
static uint64_t
dmu_zfetch_fetch(dnode_t *dn, uint64_t blkid, uint64_t nblks)
{
uint64_t fetchsz;
uint64_t i;
fetchsz = dmu_zfetch_fetchsz(dn, blkid, nblks);
for (i = 0; i < fetchsz; i++) {
dbuf_prefetch(dn, blkid + i);
}
return (fetchsz);
}
/*
* this function returns the number of blocks that would be prefetched, based
* upon the supplied dnode, blockid, and nblks. This is used so that we can
* update streams in place, and then prefetch with their old value after the
* fact. This way, we can delay the prefetch, but subsequent accesses to the
* stream won't result in the same data being prefetched multiple times.
*/
static uint64_t
dmu_zfetch_fetchsz(dnode_t *dn, uint64_t blkid, uint64_t nblks)
{
uint64_t fetchsz;
if (blkid > dn->dn_maxblkid) {
return (0);
}
/* compute fetch size */
if (blkid + nblks + 1 > dn->dn_maxblkid) {
fetchsz = (dn->dn_maxblkid - blkid) + 1;
ASSERT(blkid + fetchsz - 1 <= dn->dn_maxblkid);
} else {
fetchsz = nblks;
}
return (fetchsz);
}
/*
* given a zfetch and a zstream structure, see if there is an associated zstream
* for this block read. If so, it starts a prefetch for the stream it
* located and returns true, otherwise it returns false
*/
static int
dmu_zfetch_find(zfetch_t *zf, zstream_t *zh, int prefetched)
{
zstream_t *zs;
int64_t diff;
int reset = !prefetched;
int rc = 0;
if (zh == NULL)
return (0);
/*
* XXX: This locking strategy is a bit coarse; however, it's impact has
* yet to be tested. If this turns out to be an issue, it can be
* modified in a number of different ways.
*/
rw_enter(&zf->zf_rwlock, RW_READER);
top:
for (zs = list_head(&zf->zf_stream); zs;
zs = list_next(&zf->zf_stream, zs)) {
/*
* XXX - should this be an assert?
*/
if (zs->zst_len == 0) {
/* bogus stream */
ZFETCHSTAT_BUMP(zfetchstat_bogus_streams);
continue;
}
/*
* We hit this case when we are in a strided prefetch stream:
* we will read "len" blocks before "striding".
*/
if (zh->zst_offset >= zs->zst_offset &&
zh->zst_offset < zs->zst_offset + zs->zst_len) {
if (prefetched) {
/* already fetched */
ZFETCHSTAT_BUMP(zfetchstat_stride_hits);
rc = 1;
goto out;
} else {
ZFETCHSTAT_BUMP(zfetchstat_stride_misses);
}
}
/*
* This is the forward sequential read case: we increment
* len by one each time we hit here, so we will enter this
* case on every read.
*/
if (zh->zst_offset == zs->zst_offset + zs->zst_len) {
reset = !prefetched && zs->zst_len > 1;
mutex_enter(&zs->zst_lock);
if (zh->zst_offset != zs->zst_offset + zs->zst_len) {
mutex_exit(&zs->zst_lock);
goto top;
}
zs->zst_len += zh->zst_len;
diff = zs->zst_len - zfetch_block_cap;
if (diff > 0) {
zs->zst_offset += diff;
zs->zst_len = zs->zst_len > diff ?
zs->zst_len - diff : 0;
}
zs->zst_direction = ZFETCH_FORWARD;
break;
/*
* Same as above, but reading backwards through the file.
*/
} else if (zh->zst_offset == zs->zst_offset - zh->zst_len) {
/* backwards sequential access */
reset = !prefetched && zs->zst_len > 1;
mutex_enter(&zs->zst_lock);
if (zh->zst_offset != zs->zst_offset - zh->zst_len) {
mutex_exit(&zs->zst_lock);
goto top;
}
zs->zst_offset = zs->zst_offset > zh->zst_len ?
zs->zst_offset - zh->zst_len : 0;
zs->zst_ph_offset = zs->zst_ph_offset > zh->zst_len ?
zs->zst_ph_offset - zh->zst_len : 0;
zs->zst_len += zh->zst_len;
diff = zs->zst_len - zfetch_block_cap;
if (diff > 0) {
zs->zst_ph_offset = zs->zst_ph_offset > diff ?
zs->zst_ph_offset - diff : 0;
zs->zst_len = zs->zst_len > diff ?
zs->zst_len - diff : zs->zst_len;
}
zs->zst_direction = ZFETCH_BACKWARD;
break;
} else if ((zh->zst_offset - zs->zst_offset - zs->zst_stride <
zs->zst_len) && (zs->zst_len != zs->zst_stride)) {
/* strided forward access */
mutex_enter(&zs->zst_lock);
if ((zh->zst_offset - zs->zst_offset - zs->zst_stride >=
zs->zst_len) || (zs->zst_len == zs->zst_stride)) {
mutex_exit(&zs->zst_lock);
goto top;
}
zs->zst_offset += zs->zst_stride;
zs->zst_direction = ZFETCH_FORWARD;
break;
} else if ((zh->zst_offset - zs->zst_offset + zs->zst_stride <
zs->zst_len) && (zs->zst_len != zs->zst_stride)) {
/* strided reverse access */
mutex_enter(&zs->zst_lock);
if ((zh->zst_offset - zs->zst_offset + zs->zst_stride >=
zs->zst_len) || (zs->zst_len == zs->zst_stride)) {
mutex_exit(&zs->zst_lock);
goto top;
}
zs->zst_offset = zs->zst_offset > zs->zst_stride ?
zs->zst_offset - zs->zst_stride : 0;
zs->zst_ph_offset = (zs->zst_ph_offset >
(2 * zs->zst_stride)) ?
(zs->zst_ph_offset - (2 * zs->zst_stride)) : 0;
zs->zst_direction = ZFETCH_BACKWARD;
break;
}
}
if (zs) {
if (reset) {
zstream_t *remove = zs;
ZFETCHSTAT_BUMP(zfetchstat_stream_resets);
rc = 0;
mutex_exit(&zs->zst_lock);
rw_exit(&zf->zf_rwlock);
rw_enter(&zf->zf_rwlock, RW_WRITER);
/*
* Relocate the stream, in case someone removes
* it while we were acquiring the WRITER lock.
*/
for (zs = list_head(&zf->zf_stream); zs;
zs = list_next(&zf->zf_stream, zs)) {
if (zs == remove) {
dmu_zfetch_stream_remove(zf, zs);
mutex_destroy(&zs->zst_lock);
kmem_free(zs, sizeof (zstream_t));
break;
}
}
} else {
ZFETCHSTAT_BUMP(zfetchstat_stream_noresets);
rc = 1;
dmu_zfetch_dofetch(zf, zs);
mutex_exit(&zs->zst_lock);
}
}
out:
rw_exit(&zf->zf_rwlock);
return (rc);
}
/*
* Clean-up state associated with a zfetch structure. This frees allocated
* structure members, empties the zf_stream tree, and generally makes things
* nice. This doesn't free the zfetch_t itself, that's left to the caller.
*/
void
dmu_zfetch_rele(zfetch_t *zf)
{
zstream_t *zs;
zstream_t *zs_next;
ASSERT(!RW_LOCK_HELD(&zf->zf_rwlock));
for (zs = list_head(&zf->zf_stream); zs; zs = zs_next) {
zs_next = list_next(&zf->zf_stream, zs);
list_remove(&zf->zf_stream, zs);
mutex_destroy(&zs->zst_lock);
kmem_free(zs, sizeof (zstream_t));
}
list_destroy(&zf->zf_stream);
rw_destroy(&zf->zf_rwlock);
zf->zf_dnode = NULL;
}
/*
* Given a zfetch and zstream structure, insert the zstream structure into the
* AVL tree contained within the zfetch structure. Peform the appropriate
* book-keeping. It is possible that another thread has inserted a stream which
* matches one that we are about to insert, so we must be sure to check for this
* case. If one is found, return failure, and let the caller cleanup the
* duplicates.
*/
static int
dmu_zfetch_stream_insert(zfetch_t *zf, zstream_t *zs)
{
zstream_t *zs_walk;
zstream_t *zs_next;
ASSERT(RW_WRITE_HELD(&zf->zf_rwlock));
for (zs_walk = list_head(&zf->zf_stream); zs_walk; zs_walk = zs_next) {
zs_next = list_next(&zf->zf_stream, zs_walk);
if (dmu_zfetch_streams_equal(zs_walk, zs)) {
return (0);
}
}
list_insert_head(&zf->zf_stream, zs);
zf->zf_stream_cnt++;
return (1);
}
/*
* Walk the list of zstreams in the given zfetch, find an old one (by time), and
* reclaim it for use by the caller.
*/
static zstream_t *
dmu_zfetch_stream_reclaim(zfetch_t *zf)
{
zstream_t *zs;
if (! rw_tryenter(&zf->zf_rwlock, RW_WRITER))
return (0);
for (zs = list_head(&zf->zf_stream); zs;
zs = list_next(&zf->zf_stream, zs)) {
if (((ddi_get_lbolt() - zs->zst_last)/hz) > zfetch_min_sec_reap)
break;
}
if (zs) {
dmu_zfetch_stream_remove(zf, zs);
mutex_destroy(&zs->zst_lock);
bzero(zs, sizeof (zstream_t));
} else {
zf->zf_alloc_fail++;
}
rw_exit(&zf->zf_rwlock);
return (zs);
}
/*
* Given a zfetch and zstream structure, remove the zstream structure from its
* container in the zfetch structure. Perform the appropriate book-keeping.
*/
static void
dmu_zfetch_stream_remove(zfetch_t *zf, zstream_t *zs)
{
ASSERT(RW_WRITE_HELD(&zf->zf_rwlock));
list_remove(&zf->zf_stream, zs);
zf->zf_stream_cnt--;
}
static int
dmu_zfetch_streams_equal(zstream_t *zs1, zstream_t *zs2)
{
if (zs1->zst_offset != zs2->zst_offset)
return (0);
if (zs1->zst_len != zs2->zst_len)
return (0);
if (zs1->zst_stride != zs2->zst_stride)
return (0);
if (zs1->zst_ph_offset != zs2->zst_ph_offset)
return (0);
if (zs1->zst_cap != zs2->zst_cap)
return (0);
if (zs1->zst_direction != zs2->zst_direction)
return (0);
return (1);
}
/*
* This is the prefetch entry point. It calls all of the other dmu_zfetch
* routines to create, delete, find, or operate upon prefetch streams.
*/
void
dmu_zfetch(zfetch_t *zf, uint64_t offset, uint64_t size, int prefetched)
{
zstream_t zst;
zstream_t *newstream;
int fetched;
int inserted;
unsigned int blkshft;
uint64_t blksz;
if (zfs_prefetch_disable)
return;
/* files that aren't ln2 blocksz are only one block -- nothing to do */
if (!zf->zf_dnode->dn_datablkshift)
return;
/* convert offset and size, into blockid and nblocks */
blkshft = zf->zf_dnode->dn_datablkshift;
blksz = (1 << blkshft);
bzero(&zst, sizeof (zstream_t));
zst.zst_offset = offset >> blkshft;
zst.zst_len = (P2ROUNDUP(offset + size, blksz) -
P2ALIGN(offset, blksz)) >> blkshft;
fetched = dmu_zfetch_find(zf, &zst, prefetched);
if (fetched) {
ZFETCHSTAT_BUMP(zfetchstat_hits);
} else {
ZFETCHSTAT_BUMP(zfetchstat_misses);
if (fetched = dmu_zfetch_colinear(zf, &zst)) {
ZFETCHSTAT_BUMP(zfetchstat_colinear_hits);
} else {
ZFETCHSTAT_BUMP(zfetchstat_colinear_misses);
}
}
if (!fetched) {
newstream = dmu_zfetch_stream_reclaim(zf);
/*
* we still couldn't find a stream, drop the lock, and allocate
* one if possible. Otherwise, give up and go home.
*/
if (newstream) {
ZFETCHSTAT_BUMP(zfetchstat_reclaim_successes);
} else {
uint64_t maxblocks;
uint32_t max_streams;
uint32_t cur_streams;
ZFETCHSTAT_BUMP(zfetchstat_reclaim_failures);
cur_streams = zf->zf_stream_cnt;
maxblocks = zf->zf_dnode->dn_maxblkid;
max_streams = MIN(zfetch_max_streams,
(maxblocks / zfetch_block_cap));
if (max_streams == 0) {
max_streams++;
}
if (cur_streams >= max_streams) {
return;
}
newstream = kmem_zalloc(sizeof (zstream_t), KM_SLEEP);
}
newstream->zst_offset = zst.zst_offset;
newstream->zst_len = zst.zst_len;
newstream->zst_stride = zst.zst_len;
newstream->zst_ph_offset = zst.zst_len + zst.zst_offset;
newstream->zst_cap = zst.zst_len;
newstream->zst_direction = ZFETCH_FORWARD;
newstream->zst_last = ddi_get_lbolt();
mutex_init(&newstream->zst_lock, NULL, MUTEX_DEFAULT, NULL);
rw_enter(&zf->zf_rwlock, RW_WRITER);
inserted = dmu_zfetch_stream_insert(zf, newstream);
rw_exit(&zf->zf_rwlock);
if (!inserted) {
mutex_destroy(&newstream->zst_lock);
kmem_free(newstream, sizeof (zstream_t));
}
}
}

1993
uts/common/fs/zfs/dnode.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,693 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#include <sys/zfs_context.h>
#include <sys/dbuf.h>
#include <sys/dnode.h>
#include <sys/dmu.h>
#include <sys/dmu_tx.h>
#include <sys/dmu_objset.h>
#include <sys/dsl_dataset.h>
#include <sys/spa.h>
static void
dnode_increase_indirection(dnode_t *dn, dmu_tx_t *tx)
{
dmu_buf_impl_t *db;
int txgoff = tx->tx_txg & TXG_MASK;
int nblkptr = dn->dn_phys->dn_nblkptr;
int old_toplvl = dn->dn_phys->dn_nlevels - 1;
int new_level = dn->dn_next_nlevels[txgoff];
int i;
rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
/* this dnode can't be paged out because it's dirty */
ASSERT(dn->dn_phys->dn_type != DMU_OT_NONE);
ASSERT(RW_WRITE_HELD(&dn->dn_struct_rwlock));
ASSERT(new_level > 1 && dn->dn_phys->dn_nlevels > 0);
db = dbuf_hold_level(dn, dn->dn_phys->dn_nlevels, 0, FTAG);
ASSERT(db != NULL);
dn->dn_phys->dn_nlevels = new_level;
dprintf("os=%p obj=%llu, increase to %d\n", dn->dn_objset,
dn->dn_object, dn->dn_phys->dn_nlevels);
/* check for existing blkptrs in the dnode */
for (i = 0; i < nblkptr; i++)
if (!BP_IS_HOLE(&dn->dn_phys->dn_blkptr[i]))
break;
if (i != nblkptr) {
/* transfer dnode's block pointers to new indirect block */
(void) dbuf_read(db, NULL, DB_RF_MUST_SUCCEED|DB_RF_HAVESTRUCT);
ASSERT(db->db.db_data);
ASSERT(arc_released(db->db_buf));
ASSERT3U(sizeof (blkptr_t) * nblkptr, <=, db->db.db_size);
bcopy(dn->dn_phys->dn_blkptr, db->db.db_data,
sizeof (blkptr_t) * nblkptr);
arc_buf_freeze(db->db_buf);
}
/* set dbuf's parent pointers to new indirect buf */
for (i = 0; i < nblkptr; i++) {
dmu_buf_impl_t *child = dbuf_find(dn, old_toplvl, i);
if (child == NULL)
continue;
#ifdef DEBUG
DB_DNODE_ENTER(child);
ASSERT3P(DB_DNODE(child), ==, dn);
DB_DNODE_EXIT(child);
#endif /* DEBUG */
if (child->db_parent && child->db_parent != dn->dn_dbuf) {
ASSERT(child->db_parent->db_level == db->db_level);
ASSERT(child->db_blkptr !=
&dn->dn_phys->dn_blkptr[child->db_blkid]);
mutex_exit(&child->db_mtx);
continue;
}
ASSERT(child->db_parent == NULL ||
child->db_parent == dn->dn_dbuf);
child->db_parent = db;
dbuf_add_ref(db, child);
if (db->db.db_data)
child->db_blkptr = (blkptr_t *)db->db.db_data + i;
else
child->db_blkptr = NULL;
dprintf_dbuf_bp(child, child->db_blkptr,
"changed db_blkptr to new indirect %s", "");
mutex_exit(&child->db_mtx);
}
bzero(dn->dn_phys->dn_blkptr, sizeof (blkptr_t) * nblkptr);
dbuf_rele(db, FTAG);
rw_exit(&dn->dn_struct_rwlock);
}
static int
free_blocks(dnode_t *dn, blkptr_t *bp, int num, dmu_tx_t *tx)
{
dsl_dataset_t *ds = dn->dn_objset->os_dsl_dataset;
uint64_t bytesfreed = 0;
int i, blocks_freed = 0;
dprintf("ds=%p obj=%llx num=%d\n", ds, dn->dn_object, num);
for (i = 0; i < num; i++, bp++) {
if (BP_IS_HOLE(bp))
continue;
bytesfreed += dsl_dataset_block_kill(ds, bp, tx, B_FALSE);
ASSERT3U(bytesfreed, <=, DN_USED_BYTES(dn->dn_phys));
bzero(bp, sizeof (blkptr_t));
blocks_freed += 1;
}
dnode_diduse_space(dn, -bytesfreed);
return (blocks_freed);
}
#ifdef ZFS_DEBUG
static void
free_verify(dmu_buf_impl_t *db, uint64_t start, uint64_t end, dmu_tx_t *tx)
{
int off, num;
int i, err, epbs;
uint64_t txg = tx->tx_txg;
dnode_t *dn;
DB_DNODE_ENTER(db);
dn = DB_DNODE(db);
epbs = dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT;
off = start - (db->db_blkid * 1<<epbs);
num = end - start + 1;
ASSERT3U(off, >=, 0);
ASSERT3U(num, >=, 0);
ASSERT3U(db->db_level, >, 0);
ASSERT3U(db->db.db_size, ==, 1 << dn->dn_phys->dn_indblkshift);
ASSERT3U(off+num, <=, db->db.db_size >> SPA_BLKPTRSHIFT);
ASSERT(db->db_blkptr != NULL);
for (i = off; i < off+num; i++) {
uint64_t *buf;
dmu_buf_impl_t *child;
dbuf_dirty_record_t *dr;
int j;
ASSERT(db->db_level == 1);
rw_enter(&dn->dn_struct_rwlock, RW_READER);
err = dbuf_hold_impl(dn, db->db_level-1,
(db->db_blkid << epbs) + i, TRUE, FTAG, &child);
rw_exit(&dn->dn_struct_rwlock);
if (err == ENOENT)
continue;
ASSERT(err == 0);
ASSERT(child->db_level == 0);
dr = child->db_last_dirty;
while (dr && dr->dr_txg > txg)
dr = dr->dr_next;
ASSERT(dr == NULL || dr->dr_txg == txg);
/* data_old better be zeroed */
if (dr) {
buf = dr->dt.dl.dr_data->b_data;
for (j = 0; j < child->db.db_size >> 3; j++) {
if (buf[j] != 0) {
panic("freed data not zero: "
"child=%p i=%d off=%d num=%d\n",
(void *)child, i, off, num);
}
}
}
/*
* db_data better be zeroed unless it's dirty in a
* future txg.
*/
mutex_enter(&child->db_mtx);
buf = child->db.db_data;
if (buf != NULL && child->db_state != DB_FILL &&
child->db_last_dirty == NULL) {
for (j = 0; j < child->db.db_size >> 3; j++) {
if (buf[j] != 0) {
panic("freed data not zero: "
"child=%p i=%d off=%d num=%d\n",
(void *)child, i, off, num);
}
}
}
mutex_exit(&child->db_mtx);
dbuf_rele(child, FTAG);
}
DB_DNODE_EXIT(db);
}
#endif
#define ALL -1
static int
free_children(dmu_buf_impl_t *db, uint64_t blkid, uint64_t nblks, int trunc,
dmu_tx_t *tx)
{
dnode_t *dn;
blkptr_t *bp;
dmu_buf_impl_t *subdb;
uint64_t start, end, dbstart, dbend, i;
int epbs, shift, err;
int all = TRUE;
int blocks_freed = 0;
/*
* There is a small possibility that this block will not be cached:
* 1 - if level > 1 and there are no children with level <= 1
* 2 - if we didn't get a dirty hold (because this block had just
* finished being written -- and so had no holds), and then this
* block got evicted before we got here.
*/
if (db->db_state != DB_CACHED)
(void) dbuf_read(db, NULL, DB_RF_MUST_SUCCEED);
dbuf_release_bp(db);
bp = (blkptr_t *)db->db.db_data;
DB_DNODE_ENTER(db);
dn = DB_DNODE(db);
epbs = dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT;
shift = (db->db_level - 1) * epbs;
dbstart = db->db_blkid << epbs;
start = blkid >> shift;
if (dbstart < start) {
bp += start - dbstart;
all = FALSE;
} else {
start = dbstart;
}
dbend = ((db->db_blkid + 1) << epbs) - 1;
end = (blkid + nblks - 1) >> shift;
if (dbend <= end)
end = dbend;
else if (all)
all = trunc;
ASSERT3U(start, <=, end);
if (db->db_level == 1) {
FREE_VERIFY(db, start, end, tx);
blocks_freed = free_blocks(dn, bp, end-start+1, tx);
arc_buf_freeze(db->db_buf);
ASSERT(all || blocks_freed == 0 || db->db_last_dirty);
DB_DNODE_EXIT(db);
return (all ? ALL : blocks_freed);
}
for (i = start; i <= end; i++, bp++) {
if (BP_IS_HOLE(bp))
continue;
rw_enter(&dn->dn_struct_rwlock, RW_READER);
err = dbuf_hold_impl(dn, db->db_level-1, i, TRUE, FTAG, &subdb);
ASSERT3U(err, ==, 0);
rw_exit(&dn->dn_struct_rwlock);
if (free_children(subdb, blkid, nblks, trunc, tx) == ALL) {
ASSERT3P(subdb->db_blkptr, ==, bp);
blocks_freed += free_blocks(dn, bp, 1, tx);
} else {
all = FALSE;
}
dbuf_rele(subdb, FTAG);
}
DB_DNODE_EXIT(db);
arc_buf_freeze(db->db_buf);
#ifdef ZFS_DEBUG
bp -= (end-start)+1;
for (i = start; i <= end; i++, bp++) {
if (i == start && blkid != 0)
continue;
else if (i == end && !trunc)
continue;
ASSERT3U(bp->blk_birth, ==, 0);
}
#endif
ASSERT(all || blocks_freed == 0 || db->db_last_dirty);
return (all ? ALL : blocks_freed);
}
/*
* free_range: Traverse the indicated range of the provided file
* and "free" all the blocks contained there.
*/
static void
dnode_sync_free_range(dnode_t *dn, uint64_t blkid, uint64_t nblks, dmu_tx_t *tx)
{
blkptr_t *bp = dn->dn_phys->dn_blkptr;
dmu_buf_impl_t *db;
int trunc, start, end, shift, i, err;
int dnlevel = dn->dn_phys->dn_nlevels;
if (blkid > dn->dn_phys->dn_maxblkid)
return;
ASSERT(dn->dn_phys->dn_maxblkid < UINT64_MAX);
trunc = blkid + nblks > dn->dn_phys->dn_maxblkid;
if (trunc)
nblks = dn->dn_phys->dn_maxblkid - blkid + 1;
/* There are no indirect blocks in the object */
if (dnlevel == 1) {
if (blkid >= dn->dn_phys->dn_nblkptr) {
/* this range was never made persistent */
return;
}
ASSERT3U(blkid + nblks, <=, dn->dn_phys->dn_nblkptr);
(void) free_blocks(dn, bp + blkid, nblks, tx);
if (trunc) {
uint64_t off = (dn->dn_phys->dn_maxblkid + 1) *
(dn->dn_phys->dn_datablkszsec << SPA_MINBLOCKSHIFT);
dn->dn_phys->dn_maxblkid = (blkid ? blkid - 1 : 0);
ASSERT(off < dn->dn_phys->dn_maxblkid ||
dn->dn_phys->dn_maxblkid == 0 ||
dnode_next_offset(dn, 0, &off, 1, 1, 0) != 0);
}
return;
}
shift = (dnlevel - 1) * (dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT);
start = blkid >> shift;
ASSERT(start < dn->dn_phys->dn_nblkptr);
end = (blkid + nblks - 1) >> shift;
bp += start;
for (i = start; i <= end; i++, bp++) {
if (BP_IS_HOLE(bp))
continue;
rw_enter(&dn->dn_struct_rwlock, RW_READER);
err = dbuf_hold_impl(dn, dnlevel-1, i, TRUE, FTAG, &db);
ASSERT3U(err, ==, 0);
rw_exit(&dn->dn_struct_rwlock);
if (free_children(db, blkid, nblks, trunc, tx) == ALL) {
ASSERT3P(db->db_blkptr, ==, bp);
(void) free_blocks(dn, bp, 1, tx);
}
dbuf_rele(db, FTAG);
}
if (trunc) {
uint64_t off = (dn->dn_phys->dn_maxblkid + 1) *
(dn->dn_phys->dn_datablkszsec << SPA_MINBLOCKSHIFT);
dn->dn_phys->dn_maxblkid = (blkid ? blkid - 1 : 0);
ASSERT(off < dn->dn_phys->dn_maxblkid ||
dn->dn_phys->dn_maxblkid == 0 ||
dnode_next_offset(dn, 0, &off, 1, 1, 0) != 0);
}
}
/*
* Try to kick all the dnodes dbufs out of the cache...
*/
void
dnode_evict_dbufs(dnode_t *dn)
{
int progress;
int pass = 0;
do {
dmu_buf_impl_t *db, marker;
int evicting = FALSE;
progress = FALSE;
mutex_enter(&dn->dn_dbufs_mtx);
list_insert_tail(&dn->dn_dbufs, &marker);
db = list_head(&dn->dn_dbufs);
for (; db != &marker; db = list_head(&dn->dn_dbufs)) {
list_remove(&dn->dn_dbufs, db);
list_insert_tail(&dn->dn_dbufs, db);
#ifdef DEBUG
DB_DNODE_ENTER(db);
ASSERT3P(DB_DNODE(db), ==, dn);
DB_DNODE_EXIT(db);
#endif /* DEBUG */
mutex_enter(&db->db_mtx);
if (db->db_state == DB_EVICTING) {
progress = TRUE;
evicting = TRUE;
mutex_exit(&db->db_mtx);
} else if (refcount_is_zero(&db->db_holds)) {
progress = TRUE;
dbuf_clear(db); /* exits db_mtx for us */
} else {
mutex_exit(&db->db_mtx);
}
}
list_remove(&dn->dn_dbufs, &marker);
/*
* NB: we need to drop dn_dbufs_mtx between passes so
* that any DB_EVICTING dbufs can make progress.
* Ideally, we would have some cv we could wait on, but
* since we don't, just wait a bit to give the other
* thread a chance to run.
*/
mutex_exit(&dn->dn_dbufs_mtx);
if (evicting)
delay(1);
pass++;
ASSERT(pass < 100); /* sanity check */
} while (progress);
rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
if (dn->dn_bonus && refcount_is_zero(&dn->dn_bonus->db_holds)) {
mutex_enter(&dn->dn_bonus->db_mtx);
dbuf_evict(dn->dn_bonus);
dn->dn_bonus = NULL;
}
rw_exit(&dn->dn_struct_rwlock);
}
static void
dnode_undirty_dbufs(list_t *list)
{
dbuf_dirty_record_t *dr;
while (dr = list_head(list)) {
dmu_buf_impl_t *db = dr->dr_dbuf;
uint64_t txg = dr->dr_txg;
if (db->db_level != 0)
dnode_undirty_dbufs(&dr->dt.di.dr_children);
mutex_enter(&db->db_mtx);
/* XXX - use dbuf_undirty()? */
list_remove(list, dr);
ASSERT(db->db_last_dirty == dr);
db->db_last_dirty = NULL;
db->db_dirtycnt -= 1;
if (db->db_level == 0) {
ASSERT(db->db_blkid == DMU_BONUS_BLKID ||
dr->dt.dl.dr_data == db->db_buf);
dbuf_unoverride(dr);
}
kmem_free(dr, sizeof (dbuf_dirty_record_t));
dbuf_rele_and_unlock(db, (void *)(uintptr_t)txg);
}
}
static void
dnode_sync_free(dnode_t *dn, dmu_tx_t *tx)
{
int txgoff = tx->tx_txg & TXG_MASK;
ASSERT(dmu_tx_is_syncing(tx));
/*
* Our contents should have been freed in dnode_sync() by the
* free range record inserted by the caller of dnode_free().
*/
ASSERT3U(DN_USED_BYTES(dn->dn_phys), ==, 0);
ASSERT(BP_IS_HOLE(dn->dn_phys->dn_blkptr));
dnode_undirty_dbufs(&dn->dn_dirty_records[txgoff]);
dnode_evict_dbufs(dn);
ASSERT3P(list_head(&dn->dn_dbufs), ==, NULL);
/*
* XXX - It would be nice to assert this, but we may still
* have residual holds from async evictions from the arc...
*
* zfs_obj_to_path() also depends on this being
* commented out.
*
* ASSERT3U(refcount_count(&dn->dn_holds), ==, 1);
*/
/* Undirty next bits */
dn->dn_next_nlevels[txgoff] = 0;
dn->dn_next_indblkshift[txgoff] = 0;
dn->dn_next_blksz[txgoff] = 0;
/* ASSERT(blkptrs are zero); */
ASSERT(dn->dn_phys->dn_type != DMU_OT_NONE);
ASSERT(dn->dn_type != DMU_OT_NONE);
ASSERT(dn->dn_free_txg > 0);
if (dn->dn_allocated_txg != dn->dn_free_txg)
dbuf_will_dirty(dn->dn_dbuf, tx);
bzero(dn->dn_phys, sizeof (dnode_phys_t));
mutex_enter(&dn->dn_mtx);
dn->dn_type = DMU_OT_NONE;
dn->dn_maxblkid = 0;
dn->dn_allocated_txg = 0;
dn->dn_free_txg = 0;
dn->dn_have_spill = B_FALSE;
mutex_exit(&dn->dn_mtx);
ASSERT(dn->dn_object != DMU_META_DNODE_OBJECT);
dnode_rele(dn, (void *)(uintptr_t)tx->tx_txg);
/*
* Now that we've released our hold, the dnode may
* be evicted, so we musn't access it.
*/
}
/*
* Write out the dnode's dirty buffers.
*/
void
dnode_sync(dnode_t *dn, dmu_tx_t *tx)
{
free_range_t *rp;
dnode_phys_t *dnp = dn->dn_phys;
int txgoff = tx->tx_txg & TXG_MASK;
list_t *list = &dn->dn_dirty_records[txgoff];
static const dnode_phys_t zerodn = { 0 };
boolean_t kill_spill = B_FALSE;
ASSERT(dmu_tx_is_syncing(tx));
ASSERT(dnp->dn_type != DMU_OT_NONE || dn->dn_allocated_txg);
ASSERT(dnp->dn_type != DMU_OT_NONE ||
bcmp(dnp, &zerodn, DNODE_SIZE) == 0);
DNODE_VERIFY(dn);
ASSERT(dn->dn_dbuf == NULL || arc_released(dn->dn_dbuf->db_buf));
if (dmu_objset_userused_enabled(dn->dn_objset) &&
!DMU_OBJECT_IS_SPECIAL(dn->dn_object)) {
mutex_enter(&dn->dn_mtx);
dn->dn_oldused = DN_USED_BYTES(dn->dn_phys);
dn->dn_oldflags = dn->dn_phys->dn_flags;
dn->dn_phys->dn_flags |= DNODE_FLAG_USERUSED_ACCOUNTED;
mutex_exit(&dn->dn_mtx);
dmu_objset_userquota_get_ids(dn, B_FALSE, tx);
} else {
/* Once we account for it, we should always account for it. */
ASSERT(!(dn->dn_phys->dn_flags &
DNODE_FLAG_USERUSED_ACCOUNTED));
}
mutex_enter(&dn->dn_mtx);
if (dn->dn_allocated_txg == tx->tx_txg) {
/* The dnode is newly allocated or reallocated */
if (dnp->dn_type == DMU_OT_NONE) {
/* this is a first alloc, not a realloc */
dnp->dn_nlevels = 1;
dnp->dn_nblkptr = dn->dn_nblkptr;
}
dnp->dn_type = dn->dn_type;
dnp->dn_bonustype = dn->dn_bonustype;
dnp->dn_bonuslen = dn->dn_bonuslen;
}
ASSERT(dnp->dn_nlevels > 1 ||
BP_IS_HOLE(&dnp->dn_blkptr[0]) ||
BP_GET_LSIZE(&dnp->dn_blkptr[0]) ==
dnp->dn_datablkszsec << SPA_MINBLOCKSHIFT);
if (dn->dn_next_blksz[txgoff]) {
ASSERT(P2PHASE(dn->dn_next_blksz[txgoff],
SPA_MINBLOCKSIZE) == 0);
ASSERT(BP_IS_HOLE(&dnp->dn_blkptr[0]) ||
dn->dn_maxblkid == 0 || list_head(list) != NULL ||
avl_last(&dn->dn_ranges[txgoff]) ||
dn->dn_next_blksz[txgoff] >> SPA_MINBLOCKSHIFT ==
dnp->dn_datablkszsec);
dnp->dn_datablkszsec =
dn->dn_next_blksz[txgoff] >> SPA_MINBLOCKSHIFT;
dn->dn_next_blksz[txgoff] = 0;
}
if (dn->dn_next_bonuslen[txgoff]) {
if (dn->dn_next_bonuslen[txgoff] == DN_ZERO_BONUSLEN)
dnp->dn_bonuslen = 0;
else
dnp->dn_bonuslen = dn->dn_next_bonuslen[txgoff];
ASSERT(dnp->dn_bonuslen <= DN_MAX_BONUSLEN);
dn->dn_next_bonuslen[txgoff] = 0;
}
if (dn->dn_next_bonustype[txgoff]) {
ASSERT(dn->dn_next_bonustype[txgoff] < DMU_OT_NUMTYPES);
dnp->dn_bonustype = dn->dn_next_bonustype[txgoff];
dn->dn_next_bonustype[txgoff] = 0;
}
/*
* We will either remove a spill block when a file is being removed
* or we have been asked to remove it.
*/
if (dn->dn_rm_spillblk[txgoff] ||
((dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR) &&
dn->dn_free_txg > 0 && dn->dn_free_txg <= tx->tx_txg)) {
if ((dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR))
kill_spill = B_TRUE;
dn->dn_rm_spillblk[txgoff] = 0;
}
if (dn->dn_next_indblkshift[txgoff]) {
ASSERT(dnp->dn_nlevels == 1);
dnp->dn_indblkshift = dn->dn_next_indblkshift[txgoff];
dn->dn_next_indblkshift[txgoff] = 0;
}
/*
* Just take the live (open-context) values for checksum and compress.
* Strictly speaking it's a future leak, but nothing bad happens if we
* start using the new checksum or compress algorithm a little early.
*/
dnp->dn_checksum = dn->dn_checksum;
dnp->dn_compress = dn->dn_compress;
mutex_exit(&dn->dn_mtx);
if (kill_spill) {
(void) free_blocks(dn, &dn->dn_phys->dn_spill, 1, tx);
mutex_enter(&dn->dn_mtx);
dnp->dn_flags &= ~DNODE_FLAG_SPILL_BLKPTR;
mutex_exit(&dn->dn_mtx);
}
/* process all the "freed" ranges in the file */
while (rp = avl_last(&dn->dn_ranges[txgoff])) {
dnode_sync_free_range(dn, rp->fr_blkid, rp->fr_nblks, tx);
/* grab the mutex so we don't race with dnode_block_freed() */
mutex_enter(&dn->dn_mtx);
avl_remove(&dn->dn_ranges[txgoff], rp);
mutex_exit(&dn->dn_mtx);
kmem_free(rp, sizeof (free_range_t));
}
if (dn->dn_free_txg > 0 && dn->dn_free_txg <= tx->tx_txg) {
dnode_sync_free(dn, tx);
return;
}
if (dn->dn_next_nblkptr[txgoff]) {
/* this should only happen on a realloc */
ASSERT(dn->dn_allocated_txg == tx->tx_txg);
if (dn->dn_next_nblkptr[txgoff] > dnp->dn_nblkptr) {
/* zero the new blkptrs we are gaining */
bzero(dnp->dn_blkptr + dnp->dn_nblkptr,
sizeof (blkptr_t) *
(dn->dn_next_nblkptr[txgoff] - dnp->dn_nblkptr));
#ifdef ZFS_DEBUG
} else {
int i;
ASSERT(dn->dn_next_nblkptr[txgoff] < dnp->dn_nblkptr);
/* the blkptrs we are losing better be unallocated */
for (i = dn->dn_next_nblkptr[txgoff];
i < dnp->dn_nblkptr; i++)
ASSERT(BP_IS_HOLE(&dnp->dn_blkptr[i]));
#endif
}
mutex_enter(&dn->dn_mtx);
dnp->dn_nblkptr = dn->dn_next_nblkptr[txgoff];
dn->dn_next_nblkptr[txgoff] = 0;
mutex_exit(&dn->dn_mtx);
}
if (dn->dn_next_nlevels[txgoff]) {
dnode_increase_indirection(dn, tx);
dn->dn_next_nlevels[txgoff] = 0;
}
dbuf_sync_list(list, tx);
if (!DMU_OBJECT_IS_SPECIAL(dn->dn_object)) {
ASSERT3P(list_head(list), ==, NULL);
dnode_rele(dn, (void *)(uintptr_t)tx->tx_txg);
}
/*
* Although we have dropped our reference to the dnode, it
* can't be evicted until its written, and we haven't yet
* initiated the IO for the dnode's dbuf.
*/
}

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,474 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved.
*/
#include <sys/dsl_dataset.h>
#include <sys/dmu.h>
#include <sys/refcount.h>
#include <sys/zap.h>
#include <sys/zfs_context.h>
#include <sys/dsl_pool.h>
static int
dsl_deadlist_compare(const void *arg1, const void *arg2)
{
const dsl_deadlist_entry_t *dle1 = arg1;
const dsl_deadlist_entry_t *dle2 = arg2;
if (dle1->dle_mintxg < dle2->dle_mintxg)
return (-1);
else if (dle1->dle_mintxg > dle2->dle_mintxg)
return (+1);
else
return (0);
}
static void
dsl_deadlist_load_tree(dsl_deadlist_t *dl)
{
zap_cursor_t zc;
zap_attribute_t za;
ASSERT(!dl->dl_oldfmt);
if (dl->dl_havetree)
return;
avl_create(&dl->dl_tree, dsl_deadlist_compare,
sizeof (dsl_deadlist_entry_t),
offsetof(dsl_deadlist_entry_t, dle_node));
for (zap_cursor_init(&zc, dl->dl_os, dl->dl_object);
zap_cursor_retrieve(&zc, &za) == 0;
zap_cursor_advance(&zc)) {
dsl_deadlist_entry_t *dle = kmem_alloc(sizeof (*dle), KM_SLEEP);
dle->dle_mintxg = strtonum(za.za_name, NULL);
VERIFY3U(0, ==, bpobj_open(&dle->dle_bpobj, dl->dl_os,
za.za_first_integer));
avl_add(&dl->dl_tree, dle);
}
zap_cursor_fini(&zc);
dl->dl_havetree = B_TRUE;
}
void
dsl_deadlist_open(dsl_deadlist_t *dl, objset_t *os, uint64_t object)
{
dmu_object_info_t doi;
mutex_init(&dl->dl_lock, NULL, MUTEX_DEFAULT, NULL);
dl->dl_os = os;
dl->dl_object = object;
VERIFY3U(0, ==, dmu_bonus_hold(os, object, dl, &dl->dl_dbuf));
dmu_object_info_from_db(dl->dl_dbuf, &doi);
if (doi.doi_type == DMU_OT_BPOBJ) {
dmu_buf_rele(dl->dl_dbuf, dl);
dl->dl_dbuf = NULL;
dl->dl_oldfmt = B_TRUE;
VERIFY3U(0, ==, bpobj_open(&dl->dl_bpobj, os, object));
return;
}
dl->dl_oldfmt = B_FALSE;
dl->dl_phys = dl->dl_dbuf->db_data;
dl->dl_havetree = B_FALSE;
}
void
dsl_deadlist_close(dsl_deadlist_t *dl)
{
void *cookie = NULL;
dsl_deadlist_entry_t *dle;
if (dl->dl_oldfmt) {
dl->dl_oldfmt = B_FALSE;
bpobj_close(&dl->dl_bpobj);
return;
}
if (dl->dl_havetree) {
while ((dle = avl_destroy_nodes(&dl->dl_tree, &cookie))
!= NULL) {
bpobj_close(&dle->dle_bpobj);
kmem_free(dle, sizeof (*dle));
}
avl_destroy(&dl->dl_tree);
}
dmu_buf_rele(dl->dl_dbuf, dl);
mutex_destroy(&dl->dl_lock);
dl->dl_dbuf = NULL;
dl->dl_phys = NULL;
}
uint64_t
dsl_deadlist_alloc(objset_t *os, dmu_tx_t *tx)
{
if (spa_version(dmu_objset_spa(os)) < SPA_VERSION_DEADLISTS)
return (bpobj_alloc(os, SPA_MAXBLOCKSIZE, tx));
return (zap_create(os, DMU_OT_DEADLIST, DMU_OT_DEADLIST_HDR,
sizeof (dsl_deadlist_phys_t), tx));
}
void
dsl_deadlist_free(objset_t *os, uint64_t dlobj, dmu_tx_t *tx)
{
dmu_object_info_t doi;
zap_cursor_t zc;
zap_attribute_t za;
VERIFY3U(0, ==, dmu_object_info(os, dlobj, &doi));
if (doi.doi_type == DMU_OT_BPOBJ) {
bpobj_free(os, dlobj, tx);
return;
}
for (zap_cursor_init(&zc, os, dlobj);
zap_cursor_retrieve(&zc, &za) == 0;
zap_cursor_advance(&zc))
bpobj_free(os, za.za_first_integer, tx);
zap_cursor_fini(&zc);
VERIFY3U(0, ==, dmu_object_free(os, dlobj, tx));
}
void
dsl_deadlist_insert(dsl_deadlist_t *dl, const blkptr_t *bp, dmu_tx_t *tx)
{
dsl_deadlist_entry_t dle_tofind;
dsl_deadlist_entry_t *dle;
avl_index_t where;
if (dl->dl_oldfmt) {
bpobj_enqueue(&dl->dl_bpobj, bp, tx);
return;
}
dsl_deadlist_load_tree(dl);
dmu_buf_will_dirty(dl->dl_dbuf, tx);
mutex_enter(&dl->dl_lock);
dl->dl_phys->dl_used +=
bp_get_dsize_sync(dmu_objset_spa(dl->dl_os), bp);
dl->dl_phys->dl_comp += BP_GET_PSIZE(bp);
dl->dl_phys->dl_uncomp += BP_GET_UCSIZE(bp);
mutex_exit(&dl->dl_lock);
dle_tofind.dle_mintxg = bp->blk_birth;
dle = avl_find(&dl->dl_tree, &dle_tofind, &where);
if (dle == NULL)
dle = avl_nearest(&dl->dl_tree, where, AVL_BEFORE);
else
dle = AVL_PREV(&dl->dl_tree, dle);
bpobj_enqueue(&dle->dle_bpobj, bp, tx);
}
/*
* Insert new key in deadlist, which must be > all current entries.
* mintxg is not inclusive.
*/
void
dsl_deadlist_add_key(dsl_deadlist_t *dl, uint64_t mintxg, dmu_tx_t *tx)
{
uint64_t obj;
dsl_deadlist_entry_t *dle;
if (dl->dl_oldfmt)
return;
dsl_deadlist_load_tree(dl);
dle = kmem_alloc(sizeof (*dle), KM_SLEEP);
dle->dle_mintxg = mintxg;
obj = bpobj_alloc(dl->dl_os, SPA_MAXBLOCKSIZE, tx);
VERIFY3U(0, ==, bpobj_open(&dle->dle_bpobj, dl->dl_os, obj));
avl_add(&dl->dl_tree, dle);
VERIFY3U(0, ==, zap_add_int_key(dl->dl_os, dl->dl_object,
mintxg, obj, tx));
}
/*
* Remove this key, merging its entries into the previous key.
*/
void
dsl_deadlist_remove_key(dsl_deadlist_t *dl, uint64_t mintxg, dmu_tx_t *tx)
{
dsl_deadlist_entry_t dle_tofind;
dsl_deadlist_entry_t *dle, *dle_prev;
if (dl->dl_oldfmt)
return;
dsl_deadlist_load_tree(dl);
dle_tofind.dle_mintxg = mintxg;
dle = avl_find(&dl->dl_tree, &dle_tofind, NULL);
dle_prev = AVL_PREV(&dl->dl_tree, dle);
bpobj_enqueue_subobj(&dle_prev->dle_bpobj,
dle->dle_bpobj.bpo_object, tx);
avl_remove(&dl->dl_tree, dle);
bpobj_close(&dle->dle_bpobj);
kmem_free(dle, sizeof (*dle));
VERIFY3U(0, ==, zap_remove_int(dl->dl_os, dl->dl_object, mintxg, tx));
}
/*
* Walk ds's snapshots to regenerate generate ZAP & AVL.
*/
static void
dsl_deadlist_regenerate(objset_t *os, uint64_t dlobj,
uint64_t mrs_obj, dmu_tx_t *tx)
{
dsl_deadlist_t dl;
dsl_pool_t *dp = dmu_objset_pool(os);
dsl_deadlist_open(&dl, os, dlobj);
if (dl.dl_oldfmt) {
dsl_deadlist_close(&dl);
return;
}
while (mrs_obj != 0) {
dsl_dataset_t *ds;
VERIFY3U(0, ==, dsl_dataset_hold_obj(dp, mrs_obj, FTAG, &ds));
dsl_deadlist_add_key(&dl, ds->ds_phys->ds_prev_snap_txg, tx);
mrs_obj = ds->ds_phys->ds_prev_snap_obj;
dsl_dataset_rele(ds, FTAG);
}
dsl_deadlist_close(&dl);
}
uint64_t
dsl_deadlist_clone(dsl_deadlist_t *dl, uint64_t maxtxg,
uint64_t mrs_obj, dmu_tx_t *tx)
{
dsl_deadlist_entry_t *dle;
uint64_t newobj;
newobj = dsl_deadlist_alloc(dl->dl_os, tx);
if (dl->dl_oldfmt) {
dsl_deadlist_regenerate(dl->dl_os, newobj, mrs_obj, tx);
return (newobj);
}
dsl_deadlist_load_tree(dl);
for (dle = avl_first(&dl->dl_tree); dle;
dle = AVL_NEXT(&dl->dl_tree, dle)) {
uint64_t obj;
if (dle->dle_mintxg >= maxtxg)
break;
obj = bpobj_alloc(dl->dl_os, SPA_MAXBLOCKSIZE, tx);
VERIFY3U(0, ==, zap_add_int_key(dl->dl_os, newobj,
dle->dle_mintxg, obj, tx));
}
return (newobj);
}
void
dsl_deadlist_space(dsl_deadlist_t *dl,
uint64_t *usedp, uint64_t *compp, uint64_t *uncompp)
{
if (dl->dl_oldfmt) {
VERIFY3U(0, ==, bpobj_space(&dl->dl_bpobj,
usedp, compp, uncompp));
return;
}
mutex_enter(&dl->dl_lock);
*usedp = dl->dl_phys->dl_used;
*compp = dl->dl_phys->dl_comp;
*uncompp = dl->dl_phys->dl_uncomp;
mutex_exit(&dl->dl_lock);
}
/*
* return space used in the range (mintxg, maxtxg].
* Includes maxtxg, does not include mintxg.
* mintxg and maxtxg must both be keys in the deadlist (unless maxtxg is
* UINT64_MAX).
*/
void
dsl_deadlist_space_range(dsl_deadlist_t *dl, uint64_t mintxg, uint64_t maxtxg,
uint64_t *usedp, uint64_t *compp, uint64_t *uncompp)
{
dsl_deadlist_entry_t dle_tofind;
dsl_deadlist_entry_t *dle;
avl_index_t where;
if (dl->dl_oldfmt) {
VERIFY3U(0, ==, bpobj_space_range(&dl->dl_bpobj,
mintxg, maxtxg, usedp, compp, uncompp));
return;
}
dsl_deadlist_load_tree(dl);
*usedp = *compp = *uncompp = 0;
dle_tofind.dle_mintxg = mintxg;
dle = avl_find(&dl->dl_tree, &dle_tofind, &where);
/*
* If we don't find this mintxg, there shouldn't be anything
* after it either.
*/
ASSERT(dle != NULL ||
avl_nearest(&dl->dl_tree, where, AVL_AFTER) == NULL);
for (; dle && dle->dle_mintxg < maxtxg;
dle = AVL_NEXT(&dl->dl_tree, dle)) {
uint64_t used, comp, uncomp;
VERIFY3U(0, ==, bpobj_space(&dle->dle_bpobj,
&used, &comp, &uncomp));
*usedp += used;
*compp += comp;
*uncompp += uncomp;
}
}
static void
dsl_deadlist_insert_bpobj(dsl_deadlist_t *dl, uint64_t obj, uint64_t birth,
dmu_tx_t *tx)
{
dsl_deadlist_entry_t dle_tofind;
dsl_deadlist_entry_t *dle;
avl_index_t where;
uint64_t used, comp, uncomp;
bpobj_t bpo;
VERIFY3U(0, ==, bpobj_open(&bpo, dl->dl_os, obj));
VERIFY3U(0, ==, bpobj_space(&bpo, &used, &comp, &uncomp));
bpobj_close(&bpo);
dsl_deadlist_load_tree(dl);
dmu_buf_will_dirty(dl->dl_dbuf, tx);
mutex_enter(&dl->dl_lock);
dl->dl_phys->dl_used += used;
dl->dl_phys->dl_comp += comp;
dl->dl_phys->dl_uncomp += uncomp;
mutex_exit(&dl->dl_lock);
dle_tofind.dle_mintxg = birth;
dle = avl_find(&dl->dl_tree, &dle_tofind, &where);
if (dle == NULL)
dle = avl_nearest(&dl->dl_tree, where, AVL_BEFORE);
bpobj_enqueue_subobj(&dle->dle_bpobj, obj, tx);
}
static int
dsl_deadlist_insert_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
{
dsl_deadlist_t *dl = arg;
dsl_deadlist_insert(dl, bp, tx);
return (0);
}
/*
* Merge the deadlist pointed to by 'obj' into dl. obj will be left as
* an empty deadlist.
*/
void
dsl_deadlist_merge(dsl_deadlist_t *dl, uint64_t obj, dmu_tx_t *tx)
{
zap_cursor_t zc;
zap_attribute_t za;
dmu_buf_t *bonus;
dsl_deadlist_phys_t *dlp;
dmu_object_info_t doi;
VERIFY3U(0, ==, dmu_object_info(dl->dl_os, obj, &doi));
if (doi.doi_type == DMU_OT_BPOBJ) {
bpobj_t bpo;
VERIFY3U(0, ==, bpobj_open(&bpo, dl->dl_os, obj));
VERIFY3U(0, ==, bpobj_iterate(&bpo,
dsl_deadlist_insert_cb, dl, tx));
bpobj_close(&bpo);
return;
}
for (zap_cursor_init(&zc, dl->dl_os, obj);
zap_cursor_retrieve(&zc, &za) == 0;
zap_cursor_advance(&zc)) {
uint64_t mintxg = strtonum(za.za_name, NULL);
dsl_deadlist_insert_bpobj(dl, za.za_first_integer, mintxg, tx);
VERIFY3U(0, ==, zap_remove_int(dl->dl_os, obj, mintxg, tx));
}
zap_cursor_fini(&zc);
VERIFY3U(0, ==, dmu_bonus_hold(dl->dl_os, obj, FTAG, &bonus));
dlp = bonus->db_data;
dmu_buf_will_dirty(bonus, tx);
bzero(dlp, sizeof (*dlp));
dmu_buf_rele(bonus, FTAG);
}
/*
* Remove entries on dl that are >= mintxg, and put them on the bpobj.
*/
void
dsl_deadlist_move_bpobj(dsl_deadlist_t *dl, bpobj_t *bpo, uint64_t mintxg,
dmu_tx_t *tx)
{
dsl_deadlist_entry_t dle_tofind;
dsl_deadlist_entry_t *dle;
avl_index_t where;
ASSERT(!dl->dl_oldfmt);
dmu_buf_will_dirty(dl->dl_dbuf, tx);
dsl_deadlist_load_tree(dl);
dle_tofind.dle_mintxg = mintxg;
dle = avl_find(&dl->dl_tree, &dle_tofind, &where);
if (dle == NULL)
dle = avl_nearest(&dl->dl_tree, where, AVL_AFTER);
while (dle) {
uint64_t used, comp, uncomp;
dsl_deadlist_entry_t *dle_next;
bpobj_enqueue_subobj(bpo, dle->dle_bpobj.bpo_object, tx);
VERIFY3U(0, ==, bpobj_space(&dle->dle_bpobj,
&used, &comp, &uncomp));
mutex_enter(&dl->dl_lock);
ASSERT3U(dl->dl_phys->dl_used, >=, used);
ASSERT3U(dl->dl_phys->dl_comp, >=, comp);
ASSERT3U(dl->dl_phys->dl_uncomp, >=, uncomp);
dl->dl_phys->dl_used -= used;
dl->dl_phys->dl_comp -= comp;
dl->dl_phys->dl_uncomp -= uncomp;
mutex_exit(&dl->dl_lock);
VERIFY3U(0, ==, zap_remove_int(dl->dl_os, dl->dl_object,
dle->dle_mintxg, tx));
dle_next = AVL_NEXT(&dl->dl_tree, dle);
avl_remove(&dl->dl_tree, dle);
bpobj_close(&dle->dle_bpobj);
kmem_free(dle, sizeof (*dle));
dle = dle_next;
}
}

View File

@ -0,0 +1,746 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2007, 2010, Oracle and/or its affiliates. All rights reserved.
*/
/*
* DSL permissions are stored in a two level zap attribute
* mechanism. The first level identifies the "class" of
* entry. The class is identified by the first 2 letters of
* the attribute. The second letter "l" or "d" identifies whether
* it is a local or descendent permission. The first letter
* identifies the type of entry.
*
* ul$<id> identifies permissions granted locally for this userid.
* ud$<id> identifies permissions granted on descendent datasets for
* this userid.
* Ul$<id> identifies permission sets granted locally for this userid.
* Ud$<id> identifies permission sets granted on descendent datasets for
* this userid.
* gl$<id> identifies permissions granted locally for this groupid.
* gd$<id> identifies permissions granted on descendent datasets for
* this groupid.
* Gl$<id> identifies permission sets granted locally for this groupid.
* Gd$<id> identifies permission sets granted on descendent datasets for
* this groupid.
* el$ identifies permissions granted locally for everyone.
* ed$ identifies permissions granted on descendent datasets
* for everyone.
* El$ identifies permission sets granted locally for everyone.
* Ed$ identifies permission sets granted to descendent datasets for
* everyone.
* c-$ identifies permission to create at dataset creation time.
* C-$ identifies permission sets to grant locally at dataset creation
* time.
* s-$@<name> permissions defined in specified set @<name>
* S-$@<name> Sets defined in named set @<name>
*
* Each of the above entities points to another zap attribute that contains one
* attribute for each allowed permission, such as create, destroy,...
* All of the "upper" case class types will specify permission set names
* rather than permissions.
*
* Basically it looks something like this:
* ul$12 -> ZAP OBJ -> permissions...
*
* The ZAP OBJ is referred to as the jump object.
*/
#include <sys/dmu.h>
#include <sys/dmu_objset.h>
#include <sys/dmu_tx.h>
#include <sys/dsl_dataset.h>
#include <sys/dsl_dir.h>
#include <sys/dsl_prop.h>
#include <sys/dsl_synctask.h>
#include <sys/dsl_deleg.h>
#include <sys/spa.h>
#include <sys/zap.h>
#include <sys/fs/zfs.h>
#include <sys/cred.h>
#include <sys/sunddi.h>
#include "zfs_deleg.h"
/*
* Validate that user is allowed to delegate specified permissions.
*
* In order to delegate "create" you must have "create"
* and "allow".
*/
int
dsl_deleg_can_allow(char *ddname, nvlist_t *nvp, cred_t *cr)
{
nvpair_t *whopair = NULL;
int error;
if ((error = dsl_deleg_access(ddname, ZFS_DELEG_PERM_ALLOW, cr)) != 0)
return (error);
while (whopair = nvlist_next_nvpair(nvp, whopair)) {
nvlist_t *perms;
nvpair_t *permpair = NULL;
VERIFY(nvpair_value_nvlist(whopair, &perms) == 0);
while (permpair = nvlist_next_nvpair(perms, permpair)) {
const char *perm = nvpair_name(permpair);
if (strcmp(perm, ZFS_DELEG_PERM_ALLOW) == 0)
return (EPERM);
if ((error = dsl_deleg_access(ddname, perm, cr)) != 0)
return (error);
}
}
return (0);
}
/*
* Validate that user is allowed to unallow specified permissions. They
* must have the 'allow' permission, and even then can only unallow
* perms for their uid.
*/
int
dsl_deleg_can_unallow(char *ddname, nvlist_t *nvp, cred_t *cr)
{
nvpair_t *whopair = NULL;
int error;
char idstr[32];
if ((error = dsl_deleg_access(ddname, ZFS_DELEG_PERM_ALLOW, cr)) != 0)
return (error);
(void) snprintf(idstr, sizeof (idstr), "%lld",
(longlong_t)crgetuid(cr));
while (whopair = nvlist_next_nvpair(nvp, whopair)) {
zfs_deleg_who_type_t type = nvpair_name(whopair)[0];
if (type != ZFS_DELEG_USER &&
type != ZFS_DELEG_USER_SETS)
return (EPERM);
if (strcmp(idstr, &nvpair_name(whopair)[3]) != 0)
return (EPERM);
}
return (0);
}
static void
dsl_deleg_set_sync(void *arg1, void *arg2, dmu_tx_t *tx)
{
dsl_dir_t *dd = arg1;
nvlist_t *nvp = arg2;
objset_t *mos = dd->dd_pool->dp_meta_objset;
nvpair_t *whopair = NULL;
uint64_t zapobj = dd->dd_phys->dd_deleg_zapobj;
if (zapobj == 0) {
dmu_buf_will_dirty(dd->dd_dbuf, tx);
zapobj = dd->dd_phys->dd_deleg_zapobj = zap_create(mos,
DMU_OT_DSL_PERMS, DMU_OT_NONE, 0, tx);
}
while (whopair = nvlist_next_nvpair(nvp, whopair)) {
const char *whokey = nvpair_name(whopair);
nvlist_t *perms;
nvpair_t *permpair = NULL;
uint64_t jumpobj;
VERIFY(nvpair_value_nvlist(whopair, &perms) == 0);
if (zap_lookup(mos, zapobj, whokey, 8, 1, &jumpobj) != 0) {
jumpobj = zap_create(mos, DMU_OT_DSL_PERMS,
DMU_OT_NONE, 0, tx);
VERIFY(zap_update(mos, zapobj,
whokey, 8, 1, &jumpobj, tx) == 0);
}
while (permpair = nvlist_next_nvpair(perms, permpair)) {
const char *perm = nvpair_name(permpair);
uint64_t n = 0;
VERIFY(zap_update(mos, jumpobj,
perm, 8, 1, &n, tx) == 0);
spa_history_log_internal(LOG_DS_PERM_UPDATE,
dd->dd_pool->dp_spa, tx,
"%s %s dataset = %llu", whokey, perm,
dd->dd_phys->dd_head_dataset_obj);
}
}
}
static void
dsl_deleg_unset_sync(void *arg1, void *arg2, dmu_tx_t *tx)
{
dsl_dir_t *dd = arg1;
nvlist_t *nvp = arg2;
objset_t *mos = dd->dd_pool->dp_meta_objset;
nvpair_t *whopair = NULL;
uint64_t zapobj = dd->dd_phys->dd_deleg_zapobj;
if (zapobj == 0)
return;
while (whopair = nvlist_next_nvpair(nvp, whopair)) {
const char *whokey = nvpair_name(whopair);
nvlist_t *perms;
nvpair_t *permpair = NULL;
uint64_t jumpobj;
if (nvpair_value_nvlist(whopair, &perms) != 0) {
if (zap_lookup(mos, zapobj, whokey, 8,
1, &jumpobj) == 0) {
(void) zap_remove(mos, zapobj, whokey, tx);
VERIFY(0 == zap_destroy(mos, jumpobj, tx));
}
spa_history_log_internal(LOG_DS_PERM_WHO_REMOVE,
dd->dd_pool->dp_spa, tx,
"%s dataset = %llu", whokey,
dd->dd_phys->dd_head_dataset_obj);
continue;
}
if (zap_lookup(mos, zapobj, whokey, 8, 1, &jumpobj) != 0)
continue;
while (permpair = nvlist_next_nvpair(perms, permpair)) {
const char *perm = nvpair_name(permpair);
uint64_t n = 0;
(void) zap_remove(mos, jumpobj, perm, tx);
if (zap_count(mos, jumpobj, &n) == 0 && n == 0) {
(void) zap_remove(mos, zapobj,
whokey, tx);
VERIFY(0 == zap_destroy(mos,
jumpobj, tx));
}
spa_history_log_internal(LOG_DS_PERM_REMOVE,
dd->dd_pool->dp_spa, tx,
"%s %s dataset = %llu", whokey, perm,
dd->dd_phys->dd_head_dataset_obj);
}
}
}
int
dsl_deleg_set(const char *ddname, nvlist_t *nvp, boolean_t unset)
{
dsl_dir_t *dd;
int error;
nvpair_t *whopair = NULL;
int blocks_modified = 0;
error = dsl_dir_open(ddname, FTAG, &dd, NULL);
if (error)
return (error);
if (spa_version(dmu_objset_spa(dd->dd_pool->dp_meta_objset)) <
SPA_VERSION_DELEGATED_PERMS) {
dsl_dir_close(dd, FTAG);
return (ENOTSUP);
}
while (whopair = nvlist_next_nvpair(nvp, whopair))
blocks_modified++;
error = dsl_sync_task_do(dd->dd_pool, NULL,
unset ? dsl_deleg_unset_sync : dsl_deleg_set_sync,
dd, nvp, blocks_modified);
dsl_dir_close(dd, FTAG);
return (error);
}
/*
* Find all 'allow' permissions from a given point and then continue
* traversing up to the root.
*
* This function constructs an nvlist of nvlists.
* each setpoint is an nvlist composed of an nvlist of an nvlist
* of the individual * users/groups/everyone/create
* permissions.
*
* The nvlist will look like this.
*
* { source fsname -> { whokeys { permissions,...}, ...}}
*
* The fsname nvpairs will be arranged in a bottom up order. For example,
* if we have the following structure a/b/c then the nvpairs for the fsnames
* will be ordered a/b/c, a/b, a.
*/
int
dsl_deleg_get(const char *ddname, nvlist_t **nvp)
{
dsl_dir_t *dd, *startdd;
dsl_pool_t *dp;
int error;
objset_t *mos;
error = dsl_dir_open(ddname, FTAG, &startdd, NULL);
if (error)
return (error);
dp = startdd->dd_pool;
mos = dp->dp_meta_objset;
VERIFY(nvlist_alloc(nvp, NV_UNIQUE_NAME, KM_SLEEP) == 0);
rw_enter(&dp->dp_config_rwlock, RW_READER);
for (dd = startdd; dd != NULL; dd = dd->dd_parent) {
zap_cursor_t basezc;
zap_attribute_t baseza;
nvlist_t *sp_nvp;
uint64_t n;
char source[MAXNAMELEN];
if (dd->dd_phys->dd_deleg_zapobj &&
(zap_count(mos, dd->dd_phys->dd_deleg_zapobj,
&n) == 0) && n) {
VERIFY(nvlist_alloc(&sp_nvp,
NV_UNIQUE_NAME, KM_SLEEP) == 0);
} else {
continue;
}
for (zap_cursor_init(&basezc, mos,
dd->dd_phys->dd_deleg_zapobj);
zap_cursor_retrieve(&basezc, &baseza) == 0;
zap_cursor_advance(&basezc)) {
zap_cursor_t zc;
zap_attribute_t za;
nvlist_t *perms_nvp;
ASSERT(baseza.za_integer_length == 8);
ASSERT(baseza.za_num_integers == 1);
VERIFY(nvlist_alloc(&perms_nvp,
NV_UNIQUE_NAME, KM_SLEEP) == 0);
for (zap_cursor_init(&zc, mos, baseza.za_first_integer);
zap_cursor_retrieve(&zc, &za) == 0;
zap_cursor_advance(&zc)) {
VERIFY(nvlist_add_boolean(perms_nvp,
za.za_name) == 0);
}
zap_cursor_fini(&zc);
VERIFY(nvlist_add_nvlist(sp_nvp, baseza.za_name,
perms_nvp) == 0);
nvlist_free(perms_nvp);
}
zap_cursor_fini(&basezc);
dsl_dir_name(dd, source);
VERIFY(nvlist_add_nvlist(*nvp, source, sp_nvp) == 0);
nvlist_free(sp_nvp);
}
rw_exit(&dp->dp_config_rwlock);
dsl_dir_close(startdd, FTAG);
return (0);
}
/*
* Routines for dsl_deleg_access() -- access checking.
*/
typedef struct perm_set {
avl_node_t p_node;
boolean_t p_matched;
char p_setname[ZFS_MAX_DELEG_NAME];
} perm_set_t;
static int
perm_set_compare(const void *arg1, const void *arg2)
{
const perm_set_t *node1 = arg1;
const perm_set_t *node2 = arg2;
int val;
val = strcmp(node1->p_setname, node2->p_setname);
if (val == 0)
return (0);
return (val > 0 ? 1 : -1);
}
/*
* Determine whether a specified permission exists.
*
* First the base attribute has to be retrieved. i.e. ul$12
* Once the base object has been retrieved the actual permission
* is lookup up in the zap object the base object points to.
*
* Return 0 if permission exists, ENOENT if there is no whokey, EPERM if
* there is no perm in that jumpobj.
*/
static int
dsl_check_access(objset_t *mos, uint64_t zapobj,
char type, char checkflag, void *valp, const char *perm)
{
int error;
uint64_t jumpobj, zero;
char whokey[ZFS_MAX_DELEG_NAME];
zfs_deleg_whokey(whokey, type, checkflag, valp);
error = zap_lookup(mos, zapobj, whokey, 8, 1, &jumpobj);
if (error == 0) {
error = zap_lookup(mos, jumpobj, perm, 8, 1, &zero);
if (error == ENOENT)
error = EPERM;
}
return (error);
}
/*
* check a specified user/group for a requested permission
*/
static int
dsl_check_user_access(objset_t *mos, uint64_t zapobj, const char *perm,
int checkflag, cred_t *cr)
{
const gid_t *gids;
int ngids;
int i;
uint64_t id;
/* check for user */
id = crgetuid(cr);
if (dsl_check_access(mos, zapobj,
ZFS_DELEG_USER, checkflag, &id, perm) == 0)
return (0);
/* check for users primary group */
id = crgetgid(cr);
if (dsl_check_access(mos, zapobj,
ZFS_DELEG_GROUP, checkflag, &id, perm) == 0)
return (0);
/* check for everyone entry */
id = -1;
if (dsl_check_access(mos, zapobj,
ZFS_DELEG_EVERYONE, checkflag, &id, perm) == 0)
return (0);
/* check each supplemental group user is a member of */
ngids = crgetngroups(cr);
gids = crgetgroups(cr);
for (i = 0; i != ngids; i++) {
id = gids[i];
if (dsl_check_access(mos, zapobj,
ZFS_DELEG_GROUP, checkflag, &id, perm) == 0)
return (0);
}
return (EPERM);
}
/*
* Iterate over the sets specified in the specified zapobj
* and load them into the permsets avl tree.
*/
static int
dsl_load_sets(objset_t *mos, uint64_t zapobj,
char type, char checkflag, void *valp, avl_tree_t *avl)
{
zap_cursor_t zc;
zap_attribute_t za;
perm_set_t *permnode;
avl_index_t idx;
uint64_t jumpobj;
int error;
char whokey[ZFS_MAX_DELEG_NAME];
zfs_deleg_whokey(whokey, type, checkflag, valp);
error = zap_lookup(mos, zapobj, whokey, 8, 1, &jumpobj);
if (error != 0)
return (error);
for (zap_cursor_init(&zc, mos, jumpobj);
zap_cursor_retrieve(&zc, &za) == 0;
zap_cursor_advance(&zc)) {
permnode = kmem_alloc(sizeof (perm_set_t), KM_SLEEP);
(void) strlcpy(permnode->p_setname, za.za_name,
sizeof (permnode->p_setname));
permnode->p_matched = B_FALSE;
if (avl_find(avl, permnode, &idx) == NULL) {
avl_insert(avl, permnode, idx);
} else {
kmem_free(permnode, sizeof (perm_set_t));
}
}
zap_cursor_fini(&zc);
return (0);
}
/*
* Load all permissions user based on cred belongs to.
*/
static void
dsl_load_user_sets(objset_t *mos, uint64_t zapobj, avl_tree_t *avl,
char checkflag, cred_t *cr)
{
const gid_t *gids;
int ngids, i;
uint64_t id;
id = crgetuid(cr);
(void) dsl_load_sets(mos, zapobj,
ZFS_DELEG_USER_SETS, checkflag, &id, avl);
id = crgetgid(cr);
(void) dsl_load_sets(mos, zapobj,
ZFS_DELEG_GROUP_SETS, checkflag, &id, avl);
(void) dsl_load_sets(mos, zapobj,
ZFS_DELEG_EVERYONE_SETS, checkflag, NULL, avl);
ngids = crgetngroups(cr);
gids = crgetgroups(cr);
for (i = 0; i != ngids; i++) {
id = gids[i];
(void) dsl_load_sets(mos, zapobj,
ZFS_DELEG_GROUP_SETS, checkflag, &id, avl);
}
}
/*
* Check if user has requested permission.
*/
int
dsl_deleg_access_impl(dsl_dataset_t *ds, const char *perm, cred_t *cr)
{
dsl_dir_t *dd;
dsl_pool_t *dp;
void *cookie;
int error;
char checkflag;
objset_t *mos;
avl_tree_t permsets;
perm_set_t *setnode;
dp = ds->ds_dir->dd_pool;
mos = dp->dp_meta_objset;
if (dsl_delegation_on(mos) == B_FALSE)
return (ECANCELED);
if (spa_version(dmu_objset_spa(dp->dp_meta_objset)) <
SPA_VERSION_DELEGATED_PERMS)
return (EPERM);
if (dsl_dataset_is_snapshot(ds)) {
/*
* Snapshots are treated as descendents only,
* local permissions do not apply.
*/
checkflag = ZFS_DELEG_DESCENDENT;
} else {
checkflag = ZFS_DELEG_LOCAL;
}
avl_create(&permsets, perm_set_compare, sizeof (perm_set_t),
offsetof(perm_set_t, p_node));
rw_enter(&dp->dp_config_rwlock, RW_READER);
for (dd = ds->ds_dir; dd != NULL; dd = dd->dd_parent,
checkflag = ZFS_DELEG_DESCENDENT) {
uint64_t zapobj;
boolean_t expanded;
/*
* If not in global zone then make sure
* the zoned property is set
*/
if (!INGLOBALZONE(curproc)) {
uint64_t zoned;
if (dsl_prop_get_dd(dd,
zfs_prop_to_name(ZFS_PROP_ZONED),
8, 1, &zoned, NULL, B_FALSE) != 0)
break;
if (!zoned)
break;
}
zapobj = dd->dd_phys->dd_deleg_zapobj;
if (zapobj == 0)
continue;
dsl_load_user_sets(mos, zapobj, &permsets, checkflag, cr);
again:
expanded = B_FALSE;
for (setnode = avl_first(&permsets); setnode;
setnode = AVL_NEXT(&permsets, setnode)) {
if (setnode->p_matched == B_TRUE)
continue;
/* See if this set directly grants this permission */
error = dsl_check_access(mos, zapobj,
ZFS_DELEG_NAMED_SET, 0, setnode->p_setname, perm);
if (error == 0)
goto success;
if (error == EPERM)
setnode->p_matched = B_TRUE;
/* See if this set includes other sets */
error = dsl_load_sets(mos, zapobj,
ZFS_DELEG_NAMED_SET_SETS, 0,
setnode->p_setname, &permsets);
if (error == 0)
setnode->p_matched = expanded = B_TRUE;
}
/*
* If we expanded any sets, that will define more sets,
* which we need to check.
*/
if (expanded)
goto again;
error = dsl_check_user_access(mos, zapobj, perm, checkflag, cr);
if (error == 0)
goto success;
}
error = EPERM;
success:
rw_exit(&dp->dp_config_rwlock);
cookie = NULL;
while ((setnode = avl_destroy_nodes(&permsets, &cookie)) != NULL)
kmem_free(setnode, sizeof (perm_set_t));
return (error);
}
int
dsl_deleg_access(const char *dsname, const char *perm, cred_t *cr)
{
dsl_dataset_t *ds;
int error;
error = dsl_dataset_hold(dsname, FTAG, &ds);
if (error)
return (error);
error = dsl_deleg_access_impl(ds, perm, cr);
dsl_dataset_rele(ds, FTAG);
return (error);
}
/*
* Other routines.
*/
static void
copy_create_perms(dsl_dir_t *dd, uint64_t pzapobj,
boolean_t dosets, uint64_t uid, dmu_tx_t *tx)
{
objset_t *mos = dd->dd_pool->dp_meta_objset;
uint64_t jumpobj, pjumpobj;
uint64_t zapobj = dd->dd_phys->dd_deleg_zapobj;
zap_cursor_t zc;
zap_attribute_t za;
char whokey[ZFS_MAX_DELEG_NAME];
zfs_deleg_whokey(whokey,
dosets ? ZFS_DELEG_CREATE_SETS : ZFS_DELEG_CREATE,
ZFS_DELEG_LOCAL, NULL);
if (zap_lookup(mos, pzapobj, whokey, 8, 1, &pjumpobj) != 0)
return;
if (zapobj == 0) {
dmu_buf_will_dirty(dd->dd_dbuf, tx);
zapobj = dd->dd_phys->dd_deleg_zapobj = zap_create(mos,
DMU_OT_DSL_PERMS, DMU_OT_NONE, 0, tx);
}
zfs_deleg_whokey(whokey,
dosets ? ZFS_DELEG_USER_SETS : ZFS_DELEG_USER,
ZFS_DELEG_LOCAL, &uid);
if (zap_lookup(mos, zapobj, whokey, 8, 1, &jumpobj) == ENOENT) {
jumpobj = zap_create(mos, DMU_OT_DSL_PERMS, DMU_OT_NONE, 0, tx);
VERIFY(zap_add(mos, zapobj, whokey, 8, 1, &jumpobj, tx) == 0);
}
for (zap_cursor_init(&zc, mos, pjumpobj);
zap_cursor_retrieve(&zc, &za) == 0;
zap_cursor_advance(&zc)) {
uint64_t zero = 0;
ASSERT(za.za_integer_length == 8 && za.za_num_integers == 1);
VERIFY(zap_update(mos, jumpobj, za.za_name,
8, 1, &zero, tx) == 0);
}
zap_cursor_fini(&zc);
}
/*
* set all create time permission on new dataset.
*/
void
dsl_deleg_set_create_perms(dsl_dir_t *sdd, dmu_tx_t *tx, cred_t *cr)
{
dsl_dir_t *dd;
uint64_t uid = crgetuid(cr);
if (spa_version(dmu_objset_spa(sdd->dd_pool->dp_meta_objset)) <
SPA_VERSION_DELEGATED_PERMS)
return;
for (dd = sdd->dd_parent; dd != NULL; dd = dd->dd_parent) {
uint64_t pzapobj = dd->dd_phys->dd_deleg_zapobj;
if (pzapobj == 0)
continue;
copy_create_perms(sdd, pzapobj, B_FALSE, uid, tx);
copy_create_perms(sdd, pzapobj, B_TRUE, uid, tx);
}
}
int
dsl_deleg_destroy(objset_t *mos, uint64_t zapobj, dmu_tx_t *tx)
{
zap_cursor_t zc;
zap_attribute_t za;
if (zapobj == 0)
return (0);
for (zap_cursor_init(&zc, mos, zapobj);
zap_cursor_retrieve(&zc, &za) == 0;
zap_cursor_advance(&zc)) {
ASSERT(za.za_integer_length == 8 && za.za_num_integers == 1);
VERIFY(0 == zap_destroy(mos, za.za_first_integer, tx));
}
zap_cursor_fini(&zc);
VERIFY(0 == zap_destroy(mos, zapobj, tx));
return (0);
}
boolean_t
dsl_delegation_on(objset_t *os)
{
return (!!spa_delegation(os->os_spa));
}

1416
uts/common/fs/zfs/dsl_dir.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,848 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#include <sys/dsl_pool.h>
#include <sys/dsl_dataset.h>
#include <sys/dsl_prop.h>
#include <sys/dsl_dir.h>
#include <sys/dsl_synctask.h>
#include <sys/dsl_scan.h>
#include <sys/dnode.h>
#include <sys/dmu_tx.h>
#include <sys/dmu_objset.h>
#include <sys/arc.h>
#include <sys/zap.h>
#include <sys/zio.h>
#include <sys/zfs_context.h>
#include <sys/fs/zfs.h>
#include <sys/zfs_znode.h>
#include <sys/spa_impl.h>
#include <sys/dsl_deadlist.h>
int zfs_no_write_throttle = 0;
int zfs_write_limit_shift = 3; /* 1/8th of physical memory */
int zfs_txg_synctime_ms = 1000; /* target millisecs to sync a txg */
uint64_t zfs_write_limit_min = 32 << 20; /* min write limit is 32MB */
uint64_t zfs_write_limit_max = 0; /* max data payload per txg */
uint64_t zfs_write_limit_inflated = 0;
uint64_t zfs_write_limit_override = 0;
kmutex_t zfs_write_limit_lock;
static pgcnt_t old_physmem = 0;
int
dsl_pool_open_special_dir(dsl_pool_t *dp, const char *name, dsl_dir_t **ddp)
{
uint64_t obj;
int err;
err = zap_lookup(dp->dp_meta_objset,
dp->dp_root_dir->dd_phys->dd_child_dir_zapobj,
name, sizeof (obj), 1, &obj);
if (err)
return (err);
return (dsl_dir_open_obj(dp, obj, name, dp, ddp));
}
static dsl_pool_t *
dsl_pool_open_impl(spa_t *spa, uint64_t txg)
{
dsl_pool_t *dp;
blkptr_t *bp = spa_get_rootblkptr(spa);
dp = kmem_zalloc(sizeof (dsl_pool_t), KM_SLEEP);
dp->dp_spa = spa;
dp->dp_meta_rootbp = *bp;
rw_init(&dp->dp_config_rwlock, NULL, RW_DEFAULT, NULL);
dp->dp_write_limit = zfs_write_limit_min;
txg_init(dp, txg);
txg_list_create(&dp->dp_dirty_datasets,
offsetof(dsl_dataset_t, ds_dirty_link));
txg_list_create(&dp->dp_dirty_dirs,
offsetof(dsl_dir_t, dd_dirty_link));
txg_list_create(&dp->dp_sync_tasks,
offsetof(dsl_sync_task_group_t, dstg_node));
list_create(&dp->dp_synced_datasets, sizeof (dsl_dataset_t),
offsetof(dsl_dataset_t, ds_synced_link));
mutex_init(&dp->dp_lock, NULL, MUTEX_DEFAULT, NULL);
dp->dp_vnrele_taskq = taskq_create("zfs_vn_rele_taskq", 1, minclsyspri,
1, 4, 0);
return (dp);
}
int
dsl_pool_open(spa_t *spa, uint64_t txg, dsl_pool_t **dpp)
{
int err;
dsl_pool_t *dp = dsl_pool_open_impl(spa, txg);
dsl_dir_t *dd;
dsl_dataset_t *ds;
uint64_t obj;
rw_enter(&dp->dp_config_rwlock, RW_WRITER);
err = dmu_objset_open_impl(spa, NULL, &dp->dp_meta_rootbp,
&dp->dp_meta_objset);
if (err)
goto out;
err = zap_lookup(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_ROOT_DATASET, sizeof (uint64_t), 1,
&dp->dp_root_dir_obj);
if (err)
goto out;
err = dsl_dir_open_obj(dp, dp->dp_root_dir_obj,
NULL, dp, &dp->dp_root_dir);
if (err)
goto out;
err = dsl_pool_open_special_dir(dp, MOS_DIR_NAME, &dp->dp_mos_dir);
if (err)
goto out;
if (spa_version(spa) >= SPA_VERSION_ORIGIN) {
err = dsl_pool_open_special_dir(dp, ORIGIN_DIR_NAME, &dd);
if (err)
goto out;
err = dsl_dataset_hold_obj(dp, dd->dd_phys->dd_head_dataset_obj,
FTAG, &ds);
if (err == 0) {
err = dsl_dataset_hold_obj(dp,
ds->ds_phys->ds_prev_snap_obj, dp,
&dp->dp_origin_snap);
dsl_dataset_rele(ds, FTAG);
}
dsl_dir_close(dd, dp);
if (err)
goto out;
}
if (spa_version(spa) >= SPA_VERSION_DEADLISTS) {
err = dsl_pool_open_special_dir(dp, FREE_DIR_NAME,
&dp->dp_free_dir);
if (err)
goto out;
err = zap_lookup(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_FREE_BPOBJ, sizeof (uint64_t), 1, &obj);
if (err)
goto out;
VERIFY3U(0, ==, bpobj_open(&dp->dp_free_bpobj,
dp->dp_meta_objset, obj));
}
err = zap_lookup(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_TMP_USERREFS, sizeof (uint64_t), 1,
&dp->dp_tmp_userrefs_obj);
if (err == ENOENT)
err = 0;
if (err)
goto out;
err = dsl_scan_init(dp, txg);
out:
rw_exit(&dp->dp_config_rwlock);
if (err)
dsl_pool_close(dp);
else
*dpp = dp;
return (err);
}
void
dsl_pool_close(dsl_pool_t *dp)
{
/* drop our references from dsl_pool_open() */
/*
* Since we held the origin_snap from "syncing" context (which
* includes pool-opening context), it actually only got a "ref"
* and not a hold, so just drop that here.
*/
if (dp->dp_origin_snap)
dsl_dataset_drop_ref(dp->dp_origin_snap, dp);
if (dp->dp_mos_dir)
dsl_dir_close(dp->dp_mos_dir, dp);
if (dp->dp_free_dir)
dsl_dir_close(dp->dp_free_dir, dp);
if (dp->dp_root_dir)
dsl_dir_close(dp->dp_root_dir, dp);
bpobj_close(&dp->dp_free_bpobj);
/* undo the dmu_objset_open_impl(mos) from dsl_pool_open() */
if (dp->dp_meta_objset)
dmu_objset_evict(dp->dp_meta_objset);
txg_list_destroy(&dp->dp_dirty_datasets);
txg_list_destroy(&dp->dp_sync_tasks);
txg_list_destroy(&dp->dp_dirty_dirs);
list_destroy(&dp->dp_synced_datasets);
arc_flush(dp->dp_spa);
txg_fini(dp);
dsl_scan_fini(dp);
rw_destroy(&dp->dp_config_rwlock);
mutex_destroy(&dp->dp_lock);
taskq_destroy(dp->dp_vnrele_taskq);
if (dp->dp_blkstats)
kmem_free(dp->dp_blkstats, sizeof (zfs_all_blkstats_t));
kmem_free(dp, sizeof (dsl_pool_t));
}
dsl_pool_t *
dsl_pool_create(spa_t *spa, nvlist_t *zplprops, uint64_t txg)
{
int err;
dsl_pool_t *dp = dsl_pool_open_impl(spa, txg);
dmu_tx_t *tx = dmu_tx_create_assigned(dp, txg);
objset_t *os;
dsl_dataset_t *ds;
uint64_t obj;
/* create and open the MOS (meta-objset) */
dp->dp_meta_objset = dmu_objset_create_impl(spa,
NULL, &dp->dp_meta_rootbp, DMU_OST_META, tx);
/* create the pool directory */
err = zap_create_claim(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_OT_OBJECT_DIRECTORY, DMU_OT_NONE, 0, tx);
ASSERT3U(err, ==, 0);
/* Initialize scan structures */
VERIFY3U(0, ==, dsl_scan_init(dp, txg));
/* create and open the root dir */
dp->dp_root_dir_obj = dsl_dir_create_sync(dp, NULL, NULL, tx);
VERIFY(0 == dsl_dir_open_obj(dp, dp->dp_root_dir_obj,
NULL, dp, &dp->dp_root_dir));
/* create and open the meta-objset dir */
(void) dsl_dir_create_sync(dp, dp->dp_root_dir, MOS_DIR_NAME, tx);
VERIFY(0 == dsl_pool_open_special_dir(dp,
MOS_DIR_NAME, &dp->dp_mos_dir));
if (spa_version(spa) >= SPA_VERSION_DEADLISTS) {
/* create and open the free dir */
(void) dsl_dir_create_sync(dp, dp->dp_root_dir,
FREE_DIR_NAME, tx);
VERIFY(0 == dsl_pool_open_special_dir(dp,
FREE_DIR_NAME, &dp->dp_free_dir));
/* create and open the free_bplist */
obj = bpobj_alloc(dp->dp_meta_objset, SPA_MAXBLOCKSIZE, tx);
VERIFY(zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_FREE_BPOBJ, sizeof (uint64_t), 1, &obj, tx) == 0);
VERIFY3U(0, ==, bpobj_open(&dp->dp_free_bpobj,
dp->dp_meta_objset, obj));
}
if (spa_version(spa) >= SPA_VERSION_DSL_SCRUB)
dsl_pool_create_origin(dp, tx);
/* create the root dataset */
obj = dsl_dataset_create_sync_dd(dp->dp_root_dir, NULL, 0, tx);
/* create the root objset */
VERIFY(0 == dsl_dataset_hold_obj(dp, obj, FTAG, &ds));
os = dmu_objset_create_impl(dp->dp_spa, ds,
dsl_dataset_get_blkptr(ds), DMU_OST_ZFS, tx);
#ifdef _KERNEL
zfs_create_fs(os, kcred, zplprops, tx);
#endif
dsl_dataset_rele(ds, FTAG);
dmu_tx_commit(tx);
return (dp);
}
static int
deadlist_enqueue_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
{
dsl_deadlist_t *dl = arg;
dsl_deadlist_insert(dl, bp, tx);
return (0);
}
void
dsl_pool_sync(dsl_pool_t *dp, uint64_t txg)
{
zio_t *zio;
dmu_tx_t *tx;
dsl_dir_t *dd;
dsl_dataset_t *ds;
dsl_sync_task_group_t *dstg;
objset_t *mos = dp->dp_meta_objset;
hrtime_t start, write_time;
uint64_t data_written;
int err;
/*
* We need to copy dp_space_towrite() before doing
* dsl_sync_task_group_sync(), because
* dsl_dataset_snapshot_reserve_space() will increase
* dp_space_towrite but not actually write anything.
*/
data_written = dp->dp_space_towrite[txg & TXG_MASK];
tx = dmu_tx_create_assigned(dp, txg);
dp->dp_read_overhead = 0;
start = gethrtime();
zio = zio_root(dp->dp_spa, NULL, NULL, ZIO_FLAG_MUSTSUCCEED);
while (ds = txg_list_remove(&dp->dp_dirty_datasets, txg)) {
/*
* We must not sync any non-MOS datasets twice, because
* we may have taken a snapshot of them. However, we
* may sync newly-created datasets on pass 2.
*/
ASSERT(!list_link_active(&ds->ds_synced_link));
list_insert_tail(&dp->dp_synced_datasets, ds);
dsl_dataset_sync(ds, zio, tx);
}
DTRACE_PROBE(pool_sync__1setup);
err = zio_wait(zio);
write_time = gethrtime() - start;
ASSERT(err == 0);
DTRACE_PROBE(pool_sync__2rootzio);
for (ds = list_head(&dp->dp_synced_datasets); ds;
ds = list_next(&dp->dp_synced_datasets, ds))
dmu_objset_do_userquota_updates(ds->ds_objset, tx);
/*
* Sync the datasets again to push out the changes due to
* userspace updates. This must be done before we process the
* sync tasks, because that could cause a snapshot of a dataset
* whose ds_bp will be rewritten when we do this 2nd sync.
*/
zio = zio_root(dp->dp_spa, NULL, NULL, ZIO_FLAG_MUSTSUCCEED);
while (ds = txg_list_remove(&dp->dp_dirty_datasets, txg)) {
ASSERT(list_link_active(&ds->ds_synced_link));
dmu_buf_rele(ds->ds_dbuf, ds);
dsl_dataset_sync(ds, zio, tx);
}
err = zio_wait(zio);
/*
* Move dead blocks from the pending deadlist to the on-disk
* deadlist.
*/
for (ds = list_head(&dp->dp_synced_datasets); ds;
ds = list_next(&dp->dp_synced_datasets, ds)) {
bplist_iterate(&ds->ds_pending_deadlist,
deadlist_enqueue_cb, &ds->ds_deadlist, tx);
}
while (dstg = txg_list_remove(&dp->dp_sync_tasks, txg)) {
/*
* No more sync tasks should have been added while we
* were syncing.
*/
ASSERT(spa_sync_pass(dp->dp_spa) == 1);
dsl_sync_task_group_sync(dstg, tx);
}
DTRACE_PROBE(pool_sync__3task);
start = gethrtime();
while (dd = txg_list_remove(&dp->dp_dirty_dirs, txg))
dsl_dir_sync(dd, tx);
write_time += gethrtime() - start;
start = gethrtime();
if (list_head(&mos->os_dirty_dnodes[txg & TXG_MASK]) != NULL ||
list_head(&mos->os_free_dnodes[txg & TXG_MASK]) != NULL) {
zio = zio_root(dp->dp_spa, NULL, NULL, ZIO_FLAG_MUSTSUCCEED);
dmu_objset_sync(mos, zio, tx);
err = zio_wait(zio);
ASSERT(err == 0);
dprintf_bp(&dp->dp_meta_rootbp, "meta objset rootbp is %s", "");
spa_set_rootblkptr(dp->dp_spa, &dp->dp_meta_rootbp);
}
write_time += gethrtime() - start;
DTRACE_PROBE2(pool_sync__4io, hrtime_t, write_time,
hrtime_t, dp->dp_read_overhead);
write_time -= dp->dp_read_overhead;
dmu_tx_commit(tx);
dp->dp_space_towrite[txg & TXG_MASK] = 0;
ASSERT(dp->dp_tempreserved[txg & TXG_MASK] == 0);
/*
* If the write limit max has not been explicitly set, set it
* to a fraction of available physical memory (default 1/8th).
* Note that we must inflate the limit because the spa
* inflates write sizes to account for data replication.
* Check this each sync phase to catch changing memory size.
*/
if (physmem != old_physmem && zfs_write_limit_shift) {
mutex_enter(&zfs_write_limit_lock);
old_physmem = physmem;
zfs_write_limit_max = ptob(physmem) >> zfs_write_limit_shift;
zfs_write_limit_inflated = MAX(zfs_write_limit_min,
spa_get_asize(dp->dp_spa, zfs_write_limit_max));
mutex_exit(&zfs_write_limit_lock);
}
/*
* Attempt to keep the sync time consistent by adjusting the
* amount of write traffic allowed into each transaction group.
* Weight the throughput calculation towards the current value:
* thru = 3/4 old_thru + 1/4 new_thru
*
* Note: write_time is in nanosecs, so write_time/MICROSEC
* yields millisecs
*/
ASSERT(zfs_write_limit_min > 0);
if (data_written > zfs_write_limit_min / 8 && write_time > MICROSEC) {
uint64_t throughput = data_written / (write_time / MICROSEC);
if (dp->dp_throughput)
dp->dp_throughput = throughput / 4 +
3 * dp->dp_throughput / 4;
else
dp->dp_throughput = throughput;
dp->dp_write_limit = MIN(zfs_write_limit_inflated,
MAX(zfs_write_limit_min,
dp->dp_throughput * zfs_txg_synctime_ms));
}
}
void
dsl_pool_sync_done(dsl_pool_t *dp, uint64_t txg)
{
dsl_dataset_t *ds;
objset_t *os;
while (ds = list_head(&dp->dp_synced_datasets)) {
list_remove(&dp->dp_synced_datasets, ds);
os = ds->ds_objset;
zil_clean(os->os_zil, txg);
ASSERT(!dmu_objset_is_dirty(os, txg));
dmu_buf_rele(ds->ds_dbuf, ds);
}
ASSERT(!dmu_objset_is_dirty(dp->dp_meta_objset, txg));
}
/*
* TRUE if the current thread is the tx_sync_thread or if we
* are being called from SPA context during pool initialization.
*/
int
dsl_pool_sync_context(dsl_pool_t *dp)
{
return (curthread == dp->dp_tx.tx_sync_thread ||
spa_get_dsl(dp->dp_spa) == NULL);
}
uint64_t
dsl_pool_adjustedsize(dsl_pool_t *dp, boolean_t netfree)
{
uint64_t space, resv;
/*
* Reserve about 1.6% (1/64), or at least 32MB, for allocation
* efficiency.
* XXX The intent log is not accounted for, so it must fit
* within this slop.
*
* If we're trying to assess whether it's OK to do a free,
* cut the reservation in half to allow forward progress
* (e.g. make it possible to rm(1) files from a full pool).
*/
space = spa_get_dspace(dp->dp_spa);
resv = MAX(space >> 6, SPA_MINDEVSIZE >> 1);
if (netfree)
resv >>= 1;
return (space - resv);
}
int
dsl_pool_tempreserve_space(dsl_pool_t *dp, uint64_t space, dmu_tx_t *tx)
{
uint64_t reserved = 0;
uint64_t write_limit = (zfs_write_limit_override ?
zfs_write_limit_override : dp->dp_write_limit);
if (zfs_no_write_throttle) {
atomic_add_64(&dp->dp_tempreserved[tx->tx_txg & TXG_MASK],
space);
return (0);
}
/*
* Check to see if we have exceeded the maximum allowed IO for
* this transaction group. We can do this without locks since
* a little slop here is ok. Note that we do the reserved check
* with only half the requested reserve: this is because the
* reserve requests are worst-case, and we really don't want to
* throttle based off of worst-case estimates.
*/
if (write_limit > 0) {
reserved = dp->dp_space_towrite[tx->tx_txg & TXG_MASK]
+ dp->dp_tempreserved[tx->tx_txg & TXG_MASK] / 2;
if (reserved && reserved > write_limit)
return (ERESTART);
}
atomic_add_64(&dp->dp_tempreserved[tx->tx_txg & TXG_MASK], space);
/*
* If this transaction group is over 7/8ths capacity, delay
* the caller 1 clock tick. This will slow down the "fill"
* rate until the sync process can catch up with us.
*/
if (reserved && reserved > (write_limit - (write_limit >> 3)))
txg_delay(dp, tx->tx_txg, 1);
return (0);
}
void
dsl_pool_tempreserve_clear(dsl_pool_t *dp, int64_t space, dmu_tx_t *tx)
{
ASSERT(dp->dp_tempreserved[tx->tx_txg & TXG_MASK] >= space);
atomic_add_64(&dp->dp_tempreserved[tx->tx_txg & TXG_MASK], -space);
}
void
dsl_pool_memory_pressure(dsl_pool_t *dp)
{
uint64_t space_inuse = 0;
int i;
if (dp->dp_write_limit == zfs_write_limit_min)
return;
for (i = 0; i < TXG_SIZE; i++) {
space_inuse += dp->dp_space_towrite[i];
space_inuse += dp->dp_tempreserved[i];
}
dp->dp_write_limit = MAX(zfs_write_limit_min,
MIN(dp->dp_write_limit, space_inuse / 4));
}
void
dsl_pool_willuse_space(dsl_pool_t *dp, int64_t space, dmu_tx_t *tx)
{
if (space > 0) {
mutex_enter(&dp->dp_lock);
dp->dp_space_towrite[tx->tx_txg & TXG_MASK] += space;
mutex_exit(&dp->dp_lock);
}
}
/* ARGSUSED */
static int
upgrade_clones_cb(spa_t *spa, uint64_t dsobj, const char *dsname, void *arg)
{
dmu_tx_t *tx = arg;
dsl_dataset_t *ds, *prev = NULL;
int err;
dsl_pool_t *dp = spa_get_dsl(spa);
err = dsl_dataset_hold_obj(dp, dsobj, FTAG, &ds);
if (err)
return (err);
while (ds->ds_phys->ds_prev_snap_obj != 0) {
err = dsl_dataset_hold_obj(dp, ds->ds_phys->ds_prev_snap_obj,
FTAG, &prev);
if (err) {
dsl_dataset_rele(ds, FTAG);
return (err);
}
if (prev->ds_phys->ds_next_snap_obj != ds->ds_object)
break;
dsl_dataset_rele(ds, FTAG);
ds = prev;
prev = NULL;
}
if (prev == NULL) {
prev = dp->dp_origin_snap;
/*
* The $ORIGIN can't have any data, or the accounting
* will be wrong.
*/
ASSERT(prev->ds_phys->ds_bp.blk_birth == 0);
/* The origin doesn't get attached to itself */
if (ds->ds_object == prev->ds_object) {
dsl_dataset_rele(ds, FTAG);
return (0);
}
dmu_buf_will_dirty(ds->ds_dbuf, tx);
ds->ds_phys->ds_prev_snap_obj = prev->ds_object;
ds->ds_phys->ds_prev_snap_txg = prev->ds_phys->ds_creation_txg;
dmu_buf_will_dirty(ds->ds_dir->dd_dbuf, tx);
ds->ds_dir->dd_phys->dd_origin_obj = prev->ds_object;
dmu_buf_will_dirty(prev->ds_dbuf, tx);
prev->ds_phys->ds_num_children++;
if (ds->ds_phys->ds_next_snap_obj == 0) {
ASSERT(ds->ds_prev == NULL);
VERIFY(0 == dsl_dataset_hold_obj(dp,
ds->ds_phys->ds_prev_snap_obj, ds, &ds->ds_prev));
}
}
ASSERT(ds->ds_dir->dd_phys->dd_origin_obj == prev->ds_object);
ASSERT(ds->ds_phys->ds_prev_snap_obj == prev->ds_object);
if (prev->ds_phys->ds_next_clones_obj == 0) {
dmu_buf_will_dirty(prev->ds_dbuf, tx);
prev->ds_phys->ds_next_clones_obj =
zap_create(dp->dp_meta_objset,
DMU_OT_NEXT_CLONES, DMU_OT_NONE, 0, tx);
}
VERIFY(0 == zap_add_int(dp->dp_meta_objset,
prev->ds_phys->ds_next_clones_obj, ds->ds_object, tx));
dsl_dataset_rele(ds, FTAG);
if (prev != dp->dp_origin_snap)
dsl_dataset_rele(prev, FTAG);
return (0);
}
void
dsl_pool_upgrade_clones(dsl_pool_t *dp, dmu_tx_t *tx)
{
ASSERT(dmu_tx_is_syncing(tx));
ASSERT(dp->dp_origin_snap != NULL);
VERIFY3U(0, ==, dmu_objset_find_spa(dp->dp_spa, NULL, upgrade_clones_cb,
tx, DS_FIND_CHILDREN));
}
/* ARGSUSED */
static int
upgrade_dir_clones_cb(spa_t *spa, uint64_t dsobj, const char *dsname, void *arg)
{
dmu_tx_t *tx = arg;
dsl_dataset_t *ds;
dsl_pool_t *dp = spa_get_dsl(spa);
objset_t *mos = dp->dp_meta_objset;
VERIFY3U(0, ==, dsl_dataset_hold_obj(dp, dsobj, FTAG, &ds));
if (ds->ds_dir->dd_phys->dd_origin_obj) {
dsl_dataset_t *origin;
VERIFY3U(0, ==, dsl_dataset_hold_obj(dp,
ds->ds_dir->dd_phys->dd_origin_obj, FTAG, &origin));
if (origin->ds_dir->dd_phys->dd_clones == 0) {
dmu_buf_will_dirty(origin->ds_dir->dd_dbuf, tx);
origin->ds_dir->dd_phys->dd_clones = zap_create(mos,
DMU_OT_DSL_CLONES, DMU_OT_NONE, 0, tx);
}
VERIFY3U(0, ==, zap_add_int(dp->dp_meta_objset,
origin->ds_dir->dd_phys->dd_clones, dsobj, tx));
dsl_dataset_rele(origin, FTAG);
}
dsl_dataset_rele(ds, FTAG);
return (0);
}
void
dsl_pool_upgrade_dir_clones(dsl_pool_t *dp, dmu_tx_t *tx)
{
ASSERT(dmu_tx_is_syncing(tx));
uint64_t obj;
(void) dsl_dir_create_sync(dp, dp->dp_root_dir, FREE_DIR_NAME, tx);
VERIFY(0 == dsl_pool_open_special_dir(dp,
FREE_DIR_NAME, &dp->dp_free_dir));
/*
* We can't use bpobj_alloc(), because spa_version() still
* returns the old version, and we need a new-version bpobj with
* subobj support. So call dmu_object_alloc() directly.
*/
obj = dmu_object_alloc(dp->dp_meta_objset, DMU_OT_BPOBJ,
SPA_MAXBLOCKSIZE, DMU_OT_BPOBJ_HDR, sizeof (bpobj_phys_t), tx);
VERIFY3U(0, ==, zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_FREE_BPOBJ, sizeof (uint64_t), 1, &obj, tx));
VERIFY3U(0, ==, bpobj_open(&dp->dp_free_bpobj,
dp->dp_meta_objset, obj));
VERIFY3U(0, ==, dmu_objset_find_spa(dp->dp_spa, NULL,
upgrade_dir_clones_cb, tx, DS_FIND_CHILDREN));
}
void
dsl_pool_create_origin(dsl_pool_t *dp, dmu_tx_t *tx)
{
uint64_t dsobj;
dsl_dataset_t *ds;
ASSERT(dmu_tx_is_syncing(tx));
ASSERT(dp->dp_origin_snap == NULL);
/* create the origin dir, ds, & snap-ds */
rw_enter(&dp->dp_config_rwlock, RW_WRITER);
dsobj = dsl_dataset_create_sync(dp->dp_root_dir, ORIGIN_DIR_NAME,
NULL, 0, kcred, tx);
VERIFY(0 == dsl_dataset_hold_obj(dp, dsobj, FTAG, &ds));
dsl_dataset_snapshot_sync(ds, ORIGIN_DIR_NAME, tx);
VERIFY(0 == dsl_dataset_hold_obj(dp, ds->ds_phys->ds_prev_snap_obj,
dp, &dp->dp_origin_snap));
dsl_dataset_rele(ds, FTAG);
rw_exit(&dp->dp_config_rwlock);
}
taskq_t *
dsl_pool_vnrele_taskq(dsl_pool_t *dp)
{
return (dp->dp_vnrele_taskq);
}
/*
* Walk through the pool-wide zap object of temporary snapshot user holds
* and release them.
*/
void
dsl_pool_clean_tmp_userrefs(dsl_pool_t *dp)
{
zap_attribute_t za;
zap_cursor_t zc;
objset_t *mos = dp->dp_meta_objset;
uint64_t zapobj = dp->dp_tmp_userrefs_obj;
if (zapobj == 0)
return;
ASSERT(spa_version(dp->dp_spa) >= SPA_VERSION_USERREFS);
for (zap_cursor_init(&zc, mos, zapobj);
zap_cursor_retrieve(&zc, &za) == 0;
zap_cursor_advance(&zc)) {
char *htag;
uint64_t dsobj;
htag = strchr(za.za_name, '-');
*htag = '\0';
++htag;
dsobj = strtonum(za.za_name, NULL);
(void) dsl_dataset_user_release_tmp(dp, dsobj, htag, B_FALSE);
}
zap_cursor_fini(&zc);
}
/*
* Create the pool-wide zap object for storing temporary snapshot holds.
*/
void
dsl_pool_user_hold_create_obj(dsl_pool_t *dp, dmu_tx_t *tx)
{
objset_t *mos = dp->dp_meta_objset;
ASSERT(dp->dp_tmp_userrefs_obj == 0);
ASSERT(dmu_tx_is_syncing(tx));
dp->dp_tmp_userrefs_obj = zap_create(mos, DMU_OT_USERREFS,
DMU_OT_NONE, 0, tx);
VERIFY(zap_add(mos, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_TMP_USERREFS,
sizeof (uint64_t), 1, &dp->dp_tmp_userrefs_obj, tx) == 0);
}
static int
dsl_pool_user_hold_rele_impl(dsl_pool_t *dp, uint64_t dsobj,
const char *tag, uint64_t *now, dmu_tx_t *tx, boolean_t holding)
{
objset_t *mos = dp->dp_meta_objset;
uint64_t zapobj = dp->dp_tmp_userrefs_obj;
char *name;
int error;
ASSERT(spa_version(dp->dp_spa) >= SPA_VERSION_USERREFS);
ASSERT(dmu_tx_is_syncing(tx));
/*
* If the pool was created prior to SPA_VERSION_USERREFS, the
* zap object for temporary holds might not exist yet.
*/
if (zapobj == 0) {
if (holding) {
dsl_pool_user_hold_create_obj(dp, tx);
zapobj = dp->dp_tmp_userrefs_obj;
} else {
return (ENOENT);
}
}
name = kmem_asprintf("%llx-%s", (u_longlong_t)dsobj, tag);
if (holding)
error = zap_add(mos, zapobj, name, 8, 1, now, tx);
else
error = zap_remove(mos, zapobj, name, tx);
strfree(name);
return (error);
}
/*
* Add a temporary hold for the given dataset object and tag.
*/
int
dsl_pool_user_hold(dsl_pool_t *dp, uint64_t dsobj, const char *tag,
uint64_t *now, dmu_tx_t *tx)
{
return (dsl_pool_user_hold_rele_impl(dp, dsobj, tag, now, tx, B_TRUE));
}
/*
* Release a temporary hold for the given dataset object and tag.
*/
int
dsl_pool_user_release(dsl_pool_t *dp, uint64_t dsobj, const char *tag,
dmu_tx_t *tx)
{
return (dsl_pool_user_hold_rele_impl(dp, dsobj, tag, NULL,
tx, B_FALSE));
}

1153
uts/common/fs/zfs/dsl_prop.c Normal file

File diff suppressed because it is too large Load Diff

1766
uts/common/fs/zfs/dsl_scan.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,240 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#include <sys/dmu.h>
#include <sys/dmu_tx.h>
#include <sys/dsl_pool.h>
#include <sys/dsl_dir.h>
#include <sys/dsl_synctask.h>
#include <sys/metaslab.h>
#define DST_AVG_BLKSHIFT 14
/* ARGSUSED */
static int
dsl_null_checkfunc(void *arg1, void *arg2, dmu_tx_t *tx)
{
return (0);
}
dsl_sync_task_group_t *
dsl_sync_task_group_create(dsl_pool_t *dp)
{
dsl_sync_task_group_t *dstg;
dstg = kmem_zalloc(sizeof (dsl_sync_task_group_t), KM_SLEEP);
list_create(&dstg->dstg_tasks, sizeof (dsl_sync_task_t),
offsetof(dsl_sync_task_t, dst_node));
dstg->dstg_pool = dp;
return (dstg);
}
void
dsl_sync_task_create(dsl_sync_task_group_t *dstg,
dsl_checkfunc_t *checkfunc, dsl_syncfunc_t *syncfunc,
void *arg1, void *arg2, int blocks_modified)
{
dsl_sync_task_t *dst;
if (checkfunc == NULL)
checkfunc = dsl_null_checkfunc;
dst = kmem_zalloc(sizeof (dsl_sync_task_t), KM_SLEEP);
dst->dst_checkfunc = checkfunc;
dst->dst_syncfunc = syncfunc;
dst->dst_arg1 = arg1;
dst->dst_arg2 = arg2;
list_insert_tail(&dstg->dstg_tasks, dst);
dstg->dstg_space += blocks_modified << DST_AVG_BLKSHIFT;
}
int
dsl_sync_task_group_wait(dsl_sync_task_group_t *dstg)
{
dmu_tx_t *tx;
uint64_t txg;
dsl_sync_task_t *dst;
top:
tx = dmu_tx_create_dd(dstg->dstg_pool->dp_mos_dir);
VERIFY(0 == dmu_tx_assign(tx, TXG_WAIT));
txg = dmu_tx_get_txg(tx);
/* Do a preliminary error check. */
dstg->dstg_err = 0;
rw_enter(&dstg->dstg_pool->dp_config_rwlock, RW_READER);
for (dst = list_head(&dstg->dstg_tasks); dst;
dst = list_next(&dstg->dstg_tasks, dst)) {
#ifdef ZFS_DEBUG
/*
* Only check half the time, otherwise, the sync-context
* check will almost never fail.
*/
if (spa_get_random(2) == 0)
continue;
#endif
dst->dst_err =
dst->dst_checkfunc(dst->dst_arg1, dst->dst_arg2, tx);
if (dst->dst_err)
dstg->dstg_err = dst->dst_err;
}
rw_exit(&dstg->dstg_pool->dp_config_rwlock);
if (dstg->dstg_err) {
dmu_tx_commit(tx);
return (dstg->dstg_err);
}
/*
* We don't generally have many sync tasks, so pay the price of
* add_tail to get the tasks executed in the right order.
*/
VERIFY(0 == txg_list_add_tail(&dstg->dstg_pool->dp_sync_tasks,
dstg, txg));
dmu_tx_commit(tx);
txg_wait_synced(dstg->dstg_pool, txg);
if (dstg->dstg_err == EAGAIN) {
txg_wait_synced(dstg->dstg_pool, txg + TXG_DEFER_SIZE);
goto top;
}
return (dstg->dstg_err);
}
void
dsl_sync_task_group_nowait(dsl_sync_task_group_t *dstg, dmu_tx_t *tx)
{
uint64_t txg;
dstg->dstg_nowaiter = B_TRUE;
txg = dmu_tx_get_txg(tx);
/*
* We don't generally have many sync tasks, so pay the price of
* add_tail to get the tasks executed in the right order.
*/
VERIFY(0 == txg_list_add_tail(&dstg->dstg_pool->dp_sync_tasks,
dstg, txg));
}
void
dsl_sync_task_group_destroy(dsl_sync_task_group_t *dstg)
{
dsl_sync_task_t *dst;
while (dst = list_head(&dstg->dstg_tasks)) {
list_remove(&dstg->dstg_tasks, dst);
kmem_free(dst, sizeof (dsl_sync_task_t));
}
kmem_free(dstg, sizeof (dsl_sync_task_group_t));
}
void
dsl_sync_task_group_sync(dsl_sync_task_group_t *dstg, dmu_tx_t *tx)
{
dsl_sync_task_t *dst;
dsl_pool_t *dp = dstg->dstg_pool;
uint64_t quota, used;
ASSERT3U(dstg->dstg_err, ==, 0);
/*
* Check for sufficient space. We just check against what's
* on-disk; we don't want any in-flight accounting to get in our
* way, because open context may have already used up various
* in-core limits (arc_tempreserve, dsl_pool_tempreserve).
*/
quota = dsl_pool_adjustedsize(dp, B_FALSE) -
metaslab_class_get_deferred(spa_normal_class(dp->dp_spa));
used = dp->dp_root_dir->dd_phys->dd_used_bytes;
/* MOS space is triple-dittoed, so we multiply by 3. */
if (dstg->dstg_space > 0 && used + dstg->dstg_space * 3 > quota) {
dstg->dstg_err = ENOSPC;
return;
}
/*
* Check for errors by calling checkfuncs.
*/
rw_enter(&dp->dp_config_rwlock, RW_WRITER);
for (dst = list_head(&dstg->dstg_tasks); dst;
dst = list_next(&dstg->dstg_tasks, dst)) {
dst->dst_err =
dst->dst_checkfunc(dst->dst_arg1, dst->dst_arg2, tx);
if (dst->dst_err)
dstg->dstg_err = dst->dst_err;
}
if (dstg->dstg_err == 0) {
/*
* Execute sync tasks.
*/
for (dst = list_head(&dstg->dstg_tasks); dst;
dst = list_next(&dstg->dstg_tasks, dst)) {
dst->dst_syncfunc(dst->dst_arg1, dst->dst_arg2, tx);
}
}
rw_exit(&dp->dp_config_rwlock);
if (dstg->dstg_nowaiter)
dsl_sync_task_group_destroy(dstg);
}
int
dsl_sync_task_do(dsl_pool_t *dp,
dsl_checkfunc_t *checkfunc, dsl_syncfunc_t *syncfunc,
void *arg1, void *arg2, int blocks_modified)
{
dsl_sync_task_group_t *dstg;
int err;
ASSERT(spa_writeable(dp->dp_spa));
dstg = dsl_sync_task_group_create(dp);
dsl_sync_task_create(dstg, checkfunc, syncfunc,
arg1, arg2, blocks_modified);
err = dsl_sync_task_group_wait(dstg);
dsl_sync_task_group_destroy(dstg);
return (err);
}
void
dsl_sync_task_do_nowait(dsl_pool_t *dp,
dsl_checkfunc_t *checkfunc, dsl_syncfunc_t *syncfunc,
void *arg1, void *arg2, int blocks_modified, dmu_tx_t *tx)
{
dsl_sync_task_group_t *dstg;
if (!spa_writeable(dp->dp_spa))
return;
dstg = dsl_sync_task_group_create(dp);
dsl_sync_task_create(dstg, checkfunc, syncfunc,
arg1, arg2, blocks_modified);
dsl_sync_task_group_nowait(dstg, tx);
}

69
uts/common/fs/zfs/gzip.c Normal file
View File

@ -0,0 +1,69 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2007 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#pragma ident "%Z%%M% %I% %E% SMI"
#include <sys/debug.h>
#include <sys/types.h>
#include <sys/zmod.h>
#ifdef _KERNEL
#include <sys/systm.h>
#else
#include <strings.h>
#endif
size_t
gzip_compress(void *s_start, void *d_start, size_t s_len, size_t d_len, int n)
{
size_t dstlen = d_len;
ASSERT(d_len <= s_len);
if (z_compress_level(d_start, &dstlen, s_start, s_len, n) != Z_OK) {
if (d_len != s_len)
return (s_len);
bcopy(s_start, d_start, s_len);
return (s_len);
}
return (dstlen);
}
/*ARGSUSED*/
int
gzip_decompress(void *s_start, void *d_start, size_t s_len, size_t d_len, int n)
{
size_t dstlen = d_len;
ASSERT(d_len >= s_len);
if (z_uncompress(d_start, &dstlen, s_start, s_len) != Z_OK)
return (-1);
return (0);
}

123
uts/common/fs/zfs/lzjb.c Normal file
View File

@ -0,0 +1,123 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
/*
* We keep our own copy of this algorithm for 3 main reasons:
* 1. If we didn't, anyone modifying common/os/compress.c would
* directly break our on disk format
* 2. Our version of lzjb does not have a number of checks that the
* common/os version needs and uses
* 3. We initialize the lempel to ensure deterministic results,
* so that identical blocks can always be deduplicated.
* In particular, we are adding the "feature" that compress() can
* take a destination buffer size and returns the compressed length, or the
* source length if compression would overflow the destination buffer.
*/
#include <sys/types.h>
#define MATCH_BITS 6
#define MATCH_MIN 3
#define MATCH_MAX ((1 << MATCH_BITS) + (MATCH_MIN - 1))
#define OFFSET_MASK ((1 << (16 - MATCH_BITS)) - 1)
#define LEMPEL_SIZE 1024
/*ARGSUSED*/
size_t
lzjb_compress(void *s_start, void *d_start, size_t s_len, size_t d_len, int n)
{
uchar_t *src = s_start;
uchar_t *dst = d_start;
uchar_t *cpy, *copymap;
int copymask = 1 << (NBBY - 1);
int mlen, offset, hash;
uint16_t *hp;
uint16_t lempel[LEMPEL_SIZE] = { 0 };
while (src < (uchar_t *)s_start + s_len) {
if ((copymask <<= 1) == (1 << NBBY)) {
if (dst >= (uchar_t *)d_start + d_len - 1 - 2 * NBBY)
return (s_len);
copymask = 1;
copymap = dst;
*dst++ = 0;
}
if (src > (uchar_t *)s_start + s_len - MATCH_MAX) {
*dst++ = *src++;
continue;
}
hash = (src[0] << 16) + (src[1] << 8) + src[2];
hash += hash >> 9;
hash += hash >> 5;
hp = &lempel[hash & (LEMPEL_SIZE - 1)];
offset = (intptr_t)(src - *hp) & OFFSET_MASK;
*hp = (uint16_t)(uintptr_t)src;
cpy = src - offset;
if (cpy >= (uchar_t *)s_start && cpy != src &&
src[0] == cpy[0] && src[1] == cpy[1] && src[2] == cpy[2]) {
*copymap |= copymask;
for (mlen = MATCH_MIN; mlen < MATCH_MAX; mlen++)
if (src[mlen] != cpy[mlen])
break;
*dst++ = ((mlen - MATCH_MIN) << (NBBY - MATCH_BITS)) |
(offset >> NBBY);
*dst++ = (uchar_t)offset;
src += mlen;
} else {
*dst++ = *src++;
}
}
return (dst - (uchar_t *)d_start);
}
/*ARGSUSED*/
int
lzjb_decompress(void *s_start, void *d_start, size_t s_len, size_t d_len, int n)
{
uchar_t *src = s_start;
uchar_t *dst = d_start;
uchar_t *d_end = (uchar_t *)d_start + d_len;
uchar_t *cpy, copymap;
int copymask = 1 << (NBBY - 1);
while (dst < d_end) {
if ((copymask <<= 1) == (1 << NBBY)) {
copymask = 1;
copymap = *src++;
}
if (copymap & copymask) {
int mlen = (src[0] >> (NBBY - MATCH_BITS)) + MATCH_MIN;
int offset = ((src[0] << NBBY) | src[1]) & OFFSET_MASK;
src += 2;
if ((cpy = dst - offset) < (uchar_t *)d_start)
return (-1);
while (--mlen >= 0 && dst < d_end)
*dst++ = *cpy++;
} else {
*dst++ = *src++;
}
}
return (0);
}

1604
uts/common/fs/zfs/metaslab.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,223 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#include <sys/zfs_context.h>
#include <sys/refcount.h>
#ifdef ZFS_DEBUG
#ifdef _KERNEL
int reference_tracking_enable = FALSE; /* runs out of memory too easily */
#else
int reference_tracking_enable = TRUE;
#endif
int reference_history = 4; /* tunable */
static kmem_cache_t *reference_cache;
static kmem_cache_t *reference_history_cache;
void
refcount_init(void)
{
reference_cache = kmem_cache_create("reference_cache",
sizeof (reference_t), 0, NULL, NULL, NULL, NULL, NULL, 0);
reference_history_cache = kmem_cache_create("reference_history_cache",
sizeof (uint64_t), 0, NULL, NULL, NULL, NULL, NULL, 0);
}
void
refcount_fini(void)
{
kmem_cache_destroy(reference_cache);
kmem_cache_destroy(reference_history_cache);
}
void
refcount_create(refcount_t *rc)
{
mutex_init(&rc->rc_mtx, NULL, MUTEX_DEFAULT, NULL);
list_create(&rc->rc_list, sizeof (reference_t),
offsetof(reference_t, ref_link));
list_create(&rc->rc_removed, sizeof (reference_t),
offsetof(reference_t, ref_link));
rc->rc_count = 0;
rc->rc_removed_count = 0;
}
void
refcount_destroy_many(refcount_t *rc, uint64_t number)
{
reference_t *ref;
ASSERT(rc->rc_count == number);
while (ref = list_head(&rc->rc_list)) {
list_remove(&rc->rc_list, ref);
kmem_cache_free(reference_cache, ref);
}
list_destroy(&rc->rc_list);
while (ref = list_head(&rc->rc_removed)) {
list_remove(&rc->rc_removed, ref);
kmem_cache_free(reference_history_cache, ref->ref_removed);
kmem_cache_free(reference_cache, ref);
}
list_destroy(&rc->rc_removed);
mutex_destroy(&rc->rc_mtx);
}
void
refcount_destroy(refcount_t *rc)
{
refcount_destroy_many(rc, 0);
}
int
refcount_is_zero(refcount_t *rc)
{
ASSERT(rc->rc_count >= 0);
return (rc->rc_count == 0);
}
int64_t
refcount_count(refcount_t *rc)
{
ASSERT(rc->rc_count >= 0);
return (rc->rc_count);
}
int64_t
refcount_add_many(refcount_t *rc, uint64_t number, void *holder)
{
reference_t *ref;
int64_t count;
if (reference_tracking_enable) {
ref = kmem_cache_alloc(reference_cache, KM_SLEEP);
ref->ref_holder = holder;
ref->ref_number = number;
}
mutex_enter(&rc->rc_mtx);
ASSERT(rc->rc_count >= 0);
if (reference_tracking_enable)
list_insert_head(&rc->rc_list, ref);
rc->rc_count += number;
count = rc->rc_count;
mutex_exit(&rc->rc_mtx);
return (count);
}
int64_t
refcount_add(refcount_t *rc, void *holder)
{
return (refcount_add_many(rc, 1, holder));
}
int64_t
refcount_remove_many(refcount_t *rc, uint64_t number, void *holder)
{
reference_t *ref;
int64_t count;
mutex_enter(&rc->rc_mtx);
ASSERT(rc->rc_count >= number);
if (!reference_tracking_enable) {
rc->rc_count -= number;
count = rc->rc_count;
mutex_exit(&rc->rc_mtx);
return (count);
}
for (ref = list_head(&rc->rc_list); ref;
ref = list_next(&rc->rc_list, ref)) {
if (ref->ref_holder == holder && ref->ref_number == number) {
list_remove(&rc->rc_list, ref);
if (reference_history > 0) {
ref->ref_removed =
kmem_cache_alloc(reference_history_cache,
KM_SLEEP);
list_insert_head(&rc->rc_removed, ref);
rc->rc_removed_count++;
if (rc->rc_removed_count >= reference_history) {
ref = list_tail(&rc->rc_removed);
list_remove(&rc->rc_removed, ref);
kmem_cache_free(reference_history_cache,
ref->ref_removed);
kmem_cache_free(reference_cache, ref);
rc->rc_removed_count--;
}
} else {
kmem_cache_free(reference_cache, ref);
}
rc->rc_count -= number;
count = rc->rc_count;
mutex_exit(&rc->rc_mtx);
return (count);
}
}
panic("No such hold %p on refcount %llx", holder,
(u_longlong_t)(uintptr_t)rc);
return (-1);
}
int64_t
refcount_remove(refcount_t *rc, void *holder)
{
return (refcount_remove_many(rc, 1, holder));
}
void
refcount_transfer(refcount_t *dst, refcount_t *src)
{
int64_t count, removed_count;
list_t list, removed;
list_create(&list, sizeof (reference_t),
offsetof(reference_t, ref_link));
list_create(&removed, sizeof (reference_t),
offsetof(reference_t, ref_link));
mutex_enter(&src->rc_mtx);
count = src->rc_count;
removed_count = src->rc_removed_count;
src->rc_count = 0;
src->rc_removed_count = 0;
list_move_tail(&list, &src->rc_list);
list_move_tail(&removed, &src->rc_removed);
mutex_exit(&src->rc_mtx);
mutex_enter(&dst->rc_mtx);
dst->rc_count += count;
dst->rc_removed_count += removed_count;
list_move_tail(&dst->rc_list, &list);
list_move_tail(&dst->rc_removed, &removed);
mutex_exit(&dst->rc_mtx);
list_destroy(&list);
list_destroy(&removed);
}
#endif /* ZFS_DEBUG */

264
uts/common/fs/zfs/rrwlock.c Normal file
View File

@ -0,0 +1,264 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#include <sys/refcount.h>
#include <sys/rrwlock.h>
/*
* This file contains the implementation of a re-entrant read
* reader/writer lock (aka "rrwlock").
*
* This is a normal reader/writer lock with the additional feature
* of allowing threads who have already obtained a read lock to
* re-enter another read lock (re-entrant read) - even if there are
* waiting writers.
*
* Callers who have not obtained a read lock give waiting writers priority.
*
* The rrwlock_t lock does not allow re-entrant writers, nor does it
* allow a re-entrant mix of reads and writes (that is, it does not
* allow a caller who has already obtained a read lock to be able to
* then grab a write lock without first dropping all read locks, and
* vice versa).
*
* The rrwlock_t uses tsd (thread specific data) to keep a list of
* nodes (rrw_node_t), where each node keeps track of which specific
* lock (rrw_node_t::rn_rrl) the thread has grabbed. Since re-entering
* should be rare, a thread that grabs multiple reads on the same rrwlock_t
* will store multiple rrw_node_ts of the same 'rrn_rrl'. Nodes on the
* tsd list can represent a different rrwlock_t. This allows a thread
* to enter multiple and unique rrwlock_ts for read locks at the same time.
*
* Since using tsd exposes some overhead, the rrwlock_t only needs to
* keep tsd data when writers are waiting. If no writers are waiting, then
* a reader just bumps the anonymous read count (rr_anon_rcount) - no tsd
* is needed. Once a writer attempts to grab the lock, readers then
* keep tsd data and bump the linked readers count (rr_linked_rcount).
*
* If there are waiting writers and there are anonymous readers, then a
* reader doesn't know if it is a re-entrant lock. But since it may be one,
* we allow the read to proceed (otherwise it could deadlock). Since once
* waiting writers are active, readers no longer bump the anonymous count,
* the anonymous readers will eventually flush themselves out. At this point,
* readers will be able to tell if they are a re-entrant lock (have a
* rrw_node_t entry for the lock) or not. If they are a re-entrant lock, then
* we must let the proceed. If they are not, then the reader blocks for the
* waiting writers. Hence, we do not starve writers.
*/
/* global key for TSD */
uint_t rrw_tsd_key;
typedef struct rrw_node {
struct rrw_node *rn_next;
rrwlock_t *rn_rrl;
} rrw_node_t;
static rrw_node_t *
rrn_find(rrwlock_t *rrl)
{
rrw_node_t *rn;
if (refcount_count(&rrl->rr_linked_rcount) == 0)
return (NULL);
for (rn = tsd_get(rrw_tsd_key); rn != NULL; rn = rn->rn_next) {
if (rn->rn_rrl == rrl)
return (rn);
}
return (NULL);
}
/*
* Add a node to the head of the singly linked list.
*/
static void
rrn_add(rrwlock_t *rrl)
{
rrw_node_t *rn;
rn = kmem_alloc(sizeof (*rn), KM_SLEEP);
rn->rn_rrl = rrl;
rn->rn_next = tsd_get(rrw_tsd_key);
VERIFY(tsd_set(rrw_tsd_key, rn) == 0);
}
/*
* If a node is found for 'rrl', then remove the node from this
* thread's list and return TRUE; otherwise return FALSE.
*/
static boolean_t
rrn_find_and_remove(rrwlock_t *rrl)
{
rrw_node_t *rn;
rrw_node_t *prev = NULL;
if (refcount_count(&rrl->rr_linked_rcount) == 0)
return (B_FALSE);
for (rn = tsd_get(rrw_tsd_key); rn != NULL; rn = rn->rn_next) {
if (rn->rn_rrl == rrl) {
if (prev)
prev->rn_next = rn->rn_next;
else
VERIFY(tsd_set(rrw_tsd_key, rn->rn_next) == 0);
kmem_free(rn, sizeof (*rn));
return (B_TRUE);
}
prev = rn;
}
return (B_FALSE);
}
void
rrw_init(rrwlock_t *rrl)
{
mutex_init(&rrl->rr_lock, NULL, MUTEX_DEFAULT, NULL);
cv_init(&rrl->rr_cv, NULL, CV_DEFAULT, NULL);
rrl->rr_writer = NULL;
refcount_create(&rrl->rr_anon_rcount);
refcount_create(&rrl->rr_linked_rcount);
rrl->rr_writer_wanted = B_FALSE;
}
void
rrw_destroy(rrwlock_t *rrl)
{
mutex_destroy(&rrl->rr_lock);
cv_destroy(&rrl->rr_cv);
ASSERT(rrl->rr_writer == NULL);
refcount_destroy(&rrl->rr_anon_rcount);
refcount_destroy(&rrl->rr_linked_rcount);
}
static void
rrw_enter_read(rrwlock_t *rrl, void *tag)
{
mutex_enter(&rrl->rr_lock);
#if !defined(DEBUG) && defined(_KERNEL)
if (!rrl->rr_writer && !rrl->rr_writer_wanted) {
rrl->rr_anon_rcount.rc_count++;
mutex_exit(&rrl->rr_lock);
return;
}
DTRACE_PROBE(zfs__rrwfastpath__rdmiss);
#endif
ASSERT(rrl->rr_writer != curthread);
ASSERT(refcount_count(&rrl->rr_anon_rcount) >= 0);
while (rrl->rr_writer || (rrl->rr_writer_wanted &&
refcount_is_zero(&rrl->rr_anon_rcount) &&
rrn_find(rrl) == NULL))
cv_wait(&rrl->rr_cv, &rrl->rr_lock);
if (rrl->rr_writer_wanted) {
/* may or may not be a re-entrant enter */
rrn_add(rrl);
(void) refcount_add(&rrl->rr_linked_rcount, tag);
} else {
(void) refcount_add(&rrl->rr_anon_rcount, tag);
}
ASSERT(rrl->rr_writer == NULL);
mutex_exit(&rrl->rr_lock);
}
static void
rrw_enter_write(rrwlock_t *rrl)
{
mutex_enter(&rrl->rr_lock);
ASSERT(rrl->rr_writer != curthread);
while (refcount_count(&rrl->rr_anon_rcount) > 0 ||
refcount_count(&rrl->rr_linked_rcount) > 0 ||
rrl->rr_writer != NULL) {
rrl->rr_writer_wanted = B_TRUE;
cv_wait(&rrl->rr_cv, &rrl->rr_lock);
}
rrl->rr_writer_wanted = B_FALSE;
rrl->rr_writer = curthread;
mutex_exit(&rrl->rr_lock);
}
void
rrw_enter(rrwlock_t *rrl, krw_t rw, void *tag)
{
if (rw == RW_READER)
rrw_enter_read(rrl, tag);
else
rrw_enter_write(rrl);
}
void
rrw_exit(rrwlock_t *rrl, void *tag)
{
mutex_enter(&rrl->rr_lock);
#if !defined(DEBUG) && defined(_KERNEL)
if (!rrl->rr_writer && rrl->rr_linked_rcount.rc_count == 0) {
rrl->rr_anon_rcount.rc_count--;
if (rrl->rr_anon_rcount.rc_count == 0)
cv_broadcast(&rrl->rr_cv);
mutex_exit(&rrl->rr_lock);
return;
}
DTRACE_PROBE(zfs__rrwfastpath__exitmiss);
#endif
ASSERT(!refcount_is_zero(&rrl->rr_anon_rcount) ||
!refcount_is_zero(&rrl->rr_linked_rcount) ||
rrl->rr_writer != NULL);
if (rrl->rr_writer == NULL) {
int64_t count;
if (rrn_find_and_remove(rrl))
count = refcount_remove(&rrl->rr_linked_rcount, tag);
else
count = refcount_remove(&rrl->rr_anon_rcount, tag);
if (count == 0)
cv_broadcast(&rrl->rr_cv);
} else {
ASSERT(rrl->rr_writer == curthread);
ASSERT(refcount_is_zero(&rrl->rr_anon_rcount) &&
refcount_is_zero(&rrl->rr_linked_rcount));
rrl->rr_writer = NULL;
cv_broadcast(&rrl->rr_cv);
}
mutex_exit(&rrl->rr_lock);
}
boolean_t
rrw_held(rrwlock_t *rrl, krw_t rw)
{
boolean_t held;
mutex_enter(&rrl->rr_lock);
if (rw == RW_WRITER) {
held = (rrl->rr_writer == curthread);
} else {
held = (!refcount_is_zero(&rrl->rr_anon_rcount) ||
!refcount_is_zero(&rrl->rr_linked_rcount));
}
mutex_exit(&rrl->rr_lock);
return (held);
}

1970
uts/common/fs/zfs/sa.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,50 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#include <sys/zfs_context.h>
#include <sys/zio.h>
#include <sys/sha2.h>
void
zio_checksum_SHA256(const void *buf, uint64_t size, zio_cksum_t *zcp)
{
SHA2_CTX ctx;
zio_cksum_t tmp;
SHA2Init(SHA256, &ctx);
SHA2Update(&ctx, buf, size);
SHA2Final(&tmp, &ctx);
/*
* A prior implementation of this function had a
* private SHA256 implementation always wrote things out in
* Big Endian and there wasn't a byteswap variant of it.
* To preseve on disk compatibility we need to force that
* behaviour.
*/
zcp->zc_word[0] = BE_64(tmp.zc_word[0]);
zcp->zc_word[1] = BE_64(tmp.zc_word[1]);
zcp->zc_word[2] = BE_64(tmp.zc_word[2]);
zcp->zc_word[3] = BE_64(tmp.zc_word[3]);
}

5882
uts/common/fs/zfs/spa.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,487 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#include <sys/spa.h>
#include <sys/spa_impl.h>
#include <sys/nvpair.h>
#include <sys/uio.h>
#include <sys/fs/zfs.h>
#include <sys/vdev_impl.h>
#include <sys/zfs_ioctl.h>
#include <sys/utsname.h>
#include <sys/systeminfo.h>
#include <sys/sunddi.h>
#ifdef _KERNEL
#include <sys/kobj.h>
#include <sys/zone.h>
#endif
/*
* Pool configuration repository.
*
* Pool configuration is stored as a packed nvlist on the filesystem. By
* default, all pools are stored in /etc/zfs/zpool.cache and loaded on boot
* (when the ZFS module is loaded). Pools can also have the 'cachefile'
* property set that allows them to be stored in an alternate location until
* the control of external software.
*
* For each cache file, we have a single nvlist which holds all the
* configuration information. When the module loads, we read this information
* from /etc/zfs/zpool.cache and populate the SPA namespace. This namespace is
* maintained independently in spa.c. Whenever the namespace is modified, or
* the configuration of a pool is changed, we call spa_config_sync(), which
* walks through all the active pools and writes the configuration to disk.
*/
static uint64_t spa_config_generation = 1;
/*
* This can be overridden in userland to preserve an alternate namespace for
* userland pools when doing testing.
*/
const char *spa_config_path = ZPOOL_CACHE;
/*
* Called when the module is first loaded, this routine loads the configuration
* file into the SPA namespace. It does not actually open or load the pools; it
* only populates the namespace.
*/
void
spa_config_load(void)
{
void *buf = NULL;
nvlist_t *nvlist, *child;
nvpair_t *nvpair;
char *pathname;
struct _buf *file;
uint64_t fsize;
/*
* Open the configuration file.
*/
pathname = kmem_alloc(MAXPATHLEN, KM_SLEEP);
(void) snprintf(pathname, MAXPATHLEN, "%s%s",
(rootdir != NULL) ? "./" : "", spa_config_path);
file = kobj_open_file(pathname);
kmem_free(pathname, MAXPATHLEN);
if (file == (struct _buf *)-1)
return;
if (kobj_get_filesize(file, &fsize) != 0)
goto out;
buf = kmem_alloc(fsize, KM_SLEEP);
/*
* Read the nvlist from the file.
*/
if (kobj_read_file(file, buf, fsize, 0) < 0)
goto out;
/*
* Unpack the nvlist.
*/
if (nvlist_unpack(buf, fsize, &nvlist, KM_SLEEP) != 0)
goto out;
/*
* Iterate over all elements in the nvlist, creating a new spa_t for
* each one with the specified configuration.
*/
mutex_enter(&spa_namespace_lock);
nvpair = NULL;
while ((nvpair = nvlist_next_nvpair(nvlist, nvpair)) != NULL) {
if (nvpair_type(nvpair) != DATA_TYPE_NVLIST)
continue;
VERIFY(nvpair_value_nvlist(nvpair, &child) == 0);
if (spa_lookup(nvpair_name(nvpair)) != NULL)
continue;
(void) spa_add(nvpair_name(nvpair), child, NULL);
}
mutex_exit(&spa_namespace_lock);
nvlist_free(nvlist);
out:
if (buf != NULL)
kmem_free(buf, fsize);
kobj_close_file(file);
}
static void
spa_config_write(spa_config_dirent_t *dp, nvlist_t *nvl)
{
size_t buflen;
char *buf;
vnode_t *vp;
int oflags = FWRITE | FTRUNC | FCREAT | FOFFMAX;
char *temp;
/*
* If the nvlist is empty (NULL), then remove the old cachefile.
*/
if (nvl == NULL) {
(void) vn_remove(dp->scd_path, UIO_SYSSPACE, RMFILE);
return;
}
/*
* Pack the configuration into a buffer.
*/
VERIFY(nvlist_size(nvl, &buflen, NV_ENCODE_XDR) == 0);
buf = kmem_alloc(buflen, KM_SLEEP);
temp = kmem_zalloc(MAXPATHLEN, KM_SLEEP);
VERIFY(nvlist_pack(nvl, &buf, &buflen, NV_ENCODE_XDR,
KM_SLEEP) == 0);
/*
* Write the configuration to disk. We need to do the traditional
* 'write to temporary file, sync, move over original' to make sure we
* always have a consistent view of the data.
*/
(void) snprintf(temp, MAXPATHLEN, "%s.tmp", dp->scd_path);
if (vn_open(temp, UIO_SYSSPACE, oflags, 0644, &vp, CRCREAT, 0) == 0) {
if (vn_rdwr(UIO_WRITE, vp, buf, buflen, 0, UIO_SYSSPACE,
0, RLIM64_INFINITY, kcred, NULL) == 0 &&
VOP_FSYNC(vp, FSYNC, kcred, NULL) == 0) {
(void) vn_rename(temp, dp->scd_path, UIO_SYSSPACE);
}
(void) VOP_CLOSE(vp, oflags, 1, 0, kcred, NULL);
VN_RELE(vp);
}
(void) vn_remove(temp, UIO_SYSSPACE, RMFILE);
kmem_free(buf, buflen);
kmem_free(temp, MAXPATHLEN);
}
/*
* Synchronize pool configuration to disk. This must be called with the
* namespace lock held.
*/
void
spa_config_sync(spa_t *target, boolean_t removing, boolean_t postsysevent)
{
spa_config_dirent_t *dp, *tdp;
nvlist_t *nvl;
ASSERT(MUTEX_HELD(&spa_namespace_lock));
if (rootdir == NULL || !(spa_mode_global & FWRITE))
return;
/*
* Iterate over all cachefiles for the pool, past or present. When the
* cachefile is changed, the new one is pushed onto this list, allowing
* us to update previous cachefiles that no longer contain this pool.
*/
for (dp = list_head(&target->spa_config_list); dp != NULL;
dp = list_next(&target->spa_config_list, dp)) {
spa_t *spa = NULL;
if (dp->scd_path == NULL)
continue;
/*
* Iterate over all pools, adding any matching pools to 'nvl'.
*/
nvl = NULL;
while ((spa = spa_next(spa)) != NULL) {
if (spa == target && removing)
continue;
mutex_enter(&spa->spa_props_lock);
tdp = list_head(&spa->spa_config_list);
if (spa->spa_config == NULL ||
tdp->scd_path == NULL ||
strcmp(tdp->scd_path, dp->scd_path) != 0) {
mutex_exit(&spa->spa_props_lock);
continue;
}
if (nvl == NULL)
VERIFY(nvlist_alloc(&nvl, NV_UNIQUE_NAME,
KM_SLEEP) == 0);
VERIFY(nvlist_add_nvlist(nvl, spa->spa_name,
spa->spa_config) == 0);
mutex_exit(&spa->spa_props_lock);
}
spa_config_write(dp, nvl);
nvlist_free(nvl);
}
/*
* Remove any config entries older than the current one.
*/
dp = list_head(&target->spa_config_list);
while ((tdp = list_next(&target->spa_config_list, dp)) != NULL) {
list_remove(&target->spa_config_list, tdp);
if (tdp->scd_path != NULL)
spa_strfree(tdp->scd_path);
kmem_free(tdp, sizeof (spa_config_dirent_t));
}
spa_config_generation++;
if (postsysevent)
spa_event_notify(target, NULL, ESC_ZFS_CONFIG_SYNC);
}
/*
* Sigh. Inside a local zone, we don't have access to /etc/zfs/zpool.cache,
* and we don't want to allow the local zone to see all the pools anyway.
* So we have to invent the ZFS_IOC_CONFIG ioctl to grab the configuration
* information for all pool visible within the zone.
*/
nvlist_t *
spa_all_configs(uint64_t *generation)
{
nvlist_t *pools;
spa_t *spa = NULL;
if (*generation == spa_config_generation)
return (NULL);
VERIFY(nvlist_alloc(&pools, NV_UNIQUE_NAME, KM_SLEEP) == 0);
mutex_enter(&spa_namespace_lock);
while ((spa = spa_next(spa)) != NULL) {
if (INGLOBALZONE(curproc) ||
zone_dataset_visible(spa_name(spa), NULL)) {
mutex_enter(&spa->spa_props_lock);
VERIFY(nvlist_add_nvlist(pools, spa_name(spa),
spa->spa_config) == 0);
mutex_exit(&spa->spa_props_lock);
}
}
*generation = spa_config_generation;
mutex_exit(&spa_namespace_lock);
return (pools);
}
void
spa_config_set(spa_t *spa, nvlist_t *config)
{
mutex_enter(&spa->spa_props_lock);
if (spa->spa_config != NULL)
nvlist_free(spa->spa_config);
spa->spa_config = config;
mutex_exit(&spa->spa_props_lock);
}
/*
* Generate the pool's configuration based on the current in-core state.
* We infer whether to generate a complete config or just one top-level config
* based on whether vd is the root vdev.
*/
nvlist_t *
spa_config_generate(spa_t *spa, vdev_t *vd, uint64_t txg, int getstats)
{
nvlist_t *config, *nvroot;
vdev_t *rvd = spa->spa_root_vdev;
unsigned long hostid = 0;
boolean_t locked = B_FALSE;
uint64_t split_guid;
if (vd == NULL) {
vd = rvd;
locked = B_TRUE;
spa_config_enter(spa, SCL_CONFIG | SCL_STATE, FTAG, RW_READER);
}
ASSERT(spa_config_held(spa, SCL_CONFIG | SCL_STATE, RW_READER) ==
(SCL_CONFIG | SCL_STATE));
/*
* If txg is -1, report the current value of spa->spa_config_txg.
*/
if (txg == -1ULL)
txg = spa->spa_config_txg;
VERIFY(nvlist_alloc(&config, NV_UNIQUE_NAME, KM_SLEEP) == 0);
VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_VERSION,
spa_version(spa)) == 0);
VERIFY(nvlist_add_string(config, ZPOOL_CONFIG_POOL_NAME,
spa_name(spa)) == 0);
VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_STATE,
spa_state(spa)) == 0);
VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_TXG,
txg) == 0);
VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_POOL_GUID,
spa_guid(spa)) == 0);
#ifdef _KERNEL
hostid = zone_get_hostid(NULL);
#else /* _KERNEL */
/*
* We're emulating the system's hostid in userland, so we can't use
* zone_get_hostid().
*/
(void) ddi_strtoul(hw_serial, NULL, 10, &hostid);
#endif /* _KERNEL */
if (hostid != 0) {
VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_HOSTID,
hostid) == 0);
}
VERIFY(nvlist_add_string(config, ZPOOL_CONFIG_HOSTNAME,
utsname.nodename) == 0);
if (vd != rvd) {
VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_TOP_GUID,
vd->vdev_top->vdev_guid) == 0);
VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_GUID,
vd->vdev_guid) == 0);
if (vd->vdev_isspare)
VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_IS_SPARE,
1ULL) == 0);
if (vd->vdev_islog)
VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_IS_LOG,
1ULL) == 0);
vd = vd->vdev_top; /* label contains top config */
} else {
/*
* Only add the (potentially large) split information
* in the mos config, and not in the vdev labels
*/
if (spa->spa_config_splitting != NULL)
VERIFY(nvlist_add_nvlist(config, ZPOOL_CONFIG_SPLIT,
spa->spa_config_splitting) == 0);
}
/*
* Add the top-level config. We even add this on pools which
* don't support holes in the namespace.
*/
vdev_top_config_generate(spa, config);
/*
* If we're splitting, record the original pool's guid.
*/
if (spa->spa_config_splitting != NULL &&
nvlist_lookup_uint64(spa->spa_config_splitting,
ZPOOL_CONFIG_SPLIT_GUID, &split_guid) == 0) {
VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_SPLIT_GUID,
split_guid) == 0);
}
nvroot = vdev_config_generate(spa, vd, getstats, 0);
VERIFY(nvlist_add_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, nvroot) == 0);
nvlist_free(nvroot);
if (getstats && spa_load_state(spa) == SPA_LOAD_NONE) {
ddt_histogram_t *ddh;
ddt_stat_t *dds;
ddt_object_t *ddo;
ddh = kmem_zalloc(sizeof (ddt_histogram_t), KM_SLEEP);
ddt_get_dedup_histogram(spa, ddh);
VERIFY(nvlist_add_uint64_array(config,
ZPOOL_CONFIG_DDT_HISTOGRAM,
(uint64_t *)ddh, sizeof (*ddh) / sizeof (uint64_t)) == 0);
kmem_free(ddh, sizeof (ddt_histogram_t));
ddo = kmem_zalloc(sizeof (ddt_object_t), KM_SLEEP);
ddt_get_dedup_object_stats(spa, ddo);
VERIFY(nvlist_add_uint64_array(config,
ZPOOL_CONFIG_DDT_OBJ_STATS,
(uint64_t *)ddo, sizeof (*ddo) / sizeof (uint64_t)) == 0);
kmem_free(ddo, sizeof (ddt_object_t));
dds = kmem_zalloc(sizeof (ddt_stat_t), KM_SLEEP);
ddt_get_dedup_stats(spa, dds);
VERIFY(nvlist_add_uint64_array(config,
ZPOOL_CONFIG_DDT_STATS,
(uint64_t *)dds, sizeof (*dds) / sizeof (uint64_t)) == 0);
kmem_free(dds, sizeof (ddt_stat_t));
}
if (locked)
spa_config_exit(spa, SCL_CONFIG | SCL_STATE, FTAG);
return (config);
}
/*
* Update all disk labels, generate a fresh config based on the current
* in-core state, and sync the global config cache (do not sync the config
* cache if this is a booting rootpool).
*/
void
spa_config_update(spa_t *spa, int what)
{
vdev_t *rvd = spa->spa_root_vdev;
uint64_t txg;
int c;
ASSERT(MUTEX_HELD(&spa_namespace_lock));
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
txg = spa_last_synced_txg(spa) + 1;
if (what == SPA_CONFIG_UPDATE_POOL) {
vdev_config_dirty(rvd);
} else {
/*
* If we have top-level vdevs that were added but have
* not yet been prepared for allocation, do that now.
* (It's safe now because the config cache is up to date,
* so it will be able to translate the new DVAs.)
* See comments in spa_vdev_add() for full details.
*/
for (c = 0; c < rvd->vdev_children; c++) {
vdev_t *tvd = rvd->vdev_child[c];
if (tvd->vdev_ms_array == 0)
vdev_metaslab_set_size(tvd);
vdev_expand(tvd, txg);
}
}
spa_config_exit(spa, SCL_ALL, FTAG);
/*
* Wait for the mosconfig to be regenerated and synced.
*/
txg_wait_synced(spa->spa_dsl_pool, txg);
/*
* Update the global config cache to reflect the new mosconfig.
*/
if (!spa->spa_is_root)
spa_config_sync(spa, B_FALSE, what != SPA_CONFIG_UPDATE_POOL);
if (what == SPA_CONFIG_UPDATE_POOL)
spa_config_update(spa, SPA_CONFIG_UPDATE_VDEVS);
}

View File

@ -0,0 +1,403 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2006, 2010, Oracle and/or its affiliates. All rights reserved.
*/
/*
* Routines to manage the on-disk persistent error log.
*
* Each pool stores a log of all logical data errors seen during normal
* operation. This is actually the union of two distinct logs: the last log,
* and the current log. All errors seen are logged to the current log. When a
* scrub completes, the current log becomes the last log, the last log is thrown
* out, and the current log is reinitialized. This way, if an error is somehow
* corrected, a new scrub will show that that it no longer exists, and will be
* deleted from the log when the scrub completes.
*
* The log is stored using a ZAP object whose key is a string form of the
* zbookmark tuple (objset, object, level, blkid), and whose contents is an
* optional 'objset:object' human-readable string describing the data. When an
* error is first logged, this string will be empty, indicating that no name is
* known. This prevents us from having to issue a potentially large amount of
* I/O to discover the object name during an error path. Instead, we do the
* calculation when the data is requested, storing the result so future queries
* will be faster.
*
* This log is then shipped into an nvlist where the key is the dataset name and
* the value is the object name. Userland is then responsible for uniquifying
* this list and displaying it to the user.
*/
#include <sys/dmu_tx.h>
#include <sys/spa.h>
#include <sys/spa_impl.h>
#include <sys/zap.h>
#include <sys/zio.h>
/*
* Convert a bookmark to a string.
*/
static void
bookmark_to_name(zbookmark_t *zb, char *buf, size_t len)
{
(void) snprintf(buf, len, "%llx:%llx:%llx:%llx",
(u_longlong_t)zb->zb_objset, (u_longlong_t)zb->zb_object,
(u_longlong_t)zb->zb_level, (u_longlong_t)zb->zb_blkid);
}
/*
* Convert a string to a bookmark
*/
#ifdef _KERNEL
static void
name_to_bookmark(char *buf, zbookmark_t *zb)
{
zb->zb_objset = strtonum(buf, &buf);
ASSERT(*buf == ':');
zb->zb_object = strtonum(buf + 1, &buf);
ASSERT(*buf == ':');
zb->zb_level = (int)strtonum(buf + 1, &buf);
ASSERT(*buf == ':');
zb->zb_blkid = strtonum(buf + 1, &buf);
ASSERT(*buf == '\0');
}
#endif
/*
* Log an uncorrectable error to the persistent error log. We add it to the
* spa's list of pending errors. The changes are actually synced out to disk
* during spa_errlog_sync().
*/
void
spa_log_error(spa_t *spa, zio_t *zio)
{
zbookmark_t *zb = &zio->io_logical->io_bookmark;
spa_error_entry_t search;
spa_error_entry_t *new;
avl_tree_t *tree;
avl_index_t where;
/*
* If we are trying to import a pool, ignore any errors, as we won't be
* writing to the pool any time soon.
*/
if (spa_load_state(spa) == SPA_LOAD_TRYIMPORT)
return;
mutex_enter(&spa->spa_errlist_lock);
/*
* If we have had a request to rotate the log, log it to the next list
* instead of the current one.
*/
if (spa->spa_scrub_active || spa->spa_scrub_finished)
tree = &spa->spa_errlist_scrub;
else
tree = &spa->spa_errlist_last;
search.se_bookmark = *zb;
if (avl_find(tree, &search, &where) != NULL) {
mutex_exit(&spa->spa_errlist_lock);
return;
}
new = kmem_zalloc(sizeof (spa_error_entry_t), KM_SLEEP);
new->se_bookmark = *zb;
avl_insert(tree, new, where);
mutex_exit(&spa->spa_errlist_lock);
}
/*
* Return the number of errors currently in the error log. This is actually the
* sum of both the last log and the current log, since we don't know the union
* of these logs until we reach userland.
*/
uint64_t
spa_get_errlog_size(spa_t *spa)
{
uint64_t total = 0, count;
mutex_enter(&spa->spa_errlog_lock);
if (spa->spa_errlog_scrub != 0 &&
zap_count(spa->spa_meta_objset, spa->spa_errlog_scrub,
&count) == 0)
total += count;
if (spa->spa_errlog_last != 0 && !spa->spa_scrub_finished &&
zap_count(spa->spa_meta_objset, spa->spa_errlog_last,
&count) == 0)
total += count;
mutex_exit(&spa->spa_errlog_lock);
mutex_enter(&spa->spa_errlist_lock);
total += avl_numnodes(&spa->spa_errlist_last);
total += avl_numnodes(&spa->spa_errlist_scrub);
mutex_exit(&spa->spa_errlist_lock);
return (total);
}
#ifdef _KERNEL
static int
process_error_log(spa_t *spa, uint64_t obj, void *addr, size_t *count)
{
zap_cursor_t zc;
zap_attribute_t za;
zbookmark_t zb;
if (obj == 0)
return (0);
for (zap_cursor_init(&zc, spa->spa_meta_objset, obj);
zap_cursor_retrieve(&zc, &za) == 0;
zap_cursor_advance(&zc)) {
if (*count == 0) {
zap_cursor_fini(&zc);
return (ENOMEM);
}
name_to_bookmark(za.za_name, &zb);
if (copyout(&zb, (char *)addr +
(*count - 1) * sizeof (zbookmark_t),
sizeof (zbookmark_t)) != 0)
return (EFAULT);
*count -= 1;
}
zap_cursor_fini(&zc);
return (0);
}
static int
process_error_list(avl_tree_t *list, void *addr, size_t *count)
{
spa_error_entry_t *se;
for (se = avl_first(list); se != NULL; se = AVL_NEXT(list, se)) {
if (*count == 0)
return (ENOMEM);
if (copyout(&se->se_bookmark, (char *)addr +
(*count - 1) * sizeof (zbookmark_t),
sizeof (zbookmark_t)) != 0)
return (EFAULT);
*count -= 1;
}
return (0);
}
#endif
/*
* Copy all known errors to userland as an array of bookmarks. This is
* actually a union of the on-disk last log and current log, as well as any
* pending error requests.
*
* Because the act of reading the on-disk log could cause errors to be
* generated, we have two separate locks: one for the error log and one for the
* in-core error lists. We only need the error list lock to log and error, so
* we grab the error log lock while we read the on-disk logs, and only pick up
* the error list lock when we are finished.
*/
int
spa_get_errlog(spa_t *spa, void *uaddr, size_t *count)
{
int ret = 0;
#ifdef _KERNEL
mutex_enter(&spa->spa_errlog_lock);
ret = process_error_log(spa, spa->spa_errlog_scrub, uaddr, count);
if (!ret && !spa->spa_scrub_finished)
ret = process_error_log(spa, spa->spa_errlog_last, uaddr,
count);
mutex_enter(&spa->spa_errlist_lock);
if (!ret)
ret = process_error_list(&spa->spa_errlist_scrub, uaddr,
count);
if (!ret)
ret = process_error_list(&spa->spa_errlist_last, uaddr,
count);
mutex_exit(&spa->spa_errlist_lock);
mutex_exit(&spa->spa_errlog_lock);
#endif
return (ret);
}
/*
* Called when a scrub completes. This simply set a bit which tells which AVL
* tree to add new errors. spa_errlog_sync() is responsible for actually
* syncing the changes to the underlying objects.
*/
void
spa_errlog_rotate(spa_t *spa)
{
mutex_enter(&spa->spa_errlist_lock);
spa->spa_scrub_finished = B_TRUE;
mutex_exit(&spa->spa_errlist_lock);
}
/*
* Discard any pending errors from the spa_t. Called when unloading a faulted
* pool, as the errors encountered during the open cannot be synced to disk.
*/
void
spa_errlog_drain(spa_t *spa)
{
spa_error_entry_t *se;
void *cookie;
mutex_enter(&spa->spa_errlist_lock);
cookie = NULL;
while ((se = avl_destroy_nodes(&spa->spa_errlist_last,
&cookie)) != NULL)
kmem_free(se, sizeof (spa_error_entry_t));
cookie = NULL;
while ((se = avl_destroy_nodes(&spa->spa_errlist_scrub,
&cookie)) != NULL)
kmem_free(se, sizeof (spa_error_entry_t));
mutex_exit(&spa->spa_errlist_lock);
}
/*
* Process a list of errors into the current on-disk log.
*/
static void
sync_error_list(spa_t *spa, avl_tree_t *t, uint64_t *obj, dmu_tx_t *tx)
{
spa_error_entry_t *se;
char buf[64];
void *cookie;
if (avl_numnodes(t) != 0) {
/* create log if necessary */
if (*obj == 0)
*obj = zap_create(spa->spa_meta_objset,
DMU_OT_ERROR_LOG, DMU_OT_NONE,
0, tx);
/* add errors to the current log */
for (se = avl_first(t); se != NULL; se = AVL_NEXT(t, se)) {
char *name = se->se_name ? se->se_name : "";
bookmark_to_name(&se->se_bookmark, buf, sizeof (buf));
(void) zap_update(spa->spa_meta_objset,
*obj, buf, 1, strlen(name) + 1, name, tx);
}
/* purge the error list */
cookie = NULL;
while ((se = avl_destroy_nodes(t, &cookie)) != NULL)
kmem_free(se, sizeof (spa_error_entry_t));
}
}
/*
* Sync the error log out to disk. This is a little tricky because the act of
* writing the error log requires the spa_errlist_lock. So, we need to lock the
* error lists, take a copy of the lists, and then reinitialize them. Then, we
* drop the error list lock and take the error log lock, at which point we
* do the errlog processing. Then, if we encounter an I/O error during this
* process, we can successfully add the error to the list. Note that this will
* result in the perpetual recycling of errors, but it is an unlikely situation
* and not a performance critical operation.
*/
void
spa_errlog_sync(spa_t *spa, uint64_t txg)
{
dmu_tx_t *tx;
avl_tree_t scrub, last;
int scrub_finished;
mutex_enter(&spa->spa_errlist_lock);
/*
* Bail out early under normal circumstances.
*/
if (avl_numnodes(&spa->spa_errlist_scrub) == 0 &&
avl_numnodes(&spa->spa_errlist_last) == 0 &&
!spa->spa_scrub_finished) {
mutex_exit(&spa->spa_errlist_lock);
return;
}
spa_get_errlists(spa, &last, &scrub);
scrub_finished = spa->spa_scrub_finished;
spa->spa_scrub_finished = B_FALSE;
mutex_exit(&spa->spa_errlist_lock);
mutex_enter(&spa->spa_errlog_lock);
tx = dmu_tx_create_assigned(spa->spa_dsl_pool, txg);
/*
* Sync out the current list of errors.
*/
sync_error_list(spa, &last, &spa->spa_errlog_last, tx);
/*
* Rotate the log if necessary.
*/
if (scrub_finished) {
if (spa->spa_errlog_last != 0)
VERIFY(dmu_object_free(spa->spa_meta_objset,
spa->spa_errlog_last, tx) == 0);
spa->spa_errlog_last = spa->spa_errlog_scrub;
spa->spa_errlog_scrub = 0;
sync_error_list(spa, &scrub, &spa->spa_errlog_last, tx);
}
/*
* Sync out any pending scrub errors.
*/
sync_error_list(spa, &scrub, &spa->spa_errlog_scrub, tx);
/*
* Update the MOS to reflect the new values.
*/
(void) zap_update(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_ERRLOG_LAST, sizeof (uint64_t), 1,
&spa->spa_errlog_last, tx);
(void) zap_update(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_ERRLOG_SCRUB, sizeof (uint64_t), 1,
&spa->spa_errlog_scrub, tx);
dmu_tx_commit(tx);
mutex_exit(&spa->spa_errlog_lock);
}

View File

@ -0,0 +1,502 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2006, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#include <sys/spa.h>
#include <sys/spa_impl.h>
#include <sys/zap.h>
#include <sys/dsl_synctask.h>
#include <sys/dmu_tx.h>
#include <sys/dmu_objset.h>
#include <sys/utsname.h>
#include <sys/cmn_err.h>
#include <sys/sunddi.h>
#include "zfs_comutil.h"
#ifdef _KERNEL
#include <sys/zone.h>
#endif
/*
* Routines to manage the on-disk history log.
*
* The history log is stored as a dmu object containing
* <packed record length, record nvlist> tuples.
*
* Where "record nvlist" is a nvlist containing uint64_ts and strings, and
* "packed record length" is the packed length of the "record nvlist" stored
* as a little endian uint64_t.
*
* The log is implemented as a ring buffer, though the original creation
* of the pool ('zpool create') is never overwritten.
*
* The history log is tracked as object 'spa_t::spa_history'. The bonus buffer
* of 'spa_history' stores the offsets for logging/retrieving history as
* 'spa_history_phys_t'. 'sh_pool_create_len' is the ending offset in bytes of
* where the 'zpool create' record is stored. This allows us to never
* overwrite the original creation of the pool. 'sh_phys_max_off' is the
* physical ending offset in bytes of the log. This tells you the length of
* the buffer. 'sh_eof' is the logical EOF (in bytes). Whenever a record
* is added, 'sh_eof' is incremented by the the size of the record.
* 'sh_eof' is never decremented. 'sh_bof' is the logical BOF (in bytes).
* This is where the consumer should start reading from after reading in
* the 'zpool create' portion of the log.
*
* 'sh_records_lost' keeps track of how many records have been overwritten
* and permanently lost.
*/
/* convert a logical offset to physical */
static uint64_t
spa_history_log_to_phys(uint64_t log_off, spa_history_phys_t *shpp)
{
uint64_t phys_len;
phys_len = shpp->sh_phys_max_off - shpp->sh_pool_create_len;
return ((log_off - shpp->sh_pool_create_len) % phys_len
+ shpp->sh_pool_create_len);
}
void
spa_history_create_obj(spa_t *spa, dmu_tx_t *tx)
{
dmu_buf_t *dbp;
spa_history_phys_t *shpp;
objset_t *mos = spa->spa_meta_objset;
ASSERT(spa->spa_history == 0);
spa->spa_history = dmu_object_alloc(mos, DMU_OT_SPA_HISTORY,
SPA_MAXBLOCKSIZE, DMU_OT_SPA_HISTORY_OFFSETS,
sizeof (spa_history_phys_t), tx);
VERIFY(zap_add(mos, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_HISTORY, sizeof (uint64_t), 1,
&spa->spa_history, tx) == 0);
VERIFY(0 == dmu_bonus_hold(mos, spa->spa_history, FTAG, &dbp));
ASSERT(dbp->db_size >= sizeof (spa_history_phys_t));
shpp = dbp->db_data;
dmu_buf_will_dirty(dbp, tx);
/*
* Figure out maximum size of history log. We set it at
* 1% of pool size, with a max of 32MB and min of 128KB.
*/
shpp->sh_phys_max_off =
metaslab_class_get_dspace(spa_normal_class(spa)) / 100;
shpp->sh_phys_max_off = MIN(shpp->sh_phys_max_off, 32<<20);
shpp->sh_phys_max_off = MAX(shpp->sh_phys_max_off, 128<<10);
dmu_buf_rele(dbp, FTAG);
}
/*
* Change 'sh_bof' to the beginning of the next record.
*/
static int
spa_history_advance_bof(spa_t *spa, spa_history_phys_t *shpp)
{
objset_t *mos = spa->spa_meta_objset;
uint64_t firstread, reclen, phys_bof;
char buf[sizeof (reclen)];
int err;
phys_bof = spa_history_log_to_phys(shpp->sh_bof, shpp);
firstread = MIN(sizeof (reclen), shpp->sh_phys_max_off - phys_bof);
if ((err = dmu_read(mos, spa->spa_history, phys_bof, firstread,
buf, DMU_READ_PREFETCH)) != 0)
return (err);
if (firstread != sizeof (reclen)) {
if ((err = dmu_read(mos, spa->spa_history,
shpp->sh_pool_create_len, sizeof (reclen) - firstread,
buf + firstread, DMU_READ_PREFETCH)) != 0)
return (err);
}
reclen = LE_64(*((uint64_t *)buf));
shpp->sh_bof += reclen + sizeof (reclen);
shpp->sh_records_lost++;
return (0);
}
static int
spa_history_write(spa_t *spa, void *buf, uint64_t len, spa_history_phys_t *shpp,
dmu_tx_t *tx)
{
uint64_t firstwrite, phys_eof;
objset_t *mos = spa->spa_meta_objset;
int err;
ASSERT(MUTEX_HELD(&spa->spa_history_lock));
/* see if we need to reset logical BOF */
while (shpp->sh_phys_max_off - shpp->sh_pool_create_len -
(shpp->sh_eof - shpp->sh_bof) <= len) {
if ((err = spa_history_advance_bof(spa, shpp)) != 0) {
return (err);
}
}
phys_eof = spa_history_log_to_phys(shpp->sh_eof, shpp);
firstwrite = MIN(len, shpp->sh_phys_max_off - phys_eof);
shpp->sh_eof += len;
dmu_write(mos, spa->spa_history, phys_eof, firstwrite, buf, tx);
len -= firstwrite;
if (len > 0) {
/* write out the rest at the beginning of physical file */
dmu_write(mos, spa->spa_history, shpp->sh_pool_create_len,
len, (char *)buf + firstwrite, tx);
}
return (0);
}
static char *
spa_history_zone()
{
#ifdef _KERNEL
return (curproc->p_zone->zone_name);
#else
return ("global");
#endif
}
/*
* Write out a history event.
*/
/*ARGSUSED*/
static void
spa_history_log_sync(void *arg1, void *arg2, dmu_tx_t *tx)
{
spa_t *spa = arg1;
history_arg_t *hap = arg2;
const char *history_str = hap->ha_history_str;
objset_t *mos = spa->spa_meta_objset;
dmu_buf_t *dbp;
spa_history_phys_t *shpp;
size_t reclen;
uint64_t le_len;
nvlist_t *nvrecord;
char *record_packed = NULL;
int ret;
/*
* If we have an older pool that doesn't have a command
* history object, create it now.
*/
mutex_enter(&spa->spa_history_lock);
if (!spa->spa_history)
spa_history_create_obj(spa, tx);
mutex_exit(&spa->spa_history_lock);
/*
* Get the offset of where we need to write via the bonus buffer.
* Update the offset when the write completes.
*/
VERIFY(0 == dmu_bonus_hold(mos, spa->spa_history, FTAG, &dbp));
shpp = dbp->db_data;
dmu_buf_will_dirty(dbp, tx);
#ifdef ZFS_DEBUG
{
dmu_object_info_t doi;
dmu_object_info_from_db(dbp, &doi);
ASSERT3U(doi.doi_bonus_type, ==, DMU_OT_SPA_HISTORY_OFFSETS);
}
#endif
VERIFY(nvlist_alloc(&nvrecord, NV_UNIQUE_NAME, KM_SLEEP) == 0);
VERIFY(nvlist_add_uint64(nvrecord, ZPOOL_HIST_TIME,
gethrestime_sec()) == 0);
VERIFY(nvlist_add_uint64(nvrecord, ZPOOL_HIST_WHO, hap->ha_uid) == 0);
if (hap->ha_zone != NULL)
VERIFY(nvlist_add_string(nvrecord, ZPOOL_HIST_ZONE,
hap->ha_zone) == 0);
#ifdef _KERNEL
VERIFY(nvlist_add_string(nvrecord, ZPOOL_HIST_HOST,
utsname.nodename) == 0);
#endif
if (hap->ha_log_type == LOG_CMD_POOL_CREATE ||
hap->ha_log_type == LOG_CMD_NORMAL) {
VERIFY(nvlist_add_string(nvrecord, ZPOOL_HIST_CMD,
history_str) == 0);
zfs_dbgmsg("command: %s", history_str);
} else {
VERIFY(nvlist_add_uint64(nvrecord, ZPOOL_HIST_INT_EVENT,
hap->ha_event) == 0);
VERIFY(nvlist_add_uint64(nvrecord, ZPOOL_HIST_TXG,
tx->tx_txg) == 0);
VERIFY(nvlist_add_string(nvrecord, ZPOOL_HIST_INT_STR,
history_str) == 0);
zfs_dbgmsg("internal %s pool:%s txg:%llu %s",
zfs_history_event_names[hap->ha_event], spa_name(spa),
(longlong_t)tx->tx_txg, history_str);
}
VERIFY(nvlist_size(nvrecord, &reclen, NV_ENCODE_XDR) == 0);
record_packed = kmem_alloc(reclen, KM_SLEEP);
VERIFY(nvlist_pack(nvrecord, &record_packed, &reclen,
NV_ENCODE_XDR, KM_SLEEP) == 0);
mutex_enter(&spa->spa_history_lock);
if (hap->ha_log_type == LOG_CMD_POOL_CREATE)
VERIFY(shpp->sh_eof == shpp->sh_pool_create_len);
/* write out the packed length as little endian */
le_len = LE_64((uint64_t)reclen);
ret = spa_history_write(spa, &le_len, sizeof (le_len), shpp, tx);
if (!ret)
ret = spa_history_write(spa, record_packed, reclen, shpp, tx);
if (!ret && hap->ha_log_type == LOG_CMD_POOL_CREATE) {
shpp->sh_pool_create_len += sizeof (le_len) + reclen;
shpp->sh_bof = shpp->sh_pool_create_len;
}
mutex_exit(&spa->spa_history_lock);
nvlist_free(nvrecord);
kmem_free(record_packed, reclen);
dmu_buf_rele(dbp, FTAG);
strfree(hap->ha_history_str);
if (hap->ha_zone != NULL)
strfree(hap->ha_zone);
kmem_free(hap, sizeof (history_arg_t));
}
/*
* Write out a history event.
*/
int
spa_history_log(spa_t *spa, const char *history_str, history_log_type_t what)
{
history_arg_t *ha;
int err = 0;
dmu_tx_t *tx;
ASSERT(what != LOG_INTERNAL);
tx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir);
err = dmu_tx_assign(tx, TXG_WAIT);
if (err) {
dmu_tx_abort(tx);
return (err);
}
ha = kmem_alloc(sizeof (history_arg_t), KM_SLEEP);
ha->ha_history_str = strdup(history_str);
ha->ha_zone = strdup(spa_history_zone());
ha->ha_log_type = what;
ha->ha_uid = crgetuid(CRED());
/* Kick this off asynchronously; errors are ignored. */
dsl_sync_task_do_nowait(spa_get_dsl(spa), NULL,
spa_history_log_sync, spa, ha, 0, tx);
dmu_tx_commit(tx);
/* spa_history_log_sync will free ha and strings */
return (err);
}
/*
* Read out the command history.
*/
int
spa_history_get(spa_t *spa, uint64_t *offp, uint64_t *len, char *buf)
{
objset_t *mos = spa->spa_meta_objset;
dmu_buf_t *dbp;
uint64_t read_len, phys_read_off, phys_eof;
uint64_t leftover = 0;
spa_history_phys_t *shpp;
int err;
/*
* If the command history doesn't exist (older pool),
* that's ok, just return ENOENT.
*/
if (!spa->spa_history)
return (ENOENT);
/*
* The history is logged asynchronously, so when they request
* the first chunk of history, make sure everything has been
* synced to disk so that we get it.
*/
if (*offp == 0 && spa_writeable(spa))
txg_wait_synced(spa_get_dsl(spa), 0);
if ((err = dmu_bonus_hold(mos, spa->spa_history, FTAG, &dbp)) != 0)
return (err);
shpp = dbp->db_data;
#ifdef ZFS_DEBUG
{
dmu_object_info_t doi;
dmu_object_info_from_db(dbp, &doi);
ASSERT3U(doi.doi_bonus_type, ==, DMU_OT_SPA_HISTORY_OFFSETS);
}
#endif
mutex_enter(&spa->spa_history_lock);
phys_eof = spa_history_log_to_phys(shpp->sh_eof, shpp);
if (*offp < shpp->sh_pool_create_len) {
/* read in just the zpool create history */
phys_read_off = *offp;
read_len = MIN(*len, shpp->sh_pool_create_len -
phys_read_off);
} else {
/*
* Need to reset passed in offset to BOF if the passed in
* offset has since been overwritten.
*/
*offp = MAX(*offp, shpp->sh_bof);
phys_read_off = spa_history_log_to_phys(*offp, shpp);
/*
* Read up to the minimum of what the user passed down or
* the EOF (physical or logical). If we hit physical EOF,
* use 'leftover' to read from the physical BOF.
*/
if (phys_read_off <= phys_eof) {
read_len = MIN(*len, phys_eof - phys_read_off);
} else {
read_len = MIN(*len,
shpp->sh_phys_max_off - phys_read_off);
if (phys_read_off + *len > shpp->sh_phys_max_off) {
leftover = MIN(*len - read_len,
phys_eof - shpp->sh_pool_create_len);
}
}
}
/* offset for consumer to use next */
*offp += read_len + leftover;
/* tell the consumer how much you actually read */
*len = read_len + leftover;
if (read_len == 0) {
mutex_exit(&spa->spa_history_lock);
dmu_buf_rele(dbp, FTAG);
return (0);
}
err = dmu_read(mos, spa->spa_history, phys_read_off, read_len, buf,
DMU_READ_PREFETCH);
if (leftover && err == 0) {
err = dmu_read(mos, spa->spa_history, shpp->sh_pool_create_len,
leftover, buf + read_len, DMU_READ_PREFETCH);
}
mutex_exit(&spa->spa_history_lock);
dmu_buf_rele(dbp, FTAG);
return (err);
}
static void
log_internal(history_internal_events_t event, spa_t *spa,
dmu_tx_t *tx, const char *fmt, va_list adx)
{
history_arg_t *ha;
/*
* If this is part of creating a pool, not everything is
* initialized yet, so don't bother logging the internal events.
*/
if (tx->tx_txg == TXG_INITIAL)
return;
ha = kmem_alloc(sizeof (history_arg_t), KM_SLEEP);
ha->ha_history_str = kmem_alloc(vsnprintf(NULL, 0, fmt, adx) + 1,
KM_SLEEP);
(void) vsprintf(ha->ha_history_str, fmt, adx);
ha->ha_log_type = LOG_INTERNAL;
ha->ha_event = event;
ha->ha_zone = NULL;
ha->ha_uid = 0;
if (dmu_tx_is_syncing(tx)) {
spa_history_log_sync(spa, ha, tx);
} else {
dsl_sync_task_do_nowait(spa_get_dsl(spa), NULL,
spa_history_log_sync, spa, ha, 0, tx);
}
/* spa_history_log_sync() will free ha and strings */
}
void
spa_history_log_internal(history_internal_events_t event, spa_t *spa,
dmu_tx_t *tx, const char *fmt, ...)
{
dmu_tx_t *htx = tx;
va_list adx;
/* create a tx if we didn't get one */
if (tx == NULL) {
htx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir);
if (dmu_tx_assign(htx, TXG_WAIT) != 0) {
dmu_tx_abort(htx);
return;
}
}
va_start(adx, fmt);
log_internal(event, spa, htx, fmt, adx);
va_end(adx);
/* if we didn't get a tx from the caller, commit the one we made */
if (tx == NULL)
dmu_tx_commit(htx);
}
void
spa_history_log_version(spa_t *spa, history_internal_events_t event)
{
#ifdef _KERNEL
uint64_t current_vers = spa_version(spa);
if (current_vers >= SPA_VERSION_ZPOOL_HISTORY) {
spa_history_log_internal(event, spa, NULL,
"pool spa %llu; zfs spa %llu; zpl %d; uts %s %s %s %s",
(u_longlong_t)current_vers, SPA_VERSION, ZPL_VERSION,
utsname.nodename, utsname.release, utsname.version,
utsname.machine);
}
cmn_err(CE_CONT, "!%s version %llu pool %s using %llu",
event == LOG_POOL_IMPORT ? "imported" :
event == LOG_POOL_CREATE ? "created" : "accessed",
(u_longlong_t)current_vers, spa_name(spa), SPA_VERSION);
#endif
}

1672
uts/common/fs/zfs/spa_misc.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,616 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#include <sys/zfs_context.h>
#include <sys/spa.h>
#include <sys/dmu.h>
#include <sys/zio.h>
#include <sys/space_map.h>
/*
* Space map routines.
* NOTE: caller is responsible for all locking.
*/
static int
space_map_seg_compare(const void *x1, const void *x2)
{
const space_seg_t *s1 = x1;
const space_seg_t *s2 = x2;
if (s1->ss_start < s2->ss_start) {
if (s1->ss_end > s2->ss_start)
return (0);
return (-1);
}
if (s1->ss_start > s2->ss_start) {
if (s1->ss_start < s2->ss_end)
return (0);
return (1);
}
return (0);
}
void
space_map_create(space_map_t *sm, uint64_t start, uint64_t size, uint8_t shift,
kmutex_t *lp)
{
bzero(sm, sizeof (*sm));
cv_init(&sm->sm_load_cv, NULL, CV_DEFAULT, NULL);
avl_create(&sm->sm_root, space_map_seg_compare,
sizeof (space_seg_t), offsetof(struct space_seg, ss_node));
sm->sm_start = start;
sm->sm_size = size;
sm->sm_shift = shift;
sm->sm_lock = lp;
}
void
space_map_destroy(space_map_t *sm)
{
ASSERT(!sm->sm_loaded && !sm->sm_loading);
VERIFY3U(sm->sm_space, ==, 0);
avl_destroy(&sm->sm_root);
cv_destroy(&sm->sm_load_cv);
}
void
space_map_add(space_map_t *sm, uint64_t start, uint64_t size)
{
avl_index_t where;
space_seg_t ssearch, *ss_before, *ss_after, *ss;
uint64_t end = start + size;
int merge_before, merge_after;
ASSERT(MUTEX_HELD(sm->sm_lock));
VERIFY(size != 0);
VERIFY3U(start, >=, sm->sm_start);
VERIFY3U(end, <=, sm->sm_start + sm->sm_size);
VERIFY(sm->sm_space + size <= sm->sm_size);
VERIFY(P2PHASE(start, 1ULL << sm->sm_shift) == 0);
VERIFY(P2PHASE(size, 1ULL << sm->sm_shift) == 0);
ssearch.ss_start = start;
ssearch.ss_end = end;
ss = avl_find(&sm->sm_root, &ssearch, &where);
if (ss != NULL && ss->ss_start <= start && ss->ss_end >= end) {
zfs_panic_recover("zfs: allocating allocated segment"
"(offset=%llu size=%llu)\n",
(longlong_t)start, (longlong_t)size);
return;
}
/* Make sure we don't overlap with either of our neighbors */
VERIFY(ss == NULL);
ss_before = avl_nearest(&sm->sm_root, where, AVL_BEFORE);
ss_after = avl_nearest(&sm->sm_root, where, AVL_AFTER);
merge_before = (ss_before != NULL && ss_before->ss_end == start);
merge_after = (ss_after != NULL && ss_after->ss_start == end);
if (merge_before && merge_after) {
avl_remove(&sm->sm_root, ss_before);
if (sm->sm_pp_root) {
avl_remove(sm->sm_pp_root, ss_before);
avl_remove(sm->sm_pp_root, ss_after);
}
ss_after->ss_start = ss_before->ss_start;
kmem_free(ss_before, sizeof (*ss_before));
ss = ss_after;
} else if (merge_before) {
ss_before->ss_end = end;
if (sm->sm_pp_root)
avl_remove(sm->sm_pp_root, ss_before);
ss = ss_before;
} else if (merge_after) {
ss_after->ss_start = start;
if (sm->sm_pp_root)
avl_remove(sm->sm_pp_root, ss_after);
ss = ss_after;
} else {
ss = kmem_alloc(sizeof (*ss), KM_SLEEP);
ss->ss_start = start;
ss->ss_end = end;
avl_insert(&sm->sm_root, ss, where);
}
if (sm->sm_pp_root)
avl_add(sm->sm_pp_root, ss);
sm->sm_space += size;
}
void
space_map_remove(space_map_t *sm, uint64_t start, uint64_t size)
{
avl_index_t where;
space_seg_t ssearch, *ss, *newseg;
uint64_t end = start + size;
int left_over, right_over;
ASSERT(MUTEX_HELD(sm->sm_lock));
VERIFY(size != 0);
VERIFY(P2PHASE(start, 1ULL << sm->sm_shift) == 0);
VERIFY(P2PHASE(size, 1ULL << sm->sm_shift) == 0);
ssearch.ss_start = start;
ssearch.ss_end = end;
ss = avl_find(&sm->sm_root, &ssearch, &where);
/* Make sure we completely overlap with someone */
if (ss == NULL) {
zfs_panic_recover("zfs: freeing free segment "
"(offset=%llu size=%llu)",
(longlong_t)start, (longlong_t)size);
return;
}
VERIFY3U(ss->ss_start, <=, start);
VERIFY3U(ss->ss_end, >=, end);
VERIFY(sm->sm_space - size <= sm->sm_size);
left_over = (ss->ss_start != start);
right_over = (ss->ss_end != end);
if (sm->sm_pp_root)
avl_remove(sm->sm_pp_root, ss);
if (left_over && right_over) {
newseg = kmem_alloc(sizeof (*newseg), KM_SLEEP);
newseg->ss_start = end;
newseg->ss_end = ss->ss_end;
ss->ss_end = start;
avl_insert_here(&sm->sm_root, newseg, ss, AVL_AFTER);
if (sm->sm_pp_root)
avl_add(sm->sm_pp_root, newseg);
} else if (left_over) {
ss->ss_end = start;
} else if (right_over) {
ss->ss_start = end;
} else {
avl_remove(&sm->sm_root, ss);
kmem_free(ss, sizeof (*ss));
ss = NULL;
}
if (sm->sm_pp_root && ss != NULL)
avl_add(sm->sm_pp_root, ss);
sm->sm_space -= size;
}
boolean_t
space_map_contains(space_map_t *sm, uint64_t start, uint64_t size)
{
avl_index_t where;
space_seg_t ssearch, *ss;
uint64_t end = start + size;
ASSERT(MUTEX_HELD(sm->sm_lock));
VERIFY(size != 0);
VERIFY(P2PHASE(start, 1ULL << sm->sm_shift) == 0);
VERIFY(P2PHASE(size, 1ULL << sm->sm_shift) == 0);
ssearch.ss_start = start;
ssearch.ss_end = end;
ss = avl_find(&sm->sm_root, &ssearch, &where);
return (ss != NULL && ss->ss_start <= start && ss->ss_end >= end);
}
void
space_map_vacate(space_map_t *sm, space_map_func_t *func, space_map_t *mdest)
{
space_seg_t *ss;
void *cookie = NULL;
ASSERT(MUTEX_HELD(sm->sm_lock));
while ((ss = avl_destroy_nodes(&sm->sm_root, &cookie)) != NULL) {
if (func != NULL)
func(mdest, ss->ss_start, ss->ss_end - ss->ss_start);
kmem_free(ss, sizeof (*ss));
}
sm->sm_space = 0;
}
void
space_map_walk(space_map_t *sm, space_map_func_t *func, space_map_t *mdest)
{
space_seg_t *ss;
ASSERT(MUTEX_HELD(sm->sm_lock));
for (ss = avl_first(&sm->sm_root); ss; ss = AVL_NEXT(&sm->sm_root, ss))
func(mdest, ss->ss_start, ss->ss_end - ss->ss_start);
}
/*
* Wait for any in-progress space_map_load() to complete.
*/
void
space_map_load_wait(space_map_t *sm)
{
ASSERT(MUTEX_HELD(sm->sm_lock));
while (sm->sm_loading) {
ASSERT(!sm->sm_loaded);
cv_wait(&sm->sm_load_cv, sm->sm_lock);
}
}
/*
* Note: space_map_load() will drop sm_lock across dmu_read() calls.
* The caller must be OK with this.
*/
int
space_map_load(space_map_t *sm, space_map_ops_t *ops, uint8_t maptype,
space_map_obj_t *smo, objset_t *os)
{
uint64_t *entry, *entry_map, *entry_map_end;
uint64_t bufsize, size, offset, end, space;
uint64_t mapstart = sm->sm_start;
int error = 0;
ASSERT(MUTEX_HELD(sm->sm_lock));
ASSERT(!sm->sm_loaded);
ASSERT(!sm->sm_loading);
sm->sm_loading = B_TRUE;
end = smo->smo_objsize;
space = smo->smo_alloc;
ASSERT(sm->sm_ops == NULL);
VERIFY3U(sm->sm_space, ==, 0);
if (maptype == SM_FREE) {
space_map_add(sm, sm->sm_start, sm->sm_size);
space = sm->sm_size - space;
}
bufsize = 1ULL << SPACE_MAP_BLOCKSHIFT;
entry_map = zio_buf_alloc(bufsize);
mutex_exit(sm->sm_lock);
if (end > bufsize)
dmu_prefetch(os, smo->smo_object, bufsize, end - bufsize);
mutex_enter(sm->sm_lock);
for (offset = 0; offset < end; offset += bufsize) {
size = MIN(end - offset, bufsize);
VERIFY(P2PHASE(size, sizeof (uint64_t)) == 0);
VERIFY(size != 0);
dprintf("object=%llu offset=%llx size=%llx\n",
smo->smo_object, offset, size);
mutex_exit(sm->sm_lock);
error = dmu_read(os, smo->smo_object, offset, size, entry_map,
DMU_READ_PREFETCH);
mutex_enter(sm->sm_lock);
if (error != 0)
break;
entry_map_end = entry_map + (size / sizeof (uint64_t));
for (entry = entry_map; entry < entry_map_end; entry++) {
uint64_t e = *entry;
if (SM_DEBUG_DECODE(e)) /* Skip debug entries */
continue;
(SM_TYPE_DECODE(e) == maptype ?
space_map_add : space_map_remove)(sm,
(SM_OFFSET_DECODE(e) << sm->sm_shift) + mapstart,
SM_RUN_DECODE(e) << sm->sm_shift);
}
}
if (error == 0) {
VERIFY3U(sm->sm_space, ==, space);
sm->sm_loaded = B_TRUE;
sm->sm_ops = ops;
if (ops != NULL)
ops->smop_load(sm);
} else {
space_map_vacate(sm, NULL, NULL);
}
zio_buf_free(entry_map, bufsize);
sm->sm_loading = B_FALSE;
cv_broadcast(&sm->sm_load_cv);
return (error);
}
void
space_map_unload(space_map_t *sm)
{
ASSERT(MUTEX_HELD(sm->sm_lock));
if (sm->sm_loaded && sm->sm_ops != NULL)
sm->sm_ops->smop_unload(sm);
sm->sm_loaded = B_FALSE;
sm->sm_ops = NULL;
space_map_vacate(sm, NULL, NULL);
}
uint64_t
space_map_maxsize(space_map_t *sm)
{
ASSERT(sm->sm_ops != NULL);
return (sm->sm_ops->smop_max(sm));
}
uint64_t
space_map_alloc(space_map_t *sm, uint64_t size)
{
uint64_t start;
start = sm->sm_ops->smop_alloc(sm, size);
if (start != -1ULL)
space_map_remove(sm, start, size);
return (start);
}
void
space_map_claim(space_map_t *sm, uint64_t start, uint64_t size)
{
sm->sm_ops->smop_claim(sm, start, size);
space_map_remove(sm, start, size);
}
void
space_map_free(space_map_t *sm, uint64_t start, uint64_t size)
{
space_map_add(sm, start, size);
sm->sm_ops->smop_free(sm, start, size);
}
/*
* Note: space_map_sync() will drop sm_lock across dmu_write() calls.
*/
void
space_map_sync(space_map_t *sm, uint8_t maptype,
space_map_obj_t *smo, objset_t *os, dmu_tx_t *tx)
{
spa_t *spa = dmu_objset_spa(os);
void *cookie = NULL;
space_seg_t *ss;
uint64_t bufsize, start, size, run_len;
uint64_t *entry, *entry_map, *entry_map_end;
ASSERT(MUTEX_HELD(sm->sm_lock));
if (sm->sm_space == 0)
return;
dprintf("object %4llu, txg %llu, pass %d, %c, count %lu, space %llx\n",
smo->smo_object, dmu_tx_get_txg(tx), spa_sync_pass(spa),
maptype == SM_ALLOC ? 'A' : 'F', avl_numnodes(&sm->sm_root),
sm->sm_space);
if (maptype == SM_ALLOC)
smo->smo_alloc += sm->sm_space;
else
smo->smo_alloc -= sm->sm_space;
bufsize = (8 + avl_numnodes(&sm->sm_root)) * sizeof (uint64_t);
bufsize = MIN(bufsize, 1ULL << SPACE_MAP_BLOCKSHIFT);
entry_map = zio_buf_alloc(bufsize);
entry_map_end = entry_map + (bufsize / sizeof (uint64_t));
entry = entry_map;
*entry++ = SM_DEBUG_ENCODE(1) |
SM_DEBUG_ACTION_ENCODE(maptype) |
SM_DEBUG_SYNCPASS_ENCODE(spa_sync_pass(spa)) |
SM_DEBUG_TXG_ENCODE(dmu_tx_get_txg(tx));
while ((ss = avl_destroy_nodes(&sm->sm_root, &cookie)) != NULL) {
size = ss->ss_end - ss->ss_start;
start = (ss->ss_start - sm->sm_start) >> sm->sm_shift;
sm->sm_space -= size;
size >>= sm->sm_shift;
while (size) {
run_len = MIN(size, SM_RUN_MAX);
if (entry == entry_map_end) {
mutex_exit(sm->sm_lock);
dmu_write(os, smo->smo_object, smo->smo_objsize,
bufsize, entry_map, tx);
mutex_enter(sm->sm_lock);
smo->smo_objsize += bufsize;
entry = entry_map;
}
*entry++ = SM_OFFSET_ENCODE(start) |
SM_TYPE_ENCODE(maptype) |
SM_RUN_ENCODE(run_len);
start += run_len;
size -= run_len;
}
kmem_free(ss, sizeof (*ss));
}
if (entry != entry_map) {
size = (entry - entry_map) * sizeof (uint64_t);
mutex_exit(sm->sm_lock);
dmu_write(os, smo->smo_object, smo->smo_objsize,
size, entry_map, tx);
mutex_enter(sm->sm_lock);
smo->smo_objsize += size;
}
zio_buf_free(entry_map, bufsize);
VERIFY3U(sm->sm_space, ==, 0);
}
void
space_map_truncate(space_map_obj_t *smo, objset_t *os, dmu_tx_t *tx)
{
VERIFY(dmu_free_range(os, smo->smo_object, 0, -1ULL, tx) == 0);
smo->smo_objsize = 0;
smo->smo_alloc = 0;
}
/*
* Space map reference trees.
*
* A space map is a collection of integers. Every integer is either
* in the map, or it's not. A space map reference tree generalizes
* the idea: it allows its members to have arbitrary reference counts,
* as opposed to the implicit reference count of 0 or 1 in a space map.
* This representation comes in handy when computing the union or
* intersection of multiple space maps. For example, the union of
* N space maps is the subset of the reference tree with refcnt >= 1.
* The intersection of N space maps is the subset with refcnt >= N.
*
* [It's very much like a Fourier transform. Unions and intersections
* are hard to perform in the 'space map domain', so we convert the maps
* into the 'reference count domain', where it's trivial, then invert.]
*
* vdev_dtl_reassess() uses computations of this form to determine
* DTL_MISSING and DTL_OUTAGE for interior vdevs -- e.g. a RAID-Z vdev
* has an outage wherever refcnt >= vdev_nparity + 1, and a mirror vdev
* has an outage wherever refcnt >= vdev_children.
*/
static int
space_map_ref_compare(const void *x1, const void *x2)
{
const space_ref_t *sr1 = x1;
const space_ref_t *sr2 = x2;
if (sr1->sr_offset < sr2->sr_offset)
return (-1);
if (sr1->sr_offset > sr2->sr_offset)
return (1);
if (sr1 < sr2)
return (-1);
if (sr1 > sr2)
return (1);
return (0);
}
void
space_map_ref_create(avl_tree_t *t)
{
avl_create(t, space_map_ref_compare,
sizeof (space_ref_t), offsetof(space_ref_t, sr_node));
}
void
space_map_ref_destroy(avl_tree_t *t)
{
space_ref_t *sr;
void *cookie = NULL;
while ((sr = avl_destroy_nodes(t, &cookie)) != NULL)
kmem_free(sr, sizeof (*sr));
avl_destroy(t);
}
static void
space_map_ref_add_node(avl_tree_t *t, uint64_t offset, int64_t refcnt)
{
space_ref_t *sr;
sr = kmem_alloc(sizeof (*sr), KM_SLEEP);
sr->sr_offset = offset;
sr->sr_refcnt = refcnt;
avl_add(t, sr);
}
void
space_map_ref_add_seg(avl_tree_t *t, uint64_t start, uint64_t end,
int64_t refcnt)
{
space_map_ref_add_node(t, start, refcnt);
space_map_ref_add_node(t, end, -refcnt);
}
/*
* Convert (or add) a space map into a reference tree.
*/
void
space_map_ref_add_map(avl_tree_t *t, space_map_t *sm, int64_t refcnt)
{
space_seg_t *ss;
ASSERT(MUTEX_HELD(sm->sm_lock));
for (ss = avl_first(&sm->sm_root); ss; ss = AVL_NEXT(&sm->sm_root, ss))
space_map_ref_add_seg(t, ss->ss_start, ss->ss_end, refcnt);
}
/*
* Convert a reference tree into a space map. The space map will contain
* all members of the reference tree for which refcnt >= minref.
*/
void
space_map_ref_generate_map(avl_tree_t *t, space_map_t *sm, int64_t minref)
{
uint64_t start = -1ULL;
int64_t refcnt = 0;
space_ref_t *sr;
ASSERT(MUTEX_HELD(sm->sm_lock));
space_map_vacate(sm, NULL, NULL);
for (sr = avl_first(t); sr != NULL; sr = AVL_NEXT(t, sr)) {
refcnt += sr->sr_refcnt;
if (refcnt >= minref) {
if (start == -1ULL) {
start = sr->sr_offset;
}
} else {
if (start != -1ULL) {
uint64_t end = sr->sr_offset;
ASSERT(start <= end);
if (end > start)
space_map_add(sm, start, end - start);
start = -1ULL;
}
}
}
ASSERT(refcnt == 0);
ASSERT(start == -1ULL);
}

142
uts/common/fs/zfs/sys/arc.h Normal file
View File

@ -0,0 +1,142 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_ARC_H
#define _SYS_ARC_H
#include <sys/zfs_context.h>
#ifdef __cplusplus
extern "C" {
#endif
#include <sys/zio.h>
#include <sys/dmu.h>
#include <sys/spa.h>
typedef struct arc_buf_hdr arc_buf_hdr_t;
typedef struct arc_buf arc_buf_t;
typedef void arc_done_func_t(zio_t *zio, arc_buf_t *buf, void *private);
typedef int arc_evict_func_t(void *private);
/* generic arc_done_func_t's which you can use */
arc_done_func_t arc_bcopy_func;
arc_done_func_t arc_getbuf_func;
struct arc_buf {
arc_buf_hdr_t *b_hdr;
arc_buf_t *b_next;
kmutex_t b_evict_lock;
krwlock_t b_data_lock;
void *b_data;
arc_evict_func_t *b_efunc;
void *b_private;
};
typedef enum arc_buf_contents {
ARC_BUFC_DATA, /* buffer contains data */
ARC_BUFC_METADATA, /* buffer contains metadata */
ARC_BUFC_NUMTYPES
} arc_buf_contents_t;
/*
* These are the flags we pass into calls to the arc
*/
#define ARC_WAIT (1 << 1) /* perform I/O synchronously */
#define ARC_NOWAIT (1 << 2) /* perform I/O asynchronously */
#define ARC_PREFETCH (1 << 3) /* I/O is a prefetch */
#define ARC_CACHED (1 << 4) /* I/O was already in cache */
#define ARC_L2CACHE (1 << 5) /* cache in L2ARC */
/*
* The following breakdows of arc_size exist for kstat only.
*/
typedef enum arc_space_type {
ARC_SPACE_DATA,
ARC_SPACE_HDRS,
ARC_SPACE_L2HDRS,
ARC_SPACE_OTHER,
ARC_SPACE_NUMTYPES
} arc_space_type_t;
void arc_space_consume(uint64_t space, arc_space_type_t type);
void arc_space_return(uint64_t space, arc_space_type_t type);
void *arc_data_buf_alloc(uint64_t space);
void arc_data_buf_free(void *buf, uint64_t space);
arc_buf_t *arc_buf_alloc(spa_t *spa, int size, void *tag,
arc_buf_contents_t type);
arc_buf_t *arc_loan_buf(spa_t *spa, int size);
void arc_return_buf(arc_buf_t *buf, void *tag);
void arc_loan_inuse_buf(arc_buf_t *buf, void *tag);
void arc_buf_add_ref(arc_buf_t *buf, void *tag);
int arc_buf_remove_ref(arc_buf_t *buf, void *tag);
int arc_buf_size(arc_buf_t *buf);
void arc_release(arc_buf_t *buf, void *tag);
int arc_release_bp(arc_buf_t *buf, void *tag, blkptr_t *bp, spa_t *spa,
zbookmark_t *zb);
int arc_released(arc_buf_t *buf);
int arc_has_callback(arc_buf_t *buf);
void arc_buf_freeze(arc_buf_t *buf);
void arc_buf_thaw(arc_buf_t *buf);
#ifdef ZFS_DEBUG
int arc_referenced(arc_buf_t *buf);
#endif
int arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp, arc_buf_t *pbuf,
arc_done_func_t *done, void *private, int priority, int zio_flags,
uint32_t *arc_flags, const zbookmark_t *zb);
int arc_read_nolock(zio_t *pio, spa_t *spa, const blkptr_t *bp,
arc_done_func_t *done, void *private, int priority, int flags,
uint32_t *arc_flags, const zbookmark_t *zb);
zio_t *arc_write(zio_t *pio, spa_t *spa, uint64_t txg,
blkptr_t *bp, arc_buf_t *buf, boolean_t l2arc, const zio_prop_t *zp,
arc_done_func_t *ready, arc_done_func_t *done, void *private,
int priority, int zio_flags, const zbookmark_t *zb);
void arc_set_callback(arc_buf_t *buf, arc_evict_func_t *func, void *private);
int arc_buf_evict(arc_buf_t *buf);
void arc_flush(spa_t *spa);
void arc_tempreserve_clear(uint64_t reserve);
int arc_tempreserve_space(uint64_t reserve, uint64_t txg);
void arc_init(void);
void arc_fini(void);
/*
* Level 2 ARC
*/
void l2arc_add_vdev(spa_t *spa, vdev_t *vd);
void l2arc_remove_vdev(vdev_t *vd);
boolean_t l2arc_vdev_present(vdev_t *vd);
void l2arc_init(void);
void l2arc_fini(void);
void l2arc_start(void);
void l2arc_stop(void);
#ifdef __cplusplus
}
#endif
#endif /* _SYS_ARC_H */

View File

@ -0,0 +1,57 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_BPLIST_H
#define _SYS_BPLIST_H
#include <sys/zfs_context.h>
#include <sys/spa.h>
#ifdef __cplusplus
extern "C" {
#endif
typedef struct bplist_entry {
blkptr_t bpe_blk;
list_node_t bpe_node;
} bplist_entry_t;
typedef struct bplist {
kmutex_t bpl_lock;
list_t bpl_list;
} bplist_t;
typedef int bplist_itor_t(void *arg, const blkptr_t *bp, dmu_tx_t *tx);
void bplist_create(bplist_t *bpl);
void bplist_destroy(bplist_t *bpl);
void bplist_append(bplist_t *bpl, const blkptr_t *bp);
void bplist_iterate(bplist_t *bpl, bplist_itor_t *func,
void *arg, dmu_tx_t *tx);
#ifdef __cplusplus
}
#endif
#endif /* _SYS_BPLIST_H */

View File

@ -0,0 +1,91 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_BPOBJ_H
#define _SYS_BPOBJ_H
#include <sys/dmu.h>
#include <sys/spa.h>
#include <sys/txg.h>
#include <sys/zio.h>
#include <sys/zfs_context.h>
#ifdef __cplusplus
extern "C" {
#endif
typedef struct bpobj_phys {
/*
* This is the bonus buffer for the dead lists. The object's
* contents is an array of bpo_entries blkptr_t's, representing
* a total of bpo_bytes physical space.
*/
uint64_t bpo_num_blkptrs;
uint64_t bpo_bytes;
uint64_t bpo_comp;
uint64_t bpo_uncomp;
uint64_t bpo_subobjs;
uint64_t bpo_num_subobjs;
} bpobj_phys_t;
#define BPOBJ_SIZE_V0 (2 * sizeof (uint64_t))
#define BPOBJ_SIZE_V1 (4 * sizeof (uint64_t))
typedef struct bpobj {
kmutex_t bpo_lock;
objset_t *bpo_os;
uint64_t bpo_object;
int bpo_epb;
uint8_t bpo_havecomp;
uint8_t bpo_havesubobj;
bpobj_phys_t *bpo_phys;
dmu_buf_t *bpo_dbuf;
dmu_buf_t *bpo_cached_dbuf;
} bpobj_t;
typedef int bpobj_itor_t(void *arg, const blkptr_t *bp, dmu_tx_t *tx);
uint64_t bpobj_alloc(objset_t *mos, int blocksize, dmu_tx_t *tx);
void bpobj_free(objset_t *os, uint64_t obj, dmu_tx_t *tx);
int bpobj_open(bpobj_t *bpo, objset_t *mos, uint64_t object);
void bpobj_close(bpobj_t *bpo);
int bpobj_iterate(bpobj_t *bpo, bpobj_itor_t func, void *arg, dmu_tx_t *tx);
int bpobj_iterate_nofree(bpobj_t *bpo, bpobj_itor_t func, void *, dmu_tx_t *);
int bpobj_iterate_dbg(bpobj_t *bpo, uint64_t *itorp, blkptr_t *bp);
void bpobj_enqueue_subobj(bpobj_t *bpo, uint64_t subobj, dmu_tx_t *tx);
void bpobj_enqueue(bpobj_t *bpo, const blkptr_t *bp, dmu_tx_t *tx);
int bpobj_space(bpobj_t *bpo,
uint64_t *usedp, uint64_t *compp, uint64_t *uncompp);
int bpobj_space_range(bpobj_t *bpo, uint64_t mintxg, uint64_t maxtxg,
uint64_t *usedp, uint64_t *compp, uint64_t *uncompp);
#ifdef __cplusplus
}
#endif
#endif /* _SYS_BPOBJ_H */

View File

@ -0,0 +1,375 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_DBUF_H
#define _SYS_DBUF_H
#include <sys/dmu.h>
#include <sys/spa.h>
#include <sys/txg.h>
#include <sys/zio.h>
#include <sys/arc.h>
#include <sys/zfs_context.h>
#include <sys/refcount.h>
#include <sys/zrlock.h>
#ifdef __cplusplus
extern "C" {
#endif
#define IN_DMU_SYNC 2
/*
* define flags for dbuf_read
*/
#define DB_RF_MUST_SUCCEED (1 << 0)
#define DB_RF_CANFAIL (1 << 1)
#define DB_RF_HAVESTRUCT (1 << 2)
#define DB_RF_NOPREFETCH (1 << 3)
#define DB_RF_NEVERWAIT (1 << 4)
#define DB_RF_CACHED (1 << 5)
/*
* The simplified state transition diagram for dbufs looks like:
*
* +----> READ ----+
* | |
* | V
* (alloc)-->UNCACHED CACHED-->EVICTING-->(free)
* | ^ ^
* | | |
* +----> FILL ----+ |
* | |
* | |
* +--------> NOFILL -------+
*/
typedef enum dbuf_states {
DB_UNCACHED,
DB_FILL,
DB_NOFILL,
DB_READ,
DB_CACHED,
DB_EVICTING
} dbuf_states_t;
struct dnode;
struct dmu_tx;
/*
* level = 0 means the user data
* level = 1 means the single indirect block
* etc.
*/
struct dmu_buf_impl;
typedef enum override_states {
DR_NOT_OVERRIDDEN,
DR_IN_DMU_SYNC,
DR_OVERRIDDEN
} override_states_t;
typedef struct dbuf_dirty_record {
/* link on our parents dirty list */
list_node_t dr_dirty_node;
/* transaction group this data will sync in */
uint64_t dr_txg;
/* zio of outstanding write IO */
zio_t *dr_zio;
/* pointer back to our dbuf */
struct dmu_buf_impl *dr_dbuf;
/* pointer to next dirty record */
struct dbuf_dirty_record *dr_next;
/* pointer to parent dirty record */
struct dbuf_dirty_record *dr_parent;
union dirty_types {
struct dirty_indirect {
/* protect access to list */
kmutex_t dr_mtx;
/* Our list of dirty children */
list_t dr_children;
} di;
struct dirty_leaf {
/*
* dr_data is set when we dirty the buffer
* so that we can retain the pointer even if it
* gets COW'd in a subsequent transaction group.
*/
arc_buf_t *dr_data;
blkptr_t dr_overridden_by;
override_states_t dr_override_state;
uint8_t dr_copies;
} dl;
} dt;
} dbuf_dirty_record_t;
typedef struct dmu_buf_impl {
/*
* The following members are immutable, with the exception of
* db.db_data, which is protected by db_mtx.
*/
/* the publicly visible structure */
dmu_buf_t db;
/* the objset we belong to */
struct objset *db_objset;
/*
* handle to safely access the dnode we belong to (NULL when evicted)
*/
struct dnode_handle *db_dnode_handle;
/*
* our parent buffer; if the dnode points to us directly,
* db_parent == db_dnode_handle->dnh_dnode->dn_dbuf
* only accessed by sync thread ???
* (NULL when evicted)
* May change from NULL to non-NULL under the protection of db_mtx
* (see dbuf_check_blkptr())
*/
struct dmu_buf_impl *db_parent;
/*
* link for hash table of all dmu_buf_impl_t's
*/
struct dmu_buf_impl *db_hash_next;
/* our block number */
uint64_t db_blkid;
/*
* Pointer to the blkptr_t which points to us. May be NULL if we
* don't have one yet. (NULL when evicted)
*/
blkptr_t *db_blkptr;
/*
* Our indirection level. Data buffers have db_level==0.
* Indirect buffers which point to data buffers have
* db_level==1. etc. Buffers which contain dnodes have
* db_level==0, since the dnodes are stored in a file.
*/
uint8_t db_level;
/* db_mtx protects the members below */
kmutex_t db_mtx;
/*
* Current state of the buffer
*/
dbuf_states_t db_state;
/*
* Refcount accessed by dmu_buf_{hold,rele}.
* If nonzero, the buffer can't be destroyed.
* Protected by db_mtx.
*/
refcount_t db_holds;
/* buffer holding our data */
arc_buf_t *db_buf;
kcondvar_t db_changed;
dbuf_dirty_record_t *db_data_pending;
/* pointer to most recent dirty record for this buffer */
dbuf_dirty_record_t *db_last_dirty;
/*
* Our link on the owner dnodes's dn_dbufs list.
* Protected by its dn_dbufs_mtx.
*/
list_node_t db_link;
/* Data which is unique to data (leaf) blocks: */
/* stuff we store for the user (see dmu_buf_set_user) */
void *db_user_ptr;
void **db_user_data_ptr_ptr;
dmu_buf_evict_func_t *db_evict_func;
uint8_t db_immediate_evict;
uint8_t db_freed_in_flight;
uint8_t db_dirtycnt;
} dmu_buf_impl_t;
/* Note: the dbuf hash table is exposed only for the mdb module */
#define DBUF_MUTEXES 256
#define DBUF_HASH_MUTEX(h, idx) (&(h)->hash_mutexes[(idx) & (DBUF_MUTEXES-1)])
typedef struct dbuf_hash_table {
uint64_t hash_table_mask;
dmu_buf_impl_t **hash_table;
kmutex_t hash_mutexes[DBUF_MUTEXES];
} dbuf_hash_table_t;
uint64_t dbuf_whichblock(struct dnode *di, uint64_t offset);
dmu_buf_impl_t *dbuf_create_tlib(struct dnode *dn, char *data);
void dbuf_create_bonus(struct dnode *dn);
int dbuf_spill_set_blksz(dmu_buf_t *db, uint64_t blksz, dmu_tx_t *tx);
void dbuf_spill_hold(struct dnode *dn, dmu_buf_impl_t **dbp, void *tag);
void dbuf_rm_spill(struct dnode *dn, dmu_tx_t *tx);
dmu_buf_impl_t *dbuf_hold(struct dnode *dn, uint64_t blkid, void *tag);
dmu_buf_impl_t *dbuf_hold_level(struct dnode *dn, int level, uint64_t blkid,
void *tag);
int dbuf_hold_impl(struct dnode *dn, uint8_t level, uint64_t blkid, int create,
void *tag, dmu_buf_impl_t **dbp);
void dbuf_prefetch(struct dnode *dn, uint64_t blkid);
void dbuf_add_ref(dmu_buf_impl_t *db, void *tag);
uint64_t dbuf_refcount(dmu_buf_impl_t *db);
void dbuf_rele(dmu_buf_impl_t *db, void *tag);
void dbuf_rele_and_unlock(dmu_buf_impl_t *db, void *tag);
dmu_buf_impl_t *dbuf_find(struct dnode *dn, uint8_t level, uint64_t blkid);
int dbuf_read(dmu_buf_impl_t *db, zio_t *zio, uint32_t flags);
void dbuf_will_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx);
void dbuf_fill_done(dmu_buf_impl_t *db, dmu_tx_t *tx);
void dmu_buf_will_not_fill(dmu_buf_t *db, dmu_tx_t *tx);
void dmu_buf_will_fill(dmu_buf_t *db, dmu_tx_t *tx);
void dmu_buf_fill_done(dmu_buf_t *db, dmu_tx_t *tx);
void dbuf_assign_arcbuf(dmu_buf_impl_t *db, arc_buf_t *buf, dmu_tx_t *tx);
dbuf_dirty_record_t *dbuf_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx);
arc_buf_t *dbuf_loan_arcbuf(dmu_buf_impl_t *db);
void dbuf_clear(dmu_buf_impl_t *db);
void dbuf_evict(dmu_buf_impl_t *db);
void dbuf_setdirty(dmu_buf_impl_t *db, dmu_tx_t *tx);
void dbuf_unoverride(dbuf_dirty_record_t *dr);
void dbuf_sync_list(list_t *list, dmu_tx_t *tx);
void dbuf_release_bp(dmu_buf_impl_t *db);
void dbuf_free_range(struct dnode *dn, uint64_t start, uint64_t end,
struct dmu_tx *);
void dbuf_new_size(dmu_buf_impl_t *db, int size, dmu_tx_t *tx);
#define DB_DNODE(_db) ((_db)->db_dnode_handle->dnh_dnode)
#define DB_DNODE_LOCK(_db) ((_db)->db_dnode_handle->dnh_zrlock)
#define DB_DNODE_ENTER(_db) (zrl_add(&DB_DNODE_LOCK(_db)))
#define DB_DNODE_EXIT(_db) (zrl_remove(&DB_DNODE_LOCK(_db)))
#define DB_DNODE_HELD(_db) (!zrl_is_zero(&DB_DNODE_LOCK(_db)))
#define DB_GET_SPA(_spa_p, _db) { \
dnode_t *__dn; \
DB_DNODE_ENTER(_db); \
__dn = DB_DNODE(_db); \
*(_spa_p) = __dn->dn_objset->os_spa; \
DB_DNODE_EXIT(_db); \
}
#define DB_GET_OBJSET(_os_p, _db) { \
dnode_t *__dn; \
DB_DNODE_ENTER(_db); \
__dn = DB_DNODE(_db); \
*(_os_p) = __dn->dn_objset; \
DB_DNODE_EXIT(_db); \
}
void dbuf_init(void);
void dbuf_fini(void);
boolean_t dbuf_is_metadata(dmu_buf_impl_t *db);
#define DBUF_IS_METADATA(_db) \
(dbuf_is_metadata(_db))
#define DBUF_GET_BUFC_TYPE(_db) \
(DBUF_IS_METADATA(_db) ? ARC_BUFC_METADATA : ARC_BUFC_DATA)
#define DBUF_IS_CACHEABLE(_db) \
((_db)->db_objset->os_primary_cache == ZFS_CACHE_ALL || \
(DBUF_IS_METADATA(_db) && \
((_db)->db_objset->os_primary_cache == ZFS_CACHE_METADATA)))
#define DBUF_IS_L2CACHEABLE(_db) \
((_db)->db_objset->os_secondary_cache == ZFS_CACHE_ALL || \
(DBUF_IS_METADATA(_db) && \
((_db)->db_objset->os_secondary_cache == ZFS_CACHE_METADATA)))
#ifdef ZFS_DEBUG
/*
* There should be a ## between the string literal and fmt, to make it
* clear that we're joining two strings together, but gcc does not
* support that preprocessor token.
*/
#define dprintf_dbuf(dbuf, fmt, ...) do { \
if (zfs_flags & ZFS_DEBUG_DPRINTF) { \
char __db_buf[32]; \
uint64_t __db_obj = (dbuf)->db.db_object; \
if (__db_obj == DMU_META_DNODE_OBJECT) \
(void) strcpy(__db_buf, "mdn"); \
else \
(void) snprintf(__db_buf, sizeof (__db_buf), "%lld", \
(u_longlong_t)__db_obj); \
dprintf_ds((dbuf)->db_objset->os_dsl_dataset, \
"obj=%s lvl=%u blkid=%lld " fmt, \
__db_buf, (dbuf)->db_level, \
(u_longlong_t)(dbuf)->db_blkid, __VA_ARGS__); \
} \
_NOTE(CONSTCOND) } while (0)
#define dprintf_dbuf_bp(db, bp, fmt, ...) do { \
if (zfs_flags & ZFS_DEBUG_DPRINTF) { \
char *__blkbuf = kmem_alloc(BP_SPRINTF_LEN, KM_SLEEP); \
sprintf_blkptr(__blkbuf, bp); \
dprintf_dbuf(db, fmt " %s\n", __VA_ARGS__, __blkbuf); \
kmem_free(__blkbuf, BP_SPRINTF_LEN); \
} \
_NOTE(CONSTCOND) } while (0)
#define DBUF_VERIFY(db) dbuf_verify(db)
#else
#define dprintf_dbuf(db, fmt, ...)
#define dprintf_dbuf_bp(db, bp, fmt, ...)
#define DBUF_VERIFY(db)
#endif
#ifdef __cplusplus
}
#endif
#endif /* _SYS_DBUF_H */

246
uts/common/fs/zfs/sys/ddt.h Normal file
View File

@ -0,0 +1,246 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2009, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_DDT_H
#define _SYS_DDT_H
#include <sys/sysmacros.h>
#include <sys/types.h>
#include <sys/fs/zfs.h>
#include <sys/zio.h>
#include <sys/dmu.h>
#ifdef __cplusplus
extern "C" {
#endif
/*
* On-disk DDT formats, in the desired search order (newest version first).
*/
enum ddt_type {
DDT_TYPE_ZAP = 0,
DDT_TYPES
};
/*
* DDT classes, in the desired search order (highest replication level first).
*/
enum ddt_class {
DDT_CLASS_DITTO = 0,
DDT_CLASS_DUPLICATE,
DDT_CLASS_UNIQUE,
DDT_CLASSES
};
#define DDT_TYPE_CURRENT 0
#define DDT_COMPRESS_BYTEORDER_MASK 0x80
#define DDT_COMPRESS_FUNCTION_MASK 0x7f
/*
* On-disk ddt entry: key (name) and physical storage (value).
*/
typedef struct ddt_key {
zio_cksum_t ddk_cksum; /* 256-bit block checksum */
uint64_t ddk_prop; /* LSIZE, PSIZE, compression */
} ddt_key_t;
/*
* ddk_prop layout:
*
* +-------+-------+-------+-------+-------+-------+-------+-------+
* | 0 | 0 | 0 | comp | PSIZE | LSIZE |
* +-------+-------+-------+-------+-------+-------+-------+-------+
*/
#define DDK_GET_LSIZE(ddk) \
BF64_GET_SB((ddk)->ddk_prop, 0, 16, SPA_MINBLOCKSHIFT, 1)
#define DDK_SET_LSIZE(ddk, x) \
BF64_SET_SB((ddk)->ddk_prop, 0, 16, SPA_MINBLOCKSHIFT, 1, x)
#define DDK_GET_PSIZE(ddk) \
BF64_GET_SB((ddk)->ddk_prop, 16, 16, SPA_MINBLOCKSHIFT, 1)
#define DDK_SET_PSIZE(ddk, x) \
BF64_SET_SB((ddk)->ddk_prop, 16, 16, SPA_MINBLOCKSHIFT, 1, x)
#define DDK_GET_COMPRESS(ddk) BF64_GET((ddk)->ddk_prop, 32, 8)
#define DDK_SET_COMPRESS(ddk, x) BF64_SET((ddk)->ddk_prop, 32, 8, x)
#define DDT_KEY_WORDS (sizeof (ddt_key_t) / sizeof (uint64_t))
typedef struct ddt_phys {
dva_t ddp_dva[SPA_DVAS_PER_BP];
uint64_t ddp_refcnt;
uint64_t ddp_phys_birth;
} ddt_phys_t;
enum ddt_phys_type {
DDT_PHYS_DITTO = 0,
DDT_PHYS_SINGLE = 1,
DDT_PHYS_DOUBLE = 2,
DDT_PHYS_TRIPLE = 3,
DDT_PHYS_TYPES
};
/*
* In-core ddt entry
*/
struct ddt_entry {
ddt_key_t dde_key;
ddt_phys_t dde_phys[DDT_PHYS_TYPES];
zio_t *dde_lead_zio[DDT_PHYS_TYPES];
void *dde_repair_data;
enum ddt_type dde_type;
enum ddt_class dde_class;
uint8_t dde_loading;
uint8_t dde_loaded;
kcondvar_t dde_cv;
avl_node_t dde_node;
};
/*
* In-core ddt
*/
struct ddt {
kmutex_t ddt_lock;
avl_tree_t ddt_tree;
avl_tree_t ddt_repair_tree;
enum zio_checksum ddt_checksum;
spa_t *ddt_spa;
objset_t *ddt_os;
uint64_t ddt_stat_object;
uint64_t ddt_object[DDT_TYPES][DDT_CLASSES];
ddt_histogram_t ddt_histogram[DDT_TYPES][DDT_CLASSES];
ddt_histogram_t ddt_histogram_cache[DDT_TYPES][DDT_CLASSES];
ddt_object_t ddt_object_stats[DDT_TYPES][DDT_CLASSES];
avl_node_t ddt_node;
};
/*
* In-core and on-disk bookmark for DDT walks
*/
typedef struct ddt_bookmark {
uint64_t ddb_class;
uint64_t ddb_type;
uint64_t ddb_checksum;
uint64_t ddb_cursor;
} ddt_bookmark_t;
/*
* Ops vector to access a specific DDT object type.
*/
typedef struct ddt_ops {
char ddt_op_name[32];
int (*ddt_op_create)(objset_t *os, uint64_t *object, dmu_tx_t *tx,
boolean_t prehash);
int (*ddt_op_destroy)(objset_t *os, uint64_t object, dmu_tx_t *tx);
int (*ddt_op_lookup)(objset_t *os, uint64_t object, ddt_entry_t *dde);
void (*ddt_op_prefetch)(objset_t *os, uint64_t object,
ddt_entry_t *dde);
int (*ddt_op_update)(objset_t *os, uint64_t object, ddt_entry_t *dde,
dmu_tx_t *tx);
int (*ddt_op_remove)(objset_t *os, uint64_t object, ddt_entry_t *dde,
dmu_tx_t *tx);
int (*ddt_op_walk)(objset_t *os, uint64_t object, ddt_entry_t *dde,
uint64_t *walk);
uint64_t (*ddt_op_count)(objset_t *os, uint64_t object);
} ddt_ops_t;
#define DDT_NAMELEN 80
extern void ddt_object_name(ddt_t *ddt, enum ddt_type type,
enum ddt_class class, char *name);
extern int ddt_object_walk(ddt_t *ddt, enum ddt_type type,
enum ddt_class class, uint64_t *walk, ddt_entry_t *dde);
extern uint64_t ddt_object_count(ddt_t *ddt, enum ddt_type type,
enum ddt_class class);
extern int ddt_object_info(ddt_t *ddt, enum ddt_type type,
enum ddt_class class, dmu_object_info_t *);
extern boolean_t ddt_object_exists(ddt_t *ddt, enum ddt_type type,
enum ddt_class class);
extern void ddt_bp_fill(const ddt_phys_t *ddp, blkptr_t *bp,
uint64_t txg);
extern void ddt_bp_create(enum zio_checksum checksum, const ddt_key_t *ddk,
const ddt_phys_t *ddp, blkptr_t *bp);
extern void ddt_key_fill(ddt_key_t *ddk, const blkptr_t *bp);
extern void ddt_phys_fill(ddt_phys_t *ddp, const blkptr_t *bp);
extern void ddt_phys_clear(ddt_phys_t *ddp);
extern void ddt_phys_addref(ddt_phys_t *ddp);
extern void ddt_phys_decref(ddt_phys_t *ddp);
extern void ddt_phys_free(ddt_t *ddt, ddt_key_t *ddk, ddt_phys_t *ddp,
uint64_t txg);
extern ddt_phys_t *ddt_phys_select(const ddt_entry_t *dde, const blkptr_t *bp);
extern uint64_t ddt_phys_total_refcnt(const ddt_entry_t *dde);
extern void ddt_stat_add(ddt_stat_t *dst, const ddt_stat_t *src, uint64_t neg);
extern void ddt_histogram_add(ddt_histogram_t *dst, const ddt_histogram_t *src);
extern void ddt_histogram_stat(ddt_stat_t *dds, const ddt_histogram_t *ddh);
extern boolean_t ddt_histogram_empty(const ddt_histogram_t *ddh);
extern void ddt_get_dedup_object_stats(spa_t *spa, ddt_object_t *ddo);
extern void ddt_get_dedup_histogram(spa_t *spa, ddt_histogram_t *ddh);
extern void ddt_get_dedup_stats(spa_t *spa, ddt_stat_t *dds_total);
extern uint64_t ddt_get_dedup_dspace(spa_t *spa);
extern uint64_t ddt_get_pool_dedup_ratio(spa_t *spa);
extern int ddt_ditto_copies_needed(ddt_t *ddt, ddt_entry_t *dde,
ddt_phys_t *ddp_willref);
extern int ddt_ditto_copies_present(ddt_entry_t *dde);
extern size_t ddt_compress(void *src, uchar_t *dst, size_t s_len, size_t d_len);
extern void ddt_decompress(uchar_t *src, void *dst, size_t s_len, size_t d_len);
extern ddt_t *ddt_select(spa_t *spa, const blkptr_t *bp);
extern void ddt_enter(ddt_t *ddt);
extern void ddt_exit(ddt_t *ddt);
extern ddt_entry_t *ddt_lookup(ddt_t *ddt, const blkptr_t *bp, boolean_t add);
extern void ddt_prefetch(spa_t *spa, const blkptr_t *bp);
extern void ddt_remove(ddt_t *ddt, ddt_entry_t *dde);
extern boolean_t ddt_class_contains(spa_t *spa, enum ddt_class max_class,
const blkptr_t *bp);
extern ddt_entry_t *ddt_repair_start(ddt_t *ddt, const blkptr_t *bp);
extern void ddt_repair_done(ddt_t *ddt, ddt_entry_t *dde);
extern int ddt_entry_compare(const void *x1, const void *x2);
extern void ddt_create(spa_t *spa);
extern int ddt_load(spa_t *spa);
extern void ddt_unload(spa_t *spa);
extern void ddt_sync(spa_t *spa, uint64_t txg);
extern int ddt_walk(spa_t *spa, ddt_bookmark_t *ddb, ddt_entry_t *dde);
extern int ddt_object_update(ddt_t *ddt, enum ddt_type type,
enum ddt_class class, ddt_entry_t *dde, dmu_tx_t *tx);
extern const ddt_ops_t ddt_zap_ops;
#ifdef __cplusplus
}
#endif
#endif /* _SYS_DDT_H */

740
uts/common/fs/zfs/sys/dmu.h Normal file
View File

@ -0,0 +1,740 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
/* Portions Copyright 2010 Robert Milkowski */
#ifndef _SYS_DMU_H
#define _SYS_DMU_H
/*
* This file describes the interface that the DMU provides for its
* consumers.
*
* The DMU also interacts with the SPA. That interface is described in
* dmu_spa.h.
*/
#include <sys/inttypes.h>
#include <sys/types.h>
#include <sys/param.h>
#include <sys/cred.h>
#include <sys/time.h>
#ifdef __cplusplus
extern "C" {
#endif
struct uio;
struct xuio;
struct page;
struct vnode;
struct spa;
struct zilog;
struct zio;
struct blkptr;
struct zap_cursor;
struct dsl_dataset;
struct dsl_pool;
struct dnode;
struct drr_begin;
struct drr_end;
struct zbookmark;
struct spa;
struct nvlist;
struct arc_buf;
struct zio_prop;
struct sa_handle;
typedef struct objset objset_t;
typedef struct dmu_tx dmu_tx_t;
typedef struct dsl_dir dsl_dir_t;
typedef enum dmu_object_type {
DMU_OT_NONE,
/* general: */
DMU_OT_OBJECT_DIRECTORY, /* ZAP */
DMU_OT_OBJECT_ARRAY, /* UINT64 */
DMU_OT_PACKED_NVLIST, /* UINT8 (XDR by nvlist_pack/unpack) */
DMU_OT_PACKED_NVLIST_SIZE, /* UINT64 */
DMU_OT_BPOBJ, /* UINT64 */
DMU_OT_BPOBJ_HDR, /* UINT64 */
/* spa: */
DMU_OT_SPACE_MAP_HEADER, /* UINT64 */
DMU_OT_SPACE_MAP, /* UINT64 */
/* zil: */
DMU_OT_INTENT_LOG, /* UINT64 */
/* dmu: */
DMU_OT_DNODE, /* DNODE */
DMU_OT_OBJSET, /* OBJSET */
/* dsl: */
DMU_OT_DSL_DIR, /* UINT64 */
DMU_OT_DSL_DIR_CHILD_MAP, /* ZAP */
DMU_OT_DSL_DS_SNAP_MAP, /* ZAP */
DMU_OT_DSL_PROPS, /* ZAP */
DMU_OT_DSL_DATASET, /* UINT64 */
/* zpl: */
DMU_OT_ZNODE, /* ZNODE */
DMU_OT_OLDACL, /* Old ACL */
DMU_OT_PLAIN_FILE_CONTENTS, /* UINT8 */
DMU_OT_DIRECTORY_CONTENTS, /* ZAP */
DMU_OT_MASTER_NODE, /* ZAP */
DMU_OT_UNLINKED_SET, /* ZAP */
/* zvol: */
DMU_OT_ZVOL, /* UINT8 */
DMU_OT_ZVOL_PROP, /* ZAP */
/* other; for testing only! */
DMU_OT_PLAIN_OTHER, /* UINT8 */
DMU_OT_UINT64_OTHER, /* UINT64 */
DMU_OT_ZAP_OTHER, /* ZAP */
/* new object types: */
DMU_OT_ERROR_LOG, /* ZAP */
DMU_OT_SPA_HISTORY, /* UINT8 */
DMU_OT_SPA_HISTORY_OFFSETS, /* spa_his_phys_t */
DMU_OT_POOL_PROPS, /* ZAP */
DMU_OT_DSL_PERMS, /* ZAP */
DMU_OT_ACL, /* ACL */
DMU_OT_SYSACL, /* SYSACL */
DMU_OT_FUID, /* FUID table (Packed NVLIST UINT8) */
DMU_OT_FUID_SIZE, /* FUID table size UINT64 */
DMU_OT_NEXT_CLONES, /* ZAP */
DMU_OT_SCAN_QUEUE, /* ZAP */
DMU_OT_USERGROUP_USED, /* ZAP */
DMU_OT_USERGROUP_QUOTA, /* ZAP */
DMU_OT_USERREFS, /* ZAP */
DMU_OT_DDT_ZAP, /* ZAP */
DMU_OT_DDT_STATS, /* ZAP */
DMU_OT_SA, /* System attr */
DMU_OT_SA_MASTER_NODE, /* ZAP */
DMU_OT_SA_ATTR_REGISTRATION, /* ZAP */
DMU_OT_SA_ATTR_LAYOUTS, /* ZAP */
DMU_OT_SCAN_XLATE, /* ZAP */
DMU_OT_DEDUP, /* fake dedup BP from ddt_bp_create() */
DMU_OT_DEADLIST, /* ZAP */
DMU_OT_DEADLIST_HDR, /* UINT64 */
DMU_OT_DSL_CLONES, /* ZAP */
DMU_OT_BPOBJ_SUBOBJ, /* UINT64 */
DMU_OT_NUMTYPES
} dmu_object_type_t;
typedef enum dmu_objset_type {
DMU_OST_NONE,
DMU_OST_META,
DMU_OST_ZFS,
DMU_OST_ZVOL,
DMU_OST_OTHER, /* For testing only! */
DMU_OST_ANY, /* Be careful! */
DMU_OST_NUMTYPES
} dmu_objset_type_t;
void byteswap_uint64_array(void *buf, size_t size);
void byteswap_uint32_array(void *buf, size_t size);
void byteswap_uint16_array(void *buf, size_t size);
void byteswap_uint8_array(void *buf, size_t size);
void zap_byteswap(void *buf, size_t size);
void zfs_oldacl_byteswap(void *buf, size_t size);
void zfs_acl_byteswap(void *buf, size_t size);
void zfs_znode_byteswap(void *buf, size_t size);
#define DS_FIND_SNAPSHOTS (1<<0)
#define DS_FIND_CHILDREN (1<<1)
/*
* The maximum number of bytes that can be accessed as part of one
* operation, including metadata.
*/
#define DMU_MAX_ACCESS (10<<20) /* 10MB */
#define DMU_MAX_DELETEBLKCNT (20480) /* ~5MB of indirect blocks */
#define DMU_USERUSED_OBJECT (-1ULL)
#define DMU_GROUPUSED_OBJECT (-2ULL)
#define DMU_DEADLIST_OBJECT (-3ULL)
/*
* artificial blkids for bonus buffer and spill blocks
*/
#define DMU_BONUS_BLKID (-1ULL)
#define DMU_SPILL_BLKID (-2ULL)
/*
* Public routines to create, destroy, open, and close objsets.
*/
int dmu_objset_hold(const char *name, void *tag, objset_t **osp);
int dmu_objset_own(const char *name, dmu_objset_type_t type,
boolean_t readonly, void *tag, objset_t **osp);
void dmu_objset_rele(objset_t *os, void *tag);
void dmu_objset_disown(objset_t *os, void *tag);
int dmu_objset_open_ds(struct dsl_dataset *ds, objset_t **osp);
int dmu_objset_evict_dbufs(objset_t *os);
int dmu_objset_create(const char *name, dmu_objset_type_t type, uint64_t flags,
void (*func)(objset_t *os, void *arg, cred_t *cr, dmu_tx_t *tx), void *arg);
int dmu_objset_clone(const char *name, struct dsl_dataset *clone_origin,
uint64_t flags);
int dmu_objset_destroy(const char *name, boolean_t defer);
int dmu_snapshots_destroy(char *fsname, char *snapname, boolean_t defer);
int dmu_objset_snapshot(char *fsname, char *snapname, char *tag,
struct nvlist *props, boolean_t recursive, boolean_t temporary, int fd);
int dmu_objset_rename(const char *name, const char *newname,
boolean_t recursive);
int dmu_objset_find(char *name, int func(const char *, void *), void *arg,
int flags);
void dmu_objset_byteswap(void *buf, size_t size);
typedef struct dmu_buf {
uint64_t db_object; /* object that this buffer is part of */
uint64_t db_offset; /* byte offset in this object */
uint64_t db_size; /* size of buffer in bytes */
void *db_data; /* data in buffer */
} dmu_buf_t;
typedef void dmu_buf_evict_func_t(struct dmu_buf *db, void *user_ptr);
/*
* The names of zap entries in the DIRECTORY_OBJECT of the MOS.
*/
#define DMU_POOL_DIRECTORY_OBJECT 1
#define DMU_POOL_CONFIG "config"
#define DMU_POOL_ROOT_DATASET "root_dataset"
#define DMU_POOL_SYNC_BPOBJ "sync_bplist"
#define DMU_POOL_ERRLOG_SCRUB "errlog_scrub"
#define DMU_POOL_ERRLOG_LAST "errlog_last"
#define DMU_POOL_SPARES "spares"
#define DMU_POOL_DEFLATE "deflate"
#define DMU_POOL_HISTORY "history"
#define DMU_POOL_PROPS "pool_props"
#define DMU_POOL_L2CACHE "l2cache"
#define DMU_POOL_TMP_USERREFS "tmp_userrefs"
#define DMU_POOL_DDT "DDT-%s-%s-%s"
#define DMU_POOL_DDT_STATS "DDT-statistics"
#define DMU_POOL_CREATION_VERSION "creation_version"
#define DMU_POOL_SCAN "scan"
#define DMU_POOL_FREE_BPOBJ "free_bpobj"
/*
* Allocate an object from this objset. The range of object numbers
* available is (0, DN_MAX_OBJECT). Object 0 is the meta-dnode.
*
* The transaction must be assigned to a txg. The newly allocated
* object will be "held" in the transaction (ie. you can modify the
* newly allocated object in this transaction).
*
* dmu_object_alloc() chooses an object and returns it in *objectp.
*
* dmu_object_claim() allocates a specific object number. If that
* number is already allocated, it fails and returns EEXIST.
*
* Return 0 on success, or ENOSPC or EEXIST as specified above.
*/
uint64_t dmu_object_alloc(objset_t *os, dmu_object_type_t ot,
int blocksize, dmu_object_type_t bonus_type, int bonus_len, dmu_tx_t *tx);
int dmu_object_claim(objset_t *os, uint64_t object, dmu_object_type_t ot,
int blocksize, dmu_object_type_t bonus_type, int bonus_len, dmu_tx_t *tx);
int dmu_object_reclaim(objset_t *os, uint64_t object, dmu_object_type_t ot,
int blocksize, dmu_object_type_t bonustype, int bonuslen);
/*
* Free an object from this objset.
*
* The object's data will be freed as well (ie. you don't need to call
* dmu_free(object, 0, -1, tx)).
*
* The object need not be held in the transaction.
*
* If there are any holds on this object's buffers (via dmu_buf_hold()),
* or tx holds on the object (via dmu_tx_hold_object()), you can not
* free it; it fails and returns EBUSY.
*
* If the object is not allocated, it fails and returns ENOENT.
*
* Return 0 on success, or EBUSY or ENOENT as specified above.
*/
int dmu_object_free(objset_t *os, uint64_t object, dmu_tx_t *tx);
/*
* Find the next allocated or free object.
*
* The objectp parameter is in-out. It will be updated to be the next
* object which is allocated. Ignore objects which have not been
* modified since txg.
*
* XXX Can only be called on a objset with no dirty data.
*
* Returns 0 on success, or ENOENT if there are no more objects.
*/
int dmu_object_next(objset_t *os, uint64_t *objectp,
boolean_t hole, uint64_t txg);
/*
* Set the data blocksize for an object.
*
* The object cannot have any blocks allcated beyond the first. If
* the first block is allocated already, the new size must be greater
* than the current block size. If these conditions are not met,
* ENOTSUP will be returned.
*
* Returns 0 on success, or EBUSY if there are any holds on the object
* contents, or ENOTSUP as described above.
*/
int dmu_object_set_blocksize(objset_t *os, uint64_t object, uint64_t size,
int ibs, dmu_tx_t *tx);
/*
* Set the checksum property on a dnode. The new checksum algorithm will
* apply to all newly written blocks; existing blocks will not be affected.
*/
void dmu_object_set_checksum(objset_t *os, uint64_t object, uint8_t checksum,
dmu_tx_t *tx);
/*
* Set the compress property on a dnode. The new compression algorithm will
* apply to all newly written blocks; existing blocks will not be affected.
*/
void dmu_object_set_compress(objset_t *os, uint64_t object, uint8_t compress,
dmu_tx_t *tx);
/*
* Decide how to write a block: checksum, compression, number of copies, etc.
*/
#define WP_NOFILL 0x1
#define WP_DMU_SYNC 0x2
#define WP_SPILL 0x4
void dmu_write_policy(objset_t *os, struct dnode *dn, int level, int wp,
struct zio_prop *zp);
/*
* The bonus data is accessed more or less like a regular buffer.
* You must dmu_bonus_hold() to get the buffer, which will give you a
* dmu_buf_t with db_offset==-1ULL, and db_size = the size of the bonus
* data. As with any normal buffer, you must call dmu_buf_read() to
* read db_data, dmu_buf_will_dirty() before modifying it, and the
* object must be held in an assigned transaction before calling
* dmu_buf_will_dirty. You may use dmu_buf_set_user() on the bonus
* buffer as well. You must release your hold with dmu_buf_rele().
*/
int dmu_bonus_hold(objset_t *os, uint64_t object, void *tag, dmu_buf_t **);
int dmu_bonus_max(void);
int dmu_set_bonus(dmu_buf_t *, int, dmu_tx_t *);
int dmu_set_bonustype(dmu_buf_t *, dmu_object_type_t, dmu_tx_t *);
dmu_object_type_t dmu_get_bonustype(dmu_buf_t *);
int dmu_rm_spill(objset_t *, uint64_t, dmu_tx_t *);
/*
* Special spill buffer support used by "SA" framework
*/
int dmu_spill_hold_by_bonus(dmu_buf_t *bonus, void *tag, dmu_buf_t **dbp);
int dmu_spill_hold_by_dnode(struct dnode *dn, uint32_t flags,
void *tag, dmu_buf_t **dbp);
int dmu_spill_hold_existing(dmu_buf_t *bonus, void *tag, dmu_buf_t **dbp);
/*
* Obtain the DMU buffer from the specified object which contains the
* specified offset. dmu_buf_hold() puts a "hold" on the buffer, so
* that it will remain in memory. You must release the hold with
* dmu_buf_rele(). You musn't access the dmu_buf_t after releasing your
* hold. You must have a hold on any dmu_buf_t* you pass to the DMU.
*
* You must call dmu_buf_read, dmu_buf_will_dirty, or dmu_buf_will_fill
* on the returned buffer before reading or writing the buffer's
* db_data. The comments for those routines describe what particular
* operations are valid after calling them.
*
* The object number must be a valid, allocated object number.
*/
int dmu_buf_hold(objset_t *os, uint64_t object, uint64_t offset,
void *tag, dmu_buf_t **, int flags);
void dmu_buf_add_ref(dmu_buf_t *db, void* tag);
void dmu_buf_rele(dmu_buf_t *db, void *tag);
uint64_t dmu_buf_refcount(dmu_buf_t *db);
/*
* dmu_buf_hold_array holds the DMU buffers which contain all bytes in a
* range of an object. A pointer to an array of dmu_buf_t*'s is
* returned (in *dbpp).
*
* dmu_buf_rele_array releases the hold on an array of dmu_buf_t*'s, and
* frees the array. The hold on the array of buffers MUST be released
* with dmu_buf_rele_array. You can NOT release the hold on each buffer
* individually with dmu_buf_rele.
*/
int dmu_buf_hold_array_by_bonus(dmu_buf_t *db, uint64_t offset,
uint64_t length, int read, void *tag, int *numbufsp, dmu_buf_t ***dbpp);
void dmu_buf_rele_array(dmu_buf_t **, int numbufs, void *tag);
/*
* Returns NULL on success, or the existing user ptr if it's already
* been set.
*
* user_ptr is for use by the user and can be obtained via dmu_buf_get_user().
*
* user_data_ptr_ptr should be NULL, or a pointer to a pointer which
* will be set to db->db_data when you are allowed to access it. Note
* that db->db_data (the pointer) can change when you do dmu_buf_read(),
* dmu_buf_tryupgrade(), dmu_buf_will_dirty(), or dmu_buf_will_fill().
* *user_data_ptr_ptr will be set to the new value when it changes.
*
* If non-NULL, pageout func will be called when this buffer is being
* excised from the cache, so that you can clean up the data structure
* pointed to by user_ptr.
*
* dmu_evict_user() will call the pageout func for all buffers in a
* objset with a given pageout func.
*/
void *dmu_buf_set_user(dmu_buf_t *db, void *user_ptr, void *user_data_ptr_ptr,
dmu_buf_evict_func_t *pageout_func);
/*
* set_user_ie is the same as set_user, but request immediate eviction
* when hold count goes to zero.
*/
void *dmu_buf_set_user_ie(dmu_buf_t *db, void *user_ptr,
void *user_data_ptr_ptr, dmu_buf_evict_func_t *pageout_func);
void *dmu_buf_update_user(dmu_buf_t *db_fake, void *old_user_ptr,
void *user_ptr, void *user_data_ptr_ptr,
dmu_buf_evict_func_t *pageout_func);
void dmu_evict_user(objset_t *os, dmu_buf_evict_func_t *func);
/*
* Returns the user_ptr set with dmu_buf_set_user(), or NULL if not set.
*/
void *dmu_buf_get_user(dmu_buf_t *db);
/*
* Indicate that you are going to modify the buffer's data (db_data).
*
* The transaction (tx) must be assigned to a txg (ie. you've called
* dmu_tx_assign()). The buffer's object must be held in the tx
* (ie. you've called dmu_tx_hold_object(tx, db->db_object)).
*/
void dmu_buf_will_dirty(dmu_buf_t *db, dmu_tx_t *tx);
/*
* Tells if the given dbuf is freeable.
*/
boolean_t dmu_buf_freeable(dmu_buf_t *);
/*
* You must create a transaction, then hold the objects which you will
* (or might) modify as part of this transaction. Then you must assign
* the transaction to a transaction group. Once the transaction has
* been assigned, you can modify buffers which belong to held objects as
* part of this transaction. You can't modify buffers before the
* transaction has been assigned; you can't modify buffers which don't
* belong to objects which this transaction holds; you can't hold
* objects once the transaction has been assigned. You may hold an
* object which you are going to free (with dmu_object_free()), but you
* don't have to.
*
* You can abort the transaction before it has been assigned.
*
* Note that you may hold buffers (with dmu_buf_hold) at any time,
* regardless of transaction state.
*/
#define DMU_NEW_OBJECT (-1ULL)
#define DMU_OBJECT_END (-1ULL)
dmu_tx_t *dmu_tx_create(objset_t *os);
void dmu_tx_hold_write(dmu_tx_t *tx, uint64_t object, uint64_t off, int len);
void dmu_tx_hold_free(dmu_tx_t *tx, uint64_t object, uint64_t off,
uint64_t len);
void dmu_tx_hold_zap(dmu_tx_t *tx, uint64_t object, int add, const char *name);
void dmu_tx_hold_bonus(dmu_tx_t *tx, uint64_t object);
void dmu_tx_hold_spill(dmu_tx_t *tx, uint64_t object);
void dmu_tx_hold_sa(dmu_tx_t *tx, struct sa_handle *hdl, boolean_t may_grow);
void dmu_tx_hold_sa_create(dmu_tx_t *tx, int total_size);
void dmu_tx_abort(dmu_tx_t *tx);
int dmu_tx_assign(dmu_tx_t *tx, uint64_t txg_how);
void dmu_tx_wait(dmu_tx_t *tx);
void dmu_tx_commit(dmu_tx_t *tx);
/*
* To register a commit callback, dmu_tx_callback_register() must be called.
*
* dcb_data is a pointer to caller private data that is passed on as a
* callback parameter. The caller is responsible for properly allocating and
* freeing it.
*
* When registering a callback, the transaction must be already created, but
* it cannot be committed or aborted. It can be assigned to a txg or not.
*
* The callback will be called after the transaction has been safely written
* to stable storage and will also be called if the dmu_tx is aborted.
* If there is any error which prevents the transaction from being committed to
* disk, the callback will be called with a value of error != 0.
*/
typedef void dmu_tx_callback_func_t(void *dcb_data, int error);
void dmu_tx_callback_register(dmu_tx_t *tx, dmu_tx_callback_func_t *dcb_func,
void *dcb_data);
/*
* Free up the data blocks for a defined range of a file. If size is
* zero, the range from offset to end-of-file is freed.
*/
int dmu_free_range(objset_t *os, uint64_t object, uint64_t offset,
uint64_t size, dmu_tx_t *tx);
int dmu_free_long_range(objset_t *os, uint64_t object, uint64_t offset,
uint64_t size);
int dmu_free_object(objset_t *os, uint64_t object);
/*
* Convenience functions.
*
* Canfail routines will return 0 on success, or an errno if there is a
* nonrecoverable I/O error.
*/
#define DMU_READ_PREFETCH 0 /* prefetch */
#define DMU_READ_NO_PREFETCH 1 /* don't prefetch */
int dmu_read(objset_t *os, uint64_t object, uint64_t offset, uint64_t size,
void *buf, uint32_t flags);
void dmu_write(objset_t *os, uint64_t object, uint64_t offset, uint64_t size,
const void *buf, dmu_tx_t *tx);
void dmu_prealloc(objset_t *os, uint64_t object, uint64_t offset, uint64_t size,
dmu_tx_t *tx);
int dmu_read_uio(objset_t *os, uint64_t object, struct uio *uio, uint64_t size);
int dmu_write_uio(objset_t *os, uint64_t object, struct uio *uio, uint64_t size,
dmu_tx_t *tx);
int dmu_write_uio_dbuf(dmu_buf_t *zdb, struct uio *uio, uint64_t size,
dmu_tx_t *tx);
int dmu_write_pages(objset_t *os, uint64_t object, uint64_t offset,
uint64_t size, struct page *pp, dmu_tx_t *tx);
struct arc_buf *dmu_request_arcbuf(dmu_buf_t *handle, int size);
void dmu_return_arcbuf(struct arc_buf *buf);
void dmu_assign_arcbuf(dmu_buf_t *handle, uint64_t offset, struct arc_buf *buf,
dmu_tx_t *tx);
int dmu_xuio_init(struct xuio *uio, int niov);
void dmu_xuio_fini(struct xuio *uio);
int dmu_xuio_add(struct xuio *uio, struct arc_buf *abuf, offset_t off,
size_t n);
int dmu_xuio_cnt(struct xuio *uio);
struct arc_buf *dmu_xuio_arcbuf(struct xuio *uio, int i);
void dmu_xuio_clear(struct xuio *uio, int i);
void xuio_stat_wbuf_copied();
void xuio_stat_wbuf_nocopy();
extern int zfs_prefetch_disable;
/*
* Asynchronously try to read in the data.
*/
void dmu_prefetch(objset_t *os, uint64_t object, uint64_t offset,
uint64_t len);
typedef struct dmu_object_info {
/* All sizes are in bytes unless otherwise indicated. */
uint32_t doi_data_block_size;
uint32_t doi_metadata_block_size;
dmu_object_type_t doi_type;
dmu_object_type_t doi_bonus_type;
uint64_t doi_bonus_size;
uint8_t doi_indirection; /* 2 = dnode->indirect->data */
uint8_t doi_checksum;
uint8_t doi_compress;
uint8_t doi_pad[5];
uint64_t doi_physical_blocks_512; /* data + metadata, 512b blks */
uint64_t doi_max_offset;
uint64_t doi_fill_count; /* number of non-empty blocks */
} dmu_object_info_t;
typedef void arc_byteswap_func_t(void *buf, size_t size);
typedef struct dmu_object_type_info {
arc_byteswap_func_t *ot_byteswap;
boolean_t ot_metadata;
char *ot_name;
} dmu_object_type_info_t;
extern const dmu_object_type_info_t dmu_ot[DMU_OT_NUMTYPES];
/*
* Get information on a DMU object.
*
* Return 0 on success or ENOENT if object is not allocated.
*
* If doi is NULL, just indicates whether the object exists.
*/
int dmu_object_info(objset_t *os, uint64_t object, dmu_object_info_t *doi);
void dmu_object_info_from_dnode(struct dnode *dn, dmu_object_info_t *doi);
void dmu_object_info_from_db(dmu_buf_t *db, dmu_object_info_t *doi);
void dmu_object_size_from_db(dmu_buf_t *db, uint32_t *blksize,
u_longlong_t *nblk512);
typedef struct dmu_objset_stats {
uint64_t dds_num_clones; /* number of clones of this */
uint64_t dds_creation_txg;
uint64_t dds_guid;
dmu_objset_type_t dds_type;
uint8_t dds_is_snapshot;
uint8_t dds_inconsistent;
char dds_origin[MAXNAMELEN];
} dmu_objset_stats_t;
/*
* Get stats on a dataset.
*/
void dmu_objset_fast_stat(objset_t *os, dmu_objset_stats_t *stat);
/*
* Add entries to the nvlist for all the objset's properties. See
* zfs_prop_table[] and zfs(1m) for details on the properties.
*/
void dmu_objset_stats(objset_t *os, struct nvlist *nv);
/*
* Get the space usage statistics for statvfs().
*
* refdbytes is the amount of space "referenced" by this objset.
* availbytes is the amount of space available to this objset, taking
* into account quotas & reservations, assuming that no other objsets
* use the space first. These values correspond to the 'referenced' and
* 'available' properties, described in the zfs(1m) manpage.
*
* usedobjs and availobjs are the number of objects currently allocated,
* and available.
*/
void dmu_objset_space(objset_t *os, uint64_t *refdbytesp, uint64_t *availbytesp,
uint64_t *usedobjsp, uint64_t *availobjsp);
/*
* The fsid_guid is a 56-bit ID that can change to avoid collisions.
* (Contrast with the ds_guid which is a 64-bit ID that will never
* change, so there is a small probability that it will collide.)
*/
uint64_t dmu_objset_fsid_guid(objset_t *os);
/*
* Get the [cm]time for an objset's snapshot dir
*/
timestruc_t dmu_objset_snap_cmtime(objset_t *os);
int dmu_objset_is_snapshot(objset_t *os);
extern struct spa *dmu_objset_spa(objset_t *os);
extern struct zilog *dmu_objset_zil(objset_t *os);
extern struct dsl_pool *dmu_objset_pool(objset_t *os);
extern struct dsl_dataset *dmu_objset_ds(objset_t *os);
extern void dmu_objset_name(objset_t *os, char *buf);
extern dmu_objset_type_t dmu_objset_type(objset_t *os);
extern uint64_t dmu_objset_id(objset_t *os);
extern uint64_t dmu_objset_syncprop(objset_t *os);
extern uint64_t dmu_objset_logbias(objset_t *os);
extern int dmu_snapshot_list_next(objset_t *os, int namelen, char *name,
uint64_t *id, uint64_t *offp, boolean_t *case_conflict);
extern int dmu_snapshot_realname(objset_t *os, char *name, char *real,
int maxlen, boolean_t *conflict);
extern int dmu_dir_list_next(objset_t *os, int namelen, char *name,
uint64_t *idp, uint64_t *offp);
typedef int objset_used_cb_t(dmu_object_type_t bonustype,
void *bonus, uint64_t *userp, uint64_t *groupp);
extern void dmu_objset_register_type(dmu_objset_type_t ost,
objset_used_cb_t *cb);
extern void dmu_objset_set_user(objset_t *os, void *user_ptr);
extern void *dmu_objset_get_user(objset_t *os);
/*
* Return the txg number for the given assigned transaction.
*/
uint64_t dmu_tx_get_txg(dmu_tx_t *tx);
/*
* Synchronous write.
* If a parent zio is provided this function initiates a write on the
* provided buffer as a child of the parent zio.
* In the absence of a parent zio, the write is completed synchronously.
* At write completion, blk is filled with the bp of the written block.
* Note that while the data covered by this function will be on stable
* storage when the write completes this new data does not become a
* permanent part of the file until the associated transaction commits.
*/
/*
* {zfs,zvol,ztest}_get_done() args
*/
typedef struct zgd {
struct zilog *zgd_zilog;
struct blkptr *zgd_bp;
dmu_buf_t *zgd_db;
struct rl *zgd_rl;
void *zgd_private;
} zgd_t;
typedef void dmu_sync_cb_t(zgd_t *arg, int error);
int dmu_sync(struct zio *zio, uint64_t txg, dmu_sync_cb_t *done, zgd_t *zgd);
/*
* Find the next hole or data block in file starting at *off
* Return found offset in *off. Return ESRCH for end of file.
*/
int dmu_offset_next(objset_t *os, uint64_t object, boolean_t hole,
uint64_t *off);
/*
* Initial setup and final teardown.
*/
extern void dmu_init(void);
extern void dmu_fini(void);
typedef void (*dmu_traverse_cb_t)(objset_t *os, void *arg, struct blkptr *bp,
uint64_t object, uint64_t offset, int len);
void dmu_traverse_objset(objset_t *os, uint64_t txg_start,
dmu_traverse_cb_t cb, void *arg);
int dmu_sendbackup(objset_t *tosnap, objset_t *fromsnap, boolean_t fromorigin,
struct vnode *vp, offset_t *off);
typedef struct dmu_recv_cookie {
/*
* This structure is opaque!
*
* If logical and real are different, we are recving the stream
* into the "real" temporary clone, and then switching it with
* the "logical" target.
*/
struct dsl_dataset *drc_logical_ds;
struct dsl_dataset *drc_real_ds;
struct drr_begin *drc_drrb;
char *drc_tosnap;
char *drc_top_ds;
boolean_t drc_newfs;
boolean_t drc_force;
} dmu_recv_cookie_t;
int dmu_recv_begin(char *tofs, char *tosnap, char *topds, struct drr_begin *,
boolean_t force, objset_t *origin, dmu_recv_cookie_t *);
int dmu_recv_stream(dmu_recv_cookie_t *drc, struct vnode *vp, offset_t *voffp,
int cleanup_fd, uint64_t *action_handlep);
int dmu_recv_end(dmu_recv_cookie_t *drc);
int dmu_diff(objset_t *tosnap, objset_t *fromsnap, struct vnode *vp,
offset_t *off);
/* CRC64 table */
#define ZFS_CRC64_POLY 0xC96C5795D7870F42ULL /* ECMA-182, reflected form */
extern uint64_t zfs_crc64_table[256];
#ifdef __cplusplus
}
#endif
#endif /* _SYS_DMU_H */

View File

@ -0,0 +1,272 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2010 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#ifndef _SYS_DMU_IMPL_H
#define _SYS_DMU_IMPL_H
#include <sys/txg_impl.h>
#include <sys/zio.h>
#include <sys/dnode.h>
#include <sys/zfs_context.h>
#ifdef __cplusplus
extern "C" {
#endif
/*
* This is the locking strategy for the DMU. Numbers in parenthesis are
* cases that use that lock order, referenced below:
*
* ARC is self-contained
* bplist is self-contained
* refcount is self-contained
* txg is self-contained (hopefully!)
* zst_lock
* zf_rwlock
*
* XXX try to improve evicting path?
*
* dp_config_rwlock > os_obj_lock > dn_struct_rwlock >
* dn_dbufs_mtx > hash_mutexes > db_mtx > dd_lock > leafs
*
* dp_config_rwlock
* must be held before: everything
* protects dd namespace changes
* protects property changes globally
* held from:
* dsl_dir_open/r:
* dsl_dir_create_sync/w:
* dsl_dir_sync_destroy/w:
* dsl_dir_rename_sync/w:
* dsl_prop_changed_notify/r:
*
* os_obj_lock
* must be held before:
* everything except dp_config_rwlock
* protects os_obj_next
* held from:
* dmu_object_alloc: dn_dbufs_mtx, db_mtx, hash_mutexes, dn_struct_rwlock
*
* dn_struct_rwlock
* must be held before:
* everything except dp_config_rwlock and os_obj_lock
* protects structure of dnode (eg. nlevels)
* db_blkptr can change when syncing out change to nlevels
* dn_maxblkid
* dn_nlevels
* dn_*blksz*
* phys nlevels, maxblkid, physical blkptr_t's (?)
* held from:
* callers of dbuf_read_impl, dbuf_hold[_impl], dbuf_prefetch
* dmu_object_info_from_dnode: dn_dirty_mtx (dn_datablksz)
* dmu_tx_count_free:
* dbuf_read_impl: db_mtx, dmu_zfetch()
* dmu_zfetch: zf_rwlock/r, zst_lock, dbuf_prefetch()
* dbuf_new_size: db_mtx
* dbuf_dirty: db_mtx
* dbuf_findbp: (callers, phys? - the real need)
* dbuf_create: dn_dbufs_mtx, hash_mutexes, db_mtx (phys?)
* dbuf_prefetch: dn_dirty_mtx, hash_mutexes, db_mtx, dn_dbufs_mtx
* dbuf_hold_impl: hash_mutexes, db_mtx, dn_dbufs_mtx, dbuf_findbp()
* dnode_sync/w (increase_indirection): db_mtx (phys)
* dnode_set_blksz/w: dn_dbufs_mtx (dn_*blksz*)
* dnode_new_blkid/w: (dn_maxblkid)
* dnode_free_range/w: dn_dirty_mtx (dn_maxblkid)
* dnode_next_offset: (phys)
*
* dn_dbufs_mtx
* must be held before:
* db_mtx, hash_mutexes
* protects:
* dn_dbufs
* dn_evicted
* held from:
* dmu_evict_user: db_mtx (dn_dbufs)
* dbuf_free_range: db_mtx (dn_dbufs)
* dbuf_remove_ref: db_mtx, callees:
* dbuf_hash_remove: hash_mutexes, db_mtx
* dbuf_create: hash_mutexes, db_mtx (dn_dbufs)
* dnode_set_blksz: (dn_dbufs)
*
* hash_mutexes (global)
* must be held before:
* db_mtx
* protects dbuf_hash_table (global) and db_hash_next
* held from:
* dbuf_find: db_mtx
* dbuf_hash_insert: db_mtx
* dbuf_hash_remove: db_mtx
*
* db_mtx (meta-leaf)
* must be held before:
* dn_mtx, dn_dirty_mtx, dd_lock (leaf mutexes)
* protects:
* db_state
* db_holds
* db_buf
* db_changed
* db_data_pending
* db_dirtied
* db_link
* db_dirty_node (??)
* db_dirtycnt
* db_d.*
* db.*
* held from:
* dbuf_dirty: dn_mtx, dn_dirty_mtx
* dbuf_dirty->dsl_dir_willuse_space: dd_lock
* dbuf_dirty->dbuf_new_block->dsl_dataset_block_freeable: dd_lock
* dbuf_undirty: dn_dirty_mtx (db_d)
* dbuf_write_done: dn_dirty_mtx (db_state)
* dbuf_*
* dmu_buf_update_user: none (db_d)
* dmu_evict_user: none (db_d) (maybe can eliminate)
* dbuf_find: none (db_holds)
* dbuf_hash_insert: none (db_holds)
* dmu_buf_read_array_impl: none (db_state, db_changed)
* dmu_sync: none (db_dirty_node, db_d)
* dnode_reallocate: none (db)
*
* dn_mtx (leaf)
* protects:
* dn_dirty_dbufs
* dn_ranges
* phys accounting
* dn_allocated_txg
* dn_free_txg
* dn_assigned_txg
* dd_assigned_tx
* dn_notxholds
* dn_dirtyctx
* dn_dirtyctx_firstset
* (dn_phys copy fields?)
* (dn_phys contents?)
* held from:
* dnode_*
* dbuf_dirty: none
* dbuf_sync: none (phys accounting)
* dbuf_undirty: none (dn_ranges, dn_dirty_dbufs)
* dbuf_write_done: none (phys accounting)
* dmu_object_info_from_dnode: none (accounting)
* dmu_tx_commit: none
* dmu_tx_hold_object_impl: none
* dmu_tx_try_assign: dn_notxholds(cv)
* dmu_tx_unassign: none
*
* dd_lock
* must be held before:
* ds_lock
* ancestors' dd_lock
* protects:
* dd_prop_cbs
* dd_sync_*
* dd_used_bytes
* dd_tempreserved
* dd_space_towrite
* dd_myname
* dd_phys accounting?
* held from:
* dsl_dir_*
* dsl_prop_changed_notify: none (dd_prop_cbs)
* dsl_prop_register: none (dd_prop_cbs)
* dsl_prop_unregister: none (dd_prop_cbs)
* dsl_dataset_block_freeable: none (dd_sync_*)
*
* os_lock (leaf)
* protects:
* os_dirty_dnodes
* os_free_dnodes
* os_dnodes
* os_downgraded_dbufs
* dn_dirtyblksz
* dn_dirty_link
* held from:
* dnode_create: none (os_dnodes)
* dnode_destroy: none (os_dnodes)
* dnode_setdirty: none (dn_dirtyblksz, os_*_dnodes)
* dnode_free: none (dn_dirtyblksz, os_*_dnodes)
*
* ds_lock
* protects:
* ds_objset
* ds_open_refcount
* ds_snapname
* ds_phys accounting
* ds_phys userrefs zapobj
* ds_reserved
* held from:
* dsl_dataset_*
*
* dr_mtx (leaf)
* protects:
* dr_children
* held from:
* dbuf_dirty
* dbuf_undirty
* dbuf_sync_indirect
* dnode_new_blkid
*/
struct objset;
struct dmu_pool;
typedef struct dmu_xuio {
int next;
int cnt;
struct arc_buf **bufs;
iovec_t *iovp;
} dmu_xuio_t;
typedef struct xuio_stats {
/* loaned yet not returned arc_buf */
kstat_named_t xuiostat_onloan_rbuf;
kstat_named_t xuiostat_onloan_wbuf;
/* whether a copy is made when loaning out a read buffer */
kstat_named_t xuiostat_rbuf_copied;
kstat_named_t xuiostat_rbuf_nocopy;
/* whether a copy is made when assigning a write buffer */
kstat_named_t xuiostat_wbuf_copied;
kstat_named_t xuiostat_wbuf_nocopy;
} xuio_stats_t;
static xuio_stats_t xuio_stats = {
{ "onloan_read_buf", KSTAT_DATA_UINT64 },
{ "onloan_write_buf", KSTAT_DATA_UINT64 },
{ "read_buf_copied", KSTAT_DATA_UINT64 },
{ "read_buf_nocopy", KSTAT_DATA_UINT64 },
{ "write_buf_copied", KSTAT_DATA_UINT64 },
{ "write_buf_nocopy", KSTAT_DATA_UINT64 }
};
#define XUIOSTAT_INCR(stat, val) \
atomic_add_64(&xuio_stats.stat.value.ui64, (val))
#define XUIOSTAT_BUMP(stat) XUIOSTAT_INCR(stat, 1)
#ifdef __cplusplus
}
#endif
#endif /* _SYS_DMU_IMPL_H */

View File

@ -0,0 +1,183 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
/* Portions Copyright 2010 Robert Milkowski */
#ifndef _SYS_DMU_OBJSET_H
#define _SYS_DMU_OBJSET_H
#include <sys/spa.h>
#include <sys/arc.h>
#include <sys/txg.h>
#include <sys/zfs_context.h>
#include <sys/dnode.h>
#include <sys/zio.h>
#include <sys/zil.h>
#include <sys/sa.h>
#ifdef __cplusplus
extern "C" {
#endif
extern krwlock_t os_lock;
struct dsl_dataset;
struct dmu_tx;
#define OBJSET_PHYS_SIZE 2048
#define OBJSET_OLD_PHYS_SIZE 1024
#define OBJSET_BUF_HAS_USERUSED(buf) \
(arc_buf_size(buf) > OBJSET_OLD_PHYS_SIZE)
#define OBJSET_FLAG_USERACCOUNTING_COMPLETE (1ULL<<0)
typedef struct objset_phys {
dnode_phys_t os_meta_dnode;
zil_header_t os_zil_header;
uint64_t os_type;
uint64_t os_flags;
char os_pad[OBJSET_PHYS_SIZE - sizeof (dnode_phys_t)*3 -
sizeof (zil_header_t) - sizeof (uint64_t)*2];
dnode_phys_t os_userused_dnode;
dnode_phys_t os_groupused_dnode;
} objset_phys_t;
struct objset {
/* Immutable: */
struct dsl_dataset *os_dsl_dataset;
spa_t *os_spa;
arc_buf_t *os_phys_buf;
objset_phys_t *os_phys;
/*
* The following "special" dnodes have no parent and are exempt from
* dnode_move(), but they root their descendents in this objset using
* handles anyway, so that all access to dnodes from dbufs consistently
* uses handles.
*/
dnode_handle_t os_meta_dnode;
dnode_handle_t os_userused_dnode;
dnode_handle_t os_groupused_dnode;
zilog_t *os_zil;
/* can change, under dsl_dir's locks: */
uint8_t os_checksum;
uint8_t os_compress;
uint8_t os_copies;
uint8_t os_dedup_checksum;
uint8_t os_dedup_verify;
uint8_t os_logbias;
uint8_t os_primary_cache;
uint8_t os_secondary_cache;
uint8_t os_sync;
/* no lock needed: */
struct dmu_tx *os_synctx; /* XXX sketchy */
blkptr_t *os_rootbp;
zil_header_t os_zil_header;
list_t os_synced_dnodes;
uint64_t os_flags;
/* Protected by os_obj_lock */
kmutex_t os_obj_lock;
uint64_t os_obj_next;
/* Protected by os_lock */
kmutex_t os_lock;
list_t os_dirty_dnodes[TXG_SIZE];
list_t os_free_dnodes[TXG_SIZE];
list_t os_dnodes;
list_t os_downgraded_dbufs;
/* stuff we store for the user */
kmutex_t os_user_ptr_lock;
void *os_user_ptr;
/* SA layout/attribute registration */
sa_os_t *os_sa;
};
#define DMU_META_OBJSET 0
#define DMU_META_DNODE_OBJECT 0
#define DMU_OBJECT_IS_SPECIAL(obj) ((int64_t)(obj) <= 0)
#define DMU_META_DNODE(os) ((os)->os_meta_dnode.dnh_dnode)
#define DMU_USERUSED_DNODE(os) ((os)->os_userused_dnode.dnh_dnode)
#define DMU_GROUPUSED_DNODE(os) ((os)->os_groupused_dnode.dnh_dnode)
#define DMU_OS_IS_L2CACHEABLE(os) \
((os)->os_secondary_cache == ZFS_CACHE_ALL || \
(os)->os_secondary_cache == ZFS_CACHE_METADATA)
/* called from zpl */
int dmu_objset_hold(const char *name, void *tag, objset_t **osp);
int dmu_objset_own(const char *name, dmu_objset_type_t type,
boolean_t readonly, void *tag, objset_t **osp);
void dmu_objset_rele(objset_t *os, void *tag);
void dmu_objset_disown(objset_t *os, void *tag);
int dmu_objset_from_ds(struct dsl_dataset *ds, objset_t **osp);
int dmu_objset_create(const char *name, dmu_objset_type_t type, uint64_t flags,
void (*func)(objset_t *os, void *arg, cred_t *cr, dmu_tx_t *tx), void *arg);
int dmu_objset_clone(const char *name, struct dsl_dataset *clone_origin,
uint64_t flags);
int dmu_objset_destroy(const char *name, boolean_t defer);
int dmu_objset_snapshot(char *fsname, char *snapname, char *tag,
struct nvlist *props, boolean_t recursive, boolean_t temporary, int fd);
void dmu_objset_stats(objset_t *os, nvlist_t *nv);
void dmu_objset_fast_stat(objset_t *os, dmu_objset_stats_t *stat);
void dmu_objset_space(objset_t *os, uint64_t *refdbytesp, uint64_t *availbytesp,
uint64_t *usedobjsp, uint64_t *availobjsp);
uint64_t dmu_objset_fsid_guid(objset_t *os);
int dmu_objset_find(char *name, int func(const char *, void *), void *arg,
int flags);
int dmu_objset_find_spa(spa_t *spa, const char *name,
int func(spa_t *, uint64_t, const char *, void *), void *arg, int flags);
int dmu_objset_prefetch(const char *name, void *arg);
void dmu_objset_byteswap(void *buf, size_t size);
int dmu_objset_evict_dbufs(objset_t *os);
timestruc_t dmu_objset_snap_cmtime(objset_t *os);
/* called from dsl */
void dmu_objset_sync(objset_t *os, zio_t *zio, dmu_tx_t *tx);
boolean_t dmu_objset_is_dirty(objset_t *os, uint64_t txg);
boolean_t dmu_objset_is_dirty_anywhere(objset_t *os);
objset_t *dmu_objset_create_impl(spa_t *spa, struct dsl_dataset *ds,
blkptr_t *bp, dmu_objset_type_t type, dmu_tx_t *tx);
int dmu_objset_open_impl(spa_t *spa, struct dsl_dataset *ds, blkptr_t *bp,
objset_t **osp);
void dmu_objset_evict(objset_t *os);
void dmu_objset_do_userquota_updates(objset_t *os, dmu_tx_t *tx);
void dmu_objset_userquota_get_ids(dnode_t *dn, boolean_t before, dmu_tx_t *tx);
boolean_t dmu_objset_userused_enabled(objset_t *os);
int dmu_objset_userspace_upgrade(objset_t *os);
boolean_t dmu_objset_userspace_present(objset_t *os);
void dmu_objset_init(void);
void dmu_objset_fini(void);
#ifdef __cplusplus
}
#endif
#endif /* _SYS_DMU_OBJSET_H */

View File

@ -0,0 +1,64 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_DMU_TRAVERSE_H
#define _SYS_DMU_TRAVERSE_H
#include <sys/zfs_context.h>
#include <sys/spa.h>
#include <sys/zio.h>
#ifdef __cplusplus
extern "C" {
#endif
struct dnode_phys;
struct dsl_dataset;
struct zilog;
struct arc_buf;
typedef int (blkptr_cb_t)(spa_t *spa, zilog_t *zilog, const blkptr_t *bp,
struct arc_buf *pbuf, const zbookmark_t *zb, const struct dnode_phys *dnp,
void *arg);
#define TRAVERSE_PRE (1<<0)
#define TRAVERSE_POST (1<<1)
#define TRAVERSE_PREFETCH_METADATA (1<<2)
#define TRAVERSE_PREFETCH_DATA (1<<3)
#define TRAVERSE_PREFETCH (TRAVERSE_PREFETCH_METADATA | TRAVERSE_PREFETCH_DATA)
#define TRAVERSE_HARD (1<<4)
/* Special traverse error return value to indicate skipping of children */
#define TRAVERSE_VISIT_NO_CHILDREN -1
int traverse_dataset(struct dsl_dataset *ds,
uint64_t txg_start, int flags, blkptr_cb_t func, void *arg);
int traverse_pool(spa_t *spa,
uint64_t txg_start, int flags, blkptr_cb_t func, void *arg);
#ifdef __cplusplus
}
#endif
#endif /* _SYS_DMU_TRAVERSE_H */

View File

@ -0,0 +1,148 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2010 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#ifndef _SYS_DMU_TX_H
#define _SYS_DMU_TX_H
#include <sys/inttypes.h>
#include <sys/dmu.h>
#include <sys/txg.h>
#include <sys/refcount.h>
#ifdef __cplusplus
extern "C" {
#endif
struct dmu_buf_impl;
struct dmu_tx_hold;
struct dnode_link;
struct dsl_pool;
struct dnode;
struct dsl_dir;
struct dmu_tx {
/*
* No synchronization is needed because a tx can only be handled
* by one thread.
*/
list_t tx_holds; /* list of dmu_tx_hold_t */
objset_t *tx_objset;
struct dsl_dir *tx_dir;
struct dsl_pool *tx_pool;
uint64_t tx_txg;
uint64_t tx_lastsnap_txg;
uint64_t tx_lasttried_txg;
txg_handle_t tx_txgh;
void *tx_tempreserve_cookie;
struct dmu_tx_hold *tx_needassign_txh;
list_t tx_callbacks; /* list of dmu_tx_callback_t on this dmu_tx */
uint8_t tx_anyobj;
int tx_err;
#ifdef ZFS_DEBUG
uint64_t tx_space_towrite;
uint64_t tx_space_tofree;
uint64_t tx_space_tooverwrite;
uint64_t tx_space_tounref;
refcount_t tx_space_written;
refcount_t tx_space_freed;
#endif
};
enum dmu_tx_hold_type {
THT_NEWOBJECT,
THT_WRITE,
THT_BONUS,
THT_FREE,
THT_ZAP,
THT_SPACE,
THT_SPILL,
THT_NUMTYPES
};
typedef struct dmu_tx_hold {
dmu_tx_t *txh_tx;
list_node_t txh_node;
struct dnode *txh_dnode;
uint64_t txh_space_towrite;
uint64_t txh_space_tofree;
uint64_t txh_space_tooverwrite;
uint64_t txh_space_tounref;
uint64_t txh_memory_tohold;
uint64_t txh_fudge;
#ifdef ZFS_DEBUG
enum dmu_tx_hold_type txh_type;
uint64_t txh_arg1;
uint64_t txh_arg2;
#endif
} dmu_tx_hold_t;
typedef struct dmu_tx_callback {
list_node_t dcb_node; /* linked to tx_callbacks list */
dmu_tx_callback_func_t *dcb_func; /* caller function pointer */
void *dcb_data; /* caller private data */
} dmu_tx_callback_t;
/*
* These routines are defined in dmu.h, and are called by the user.
*/
dmu_tx_t *dmu_tx_create(objset_t *dd);
int dmu_tx_assign(dmu_tx_t *tx, uint64_t txg_how);
void dmu_tx_commit(dmu_tx_t *tx);
void dmu_tx_abort(dmu_tx_t *tx);
uint64_t dmu_tx_get_txg(dmu_tx_t *tx);
void dmu_tx_wait(dmu_tx_t *tx);
void dmu_tx_callback_register(dmu_tx_t *tx, dmu_tx_callback_func_t *dcb_func,
void *dcb_data);
void dmu_tx_do_callbacks(list_t *cb_list, int error);
/*
* These routines are defined in dmu_spa.h, and are called by the SPA.
*/
extern dmu_tx_t *dmu_tx_create_assigned(struct dsl_pool *dp, uint64_t txg);
/*
* These routines are only called by the DMU.
*/
dmu_tx_t *dmu_tx_create_dd(dsl_dir_t *dd);
int dmu_tx_is_syncing(dmu_tx_t *tx);
int dmu_tx_private_ok(dmu_tx_t *tx);
void dmu_tx_add_new_object(dmu_tx_t *tx, objset_t *os, uint64_t object);
void dmu_tx_willuse_space(dmu_tx_t *tx, int64_t delta);
void dmu_tx_dirty_buf(dmu_tx_t *tx, struct dmu_buf_impl *db);
int dmu_tx_holds(dmu_tx_t *tx, uint64_t object);
void dmu_tx_hold_space(dmu_tx_t *tx, uint64_t space);
#ifdef ZFS_DEBUG
#define DMU_TX_DIRTY_BUF(tx, db) dmu_tx_dirty_buf(tx, db)
#else
#define DMU_TX_DIRTY_BUF(tx, db)
#endif
#ifdef __cplusplus
}
#endif
#endif /* _SYS_DMU_TX_H */

View File

@ -0,0 +1,76 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#ifndef _DFETCH_H
#define _DFETCH_H
#include <sys/zfs_context.h>
#ifdef __cplusplus
extern "C" {
#endif
extern uint64_t zfetch_array_rd_sz;
struct dnode; /* so we can reference dnode */
typedef enum zfetch_dirn {
ZFETCH_FORWARD = 1, /* prefetch increasing block numbers */
ZFETCH_BACKWARD = -1 /* prefetch decreasing block numbers */
} zfetch_dirn_t;
typedef struct zstream {
uint64_t zst_offset; /* offset of starting block in range */
uint64_t zst_len; /* length of range, in blocks */
zfetch_dirn_t zst_direction; /* direction of prefetch */
uint64_t zst_stride; /* length of stride, in blocks */
uint64_t zst_ph_offset; /* prefetch offset, in blocks */
uint64_t zst_cap; /* prefetch limit (cap), in blocks */
kmutex_t zst_lock; /* protects stream */
clock_t zst_last; /* lbolt of last prefetch */
avl_node_t zst_node; /* embed avl node here */
} zstream_t;
typedef struct zfetch {
krwlock_t zf_rwlock; /* protects zfetch structure */
list_t zf_stream; /* AVL tree of zstream_t's */
struct dnode *zf_dnode; /* dnode that owns this zfetch */
uint32_t zf_stream_cnt; /* # of active streams */
uint64_t zf_alloc_fail; /* # of failed attempts to alloc strm */
} zfetch_t;
void zfetch_init(void);
void zfetch_fini(void);
void dmu_zfetch_init(zfetch_t *, struct dnode *);
void dmu_zfetch_rele(zfetch_t *);
void dmu_zfetch(zfetch_t *, uint64_t, uint64_t, int);
#ifdef __cplusplus
}
#endif
#endif /* _DFETCH_H */

View File

@ -0,0 +1,329 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_DNODE_H
#define _SYS_DNODE_H
#include <sys/zfs_context.h>
#include <sys/avl.h>
#include <sys/spa.h>
#include <sys/txg.h>
#include <sys/zio.h>
#include <sys/refcount.h>
#include <sys/dmu_zfetch.h>
#include <sys/zrlock.h>
#ifdef __cplusplus
extern "C" {
#endif
/*
* dnode_hold() flags.
*/
#define DNODE_MUST_BE_ALLOCATED 1
#define DNODE_MUST_BE_FREE 2
/*
* dnode_next_offset() flags.
*/
#define DNODE_FIND_HOLE 1
#define DNODE_FIND_BACKWARDS 2
#define DNODE_FIND_HAVELOCK 4
/*
* Fixed constants.
*/
#define DNODE_SHIFT 9 /* 512 bytes */
#define DN_MIN_INDBLKSHIFT 10 /* 1k */
#define DN_MAX_INDBLKSHIFT 14 /* 16k */
#define DNODE_BLOCK_SHIFT 14 /* 16k */
#define DNODE_CORE_SIZE 64 /* 64 bytes for dnode sans blkptrs */
#define DN_MAX_OBJECT_SHIFT 48 /* 256 trillion (zfs_fid_t limit) */
#define DN_MAX_OFFSET_SHIFT 64 /* 2^64 bytes in a dnode */
/*
* dnode id flags
*
* Note: a file will never ever have its
* ids moved from bonus->spill
* and only in a crypto environment would it be on spill
*/
#define DN_ID_CHKED_BONUS 0x1
#define DN_ID_CHKED_SPILL 0x2
#define DN_ID_OLD_EXIST 0x4
#define DN_ID_NEW_EXIST 0x8
/*
* Derived constants.
*/
#define DNODE_SIZE (1 << DNODE_SHIFT)
#define DN_MAX_NBLKPTR ((DNODE_SIZE - DNODE_CORE_SIZE) >> SPA_BLKPTRSHIFT)
#define DN_MAX_BONUSLEN (DNODE_SIZE - DNODE_CORE_SIZE - (1 << SPA_BLKPTRSHIFT))
#define DN_MAX_OBJECT (1ULL << DN_MAX_OBJECT_SHIFT)
#define DN_ZERO_BONUSLEN (DN_MAX_BONUSLEN + 1)
#define DN_KILL_SPILLBLK (1)
#define DNODES_PER_BLOCK_SHIFT (DNODE_BLOCK_SHIFT - DNODE_SHIFT)
#define DNODES_PER_BLOCK (1ULL << DNODES_PER_BLOCK_SHIFT)
#define DNODES_PER_LEVEL_SHIFT (DN_MAX_INDBLKSHIFT - SPA_BLKPTRSHIFT)
#define DNODES_PER_LEVEL (1ULL << DNODES_PER_LEVEL_SHIFT)
/* The +2 here is a cheesy way to round up */
#define DN_MAX_LEVELS (2 + ((DN_MAX_OFFSET_SHIFT - SPA_MINBLOCKSHIFT) / \
(DN_MIN_INDBLKSHIFT - SPA_BLKPTRSHIFT)))
#define DN_BONUS(dnp) ((void*)((dnp)->dn_bonus + \
(((dnp)->dn_nblkptr - 1) * sizeof (blkptr_t))))
#define DN_USED_BYTES(dnp) (((dnp)->dn_flags & DNODE_FLAG_USED_BYTES) ? \
(dnp)->dn_used : (dnp)->dn_used << SPA_MINBLOCKSHIFT)
#define EPB(blkshift, typeshift) (1 << (blkshift - typeshift))
struct dmu_buf_impl;
struct objset;
struct zio;
enum dnode_dirtycontext {
DN_UNDIRTIED,
DN_DIRTY_OPEN,
DN_DIRTY_SYNC
};
/* Is dn_used in bytes? if not, it's in multiples of SPA_MINBLOCKSIZE */
#define DNODE_FLAG_USED_BYTES (1<<0)
#define DNODE_FLAG_USERUSED_ACCOUNTED (1<<1)
/* Does dnode have a SA spill blkptr in bonus? */
#define DNODE_FLAG_SPILL_BLKPTR (1<<2)
typedef struct dnode_phys {
uint8_t dn_type; /* dmu_object_type_t */
uint8_t dn_indblkshift; /* ln2(indirect block size) */
uint8_t dn_nlevels; /* 1=dn_blkptr->data blocks */
uint8_t dn_nblkptr; /* length of dn_blkptr */
uint8_t dn_bonustype; /* type of data in bonus buffer */
uint8_t dn_checksum; /* ZIO_CHECKSUM type */
uint8_t dn_compress; /* ZIO_COMPRESS type */
uint8_t dn_flags; /* DNODE_FLAG_* */
uint16_t dn_datablkszsec; /* data block size in 512b sectors */
uint16_t dn_bonuslen; /* length of dn_bonus */
uint8_t dn_pad2[4];
/* accounting is protected by dn_dirty_mtx */
uint64_t dn_maxblkid; /* largest allocated block ID */
uint64_t dn_used; /* bytes (or sectors) of disk space */
uint64_t dn_pad3[4];
blkptr_t dn_blkptr[1];
uint8_t dn_bonus[DN_MAX_BONUSLEN - sizeof (blkptr_t)];
blkptr_t dn_spill;
} dnode_phys_t;
typedef struct dnode {
/*
* dn_struct_rwlock protects the structure of the dnode,
* including the number of levels of indirection (dn_nlevels),
* dn_maxblkid, and dn_next_*
*/
krwlock_t dn_struct_rwlock;
/* Our link on dn_objset->os_dnodes list; protected by os_lock. */
list_node_t dn_link;
/* immutable: */
struct objset *dn_objset;
uint64_t dn_object;
struct dmu_buf_impl *dn_dbuf;
struct dnode_handle *dn_handle;
dnode_phys_t *dn_phys; /* pointer into dn->dn_dbuf->db.db_data */
/*
* Copies of stuff in dn_phys. They're valid in the open
* context (eg. even before the dnode is first synced).
* Where necessary, these are protected by dn_struct_rwlock.
*/
dmu_object_type_t dn_type; /* object type */
uint16_t dn_bonuslen; /* bonus length */
uint8_t dn_bonustype; /* bonus type */
uint8_t dn_nblkptr; /* number of blkptrs (immutable) */
uint8_t dn_checksum; /* ZIO_CHECKSUM type */
uint8_t dn_compress; /* ZIO_COMPRESS type */
uint8_t dn_nlevels;
uint8_t dn_indblkshift;
uint8_t dn_datablkshift; /* zero if blksz not power of 2! */
uint8_t dn_moved; /* Has this dnode been moved? */
uint16_t dn_datablkszsec; /* in 512b sectors */
uint32_t dn_datablksz; /* in bytes */
uint64_t dn_maxblkid;
uint8_t dn_next_nblkptr[TXG_SIZE];
uint8_t dn_next_nlevels[TXG_SIZE];
uint8_t dn_next_indblkshift[TXG_SIZE];
uint8_t dn_next_bonustype[TXG_SIZE];
uint8_t dn_rm_spillblk[TXG_SIZE]; /* for removing spill blk */
uint16_t dn_next_bonuslen[TXG_SIZE];
uint32_t dn_next_blksz[TXG_SIZE]; /* next block size in bytes */
/* protected by dn_dbufs_mtx; declared here to fill 32-bit hole */
uint32_t dn_dbufs_count; /* count of dn_dbufs */
/* protected by os_lock: */
list_node_t dn_dirty_link[TXG_SIZE]; /* next on dataset's dirty */
/* protected by dn_mtx: */
kmutex_t dn_mtx;
list_t dn_dirty_records[TXG_SIZE];
avl_tree_t dn_ranges[TXG_SIZE];
uint64_t dn_allocated_txg;
uint64_t dn_free_txg;
uint64_t dn_assigned_txg;
kcondvar_t dn_notxholds;
enum dnode_dirtycontext dn_dirtyctx;
uint8_t *dn_dirtyctx_firstset; /* dbg: contents meaningless */
/* protected by own devices */
refcount_t dn_tx_holds;
refcount_t dn_holds;
kmutex_t dn_dbufs_mtx;
list_t dn_dbufs; /* descendent dbufs */
/* protected by dn_struct_rwlock */
struct dmu_buf_impl *dn_bonus; /* bonus buffer dbuf */
boolean_t dn_have_spill; /* have spill or are spilling */
/* parent IO for current sync write */
zio_t *dn_zio;
/* used in syncing context */
uint64_t dn_oldused; /* old phys used bytes */
uint64_t dn_oldflags; /* old phys dn_flags */
uint64_t dn_olduid, dn_oldgid;
uint64_t dn_newuid, dn_newgid;
int dn_id_flags;
/* holds prefetch structure */
struct zfetch dn_zfetch;
} dnode_t;
/*
* Adds a level of indirection between the dbuf and the dnode to avoid
* iterating descendent dbufs in dnode_move(). Handles are not allocated
* individually, but as an array of child dnodes in dnode_hold_impl().
*/
typedef struct dnode_handle {
/* Protects dnh_dnode from modification by dnode_move(). */
zrlock_t dnh_zrlock;
dnode_t *dnh_dnode;
} dnode_handle_t;
typedef struct dnode_children {
size_t dnc_count; /* number of children */
dnode_handle_t dnc_children[1]; /* sized dynamically */
} dnode_children_t;
typedef struct free_range {
avl_node_t fr_node;
uint64_t fr_blkid;
uint64_t fr_nblks;
} free_range_t;
dnode_t *dnode_special_open(struct objset *dd, dnode_phys_t *dnp,
uint64_t object, dnode_handle_t *dnh);
void dnode_special_close(dnode_handle_t *dnh);
void dnode_setbonuslen(dnode_t *dn, int newsize, dmu_tx_t *tx);
void dnode_setbonus_type(dnode_t *dn, dmu_object_type_t, dmu_tx_t *tx);
void dnode_rm_spill(dnode_t *dn, dmu_tx_t *tx);
int dnode_hold(struct objset *dd, uint64_t object,
void *ref, dnode_t **dnp);
int dnode_hold_impl(struct objset *dd, uint64_t object, int flag,
void *ref, dnode_t **dnp);
boolean_t dnode_add_ref(dnode_t *dn, void *ref);
void dnode_rele(dnode_t *dn, void *ref);
void dnode_setdirty(dnode_t *dn, dmu_tx_t *tx);
void dnode_sync(dnode_t *dn, dmu_tx_t *tx);
void dnode_allocate(dnode_t *dn, dmu_object_type_t ot, int blocksize, int ibs,
dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx);
void dnode_reallocate(dnode_t *dn, dmu_object_type_t ot, int blocksize,
dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx);
void dnode_free(dnode_t *dn, dmu_tx_t *tx);
void dnode_byteswap(dnode_phys_t *dnp);
void dnode_buf_byteswap(void *buf, size_t size);
void dnode_verify(dnode_t *dn);
int dnode_set_blksz(dnode_t *dn, uint64_t size, int ibs, dmu_tx_t *tx);
uint64_t dnode_current_max_length(dnode_t *dn);
void dnode_free_range(dnode_t *dn, uint64_t off, uint64_t len, dmu_tx_t *tx);
void dnode_clear_range(dnode_t *dn, uint64_t blkid,
uint64_t nblks, dmu_tx_t *tx);
void dnode_diduse_space(dnode_t *dn, int64_t space);
void dnode_willuse_space(dnode_t *dn, int64_t space, dmu_tx_t *tx);
void dnode_new_blkid(dnode_t *dn, uint64_t blkid, dmu_tx_t *tx, boolean_t);
uint64_t dnode_block_freed(dnode_t *dn, uint64_t blkid);
void dnode_init(void);
void dnode_fini(void);
int dnode_next_offset(dnode_t *dn, int flags, uint64_t *off,
int minlvl, uint64_t blkfill, uint64_t txg);
void dnode_evict_dbufs(dnode_t *dn);
#ifdef ZFS_DEBUG
/*
* There should be a ## between the string literal and fmt, to make it
* clear that we're joining two strings together, but that piece of shit
* gcc doesn't support that preprocessor token.
*/
#define dprintf_dnode(dn, fmt, ...) do { \
if (zfs_flags & ZFS_DEBUG_DPRINTF) { \
char __db_buf[32]; \
uint64_t __db_obj = (dn)->dn_object; \
if (__db_obj == DMU_META_DNODE_OBJECT) \
(void) strcpy(__db_buf, "mdn"); \
else \
(void) snprintf(__db_buf, sizeof (__db_buf), "%lld", \
(u_longlong_t)__db_obj);\
dprintf_ds((dn)->dn_objset->os_dsl_dataset, "obj=%s " fmt, \
__db_buf, __VA_ARGS__); \
} \
_NOTE(CONSTCOND) } while (0)
#define DNODE_VERIFY(dn) dnode_verify(dn)
#define FREE_VERIFY(db, start, end, tx) free_verify(db, start, end, tx)
#else
#define dprintf_dnode(db, fmt, ...)
#define DNODE_VERIFY(dn)
#define FREE_VERIFY(db, start, end, tx)
#endif
#ifdef __cplusplus
}
#endif
#endif /* _SYS_DNODE_H */

View File

@ -0,0 +1,283 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_DSL_DATASET_H
#define _SYS_DSL_DATASET_H
#include <sys/dmu.h>
#include <sys/spa.h>
#include <sys/txg.h>
#include <sys/zio.h>
#include <sys/bplist.h>
#include <sys/dsl_synctask.h>
#include <sys/zfs_context.h>
#include <sys/dsl_deadlist.h>
#ifdef __cplusplus
extern "C" {
#endif
struct dsl_dataset;
struct dsl_dir;
struct dsl_pool;
#define DS_FLAG_INCONSISTENT (1ULL<<0)
#define DS_IS_INCONSISTENT(ds) \
((ds)->ds_phys->ds_flags & DS_FLAG_INCONSISTENT)
/*
* NB: nopromote can not yet be set, but we want support for it in this
* on-disk version, so that we don't need to upgrade for it later. It
* will be needed when we implement 'zfs split' (where the split off
* clone should not be promoted).
*/
#define DS_FLAG_NOPROMOTE (1ULL<<1)
/*
* DS_FLAG_UNIQUE_ACCURATE is set if ds_unique_bytes has been correctly
* calculated for head datasets (starting with SPA_VERSION_UNIQUE_ACCURATE,
* refquota/refreservations).
*/
#define DS_FLAG_UNIQUE_ACCURATE (1ULL<<2)
/*
* DS_FLAG_DEFER_DESTROY is set after 'zfs destroy -d' has been called
* on a dataset. This allows the dataset to be destroyed using 'zfs release'.
*/
#define DS_FLAG_DEFER_DESTROY (1ULL<<3)
#define DS_IS_DEFER_DESTROY(ds) \
((ds)->ds_phys->ds_flags & DS_FLAG_DEFER_DESTROY)
/*
* DS_FLAG_CI_DATASET is set if the dataset contains a file system whose
* name lookups should be performed case-insensitively.
*/
#define DS_FLAG_CI_DATASET (1ULL<<16)
typedef struct dsl_dataset_phys {
uint64_t ds_dir_obj; /* DMU_OT_DSL_DIR */
uint64_t ds_prev_snap_obj; /* DMU_OT_DSL_DATASET */
uint64_t ds_prev_snap_txg;
uint64_t ds_next_snap_obj; /* DMU_OT_DSL_DATASET */
uint64_t ds_snapnames_zapobj; /* DMU_OT_DSL_DS_SNAP_MAP 0 for snaps */
uint64_t ds_num_children; /* clone/snap children; ==0 for head */
uint64_t ds_creation_time; /* seconds since 1970 */
uint64_t ds_creation_txg;
uint64_t ds_deadlist_obj; /* DMU_OT_DEADLIST */
uint64_t ds_used_bytes;
uint64_t ds_compressed_bytes;
uint64_t ds_uncompressed_bytes;
uint64_t ds_unique_bytes; /* only relevant to snapshots */
/*
* The ds_fsid_guid is a 56-bit ID that can change to avoid
* collisions. The ds_guid is a 64-bit ID that will never
* change, so there is a small probability that it will collide.
*/
uint64_t ds_fsid_guid;
uint64_t ds_guid;
uint64_t ds_flags; /* DS_FLAG_* */
blkptr_t ds_bp;
uint64_t ds_next_clones_obj; /* DMU_OT_DSL_CLONES */
uint64_t ds_props_obj; /* DMU_OT_DSL_PROPS for snaps */
uint64_t ds_userrefs_obj; /* DMU_OT_USERREFS */
uint64_t ds_pad[5]; /* pad out to 320 bytes for good measure */
} dsl_dataset_phys_t;
typedef struct dsl_dataset {
/* Immutable: */
struct dsl_dir *ds_dir;
dsl_dataset_phys_t *ds_phys;
dmu_buf_t *ds_dbuf;
uint64_t ds_object;
uint64_t ds_fsid_guid;
/* only used in syncing context, only valid for non-snapshots: */
struct dsl_dataset *ds_prev;
/* has internal locking: */
dsl_deadlist_t ds_deadlist;
bplist_t ds_pending_deadlist;
/* to protect against multiple concurrent incremental recv */
kmutex_t ds_recvlock;
/* protected by lock on pool's dp_dirty_datasets list */
txg_node_t ds_dirty_link;
list_node_t ds_synced_link;
/*
* ds_phys->ds_<accounting> is also protected by ds_lock.
* Protected by ds_lock:
*/
kmutex_t ds_lock;
objset_t *ds_objset;
uint64_t ds_userrefs;
/*
* ds_owner is protected by the ds_rwlock and the ds_lock
*/
krwlock_t ds_rwlock;
kcondvar_t ds_exclusive_cv;
void *ds_owner;
/* no locking; only for making guesses */
uint64_t ds_trysnap_txg;
/* for objset_open() */
kmutex_t ds_opening_lock;
uint64_t ds_reserved; /* cached refreservation */
uint64_t ds_quota; /* cached refquota */
/* Protected by ds_lock; keep at end of struct for better locality */
char ds_snapname[MAXNAMELEN];
} dsl_dataset_t;
struct dsl_ds_destroyarg {
dsl_dataset_t *ds; /* ds to destroy */
dsl_dataset_t *rm_origin; /* also remove our origin? */
boolean_t is_origin_rm; /* set if removing origin snap */
boolean_t defer; /* destroy -d requested? */
boolean_t releasing; /* destroying due to release? */
boolean_t need_prep; /* do we need to retry due to EBUSY? */
};
/*
* The max length of a temporary tag prefix is the number of hex digits
* required to express UINT64_MAX plus one for the hyphen.
*/
#define MAX_TAG_PREFIX_LEN 17
struct dsl_ds_holdarg {
dsl_sync_task_group_t *dstg;
char *htag;
char *snapname;
boolean_t recursive;
boolean_t gotone;
boolean_t temphold;
char failed[MAXPATHLEN];
};
#define dsl_dataset_is_snapshot(ds) \
((ds)->ds_phys->ds_num_children != 0)
#define DS_UNIQUE_IS_ACCURATE(ds) \
(((ds)->ds_phys->ds_flags & DS_FLAG_UNIQUE_ACCURATE) != 0)
int dsl_dataset_hold(const char *name, void *tag, dsl_dataset_t **dsp);
int dsl_dataset_hold_obj(struct dsl_pool *dp, uint64_t dsobj,
void *tag, dsl_dataset_t **);
int dsl_dataset_own(const char *name, boolean_t inconsistentok,
void *tag, dsl_dataset_t **dsp);
int dsl_dataset_own_obj(struct dsl_pool *dp, uint64_t dsobj,
boolean_t inconsistentok, void *tag, dsl_dataset_t **dsp);
void dsl_dataset_name(dsl_dataset_t *ds, char *name);
void dsl_dataset_rele(dsl_dataset_t *ds, void *tag);
void dsl_dataset_disown(dsl_dataset_t *ds, void *tag);
void dsl_dataset_drop_ref(dsl_dataset_t *ds, void *tag);
boolean_t dsl_dataset_tryown(dsl_dataset_t *ds, boolean_t inconsistentok,
void *tag);
void dsl_dataset_make_exclusive(dsl_dataset_t *ds, void *tag);
void dsl_register_onexit_hold_cleanup(dsl_dataset_t *ds, const char *htag,
minor_t minor);
uint64_t dsl_dataset_create_sync(dsl_dir_t *pds, const char *lastname,
dsl_dataset_t *origin, uint64_t flags, cred_t *, dmu_tx_t *);
uint64_t dsl_dataset_create_sync_dd(dsl_dir_t *dd, dsl_dataset_t *origin,
uint64_t flags, dmu_tx_t *tx);
int dsl_dataset_destroy(dsl_dataset_t *ds, void *tag, boolean_t defer);
int dsl_snapshots_destroy(char *fsname, char *snapname, boolean_t defer);
dsl_checkfunc_t dsl_dataset_destroy_check;
dsl_syncfunc_t dsl_dataset_destroy_sync;
dsl_checkfunc_t dsl_dataset_snapshot_check;
dsl_syncfunc_t dsl_dataset_snapshot_sync;
dsl_syncfunc_t dsl_dataset_user_hold_sync;
int dsl_dataset_rename(char *name, const char *newname, boolean_t recursive);
int dsl_dataset_promote(const char *name, char *conflsnap);
int dsl_dataset_clone_swap(dsl_dataset_t *clone, dsl_dataset_t *origin_head,
boolean_t force);
int dsl_dataset_user_hold(char *dsname, char *snapname, char *htag,
boolean_t recursive, boolean_t temphold, int cleanup_fd);
int dsl_dataset_user_hold_for_send(dsl_dataset_t *ds, char *htag,
boolean_t temphold);
int dsl_dataset_user_release(char *dsname, char *snapname, char *htag,
boolean_t recursive);
int dsl_dataset_user_release_tmp(struct dsl_pool *dp, uint64_t dsobj,
char *htag, boolean_t retry);
int dsl_dataset_get_holds(const char *dsname, nvlist_t **nvp);
blkptr_t *dsl_dataset_get_blkptr(dsl_dataset_t *ds);
void dsl_dataset_set_blkptr(dsl_dataset_t *ds, blkptr_t *bp, dmu_tx_t *tx);
spa_t *dsl_dataset_get_spa(dsl_dataset_t *ds);
boolean_t dsl_dataset_modified_since_lastsnap(dsl_dataset_t *ds);
void dsl_dataset_sync(dsl_dataset_t *os, zio_t *zio, dmu_tx_t *tx);
void dsl_dataset_block_born(dsl_dataset_t *ds, const blkptr_t *bp,
dmu_tx_t *tx);
int dsl_dataset_block_kill(dsl_dataset_t *ds, const blkptr_t *bp,
dmu_tx_t *tx, boolean_t async);
boolean_t dsl_dataset_block_freeable(dsl_dataset_t *ds, const blkptr_t *bp,
uint64_t blk_birth);
uint64_t dsl_dataset_prev_snap_txg(dsl_dataset_t *ds);
void dsl_dataset_dirty(dsl_dataset_t *ds, dmu_tx_t *tx);
void dsl_dataset_stats(dsl_dataset_t *os, nvlist_t *nv);
void dsl_dataset_fast_stat(dsl_dataset_t *ds, dmu_objset_stats_t *stat);
void dsl_dataset_space(dsl_dataset_t *ds,
uint64_t *refdbytesp, uint64_t *availbytesp,
uint64_t *usedobjsp, uint64_t *availobjsp);
uint64_t dsl_dataset_fsid_guid(dsl_dataset_t *ds);
int dsl_dsobj_to_dsname(char *pname, uint64_t obj, char *buf);
int dsl_dataset_check_quota(dsl_dataset_t *ds, boolean_t check_quota,
uint64_t asize, uint64_t inflight, uint64_t *used,
uint64_t *ref_rsrv);
int dsl_dataset_set_quota(const char *dsname, zprop_source_t source,
uint64_t quota);
dsl_syncfunc_t dsl_dataset_set_quota_sync;
int dsl_dataset_set_reservation(const char *dsname, zprop_source_t source,
uint64_t reservation);
int dsl_destroy_inconsistent(const char *dsname, void *arg);
#ifdef ZFS_DEBUG
#define dprintf_ds(ds, fmt, ...) do { \
if (zfs_flags & ZFS_DEBUG_DPRINTF) { \
char *__ds_name = kmem_alloc(MAXNAMELEN, KM_SLEEP); \
dsl_dataset_name(ds, __ds_name); \
dprintf("ds=%s " fmt, __ds_name, __VA_ARGS__); \
kmem_free(__ds_name, MAXNAMELEN); \
} \
_NOTE(CONSTCOND) } while (0)
#else
#define dprintf_ds(dd, fmt, ...)
#endif
#ifdef __cplusplus
}
#endif
#endif /* _SYS_DSL_DATASET_H */

View File

@ -0,0 +1,87 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_DSL_DEADLIST_H
#define _SYS_DSL_DEADLIST_H
#include <sys/bpobj.h>
#include <sys/zfs_context.h>
#ifdef __cplusplus
extern "C" {
#endif
struct dmu_buf;
struct dsl_dataset;
typedef struct dsl_deadlist_phys {
uint64_t dl_used;
uint64_t dl_comp;
uint64_t dl_uncomp;
uint64_t dl_pad[37]; /* pad out to 320b for future expansion */
} dsl_deadlist_phys_t;
typedef struct dsl_deadlist {
objset_t *dl_os;
uint64_t dl_object;
avl_tree_t dl_tree;
boolean_t dl_havetree;
struct dmu_buf *dl_dbuf;
dsl_deadlist_phys_t *dl_phys;
kmutex_t dl_lock;
/* if it's the old on-disk format: */
bpobj_t dl_bpobj;
boolean_t dl_oldfmt;
} dsl_deadlist_t;
typedef struct dsl_deadlist_entry {
avl_node_t dle_node;
uint64_t dle_mintxg;
bpobj_t dle_bpobj;
} dsl_deadlist_entry_t;
void dsl_deadlist_open(dsl_deadlist_t *dl, objset_t *os, uint64_t object);
void dsl_deadlist_close(dsl_deadlist_t *dl);
uint64_t dsl_deadlist_alloc(objset_t *os, dmu_tx_t *tx);
void dsl_deadlist_free(objset_t *os, uint64_t dlobj, dmu_tx_t *tx);
void dsl_deadlist_insert(dsl_deadlist_t *dl, const blkptr_t *bp, dmu_tx_t *tx);
void dsl_deadlist_add_key(dsl_deadlist_t *dl, uint64_t mintxg, dmu_tx_t *tx);
void dsl_deadlist_remove_key(dsl_deadlist_t *dl, uint64_t mintxg, dmu_tx_t *tx);
uint64_t dsl_deadlist_clone(dsl_deadlist_t *dl, uint64_t maxtxg,
uint64_t mrs_obj, dmu_tx_t *tx);
void dsl_deadlist_space(dsl_deadlist_t *dl,
uint64_t *usedp, uint64_t *compp, uint64_t *uncompp);
void dsl_deadlist_space_range(dsl_deadlist_t *dl,
uint64_t mintxg, uint64_t maxtxg,
uint64_t *usedp, uint64_t *compp, uint64_t *uncompp);
void dsl_deadlist_merge(dsl_deadlist_t *dl, uint64_t obj, dmu_tx_t *tx);
void dsl_deadlist_move_bpobj(dsl_deadlist_t *dl, bpobj_t *bpo, uint64_t mintxg,
dmu_tx_t *tx);
#ifdef __cplusplus
}
#endif
#endif /* _SYS_DSL_DEADLIST_H */

View File

@ -0,0 +1,78 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2007, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_DSL_DELEG_H
#define _SYS_DSL_DELEG_H
#include <sys/dmu.h>
#include <sys/dsl_pool.h>
#include <sys/zfs_context.h>
#ifdef __cplusplus
extern "C" {
#endif
#define ZFS_DELEG_PERM_NONE ""
#define ZFS_DELEG_PERM_CREATE "create"
#define ZFS_DELEG_PERM_DESTROY "destroy"
#define ZFS_DELEG_PERM_SNAPSHOT "snapshot"
#define ZFS_DELEG_PERM_ROLLBACK "rollback"
#define ZFS_DELEG_PERM_CLONE "clone"
#define ZFS_DELEG_PERM_PROMOTE "promote"
#define ZFS_DELEG_PERM_RENAME "rename"
#define ZFS_DELEG_PERM_MOUNT "mount"
#define ZFS_DELEG_PERM_SHARE "share"
#define ZFS_DELEG_PERM_SEND "send"
#define ZFS_DELEG_PERM_RECEIVE "receive"
#define ZFS_DELEG_PERM_ALLOW "allow"
#define ZFS_DELEG_PERM_USERPROP "userprop"
#define ZFS_DELEG_PERM_VSCAN "vscan"
#define ZFS_DELEG_PERM_USERQUOTA "userquota"
#define ZFS_DELEG_PERM_GROUPQUOTA "groupquota"
#define ZFS_DELEG_PERM_USERUSED "userused"
#define ZFS_DELEG_PERM_GROUPUSED "groupused"
#define ZFS_DELEG_PERM_HOLD "hold"
#define ZFS_DELEG_PERM_RELEASE "release"
#define ZFS_DELEG_PERM_DIFF "diff"
/*
* Note: the names of properties that are marked delegatable are also
* valid delegated permissions
*/
int dsl_deleg_get(const char *ddname, nvlist_t **nvp);
int dsl_deleg_set(const char *ddname, nvlist_t *nvp, boolean_t unset);
int dsl_deleg_access(const char *ddname, const char *perm, cred_t *cr);
int dsl_deleg_access_impl(struct dsl_dataset *ds, const char *perm, cred_t *cr);
void dsl_deleg_set_create_perms(dsl_dir_t *dd, dmu_tx_t *tx, cred_t *cr);
int dsl_deleg_can_allow(char *ddname, nvlist_t *nvp, cred_t *cr);
int dsl_deleg_can_unallow(char *ddname, nvlist_t *nvp, cred_t *cr);
int dsl_deleg_destroy(objset_t *os, uint64_t zapobj, dmu_tx_t *tx);
boolean_t dsl_delegation_on(objset_t *os);
#ifdef __cplusplus
}
#endif
#endif /* _SYS_DSL_DELEG_H */

View File

@ -0,0 +1,167 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_DSL_DIR_H
#define _SYS_DSL_DIR_H
#include <sys/dmu.h>
#include <sys/dsl_pool.h>
#include <sys/dsl_synctask.h>
#include <sys/refcount.h>
#include <sys/zfs_context.h>
#ifdef __cplusplus
extern "C" {
#endif
struct dsl_dataset;
typedef enum dd_used {
DD_USED_HEAD,
DD_USED_SNAP,
DD_USED_CHILD,
DD_USED_CHILD_RSRV,
DD_USED_REFRSRV,
DD_USED_NUM
} dd_used_t;
#define DD_FLAG_USED_BREAKDOWN (1<<0)
typedef struct dsl_dir_phys {
uint64_t dd_creation_time; /* not actually used */
uint64_t dd_head_dataset_obj;
uint64_t dd_parent_obj;
uint64_t dd_origin_obj;
uint64_t dd_child_dir_zapobj;
/*
* how much space our children are accounting for; for leaf
* datasets, == physical space used by fs + snaps
*/
uint64_t dd_used_bytes;
uint64_t dd_compressed_bytes;
uint64_t dd_uncompressed_bytes;
/* Administrative quota setting */
uint64_t dd_quota;
/* Administrative reservation setting */
uint64_t dd_reserved;
uint64_t dd_props_zapobj;
uint64_t dd_deleg_zapobj; /* dataset delegation permissions */
uint64_t dd_flags;
uint64_t dd_used_breakdown[DD_USED_NUM];
uint64_t dd_clones; /* dsl_dir objects */
uint64_t dd_pad[13]; /* pad out to 256 bytes for good measure */
} dsl_dir_phys_t;
struct dsl_dir {
/* These are immutable; no lock needed: */
uint64_t dd_object;
dsl_dir_phys_t *dd_phys;
dmu_buf_t *dd_dbuf;
dsl_pool_t *dd_pool;
/* protected by lock on pool's dp_dirty_dirs list */
txg_node_t dd_dirty_link;
/* protected by dp_config_rwlock */
dsl_dir_t *dd_parent;
/* Protected by dd_lock */
kmutex_t dd_lock;
list_t dd_prop_cbs; /* list of dsl_prop_cb_record_t's */
timestruc_t dd_snap_cmtime; /* last time snapshot namespace changed */
uint64_t dd_origin_txg;
/* gross estimate of space used by in-flight tx's */
uint64_t dd_tempreserved[TXG_SIZE];
/* amount of space we expect to write; == amount of dirty data */
int64_t dd_space_towrite[TXG_SIZE];
/* protected by dd_lock; keep at end of struct for better locality */
char dd_myname[MAXNAMELEN];
};
void dsl_dir_close(dsl_dir_t *dd, void *tag);
int dsl_dir_open(const char *name, void *tag, dsl_dir_t **, const char **tail);
int dsl_dir_open_spa(spa_t *spa, const char *name, void *tag, dsl_dir_t **,
const char **tailp);
int dsl_dir_open_obj(dsl_pool_t *dp, uint64_t ddobj,
const char *tail, void *tag, dsl_dir_t **);
void dsl_dir_name(dsl_dir_t *dd, char *buf);
int dsl_dir_namelen(dsl_dir_t *dd);
uint64_t dsl_dir_create_sync(dsl_pool_t *dp, dsl_dir_t *pds,
const char *name, dmu_tx_t *tx);
dsl_checkfunc_t dsl_dir_destroy_check;
dsl_syncfunc_t dsl_dir_destroy_sync;
void dsl_dir_stats(dsl_dir_t *dd, nvlist_t *nv);
uint64_t dsl_dir_space_available(dsl_dir_t *dd,
dsl_dir_t *ancestor, int64_t delta, int ondiskonly);
void dsl_dir_dirty(dsl_dir_t *dd, dmu_tx_t *tx);
void dsl_dir_sync(dsl_dir_t *dd, dmu_tx_t *tx);
int dsl_dir_tempreserve_space(dsl_dir_t *dd, uint64_t mem,
uint64_t asize, uint64_t fsize, uint64_t usize, void **tr_cookiep,
dmu_tx_t *tx);
void dsl_dir_tempreserve_clear(void *tr_cookie, dmu_tx_t *tx);
void dsl_dir_willuse_space(dsl_dir_t *dd, int64_t space, dmu_tx_t *tx);
void dsl_dir_diduse_space(dsl_dir_t *dd, dd_used_t type,
int64_t used, int64_t compressed, int64_t uncompressed, dmu_tx_t *tx);
void dsl_dir_transfer_space(dsl_dir_t *dd, int64_t delta,
dd_used_t oldtype, dd_used_t newtype, dmu_tx_t *tx);
int dsl_dir_set_quota(const char *ddname, zprop_source_t source,
uint64_t quota);
int dsl_dir_set_reservation(const char *ddname, zprop_source_t source,
uint64_t reservation);
int dsl_dir_rename(dsl_dir_t *dd, const char *newname);
int dsl_dir_transfer_possible(dsl_dir_t *sdd, dsl_dir_t *tdd, uint64_t space);
int dsl_dir_set_reservation_check(void *arg1, void *arg2, dmu_tx_t *tx);
boolean_t dsl_dir_is_clone(dsl_dir_t *dd);
void dsl_dir_new_refreservation(dsl_dir_t *dd, struct dsl_dataset *ds,
uint64_t reservation, cred_t *cr, dmu_tx_t *tx);
void dsl_dir_snap_cmtime_update(dsl_dir_t *dd);
timestruc_t dsl_dir_snap_cmtime(dsl_dir_t *dd);
/* internal reserved dir name */
#define MOS_DIR_NAME "$MOS"
#define ORIGIN_DIR_NAME "$ORIGIN"
#define XLATION_DIR_NAME "$XLATION"
#define FREE_DIR_NAME "$FREE"
#ifdef ZFS_DEBUG
#define dprintf_dd(dd, fmt, ...) do { \
if (zfs_flags & ZFS_DEBUG_DPRINTF) { \
char *__ds_name = kmem_alloc(MAXNAMELEN + strlen(MOS_DIR_NAME) + 1, \
KM_SLEEP); \
dsl_dir_name(dd, __ds_name); \
dprintf("dd=%s " fmt, __ds_name, __VA_ARGS__); \
kmem_free(__ds_name, MAXNAMELEN + strlen(MOS_DIR_NAME) + 1); \
} \
_NOTE(CONSTCOND) } while (0)
#else
#define dprintf_dd(dd, fmt, ...)
#endif
#ifdef __cplusplus
}
#endif
#endif /* _SYS_DSL_DIR_H */

View File

@ -0,0 +1,151 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_DSL_POOL_H
#define _SYS_DSL_POOL_H
#include <sys/spa.h>
#include <sys/txg.h>
#include <sys/txg_impl.h>
#include <sys/zfs_context.h>
#include <sys/zio.h>
#include <sys/dnode.h>
#include <sys/ddt.h>
#include <sys/arc.h>
#include <sys/bpobj.h>
#ifdef __cplusplus
extern "C" {
#endif
struct objset;
struct dsl_dir;
struct dsl_dataset;
struct dsl_pool;
struct dmu_tx;
struct dsl_scan;
/* These macros are for indexing into the zfs_all_blkstats_t. */
#define DMU_OT_DEFERRED DMU_OT_NONE
#define DMU_OT_TOTAL DMU_OT_NUMTYPES
typedef struct zfs_blkstat {
uint64_t zb_count;
uint64_t zb_asize;
uint64_t zb_lsize;
uint64_t zb_psize;
uint64_t zb_gangs;
uint64_t zb_ditto_2_of_2_samevdev;
uint64_t zb_ditto_2_of_3_samevdev;
uint64_t zb_ditto_3_of_3_samevdev;
} zfs_blkstat_t;
typedef struct zfs_all_blkstats {
zfs_blkstat_t zab_type[DN_MAX_LEVELS + 1][DMU_OT_TOTAL + 1];
} zfs_all_blkstats_t;
typedef struct dsl_pool {
/* Immutable */
spa_t *dp_spa;
struct objset *dp_meta_objset;
struct dsl_dir *dp_root_dir;
struct dsl_dir *dp_mos_dir;
struct dsl_dir *dp_free_dir;
struct dsl_dataset *dp_origin_snap;
uint64_t dp_root_dir_obj;
struct taskq *dp_vnrele_taskq;
/* No lock needed - sync context only */
blkptr_t dp_meta_rootbp;
list_t dp_synced_datasets;
hrtime_t dp_read_overhead;
uint64_t dp_throughput; /* bytes per millisec */
uint64_t dp_write_limit;
uint64_t dp_tmp_userrefs_obj;
bpobj_t dp_free_bpobj;
struct dsl_scan *dp_scan;
/* Uses dp_lock */
kmutex_t dp_lock;
uint64_t dp_space_towrite[TXG_SIZE];
uint64_t dp_tempreserved[TXG_SIZE];
/* Has its own locking */
tx_state_t dp_tx;
txg_list_t dp_dirty_datasets;
txg_list_t dp_dirty_dirs;
txg_list_t dp_sync_tasks;
/*
* Protects administrative changes (properties, namespace)
* It is only held for write in syncing context. Therefore
* syncing context does not need to ever have it for read, since
* nobody else could possibly have it for write.
*/
krwlock_t dp_config_rwlock;
zfs_all_blkstats_t *dp_blkstats;
} dsl_pool_t;
int dsl_pool_open(spa_t *spa, uint64_t txg, dsl_pool_t **dpp);
void dsl_pool_close(dsl_pool_t *dp);
dsl_pool_t *dsl_pool_create(spa_t *spa, nvlist_t *zplprops, uint64_t txg);
void dsl_pool_sync(dsl_pool_t *dp, uint64_t txg);
void dsl_pool_sync_done(dsl_pool_t *dp, uint64_t txg);
int dsl_pool_sync_context(dsl_pool_t *dp);
uint64_t dsl_pool_adjustedsize(dsl_pool_t *dp, boolean_t netfree);
uint64_t dsl_pool_adjustedfree(dsl_pool_t *dp, boolean_t netfree);
int dsl_pool_tempreserve_space(dsl_pool_t *dp, uint64_t space, dmu_tx_t *tx);
void dsl_pool_tempreserve_clear(dsl_pool_t *dp, int64_t space, dmu_tx_t *tx);
void dsl_pool_memory_pressure(dsl_pool_t *dp);
void dsl_pool_willuse_space(dsl_pool_t *dp, int64_t space, dmu_tx_t *tx);
void dsl_free(dsl_pool_t *dp, uint64_t txg, const blkptr_t *bpp);
void dsl_free_sync(zio_t *pio, dsl_pool_t *dp, uint64_t txg,
const blkptr_t *bpp);
int dsl_read(zio_t *pio, spa_t *spa, const blkptr_t *bpp, arc_buf_t *pbuf,
arc_done_func_t *done, void *private, int priority, int zio_flags,
uint32_t *arc_flags, const zbookmark_t *zb);
int dsl_read_nolock(zio_t *pio, spa_t *spa, const blkptr_t *bpp,
arc_done_func_t *done, void *private, int priority, int zio_flags,
uint32_t *arc_flags, const zbookmark_t *zb);
void dsl_pool_create_origin(dsl_pool_t *dp, dmu_tx_t *tx);
void dsl_pool_upgrade_clones(dsl_pool_t *dp, dmu_tx_t *tx);
void dsl_pool_upgrade_dir_clones(dsl_pool_t *dp, dmu_tx_t *tx);
taskq_t *dsl_pool_vnrele_taskq(dsl_pool_t *dp);
extern int dsl_pool_user_hold(dsl_pool_t *dp, uint64_t dsobj,
const char *tag, uint64_t *now, dmu_tx_t *tx);
extern int dsl_pool_user_release(dsl_pool_t *dp, uint64_t dsobj,
const char *tag, dmu_tx_t *tx);
extern void dsl_pool_clean_tmp_userrefs(dsl_pool_t *dp);
int dsl_pool_open_special_dir(dsl_pool_t *dp, const char *name, dsl_dir_t **);
#ifdef __cplusplus
}
#endif
#endif /* _SYS_DSL_POOL_H */

View File

@ -0,0 +1,119 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_DSL_PROP_H
#define _SYS_DSL_PROP_H
#include <sys/dmu.h>
#include <sys/dsl_pool.h>
#include <sys/zfs_context.h>
#include <sys/dsl_synctask.h>
#ifdef __cplusplus
extern "C" {
#endif
struct dsl_dataset;
struct dsl_dir;
/* The callback func may not call into the DMU or DSL! */
typedef void (dsl_prop_changed_cb_t)(void *arg, uint64_t newval);
typedef struct dsl_prop_cb_record {
list_node_t cbr_node; /* link on dd_prop_cbs */
struct dsl_dataset *cbr_ds;
const char *cbr_propname;
dsl_prop_changed_cb_t *cbr_func;
void *cbr_arg;
} dsl_prop_cb_record_t;
typedef struct dsl_props_arg {
nvlist_t *pa_props;
zprop_source_t pa_source;
} dsl_props_arg_t;
typedef struct dsl_prop_set_arg {
const char *psa_name;
zprop_source_t psa_source;
int psa_intsz;
int psa_numints;
const void *psa_value;
/*
* Used to handle the special requirements of the quota and reservation
* properties.
*/
uint64_t psa_effective_value;
} dsl_prop_setarg_t;
int dsl_prop_register(struct dsl_dataset *ds, const char *propname,
dsl_prop_changed_cb_t *callback, void *cbarg);
int dsl_prop_unregister(struct dsl_dataset *ds, const char *propname,
dsl_prop_changed_cb_t *callback, void *cbarg);
int dsl_prop_numcb(struct dsl_dataset *ds);
int dsl_prop_get(const char *ddname, const char *propname,
int intsz, int numints, void *buf, char *setpoint);
int dsl_prop_get_integer(const char *ddname, const char *propname,
uint64_t *valuep, char *setpoint);
int dsl_prop_get_all(objset_t *os, nvlist_t **nvp);
int dsl_prop_get_received(objset_t *os, nvlist_t **nvp);
int dsl_prop_get_ds(struct dsl_dataset *ds, const char *propname,
int intsz, int numints, void *buf, char *setpoint);
int dsl_prop_get_dd(struct dsl_dir *dd, const char *propname,
int intsz, int numints, void *buf, char *setpoint,
boolean_t snapshot);
dsl_syncfunc_t dsl_props_set_sync;
int dsl_prop_set(const char *ddname, const char *propname,
zprop_source_t source, int intsz, int numints, const void *buf);
int dsl_props_set(const char *dsname, zprop_source_t source, nvlist_t *nvl);
void dsl_dir_prop_set_uint64_sync(dsl_dir_t *dd, const char *name, uint64_t val,
dmu_tx_t *tx);
void dsl_prop_setarg_init_uint64(dsl_prop_setarg_t *psa, const char *propname,
zprop_source_t source, uint64_t *value);
int dsl_prop_predict_sync(dsl_dir_t *dd, dsl_prop_setarg_t *psa);
#ifdef ZFS_DEBUG
void dsl_prop_check_prediction(dsl_dir_t *dd, dsl_prop_setarg_t *psa);
#define DSL_PROP_CHECK_PREDICTION(dd, psa) \
dsl_prop_check_prediction((dd), (psa))
#else
#define DSL_PROP_CHECK_PREDICTION(dd, psa) /* nothing */
#endif
/* flag first receive on or after SPA_VERSION_RECVD_PROPS */
boolean_t dsl_prop_get_hasrecvd(objset_t *os);
void dsl_prop_set_hasrecvd(objset_t *os);
void dsl_prop_unset_hasrecvd(objset_t *os);
void dsl_prop_nvlist_add_uint64(nvlist_t *nv, zfs_prop_t prop, uint64_t value);
void dsl_prop_nvlist_add_string(nvlist_t *nv,
zfs_prop_t prop, const char *value);
#ifdef __cplusplus
}
#endif
#endif /* _SYS_DSL_PROP_H */

View File

@ -0,0 +1,108 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_DSL_SCAN_H
#define _SYS_DSL_SCAN_H
#include <sys/zfs_context.h>
#include <sys/zio.h>
#include <sys/ddt.h>
#include <sys/bplist.h>
#ifdef __cplusplus
extern "C" {
#endif
struct objset;
struct dsl_dir;
struct dsl_dataset;
struct dsl_pool;
struct dmu_tx;
/*
* All members of this structure must be uint64_t, for byteswap
* purposes.
*/
typedef struct dsl_scan_phys {
uint64_t scn_func; /* pool_scan_func_t */
uint64_t scn_state; /* dsl_scan_state_t */
uint64_t scn_queue_obj;
uint64_t scn_min_txg;
uint64_t scn_max_txg;
uint64_t scn_cur_min_txg;
uint64_t scn_cur_max_txg;
uint64_t scn_start_time;
uint64_t scn_end_time;
uint64_t scn_to_examine; /* total bytes to be scanned */
uint64_t scn_examined; /* bytes scanned so far */
uint64_t scn_to_process;
uint64_t scn_processed;
uint64_t scn_errors; /* scan I/O error count */
uint64_t scn_ddt_class_max;
ddt_bookmark_t scn_ddt_bookmark;
zbookmark_t scn_bookmark;
uint64_t scn_flags; /* dsl_scan_flags_t */
} dsl_scan_phys_t;
#define SCAN_PHYS_NUMINTS (sizeof (dsl_scan_phys_t) / sizeof (uint64_t))
typedef enum dsl_scan_flags {
DSF_VISIT_DS_AGAIN = 1<<0,
} dsl_scan_flags_t;
typedef struct dsl_scan {
struct dsl_pool *scn_dp;
boolean_t scn_pausing;
uint64_t scn_restart_txg;
uint64_t scn_sync_start_time;
zio_t *scn_zio_root;
/* for debugging / information */
uint64_t scn_visited_this_txg;
dsl_scan_phys_t scn_phys;
} dsl_scan_t;
int dsl_scan_init(struct dsl_pool *dp, uint64_t txg);
void dsl_scan_fini(struct dsl_pool *dp);
void dsl_scan_sync(struct dsl_pool *, dmu_tx_t *);
int dsl_scan_cancel(struct dsl_pool *);
int dsl_scan(struct dsl_pool *, pool_scan_func_t);
void dsl_resilver_restart(struct dsl_pool *, uint64_t txg);
boolean_t dsl_scan_resilvering(struct dsl_pool *dp);
boolean_t dsl_dataset_unstable(struct dsl_dataset *ds);
void dsl_scan_ddt_entry(dsl_scan_t *scn, enum zio_checksum checksum,
ddt_entry_t *dde, dmu_tx_t *tx);
void dsl_scan_ds_destroyed(struct dsl_dataset *ds, struct dmu_tx *tx);
void dsl_scan_ds_snapshotted(struct dsl_dataset *ds, struct dmu_tx *tx);
void dsl_scan_ds_clone_swapped(struct dsl_dataset *ds1, struct dsl_dataset *ds2,
struct dmu_tx *tx);
boolean_t dsl_scan_active(dsl_scan_t *scn);
#ifdef __cplusplus
}
#endif
#endif /* _SYS_DSL_SCAN_H */

View File

@ -0,0 +1,79 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_DSL_SYNCTASK_H
#define _SYS_DSL_SYNCTASK_H
#include <sys/txg.h>
#include <sys/zfs_context.h>
#ifdef __cplusplus
extern "C" {
#endif
struct dsl_pool;
typedef int (dsl_checkfunc_t)(void *, void *, dmu_tx_t *);
typedef void (dsl_syncfunc_t)(void *, void *, dmu_tx_t *);
typedef struct dsl_sync_task {
list_node_t dst_node;
dsl_checkfunc_t *dst_checkfunc;
dsl_syncfunc_t *dst_syncfunc;
void *dst_arg1;
void *dst_arg2;
int dst_err;
} dsl_sync_task_t;
typedef struct dsl_sync_task_group {
txg_node_t dstg_node;
list_t dstg_tasks;
struct dsl_pool *dstg_pool;
uint64_t dstg_txg;
int dstg_err;
int dstg_space;
boolean_t dstg_nowaiter;
} dsl_sync_task_group_t;
dsl_sync_task_group_t *dsl_sync_task_group_create(struct dsl_pool *dp);
void dsl_sync_task_create(dsl_sync_task_group_t *dstg,
dsl_checkfunc_t *, dsl_syncfunc_t *,
void *arg1, void *arg2, int blocks_modified);
int dsl_sync_task_group_wait(dsl_sync_task_group_t *dstg);
void dsl_sync_task_group_nowait(dsl_sync_task_group_t *dstg, dmu_tx_t *tx);
void dsl_sync_task_group_destroy(dsl_sync_task_group_t *dstg);
void dsl_sync_task_group_sync(dsl_sync_task_group_t *dstg, dmu_tx_t *tx);
int dsl_sync_task_do(struct dsl_pool *dp,
dsl_checkfunc_t *checkfunc, dsl_syncfunc_t *syncfunc,
void *arg1, void *arg2, int blocks_modified);
void dsl_sync_task_do_nowait(struct dsl_pool *dp,
dsl_checkfunc_t *checkfunc, dsl_syncfunc_t *syncfunc,
void *arg1, void *arg2, int blocks_modified, dmu_tx_t *tx);
#ifdef __cplusplus
}
#endif
#endif /* _SYS_DSL_SYNCTASK_H */

View File

@ -0,0 +1,80 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_METASLAB_H
#define _SYS_METASLAB_H
#include <sys/spa.h>
#include <sys/space_map.h>
#include <sys/txg.h>
#include <sys/zio.h>
#include <sys/avl.h>
#ifdef __cplusplus
extern "C" {
#endif
extern space_map_ops_t *zfs_metaslab_ops;
extern metaslab_t *metaslab_init(metaslab_group_t *mg, space_map_obj_t *smo,
uint64_t start, uint64_t size, uint64_t txg);
extern void metaslab_fini(metaslab_t *msp);
extern void metaslab_sync(metaslab_t *msp, uint64_t txg);
extern void metaslab_sync_done(metaslab_t *msp, uint64_t txg);
extern void metaslab_sync_reassess(metaslab_group_t *mg);
#define METASLAB_HINTBP_FAVOR 0x0
#define METASLAB_HINTBP_AVOID 0x1
#define METASLAB_GANG_HEADER 0x2
extern int metaslab_alloc(spa_t *spa, metaslab_class_t *mc, uint64_t psize,
blkptr_t *bp, int ncopies, uint64_t txg, blkptr_t *hintbp, int flags);
extern void metaslab_free(spa_t *spa, const blkptr_t *bp, uint64_t txg,
boolean_t now);
extern int metaslab_claim(spa_t *spa, const blkptr_t *bp, uint64_t txg);
extern metaslab_class_t *metaslab_class_create(spa_t *spa,
space_map_ops_t *ops);
extern void metaslab_class_destroy(metaslab_class_t *mc);
extern int metaslab_class_validate(metaslab_class_t *mc);
extern void metaslab_class_space_update(metaslab_class_t *mc,
int64_t alloc_delta, int64_t defer_delta,
int64_t space_delta, int64_t dspace_delta);
extern uint64_t metaslab_class_get_alloc(metaslab_class_t *mc);
extern uint64_t metaslab_class_get_space(metaslab_class_t *mc);
extern uint64_t metaslab_class_get_dspace(metaslab_class_t *mc);
extern uint64_t metaslab_class_get_deferred(metaslab_class_t *mc);
extern metaslab_group_t *metaslab_group_create(metaslab_class_t *mc,
vdev_t *vd);
extern void metaslab_group_destroy(metaslab_group_t *mg);
extern void metaslab_group_activate(metaslab_group_t *mg);
extern void metaslab_group_passivate(metaslab_group_t *mg);
#ifdef __cplusplus
}
#endif
#endif /* _SYS_METASLAB_H */

View File

@ -0,0 +1,89 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#ifndef _SYS_METASLAB_IMPL_H
#define _SYS_METASLAB_IMPL_H
#include <sys/metaslab.h>
#include <sys/space_map.h>
#include <sys/vdev.h>
#include <sys/txg.h>
#include <sys/avl.h>
#ifdef __cplusplus
extern "C" {
#endif
struct metaslab_class {
spa_t *mc_spa;
metaslab_group_t *mc_rotor;
space_map_ops_t *mc_ops;
uint64_t mc_aliquot;
uint64_t mc_alloc; /* total allocated space */
uint64_t mc_deferred; /* total deferred frees */
uint64_t mc_space; /* total space (alloc + free) */
uint64_t mc_dspace; /* total deflated space */
};
struct metaslab_group {
kmutex_t mg_lock;
avl_tree_t mg_metaslab_tree;
uint64_t mg_aliquot;
uint64_t mg_bonus_area;
int64_t mg_bias;
int64_t mg_activation_count;
metaslab_class_t *mg_class;
vdev_t *mg_vd;
metaslab_group_t *mg_prev;
metaslab_group_t *mg_next;
};
/*
* Each metaslab's free space is tracked in space map object in the MOS,
* which is only updated in syncing context. Each time we sync a txg,
* we append the allocs and frees from that txg to the space map object.
* When the txg is done syncing, metaslab_sync_done() updates ms_smo
* to ms_smo_syncing. Everything in ms_smo is always safe to allocate.
*/
struct metaslab {
kmutex_t ms_lock; /* metaslab lock */
space_map_obj_t ms_smo; /* synced space map object */
space_map_obj_t ms_smo_syncing; /* syncing space map object */
space_map_t ms_allocmap[TXG_SIZE]; /* allocated this txg */
space_map_t ms_freemap[TXG_SIZE]; /* freed this txg */
space_map_t ms_defermap[TXG_DEFER_SIZE]; /* deferred frees */
space_map_t ms_map; /* in-core free space map */
int64_t ms_deferspace; /* sum of ms_defermap[] space */
uint64_t ms_weight; /* weight vs. others in group */
metaslab_group_t *ms_group; /* metaslab group */
avl_node_t ms_group_node; /* node in metaslab group tree */
txg_node_t ms_txg_node; /* per-txg dirty metaslab links */
};
#ifdef __cplusplus
}
#endif
#endif /* _SYS_METASLAB_IMPL_H */

View File

@ -0,0 +1,107 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_REFCOUNT_H
#define _SYS_REFCOUNT_H
#include <sys/inttypes.h>
#include <sys/list.h>
#include <sys/zfs_context.h>
#ifdef __cplusplus
extern "C" {
#endif
/*
* If the reference is held only by the calling function and not any
* particular object, use FTAG (which is a string) for the holder_tag.
* Otherwise, use the object that holds the reference.
*/
#define FTAG ((char *)__func__)
#ifdef ZFS_DEBUG
typedef struct reference {
list_node_t ref_link;
void *ref_holder;
uint64_t ref_number;
uint8_t *ref_removed;
} reference_t;
typedef struct refcount {
kmutex_t rc_mtx;
list_t rc_list;
list_t rc_removed;
int64_t rc_count;
int64_t rc_removed_count;
} refcount_t;
/* Note: refcount_t must be initialized with refcount_create() */
void refcount_create(refcount_t *rc);
void refcount_destroy(refcount_t *rc);
void refcount_destroy_many(refcount_t *rc, uint64_t number);
int refcount_is_zero(refcount_t *rc);
int64_t refcount_count(refcount_t *rc);
int64_t refcount_add(refcount_t *rc, void *holder_tag);
int64_t refcount_remove(refcount_t *rc, void *holder_tag);
int64_t refcount_add_many(refcount_t *rc, uint64_t number, void *holder_tag);
int64_t refcount_remove_many(refcount_t *rc, uint64_t number, void *holder_tag);
void refcount_transfer(refcount_t *dst, refcount_t *src);
void refcount_init(void);
void refcount_fini(void);
#else /* ZFS_DEBUG */
typedef struct refcount {
uint64_t rc_count;
} refcount_t;
#define refcount_create(rc) ((rc)->rc_count = 0)
#define refcount_destroy(rc) ((rc)->rc_count = 0)
#define refcount_destroy_many(rc, number) ((rc)->rc_count = 0)
#define refcount_is_zero(rc) ((rc)->rc_count == 0)
#define refcount_count(rc) ((rc)->rc_count)
#define refcount_add(rc, holder) atomic_add_64_nv(&(rc)->rc_count, 1)
#define refcount_remove(rc, holder) atomic_add_64_nv(&(rc)->rc_count, -1)
#define refcount_add_many(rc, number, holder) \
atomic_add_64_nv(&(rc)->rc_count, number)
#define refcount_remove_many(rc, number, holder) \
atomic_add_64_nv(&(rc)->rc_count, -number)
#define refcount_transfer(dst, src) { \
uint64_t __tmp = (src)->rc_count; \
atomic_add_64(&(src)->rc_count, -__tmp); \
atomic_add_64(&(dst)->rc_count, __tmp); \
}
#define refcount_init()
#define refcount_fini()
#endif /* ZFS_DEBUG */
#ifdef __cplusplus
}
#endif
#endif /* _SYS_REFCOUNT_H */

View File

@ -0,0 +1,80 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2007 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#ifndef _SYS_RR_RW_LOCK_H
#define _SYS_RR_RW_LOCK_H
#pragma ident "%Z%%M% %I% %E% SMI"
#ifdef __cplusplus
extern "C" {
#endif
#include <sys/inttypes.h>
#include <sys/zfs_context.h>
#include <sys/refcount.h>
/*
* A reader-writer lock implementation that allows re-entrant reads, but
* still gives writers priority on "new" reads.
*
* See rrwlock.c for more details about the implementation.
*
* Fields of the rrwlock_t structure:
* - rr_lock: protects modification and reading of rrwlock_t fields
* - rr_cv: cv for waking up readers or waiting writers
* - rr_writer: thread id of the current writer
* - rr_anon_rount: number of active anonymous readers
* - rr_linked_rcount: total number of non-anonymous active readers
* - rr_writer_wanted: a writer wants the lock
*/
typedef struct rrwlock {
kmutex_t rr_lock;
kcondvar_t rr_cv;
kthread_t *rr_writer;
refcount_t rr_anon_rcount;
refcount_t rr_linked_rcount;
boolean_t rr_writer_wanted;
} rrwlock_t;
/*
* 'tag' is used in reference counting tracking. The
* 'tag' must be the same in a rrw_enter() as in its
* corresponding rrw_exit().
*/
void rrw_init(rrwlock_t *rrl);
void rrw_destroy(rrwlock_t *rrl);
void rrw_enter(rrwlock_t *rrl, krw_t rw, void *tag);
void rrw_exit(rrwlock_t *rrl, void *tag);
boolean_t rrw_held(rrwlock_t *rrl, krw_t rw);
#define RRW_READ_HELD(x) rrw_held(x, RW_READER)
#define RRW_WRITE_HELD(x) rrw_held(x, RW_WRITER)
#ifdef __cplusplus
}
#endif
#endif /* _SYS_RR_RW_LOCK_H */

170
uts/common/fs/zfs/sys/sa.h Normal file
View File

@ -0,0 +1,170 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_SA_H
#define _SYS_SA_H
#include <sys/dmu.h>
/*
* Currently available byteswap functions.
* If it all possible new attributes should used
* one of the already defined byteswap functions.
* If a new byteswap function is added then the
* ZPL/Pool version will need to be bumped.
*/
typedef enum sa_bswap_type {
SA_UINT64_ARRAY,
SA_UINT32_ARRAY,
SA_UINT16_ARRAY,
SA_UINT8_ARRAY,
SA_ACL,
} sa_bswap_type_t;
typedef uint16_t sa_attr_type_t;
/*
* Attribute to register support for.
*/
typedef struct sa_attr_reg {
char *sa_name; /* attribute name */
uint16_t sa_length;
sa_bswap_type_t sa_byteswap; /* bswap functon enum */
sa_attr_type_t sa_attr; /* filled in during registration */
} sa_attr_reg_t;
typedef void (sa_data_locator_t)(void **, uint32_t *, uint32_t,
boolean_t, void *userptr);
/*
* array of attributes to store.
*
* This array should be treated as opaque/private data.
* The SA_BULK_ADD_ATTR() macro should be used for manipulating
* the array.
*
* When sa_replace_all_by_template() is used the attributes
* will be stored in the order defined in the array, except that
* the attributes may be split between the bonus and the spill buffer
*
*/
typedef struct sa_bulk_attr {
void *sa_data;
sa_data_locator_t *sa_data_func;
uint16_t sa_length;
sa_attr_type_t sa_attr;
/* the following are private to the sa framework */
void *sa_addr;
uint16_t sa_buftype;
uint16_t sa_size;
} sa_bulk_attr_t;
/*
* special macro for adding entries for bulk attr support
* bulk - sa_bulk_attr_t
* count - integer that will be incremented during each add
* attr - attribute to manipulate
* func - function for accessing data.
* data - pointer to data.
* len - length of data
*/
#define SA_ADD_BULK_ATTR(b, idx, attr, func, data, len) \
{ \
b[idx].sa_attr = attr;\
b[idx].sa_data_func = func; \
b[idx].sa_data = data; \
b[idx++].sa_length = len; \
}
typedef struct sa_os sa_os_t;
typedef enum sa_handle_type {
SA_HDL_SHARED,
SA_HDL_PRIVATE
} sa_handle_type_t;
struct sa_handle;
typedef void *sa_lookup_tab_t;
typedef struct sa_handle sa_handle_t;
typedef void (sa_update_cb_t)(sa_handle_t *, dmu_tx_t *tx);
int sa_handle_get(objset_t *, uint64_t, void *userp,
sa_handle_type_t, sa_handle_t **);
int sa_handle_get_from_db(objset_t *, dmu_buf_t *, void *userp,
sa_handle_type_t, sa_handle_t **);
void sa_handle_destroy(sa_handle_t *);
int sa_buf_hold(objset_t *, uint64_t, void *, dmu_buf_t **);
void sa_buf_rele(dmu_buf_t *, void *);
int sa_lookup(sa_handle_t *, sa_attr_type_t, void *buf, uint32_t buflen);
int sa_update(sa_handle_t *, sa_attr_type_t, void *buf,
uint32_t buflen, dmu_tx_t *);
int sa_remove(sa_handle_t *, sa_attr_type_t, dmu_tx_t *);
int sa_bulk_lookup(sa_handle_t *, sa_bulk_attr_t *, int count);
int sa_bulk_lookup_locked(sa_handle_t *, sa_bulk_attr_t *, int count);
int sa_bulk_update(sa_handle_t *, sa_bulk_attr_t *, int count, dmu_tx_t *);
int sa_size(sa_handle_t *, sa_attr_type_t, int *);
int sa_update_from_cb(sa_handle_t *, sa_attr_type_t,
uint32_t buflen, sa_data_locator_t *, void *userdata, dmu_tx_t *);
void sa_object_info(sa_handle_t *, dmu_object_info_t *);
void sa_object_size(sa_handle_t *, uint32_t *, u_longlong_t *);
void sa_update_user(sa_handle_t *, sa_handle_t *);
void *sa_get_userdata(sa_handle_t *);
void sa_set_userp(sa_handle_t *, void *);
dmu_buf_t *sa_get_db(sa_handle_t *);
uint64_t sa_handle_object(sa_handle_t *);
boolean_t sa_attr_would_spill(sa_handle_t *, sa_attr_type_t, int size);
void sa_register_update_callback(objset_t *, sa_update_cb_t *);
int sa_setup(objset_t *, uint64_t, sa_attr_reg_t *, int, sa_attr_type_t **);
void sa_tear_down(objset_t *);
int sa_replace_all_by_template(sa_handle_t *, sa_bulk_attr_t *,
int, dmu_tx_t *);
int sa_replace_all_by_template_locked(sa_handle_t *, sa_bulk_attr_t *,
int, dmu_tx_t *);
boolean_t sa_enabled(objset_t *);
void sa_cache_init();
void sa_cache_fini();
int sa_set_sa_object(objset_t *, uint64_t);
int sa_hdrsize(void *);
void sa_handle_lock(sa_handle_t *);
void sa_handle_unlock(sa_handle_t *);
#ifdef _KERNEL
int sa_lookup_uio(sa_handle_t *, sa_attr_type_t, uio_t *);
#endif
#ifdef __cplusplus
extern "C" {
#endif
#ifdef __cplusplus
}
#endif
#endif /* _SYS_SA_H */

View File

@ -0,0 +1,287 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_SA_IMPL_H
#define _SYS_SA_IMPL_H
#include <sys/dmu.h>
#include <sys/refcount.h>
#include <sys/list.h>
/*
* Array of known attributes and their
* various characteristics.
*/
typedef struct sa_attr_table {
sa_attr_type_t sa_attr;
uint8_t sa_registered;
uint16_t sa_length;
sa_bswap_type_t sa_byteswap;
char *sa_name;
} sa_attr_table_t;
/*
* Zap attribute format for attribute registration
*
* 64 56 48 40 32 24 16 8 0
* +-------+-------+-------+-------+-------+-------+-------+-------+
* | unused | len | bswap | attr num |
* +-------+-------+-------+-------+-------+-------+-------+-------+
*
* Zap attribute format for layout information.
*
* layout information is stored as an array of attribute numbers
* The name of the attribute is the layout number (0, 1, 2, ...)
*
* 16 0
* +---- ---+
* | attr # |
* +--------+
* | attr # |
* +--- ----+
* ......
*
*/
#define ATTR_BSWAP(x) BF32_GET(x, 16, 8)
#define ATTR_LENGTH(x) BF32_GET(x, 24, 16)
#define ATTR_NUM(x) BF32_GET(x, 0, 16)
#define ATTR_ENCODE(x, attr, length, bswap) \
{ \
BF64_SET(x, 24, 16, length); \
BF64_SET(x, 16, 8, bswap); \
BF64_SET(x, 0, 16, attr); \
}
#define TOC_OFF(x) BF32_GET(x, 0, 23)
#define TOC_ATTR_PRESENT(x) BF32_GET(x, 31, 1)
#define TOC_LEN_IDX(x) BF32_GET(x, 24, 4)
#define TOC_ATTR_ENCODE(x, len_idx, offset) \
{ \
BF32_SET(x, 31, 1, 1); \
BF32_SET(x, 24, 7, len_idx); \
BF32_SET(x, 0, 24, offset); \
}
#define SA_LAYOUTS "LAYOUTS"
#define SA_REGISTRY "REGISTRY"
/*
* Each unique layout will have their own table
* sa_lot (layout_table)
*/
typedef struct sa_lot {
avl_node_t lot_num_node;
avl_node_t lot_hash_node;
uint64_t lot_num;
uint64_t lot_hash;
sa_attr_type_t *lot_attrs; /* array of attr #'s */
uint32_t lot_var_sizes; /* how many aren't fixed size */
uint32_t lot_attr_count; /* total attr count */
list_t lot_idx_tab; /* should be only a couple of entries */
int lot_instance; /* used with lot_hash to identify entry */
} sa_lot_t;
/* index table of offsets */
typedef struct sa_idx_tab {
list_node_t sa_next;
sa_lot_t *sa_layout;
uint16_t *sa_variable_lengths;
refcount_t sa_refcount;
uint32_t *sa_idx_tab; /* array of offsets */
} sa_idx_tab_t;
/*
* Since the offset/index information into the actual data
* will usually be identical we can share that information with
* all handles that have the exact same offsets.
*
* You would typically only have a large number of different table of
* contents if you had a several variable sized attributes.
*
* Two AVL trees are used to track the attribute layout numbers.
* one is keyed by number and will be consulted when a DMU_OT_SA
* object is first read. The second tree is keyed by the hash signature
* of the attributes and will be consulted when an attribute is added
* to determine if we already have an instance of that layout. Both
* of these tree's are interconnected. The only difference is that
* when an entry is found in the "hash" tree the list of attributes will
* need to be compared against the list of attributes you have in hand.
* The assumption is that typically attributes will just be updated and
* adding a completely new attribute is a very rare operation.
*/
struct sa_os {
kmutex_t sa_lock;
boolean_t sa_need_attr_registration;
boolean_t sa_force_spill;
uint64_t sa_master_obj;
uint64_t sa_reg_attr_obj;
uint64_t sa_layout_attr_obj;
int sa_num_attrs;
sa_attr_table_t *sa_attr_table; /* private attr table */
sa_update_cb_t *sa_update_cb;
avl_tree_t sa_layout_num_tree; /* keyed by layout number */
avl_tree_t sa_layout_hash_tree; /* keyed by layout hash value */
int sa_user_table_sz;
sa_attr_type_t *sa_user_table; /* user name->attr mapping table */
};
/*
* header for all bonus and spill buffers.
* The header has a fixed portion with a variable number
* of "lengths" depending on the number of variable sized
* attribues which are determined by the "layout number"
*/
#define SA_MAGIC 0x2F505A /* ZFS SA */
typedef struct sa_hdr_phys {
uint32_t sa_magic;
uint16_t sa_layout_info; /* Encoded with hdrsize and layout number */
uint16_t sa_lengths[1]; /* optional sizes for variable length attrs */
/* ... Data follows the lengths. */
} sa_hdr_phys_t;
/*
* sa_hdr_phys -> sa_layout_info
*
* 16 10 0
* +--------+-------+
* | hdrsz |layout |
* +--------+-------+
*
* Bits 0-10 are the layout number
* Bits 11-16 are the size of the header.
* The hdrsize is the number * 8
*
* For example.
* hdrsz of 1 ==> 8 byte header
* 2 ==> 16 byte header
*
*/
#define SA_HDR_LAYOUT_NUM(hdr) BF32_GET(hdr->sa_layout_info, 0, 10)
#define SA_HDR_SIZE(hdr) BF32_GET_SB(hdr->sa_layout_info, 10, 16, 3, 0)
#define SA_HDR_LAYOUT_INFO_ENCODE(x, num, size) \
{ \
BF32_SET_SB(x, 10, 6, 3, 0, size); \
BF32_SET(x, 0, 10, num); \
}
typedef enum sa_buf_type {
SA_BONUS = 1,
SA_SPILL = 2
} sa_buf_type_t;
typedef enum sa_data_op {
SA_LOOKUP,
SA_UPDATE,
SA_ADD,
SA_REPLACE,
SA_REMOVE
} sa_data_op_t;
/*
* Opaque handle used for most sa functions
*
* This needs to be kept as small as possible.
*/
struct sa_handle {
kmutex_t sa_lock;
dmu_buf_t *sa_bonus;
dmu_buf_t *sa_spill;
objset_t *sa_os;
void *sa_userp;
sa_idx_tab_t *sa_bonus_tab; /* idx of bonus */
sa_idx_tab_t *sa_spill_tab; /* only present if spill activated */
};
#define SA_GET_DB(hdl, type) \
(dmu_buf_impl_t *)((type == SA_BONUS) ? hdl->sa_bonus : hdl->sa_spill)
#define SA_GET_HDR(hdl, type) \
((sa_hdr_phys_t *)((dmu_buf_impl_t *)(SA_GET_DB(hdl, \
type))->db.db_data))
#define SA_IDX_TAB_GET(hdl, type) \
(type == SA_BONUS ? hdl->sa_bonus_tab : hdl->sa_spill_tab)
#define IS_SA_BONUSTYPE(a) \
((a == DMU_OT_SA) ? B_TRUE : B_FALSE)
#define SA_BONUSTYPE_FROM_DB(db) \
(dmu_get_bonustype((dmu_buf_t *)db))
#define SA_BLKPTR_SPACE (DN_MAX_BONUSLEN - sizeof (blkptr_t))
#define SA_LAYOUT_NUM(x, type) \
((!IS_SA_BONUSTYPE(type) ? 0 : (((IS_SA_BONUSTYPE(type)) && \
((SA_HDR_LAYOUT_NUM(x)) == 0)) ? 1 : SA_HDR_LAYOUT_NUM(x))))
#define SA_REGISTERED_LEN(sa, attr) sa->sa_attr_table[attr].sa_length
#define SA_ATTR_LEN(sa, idx, attr, hdr) ((SA_REGISTERED_LEN(sa, attr) == 0) ?\
hdr->sa_lengths[TOC_LEN_IDX(idx->sa_idx_tab[attr])] : \
SA_REGISTERED_LEN(sa, attr))
#define SA_SET_HDR(hdr, num, size) \
{ \
hdr->sa_magic = SA_MAGIC; \
SA_HDR_LAYOUT_INFO_ENCODE(hdr->sa_layout_info, num, size); \
}
#define SA_ATTR_INFO(sa, idx, hdr, attr, bulk, type, hdl) \
{ \
bulk.sa_size = SA_ATTR_LEN(sa, idx, attr, hdr); \
bulk.sa_buftype = type; \
bulk.sa_addr = \
(void *)((uintptr_t)TOC_OFF(idx->sa_idx_tab[attr]) + \
(uintptr_t)hdr); \
}
#define SA_HDR_SIZE_MATCH_LAYOUT(hdr, tb) \
(SA_HDR_SIZE(hdr) == (sizeof (sa_hdr_phys_t) + \
(tb->lot_var_sizes > 1 ? P2ROUNDUP((tb->lot_var_sizes - 1) * \
sizeof (uint16_t), 8) : 0)))
int sa_add_impl(sa_handle_t *, sa_attr_type_t,
uint32_t, sa_data_locator_t, void *, dmu_tx_t *);
void sa_register_update_callback_locked(objset_t *, sa_update_cb_t *);
int sa_size_locked(sa_handle_t *, sa_attr_type_t, int *);
void sa_default_locator(void **, uint32_t *, uint32_t, boolean_t, void *);
int sa_attr_size(sa_os_t *, sa_idx_tab_t *, sa_attr_type_t,
uint16_t *, sa_hdr_phys_t *);
#ifdef __cplusplus
extern "C" {
#endif
#ifdef __cplusplus
}
#endif
#endif /* _SYS_SA_IMPL_H */

706
uts/common/fs/zfs/sys/spa.h Normal file
View File

@ -0,0 +1,706 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_SPA_H
#define _SYS_SPA_H
#include <sys/avl.h>
#include <sys/zfs_context.h>
#include <sys/nvpair.h>
#include <sys/sysmacros.h>
#include <sys/types.h>
#include <sys/fs/zfs.h>
#ifdef __cplusplus
extern "C" {
#endif
/*
* Forward references that lots of things need.
*/
typedef struct spa spa_t;
typedef struct vdev vdev_t;
typedef struct metaslab metaslab_t;
typedef struct metaslab_group metaslab_group_t;
typedef struct metaslab_class metaslab_class_t;
typedef struct zio zio_t;
typedef struct zilog zilog_t;
typedef struct spa_aux_vdev spa_aux_vdev_t;
typedef struct ddt ddt_t;
typedef struct ddt_entry ddt_entry_t;
struct dsl_pool;
/*
* General-purpose 32-bit and 64-bit bitfield encodings.
*/
#define BF32_DECODE(x, low, len) P2PHASE((x) >> (low), 1U << (len))
#define BF64_DECODE(x, low, len) P2PHASE((x) >> (low), 1ULL << (len))
#define BF32_ENCODE(x, low, len) (P2PHASE((x), 1U << (len)) << (low))
#define BF64_ENCODE(x, low, len) (P2PHASE((x), 1ULL << (len)) << (low))
#define BF32_GET(x, low, len) BF32_DECODE(x, low, len)
#define BF64_GET(x, low, len) BF64_DECODE(x, low, len)
#define BF32_SET(x, low, len, val) \
((x) ^= BF32_ENCODE((x >> low) ^ (val), low, len))
#define BF64_SET(x, low, len, val) \
((x) ^= BF64_ENCODE((x >> low) ^ (val), low, len))
#define BF32_GET_SB(x, low, len, shift, bias) \
((BF32_GET(x, low, len) + (bias)) << (shift))
#define BF64_GET_SB(x, low, len, shift, bias) \
((BF64_GET(x, low, len) + (bias)) << (shift))
#define BF32_SET_SB(x, low, len, shift, bias, val) \
BF32_SET(x, low, len, ((val) >> (shift)) - (bias))
#define BF64_SET_SB(x, low, len, shift, bias, val) \
BF64_SET(x, low, len, ((val) >> (shift)) - (bias))
/*
* We currently support nine block sizes, from 512 bytes to 128K.
* We could go higher, but the benefits are near-zero and the cost
* of COWing a giant block to modify one byte would become excessive.
*/
#define SPA_MINBLOCKSHIFT 9
#define SPA_MAXBLOCKSHIFT 17
#define SPA_MINBLOCKSIZE (1ULL << SPA_MINBLOCKSHIFT)
#define SPA_MAXBLOCKSIZE (1ULL << SPA_MAXBLOCKSHIFT)
#define SPA_BLOCKSIZES (SPA_MAXBLOCKSHIFT - SPA_MINBLOCKSHIFT + 1)
/*
* Size of block to hold the configuration data (a packed nvlist)
*/
#define SPA_CONFIG_BLOCKSIZE (1 << 14)
/*
* The DVA size encodings for LSIZE and PSIZE support blocks up to 32MB.
* The ASIZE encoding should be at least 64 times larger (6 more bits)
* to support up to 4-way RAID-Z mirror mode with worst-case gang block
* overhead, three DVAs per bp, plus one more bit in case we do anything
* else that expands the ASIZE.
*/
#define SPA_LSIZEBITS 16 /* LSIZE up to 32M (2^16 * 512) */
#define SPA_PSIZEBITS 16 /* PSIZE up to 32M (2^16 * 512) */
#define SPA_ASIZEBITS 24 /* ASIZE up to 64 times larger */
/*
* All SPA data is represented by 128-bit data virtual addresses (DVAs).
* The members of the dva_t should be considered opaque outside the SPA.
*/
typedef struct dva {
uint64_t dva_word[2];
} dva_t;
/*
* Each block has a 256-bit checksum -- strong enough for cryptographic hashes.
*/
typedef struct zio_cksum {
uint64_t zc_word[4];
} zio_cksum_t;
/*
* Each block is described by its DVAs, time of birth, checksum, etc.
* The word-by-word, bit-by-bit layout of the blkptr is as follows:
*
* 64 56 48 40 32 24 16 8 0
* +-------+-------+-------+-------+-------+-------+-------+-------+
* 0 | vdev1 | GRID | ASIZE |
* +-------+-------+-------+-------+-------+-------+-------+-------+
* 1 |G| offset1 |
* +-------+-------+-------+-------+-------+-------+-------+-------+
* 2 | vdev2 | GRID | ASIZE |
* +-------+-------+-------+-------+-------+-------+-------+-------+
* 3 |G| offset2 |
* +-------+-------+-------+-------+-------+-------+-------+-------+
* 4 | vdev3 | GRID | ASIZE |
* +-------+-------+-------+-------+-------+-------+-------+-------+
* 5 |G| offset3 |
* +-------+-------+-------+-------+-------+-------+-------+-------+
* 6 |BDX|lvl| type | cksum | comp | PSIZE | LSIZE |
* +-------+-------+-------+-------+-------+-------+-------+-------+
* 7 | padding |
* +-------+-------+-------+-------+-------+-------+-------+-------+
* 8 | padding |
* +-------+-------+-------+-------+-------+-------+-------+-------+
* 9 | physical birth txg |
* +-------+-------+-------+-------+-------+-------+-------+-------+
* a | logical birth txg |
* +-------+-------+-------+-------+-------+-------+-------+-------+
* b | fill count |
* +-------+-------+-------+-------+-------+-------+-------+-------+
* c | checksum[0] |
* +-------+-------+-------+-------+-------+-------+-------+-------+
* d | checksum[1] |
* +-------+-------+-------+-------+-------+-------+-------+-------+
* e | checksum[2] |
* +-------+-------+-------+-------+-------+-------+-------+-------+
* f | checksum[3] |
* +-------+-------+-------+-------+-------+-------+-------+-------+
*
* Legend:
*
* vdev virtual device ID
* offset offset into virtual device
* LSIZE logical size
* PSIZE physical size (after compression)
* ASIZE allocated size (including RAID-Z parity and gang block headers)
* GRID RAID-Z layout information (reserved for future use)
* cksum checksum function
* comp compression function
* G gang block indicator
* B byteorder (endianness)
* D dedup
* X unused
* lvl level of indirection
* type DMU object type
* phys birth txg of block allocation; zero if same as logical birth txg
* log. birth transaction group in which the block was logically born
* fill count number of non-zero blocks under this bp
* checksum[4] 256-bit checksum of the data this bp describes
*/
#define SPA_BLKPTRSHIFT 7 /* blkptr_t is 128 bytes */
#define SPA_DVAS_PER_BP 3 /* Number of DVAs in a bp */
typedef struct blkptr {
dva_t blk_dva[SPA_DVAS_PER_BP]; /* Data Virtual Addresses */
uint64_t blk_prop; /* size, compression, type, etc */
uint64_t blk_pad[2]; /* Extra space for the future */
uint64_t blk_phys_birth; /* txg when block was allocated */
uint64_t blk_birth; /* transaction group at birth */
uint64_t blk_fill; /* fill count */
zio_cksum_t blk_cksum; /* 256-bit checksum */
} blkptr_t;
/*
* Macros to get and set fields in a bp or DVA.
*/
#define DVA_GET_ASIZE(dva) \
BF64_GET_SB((dva)->dva_word[0], 0, 24, SPA_MINBLOCKSHIFT, 0)
#define DVA_SET_ASIZE(dva, x) \
BF64_SET_SB((dva)->dva_word[0], 0, 24, SPA_MINBLOCKSHIFT, 0, x)
#define DVA_GET_GRID(dva) BF64_GET((dva)->dva_word[0], 24, 8)
#define DVA_SET_GRID(dva, x) BF64_SET((dva)->dva_word[0], 24, 8, x)
#define DVA_GET_VDEV(dva) BF64_GET((dva)->dva_word[0], 32, 32)
#define DVA_SET_VDEV(dva, x) BF64_SET((dva)->dva_word[0], 32, 32, x)
#define DVA_GET_OFFSET(dva) \
BF64_GET_SB((dva)->dva_word[1], 0, 63, SPA_MINBLOCKSHIFT, 0)
#define DVA_SET_OFFSET(dva, x) \
BF64_SET_SB((dva)->dva_word[1], 0, 63, SPA_MINBLOCKSHIFT, 0, x)
#define DVA_GET_GANG(dva) BF64_GET((dva)->dva_word[1], 63, 1)
#define DVA_SET_GANG(dva, x) BF64_SET((dva)->dva_word[1], 63, 1, x)
#define BP_GET_LSIZE(bp) \
BF64_GET_SB((bp)->blk_prop, 0, 16, SPA_MINBLOCKSHIFT, 1)
#define BP_SET_LSIZE(bp, x) \
BF64_SET_SB((bp)->blk_prop, 0, 16, SPA_MINBLOCKSHIFT, 1, x)
#define BP_GET_PSIZE(bp) \
BF64_GET_SB((bp)->blk_prop, 16, 16, SPA_MINBLOCKSHIFT, 1)
#define BP_SET_PSIZE(bp, x) \
BF64_SET_SB((bp)->blk_prop, 16, 16, SPA_MINBLOCKSHIFT, 1, x)
#define BP_GET_COMPRESS(bp) BF64_GET((bp)->blk_prop, 32, 8)
#define BP_SET_COMPRESS(bp, x) BF64_SET((bp)->blk_prop, 32, 8, x)
#define BP_GET_CHECKSUM(bp) BF64_GET((bp)->blk_prop, 40, 8)
#define BP_SET_CHECKSUM(bp, x) BF64_SET((bp)->blk_prop, 40, 8, x)
#define BP_GET_TYPE(bp) BF64_GET((bp)->blk_prop, 48, 8)
#define BP_SET_TYPE(bp, x) BF64_SET((bp)->blk_prop, 48, 8, x)
#define BP_GET_LEVEL(bp) BF64_GET((bp)->blk_prop, 56, 5)
#define BP_SET_LEVEL(bp, x) BF64_SET((bp)->blk_prop, 56, 5, x)
#define BP_GET_PROP_BIT_61(bp) BF64_GET((bp)->blk_prop, 61, 1)
#define BP_SET_PROP_BIT_61(bp, x) BF64_SET((bp)->blk_prop, 61, 1, x)
#define BP_GET_DEDUP(bp) BF64_GET((bp)->blk_prop, 62, 1)
#define BP_SET_DEDUP(bp, x) BF64_SET((bp)->blk_prop, 62, 1, x)
#define BP_GET_BYTEORDER(bp) (0 - BF64_GET((bp)->blk_prop, 63, 1))
#define BP_SET_BYTEORDER(bp, x) BF64_SET((bp)->blk_prop, 63, 1, x)
#define BP_PHYSICAL_BIRTH(bp) \
((bp)->blk_phys_birth ? (bp)->blk_phys_birth : (bp)->blk_birth)
#define BP_SET_BIRTH(bp, logical, physical) \
{ \
(bp)->blk_birth = (logical); \
(bp)->blk_phys_birth = ((logical) == (physical) ? 0 : (physical)); \
}
#define BP_GET_ASIZE(bp) \
(DVA_GET_ASIZE(&(bp)->blk_dva[0]) + DVA_GET_ASIZE(&(bp)->blk_dva[1]) + \
DVA_GET_ASIZE(&(bp)->blk_dva[2]))
#define BP_GET_UCSIZE(bp) \
((BP_GET_LEVEL(bp) > 0 || dmu_ot[BP_GET_TYPE(bp)].ot_metadata) ? \
BP_GET_PSIZE(bp) : BP_GET_LSIZE(bp))
#define BP_GET_NDVAS(bp) \
(!!DVA_GET_ASIZE(&(bp)->blk_dva[0]) + \
!!DVA_GET_ASIZE(&(bp)->blk_dva[1]) + \
!!DVA_GET_ASIZE(&(bp)->blk_dva[2]))
#define BP_COUNT_GANG(bp) \
(DVA_GET_GANG(&(bp)->blk_dva[0]) + \
DVA_GET_GANG(&(bp)->blk_dva[1]) + \
DVA_GET_GANG(&(bp)->blk_dva[2]))
#define DVA_EQUAL(dva1, dva2) \
((dva1)->dva_word[1] == (dva2)->dva_word[1] && \
(dva1)->dva_word[0] == (dva2)->dva_word[0])
#define BP_EQUAL(bp1, bp2) \
(BP_PHYSICAL_BIRTH(bp1) == BP_PHYSICAL_BIRTH(bp2) && \
DVA_EQUAL(&(bp1)->blk_dva[0], &(bp2)->blk_dva[0]) && \
DVA_EQUAL(&(bp1)->blk_dva[1], &(bp2)->blk_dva[1]) && \
DVA_EQUAL(&(bp1)->blk_dva[2], &(bp2)->blk_dva[2]))
#define ZIO_CHECKSUM_EQUAL(zc1, zc2) \
(0 == (((zc1).zc_word[0] - (zc2).zc_word[0]) | \
((zc1).zc_word[1] - (zc2).zc_word[1]) | \
((zc1).zc_word[2] - (zc2).zc_word[2]) | \
((zc1).zc_word[3] - (zc2).zc_word[3])))
#define DVA_IS_VALID(dva) (DVA_GET_ASIZE(dva) != 0)
#define ZIO_SET_CHECKSUM(zcp, w0, w1, w2, w3) \
{ \
(zcp)->zc_word[0] = w0; \
(zcp)->zc_word[1] = w1; \
(zcp)->zc_word[2] = w2; \
(zcp)->zc_word[3] = w3; \
}
#define BP_IDENTITY(bp) (&(bp)->blk_dva[0])
#define BP_IS_GANG(bp) DVA_GET_GANG(BP_IDENTITY(bp))
#define BP_IS_HOLE(bp) ((bp)->blk_birth == 0)
/* BP_IS_RAIDZ(bp) assumes no block compression */
#define BP_IS_RAIDZ(bp) (DVA_GET_ASIZE(&(bp)->blk_dva[0]) > \
BP_GET_PSIZE(bp))
#define BP_ZERO(bp) \
{ \
(bp)->blk_dva[0].dva_word[0] = 0; \
(bp)->blk_dva[0].dva_word[1] = 0; \
(bp)->blk_dva[1].dva_word[0] = 0; \
(bp)->blk_dva[1].dva_word[1] = 0; \
(bp)->blk_dva[2].dva_word[0] = 0; \
(bp)->blk_dva[2].dva_word[1] = 0; \
(bp)->blk_prop = 0; \
(bp)->blk_pad[0] = 0; \
(bp)->blk_pad[1] = 0; \
(bp)->blk_phys_birth = 0; \
(bp)->blk_birth = 0; \
(bp)->blk_fill = 0; \
ZIO_SET_CHECKSUM(&(bp)->blk_cksum, 0, 0, 0, 0); \
}
/*
* Note: the byteorder is either 0 or -1, both of which are palindromes.
* This simplifies the endianness handling a bit.
*/
#ifdef _BIG_ENDIAN
#define ZFS_HOST_BYTEORDER (0ULL)
#else
#define ZFS_HOST_BYTEORDER (-1ULL)
#endif
#define BP_SHOULD_BYTESWAP(bp) (BP_GET_BYTEORDER(bp) != ZFS_HOST_BYTEORDER)
#define BP_SPRINTF_LEN 320
/*
* This macro allows code sharing between zfs, libzpool, and mdb.
* 'func' is either snprintf() or mdb_snprintf().
* 'ws' (whitespace) can be ' ' for single-line format, '\n' for multi-line.
*/
#define SPRINTF_BLKPTR(func, ws, buf, bp, type, checksum, compress) \
{ \
static const char *copyname[] = \
{ "zero", "single", "double", "triple" }; \
int size = BP_SPRINTF_LEN; \
int len = 0; \
int copies = 0; \
\
if (bp == NULL) { \
len = func(buf + len, size - len, "<NULL>"); \
} else if (BP_IS_HOLE(bp)) { \
len = func(buf + len, size - len, "<hole>"); \
} else { \
for (int d = 0; d < BP_GET_NDVAS(bp); d++) { \
const dva_t *dva = &bp->blk_dva[d]; \
if (DVA_IS_VALID(dva)) \
copies++; \
len += func(buf + len, size - len, \
"DVA[%d]=<%llu:%llx:%llx>%c", d, \
(u_longlong_t)DVA_GET_VDEV(dva), \
(u_longlong_t)DVA_GET_OFFSET(dva), \
(u_longlong_t)DVA_GET_ASIZE(dva), \
ws); \
} \
if (BP_IS_GANG(bp) && \
DVA_GET_ASIZE(&bp->blk_dva[2]) <= \
DVA_GET_ASIZE(&bp->blk_dva[1]) / 2) \
copies--; \
len += func(buf + len, size - len, \
"[L%llu %s] %s %s %s %s %s %s%c" \
"size=%llxL/%llxP birth=%lluL/%lluP fill=%llu%c" \
"cksum=%llx:%llx:%llx:%llx", \
(u_longlong_t)BP_GET_LEVEL(bp), \
type, \
checksum, \
compress, \
BP_GET_BYTEORDER(bp) == 0 ? "BE" : "LE", \
BP_IS_GANG(bp) ? "gang" : "contiguous", \
BP_GET_DEDUP(bp) ? "dedup" : "unique", \
copyname[copies], \
ws, \
(u_longlong_t)BP_GET_LSIZE(bp), \
(u_longlong_t)BP_GET_PSIZE(bp), \
(u_longlong_t)bp->blk_birth, \
(u_longlong_t)BP_PHYSICAL_BIRTH(bp), \
(u_longlong_t)bp->blk_fill, \
ws, \
(u_longlong_t)bp->blk_cksum.zc_word[0], \
(u_longlong_t)bp->blk_cksum.zc_word[1], \
(u_longlong_t)bp->blk_cksum.zc_word[2], \
(u_longlong_t)bp->blk_cksum.zc_word[3]); \
} \
ASSERT(len < size); \
}
#include <sys/dmu.h>
#define BP_GET_BUFC_TYPE(bp) \
(((BP_GET_LEVEL(bp) > 0) || (dmu_ot[BP_GET_TYPE(bp)].ot_metadata)) ? \
ARC_BUFC_METADATA : ARC_BUFC_DATA);
typedef enum spa_import_type {
SPA_IMPORT_EXISTING,
SPA_IMPORT_ASSEMBLE
} spa_import_type_t;
/* state manipulation functions */
extern int spa_open(const char *pool, spa_t **, void *tag);
extern int spa_open_rewind(const char *pool, spa_t **, void *tag,
nvlist_t *policy, nvlist_t **config);
extern int spa_get_stats(const char *pool, nvlist_t **config,
char *altroot, size_t buflen);
extern int spa_create(const char *pool, nvlist_t *config, nvlist_t *props,
const char *history_str, nvlist_t *zplprops);
extern int spa_import_rootpool(char *devpath, char *devid);
extern int spa_import(const char *pool, nvlist_t *config, nvlist_t *props,
uint64_t flags);
extern nvlist_t *spa_tryimport(nvlist_t *tryconfig);
extern int spa_destroy(char *pool);
extern int spa_export(char *pool, nvlist_t **oldconfig, boolean_t force,
boolean_t hardforce);
extern int spa_reset(char *pool);
extern void spa_async_request(spa_t *spa, int flag);
extern void spa_async_unrequest(spa_t *spa, int flag);
extern void spa_async_suspend(spa_t *spa);
extern void spa_async_resume(spa_t *spa);
extern spa_t *spa_inject_addref(char *pool);
extern void spa_inject_delref(spa_t *spa);
extern void spa_scan_stat_init(spa_t *spa);
extern int spa_scan_get_stats(spa_t *spa, pool_scan_stat_t *ps);
#define SPA_ASYNC_CONFIG_UPDATE 0x01
#define SPA_ASYNC_REMOVE 0x02
#define SPA_ASYNC_PROBE 0x04
#define SPA_ASYNC_RESILVER_DONE 0x08
#define SPA_ASYNC_RESILVER 0x10
#define SPA_ASYNC_AUTOEXPAND 0x20
#define SPA_ASYNC_REMOVE_DONE 0x40
#define SPA_ASYNC_REMOVE_STOP 0x80
/*
* Controls the behavior of spa_vdev_remove().
*/
#define SPA_REMOVE_UNSPARE 0x01
#define SPA_REMOVE_DONE 0x02
/* device manipulation */
extern int spa_vdev_add(spa_t *spa, nvlist_t *nvroot);
extern int spa_vdev_attach(spa_t *spa, uint64_t guid, nvlist_t *nvroot,
int replacing);
extern int spa_vdev_detach(spa_t *spa, uint64_t guid, uint64_t pguid,
int replace_done);
extern int spa_vdev_remove(spa_t *spa, uint64_t guid, boolean_t unspare);
extern boolean_t spa_vdev_remove_active(spa_t *spa);
extern int spa_vdev_setpath(spa_t *spa, uint64_t guid, const char *newpath);
extern int spa_vdev_setfru(spa_t *spa, uint64_t guid, const char *newfru);
extern int spa_vdev_split_mirror(spa_t *spa, char *newname, nvlist_t *config,
nvlist_t *props, boolean_t exp);
/* spare state (which is global across all pools) */
extern void spa_spare_add(vdev_t *vd);
extern void spa_spare_remove(vdev_t *vd);
extern boolean_t spa_spare_exists(uint64_t guid, uint64_t *pool, int *refcnt);
extern void spa_spare_activate(vdev_t *vd);
/* L2ARC state (which is global across all pools) */
extern void spa_l2cache_add(vdev_t *vd);
extern void spa_l2cache_remove(vdev_t *vd);
extern boolean_t spa_l2cache_exists(uint64_t guid, uint64_t *pool);
extern void spa_l2cache_activate(vdev_t *vd);
extern void spa_l2cache_drop(spa_t *spa);
/* scanning */
extern int spa_scan(spa_t *spa, pool_scan_func_t func);
extern int spa_scan_stop(spa_t *spa);
/* spa syncing */
extern void spa_sync(spa_t *spa, uint64_t txg); /* only for DMU use */
extern void spa_sync_allpools(void);
/*
* DEFERRED_FREE must be large enough that regular blocks are not
* deferred. XXX so can't we change it back to 1?
*/
#define SYNC_PASS_DEFERRED_FREE 2 /* defer frees after this pass */
#define SYNC_PASS_DONT_COMPRESS 4 /* don't compress after this pass */
#define SYNC_PASS_REWRITE 1 /* rewrite new bps after this pass */
/* spa namespace global mutex */
extern kmutex_t spa_namespace_lock;
/*
* SPA configuration functions in spa_config.c
*/
#define SPA_CONFIG_UPDATE_POOL 0
#define SPA_CONFIG_UPDATE_VDEVS 1
extern void spa_config_sync(spa_t *, boolean_t, boolean_t);
extern void spa_config_load(void);
extern nvlist_t *spa_all_configs(uint64_t *);
extern void spa_config_set(spa_t *spa, nvlist_t *config);
extern nvlist_t *spa_config_generate(spa_t *spa, vdev_t *vd, uint64_t txg,
int getstats);
extern void spa_config_update(spa_t *spa, int what);
/*
* Miscellaneous SPA routines in spa_misc.c
*/
/* Namespace manipulation */
extern spa_t *spa_lookup(const char *name);
extern spa_t *spa_add(const char *name, nvlist_t *config, const char *altroot);
extern void spa_remove(spa_t *spa);
extern spa_t *spa_next(spa_t *prev);
/* Refcount functions */
extern void spa_open_ref(spa_t *spa, void *tag);
extern void spa_close(spa_t *spa, void *tag);
extern boolean_t spa_refcount_zero(spa_t *spa);
#define SCL_NONE 0x00
#define SCL_CONFIG 0x01
#define SCL_STATE 0x02
#define SCL_L2ARC 0x04 /* hack until L2ARC 2.0 */
#define SCL_ALLOC 0x08
#define SCL_ZIO 0x10
#define SCL_FREE 0x20
#define SCL_VDEV 0x40
#define SCL_LOCKS 7
#define SCL_ALL ((1 << SCL_LOCKS) - 1)
#define SCL_STATE_ALL (SCL_STATE | SCL_L2ARC | SCL_ZIO)
/* Pool configuration locks */
extern int spa_config_tryenter(spa_t *spa, int locks, void *tag, krw_t rw);
extern void spa_config_enter(spa_t *spa, int locks, void *tag, krw_t rw);
extern void spa_config_exit(spa_t *spa, int locks, void *tag);
extern int spa_config_held(spa_t *spa, int locks, krw_t rw);
/* Pool vdev add/remove lock */
extern uint64_t spa_vdev_enter(spa_t *spa);
extern uint64_t spa_vdev_config_enter(spa_t *spa);
extern void spa_vdev_config_exit(spa_t *spa, vdev_t *vd, uint64_t txg,
int error, char *tag);
extern int spa_vdev_exit(spa_t *spa, vdev_t *vd, uint64_t txg, int error);
/* Pool vdev state change lock */
extern void spa_vdev_state_enter(spa_t *spa, int oplock);
extern int spa_vdev_state_exit(spa_t *spa, vdev_t *vd, int error);
/* Log state */
typedef enum spa_log_state {
SPA_LOG_UNKNOWN = 0, /* unknown log state */
SPA_LOG_MISSING, /* missing log(s) */
SPA_LOG_CLEAR, /* clear the log(s) */
SPA_LOG_GOOD, /* log(s) are good */
} spa_log_state_t;
extern spa_log_state_t spa_get_log_state(spa_t *spa);
extern void spa_set_log_state(spa_t *spa, spa_log_state_t state);
extern int spa_offline_log(spa_t *spa);
/* Log claim callback */
extern void spa_claim_notify(zio_t *zio);
/* Accessor functions */
extern boolean_t spa_shutting_down(spa_t *spa);
extern struct dsl_pool *spa_get_dsl(spa_t *spa);
extern blkptr_t *spa_get_rootblkptr(spa_t *spa);
extern void spa_set_rootblkptr(spa_t *spa, const blkptr_t *bp);
extern void spa_altroot(spa_t *, char *, size_t);
extern int spa_sync_pass(spa_t *spa);
extern char *spa_name(spa_t *spa);
extern uint64_t spa_guid(spa_t *spa);
extern uint64_t spa_last_synced_txg(spa_t *spa);
extern uint64_t spa_first_txg(spa_t *spa);
extern uint64_t spa_syncing_txg(spa_t *spa);
extern uint64_t spa_version(spa_t *spa);
extern pool_state_t spa_state(spa_t *spa);
extern spa_load_state_t spa_load_state(spa_t *spa);
extern uint64_t spa_freeze_txg(spa_t *spa);
extern uint64_t spa_get_asize(spa_t *spa, uint64_t lsize);
extern uint64_t spa_get_dspace(spa_t *spa);
extern void spa_update_dspace(spa_t *spa);
extern uint64_t spa_version(spa_t *spa);
extern boolean_t spa_deflate(spa_t *spa);
extern metaslab_class_t *spa_normal_class(spa_t *spa);
extern metaslab_class_t *spa_log_class(spa_t *spa);
extern int spa_max_replication(spa_t *spa);
extern int spa_prev_software_version(spa_t *spa);
extern int spa_busy(void);
extern uint8_t spa_get_failmode(spa_t *spa);
extern boolean_t spa_suspended(spa_t *spa);
extern uint64_t spa_bootfs(spa_t *spa);
extern uint64_t spa_delegation(spa_t *spa);
extern objset_t *spa_meta_objset(spa_t *spa);
/* Miscellaneous support routines */
extern int spa_rename(const char *oldname, const char *newname);
extern spa_t *spa_by_guid(uint64_t pool_guid, uint64_t device_guid);
extern boolean_t spa_guid_exists(uint64_t pool_guid, uint64_t device_guid);
extern char *spa_strdup(const char *);
extern void spa_strfree(char *);
extern uint64_t spa_get_random(uint64_t range);
extern uint64_t spa_generate_guid(spa_t *spa);
extern void sprintf_blkptr(char *buf, const blkptr_t *bp);
extern void spa_freeze(spa_t *spa);
extern void spa_upgrade(spa_t *spa, uint64_t version);
extern void spa_evict_all(void);
extern vdev_t *spa_lookup_by_guid(spa_t *spa, uint64_t guid,
boolean_t l2cache);
extern boolean_t spa_has_spare(spa_t *, uint64_t guid);
extern uint64_t dva_get_dsize_sync(spa_t *spa, const dva_t *dva);
extern uint64_t bp_get_dsize_sync(spa_t *spa, const blkptr_t *bp);
extern uint64_t bp_get_dsize(spa_t *spa, const blkptr_t *bp);
extern boolean_t spa_has_slogs(spa_t *spa);
extern boolean_t spa_is_root(spa_t *spa);
extern boolean_t spa_writeable(spa_t *spa);
extern int spa_mode(spa_t *spa);
extern uint64_t strtonum(const char *str, char **nptr);
/* history logging */
typedef enum history_log_type {
LOG_CMD_POOL_CREATE,
LOG_CMD_NORMAL,
LOG_INTERNAL
} history_log_type_t;
typedef struct history_arg {
char *ha_history_str;
history_log_type_t ha_log_type;
history_internal_events_t ha_event;
char *ha_zone;
uid_t ha_uid;
} history_arg_t;
extern char *spa_his_ievent_table[];
extern void spa_history_create_obj(spa_t *spa, dmu_tx_t *tx);
extern int spa_history_get(spa_t *spa, uint64_t *offset, uint64_t *len_read,
char *his_buf);
extern int spa_history_log(spa_t *spa, const char *his_buf,
history_log_type_t what);
extern void spa_history_log_internal(history_internal_events_t event,
spa_t *spa, dmu_tx_t *tx, const char *fmt, ...);
extern void spa_history_log_version(spa_t *spa, history_internal_events_t evt);
/* error handling */
struct zbookmark;
extern void spa_log_error(spa_t *spa, zio_t *zio);
extern void zfs_ereport_post(const char *class, spa_t *spa, vdev_t *vd,
zio_t *zio, uint64_t stateoroffset, uint64_t length);
extern void zfs_post_remove(spa_t *spa, vdev_t *vd);
extern void zfs_post_state_change(spa_t *spa, vdev_t *vd);
extern void zfs_post_autoreplace(spa_t *spa, vdev_t *vd);
extern uint64_t spa_get_errlog_size(spa_t *spa);
extern int spa_get_errlog(spa_t *spa, void *uaddr, size_t *count);
extern void spa_errlog_rotate(spa_t *spa);
extern void spa_errlog_drain(spa_t *spa);
extern void spa_errlog_sync(spa_t *spa, uint64_t txg);
extern void spa_get_errlists(spa_t *spa, avl_tree_t *last, avl_tree_t *scrub);
/* vdev cache */
extern void vdev_cache_stat_init(void);
extern void vdev_cache_stat_fini(void);
/* Initialization and termination */
extern void spa_init(int flags);
extern void spa_fini(void);
extern void spa_boot_init();
/* properties */
extern int spa_prop_set(spa_t *spa, nvlist_t *nvp);
extern int spa_prop_get(spa_t *spa, nvlist_t **nvp);
extern void spa_prop_clear_bootfs(spa_t *spa, uint64_t obj, dmu_tx_t *tx);
extern void spa_configfile_set(spa_t *, nvlist_t *, boolean_t);
/* asynchronous event notification */
extern void spa_event_notify(spa_t *spa, vdev_t *vdev, const char *name);
#ifdef ZFS_DEBUG
#define dprintf_bp(bp, fmt, ...) do { \
if (zfs_flags & ZFS_DEBUG_DPRINTF) { \
char *__blkbuf = kmem_alloc(BP_SPRINTF_LEN, KM_SLEEP); \
sprintf_blkptr(__blkbuf, (bp)); \
dprintf(fmt " %s\n", __VA_ARGS__, __blkbuf); \
kmem_free(__blkbuf, BP_SPRINTF_LEN); \
} \
_NOTE(CONSTCOND) } while (0)
#else
#define dprintf_bp(bp, fmt, ...)
#endif
extern int spa_mode_global; /* mode, e.g. FREAD | FWRITE */
#ifdef __cplusplus
}
#endif
#endif /* _SYS_SPA_H */

View File

@ -0,0 +1,42 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#ifndef _SYS_SPA_BOOT_H
#define _SYS_SPA_BOOT_H
#include <sys/nvpair.h>
#ifdef __cplusplus
extern "C" {
#endif
extern char *spa_get_bootprop(char *prop);
extern void spa_free_bootprop(char *prop);
#ifdef __cplusplus
}
#endif
#endif /* _SYS_SPA_BOOT_H */

View File

@ -0,0 +1,235 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
*/
#ifndef _SYS_SPA_IMPL_H
#define _SYS_SPA_IMPL_H
#include <sys/spa.h>
#include <sys/vdev.h>
#include <sys/metaslab.h>
#include <sys/dmu.h>
#include <sys/dsl_pool.h>
#include <sys/uberblock_impl.h>
#include <sys/zfs_context.h>
#include <sys/avl.h>
#include <sys/refcount.h>
#include <sys/bplist.h>
#include <sys/bpobj.h>
#ifdef __cplusplus
extern "C" {
#endif
typedef struct spa_error_entry {
zbookmark_t se_bookmark;
char *se_name;
avl_node_t se_avl;
} spa_error_entry_t;
typedef struct spa_history_phys {
uint64_t sh_pool_create_len; /* ending offset of zpool create */
uint64_t sh_phys_max_off; /* physical EOF */
uint64_t sh_bof; /* logical BOF */
uint64_t sh_eof; /* logical EOF */
uint64_t sh_records_lost; /* num of records overwritten */
} spa_history_phys_t;
struct spa_aux_vdev {
uint64_t sav_object; /* MOS object for device list */
nvlist_t *sav_config; /* cached device config */
vdev_t **sav_vdevs; /* devices */
int sav_count; /* number devices */
boolean_t sav_sync; /* sync the device list */
nvlist_t **sav_pending; /* pending device additions */
uint_t sav_npending; /* # pending devices */
};
typedef struct spa_config_lock {
kmutex_t scl_lock;
kthread_t *scl_writer;
int scl_write_wanted;
kcondvar_t scl_cv;
refcount_t scl_count;
} spa_config_lock_t;
typedef struct spa_config_dirent {
list_node_t scd_link;
char *scd_path;
} spa_config_dirent_t;
enum zio_taskq_type {
ZIO_TASKQ_ISSUE = 0,
ZIO_TASKQ_ISSUE_HIGH,
ZIO_TASKQ_INTERRUPT,
ZIO_TASKQ_INTERRUPT_HIGH,
ZIO_TASKQ_TYPES
};
/*
* State machine for the zpool-pooname process. The states transitions
* are done as follows:
*
* From To Routine
* PROC_NONE -> PROC_CREATED spa_activate()
* PROC_CREATED -> PROC_ACTIVE spa_thread()
* PROC_ACTIVE -> PROC_DEACTIVATE spa_deactivate()
* PROC_DEACTIVATE -> PROC_GONE spa_thread()
* PROC_GONE -> PROC_NONE spa_deactivate()
*/
typedef enum spa_proc_state {
SPA_PROC_NONE, /* spa_proc = &p0, no process created */
SPA_PROC_CREATED, /* spa_activate() has proc, is waiting */
SPA_PROC_ACTIVE, /* taskqs created, spa_proc set */
SPA_PROC_DEACTIVATE, /* spa_deactivate() requests process exit */
SPA_PROC_GONE /* spa_thread() is exiting, spa_proc = &p0 */
} spa_proc_state_t;
struct spa {
/*
* Fields protected by spa_namespace_lock.
*/
char spa_name[MAXNAMELEN]; /* pool name */
avl_node_t spa_avl; /* node in spa_namespace_avl */
nvlist_t *spa_config; /* last synced config */
nvlist_t *spa_config_syncing; /* currently syncing config */
nvlist_t *spa_config_splitting; /* config for splitting */
nvlist_t *spa_load_info; /* info and errors from load */
uint64_t spa_config_txg; /* txg of last config change */
int spa_sync_pass; /* iterate-to-convergence */
pool_state_t spa_state; /* pool state */
int spa_inject_ref; /* injection references */
uint8_t spa_sync_on; /* sync threads are running */
spa_load_state_t spa_load_state; /* current load operation */
uint64_t spa_import_flags; /* import specific flags */
taskq_t *spa_zio_taskq[ZIO_TYPES][ZIO_TASKQ_TYPES];
dsl_pool_t *spa_dsl_pool;
metaslab_class_t *spa_normal_class; /* normal data class */
metaslab_class_t *spa_log_class; /* intent log data class */
uint64_t spa_first_txg; /* first txg after spa_open() */
uint64_t spa_final_txg; /* txg of export/destroy */
uint64_t spa_freeze_txg; /* freeze pool at this txg */
uint64_t spa_load_max_txg; /* best initial ub_txg */
uint64_t spa_claim_max_txg; /* highest claimed birth txg */
timespec_t spa_loaded_ts; /* 1st successful open time */
objset_t *spa_meta_objset; /* copy of dp->dp_meta_objset */
txg_list_t spa_vdev_txg_list; /* per-txg dirty vdev list */
vdev_t *spa_root_vdev; /* top-level vdev container */
uint64_t spa_load_guid; /* initial guid for spa_load */
list_t spa_config_dirty_list; /* vdevs with dirty config */
list_t spa_state_dirty_list; /* vdevs with dirty state */
spa_aux_vdev_t spa_spares; /* hot spares */
spa_aux_vdev_t spa_l2cache; /* L2ARC cache devices */
uint64_t spa_config_object; /* MOS object for pool config */
uint64_t spa_config_generation; /* config generation number */
uint64_t spa_syncing_txg; /* txg currently syncing */
bpobj_t spa_deferred_bpobj; /* deferred-free bplist */
bplist_t spa_free_bplist[TXG_SIZE]; /* bplist of stuff to free */
uberblock_t spa_ubsync; /* last synced uberblock */
uberblock_t spa_uberblock; /* current uberblock */
boolean_t spa_extreme_rewind; /* rewind past deferred frees */
uint64_t spa_last_io; /* lbolt of last non-scan I/O */
kmutex_t spa_scrub_lock; /* resilver/scrub lock */
uint64_t spa_scrub_inflight; /* in-flight scrub I/Os */
kcondvar_t spa_scrub_io_cv; /* scrub I/O completion */
uint8_t spa_scrub_active; /* active or suspended? */
uint8_t spa_scrub_type; /* type of scrub we're doing */
uint8_t spa_scrub_finished; /* indicator to rotate logs */
uint8_t spa_scrub_started; /* started since last boot */
uint8_t spa_scrub_reopen; /* scrub doing vdev_reopen */
uint64_t spa_scan_pass_start; /* start time per pass/reboot */
uint64_t spa_scan_pass_exam; /* examined bytes per pass */
kmutex_t spa_async_lock; /* protect async state */
kthread_t *spa_async_thread; /* thread doing async task */
int spa_async_suspended; /* async tasks suspended */
kcondvar_t spa_async_cv; /* wait for thread_exit() */
uint16_t spa_async_tasks; /* async task mask */
char *spa_root; /* alternate root directory */
uint64_t spa_ena; /* spa-wide ereport ENA */
int spa_last_open_failed; /* error if last open failed */
uint64_t spa_last_ubsync_txg; /* "best" uberblock txg */
uint64_t spa_last_ubsync_txg_ts; /* timestamp from that ub */
uint64_t spa_load_txg; /* ub txg that loaded */
uint64_t spa_load_txg_ts; /* timestamp from that ub */
uint64_t spa_load_meta_errors; /* verify metadata err count */
uint64_t spa_load_data_errors; /* verify data err count */
uint64_t spa_verify_min_txg; /* start txg of verify scrub */
kmutex_t spa_errlog_lock; /* error log lock */
uint64_t spa_errlog_last; /* last error log object */
uint64_t spa_errlog_scrub; /* scrub error log object */
kmutex_t spa_errlist_lock; /* error list/ereport lock */
avl_tree_t spa_errlist_last; /* last error list */
avl_tree_t spa_errlist_scrub; /* scrub error list */
uint64_t spa_deflate; /* should we deflate? */
uint64_t spa_history; /* history object */
kmutex_t spa_history_lock; /* history lock */
vdev_t *spa_pending_vdev; /* pending vdev additions */
kmutex_t spa_props_lock; /* property lock */
uint64_t spa_pool_props_object; /* object for properties */
uint64_t spa_bootfs; /* default boot filesystem */
uint64_t spa_failmode; /* failure mode for the pool */
uint64_t spa_delegation; /* delegation on/off */
list_t spa_config_list; /* previous cache file(s) */
zio_t *spa_async_zio_root; /* root of all async I/O */
zio_t *spa_suspend_zio_root; /* root of all suspended I/O */
kmutex_t spa_suspend_lock; /* protects suspend_zio_root */
kcondvar_t spa_suspend_cv; /* notification of resume */
uint8_t spa_suspended; /* pool is suspended */
uint8_t spa_claiming; /* pool is doing zil_claim() */
boolean_t spa_is_root; /* pool is root */
int spa_minref; /* num refs when first opened */
int spa_mode; /* FREAD | FWRITE */
spa_log_state_t spa_log_state; /* log state */
uint64_t spa_autoexpand; /* lun expansion on/off */
ddt_t *spa_ddt[ZIO_CHECKSUM_FUNCTIONS]; /* in-core DDTs */
uint64_t spa_ddt_stat_object; /* DDT statistics */
uint64_t spa_dedup_ditto; /* dedup ditto threshold */
uint64_t spa_dedup_checksum; /* default dedup checksum */
uint64_t spa_dspace; /* dspace in normal class */
kmutex_t spa_vdev_top_lock; /* dueling offline/remove */
kmutex_t spa_proc_lock; /* protects spa_proc* */
kcondvar_t spa_proc_cv; /* spa_proc_state transitions */
spa_proc_state_t spa_proc_state; /* see definition */
struct proc *spa_proc; /* "zpool-poolname" process */
uint64_t spa_did; /* if procp != p0, did of t1 */
boolean_t spa_autoreplace; /* autoreplace set in open */
int spa_vdev_locks; /* locks grabbed */
uint64_t spa_creation_version; /* version at pool creation */
uint64_t spa_prev_software_version;
/*
* spa_refcnt & spa_config_lock must be the last elements
* because refcount_t changes size based on compilation options.
* In order for the MDB module to function correctly, the other
* fields must remain in the same location.
*/
spa_config_lock_t spa_config_lock[SCL_LOCKS]; /* config changes */
refcount_t spa_refcount; /* number of opens */
};
extern const char *spa_config_path;
#ifdef __cplusplus
}
#endif
#endif /* _SYS_SPA_IMPL_H */

View File

@ -0,0 +1,179 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#ifndef _SYS_SPACE_MAP_H
#define _SYS_SPACE_MAP_H
#include <sys/avl.h>
#include <sys/dmu.h>
#ifdef __cplusplus
extern "C" {
#endif
typedef struct space_map_ops space_map_ops_t;
typedef struct space_map {
avl_tree_t sm_root; /* AVL tree of map segments */
uint64_t sm_space; /* sum of all segments in the map */
uint64_t sm_start; /* start of map */
uint64_t sm_size; /* size of map */
uint8_t sm_shift; /* unit shift */
uint8_t sm_pad[3]; /* unused */
uint8_t sm_loaded; /* map loaded? */
uint8_t sm_loading; /* map loading? */
kcondvar_t sm_load_cv; /* map load completion */
space_map_ops_t *sm_ops; /* space map block picker ops vector */
avl_tree_t *sm_pp_root; /* picker-private AVL tree */
void *sm_ppd; /* picker-private data */
kmutex_t *sm_lock; /* pointer to lock that protects map */
} space_map_t;
typedef struct space_seg {
avl_node_t ss_node; /* AVL node */
avl_node_t ss_pp_node; /* AVL picker-private node */
uint64_t ss_start; /* starting offset of this segment */
uint64_t ss_end; /* ending offset (non-inclusive) */
} space_seg_t;
typedef struct space_ref {
avl_node_t sr_node; /* AVL node */
uint64_t sr_offset; /* offset (start or end) */
int64_t sr_refcnt; /* associated reference count */
} space_ref_t;
typedef struct space_map_obj {
uint64_t smo_object; /* on-disk space map object */
uint64_t smo_objsize; /* size of the object */
uint64_t smo_alloc; /* space allocated from the map */
} space_map_obj_t;
struct space_map_ops {
void (*smop_load)(space_map_t *sm);
void (*smop_unload)(space_map_t *sm);
uint64_t (*smop_alloc)(space_map_t *sm, uint64_t size);
void (*smop_claim)(space_map_t *sm, uint64_t start, uint64_t size);
void (*smop_free)(space_map_t *sm, uint64_t start, uint64_t size);
uint64_t (*smop_max)(space_map_t *sm);
boolean_t (*smop_fragmented)(space_map_t *sm);
};
/*
* debug entry
*
* 1 3 10 50
* ,---+--------+------------+---------------------------------.
* | 1 | action | syncpass | txg (lower bits) |
* `---+--------+------------+---------------------------------'
* 63 62 60 59 50 49 0
*
*
*
* non-debug entry
*
* 1 47 1 15
* ,-----------------------------------------------------------.
* | 0 | offset (sm_shift units) | type | run |
* `-----------------------------------------------------------'
* 63 62 17 16 15 0
*/
/* All this stuff takes and returns bytes */
#define SM_RUN_DECODE(x) (BF64_DECODE(x, 0, 15) + 1)
#define SM_RUN_ENCODE(x) BF64_ENCODE((x) - 1, 0, 15)
#define SM_TYPE_DECODE(x) BF64_DECODE(x, 15, 1)
#define SM_TYPE_ENCODE(x) BF64_ENCODE(x, 15, 1)
#define SM_OFFSET_DECODE(x) BF64_DECODE(x, 16, 47)
#define SM_OFFSET_ENCODE(x) BF64_ENCODE(x, 16, 47)
#define SM_DEBUG_DECODE(x) BF64_DECODE(x, 63, 1)
#define SM_DEBUG_ENCODE(x) BF64_ENCODE(x, 63, 1)
#define SM_DEBUG_ACTION_DECODE(x) BF64_DECODE(x, 60, 3)
#define SM_DEBUG_ACTION_ENCODE(x) BF64_ENCODE(x, 60, 3)
#define SM_DEBUG_SYNCPASS_DECODE(x) BF64_DECODE(x, 50, 10)
#define SM_DEBUG_SYNCPASS_ENCODE(x) BF64_ENCODE(x, 50, 10)
#define SM_DEBUG_TXG_DECODE(x) BF64_DECODE(x, 0, 50)
#define SM_DEBUG_TXG_ENCODE(x) BF64_ENCODE(x, 0, 50)
#define SM_RUN_MAX SM_RUN_DECODE(~0ULL)
#define SM_ALLOC 0x0
#define SM_FREE 0x1
/*
* The data for a given space map can be kept on blocks of any size.
* Larger blocks entail fewer i/o operations, but they also cause the
* DMU to keep more data in-core, and also to waste more i/o bandwidth
* when only a few blocks have changed since the last transaction group.
* This could use a lot more research, but for now, set the freelist
* block size to 4k (2^12).
*/
#define SPACE_MAP_BLOCKSHIFT 12
typedef void space_map_func_t(space_map_t *sm, uint64_t start, uint64_t size);
extern void space_map_create(space_map_t *sm, uint64_t start, uint64_t size,
uint8_t shift, kmutex_t *lp);
extern void space_map_destroy(space_map_t *sm);
extern void space_map_add(space_map_t *sm, uint64_t start, uint64_t size);
extern void space_map_remove(space_map_t *sm, uint64_t start, uint64_t size);
extern boolean_t space_map_contains(space_map_t *sm,
uint64_t start, uint64_t size);
extern void space_map_vacate(space_map_t *sm,
space_map_func_t *func, space_map_t *mdest);
extern void space_map_walk(space_map_t *sm,
space_map_func_t *func, space_map_t *mdest);
extern void space_map_load_wait(space_map_t *sm);
extern int space_map_load(space_map_t *sm, space_map_ops_t *ops,
uint8_t maptype, space_map_obj_t *smo, objset_t *os);
extern void space_map_unload(space_map_t *sm);
extern uint64_t space_map_alloc(space_map_t *sm, uint64_t size);
extern void space_map_claim(space_map_t *sm, uint64_t start, uint64_t size);
extern void space_map_free(space_map_t *sm, uint64_t start, uint64_t size);
extern uint64_t space_map_maxsize(space_map_t *sm);
extern void space_map_sync(space_map_t *sm, uint8_t maptype,
space_map_obj_t *smo, objset_t *os, dmu_tx_t *tx);
extern void space_map_truncate(space_map_obj_t *smo,
objset_t *os, dmu_tx_t *tx);
extern void space_map_ref_create(avl_tree_t *t);
extern void space_map_ref_destroy(avl_tree_t *t);
extern void space_map_ref_add_seg(avl_tree_t *t,
uint64_t start, uint64_t end, int64_t refcnt);
extern void space_map_ref_add_map(avl_tree_t *t,
space_map_t *sm, int64_t refcnt);
extern void space_map_ref_generate_map(avl_tree_t *t,
space_map_t *sm, int64_t minref);
#ifdef __cplusplus
}
#endif
#endif /* _SYS_SPACE_MAP_H */

131
uts/common/fs/zfs/sys/txg.h Normal file
View File

@ -0,0 +1,131 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2010 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#ifndef _SYS_TXG_H
#define _SYS_TXG_H
#include <sys/spa.h>
#include <sys/zfs_context.h>
#ifdef __cplusplus
extern "C" {
#endif
#define TXG_CONCURRENT_STATES 3 /* open, quiescing, syncing */
#define TXG_SIZE 4 /* next power of 2 */
#define TXG_MASK (TXG_SIZE - 1) /* mask for size */
#define TXG_INITIAL TXG_SIZE /* initial txg */
#define TXG_IDX (txg & TXG_MASK)
/* Number of txgs worth of frees we defer adding to in-core spacemaps */
#define TXG_DEFER_SIZE 2
#define TXG_WAIT 1ULL
#define TXG_NOWAIT 2ULL
typedef struct tx_cpu tx_cpu_t;
typedef struct txg_handle {
tx_cpu_t *th_cpu;
uint64_t th_txg;
} txg_handle_t;
typedef struct txg_node {
struct txg_node *tn_next[TXG_SIZE];
uint8_t tn_member[TXG_SIZE];
} txg_node_t;
typedef struct txg_list {
kmutex_t tl_lock;
size_t tl_offset;
txg_node_t *tl_head[TXG_SIZE];
} txg_list_t;
struct dsl_pool;
extern void txg_init(struct dsl_pool *dp, uint64_t txg);
extern void txg_fini(struct dsl_pool *dp);
extern void txg_sync_start(struct dsl_pool *dp);
extern void txg_sync_stop(struct dsl_pool *dp);
extern uint64_t txg_hold_open(struct dsl_pool *dp, txg_handle_t *txghp);
extern void txg_rele_to_quiesce(txg_handle_t *txghp);
extern void txg_rele_to_sync(txg_handle_t *txghp);
extern void txg_register_callbacks(txg_handle_t *txghp, list_t *tx_callbacks);
/*
* Delay the caller by the specified number of ticks or until
* the txg closes (whichever comes first). This is intended
* to be used to throttle writers when the system nears its
* capacity.
*/
extern void txg_delay(struct dsl_pool *dp, uint64_t txg, int ticks);
/*
* Wait until the given transaction group has finished syncing.
* Try to make this happen as soon as possible (eg. kick off any
* necessary syncs immediately). If txg==0, wait for the currently open
* txg to finish syncing.
*/
extern void txg_wait_synced(struct dsl_pool *dp, uint64_t txg);
/*
* Wait until the given transaction group, or one after it, is
* the open transaction group. Try to make this happen as soon
* as possible (eg. kick off any necessary syncs immediately).
* If txg == 0, wait for the next open txg.
*/
extern void txg_wait_open(struct dsl_pool *dp, uint64_t txg);
/*
* Returns TRUE if we are "backed up" waiting for the syncing
* transaction to complete; otherwise returns FALSE.
*/
extern boolean_t txg_stalled(struct dsl_pool *dp);
/* returns TRUE if someone is waiting for the next txg to sync */
extern boolean_t txg_sync_waiting(struct dsl_pool *dp);
/*
* Per-txg object lists.
*/
#define TXG_CLEAN(txg) ((txg) - 1)
extern void txg_list_create(txg_list_t *tl, size_t offset);
extern void txg_list_destroy(txg_list_t *tl);
extern int txg_list_empty(txg_list_t *tl, uint64_t txg);
extern int txg_list_add(txg_list_t *tl, void *p, uint64_t txg);
extern int txg_list_add_tail(txg_list_t *tl, void *p, uint64_t txg);
extern void *txg_list_remove(txg_list_t *tl, uint64_t txg);
extern void *txg_list_remove_this(txg_list_t *tl, void *p, uint64_t txg);
extern int txg_list_member(txg_list_t *tl, void *p, uint64_t txg);
extern void *txg_list_head(txg_list_t *tl, uint64_t txg);
extern void *txg_list_next(txg_list_t *tl, void *p, uint64_t txg);
#ifdef __cplusplus
}
#endif
#endif /* _SYS_TXG_H */

Some files were not shown because too many files have changed in this diff Show More