Commit Graph

1139 Commits

Author SHA1 Message Date
Matt Macy
d12e91d584 Make dnode definition uniform on !x86
gcc4 requires -fms-extensions to accept anonymous union members
2018-08-21 03:45:09 +00:00
Kyle Evans
7920ad944b libbe(3): Move build goop back out of cddl/
Some background: in the GSoC project, libbe/Makefile lived in lib/libbe. I
created projects/bectl branch, maintained the above for all of five
minutes before I misread Makefile.inc1 and decided that it couldn't possibly
build outside of cddl/, so I kicked the Makefile out into the cddl/ build
and all was good. The misreading was of the bit where .WAIT is added to
SUBDIR after lib, libexec but prior to building bin and cddl *only during
the install targets*, which is the critical part.

Fast forward- buildworld was still broken in my branch unbeknownst to me
because I didn't nuke my OBJDIR. Combing through Makefile.inc1 eventually
revealed the necessary magic to make sure that libbe's dependencies are
specified well enough, and it becomes clear what needs done to make a
non-cddl/ build work. This is an interesting prospect, because the build
split is kind of annoying to work with.

IGNORE_PRAGMA is added to avoid dropping WARNS by one more. This was
previously pulled in via cddl/Makefile.inc.
2018-08-18 03:20:59 +00:00
Kyle Evans
f25a4e58ec libbe(3): Remove -v from LDFLAGS
-v is clearly not needed for linking, and it adds extra verbose information
that is not necessary.
2018-08-18 03:08:54 +00:00
Mark Johnston
f0af0b312f Add partial documentation for dtrace(1)'s -x configuration options.
Some options are still missing descriptions, but they can be filled in
over time.

Submitted by:	raichoo <raichoo@googlemail.com>
Reviewed by:	0mp (previous version)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D16671
2018-08-16 19:28:44 +00:00
Will Andrews
450e5a4378 zfs: add ztest to the kyua test suite.
This program is currently failing, and has been for >6 months on HEAD.
Ideally, this should be run 24x7 in CI, to discover hard-to-find bugs that
only manifest with concurrent i/o.

Requested by:	lwhsu, mmacy
2018-08-15 13:05:04 +00:00
Kyle Evans
f2fdf2a1dc libbe(3)/bectl(8): Remove now-redundant include paths
These were previously necessary because the libnvpair and libzfs_core
includes were not installed into the SYSROOT, being a part of the copies
target in include/Makefile rather than being installed with the library.

This was fixed in r337696 and the headers are now installed properly, so we
may let go of the cruft.
2018-08-13 05:01:19 +00:00
Kyle Evans
ce33c57d6c Use INCS for non-sys/ libnvpair and libzfs_core includes
While nothing was wrong with libnvpair.h, libzfs_core.h was only guarded by
MK_CDDL rather than MK_CDDL && MK_ZFS. Rather than ugl'if'ying
include/Makefile to impose the extra restriction, just move the non-sys/
includes into INCS with the respect lib builds.

This has the added bonus of allowing third party packagers to try and split
these libs out of the FreeBSD-runtime package, if they are so inclined.

The sys/ include was left alone- generally userland libraries shouldn't
install kernel headers.

MFC after:	1 week
2018-08-13 03:38:32 +00:00
Matt Macy
cc0fbbb92e MFV/ZoL: Implement large_dnode pool feature
commit 50c957f702
Author: Ned Bass <bass6@llnl.gov>
Date:   Wed Mar 16 18:25:34 2016 -0700

    Implement large_dnode pool feature

    Justification
    -------------

    This feature adds support for variable length dnodes. Our motivation is
    to eliminate the overhead associated with using spill blocks.  Spill
    blocks are used to store system attribute data (i.e. file metadata) that
    does not fit in the dnode's bonus buffer. By allowing a larger bonus
    buffer area the use of a spill block can be avoided.  Spill blocks
    potentially incur an additional read I/O for every dnode in a dnode
    block. As a worst case example, reading 32 dnodes from a 16k dnode block
    and all of the spill blocks could issue 33 separate reads. Now suppose
    those dnodes have size 1024 and therefore don't need spill blocks.  Then
    the worst case number of blocks read is reduced to from 33 to two--one
    per dnode block. In practice spill blocks may tend to be co-located on
    disk with the dnode blocks so the reduction in I/O would not be this
    drastic. In a badly fragmented pool, however, the improvement could be
    significant.

    ZFS-on-Linux systems that make heavy use of extended attributes would
    benefit from this feature. In particular, ZFS-on-Linux supports the
    xattr=sa dataset property which allows file extended attribute data
    to be stored in the dnode bonus buffer as an alternative to the
    traditional directory-based format. Workloads such as SELinux and the
    Lustre distributed filesystem often store enough xattr data to force
    spill bocks when xattr=sa is in effect. Large dnodes may therefore
    provide a performance benefit to such systems.

    Other use cases that may benefit from this feature include files with
    large ACLs and symbolic links with long target names. Furthermore,
    this feature may be desirable on other platforms in case future
    applications or features are developed that could make use of a
    larger bonus buffer area.

    Implementation
    --------------

    The size of a dnode may be a multiple of 512 bytes up to the size of
    a dnode block (currently 16384 bytes). A dn_extra_slots field was
    added to the current on-disk dnode_phys_t structure to describe the
    size of the physical dnode on disk. The 8 bits for this field were
    taken from the zero filled dn_pad2 field. The field represents how
    many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
    This convention results in a value of 0 for 512 byte dnodes which
    preserves on-disk format compatibility with older software.

    Similarly, the in-memory dnode_t structure has a new dn_num_slots field
    to represent the total number of dnode_phys_t slots consumed on disk.
    Thus dn->dn_num_slots is 1 greater than the corresponding
    dnp->dn_extra_slots. This difference in convention was adopted
    because, unlike on-disk structures, backward compatibility is not a
    concern for in-memory objects, so we used a more natural way to
    represent size for a dnode_t.

    The default size for newly created dnodes is determined by the value of
    a new "dnodesize" dataset property. By default the property is set to
    "legacy" which is compatible with older software. Setting the property
    to "auto" will allow the filesystem to choose the most suitable dnode
    size. Currently this just sets the default dnode size to 1k, but future
    code improvements could dynamically choose a size based on observed
    workload patterns. Dnodes of varying sizes can coexist within the same
    dataset and even within the same dnode block. For example, to enable
    automatically-sized dnodes, run

     # zfs set dnodesize=auto tank/fish

    The user can also specify literal values for the dnodesize property.
    These are currently limited to powers of two from 1k to 16k. The
    power-of-2 limitation is only for simplicity of the user interface.
    Internally the implementation can handle any multiple of 512 up to 16k,
    and consumers of the DMU API can specify any legal dnode value.

    The size of a new dnode is determined at object allocation time and
    stored as a new field in the znode in-memory structure. New DMU
    interfaces are added to allow the consumer to specify the dnode size
    that a newly allocated object should use. Existing interfaces are
    unchanged to avoid having to update every call site and to preserve
    compatibility with external consumers such as Lustre. The new
    interfaces names are given below. The versions of these functions that
    don't take a dnodesize parameter now just call the _dnsize() versions
    with a dnodesize of 0, which means use the legacy dnode size.

    New DMU interfaces:
      dmu_object_alloc_dnsize()
      dmu_object_claim_dnsize()
      dmu_object_reclaim_dnsize()

    New ZAP interfaces:
      zap_create_dnsize()
      zap_create_norm_dnsize()
      zap_create_flags_dnsize()
      zap_create_claim_norm_dnsize()
      zap_create_link_dnsize()

    The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
    spa_maxdnodesize() function should be used to determine the maximum
    bonus length for a pool.

    These are a few noteworthy changes to key functions:

    * The prototype for dnode_hold_impl() now takes a "slots" parameter.
      When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
      ensure the hole at the specified object offset is large enough to
      hold the dnode being created. The slots parameter is also used
      to ensure a dnode does not span multiple dnode blocks. In both of
      these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
      these failure cases are only possible when using DNODE_MUST_BE_FREE.

      If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
      dnode_hold_impl() will check if the requested dnode is already
      consumed as an extra dnode slot by an large dnode, in which case
      it returns ENOENT.

    * The function dmu_object_alloc() advances to the next dnode block
      if dnode_hold_impl() returns an error for a requested object.
      This is because the beginning of the next dnode block is the only
      location it can safely assume to either be a hole or a valid
      starting point for a dnode.

    * dnode_next_offset_level() and other functions that iterate
      through dnode blocks may no longer use a simple array indexing
      scheme. These now use the current dnode's dn_num_slots field to
      advance to the next dnode in the block. This is to ensure we
      properly skip the current dnode's bonus area and don't interpret it
      as a valid dnode.

    zdb
    ---
    The zdb command was updated to display a dnode's size under the
    "dnsize" column when the object is dumped.

    For ZIL create log records, zdb will now display the slot count for
    the object.

    ztest
    -----
    Ztest chooses a random dnodesize for every newly created object. The
    random distribution is more heavily weighted toward small dnodes to
    better simulate real-world datasets.

    Unused bonus buffer space is filled with non-zero values computed from
    the object number, dataset id, offset, and generation number.  This
    helps ensure that the dnode traversal code properly skips the interior
    regions of large dnodes, and that these interior regions are not
    overwritten by data belonging to other dnodes. A new test visits each
    object in a dataset. It verifies that the actual dnode size matches what
    was stored in the ztest block tag when it was created. It also verifies
    that the unused bonus buffer space is filled with the expected data
    patterns.

    ZFS Test Suite
    --------------
    Added six new large dnode-specific tests, and integrated the dnodesize
    property into existing tests for zfs allow and send/recv.

    Send/Receive
    ------------
    ZFS send streams for datasets containing large dnodes cannot be received
    on pools that don't support the large_dnode feature. A send stream with
    large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
    unrecognized by an incompatible receiving pool so that the zfs receive
    will fail gracefully.

    While not implemented here, it may be possible to generate a
    backward-compatible send stream from a dataset containing large
    dnodes. The implementation may be tricky, however, because the send
    object record for a large dnode would need to be resized to a 512
    byte dnode, possibly kicking in a spill block in the process. This
    means we would need to construct a new SA layout and possibly
    register it in the SA layout object. The SA layout is normally just
    sent as an ordinary object record. But if we are constructing new
    layouts while generating the send stream we'd have to build the SA
    layout object dynamically and send it at the end of the stream.

    For sending and receiving between pools that do support large dnodes,
    the drr_object send record type is extended with a new field to store
    the dnode slot count. This field was repurposed from unused padding
    in the structure.

    ZIL Replay
    ----------
    The dnode slot count is stored in the uppermost 8 bits of the lr_foid
    field. The bits were unused as the object id is currently capped at
    48 bits.

    Resizing Dnodes
    ---------------
    It should be possible to resize a dnode when it is dirtied if the
    current dnodesize dataset property differs from the dnode's size, but
    this functionality is not currently implemented. Clearly a dnode can
    only grow if there are sufficient contiguous unused slots in the
    dnode block, but it should always be possible to shrink a dnode.
    Growing dnodes may be useful to reduce fragmentation in a pool with
    many spill blocks in use. Shrinking dnodes may be useful to allow
    sending a dataset to a pool that doesn't support the large_dnode
    feature.

    Feature Reference Counting
    --------------------------
    The reference count for the large_dnode pool feature tracks the
    number of datasets that have ever contained a dnode of size larger
    than 512 bytes. The first time a large dnode is created in a dataset
    the dataset is converted to an extensible dataset. This is a one-way
    operation and the only way to decrement the feature count is to
    destroy the dataset, even if the dataset no longer contains any large
    dnodes. The complexity of reference counting on a per-dnode basis was
    too high, so we chose to track it on a per-dataset basis similarly to
    the large_block feature.

    Signed-off-by: Ned Bass <bass6@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #3542
2018-08-12 00:45:53 +00:00
Kyle Evans
3f48dbd1cc Merge libbe(3)/bectl(8) from projects/bectl into head
bectl(8) is an administrative interface for working with ZFS boot
environments, intended to provide a superset of the functionality provided
by sysutils/beadm.

libbe(3) is the back-end library that the required functionality has been
pulled out into for later reuse.

These were originally written for GSoC 2017 under the mentorship of
allanjude@.

bectl(8) has proven pretty stable in my testing, with the known bug
documented in the man page.

Relnotes:	yes
2018-08-11 23:50:09 +00:00
Kyle Evans
35d2028fb8 libbe(3)/bectl(8): More SYSROOT/GCC build fixes
- Missing include path
- Fully specify libzfs's dependencies (except for deps pulled in by other
  deps) in Makefile.inc1
- Drop WARNS back down to 2 for libbe(3). I do this with much hesitation,
  but the libzfs headers are apparently a hot warning-filled mess as far as
  GCC 4.2 is concerned.
2018-08-11 22:45:39 +00:00
Alexander Leidinger
a079a34fd5 Extend the info about the limitations of datasets in jails.
Reviewed by:	allanjude
Sponsored by:	Essen Hackathon
2018-08-11 20:49:19 +00:00
Brad Davis
edb1df35b0 Fix the build by just installing systop since testing shows it works with:
dwatch -X systop

Reviewed by:	kp
Approved by:	allanjude (mentor)
2018-08-11 16:06:32 +00:00
Devin Teske
37b0d996dc dwatch(1): Add systop profile
Provides a top-like view of syscall consumers.

MFC after:	3 days
X-MFC-to:	stable/11
Sponsored by:	Smule, Inc.
2018-08-11 06:32:31 +00:00
Devin Teske
2282756519 dwatch(1): Fix syntax error in vop_readdir profile
Reported by:	Arne Ehrlich <ehrlich@consider-it.de>
MFC after:	3 days
X-MFC-to:	stable/11
Sponsored by:	Smule, Inc.
2018-08-11 06:13:11 +00:00
Kyle Evans
14b841d4a8 MFH @ r337607, in preparation for boarding 2018-08-11 04:26:29 +00:00
Mark Johnston
0b56e7a8e9 Disable the D subroutines msgsize() and msgdsize().
They are specific to illumos and the corresponding DIF subroutines are
already disabled on FreeBSD.

Reported by:	gnn
2018-08-10 19:23:20 +00:00
Matt Macy
648cfe57fd Performance optimization of AVL tree comparator functions
MFV:
commit ee36c709c3
Author: Gvozden Neskovic <neskovic@gmail.com>
Date:   Sat Aug 27 20:12:53 2016 +0200

    perf: 2.75x faster ddt_entry_compare()
        First 256bits of ddt_key_t is a block checksum, which are expected
    to be close to random data. Hence, on average, comparison only needs to
    look at first few bytes of the keys. To reduce number of conditional
    jump instructions, the result is computed as: sign(memcmp(k1, k2)).

    Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} ,
    which is computed efficiently.  Synthetic performance evaluation of
    original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R)
    CPU E5-2660 v3:

    old     6.85789 s
    new     2.49089 s

    perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare()
        Compute the result directly instead of using conditionals

    perf: zfs_range_compare()
        Speedup between 1.1x - 2.5x, depending on compiler version and
    optimization level.

    perf: spa_error_entry_compare()
        `bcmp()` is not suitable for comparator use. Use `memcmp()` instead.

    perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare()
    perf: 2.8x faster zil_bp_compare()
    perf: 2.8x faster mze_compare()
    perf: faster dbuf_compare()
    perf: faster compares in spa_misc
    perf: 2.8x faster layout_hash_compare()
    perf: 2.8x faster space_reftree_compare()
    perf: libzfs: faster avl tree comparators
    perf: guid_compare()
    perf: dsl_deadlist_compare()
    perf: perm_set_compare()
    perf: 2x faster range_tree_seg_compare()
    perf: faster unique_compare()
    perf: faster vdev_cache _compare()
    perf: faster vdev_uberblock_compare()
    perf: faster fuid _compare()
    perf: faster zfs_znode_hold_compare()

    Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
    Signed-off-by: Richard Elling <richard.elling@gmail.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #5033
2018-08-10 06:42:08 +00:00
Alexander Motin
07ddc55096 MFV r337223:
9580 Add a hash-table on top of nvlist to speed-up operations

illumos/illumos-gate@2ec7644aab

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
2018-08-03 01:52:25 +00:00
Alexander Motin
0285589b38 MFV 337214:
9621 Make createtxg and guid properties public

illumos/illumos-gate@e8d4a73c86

Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Josh Paetzel <josh@tcbug.org>
2018-08-03 00:24:27 +00:00
Alexander Motin
d49e9be14f MFV r337184: 9457 libzfs_import.c:add_config() has a memory leak
A memory leak occurs on lines 209 and 213 because the config is not freed
in the error case.  The interface to add_config() seems less than ideal -
it would be better if it copied any data necessary from the config and the
caller freed it.

illumos/illumos-gate@ddfe901b12

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     sara hartse <sara.hartse@delphix.com>
2018-08-02 21:25:32 +00:00
Alexander Motin
2bce9a5316 MFV r337182: 9330 stack overflow when creating a deeply nested dataset
Datasets that are deeply nested (~100 levels) are impractical. We just put
a limit of 50 levels to newly created datasets. Existing datasets should
work without a problem.

illumos/illumos-gate@5ac95da7d6

Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author:     Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
2018-08-02 21:19:35 +00:00
Alexander Motin
ac879e61ad 9523 Large alloc in zdb can cause trouble
16MB alloc in zdb_embedded_block() can cause cores in certain situations
(clang, gcc55).

OsX commit: ced236a5da
FreeBSD commit: https://svnweb.freebsd.org/base?view=revision&revision=326150
illumos/illumos-gate@03a4c2f4bf

Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Jorgen Lundman <lundman@lundman.net>

This is an update for r326150 (by avg), where this change comes from.
2018-08-02 20:44:07 +00:00
Alexander Motin
c423a5e6b8 MFV r337161: 9512 zfs remap poolname@snapname coredumps
Only filesystems and volumes are valid "zfs remap" parameters: when passed
a snapshot name zfs_remap_indirects() does not handle the EINVAL returned
from libzfs_core, which results in failing an assertion and consequently
crashing.

illumos/illumos-gate@0b2e825398

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author:     loli10K <ezomori.nozomu@gmail.com>
2018-08-02 19:13:45 +00:00
Alexander Motin
7fca1b93c4 Do not blindly include illumos kernel headers instead of user-space.
It is not needed now, and I doubt it much helped at all, creating more
confusions then good.
2018-08-02 18:55:55 +00:00
Alexander Motin
b59e9cd1c0 MFV r316926:
7955 libshare needs to initialize only those datasets being modified by the consumer

illumos/illumos-gate@8a981c3356
8a981c3356

https://www.illumos.org/issues/7955
  Libshare currently initializes all available filesystems when doing any
  libshare operation. This requires iterating through all the filesystem
  multiple times, which is a huge performance problem for sharing and
  unsharing operations.

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Daniel Hoffman <dj.hoffman@delphix.com>

For FreeBSD this is practically a NOP, just a diff reduction.
2018-08-01 21:51:49 +00:00
Michael Tuexen
7bda966394 Add a dtrace provider for UDP-Lite.
The dtrace provider for UDP-Lite is modeled after the UDP provider.
This fixes the bug that UDP-Lite packets were triggering the UDP
provider.
Thanks to dteske@ for providing the dwatch module.

Reviewed by:		dteske@, markj@, rrs@
Relnotes:		yes
Differential Revision:	https://reviews.freebsd.org/D16377
2018-07-31 22:56:03 +00:00
Alexander Motin
200c27a75d MFV r337014:
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on the deleted queue

illumos/illumos-gate@20b5dafb42

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author:     Paul Dagnelie <pcd@delphix.com>
2018-07-31 22:50:50 +00:00
Alexander Motin
0021e1c10c MFV r336991, r337001:
9102 zfs should be able to initialize storage devices

The first access to a disk block can incur a performance penalty on some
platforms (e.g. AWS's EBS, VMware VMDKs). Therefore it is recommended that
volumes be "thick provisioned", where supported by the platform (VMware).
Thick provisioning is time consuming and often is ignored. If the thick
provision step is omitted, customers will see suboptimal performance until
we have written to all parts of the LUN. ZFS should be able to initialize
any unused storage to remove any first-write penalty that exists.

illumos/illumos-gate@094e47e980

Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author:     George Wilson <george.wilson@delphix.com>
2018-07-31 21:06:04 +00:00
Alexander Motin
d1cf4052d0 MFV r336955: 9236 nuke spa_dbgmsg
We should use zfs_dbgmsg instead of spa_dbgmsg.  Or at least,
metaslab_condense() should call zfs_dbgmsg because it's important and rare
enough to always log. It's possible that the message in zio_dva_allocate()
would be too high-frequency for zfs_dbgmsg.

illumos/illumos-gate@21f7c81cc1

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
2018-07-31 00:47:27 +00:00
Alexander Motin
194000fa21 MFV r336950: 9290 device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk failure
and silent damage (e.g. some data on disk is not right, but ZFS hasn't
detected the whole device as being broken). However, the current device
removal implementation bypasses some of the mirror's redundancy.

illumos/illumos-gate@3a4b1be953

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
2018-07-31 00:25:39 +00:00
Alexander Motin
6413a6d31f MFV r336946: 9238 ZFS Spacemap Encoding V2
The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
device removal, zpool checkpoint) we've started imposing limits in the
vdevs that can be used with them based on the maximum addressable offset
(currently 64PB for a top-level vdev).

The new remains backwards compatible with the old one. The introduced
two-word entry format, besides extending the limits imposed by the single-entry
layout, also includes a vdev field and some extra padding after its prefix.

The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps.

One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and
we wanted to fit the vdev field in the space map. In addition that gives us
some extra bits in dva_t.

illumos/illumos-gate@17f11284b4

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
2018-07-30 23:47:38 +00:00
Alexander Motin
1960706625 MFV r336944: 9286 want refreservation=auto
When a ZFS volume is created with zfs create -V (but without -s), the
refreservation property is set to a value that is volsize plus the maximum
size of metadata. If refreservation is ever set to another value, it is
impossible to set it back to the automatically determined value. There are
other cases where refreservation may be wrong. These include receiving a
volume that was sent without properties and zfs clone.

We need:

zfs set refreservation=auto <volume>
zfs clone -o refreservation=auto <volume>

Each one would use the same function used by zfs create -V to determine the
proper value for refreservation.

illumos/illumos-gate@1c10ae76c0

Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Mike Gerdts <mike.gerdts@joyent.com>
2018-07-30 22:39:30 +00:00
Kyle Evans
b29bf2f84e libbe(3)/be(8): Drop WARNS overrides, fix all fallout
Based on the idea that we shouldn't have all-new library and utility going
into base that need WARNS=1...

- Decent amount of constification
- Lots of parentheses
- Minor other nits
2018-07-25 15:14:35 +00:00
Kyle Evans
268af06d3e Normalize bectl(8)/libbe(3) Makefiles, remove Makefile copyright/license
Approved by:	hselaskey
2018-07-24 19:55:02 +00:00
Kyle Evans
70a11a8eea libbe(3): Add to cddl build, adjust src.libnames.mk as needed 2018-07-24 15:42:23 +00:00
Michael Tuexen
53e0911116 Improve TCP related tests for dtrace.
Ensure that the TCP connections are terminated gracefully as expected
by the test. Use appropriate numbers for sent/received packets.
In addition, enable tst.localtcpstate.ksh, which should pass, but
doesn't until https://reviews.freebsd.org/D16369 is committed.

Reviewed by:		markj@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16288
2018-07-22 10:50:59 +00:00
Michael Tuexen
be029a4979 Test that the dtrace UDP receive probe fires.
This test ensures that the fix committed in
https://svnweb.freebsd.org/changeset/base/336551
actually works.

Reviewed by:		dteske@, markj@, rrs@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16046
2018-07-20 15:37:29 +00:00
Michael Tuexen
10b803a40d Adjust comment to reality since r286171.
Sponsored by:		Netflix, Inc.
2018-07-15 20:42:47 +00:00
Michael Tuexen
e0f9b8233f Don't require a local sshd for the local TCP state dtrace test
This change is similar to the one done in r286171 for
tst.ipv4localtcp.ksh. This not only reduces the requirements on the
system used for testing but results also in a graceful teardown of
the TCP connection.

Reviewed by:		gnn@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16276
2018-07-15 20:41:16 +00:00
Michael Tuexen
dc9f20b3f3 Fix the UDP tests for dtrace.
The code imported from opensolaris was depending on ping supporting
UDP for sending probes. Since this is not supported by ping on FreeBSD
use a perl script instead.
The remote test requires the usage of ksh93, so state that in the
sheband.
Enable the local test, but keep the remote test disabled, since it
requires a remote machine on the LAN.

Reviewed by:		markj@, gnn@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16268
2018-07-15 20:34:22 +00:00
Michael Tuexen
60e4fb3a3f Return the intended return code.
This bug was spotted by markj@ in D16268 because I copied this code part
and used it there. So fix it.

Sponsored by:		Netflix, Inc.
2018-07-14 19:53:41 +00:00
Michael Tuexen
49d7124b18 Fix shebangs and execute bit of test scripts.
Since we don't have /usr/bin/ksh, use a generic way of specifying
ksh. Some of the tests only run with ksh93, so use this shell
for these tests. Two of the tests don't have the execute bit set,
so fix this, too.

Reviewed by:		markj@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16270
2018-07-14 19:49:14 +00:00
Michael Tuexen
d073fd606c Add support for TCP state names used by Solaris.
For compatibility, add the TCP state names used by Solaris
and given in the Dtrace Guide available at
https://docs.oracle.com/cd/E37838_01/html/E61035/glhgu.html#OSDTGglhmv

Reviewed by:		markj@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16269
2018-07-14 17:12:04 +00:00
Michael Tuexen
7575d3df5b The IP, TCP, and UDP provider report IP addresses as strings.
In some cases, the required information is not available and the
UDP provider reported an empty string in this case and the IP
and TCP provider reported a NULL pointer.

This patch changes the value provided in this case to the string
"<unknown>". This make the behaviour consistent and in-line with
the behaviour of Solaris.

Reviewed by:		markj@, dteske@, gnn@
Differential Revision:	https://reviews.freebsd.org/D15855
2018-06-18 18:35:29 +00:00
Eric van Gyzen
79d4ee83de Fix markup in zfs(8); no content change
Sponsored by:	Dell EMC
2018-06-15 15:28:31 +00:00
Mark Johnston
ecbde90073 Process CUs with a language attribute of DW_LANG_Mips_Assembler.
At the moment ctfconvert(1) does not do much with such CUs, but
that may not be true in the future, and we run ctfconvert on several
assembly files during the build.

X-MFC with:	r334883
2018-06-11 16:33:36 +00:00
Mark Johnston
c5fda9bac0 Don't process DWARF generated from non-C/C++ code.
ctfconvert(1) is not designed to handle DWARF generated from such code,
and will generally fail in non-obvious ways.  Use an explicit check to
help catch such potential failures.

Reported by:	Johannes Lundberg <johalun0@gmail.com>
MFC after:	2 weeks
2018-06-09 15:10:49 +00:00
Sean Eric Fagan
69724399c4 This originated from ZFS On Linux, as
d4a72f2386

During scans (scrubs or resilvers), it sorts the blocks in each transaction
group by block offset; the result can be a significant improvement. (On my
test system just now, which I put some effort to introduce fragmentation into
the pool since I set it up yesterday, a scrub went from 1h2m to 33.5m with the
changes.) I've seen similar rations on production systems.

Approved by:	Alexander Motin
Obtained from:	ZFS On Linux
Relnotes:	Yes (improved scrub performance, with tunables)
Differential Revision:	https://reviews.freebsd.org/D15562
2018-06-08 17:38:28 +00:00
Sean Bruno
8d7181d1e0 Unbreak dtrace runtime for udp after svn r334719 SO_REUSEPORT commit.
Submitted by:	Johannes Lundberg <johalun0@gmail.com>
Sponsored by:	Limeligght Networks
2018-06-07 15:27:07 +00:00
Sean Bruno
1a43cff92a Load balance sockets with new SO_REUSEPORT_LB option.
This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures:
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations:
As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or
threads sharing the same socket).

This is a substantially different contribution as compared to its original
incarnation at svn r332894 and reverted at svn r332967.  Thanks to rwatson@
for the substantive feedback that is included in this commit.

Submitted by:	Johannes Lundberg <johalun0@gmail.com>
Obtained from:	DragonflyBSD
Relnotes:	Yes
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D11003
2018-06-06 15:45:57 +00:00