Commit Graph

3378 Commits

Author SHA1 Message Date
timor
c66e54e9dc Add support for nvme disk detection
This treats /dev/nvme.. devices the same way as /dev/sd... devices.  The
motivation behind this is that whole disk detection did not work on nvme
SSDs without that, because it DKC_UNKNOWN was returned for such devices.

Perhaps there should be a separate DKC_ type for this, but I don't know
enough about the code to know the implications of that.

Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: timor <timor.dd@googlemail.com>
Closes #7304
2018-03-21 08:35:20 -07:00
Tom Caputi
089fbf313c Add comments for portable dnode / objset flags
This patch adds some comments describing the purpose of "portable"
dnode and objset flags so that it is clear when new flags should
be added to the repective flag masks. This patch includes no
functional changes.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7313
2018-03-20 11:55:21 -07:00
Alek P
272b5d730f Add JSON output support to channel programs
The changes piggyback JSON output support on top of channel programs 
(#6558).  This way the JSON output support is targeted to scripting 
use cases and is easily maintainable since it really only touches 
one function (zfs_do_channel_program()).

This patch ports Joyent's JSON nvlist library from illumos to enable 
easy JSON printing of channel program output nvlist.  To keep the 
delta small I also took advantage of the fact that printing in
zfs_do_channel_program() was almost always done before exiting 
the program.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #7281
2018-03-19 12:40:58 -07:00
Brian Behlendorf
a76f3d0437
Fix deadlock in IO pipeline
In vdev_queue_aggregate() the zio_execute() bypass should not be
called under the vdev queue lock.  This can result in a deadlock
as shown in the stack traces below.

Drop the vdev queue lock then walk the parents of the aggregate IO
to determine the list of component IOs to be bypassed.  This can
be done safely without holding the io_lock since the new aggregate
IO has not yet been returned and its parents cannot change.

---  THREAD 1 ---
arc_read()
  zio_nowait()
    zio_vdev_io_start()
      vdev_queue_io() <--- mutex_enter(vq->vq_lock)
        vdev_queue_io_to_issue()
          vdev_queue_aggregate()
            zio_execute()
              zio_vdev_io_assess()
                zio_wait_for_children() <- mutex_enter(zio->io_lock)

--- THREAD 2 --- (inverse order)
arc_read()
  zio_change_priority() <- mutex_enter(zio->zio_lock)
    vdev_queue_change_io_priority() <- mutex_enter(vq->vq_lock)

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7307
2018-03-16 16:46:06 -07:00
Tom Caputi
17dd88352e Allow QAT to handle non page-aligned buffers
This patch adds some handling to the QAT acceleration functions
that allows them to handle buffers that are not aligned with the
page cache. At the moment this never happens since callers only
happen to work with page-aligned buffers, but this code should
prevent headaches if this isn't always true in the future. This
patch also adds some cleanups to align the QAT compression code
with the encryption and checksumming code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7305
2018-03-16 16:02:49 -07:00
Olaf Faaland
cec3a0a1bb Report pool suspended due to MMP
When the pool is suspended, record whether it was due to an I/O error or
due to MMP writes failing to succeed within the required time.

Change spa_suspended from uint8_t to zio_suspend_reason_t to store the
reason.

When userspace queries pool status via spa_tryimport(), report the
reason the pool was suspended in a new key,
ZPOOL_CONFIG_SUSPENDED_REASON.

In libzfs, when interpreting the returned config nvlist, report
suspension due to MMP with a new pool status enum value,
ZPOOL_STATUS_IO_FAILURE_MMP.

In status_callback(), which generates and emits the message when 'zpool
status' is executed, add a case to print an appropriate message for the
new pool status enum value.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7296
2018-03-15 10:56:55 -07:00
Tom Caputi
3874220932 SHA256 QAT acceleration
This patch enables acceleration of SHA256 checksums using Intel
Quick Assist Technology. This patch also fixes up and refactors
some of the code from QAT encryption to make the behavior
consistent.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7295
2018-03-15 10:53:58 -07:00
Stephen Blinick
8a2a9db8df OpenZFS 9076 - Adjust perf test concurrency settings
ZFS Performance test concurrency should be lowered for better latency

Work by Stephen Blinick.

Nightly performance runs typically consist of two levels of concurrency;
and both are fairly high.

Since the IO runs are to a ZFS filesystem, within a zpool, which is
based on some variable number of vdev's, the amount of IO driven to each
device is variable. Additionally, different device types (HDD vs SSD,
etc) can generally handle a different amount of concurrent IO before
saturating.

Nevertheless, in practice, it appears that most tests are well past the
concurrency saturation point and therefore both perform with the same
throughput, the maximum of the device. Because the queuedepth to the
device(s) is so high however, the latency is much higher than the best
possible at that throughput, and increases linearly with the increase in
concurrency.

This means that changes in code that impact latency during normal
operation (before saturation) may not be apparent when a large component
of the measured latency is from the IO sitting in a queue to be
serviced. Therefore, changing the concurrency settings is recommended

Authored by: Stephen Blinick <stephen.blinick@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: John Wren Kennedy <jwk404@gmail.com>
Ported-by: John Wren Kennedy <jwk404@gmail.com>

OpenZFS-issue: https://www.illumos.org/issues/9076
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/562
Upstream bug: DLPX-45477
Closes #7302
2018-03-15 10:51:00 -07:00
Paul Zuchowski
1a2342784a receive_spill does not byte swap spill contents
In zfs receive, the function receive_spill should account
for spill block endian conversion as a defensive measure.
 
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7300
2018-03-15 10:29:51 -07:00
Brian Behlendorf
de4f8d5d26
OpenZFS 9188 - increase size of dbuf cache to reduce indirect block decompression
With compressed ARC (bug 6950) we use up to 25% of our CPU to decompress
indirect blocks, under a workload of random cached reads. To reduce this
decompression cost, we would like to increase the size of the dbuf cache so
that more indirect blocks can be stored uncompressed.

If we are caching entire large files of recordsize=8K, the indirect blocks
use 1/64th as much memory as the data blocks (assuming they have the same
compression ratio). We suggest making the dbuf cache be 1/32nd of all memory,
so that in this scenario we should be able to keep all the indirect blocks
decompressed in the dbuf cache. (We want it to be more than the 1/64th that
the indirect blocks would use because we need to cache other stuff in the dbuf
cache as well.)

In real world workloads, this won't help as dramatically as the example above,
but we think it's still worth it because the risk of decreasing performance is
low. The potential negative performance impact is that we will be slightly
reducing the size of the ARC (by ~3%).

Porting Notes:
* Added modules options to zfs-module-parameters.5 man page.
* Preserved scaling based on target ARC size rather than max ARC size.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9188
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/564
Upstream bug: DLPX-46942
Closes #7273
2018-03-13 10:52:48 -07:00
Brian Behlendorf
a6cc97566c
Add kernel module auto-loading
Historically a dynamic misc minor number was registered for the
/dev/zfs device in order to prevent minor number collisions.  This
was fine but it prevented us from being able to use the kernel
module auto-loaded which requires a known reserved value.

Resolve this issue by adding a configure test to find an available
misc minor number which can then be used in MODULE_ALIAS_MISCDEV at
build time.  By adding this alias the zfs kmod is added to the list
of known static-nodes and the systemd-tmpfiles-setup-dev service
will create a /dev/zfs character device at boot time.

This in turn allows us to update the 90-zfs.rules file to make it
aware this is a static node.  The upshot of this is that whenever
a process (zpool, zfs, zed) opens the /dev/zfs the kmods will be
automatic loaded.  This even works for unprivileged users so there
is no longer a need to manually load the modules at boot time.

As an additional bonus the zed now no longer needs to start after
the zfs-import.service since it will trigger the module load.

In the unlikely event the minor number we selected conflicts with
another out of tree unregistered minor number the code falls back
to dynamically allocating it.  In this case the modules again
must be manually loaded.

Note that due to the change in the method of registering the minor
number the zimport.sh test case may incorrectly fail when the
static node for the installed packages is created instead of the
dynamic one.  This issue will only transiently impact zimport.sh
for this single commit when we transition and are mixing and
matching methods.

Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
TEST_ZIMPORT_SKIP="yes"
Closes #7287
2018-03-13 10:45:55 -07:00
Tim Chase
02638a30ef Add zfs_scan_ignore_errors tunable
When it's set, a DTL range will be cleared even if its scan/scrub had
errors.  This allows to work around resilver/scrub upon import when the
pool has errors.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7293
2018-03-13 10:43:14 -07:00
Paul Zuchowski
83362e8e67 Destroy makes full snap list before destroying
Change zfs destroy logic so destroying begins before
the entire list of snapshots is built.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7271
2018-03-12 15:24:08 -07:00
Chunwei Chen
9470cbd4f9 Fix race in trace point in zrl_add_impl
We hit an illegal memory access in the zrlock trace point. The problem
is that zrl->zr_owner and zrl->zr_caller are assigned locklessly. And if
zrl->zr_owner got assigned a longer string between when __string()
calculate the strlen, and when __assign_str() does strcpy. The copy will
overflow the buffer.

==
For example:

Initial condition:
zrl->zr_owner = A
zrl->zr_caller = "abc"

Thread A                                 Thread B
-------------------------------------------------
if (zrl->zr_owner == A) {
  DTRACE_PROBE2() {
    __string() {
      strlen(zrl->zr_caller) -> 3
      allocate buf[4]
    }

                                        zrl->zr_owner = B
				        zrl->zr_caller = "abcd"

    __assign_str() {
      strcpy(buf, zrl->zr_caller) <- buffer overflow
==

Dereferencing zrl->zr_owner->pid may also be problematic, in that the
zrl->zr_owner got changed to other task, and that task exits, freeing
the task_struct. This should be very unlikely, as the other task need to
zrl_remove and exit between the dereferencing zr->zr_owner and
zr->zr_owner->pid. Nevertheless, we'll deal with it as well.

To fix the zrl->zr_caller issue, instead of copy the string content, we
just copy the pointer, this is safe because it always points to
__func__, which is static. As for the zrl->zr_owner issue, we pass in
curthread instead of using zrl->zr_owner.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7291
2018-03-12 11:27:02 -07:00
Brian Behlendorf
b7eec00f9f
Fix MMP write frequency for large pools
When a single pool contains more vdevs than the CONFIG_HZ for
for the kernel the mmp thread will not delay properly.  Switch
to using cv_timedwait_sig_hires() to handle higher resolution
delays.

This issue was reported on Arch Linux where HZ defaults to only
100 and this could be fairly easily reproduced with a reasonably
large pool.  Most distribution kernels set CONFIG_HZ=250 or
CONFIG_HZ=1000 and thus are unlikely to be impacted.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7205 
Closes #7289
2018-03-12 11:26:05 -07:00
Olaf Faaland
743253df70 Hold SCL_VDEV when counting leaves
A config lock should be held while vdev_count_leaves() walks the tree,
otherwise the pointers reference may become invalid during the walk.

SCL_VDEV is a minimal lock provided for such uses cases.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7286
2018-03-09 15:42:54 -08:00
Olaf Faaland
ebed90a598 Handle zio_resume and mmp => off
When multihost is disabled on a pool, and the pool is resumed via zpool
clear, within a single cycle of the mmp thread's loop (e.g.  while it's
in the cv_timedwait call), both mmp_last_write and mmp_delay should be
updated.

The original code mistakenly treated the two cases as if they could not
occur at the same time.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7286
2018-03-09 15:42:11 -08:00
Tomohiro Kusumi
5ee220ba5c Document allowed pool names
PR #7208 was a patch to allow non-reserved pool names which begin with
mirror, raidz, spare (but do not equal), however we'd rather document
it in the man page for compatibility with other OpenZFS implementations,
to avoid pool names that may not work on non-Linux platforms.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #7216
2018-03-09 14:04:15 -08:00
LOLi
c45c6d9212 Fix zfs-kmod builds when using rpm >= 4.14
With rpm-software-management/rpm@5e94633 a package version containing
invalid characters (most commonly a double '-') causes the kmod package
generation to terminate with an error.  This change takes advantage of
the newly introduced rpm macro "_wrong_version_format_terminate_build"
to allow kmod packages to be built.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  loli10K <ezomori.nozomu@gmail.com>
Closes #7284
2018-03-09 13:52:37 -08:00
Tomohiro Kusumi
6b8655ad3f Change functions which return literals to return const char*
get_format_prompt_string() and zpool_state_to_name() return
a string literal which is read-only, thus they should return
`const char*`.

zpool_get_prop_string() returns a non-const string after
successful nv-lookup, and returns a string literal otherwise.
Since this function is designed to be used for read-only purpose,
the return type should also be `const char*`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #7285
2018-03-09 13:47:32 -08:00
Tom Caputi
cf63739191 QAT support for AES-GCM
This patch adds support for acceleration of AES-GCM encryption
with Intel Quick Assist Technology.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7282
2018-03-09 13:37:15 -08:00
Paul Zuchowski
8e5d14844d zdb and inuse tests don't pass with real disks
Due to zpool create auto-partioning in Linux (i.e. sdb1),
certain utilities need to use the parition (sdb1) while
others use the whole disk name (sdb).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #6939 
Closes #7261
2018-03-07 17:03:33 -08:00
Wolfgang Bumiller
0e85048f53 Take user namespaces into account in policy checks
Change file related checks to use user namespaces and make
sure involved uids/gids are mappable in the current
namespace.

Note that checks without file ownership information will
still not take user namespaces into account, as some of
these should be handled via 'zfs allow' (otherwise root in a
user namespace could issue commands such as `zpool export`).

This also adds an initial user namespace regression test
for the setgid bit loss, with a user_ns_exec helper usable
in further tests.

Additionally, configure checks for the required user
namespace related features are added for:
  * ns_capable
  * kuid/kgid_has_mapping()
  * user_ns in cred_t

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Closes #6800 
Closes #7270
2018-03-07 15:40:42 -08:00
Brian Behlendorf
434a3375ce
ZTS: fix send-c_stream_size_estimate
The test could fail when attempting to write to a newly created
volume which was missing its device node.  Resolve the issue by
calling block_device_wait() which blocks until udev creates the
needed  entry.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7276 
Closes #7277
2018-03-07 09:55:54 -08:00
Giuseppe Di Natale
a07ad58847 Fix dbufstats_001_pos
Implement a new helper within_tolerance to test if a value
is within range of a target.

Because the dbufstats and dbufs kstat file are being read
at slightly different times, it is possible for stats to be
slightly off. Use within_tolerance to determine if the value
is "close enough" to the target.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7239 
Closes #7266
2018-03-07 09:53:04 -08:00
Tony Hutter
639b18944a Allow to limit zed's syslog chattiness
Some usage patterns like send/recv of replication streams can
produce a large number of events. In such a case, the current
all-syslog.sh zedlet will hold up to its name, and flood the
logs with mostly redundant information. Two mitigate this
situation, this changeset introduces to new variables
ZED_SYSLOG_SUBCLASS_INCLUDE and ZED_SYSLOG_SUBCLASS_EXCLUDE
to zed.rc that give more control over which event classes end
up in the syslog.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Daniel Kobras <d.kobras@science-computing.de>
Closes #6886 
Closes #7260
2018-03-06 15:41:52 -08:00
Olaf Faaland
d2160d0538 Record skipped MMP writes in multihost_history
Once per pass through the MMP thread's loop, the vdev tree is walked to
find a suitable leaf to write the next MMP block to.  If no such leaf is
found, the thread sleeps for a while and resumes at the top of the loop.

Add an entry to multihost_history when no leaf can be found, and record
the reason in the error column.  The error code for such entries is a
bitfield, displayed in hex:

0x1  At least one vdev (interior or leaf) was not writeable.
0x2  At least one writeable leaf vdev was found, but it had a pending
MMP write.

timestamp = the time in seconds since the epoch when no leaf could be
found originally.

duration = the time (in ns) during which no MMP block was written for
this reason.  This does not include the preceeding inter-write period
nor the following inter-write period.

vdev_guid = the number of sequential cycles of the MMP thread looop when
this occurred.

Sample output, truncated to fit:

For records of skipped MMP writes the right-most column, vdev_path, is
reported as "-".

id   txg  timestamp   error  duration    mmp_delay  vdev_guid     ...
936  11   1520036441  0      146264      891422313  1740883117838 ...
937  11   1520036441  0      163956      888356657  7320395061548 ...
938  11   1520036442  0      130690      885314969  7320395061548 ...
939  11   1520036442  0      2001068577  882296582  1740883117838 ...
940  11   1520036443  0      161806      882296582  7320395061548 ...
941  11   1520036443  0x2    0           998020546  1             ...
942  11   1520036444  0      136585      998020546  7320395061548 ...
943  11   1520036444  0x2    0           998020257  1             ...
944  11   1520036445  5      2002662964  994160219  1740883117838 ...
945  11   1520036445  0x2    998073118   994160219  3             ...
946  11   1520036447  0      247136      994160219  7320395061548 ...

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7212
2018-03-06 15:15:15 -08:00
Olaf Faaland
14c240cede Detect long config lock acquisition in mmp
If something holds the config lock as a writer for too long, MMP will
fail to issue MMP writes in a timely manner.  This will result either in
the pool being suspended, or in an extreme case, in the pool not being
protected.

If the time to acquire the config lock exceeds 1/10 of the minimum
zfs_multihost_interval, report it in the zfs debug log.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7212
2018-03-06 15:14:39 -08:00
Giuseppe Di Natale
c7b55e71b0 Introduce a destroy_dataset helper
Datasets can be busy when calling zfs destroy. Introduce
a helper function to destroy datasets and use it to destroy
datasets in zfs_allow_004_pos, zfs_promote_008_pos, and
zfs_destroy_002_pos.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7224 
Closes #7246 
Closes #7249 
Closes #7267
2018-03-06 14:54:57 -08:00
Nasf-Fan
2705ebf0a7 Misc fixes and cleanup for project quota
1) The Coverity Scan reports some issues for the project
   quota patch, including:

1.1) zfs_prop_get_userquota() directly uses the const quota
   type value as the condition check by wrong.

1.2) dmu_objset_userquota_get_ids() may cause dnode::dn_newgid
   to be overwritten by dnode::dn->dn_oldprojid.

2) This patch fixes related issues. It also enhances the logic
   for zfs_project_item_alloc() to avoid buffer overflow.

3) Skip project quota ability check if does not change project
   quota related things (id or flag). Otherwise, it will cause
   chattr (for other non project quota flags) operation failed
   if project quota disabled.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fan Yong <fan.yong@intel.com>
Closes #7251 
Closes #7265
2018-03-05 12:56:27 -08:00
Giuseppe Di Natale
dd3e1e3083 Linux 4.16 compat: get_disk_and_module()
As of https://github.com/torvalds/linux/commit/fb6d47a, get_disk()
is now get_disk_and_module(). Add a configure check to determine
if we need to use get_disk_and_module().

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7264
2018-03-05 12:44:35 -08:00
Tony Hutter
80d52c3919 Change checksum & IO delay ratelimit values
Change checksum & IO delay ratelimit thresholds from 5/sec to 20/sec.
This allows zed to actually trigger if a bunch of these events arrive in
a short period of time (zed has a threshold of 10 events in 10 sec).
Previously, if you had, say, 100 checksum errors in 1 sec, it would get
ratelimited to 5/sec which wouldn't trigger zed to fault the drive.

Also, convert the checksum and IO delay thresholds to module params for
easy testing.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7252
2018-03-04 17:34:51 -08:00
chrisrd
5666a994f2 Increment zil_itx_needcopy_bytes properly
In zil_lwb_commit() with TX_WRITE, we copy the log write record (lrw)
into the log write block (lwb) and send it off using zil_lwb_add_txg().
If we also have WR_NEED_COPY, we additionally copy the lwr's data into
the lwb to be sent off.  If the lwr + data doesn't fit into the lwb, we
send the lrw and as much data as will fit (dnow bytes), then go back
and do the same with the remaining data.

Each time through this loop we're sending dnow data bytes. I.e.
zil_itx_needcopy_bytes should be incremented by dnow.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #6988 
Closes #7176
2018-03-02 10:01:53 -08:00
chrisrd
d0f6fbaff3 ZTS: fix spurious failures in mv_files
The test could fail because of a race condition between the files being
generated in the background and attempting to move the files. Wait for
all file generation to complete before trying to move the files around.

Also, clean up the waiting: the 'wait' command without arguments waits
for all child pids.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #7220 
Closes #7242 
Closes #7258
2018-03-02 09:57:29 -08:00
John Wren Kennedy
e086e717c3 Add ZFS perf test for dbuf cache
This change adds a test for sequential reads out of the dbuf cache.
It's essentially a copy of sequential_reads_cached, using a smaller
data set. The sequential read tests are renamed to differentiate them.

Authored by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Wren Kennedy <john.kennedy@delphix.com>
Closes #7225
2018-02-28 10:38:37 -08:00
John Eismeier
d699aaef09 Fix some typos
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: John Eismeier <john.eismeier@gmail.com>
Closes #7237
2018-02-28 08:57:10 -08:00
Tomohiro Kusumi
d72cd017dd Fix zpool(8) list example to match actual format
a05dfd00 (Illumos 5147) has swapped FRAG and EXPANDSZ,
so it's natural to modify these examples.

 # zpool list | head -1
 NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
                              ^^^^^^^^^^^^^^^

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #7244
2018-02-28 08:54:53 -08:00
Scot W. Stevenson
19528cf949 Add Python 3 rewrite of arc_summary.py
Add new script arc_summary3.py as a complete rewrite of the
arc_summary.py tool (see issue #6873)

Add new options:

        -g/--graph    - Display crude graphic representation
                        of ARC status and quit
        -r/--raw      - Print all available information as
                        minimally formatted list (for grep)
        -s/--section  - Print a single section. This
                        replaces -p/--page, which is kept for
                        backwards use but marked as
                        depreciated

Add new sections with information on ZIL and SPL. Notify user
if sections L2ARC and VDEV are skipped instead of failing
silently. Add warning that -p/--page option is depreciated.

Developed for Python 3.5.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes #6873 
Closes #6892
2018-02-28 08:52:34 -08:00
Tony Hutter
3e9c9d8a89 Add SMART self-test results to zpool status -c
Add in SMART self-test results to zpool status|iostat -c.  This
works for both SAS and SATA drives.

Also, add plumbing to allow the 'smart' script to take smartctl
output from a directory of output text files instead of running
it against the vdevs.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7178
2018-02-27 09:31:27 -08:00
Tom Caputi
095495e008 Raw DRR_OBJECT records must write raw data
b1d21733 made it possible for empty metadnode blocks to be
compressed to a hole, fixing a bug that would cause invalid
metadnode MACs when a send stream attempted to free objects
and allowing the blocks to be reclaimed when they were no
longer needed. However, this patch also introduced a race
condition; if a txg sync occurred after a DRR_OBJECT_RANGE
record was received but before any objects were added, the
metadnode block would be compressed to a hole and lose all
of its encryption parameters. This would cause subsequent
DRR_OBJECT records to fail when they attempted to write
their data into an unencrypted block. This patch defers the
DRR_OBJECT_RANGE handling to receive_object() so that the
encryption parameters are set with each object that is
written into that block.

Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7215 
Closes #7236
2018-02-27 09:04:05 -08:00
Tim Chase
8b5814393f Incorrect maximum DVA value in DDE_GET_NDVAS()
The conditional was reversed which caused garbage values to be used when
calculating dds_ref_dsize.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7234
2018-02-26 14:20:12 -08:00
LOLi
4af6873af6 Fix segfault in zfs_do_bookmark()
When invoked with wrong parameters 'zfs bookmark' fails to gracefully
validate user input and crashes.

This is a regression accidentally introduced in 587e228; this commit
adds additional tests to the ZFS Test Suite to exercise this codepath.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: KireinaHoro <i@jsteward.moe>
Signed-off-by:  loli10K <ezomori.nozomu@gmail.com>
Closes #7228 
Closes #7229
2018-02-26 09:55:18 -08:00
Brian Behlendorf
2a0428f16b
ZTS: Fix zfs_share_* test case failures
Prevent false positives when running the zfs_share_* test
cases due to leftover stale /var/lib/nfs/etab entries.  When
starting the test group re-synchronize the /var/lib/nfs/etab
file with /etc/exports.  At this point in the testing there
will be no additional `zfs share` entries to add.

Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7226
2018-02-24 10:07:12 -08:00
Kash Pande
41532e5a29 Shellcheck cleanup for initrd scripts
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Kash Pande <kash@tripleback.net>
Co-authored-by: Matthew Thode <mthode@mthode.org>
Signed-off-by: Kash Pande <kash@tripleback.net>
Signed-off-by: Matthew Thode <mthode@mthode.org>
Closes #7214
2018-02-23 12:57:41 -08:00
Kash Pande
7280d58197 Enable booting from nested encrypted datasets
- enable booting from nested encrypted datasets
- fix plymouth boot splash passphrase entry
- optimize unlock process

Co-authored-by: Kash Pande <kash@tripleback.net>
Co-authored-by: Matthew Thode <mthode@mthode.org>
Signed-off-by: Kash Pande <kash@tripleback.net>
Signed-off-by: Matthew Thode <mthode@mthode.org>
Closes #7214
2018-02-23 12:57:28 -08:00
Tony Hutter
bf95a000c4 Add scrub after resilver zed script
* Add a zed script to kick off a scrub after a resilver.  The script is
disabled by default.

* Add a optional $PATH (-P) option to zed to allow it to use a custom
$PATH for its zedlets.  This is needed when you're running zed under
the ZTS in a local workspace.

* Update test scripts to not copy in all-debug.sh and all-syslog.sh by
default.  They can be optionally copied in as part of zed_setup().
These scripts slow down zed considerably under heavy events loads and
can cause events to be dropped or their delivery delayed. This was
causing some sporadic failures in the 'fault' tests.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #4662 
Closes #7086
2018-02-23 11:38:05 -08:00
chrisrd
e9a7729008 Fix free memory calculation on v3.14+
Provide infrastructure to auto-configure to enum and API changes in the
global page stats used for our free memory calculations.

arc_free_memory has been broken since an API change in Linux v3.14:

2016-07-28 v4.8 599d0c95 mm, vmscan: move LRU lists to node
2016-07-28 v4.8 75ef7184 mm, vmstat: add infrastructure for per-node
  vmstats

These commits moved some of global_page_state() into
global_node_page_state(). The API change was particularly egregious as,
instead of breaking the old code, it silently did the wrong thing and we
continued using global_page_state() where we should have been using
global_node_page_state(), thus indexing into the wrong array via
NR_SLAB_RECLAIMABLE et al.

There have been further API changes along the way:

2017-07-06 v4.13 385386cf mm: vmstat: move slab statistics from zone to
  node counters
2017-09-06 v4.14 c41f012a mm: rename global_page_state to
  global_zone_page_state

...and various (incomplete, as it turns out) attempts to accomodate
these changes in ZoL:

2017-08-24 2209e409 Linux 4.8+ compatibility fix for vm stats
2017-09-16 787acae0 Linux 3.14 compat: IO acct, global_page_state, etc
2017-09-19 661907e6 Linux 4.14 compat: IO acct, global_page_state, etc

The config infrastructure provided here resolves these issues going back
to the original API change in v3.14 and is robust against further Linux
changes in this area.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #7170
2018-02-23 08:50:06 -08:00
Olaf Faaland
7088545d01 Report duration and error in mmp_history entries
After an MMP write completes, update the relevant mmp_history entry
with the time between submission and completion, and the error
status of the write.

[faaland1@toss3a zfs]$ cat /proc/spl/kstat/zfs/pool/multihost
39 0 0x01 100 8800 69147946270893 72723903122926
id       txg     timestamp  error  duration   mmp_delay    vdev_guid
10607    1166    1518985089 0      138301     637785455    4882...
10608    1166    1518985089 0      136154     635407747    1151...
10609    1166    1518985089 0      803618560  633048078    9740...
10610    1166    1518985090 0      144826     633048078    4882...
10611    1166    1518985090 0      164527     666187671    1151...

Where duration = gethrtime_in_done_fn - gethrtime_at_submission, and
error = zio->io_error.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7190
2018-02-22 15:34:34 -08:00
Olaf Faaland
0d398b2564 Do not initiate MMP writes while pool is suspended
While the pool is suspended on host A, it may be imported on host B.
If host A continued to write MMP blocks, it would be blindly
overwriting MMP blocks written by host B, and the blocks written by
host A would have outdated txg information.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7182
2018-02-22 09:14:46 -08:00
Tony Hutter
a5369b61a2 Linux 4.16 compat: use correct *_dec_and_test()
Use refcount_dec_and_test() on 4.16+ kernels, atomic_dec_and_test()
on older kernels.  https://lwn.net/Articles/714974/

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes: #7179 
Closes: #7211
2018-02-22 09:02:06 -08:00