Commit Graph

1710 Commits

Author SHA1 Message Date
jpaetzel
ed5fb51b5b MFV 316891
7386 zfs get does not work properly with bookmarks

illumos/illumos-gate@edb901aab9
edb901aab9

https://www.illumos.org/issues/7386
  The zfs get command does not work with the bookmark parameter while it works
  properly with both filesystem and snapshot:
  # zfs get -t all -r creation rpool/test
  NAME               PROPERTY  VALUE                  SOURCE
  rpool/test         creation  Fri Sep 16 15:00 2016  -
  rpool/test@snap    creation  Fri Sep 16 15:00 2016  -
  rpool/test#bkmark  creation  Fri Sep 16 15:00 2016  -
  # zfs get -t all -r creation rpool/test@snap
  NAME             PROPERTY  VALUE                  SOURCE
  rpool/test@snap  creation  Fri Sep 16 15:00 2016  -
  # zfs get -t all -r creation rpool/test#bkmark
  cannot open 'rpool/test#bkmark': invalid dataset name
  #
  The zfs get command should be modified to work properly with bookmarks too.

Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Marcel Telka <marcel@telka.sk>
2017-04-21 19:53:52 +00:00
jpaetzel
4097541960 MFV 316871
7490 real checksum errors are silenced when zinject is on

illumos/illumos-gate@6cedfc397d
6cedfc397d

https://www.illumos.org/issues/7490
  When zinject is on, error codes from zfs_checksum_error() can be overwritten
  due to an incorrect and overly-complex if condition.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2017-04-21 00:24:59 +00:00
jpaetzel
b9024ac3c1 MFV 316870
7448 ZFS doesn't notice when disk vdevs have no write cache

illumos/illumos-gate@295438ba32
295438ba32

https://www.illumos.org/issues/7448
       I built a SmartOS image with all the NVMe commits including 7372
       (support NVMe volatile write cache) and repeated my dd testing:
       > #!/bin/bash
       > for i in `seq 1 1000`; do
       > dd if=/dev/zero of=file00 bs=1M count=102400 oflag=sync &
       > dd if=/dev/zero of=file01 bs=1M count=102400 oflag=sync &
       > wait
       > rm file00 file01
       > done
       >
       Previously each dd command took ~145 seconds to finish, now it takes
       ~400 seconds.
       Eventually I figured out it is 7372 that causes unnecessary
       nvme_bd_sync() executions which wasted CPU cycles.
  If a NVMe device doesn't support a write cache, the nvme_bd_sync function will
  return ENOTSUP to indicate this to upper layers.
  It seems this returned value is ignored by ZFS, and as such this bug is not
  really specific to NVMe. In vdev_disk_io_start() ZFS sends the flush to the
  disk driver (blkdev) with a callback to vdev_disk_ioctl_done(). As nvme filled
  in the bd_sync_cache function pointer, blkdev will not return ENOTSUP, as the
  nvme driver in general does support cache flush. Instead it will issue an
  asynchronous flush to nvme and immediately return 0, and hence ZFS will not set
  vdev_nowritecache here. The nvme driver will at some point process the cache
  flush command, and if there is no write cache on the device it will return
  ENOTSUP, which will be delivered to the vdev_disk_ioctl_done() callback. This
  function will not check the error code and not set nowritecache.
  The right place to check the error code from the cache flush is in
  zio_vdev_io_assess(). This would catch both cases, synchronous and asynchronous
  cache flushes. This would also be independent of the implementation detail that
  some drivers can return ENOTSUP immediately.

Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
Obtained from:	Illumos
2017-04-21 00:17:54 +00:00
jpaetzel
d971a82b6c MFV 316868
7430 Backfill metadnode more intelligently

illumos/illumos-gate@af346df588
af346df588

https://www.illumos.org/issues/7430
  Description and patch from brought over from the following ZoL commit: https://
  github.com/zfsonlinux/zfs/commit/68cbd56e182ab949f58d004778d463aeb3f595c6
  Only attempt to backfill lower metadnode object numbers if at least
  4096 objects have been freed since the last rescan, and at most once
  per transaction group. This avoids a pathology in dmu_object_alloc()
  that caused O(N^2) behavior for create-heavy workloads and
  substantially improves object creation rates. As summarized by
  @mahrens in #4636:
  "Normally, the object allocator simply checks to see if the next
  object is available. The slow calls happened when dmu_object_alloc()
  checks to see if it can backfill lower object numbers. This happens
  every time we move on to a new L1 indirect block (i.e. every 32 *
  128 = 4096 objects). When re-checking lower object numbers, we use
  the on-disk fill count (blkptr_t:blk_fill) to quickly skip over
  indirect blocks that don?t have enough free dnodes (defined as an L2
  with at least 393,216 of 524,288 dnodes free). Therefore, we may
  find that a block of dnodes has a low (or zero) fill count, and yet
  we can?t allocate any of its dnodes, because they've been allocated
  in memory but not yet written to disk. In this case we have to hold
  each of the dnodes and then notice that it has been allocated in
  memory.
  The end result is that allocating N objects in the same TXG can
  require CPU usage proportional to N^2."
  Add a tunable dmu_rescan_dnode_threshold to define the number of
  objects that must be freed before a rescan is performed. Don't bother
  to export this as a module option because testing doesn't show a
  compelling reason to change it. The vast majority of the performance
  gain comes from limit the rescan to at most once per TXG.

Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Ned Bass <bass6@llnl.gov>

Obtained from:	Illumos
2017-04-21 00:12:47 +00:00
glebius
21ead51d79 - Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place.  To do per-cpu stats, convert all fields that previously were
  maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
  before we have set up UMA and we can do counter_u64_alloc(), provide an
  early counter mechanism:
  o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
  o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
    so that at early stages of boot, before counters are allocated we already
    point to a counter that can be safely written to.
  o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
  to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by:	kib, gallatin, marius, lidl
Differential Revision:	https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
glebius
5763443023 All these files need sys/vmmeter.h, but now they got it implicitly
included via sys/pcpu.h.
2017-04-17 17:07:00 +00:00
avg
6b8477e352 rename vfs.zfs.debug_flags to vfs.zfs.debugflags
While the former name is easier to read, the "_flags" suffix has a special
meaning for loader(8) and, thus, it was impossible to set the knob via
loader.conf(5).  The loader interpreted the setting as flags that should
be passed to a kernel module named "vfs.zfs.debug".

Discussed with:	smh
MFC after:	2 weeks
2017-04-14 15:35:07 +00:00
asomers
ee2be578bc Fix vdev_geom_attach_by_guids for partitioned disks
When opening a vdev whose path is unknown, vdev_geom must find a geom
provider with a label whose guids match the desired vdev. However, due to
partitioning, it is possible that two non-synonomous providers will share
some labels. For example, if the first partition starts at the beginning of
the drive, then ada0 and ada0p1 will share the first label. More troubling,
if the last partition runs to the end of the drive, then ada0p3 and ada0
will share the last label. If vdev_geom opens ada0 when it should've opened
ada0p3, then the pool won't be readable. If it opens ada0 when it should've
opened ada0p1, then it will corrupt some other partition when it writes the
3rd and 4th labels.

The easiest way to reproduce this problem is to install a mirrored root pool
with the default partition layout, then swap the positions of the two boot
drives and reboot.  Whether the bug manifests depends on the order in which
geom lists its providers, which is arbitrary.

Fix this situation by modifying the search algorithm to prefer geom
providers that have all four labels intact. If no such provider exists, then
open whichever provider has the most.

Reviewed by:	mav
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corp
Differential Revision:	https://reviews.freebsd.org/D10365
2017-04-13 14:51:34 +00:00
pkelsey
33064e92a2 Corrected misspelled versions of rendezvous.
The MFC will include a compat definition of smp_no_rendevous_barrier()
that calls smp_no_rendezvous_barrier().

Reviewed by:	gnn, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D10313
2017-04-09 02:00:03 +00:00
smh
c256aa0082 Fix expandsz 16.0E vals and vdev_min_asize of RAIDZ children
When a member of a RAIDZ has been replaced with a device smaller than the
original, then the top level vdev can report its expand size as 16.0E.

The reduced child asize causes the RAIDZ to have a vdev_asize lower than its
vdev_max_asize which then results in an underflow during the calculation of
the parents expand size.

Fix this by updating the vdev_asize if it shrinks, which is already
protected by a check against vdev_min_asize so should always be safe.

Also for RAIDZ vdevs, ensure that the sum of their child vdev_min_asize is
always greater than the parents vdev_min_size.

Fixes: https://www.illumos.org/issues/7885

MFC after:	2 weeks
Sponsored by:	Multiplay
2017-04-03 13:11:28 +00:00
jpaetzel
e34b741452 MFV: 315989
7603 xuio_stat_wbuf_* should be declared (void)

illumos/illumos-gate@99aa8b5505
99aa8b5505

https://www.illumos.org/issues/7603

  The funcs are declared k&r style, where the args are not specified:

  void xuio_stat_wbuf_copied();
  They should be declared to take no arguments:

  void xuio_stat_wbuf_copied(void);
  Need to change both .c and .h.

Author: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
2017-03-27 17:27:46 +00:00
mav
ee44db682f MFV r315290, r315291: 7303 dynamic metaslab selection
illumos/illumos-gate@8363e80ae7
https://github.com/illumos/illumos-gate/commit/8363e80ae72609660f6090766ca8c2c18

https://www.illumos.org/issues/7303

  This change introduces a new weighting algorithm to improve metaslab selection.
  The new weighting algorithm relies on the SPACEMAP_HISTOGRAM feature. As a result,
  the metaslab weight now encodes the type of weighting algorithm used
  (size-based vs segment-based).

  This also introduce a new allocation tracing facility and two new dcmds to help
  debug allocation problems. Each zio now contains a zio_alloc_list_t structure
  that is populated as the zio goes through the allocations stage. Here's an
  example of how to use the tracing facility:

> c5ec000::print zio_t io_alloc_list | ::walk list | ::metaslab_trace
  MSID    DVA    ASIZE      WEIGHT             RESULT               VDEV
     -      0      400           0    NOT_ALLOCATABLE           ztest.0a
     -      0      400           0    NOT_ALLOCATABLE           ztest.0a
     -      0      400           0             ENOSPC           ztest.0a
     -      0      200           0    NOT_ALLOCATABLE           ztest.0a
     -      0      200           0    NOT_ALLOCATABLE           ztest.0a
     -      0      200           0             ENOSPC           ztest.0a
     1      0      400      1 x 8M            17b1a00           ztest.0a

> 1ff2400::print zio_t io_alloc_list | ::walk list | ::metaslab_trace
  MSID    DVA    ASIZE      WEIGHT             RESULT               VDEV
     -      0      200           0    NOT_ALLOCATABLE           mirror-2
     -      0      200           0    NOT_ALLOCATABLE           mirror-0
     1      0      200      1 x 4M            112ae00           mirror-1
     -      1      200           0    NOT_ALLOCATABLE           mirror-2
     -      1      200           0    NOT_ALLOCATABLE           mirror-0
     1      1      200      1 x 4M            112b000           mirror-1
     -      2      200           0    NOT_ALLOCATABLE           mirror-2

  If the metaslab is using segment-based weighting then the WEIGHT column will
  display the number of segments available in the bucket where the allocation
  attempt was made.

Author: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
2017-03-24 09:37:00 +00:00
avg
799c768618 zfs_putpages: use TXG_WAIT
Explicit looping using TXG_NOWAIT is more verbose and may harm performance
under heavy load because of multiple waits.

MFC after:	1 week
2017-03-23 09:13:21 +00:00
avg
5f4f6dbc28 zfs: add zio_buf_alloc_nowait and use it in vdev_queue_aggregate
This way we can avoid blocking the whole queue in the low memory
situations.  It's better to sacrifice some I/O performance by not doing
the aggregation than to add an indefinite wait for more memory.

Reviewed by:	smh
MFC after:	2 weeks
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D9999
2017-03-23 08:59:17 +00:00
smh
6c4e57586b Reduce ARC fragmentation threshold
As ZFS can request up to SPA_MAXBLOCKSIZE memory block e.g. during zfs recv,
update the threshold at which we start agressive reclamation to use
SPA_MAXBLOCKSIZE (16M) instead of the lower zfs_max_recordsize which
defaults to 1M.

PR:		194513
Reviewed by:	avg, mav
MFC after:	1 month
Sponsored by:	Multiplay
Differential Revision:	https://reviews.freebsd.org/D10012
2017-03-17 12:34:57 +00:00
markj
ad3b99aa30 Fix a backwards comparison in the code to dump a DTrace debug buffer.
PR:		217739
MFC after:	1 week
2017-03-13 18:43:00 +00:00
avg
2372db40a9 zfs: provide a special vptocnp method for the .zfs vnode
vop_stdvptocnp() doesn't work properly if .zfs directory is hidden.

Reported by:	swills, des
Tested by:	des
MFC after:	1 week
MFC with:	r314048
2017-03-11 16:00:49 +00:00
avg
79968be53e MFV r314911: 7867 ARC space accounting leak
illumos/illumos-gate@6de76ce2a9
6de76ce2a9

https://www.illumos.org/issues/7867
  It seems that in the case where arc_hdr_free_pdata() sees HDR_L2_WRITING() we
  would fail to update the ARC space statistics.
  In the normal case those statistics are updated in arc_free_data_buf(). But in
  the arc_hdr_free_on_write() path we don't do that.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>

MFC after:	10 days
2017-03-08 13:52:45 +00:00
avg
c74c0ede5a MFV r314910: 7843 get_clones_stat() is suboptimal for lots of clones
illumos/illumos-gate@c5bde7273e
c5bde7273e

https://www.illumos.org/issues/7843
  get_clones_stat() could be very slow if a snapshot has many (thousands) clones.
  Clone names are added to an nvlist that's created with NV_UNIQUE_NAME.
  So, each time a new name is appended to the list, the whole list is searched
  linearly to see if that name is not already in the list. That results in the
  quadratic complexity.
  That should be easy to fix as we know in advance that we should not get any
  duplicate names, so we can drop NV_UNIQUE_NAME when creating the list.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>

MFC after:	1 week
Sponsored by:	ClusterHQ
2017-03-08 13:48:26 +00:00
mm
4e7dd14720 Fix null pointer dereference in zfs_freebsd_setacl().
Prevents unprivileged users from panicking the kernel by calling
__acl_delete_*() on files or directories inside a ZFS mount.

MFC after:	3 days
2017-03-02 23:23:28 +00:00
mav
aa7ab7b7d4 Execute last ZIO of log commit synchronously.
For short transactions overhead of context switch can be too large.
Skipping it gives significant latency reduction.  For large ones,
including multiple ZIOs, latency is less critical, while throughput
there may become limited by checksumming speed of single CPU core.
To get best of both cases, execute last ZIO directly from calling
thread context to save latency, while all others (if there are any)
enqueue to taskqueues in traditional way.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2017-03-02 07:55:47 +00:00
mav
3ef0c861a5 Completely skip cache flushing for not supporting log devices.
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2017-03-02 07:50:06 +00:00
ae
dcf8479255 Do not invoke the resize event when previous provider's size was zero.
This is similar to r303637 fix for geom_disk.

Reported by:	avg
Tested by:	avg
MFC after:	1 week
2017-03-01 18:03:32 +00:00
jpaetzel
78d6756cdb MFV 314276
7570 tunable to allow zvol SCSI unmap to return on commit of txn to ZIL

illumos/illumos-gate@1c9272b861
1c9272b861

https://www.illumos.org/issues/7570

  Based on the discovery that every unmap waits for the commit of the txn to the ZIL,
  introducing a very high latency to unmap commands, this behavior was made into a
  tunable zvol_unmap_sync_enabled and set to false. The net impact of this change is
  that by default SCSI unmap commands will result in space being freed within the zvol
  (today they are ignored and returned with good status). However, unlike the code
  today, instead of 18+ms per unmap, they take about 30us.

  With the testing done on NTFS against a Win2k12 target, the new behavior should work
  seamlessly. Files on the zvol that have already been set with the zfree application
  will continue to write 0's when deleted, and any new files created since zvol
  creation will send unmap commands when deleted. This behavior exists today, but with
  this change the unmap commands will be processed and result in reclaim of space.

Author: Stephen Blinick <stephen.blinick@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>
2017-02-25 20:01:17 +00:00
avg
ad87994751 l2arc: try to fix write size calculation broken by Compressed ARC commit
While there, make a change to not evict a first buffer outside the
requested eviciton range.

To do:
- give more consistent names to the size variables
- upstream to OpenZFS

PR:		216178
Reported by:	lev
Tested by:	lev
MFC after:	2 weeks
2017-02-25 17:03:48 +00:00
avg
d895ce4e14 zfs: call spa_deadman on a taskqueue thread
callout(9) prohibits callout functions from sleeping.
illumos mutexes are emulated using sx(9).
spa_deadman() calls vdev_deadman() and the latter acquires vq_lock.

As a result we can get a more confusing panic instead of a specific
panic or no panic:
sleepq_add: td 0xfffff80019669960 to sleep on wchan 0xfffff8001cff4d88 with sleeping prohibited

This change adds another level of indirection where the deadman
callout schedules spa_deadman() to be executed on taskqueue_thread.

While there, use callout_schedule(0 instead of callout_reset()
in spa_sync().

Discussed with:	mav
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D9762
2017-02-25 16:45:53 +00:00
jpaetzel
2140000a07 MFV 314243
6676 Race between unique_insert() and unique_remove() causes ZFS fsid change

illumos/illumos-gate@40510e8eba
40510e8eba

https://www.illumos.org/issues/6676

  The fsid of zfs filesystems might change after reboot or remount. The problem seems to
  be caused by a race between unique_insert() and unique_remove(). The unique_remove()
  is called from dsl_dataset_evict() which is now an asynchronous thread. In a case the
  dsl_dataset_evict() thread is very slow and calls unique_remove() too late we will end
  up with changed fsid on zfs mount.

  This problem is very likely caused by #5056.

  Steps to Reproduce
  Note: I'm able to reproduce this always on a single core (virtual) machine. On multicore
  machines it is not so easy to reproduce.

# uname -a
SunOS openindiana 5.11 illumos-633aa80 i86pc i386 i86pc Solaris
# zfs create rpool/TEST
# FS=$(echo ::fsinfo | mdb -k | grep TEST | awk '{print $1}')
# echo $FS::print vfs_t vfs_fsid | mdb -k
vfs_fsid = {
    vfs_fsid.val = [ 0x54d7028a, 0x70311508 ]
}
# zfs umount rpool/TEST
# zfs mount rpool/TEST
# FS=$(echo ::fsinfo | mdb -k | grep TEST | awk '{print $1}')
# echo $FS::print vfs_t vfs_fsid | mdb -k
vfs_fsid = {
    vfs_fsid.val = [ 0xd9454e49, 0x6b36d08 ]
}
#

  Impact
  The persistent fsid (filesystem id) is essential for proper NFS functionality.
  If the fsid of a filesystem changes on remount (or after reboot) the NFS
  clients might not be able to automatically recover from such event and the
  manual remount of the NFS filesystems on every NFS client might be needed.

Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dan Vatca <dan.vatca@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
2017-02-25 14:45:54 +00:00
avg
b50f2868e5 zfs: clean up unused files and definitions
MFC after:	1 month
X-MFC after:	r314048
2017-02-24 07:53:56 +00:00
tsoome
b2b5986a97 loader: update symlink support in zfs reader
As the current zfs file system is providing symlink via system attributes, need
to update the code accordingly.

Note, as the zfsboot code does not free the memory at this time, the
object list will put some stress on the boot2 heap, eventually we should
address the issue.

Reviewed by:	allanjude, smh
Approved by:	allanjude (mentor)
Differential Revision:	https://reviews.freebsd.org/D9706
2017-02-22 22:00:50 +00:00
avg
70167f709d zfs: move zio_taskq_basedc under SYSDC
That knob is useless without SDC (or alike) scheduling class support.
That is, it's unused on FreeBSD.

MFC after:	4 days
2017-02-21 21:11:58 +00:00
avg
47967f0879 zfs: lower priority of zio_write_issue threads by four
The difference of one was insignificant because zio_write_issue threads
ended up on the same run queues as other zio threads.
See sys/priority.h and sys/runq.h for more details.

Add a comment describing FreeBSD priority considerations and restore
the illumos variant of the code for comparison.

Obtained from:	Panzura
MFC after:	2 weeks
Sponsored by:	Panzura
2017-02-21 21:09:21 +00:00
avg
ef19eda2c7 reimplement zfsctl (.zfs) support
The current code is written on top of GFS, a library with the generic
support for writing filesystems, which was ported from illumos.
Because of significant differences between illumos VFS and FreeBSD
VFS models, both the GFS and zfsctl code were heavily modified to
work on FreeBSD.  Nonetheless, they still contain quite a few ugly
hacks and bugs.

This is a reimplementation of the zfsctl code where the VFS-specific
bits are written from scratch and only the code that interacts with
the rest of ZFS is reused.

Some highlights.

We use two types of nodes, static and on-demand. The static nodes
are used for permanent directories like .zfs, .zfs/snapshot, etc. The
on-demand nodes are used for ephemeral directories that act as snapshot
mount points.
Initially only static nodes are created. Their vnodes are instantiated
when they are looked up. The on-demand nodes and vnodes are instantiated
as needed and the nodes are destroyed as soon as the corresponding
vnodes are reclaimed.
We also try very hard to ensure that uncovered snapshot vnodes do not
linger.  They are supposed to become inactive as soon as they are
uncovered and we try to recycle them immediately.
When a filesystem is unmounted all snapshots under .zfs are unmounted
first, then all vnodes are flushed and finally the static .zfs nodes
are destroyed.

There are some changes outside of zfsctl code too.
z_ctldir is never used directly (as it is an opaque pointer),
zfsctl_root() has to be used instead.  The function returns a locked
vnode now, so it accepts a lock flags parameter.  The function can
also fail now, e.g. during force unmounting, whereas previously it
was infallible.
zfsctl_root_lookup() is retired, instead of it VOP_LOOKUP() on the .zfs
vnode (obtained with zfsctl_root) is used.

Some ideas are picked from an independent work by will.

Reviewed by:	asomers, smh
MFC after:	1 month
Relnotes:	maybe
Differential Revision: https://reviews.freebsd.org/D7421
2017-02-21 17:47:08 +00:00
jpaetzel
641c2d24c5 MVF: 313876
7504 kmem_reap hangs spa_sync and administrative tasks

illumos/illumos-gate@405a5a0f5c
https://github.com/illumos/illumos-gate/commit/405a5a0f5c3ab36cb76559467d1a62ba648bd80

https://www.illumos.org/issues/7504

  We see long spa_sync(). We are waiting to hold dp_config_rwlock for writer. Some
  other thread holds dp_config_rwlock for reader, then calls arc_get_data_buf(),
  which finds that arc_is_overflowing()==B_TRUE. So it waits (while holding
  dp_config_rwlock for reader) for arc_reclaim_thread to signal arc_reclaim_waiters_cv.
  Before signaling, arc_reclaim_thread does arc_kmem_reap_now(), which takes ~seconds.

Author: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
2017-02-17 17:52:12 +00:00
markj
b12a88ae0f Directly include needed headers rather than relying on pollution.
We get machine/cpu.h via kmem.h -> proc.h -> _vm_domain.h -> seq.h.

Reported by:	Ryan Libby
Sponsored by:	Dell EMC Isilon
X-MFC with:	r313841
2017-02-17 03:27:20 +00:00
markj
2a3b2874a6 Prevent CPU migration when checking the DTrace nofault flag on x86.
dtrace_trap() consumes page and protection faults triggered by code running
in DTrace probe context. Such faults occur with interrupts disabled and are
detected using a per-CPU flag. Regular faults cause dtrace_trap() to be
called with interrupts enabled, and nothing was ensuring that the flag was
read from the correct CPU. This may result in dtrace_trap() consuming
unrelated page and protection faults when DTrace is enabled, causing the
fault handler to return without actually having handled the fault.

Diagnosed by:	Ryan Libby <rlibby@gmail.com>
MFC after:	3 days
Sponsored by:	Dell EMC Isilon
2017-02-16 23:05:20 +00:00
jpaetzel
32882dfcac MFV 313786
7500 Simplify dbuf_free_range by removing dn_unlisted_l0_blkid

illumos/illumos-gate@653af1b809
653af1b809

https://www.illumos.org/issues/7500
  With the integration of:

    commit 0f6d88aded0d165f5954688a9b13bac76c38da84
    Author: Alex Reece <alex@delphix.com>
    Date:   Sat Jul 26 13:40:04 2014 -0800
    4873 zvol unmap calls can take a very long time for larger datasets

  the dnode's dn_bufs field was changed from a list to a tree. As a result,
  the dn_unlisted_l0_blkid field is no longer necessary.

Author: Stephen Blinick <stephen.blinick@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
2017-02-16 19:00:09 +00:00
markj
9836e32a79 Use pget() instead of pfind() in fasttrap_pid_{enable,disable}().
Suggested by:	mjg
MFC after:	1 week
2017-02-15 06:07:01 +00:00
markj
dfae3be6c7 Check for an exiting process when enabling PID provider probes.
MFC after:	1 week
2017-02-15 01:35:26 +00:00
avg
2e9e9620d9 remove l2_padding_needed statistic from zfs arc
It became obsolete when the Compressed ARC support was committed.

MFC after:	1 week
2017-02-12 19:45:30 +00:00
avg
764d943c05 check remaining space in zfs implementations of vptocnp
PR:		216939
Submitted by:	Iouri V. Ivliev <fbsd@any.com.ru>
MFC after:	1 week
2017-02-12 19:40:59 +00:00
asomers
ab48acbf00 Fix setting birthtime in ZFS
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
	* In zfs_freebsd_setattr, if the caller wants to set the birthtime,
	  set the bits that zfs_settattr expects

	* In zfs_setattr, if XAT_CREATETIME is set, set xoa_createtime,
	  expected by zfs_xvattr_set.  The two levels of indirection seem
	  excessive, but it minimizes diffs vs OpenZFS.

	* In zfs_setattr, check for overflow of va_birthtime (from delphij)

	* Remove red herring in zfs_getattr

sys/cddl/contrib/opensolaris/uts/common/sys/vnode.h
	* Un-booby-trap some macros

New tests are under review at https://github.com/pjd/pjdfstest/pull/6

Reviewed by:	avg
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corp
Differential Revision:	https://reviews.freebsd.org/D9353
2017-02-09 21:30:53 +00:00
gnn
761aa7c014 Fix the ifdef protection and remove superfluous extern statements
Reported by:	Konstantin Belousov
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2017-02-07 01:21:18 +00:00
markj
a327df1219 Ensure that the DOF string length is divisible by 2.
It is an ASCII encoding of a hexadecimal representation of the DOF file
used to enable anonymous tracing, so its length should always be even.

MFC after:	1 week
2017-02-05 02:47:34 +00:00
markj
d879975195 Use PC-relative relocations for USDT probe sites on i386 and amd64.
When recording probe site addresses in the output DOF file, dtrace -G
needs to emit relocations for the .SUNW_dof section in order to obtain
the addresses of functions containing probe sites. DTrace expects the
addresses to be relative to the base address of the final ELF file,
and the amd64 USDT implementation was relying on some unspecified and
incorrect behaviour in the base system GNU ld to achieve this.

This change reimplements the probe site relocation handling to allow
USDT to be used with lld and newer GNU binutils. Specifically, it
makes use of R_X86_64_PC64/R_386_PC32 relocations to obtain the
probe site address relative to the DOF file address, and adds and uses a
new DOF relocation type which computes the final probe site address using
these relative offsets.

Reported by and discussed with:	Rafael Espíndola
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D9374
2017-02-05 02:39:12 +00:00
gnn
da0598537e Files which implement the new random number system code for DTrace
Submitted by:	Graeme Jenkinson
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2017-02-03 22:40:13 +00:00
gnn
45b8e2daa4 Replace the implementation of DTrace's RAND subroutine for generating
low-quality random numbers with a modern implementation (xoroshiro128+)
that is capable of generating better quality randomness without compromising performance.

Submitted by:	Graeme Jenkinson
Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9051
2017-02-03 22:26:19 +00:00
markj
3a9f9a6da6 Sync the x86 dis_tables.c with upstream.
This corresponds to the following illumos issues:

  5755 want support for Intel FMA instrs
  5756 want support for Intel BMI1 instrs
  5757 want support for Intel BMI2 instrs
  5758 want support for Intel AVX2 instrs
  7204 Want broadwell rdseed and adx support
  7208 Want stac/clac disasm support
  7733 Need SHA Instruction dis support
  7756 dis can't handle x86 SSE 3 instructions
  7757 want avx2 disasm tests
  7758 want SSE 4.1 disasm tests

MFC after:	2 weeks
2017-02-03 03:22:47 +00:00
bapt
bd0b52fc1f Revert crap accidentally committed 2017-01-28 16:31:23 +00:00
bapt
02ac05d572 Revert r312923 a better approach will be taken later 2017-01-28 16:30:14 +00:00
markj
f41402441d Fix an off-by-one in an assertion on fasttrap tracepoint sizes.
FASTTRAP_MAX_INSTR_SIZE is the largest valid value of a tracepoint, so
correct the assertion accordingly. This limit was hit with a 15-byte NOP.

Reported by:	bdrewery
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2017-01-27 17:58:41 +00:00