Commit Graph

1762 Commits

Author SHA1 Message Date
Andriy Gapon
cfe94339f2 a stop gap fix for a race between dnode_hold and dnode_sync_free
The race was introduced in r337669, the large dnode feature import from
ZoL.  The problem was debugged by ZoL developers and then,
independently, on FreeBSD.

The fix is an early proposal by Brian Behlendorf:
50f32ed74e
This fix never went into ZoL.  A larger change that was committed later
included a different solution because of the re-worked code.

Ideally, we want to revert this fix and re-synchronize FreeBSD large
dnode code with that in illumos (or newer ZoL).  illumos has a later
import of the feature from ZoL that does not have the bug.

PR:		236480
Obtained from:	Brian Behlendorf <behlendorf1@llnl.gov>
Submitted by:	ncrogers@gmail.com (patch adaptation)
Reported by:	ncrogers@gmail.com
Tested by:	ncrogers@gmail.com,
		Dennis Noordsij <dennis.noordsij@alumni.helsinki.fi>,
		Julien Cigar <julien@perdition.city>
MFC after:	10 days
2019-08-12 10:30:00 +00:00
Toomas Soome
b1b9326846 loader: support com.delphix:removing
We should support removing vdev from boot pool. Update loader zfs reader
to support com.delphix:removing.

Reviewed by:	allanjude
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D18901
2019-08-08 18:08:13 +00:00
Xin LI
0ed1d6fb00 Allow Kernel to link in both legacy libkern/zlib and new sys/contrib/zlib,
with an eventual goal to convert all legacl zlib callers to the new zlib
version:

 * Move generic zlib shims that are not specific to zlib 1.0.4 to
   sys/dev/zlib.
 * Connect new zlib (1.2.11) to the zlib kernel module, currently built
   with Z_SOLO.
 * Prefix the legacy zlib (1.0.4) with 'zlib104_' namespace.
 * Convert sys/opencrypto/cryptodeflate.c to use new zlib.
 * Remove bundled zlib 1.2.3 from ZFS and adapt it to new zlib and make
   it depend on the zlib module.
 * Fix Z_SOLO build of new zlib.

PR:		229763
Submitted by:	Yoshihiro Ota <ota j email ne jp>
Reviewed by:	markm (sys/dev/zlib/zlib_kmod.c)
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D19706
2019-08-01 06:35:33 +00:00
Mark Johnston
61f2f0bae6 Fix FASTTRAPIOC_GETINSTR.
This ioctl is used when a breakpoint is encountered while disassembling
a symbol in the target process.  Since only one DTrace consumer can
toggle or enumerate fasttrap probes from a given process at time, this
ioctl does not appear to be used in practice.
2019-07-17 16:38:29 +00:00
Mark Johnston
eeacb3b02f Merge the vm_page hold and wire mechanisms.
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics.  The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.

This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead.  Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.

No functional change is intended.  __FreeBSD_version is bumped.

Reviewed by:	alc, kib
Discussed with:	jeff
Discussed with:	jhb, np (cxgbe)
Tested by:	pho (previous version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19247
2019-07-08 19:46:20 +00:00
Alexander Motin
419110374a Avoid extra taskq_dispatch() calls by DMU.
DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes
and os_synced_dnodes.  Since the number of sublists by default is equal
to number of CPUs, it will dispatch equal, potentially large, number of
tasks, waking up many CPUs to handle them, even if only one or few of
sublists actually have any work to do.

This change adds check for empty sublists to avoid this.
2019-06-25 18:35:23 +00:00
Alexander Motin
3bae917061 Minimize aggsum_compare(&arc_size, arc_c) calls.
For busy ARC situation when arc_size close to arc_c is desired.  But
then it is quite likely that aggsum_compare(&arc_size, arc_c) will need
to flush per-CPU buckets to find exact comparison result.  Doing that
often in a hot path penalizes whole idea of aggsum usage there, since it
replaces few simple atomic additions with dozens of lock acquisitions.

Replacing aggsum_compare() with aggsum_upper_bound() in code increasing
arc_p when ARC is growing (arc_size < arc_c) according to PMC profiles
allows to save ~5% of CPU time in aggsum code during sequential write
to 12 ZVOLs with 16KB block size on large dual-socket system.

I suppose there some minor arc_p behavior change due to lower precision
of the new code, but I don't think it is a big deal, since it should
affect only very small window in time (aggsum buckets are flushed every
second) and in ARC size (buckets are limited to 10 average ARC blocks
per CPU).

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-06-14 20:04:28 +00:00
Alexander Motin
b3b3aa2e29 Alike to ZoL disable metaslab allocation tracing code.
It is too generous to collect in production debug traces that can only
be read with kernel debugger.  Illumos includes special code in their
mdb debugger to read it, we don't.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2019-06-14 19:57:32 +00:00
Alexander Motin
284e53a401 Properly align struct multilist_sublist to cache line.
Manual Illumos alignment does not fit us due to different kmutex_t size.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2019-06-14 17:09:39 +00:00
Alexander Motin
913095dc56 Move write aggregation memory copy out of vq_lock.
Memory copy is too heavy operation to do under the congested lock.
Moving it out reduces congestion by many times to almost invisible.
Since the original zio removed from the queue, and the child zio is
not executed yet, I don't see why would the copy need protection.
My guess it just remained like this from the time when lock was not
dropped here, which was added later to fix lock ordering issue.

Multi-threaded sequential write tests with both HDD and SSD pools
with ZVOL block sizes of 4KB, 16KB, 64KB and 128KB all show major
reduction of lock congestion, saving from 15% to 35% of CPU time
and increasing throughput from 10% to 40%.

Reviewed by:	ahrens, behlendorf, ryao
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-06-13 01:21:32 +00:00
Alexander Motin
35251e9c28 Fix comparison signedness in arc_is_overflowing().
When ARC size is very small, aggsum_lower_bound(&arc_size) may return
negative values, that due to unsigned comparison caused delays, waiting
for arc_adjust() to "fix" it by calling aggsum_value(&arc_size).  Use
of signed comparison there fixes the problem.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-06-07 20:59:24 +00:00
Alexander Motin
61586dd647 Explicitly start ARC adjustment on limits change.
While formally it is not necessary, but the sooner it start, the sooner it
finish, and supposedly less disturbing for workload it will be.

MFC after:	2 weeks
2019-06-07 19:03:17 +00:00
Andriy Gapon
d8b12a2162 Restore ARC MFU/MRU pressure
Before r305323 (MFV r302991: 6950 ARC should cache compressed data)
arc_read() code did this for access to a ghost buffer:
 arc_adapt() (from arc_get_data_buf())
 arc_access(hdr, hash_lock)
I.e., we first checked access to the MFU ghost/MRU ghost buffer and
adapt MFU/MRU sizes (in arc_adapt()) and next move buffer from the ghost
state to regular.

After r305323 the sequence is different:
 arc_access(hdr, hash_lock);
 arc_hdr_alloc_pabd(hdr);
I.e., we first move the buffer from the ghost state in arc_access() and
then we check access to buffer in ghost state (in arc_hdr_alloc_pabd()
-> arc_get_data_abd() -> arc_get_data_impl() -> arc_adapt()).  This is
incorrect: arc_adapt() never see access to the ghost buffer because
arc_access() already migrated the buffer from the ghost state to
regular.

So, the fix is to restore a call to arc_adapt() before arc_access() and
to suppress the call to arc_adapt() after arc_access().

Submitted by:	Slawa Olhovchenkov <slw@zxy.spb.ru>
MFC after:	2 weeks
Sponsored by:	Integros [integros.com]
Differential Revision: https://reviews.freebsd.org/D19094
2019-06-07 06:35:42 +00:00
Mark Johnston
c080655467 Fix a race between fasttrap and the user breakpoint handler.
When disabling the last enabled userspace probe, fasttrap clears the
function pointers which hook in to the breakpoint handler.  If a traced
thread hit a fasttrap breakpoint before it was removed, we must ensure
that it is able to call the hook; otherwise fasttrap will not consume
the trap and SIGTRAP will be delievered to the thread.  Synchronize
with such threads by ensuring that they load the hook pointer with
interrupts disabled, and by completing an SMP rendezvous after removing
breakpoints and before clearing the pointers.

Reported by:	Alexander Alexeev <Alexander.Alexeev@dell.com>
Tested by:	Alexander Alexeev (earlier version)
Reviewed by:	cem, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20526
2019-06-06 16:03:25 +00:00
Alexander Motin
0b5319dda0 MFV r348585: 9683 Allow bypassing devid in vdev_disk_open()
illumos/illumos-gate@6fe4f3002c

Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Pavel Zakharov <pavel.zakharov@delphix.com>

This is irrelevant to FreeBSD, just to reduce divergence.
2019-06-03 20:55:52 +00:00
Alexander Motin
9b048dd219 MFV r348583: 9847 leaking dd_clones (DMU_OT_DSL_CLONES) objects
illumos/illumos-gate@17fb938fd6

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author:     Matthew Ahrens <mahrens@delphix.com>
2019-06-03 20:49:20 +00:00
Alexander Motin
07a5c938c9 MFV r348578: 9962 zil_commit should omit cache thrash
illumos/illumos-gate@cab3a55e15

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Author:     Prakash Surya <prakash.surya@delphix.com>
2019-06-03 20:24:40 +00:00
Alexander Motin
a66a7143d4 MFV r348576: 9963 Seperate tunable for disabling ZIL vdev flush
illumos/illumos-gate@f8fdf68125

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Prakash Surya <prakash.surya@delphix.com>
2019-06-03 20:05:43 +00:00
Alexander Motin
c9719c9a6d MFV r348573: 9993 zil writes can get delayed in zio pipeline
illumos/illumos-gate@2258ad0b75

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     George Wilson <george.wilson@delphix.com>
2019-06-03 19:25:53 +00:00
Alexander Motin
4d6afba5e0 MFV r348555: 9690 metaslab of vdev with no space maps was flushed during removal
illumos/illumos-gate@4e75ba6826

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Serapheim Dimitropoulos <serapheim@delphix.com>
2019-06-03 19:03:24 +00:00
Alexander Motin
1b61262505 MFC r348554: 9688 aggsum_fini leaks memory
illumos/illumos-gate@29bf2d68be

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Paul Dagnelie <pcd@delphix.com>
2019-06-03 19:00:24 +00:00
Alexander Motin
677ef2563d MFV r348553: 9681 ztest failure in spa_history_log_internal due to spa_rename()
illumos/illumos-gate@6aee0ad769

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2019-06-03 18:32:56 +00:00
Alexander Motin
74f7070445 MFV r348552: 9682 page fault in dsl_async_clone_destroy() while opening pool
illumos/illumos-gate@ade2c82828

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Serapheim Dimitropoulos <serapheim@delphix.com>
2019-06-03 17:56:44 +00:00
Alexander Motin
2eff60e998 MFV r348551: 9862 fix typo in comment in vdev_impl.h
illumos/illumos-gate@84927f52bd

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Allan Jude <allanjude@freebsd.org>
2019-06-03 17:44:47 +00:00
Alexander Motin
bd2ae688a4 MFV r348550: 1700 Add SCSI UNMAP support
illumos/illumos-gate@047c81d31d

Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Saso Kiselkov <saso.kiselkov@nexenta.com>

This is irrelevant to FreeBSD, just a diff reduction.
2019-06-03 17:43:32 +00:00
Alexander Motin
d40f6a585a MFV r348548: 9617 too-frequent TXG sync causes excessive write inflation
illumos/illumos-gate@7928f4baf4

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2019-06-03 17:40:11 +00:00
Alexander Motin
b3ed2d08e4 MFV r348537: 8601 memory leak in get_special_prop()
illumos/illumos-gate@e19b450bec

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     John Gallagher <john.gallagher@delphix.com>
2019-06-03 17:29:57 +00:00
Alexander Motin
c066dcc074 MFV r348535: 9677 panic from zio_write_gang_block() when creating dump device on fragmented rpool
illumos/illumos-gate@7341a7de4f

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Brad Lewis <brad.lewis@delphix.com>
2019-06-03 17:27:25 +00:00
Allan Jude
ad579d984f Fix assertion in ZFS TRIM code
Due to an attempt to check two conditions at once in a macro not designed
as such, the assertion would always evaluate to true.

#define VERIFY3_IMPL(LEFT, OP, RIGHT, TYPE) do { \
        const TYPE __left = (TYPE)(LEFT); \
        const TYPE __right = (TYPE)(RIGHT); \
        if (!(__left OP __right)) \
                assfail3(#LEFT " " #OP " " #RIGHT, \
                        (uintmax_t)__left, #OP, (uintmax_t)__right, \
                        __FILE__, __LINE__); \
_NOTE(CONSTCOND) } while (0)
#define ASSERT3U(x, y, z)       VERIFY3_IMPL(x, y, z, uint64_t)

Mean that we compared:
left = (type == ZIO_TYPE_FREE || psize)
OP = "<="
right = (SPA_MAXBLOCKSIZE)

If the type was not FREE, 0 is less than SPA_MAXBLOCKSIZE (16MB)
If the type is ZIO_TYPE_FREE, 1 is less than SPA_MAXBLOCKSIZE
The constraint on psize (physical size of the FREE operation) is never
checked against SPA_MAXBLOCKSIZE

Reported by:	Ka Ho Ng <khng300@gmail.com>
Reviewed by:	kevans
MFC after:	2 weeks
Sponsored by:	Klara Systems
2019-05-29 20:34:35 +00:00
Alexander Motin
83d0c3846d Allocate buffers smaller then ABD chunk size as linear.
This allows to reduce memory waste by letting UMA to put multiple small
buffers into one memory page slab.  The page sharing means that UMA
may not be able to free memory page when some of buffers are freed, but
alternatively memory used by that buffer would just be wasted from the
beginning.

This change follows alike change in ZoL, but unlike Linux (according to
my understanding of it from comments) FreeBSD never shares slabs bigger
then one memory page, so this should be even less invasive then there.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-05-22 18:43:48 +00:00
Allan Jude
57e9b361ba Fix typo in r348068 2019-05-21 21:39:03 +00:00
Allan Jude
4e7d8292ec ZFS: Make deadman tunables no longer read-only
This allows the user to enable, disable, and adjust the I/O deadman at
runtime. This can be especially useful when a pool is backed by remote
storage (such as iscsi, ggated, etc).

PR:		221906
Submitted by:	Fabian Keil <fk@fabiankeil.de>
Obtained from:	ElectroBSD
MFC after:	1 week
Sponsored by:	Klara Systems
Event:		Waterloo Hackathon 2019
2019-05-21 21:26:18 +00:00
Alexander Motin
7763842174 Add mutex_destroy() missed in r334844.
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2019-04-26 19:02:21 +00:00
Alexander Motin
32d8034f77 Fix minor mismerges.
No functional change.

MFC after:	1 week
2019-04-26 18:25:59 +00:00
Alexander Motin
48ecceba1e Change the way FreeBSD GID inheritance is hacked.
I believe previous ifdef caused NULL dereference in later zfs_log_create()
on attempt to create file inside directory belonging to ephemeral group
created on illumos, trying to write to log information about GID domain
of the newly created file, inheriting the ephemeral GID.

This patch reuses original illumos SGID code with exception that due to
lack of ID mapping code on FreeBSD ephemeral GID will turn into GID_NOBODY
by another ifdef inside zfs_fuid_map_id().

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2019-04-19 15:44:45 +00:00
Conrad Meyer
a8a16c7128 Replace read_random(9) with more appropriate arc4rand(9) KPIs
Reviewed by:	ae, delphij
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D19760
2019-04-04 01:02:50 +00:00
Pawel Jakub Dawidek
af4f9e5f00 If the autoexpand pool property is turned on and vdev is healthy try to
expand the pool automatically when we detect underlying GEOM provider
size change.

Obtained from:	Fudo Security
Tested in:	AWS
2019-03-30 07:29:20 +00:00
Andriy Gapon
76a63ee510 Revert r345410, VOP_FSYNC change in ZFS vdev_file
I overlooked the fact that that VOP_FSYNC() call is not a FreeBSD VFS
call, but a macro that provides an illumos-compatible wrapper for the
FreeBSD operation.

PR:		236475
Reported by:	lwhsu
Pointyhat to:	avg
2019-03-22 17:44:47 +00:00
Andriy Gapon
c1581f4f4d ZFS vdev_file: use correct value for waitfor parameter of VOP_FSYNC
PR:		236475
Reported by:	asomers
MFC after:	2 weeks
2019-03-22 09:11:45 +00:00
Alexander Motin
6bb46107d8 MFV r336930: 9284 arc_reclaim_thread has 2 jobs
`arc_reclaim_thread()` calls `arc_adjust()` after calling
`arc_kmem_reap_now()`; `arc_adjust()` signals `arc_get_data_buf()` to
indicate that we may no longer be `arc_is_overflowing()`.

The problem is, `arc_kmem_reap_now()` can take several seconds to
complete, has no impact on `arc_is_overflowing()`, but due to how the
code is structured, can impact how long the ARC will remain in the
`arc_is_overflowing()` state.

The fix is to use seperate threads to:

1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which
    improves `arc_is_overflowing()`

2. keep enough free memory in the system, by calling
 `arc_kmem_reap_now()` plus `arc_shrink()`, which improves
 `arc_available_memory()`.

illumos/illumos-gate@de753e34f9

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Brad Lewis <brad.lewis@delphix.com>
2019-03-15 18:59:04 +00:00
Simon J. Gerraty
f5fdf82d82 Add _PC_ACL_* to vop_stdpathconf
This avoid EINVAL from tmpfs etc.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D19512
2019-03-11 20:40:56 +00:00
Alexander Motin
aa8676f25d Revert minor part of r344934.
I tried to save some CPU time on hopeless aggregation attempts, but it seems
the condition I added is overly strict, blocking also aggregation of optional
I/Os in cases which previously were possible.  Revert just to be safe.

MFC after:	1 month
2019-03-11 17:39:09 +00:00
Alexander Motin
5ca679e3c4 MFV/ZoL: Disable LBA weighting on files and SSDs
The LBA weighting makes sense on rotational media where the outer tracks
have twice the bandwidth of the inner tracks. However, it is detrimental
on nonrotational media such as solid state disks, where the only effect
is to ensure that metaslabs enter the best-fit allocation behavior
sooner, which is detrimental to performance. It also makes no sense on
files where the underlying filesystem can arrange things however it
wants.

Author: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3712
zfsonlinux/zfs@fb40095f5f

To reduce code divergence this merge replaces equivalent but different
FreeBSD code detecting non-rotating medium vdevs.

MFC after:	1 month
2019-03-08 21:13:45 +00:00
Alexander Motin
673544c3dd Add separate aggregation limit for non-rotating media.
Before sequential scrub patches ZFS never aggregated I/Os above 128KB.
Sequential scrub bumped that to 1MB, which motivation I understand for
spinning disks, since it should reduce number of head seeks.  But for
SSDs it makes much less sense to me, especially on FreeBSD, where due
to MAXPHYS limitation device will likely still see bunch of 128KB I/Os
instead of one large.  Having more strict aggregation limit allows to
avoid allocation of large memory buffer and memcpy to/from it, that is
a serious problem when bandwidth reaches few GB/s.

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2019-03-08 19:38:52 +00:00
Alexander Motin
3a3ba532e7 MFV/ZoL: Fix zfs_vdev_aggregation_limit bounds checking
Update the bounds checking for zfs_vdev_aggregation_limit so that
it has a floor of zero and a maximum value of the supported block
size for the pool.

Additionally add an early return when zfs_vdev_aggregation_limit
equals zero to disable aggregation.  For very fast solid state or
memory devices it may be more expensive to perform the aggregation
than to issue the IO immediately.

Author: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs@a58df6f536

MFV/ZoL: Cap maximum aggregate IO size

Commit 8542ef8 allowed optional IOs to be aggregated beyond
the specified aggregation limit.  Since the aggregation limit
was also used to enforce the maximum block size, setting
`zfs_vdev_aggregation_limit=16777216` could result in an
attempt to allocate an ABD larger than 16M.

Author: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6259
Closes #6270
zfsonlinux/zfs@2d678f779a
2019-03-08 18:49:27 +00:00
Alexander Motin
ede8782611 Improve entropy for ZFS taskqueue selection.
I just found that at least on Skylake CPUs cpu_ticks() never returns odd
values, only even, and possibly has even bigger step (176/2?), that makes
its lower bits very bad entropy source, leaving half of taskqueues unused.
Switch to sbinuptime(), closer to upstreams, mitigates the problem by the
rate conversion working as kind of hash function.  In case that is somehow
not enough (timer rate is too low or too divisible) mix in curcpu.

MFC after:	1 week
2019-03-07 22:56:39 +00:00
Alexander Motin
551b7d3a29 Add respective tunables to few ZFS sysctls.
MFC after:	1 week
2019-03-07 01:24:08 +00:00
Pawel Jakub Dawidek
b8da50d526 Improve readability of the code by making it explicit where the 'c' variable
starts. It is also more consistent with similar code in this file.
2019-03-01 05:54:13 +00:00
Mark Johnston
8e7127fd91 Fix fasttrap_sig{trap,segv}().
- Don't leak the ksiginfo structure.
- Hold the proc lock when sending a signal in fasttrap_sigsegv().

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-02-26 18:20:41 +00:00
Mark Johnston
5563c675b3 Revert r344587.
The fasttrap_isa.h header is needed by libdtrace, not just the kernel.
2019-02-26 17:33:56 +00:00