Commit Graph

2251 Commits

Author SHA1 Message Date
Pawel Jakub Dawidek
cb761bb2fb Avoid the GEOM topology lock recursion when we automatically expand a pool.
The steps to reproduce the problem:

	mdconfig -a -t swap -s 3g -u 0
	gpart create -s GPT md0
	gpart add -t freebsd-zfs -s 1g md0
	zpool create -o autoexpand=on foo md0p1
	gpart resize -i 1 -s 2g md0
2020-04-25 21:45:31 +00:00
John Baldwin
5c4309b474 Handle non-dtrace-triggered kernel breakpoint traps in mips.
If DTRACE is enabled at compile time, all kernel breakpoint traps are
first given to dtrace to see if they are triggered by a FBT probe.
Previously if dtrace didn't recognize the trap, it was silently
ignored breaking the handling of other kernel breakpoint traps such as
the debug.kdb.enter sysctl.  This only returns early from the trap
handler if dtrace recognizes the trap and handles it.

Submitted by:	Nicolò Mazzucato <nicomazz97@gmail.com>
Reviewed by:	markj
Obtained from:	CheriBSD
Differential Revision:	https://reviews.freebsd.org/D24478
2020-04-21 17:38:07 +00:00
Gleb Smirnoff
9edef911e8 Make ZFS depend on xdr.ko only. It doesn't need kernel RPC.
Differential Revision:	https://reviews.freebsd.org/D24408
2020-04-17 06:05:08 +00:00
Ryan Moeller
69534635ff MFOpenZFS: ZVOLs should not be allowed to have children
zfs create, receive and rename can bypass this hierarchy rule. Update
both userland and kernel module to prevent this issue and use pyzfs
unit tests to exercise the ioctls directly.

Note: this commit slightly changes zfs_ioc_create() ABI. This allow to
differentiate a generic error (EINVAL) from the specific case where we
tried to create a dataset below a ZVOL (ZFS_ERR_WRONG_PARENT).

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>

Approved by:	mav (mentor)
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
openzfs/zfs@d8d418ff0c
2020-03-25 15:56:18 +00:00
Alexander Motin
d3c6ba3214 MFOpenZFS: make zil max block size tunable
We've observed that on some highly fragmented pools, most metaslab
allocations are small (~2-8KB), but there are some large, 128K
allocations.  The large allocations are for ZIL blocks.  If there is a
lot of fragmentation, the large allocations can be hard to satisfy.

The most common impact of this is that we need to check (and thus load)
lots of metaslabs from the ZIL allocation code path, causing sync writes
to wait for metaslabs to load, which can take a second or more.  In the
worst case, we may not be able to satisfy the allocation, in which case
the ZIL will resort to txg_wait_synced() to ensure the change is on
disk.

To provide a workaround for this, this change adds a tunable that can
reduce the size of ZIL blocks.

External-issue: DLPX-61719
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8865
openzfs/zfs@b8738257c2

MFC after:	2 weeks
2020-03-19 01:05:54 +00:00
Alexander Motin
cf2f2eb568 Fix infinite scan on a pool with only special allocations
Attempt to run scrub or resilver on a new pool containing only special
allocations (special vdev added on creation) caused infinite loop
because of dsl_scan_should_clear() limiting memory usage to 5% of pool
size, which it calculated accounting only normal allocation class.

Addition of special and just in case dedup classes fixes the issue.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #10106
Closes #8694
openzfs/zfs@fa130e010c
2020-03-16 19:03:10 +00:00
Ryan Moeller
9f24784038 TODO DONE: Use sx_xholder in SPL rwlock.h
Approved by:	mav (mentor)
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-03-14 00:16:15 +00:00
Konstantin Belousov
d5b7401f64 zfs dmu_read: loosen the assertion.
Since switch to the lockless grab, shared busy for ahead/behind pages
allows other threads to validate and map the pages readonly.

Reviewed by:	avg, jeff, markj
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D23986
2020-03-06 21:15:25 +00:00
Alexander Motin
5c940cf1ff Remove vfs.zfs.top_maxinflight tunable/sysctl.
It is dead since sorted scrub import at r334844.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-03-05 19:43:43 +00:00
Alexander Motin
e37d5c12e9 Increase number of write completion threads, matching ZoL.
Our iSCSI benchmarks on a large 80-core system show that previous limit
of 8 threads can be a bottleneck.  At some points this change increases
write IOPS by as much as 50%.  I am still not sure that so many threads
is really required, but we tested lower amounts and got no significant
benefits, while latencies were a bit worse, so decided to not diverge.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-03-03 15:05:13 +00:00
Jeff Roberson
9defe1c076 Eliminate object locking in zfs where possible with the new lockless grab
APIs.

Reviewed by:	kib, markj, mmacy
Differential Revision:	https://reviews.freebsd.org/D23848
2020-02-28 20:29:53 +00:00
Mark Johnston
a7261520ba Clear systrace_args_func when systrace probes are disabled.
This function pointer is invalidated when systrace.ko is unloaded.

Reported by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2020-02-28 17:04:36 +00:00
Andriy Gapon
40b1e0dc0e remove stray space symbol in r358380
MFC after:	1 week
X-MFC with:	r358380
2020-02-27 14:27:42 +00:00
Andriy Gapon
6d11243ae2 use ZFS_MAX_DATASET_NAME_LEN instead of MAXPATHLEN for dataset names
The change affects only FreeBSD specific code as the common code already
mostly uses the more idiomatic and correct ZFS_MAX_DATASET_NAME_LEN.

MFC after:	1 week
2020-02-27 14:21:01 +00:00
Andriy Gapon
6b47663df5 dsl_dataset_promote_sync: populate 'oldname' before using it
It's very unlikely that zfsvfs_update_fromname() and
zvol_rename_minors() ever did anything during the promote operation as
the old name was not initialized.

MFC after:	1 week
2020-02-27 14:12:43 +00:00
Alexander Motin
a33a65ce22 MFZoL: Relax restriction on zfs_ioc_next_obj() iteration
Per the documentation for dnode_next_offset in dnode.c, the "txg"
parameter specifies a lower bound on which transaction the dnode can
be found in. We are interested in all dnodes that are removed between
the first and last transaction in the snapshot. It doesn't need to be
created in that snapshot to correspond to a removed file.

In fact, the behavior of zfs diff in the test case exactly matches
this: the transaction that created the data that was deleted in snapshot
"2" was produced before, in snapshot "1", definitely predating the first
transaction in snapshot "2".

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <Tim Chase <tim@onlight.com>
Closes #2081
zfsonlinux/zfs@7290cd3c4e

MFC after:	1 week
2020-02-26 20:38:48 +00:00
Toomas Soome
c1c4c81fd7 loader: replace zfs_alloc/zfs_free with malloc/free
Use common memory management.
2020-02-26 18:12:12 +00:00
Alexander Motin
0f58760b82 MFZoL: Fix resilver writes in vdev_indirect_io_start
This patch addresses an issue found in ztest where resilver
write zios that were passed to an indirect vdev would end up
being handled as though they were resilver read zios. This
caused issues where the zio->io_abd would be both read to
and written from at the same time, causing asserts to fail.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8193
zfsonlinux/zfs@5aa95ba0d3

MFC after:	1 week
2020-02-26 16:51:45 +00:00
Alexander Motin
51c04e6cc2 Fix patch mismerge in r358336.
MFC after:	1 week
2020-02-26 16:04:24 +00:00
Alexander Motin
f8a7a04b79 MFZoL: Fix issue with scanning dedup blocks as scan ends
This patch fixes an issue discovered by ztest where
dsl_scan_ddt_entry() could add I/Os to the dsl scan queues
between when the scan had finished all required work and
when the scan was marked as complete. This caused the scan
to spin indefinitely without ending.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
zfsonlinux/zfs@5e0bd0ae05

MFC after:	1 week
2020-02-26 15:59:46 +00:00
Alexander Motin
308acfcc62 MFZoL: Fix 2 small bugs with cached dsl_scan_phys_t
This patch corrects 2 small bugs where scn->scn_phys_cached was
not properly updated to match the primary copy when it needed to
be. The first resulted in the pause state not being properly
updated and the second resulted in the cached version being
completely zeroed even if the primary was not.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
zfsonlinux/zfs@8cb119e3dc

MFC after:	1 week
2020-02-26 15:47:40 +00:00
Alexander Motin
4b7f090f8d MFZoL: Fix txg_sync_thread hang in scan_exec_io()
When scn->scn_maxinflight_bytes has not been initialized it's
possible to hang on the condition variable in scan_exec_io().
This issue was uncovered by ztest and is only possible when
deduplication is enabled through the following call path.

  txg_sync_thread()
    spa_sync()
      ddt_sync_table()
        ddt_sync_entry()
          dsl_scan_ddt_entry()
            dsl_scan_scrub_cb()
              dsl_scan_enqueuei()
                scan_exec_io()
                  cv_wait()

Resolve the issue by always initializing scn_maxinflight_bytes
to a reasonable minimum value.  This value will be recalculated
in dsl_scan_sync() to pick up changes to zfs_scan_vdev_limit
and the addition/removal of vdevs.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7098
zfsonlinux/zfs@f90a30ad1b

MFC after:	1 week
2020-02-26 15:45:04 +00:00
Pawel Biernacki
7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Alexander Motin
8d8e484d9c Remove duplicate dbufs accounting.
Since AVL already has embedded element counter, use dn_dbufs_count
only for dbufs not counted there (bonus buffers) and just add them.
This removes two atomics per dbuf life cycle.

According to profiler it reduces time spent by dbuf_destroy() inside
bottlenecked dbuf_evict_thread() from 13.36% to 9.20% of the core.

This counter is used only on illumos, so for FreeBSD it was just a
waste of time.

MFC after:	2 weeks
2020-02-07 15:50:47 +00:00
Alexander Motin
c10aea724f Reduce number of atomic_add() calls in aggsum.
Previous code used 4 atomics to do aggsum_flush_bucket() and 2 more to
re-borrow after the flush.  But since asc_borrowed and asc_delta are
accessed only while holding asc_lock, it makes no any sense to modify
as_lower_bound and as_upper_bound in multiple steps.  Instead of that
the new code uses only 2 atomics in all the cases, one per as_*_bound
variable.  I think even that is overkill, simple atomic store and
load could be used here, since all modifications are done under the
as_lock, but there are no such primitives in ZFS code now.

While there, make borrow code consider previous borrow value, so that
on mixed request patterns reduce chance of needing to borrow again if
much larger request follows tiny one that needed borrow.

Also reduce as_numbuckets from uint64_t to u_int.  It makes no sense
to use so large division operation on every aggsum_add().

Reviewed by:	Brian Behlendorf, Paul Dagnelie
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-02-06 20:32:53 +00:00
Konstantin Belousov
a421e8786b Add sys/systm.h to several places that use vm headers.
It is needed (but not enough) to use e.g. KASSERT() in inline functions.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-02-04 18:56:26 +00:00
Alexander Motin
ea642c5c38 Few microoptimizations to dbuf layer.
Move db_link into the same cache line as db_blkid and db_level.
It allows significantly reduce avl_add() time in dbuf_create() on
systems with large RAM and huge number of dbufs per dnode.

Avoid few accesses to dbuf_caches[].size, which is highly congested
under high IOPS and never stays in cache for a long time.  Use local
value we are receiving from zfs_refcount_add_many() any way.

Remove cache_size_bytes_max bump from dbuf_evict_one().  I don't see
a point to do it on dbuf eviction after we done it on insertion in
dbuf_rele_and_unlock().

Reviewed by:	mahrens, Brian Behlendorf
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-02-04 15:53:51 +00:00
Toomas Soome
4d297e7035 loader: rewrite zfs reader zap code to use malloc
First step on removing zfs_alloc.

Reviewed by:	delphij
Differential Revision:	https://reviews.freebsd.org/D23433
2020-02-04 07:37:55 +00:00
Warner Losh
58aa35d429 Remove sparc64 kernel support
Remove all sparc64 specific files
Remove all sparc64 ifdefs
Removee indireeect sparc64 ifdefs
2020-02-03 17:35:11 +00:00
Alexander Motin
c68c82324f Unblock kstat.zfs.misc.dbufstats sysctls.
It is not so much broken to hide it after we wasted time to collect it.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-02-03 17:10:40 +00:00
Kyle Evans
6a5abb1ee5 Provide O_SEARCH
O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping
permissions checks on the directory itself after the initial open(). This is
close to the semantics we've historically applied for O_EXEC on a directory,
which is UB according to POSIX. Conveniently, O_SEARCH on a file is also
explicitly undefined behavior according to POSIX, so O_EXEC would be a fine
choice. The spec goes on to state that O_SEARCH and O_EXEC need not be
distinct values, but they're not defined to be the same value.

This was pointed out as an incompatibility with other systems that had made
its way into libarchive, which had assumed that O_EXEC was an alias for
O_SEARCH.

This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC
respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a
directory is checked in vn_open_vnode already, so for completeness we add a
NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not
re-check that when descending in namei.

[0] https://pubs.opengroup.org/onlinepubs/9699919799/

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23247
2020-02-02 16:34:57 +00:00
Kyle Evans
c887ac8324 zfs: light refactor to indicate cachedlookup in zfs_lookup
If we come from VOP_CACHEDLOOKUP, we must skip the VEXEC check as it will
have been done in the caller (vfs_cache_lookup). This is a part of D23247,
which may skip the earlier VEXEC check as well if the root fd was opened
with O_SEARCH.

This one required slightly more work as zfs_lookup may also be called
indirectly as VOP_LOOKUP or a couple of other places where we must do the
check.
2020-02-02 16:10:33 +00:00
Mateusz Guzik
f0c402e425 zfs: ZFS_WLOCK_TEARDOWN_INACTIVE_WLOCKED -> ZFS_TEARDOWN_INACTIVE_WLOCKED
Fix up the argument used in one case as well.
2020-02-01 06:39:10 +00:00
Mateusz Guzik
8c3658c450 zfs: convert z_teardown_inactive_lock to sleepable read-mostly lock
This eliminates a global serialisation point. It only gets write locked
on unmount.

Sample result doing an incremental -j 40 build:
before: 173.30s user 458.97s system 2595% cpu 24.358 total
after:  168.58s user 254.92s system 2211% cpu 19.147 total
2020-01-31 08:38:38 +00:00
Mateusz Guzik
b076e113af zfs: provide macros to handle z_teardown_inactive_lock 2020-01-31 08:37:35 +00:00
Mateusz Guzik
42a9f8f21d zfs: fix spurious lock contention during path lookup
ZFS tracks if anything denies VEXEC to allow for a quick check for the
common case of path traversal. Use it.

Differential Revision:	https://reviews.freebsd.org/D22224
2020-01-30 02:16:17 +00:00
Mateusz Guzik
e196f7825e zfs: use VOP_NEED_INACTIVE
Big thanks to Greg V for testing previous verions of the patch.

Differential Revision:	https://reviews.freebsd.org/D22130
2020-01-30 02:14:10 +00:00
Alexander Motin
da19f62dfa Map ECKSUM and EFRAGS from ZFS onto real errnos.
Make it less confusing when, for example, stat sets errno to 122 because a
checksum failed in ZFS:

Before: getfacl: /foo/bar: stat() failed: Unknown error: 122
After: getfacl: /foo/bar: stat() failed: Integrity check failed

Submitted by:	Ryan Moeller <ryan@ixsystems.com>
Reviewed by:	mckusick, mav
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D22973
2020-01-13 22:06:16 +00:00
Mateusz Guzik
879e0604ee Add KERNEL_PANICKED macro for use in place of direct panicstr tests 2020-01-12 06:07:54 +00:00
Mateusz Guzik
638af813d9 dtrace: add missing CLTFLAG_MPSAFE annotations 2020-01-12 04:53:22 +00:00
Mateusz Guzik
20fa645666 zfs: add missing CLTFLAG_MPSAFE annotations 2020-01-12 04:53:01 +00:00
Mateusz Guzik
b52d50cf69 vfs: prealloc vnodes in getnewvnode_reserve
Having a reserved vnode count does not guarantee that getnewvnodes wont
block later. Said blocking partially defeats the purpose of reserving in
the first place.

Preallocate instaed. The only consumer was always passing "1" as count
and never nesting reservations.
2020-01-11 22:58:14 +00:00
Ian Lepore
8bfc473c0e Remove scary-looking printf output that happens when you kldload dtrace on
arm.  Replace it with a comment block explaining why the function is empty
on 32-bit arm.
2020-01-09 22:51:37 +00:00
Mateusz Guzik
75ad73a8b9 zfs: plug a vnode reserve leak in zfs_make_xattrdir 2020-01-07 04:34:29 +00:00
Mateusz Guzik
b249ce48ea vfs: drop the mostly unused flags argument from VOP_UNLOCK
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D21427
2020-01-03 22:29:58 +00:00
Brandon Bergren
9aafc7c052 [PowerPC] [MIPS] Implement 32-bit kernel emulation of atomic64 operations
This is a lock-based emulation of 64-bit atomics for kernel use, split off
from an earlier patch by jhibbits.

This is needed to unblock future improvements that reduce the need for
locking on 64-bit platforms by using atomic updates.

The implementation allows for future integration with userland atomic64,
but as that implies going through sysarch for every use, the current
status quo of userland doing its own locking may be for the best.

Submitted by:	jhibbits (original patch), kevans (mips bits)
Reviewed by:	jhibbits, jeff, kevans
Differential Revision:	https://reviews.freebsd.org/D22976
2020-01-02 23:20:37 +00:00
Mark Johnston
9f5632e6c8 Remove page locking for queue operations.
With the previous reviews, the page lock is no longer required in order
to perform queue operations on a page.  It is also no longer needed in
the page queue scans.  This change effectively eliminates remaining uses
of the page lock and also the false sharing caused by multiple pages
sharing a page lock.

Reviewed by:	jeff
Tested by:	pho
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D22885
2019-12-28 19:04:00 +00:00
Mateusz Guzik
6fa079fc3f vfs: flatten vop vectors
This eliminates the following loop from all VOP calls:

while(vop != NULL && \
    vop->vop_spare2 == NULL && vop->vop_bypass == NULL)
        vop = vop->vop_default;

Reviewed by:	jeff
Tesetd by:	pho
Differential Revision:	https://reviews.freebsd.org/D22738
2019-12-16 00:06:22 +00:00
Toomas Soome
3c2db0ef43 loader: rewrite zfs vdev initialization
In some cases the pool discovery will get stuck in infinite loop while setting
up the vdev children.

To fix, we split the vdev setup into two parts, first we create vdevs based on
configuration we do get from pool label, then, we process pool config from MOS
and update the pool config if needed.

Testing done: confirm previously hung loader is not hung any more.

MFC after:	1 week
2019-12-15 21:52:40 +00:00
Jeff Roberson
61a74c5ccd schedlock 1/4
Eliminate recursion from most thread_lock consumers.  Return from
sched_add() without the thread_lock held.  This eliminates unnecessary
atomics and lock word loads as well as reducing the hold time for
scheduler locks.  This will eventually allow for lockless remote adds.

Discussed with:	kib
Reviewed by:	jhb
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22626
2019-12-15 21:11:15 +00:00