Commit Graph

2087 Commits

Author SHA1 Message Date
Alexander Motin
32d8034f77 Fix minor mismerges.
No functional change.

MFC after:	1 week
2019-04-26 18:25:59 +00:00
Alexander Motin
48ecceba1e Change the way FreeBSD GID inheritance is hacked.
I believe previous ifdef caused NULL dereference in later zfs_log_create()
on attempt to create file inside directory belonging to ephemeral group
created on illumos, trying to write to log information about GID domain
of the newly created file, inheriting the ephemeral GID.

This patch reuses original illumos SGID code with exception that due to
lack of ID mapping code on FreeBSD ephemeral GID will turn into GID_NOBODY
by another ifdef inside zfs_fuid_map_id().

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2019-04-19 15:44:45 +00:00
Justin Hibbits
e9aae3496e powerpc/dtrace: Fix dtrace powerpc asm, and simplify stack walking
Fix some execution bugs in the dtrace powerpc asm.  addme pulls in the carry
flag which we don't want, and the result wasn't recorded anyways, so the
following beq to check for exit condition wasn't checking the right
condition.

Simplify the stack walking in dtrace_isa.c, so there's only a single walker
that handles both pc and sp.  This should make it easier to follow, and any
bugfix that may be needed for walking only needs to be made in one place
instead of two now.

MFC after:	2 weeks
2019-04-13 03:32:21 +00:00
Mariusz Zaborski
a1304030b8 Introduce funlinkat syscall that always us to check if we are removing
the file associated with the given file descriptor.

Reviewed by:	kib, asomers
Reviewed by:	cem, jilles, brooks (they reviewed previous version)
Discussed with:	pjd, and many others
Differential Revision:	https://reviews.freebsd.org/D14567
2019-04-06 09:34:26 +00:00
Conrad Meyer
a8a16c7128 Replace read_random(9) with more appropriate arc4rand(9) KPIs
Reviewed by:	ae, delphij
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D19760
2019-04-04 01:02:50 +00:00
Pawel Jakub Dawidek
af4f9e5f00 If the autoexpand pool property is turned on and vdev is healthy try to
expand the pool automatically when we detect underlying GEOM provider
size change.

Obtained from:	Fudo Security
Tested in:	AWS
2019-03-30 07:29:20 +00:00
Andriy Gapon
76a63ee510 Revert r345410, VOP_FSYNC change in ZFS vdev_file
I overlooked the fact that that VOP_FSYNC() call is not a FreeBSD VFS
call, but a macro that provides an illumos-compatible wrapper for the
FreeBSD operation.

PR:		236475
Reported by:	lwhsu
Pointyhat to:	avg
2019-03-22 17:44:47 +00:00
Andriy Gapon
c1581f4f4d ZFS vdev_file: use correct value for waitfor parameter of VOP_FSYNC
PR:		236475
Reported by:	asomers
MFC after:	2 weeks
2019-03-22 09:11:45 +00:00
Mark Johnston
a4b59d3db6 Use an explicit comparison with VM_GUEST_NO.
Reported by:	jhb
MFC with:	r345359
Sponsored by:	The FreeBSD Foundation
2019-03-21 20:07:50 +00:00
Mark Johnston
e362e590f9 Don't attempt to measure TSC skew when running as a VM guest.
It simply doesn't work in general since VCPUs may migrate between
physical cores.  The approach used to measure skew also doesn't
make much sense in a VM.

PR:		218452
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2019-03-21 02:52:22 +00:00
Alexander Motin
6bb46107d8 MFV r336930: 9284 arc_reclaim_thread has 2 jobs
`arc_reclaim_thread()` calls `arc_adjust()` after calling
`arc_kmem_reap_now()`; `arc_adjust()` signals `arc_get_data_buf()` to
indicate that we may no longer be `arc_is_overflowing()`.

The problem is, `arc_kmem_reap_now()` can take several seconds to
complete, has no impact on `arc_is_overflowing()`, but due to how the
code is structured, can impact how long the ARC will remain in the
`arc_is_overflowing()` state.

The fix is to use seperate threads to:

1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which
    improves `arc_is_overflowing()`

2. keep enough free memory in the system, by calling
 `arc_kmem_reap_now()` plus `arc_shrink()`, which improves
 `arc_available_memory()`.

illumos/illumos-gate@de753e34f9

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Brad Lewis <brad.lewis@delphix.com>
2019-03-15 18:59:04 +00:00
Simon J. Gerraty
f5fdf82d82 Add _PC_ACL_* to vop_stdpathconf
This avoid EINVAL from tmpfs etc.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D19512
2019-03-11 20:40:56 +00:00
Alexander Motin
aa8676f25d Revert minor part of r344934.
I tried to save some CPU time on hopeless aggregation attempts, but it seems
the condition I added is overly strict, blocking also aggregation of optional
I/Os in cases which previously were possible.  Revert just to be safe.

MFC after:	1 month
2019-03-11 17:39:09 +00:00
Alexander Motin
5ca679e3c4 MFV/ZoL: Disable LBA weighting on files and SSDs
The LBA weighting makes sense on rotational media where the outer tracks
have twice the bandwidth of the inner tracks. However, it is detrimental
on nonrotational media such as solid state disks, where the only effect
is to ensure that metaslabs enter the best-fit allocation behavior
sooner, which is detrimental to performance. It also makes no sense on
files where the underlying filesystem can arrange things however it
wants.

Author: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3712
zfsonlinux/zfs@fb40095f5f

To reduce code divergence this merge replaces equivalent but different
FreeBSD code detecting non-rotating medium vdevs.

MFC after:	1 month
2019-03-08 21:13:45 +00:00
Alexander Motin
673544c3dd Add separate aggregation limit for non-rotating media.
Before sequential scrub patches ZFS never aggregated I/Os above 128KB.
Sequential scrub bumped that to 1MB, which motivation I understand for
spinning disks, since it should reduce number of head seeks.  But for
SSDs it makes much less sense to me, especially on FreeBSD, where due
to MAXPHYS limitation device will likely still see bunch of 128KB I/Os
instead of one large.  Having more strict aggregation limit allows to
avoid allocation of large memory buffer and memcpy to/from it, that is
a serious problem when bandwidth reaches few GB/s.

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2019-03-08 19:38:52 +00:00
Alexander Motin
3a3ba532e7 MFV/ZoL: Fix zfs_vdev_aggregation_limit bounds checking
Update the bounds checking for zfs_vdev_aggregation_limit so that
it has a floor of zero and a maximum value of the supported block
size for the pool.

Additionally add an early return when zfs_vdev_aggregation_limit
equals zero to disable aggregation.  For very fast solid state or
memory devices it may be more expensive to perform the aggregation
than to issue the IO immediately.

Author: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs@a58df6f536

MFV/ZoL: Cap maximum aggregate IO size

Commit 8542ef8 allowed optional IOs to be aggregated beyond
the specified aggregation limit.  Since the aggregation limit
was also used to enforce the maximum block size, setting
`zfs_vdev_aggregation_limit=16777216` could result in an
attempt to allocate an ABD larger than 16M.

Author: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6259
Closes #6270
zfsonlinux/zfs@2d678f779a
2019-03-08 18:49:27 +00:00
Alexander Motin
ede8782611 Improve entropy for ZFS taskqueue selection.
I just found that at least on Skylake CPUs cpu_ticks() never returns odd
values, only even, and possibly has even bigger step (176/2?), that makes
its lower bits very bad entropy source, leaving half of taskqueues unused.
Switch to sbinuptime(), closer to upstreams, mitigates the problem by the
rate conversion working as kind of hash function.  In case that is somehow
not enough (timer rate is too low or too divisible) mix in curcpu.

MFC after:	1 week
2019-03-07 22:56:39 +00:00
Alexander Motin
551b7d3a29 Add respective tunables to few ZFS sysctls.
MFC after:	1 week
2019-03-07 01:24:08 +00:00
Pawel Jakub Dawidek
b8da50d526 Improve readability of the code by making it explicit where the 'c' variable
starts. It is also more consistent with similar code in this file.
2019-03-01 05:54:13 +00:00
Mark Johnston
8e7127fd91 Fix fasttrap_sig{trap,segv}().
- Don't leak the ksiginfo structure.
- Hold the proc lock when sending a signal in fasttrap_sigsegv().

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-02-26 18:20:41 +00:00
Mark Johnston
5563c675b3 Revert r344587.
The fasttrap_isa.h header is needed by libdtrace, not just the kernel.
2019-02-26 17:33:56 +00:00
Mark Johnston
df59ed0787 Remove illumos-specific code from the x86 fasttrap_isa.c.
The file has not been touched upstream in over a decade, and the nature
of the code means that a lot of FreeBSD-specific bits are required.  Remove
the dead code to improve readability.  No functional change intended.

Discussed with:	cem
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-02-26 16:34:43 +00:00
Mark Johnston
6829dae12b Remove stub fasttrap implementations.
No platforms except i386, amd64 and powerpc implement fasttrap; the
fasttrap files for other arches do not contain any code and bloat
the output from cscope, so just remove them.

MFC after:	1 week
2019-02-26 16:31:47 +00:00
Mark Johnston
f23e684bbf Commit a missing piece of r344452.
MFC with:	r344452
2019-02-21 22:56:54 +00:00
Mark Johnston
4f1b715c84 Fix a tracepoint lookup race in fasttrap_pid_probe().
fasttrap hooks the userspace breakpoint handler; the hook looks up the
breakpoint address in a hash table of tracepoints.  It is possible for
the tracepoint to be removed by a different thread in between the
breakpoint trap and the hash table lookup, in which case SIGTRAP gets
delivered to the target process.  Fix the problem by adding a
per-process generation counter that gets incremented when a tracepoint
belonging to that process is removed.  Then, when a lookup fails, the
trapping instruction is restarted if the thread's counter doesn't match
that of the process.

Reviewed by:	cem
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19273
2019-02-21 22:54:17 +00:00
Pawel Jakub Dawidek
2691ae3230 Simplify the code. No functional changes.
Reviewed by:	rpokala
2019-02-20 00:25:45 +00:00
Pawel Jakub Dawidek
91853b8546 Simplify the code. 2019-02-19 23:53:33 +00:00
Pawel Jakub Dawidek
01e21ead90 Correct typo in the comment. 2019-02-19 23:44:00 +00:00
Pawel Jakub Dawidek
99ab63b69d Change assertion to log the incorrect io_type we've got. 2019-02-19 23:43:15 +00:00
Pawel Jakub Dawidek
36d43b5dfe Grabage-collect no longer used variable. 2019-02-19 23:41:23 +00:00
Pawel Jakub Dawidek
11c8759337 The way ZFS searches for its vdevs is the following: first it looks for
a vdev that has the same name as the one stored in metadata and that has
all VDEV labels in place. If it cannot find a GEOM provider with the given
name and all VDEV labels it will scan all GEOM providers for the best match
(the most VDEV labels available), but here the name is ignored.

In case the ZFS pool is created, eg. using GPT partition label:

	# zpool create tank /dev/gpt/tank

everything works, and on every import ZFS will pick /dev/gpt/tank and
not /dev/da0p4.

The problem occurs when da0p4 is extended and ZFS is unable to find all
VDEV labels in /dev/gpt/tank anymore (the VDEV labels stored at the end
of the partition are now somewhere else). In this case it will scan all
GEOM providers and will pick the first one with the best match, ie. da0p4.

Fix this problem by checking the VDEV/provider name even if we get the same
match. If the name is the same as the one we have in pool's metadata, prefer
this GEOM provider.

Reported by:	oshogbo, Michal Mroz <m.mroz@fudosecurity.com>
Tested by:	Michal Mroz <m.mroz@fudosecurity.com>
Obtained from:	Fudo Security
2019-02-19 23:35:55 +00:00
Pawel Jakub Dawidek
d793cf7019 In the vdev_geom_open_by_path() function we assume that vdev path starts
with "/dev/". Make sure this is the case.
2019-02-19 23:22:39 +00:00
Alexander Motin
ed0a3e8637 s/Maximal/Maximum/ in sysctl description.
Submitted by:	smh
MFC after:	1 week
2019-02-04 20:09:22 +00:00
Alexander Motin
ef08154150 Add missed tunables/sysctls for some new vdev variables.
While there, make few existing sysctls writeable, since there is no reason
not to.

MFC after:	1 week
2019-02-04 16:13:41 +00:00
Alexander Motin
54cde30f92 Remove BIO_ORDERED flag from BIO_FLUSH sent by ZFS.
In all cases where ZFS sends BIO_FLUSH, it first waits for all related
writes to complete, so its BIO_FLUSH does not care about strict ordering.
Removal of one makes life much easier at least for NVMe driver, which
hardware has no concept of request ordering, relying completely on software.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-01-30 17:39:44 +00:00
Mariusz Zaborski
db009dddfd zfs: allow to change cache flush sysctl
There is no reason for this variable to be tunable.
This variable is used as a barrier in few places.

Discussed with:	pjd
MFC after:	2 weeks
Sponsored by:	Fudo Security
2019-01-26 13:53:00 +00:00
Sean Eric Fagan
82e20c0a72 Change ZFS quotas to return EINVAL when not present (matches man page).
UFS will return EINVAL when quotas are not enabled on a filesystem; ZFS'
equivalent involves not having quotas (there is not way to enable or disable
quotas as such).  My initial implementation had it return ENOENT, but
quotactl(2) indicates EINVAL is more appropriate.

MFC after:	2 weeks
Approved by:	mav
Reviewed by:	markj
Reported by:	Emrion <kmachine@free.fr>
Sponsored by: iXsystems Inc
PR: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=234413
2019-01-11 02:53:46 +00:00
Matt Macy
27e05a1902 zfsboot: support newer ZFS versions
declare v3 objset size/layout to fix userboot and possibly other loader issues

- fix for userboot assertion failure in zfs_dev_close in free due to out of bounds write
- fix for zfs_alloc / zfs_free mismatch assertion failure when booting GPT on BIOS
2019-01-03 22:49:11 +00:00
Andriy Gapon
4c325393f3 MFV r342532: 5882 Temporary pool names
Note that this commit brings only formatting changes that were done
during the final review of the illumos change, because FreeBSD got the
main changes before illumos.

illumos/illumos-gate@04e5635652
04e5635652

https://www.illumos.org/issues/5882
  This is an import of the temporary pool names functionality from ZoL:
  e2282ef57e
  26b42f3f9d
  2f3ec90061
  00d2a8c92f
  83e9986f6e
  023bbe6f01
  It is intended to assist the creation and management of virtual machines
  that have their rootfs on ZFS on hosts that also have their rootfs on
  ZFS. These situations cause SPA namespace collisions when the standard
  name rpool is used in both cases. The solution is either to give each
  guest pool a name unique to the host, which is not always desireable, or
  boot a VM environment containing an ISO image to install it, which is
  cumbersome.

MFC after:	1 week
Sponsored by:	Panzura
2018-12-26 11:03:14 +00:00
Andriy Gapon
f050611e7f MFV r342469: 9630 add lzc_rename and lzc_destroy to libzfs_core
illumos/illumos-gate@049ba636fa
049ba636fa

https://www.illumos.org/issues/9630
  Rename and destroy are very useful operations that deserve to be in
  libzfs_core.  And they are not hard to implement too.

MFC after:	2 weeks
Relnotes:	maybe
2018-12-26 10:37:41 +00:00
Mateusz Guzik
8ca79fbd4a dtrace: fix userspace access on boxes with SMAP
dtrace has its own routines which were not updated after SMAP support got
implemented. Use ifunc just like for other routines.

This in particular fixes ustack().

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D18542
2018-12-13 20:09:38 +00:00
Mateusz Guzik
cc426dd319 Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00
Toomas Soome
7aaf685ba7 zfs: we can boot from dataset with large_dnode enabled
loader has been supporting large_dnode for some time, no need to block the
feature for boot dataset.

Reviewed by:	avg
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D18391
2018-12-03 19:35:21 +00:00
Mark Johnston
6d2e2df764 Ensure that directory entry padding bytes are zeroed.
Directory entries must be padded to maintain alignment; in many
filesystems the padding was not initialized, resulting in stack
memory being copied out to userspace.  With the ino64 work there
are also some explicit pad fields in struct dirent.  Add a subroutine
to clear these bytes and use it in the in-tree filesystems.  The
NFS client is omitted for now as it was fixed separately in r340787.

Reported by:	Thomas Barabosch, Fraunhofer FKIE
Reviewed by:	kib
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2018-11-23 22:24:59 +00:00
Alexander Motin
eecd0a1856 Revert r340096: 9952 Block size change during zfs receive drops spill block
It was reported, and I easily reproduced it, that this change triggers panic
when receiving replication stream with enabled embedded blocks, when short
file compressing into one embedded block changes its block size.  I am not
sure that the problem is in this particuler patch, not just triggered by it,
but since investigation and fix will take some time, I've decided to revert
this for now.

PR:		198457, 233277
2018-11-21 18:18:57 +00:00
Mark Johnston
544e0a4f69 Use taskqueue_quiesce(9) to implement taskq_wait().
PR:		227784
Reviewed by:	cem
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17975
2018-11-21 17:19:08 +00:00
Justin Hibbits
cfebc0faa7 DTrace/powerpc: Fix FBT return probes
The FBT fuction boundary prober was setting one return probe marker value,
but the dtrace handler was expecting another.  This causes a hang when
tracing return probes.
2018-11-21 16:47:11 +00:00
Konstantin Belousov
1c4ca77890 Add d_off support for multiple filesystems.
The d_off field has been added to the dirent structure recently.
Currently filesystems don't support this feature.  Support has been
added and tested for zfs, ufs, ext2fs, fdescfs, msdosfs and unionfs.
A stub implementation is available for cd9660, nandfs, udf and
pseudofs but hasn't been tested.

Motivation for this feature: our usecase is for a userspace nfs server
(nfs-ganesha) with zfs.  At the moment we cache direntry offsets by
calling lseek once per entry, with this patch we can get the offset
directly from getdirentries(2) calls which provides a significant
speedup.

Submitted by:	Jack Halford <jack@gandi.net>
Reviewed by:	mckusick, pfg, rmacklem (previous versions)
Sponsored by:	Gandi.net
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17917
2018-11-14 14:18:35 +00:00
Alexander Motin
1fcdb58634 Do not ignore arc_adjust() return value.
This covers scenario when ARC may not shrink as fast as it could:
1. arc_size < arc_c and arc_adjust() does not evict anything, returning
   zero to arc_reclaim_thread();
2. arc_available_memory() reports memory pressure, which can not be
   satisfied by arc_kmem_reap_now();
3. arc_shrink() reduces arc_c and calls arc_adjust(), return of which is
   ignored;
4. even if the last arc_adjust() could not satisfy arc_size < arc_c,
   arc_reclaim_thread() will still go to sleep, since the first one
   returned zero.

Reviewed by:	allanjude, markj, sef
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D17927
2018-11-10 01:58:37 +00:00
Alexander Motin
b4d66a1739 9952 Block size change during zfs receive drops spill block
Replication code in receive_object() falsely assumes that if received
object block size is different from local, then it must be a new object
and calls dmu_object_reclaim() to wipe it out. In most cases it is not a
problem, since all dnode, bonus buffer and data block(s) are immediately
rewritten any way, but the problem is that spill block (if used) is not.
This means loss of ACLs, extended attributes, etc.

This issue can be triggered in very simple way:
1. create 4KB file with 10+ ACL entries;
2. take snapshot and send it to different dataset;
3. append another 4KB to the file;
4. take another snapshot and send incrementally;
5. witness ACL loss on receive side.

PR:		198457
Discussed with:	mahrens
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2018-11-03 03:10:06 +00:00