illumos/illumos-gate@36a64e6284
To prevent kmem_cache reaping from blocking other system resources, turn
kmem_cache_reap_now() (which blocks) into kmem_cache_reap_soon(). Callers
to kmem_cache_reap_soon() should use kmem_cache_reap_active(), which
exploits #9017's new taskq_empty().
Reviewed by: Bryan Cantrill <bryan@joyent.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Author: Tim Kordas <tim.kordas@joyent.com>
FreeBSD does not use taskqueue for kmem caches reaping, so this change
is less dramatic then it is on Illumos, just limiting reaping to 1 time
per second. It may possibly be improved later, if needed.
illumos/illumos-gate@f06dce2c1f
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andrew Stormont <astormont@racktopsystems.com>
illumos/illumos-gate@0fb055e81f
At present it is possible to boot from a root pool that is on RAIDZ but not
one that is on RAIDZ2 or RAIDZ3. This is because, at the time the pool
version is checked to ensure support for dual/triple parity, the uberblock
has not yet been loaded into the SPA and therefore the code determines that
the pool version is too old and returns ENOTSUP.
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Andy Fiddaman <omnios@citrus-it.co.uk>
FreeBSD already had this fixed, so this is just a diff reduction.
illumos/illumos-gate@5cabbc6b49https://www.illumos.org/issues/7614:
This project allows top-level vdevs to be removed from the storage pool with
“zpool remove”, reducing the total amount of storage in the pool. This
operation copies all allocated regions of the device to be removed onto other
devices, recording the mapping from old to new location. After the removal is
complete, read and free operations to the removed (now “indirect”) vdev must
be remapped and performed at the new location on disk. The indirect mapping
table is kept in memory whenever the pool is loaded, so there is minimal
performance overhead when doing operations on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become “obsolete” because they are no longer used by any block pointers in
the pool. An entry becomes obsolete when all the blocks that use it are
freed. An entry can also become obsolete when all the snapshots that
reference it are deleted, and the block pointers that reference it have been
“remapped” in all filesystems/zvols (and clones). Whenever an indirect block
is written, all the block pointers in it will be “remapped” to their new
(concrete) locations if possible. This process can be accelerated by using
the “zfs remap” command to proactively rewrite all indirect blocks that
reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of the data
that is copied. This makes the process much faster, but if it were used on
redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy
the wrong data, when we have the correct data on e.g. the other side of the
mirror. Therefore, mirror and raidz devices can not be removed.
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Prashanth Sreenivasa <pks@delphix.com>
illumos/illumos-gate@95643f75d295643f75d2https://www.illumos.org/issues/8520
lzc_rollback_to() should support rolling back to a clone's origin.
The current checks in zfs_ioc_rollback() would not allow that because the
origin snapshot belongs to a different filesystem.
The overly restrictive check was introduced in 7600, but it was not a
regression as none of the existing tools provided a way to rollback to the
origin.
https://www.illumos.org/issues/7198
EINVAL is returned when a dataset does not have any snapshots, so there is
nothing to roll back to.
Although the code in zfs_do_rollback checks for that condition in advance, it's
still possible that the snapshot(s) gets removed after the check and before the
rollback sync task is executed.
At the moment zfs command would crash when that happens.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 2 weeks
illumos/illumos-gate@f864f99efef864f99efehttps://www.illumos.org/issues/8997
When dmu_tx_assign is called from zil_lwb_write_issue, it's possible
for either ERESTART or EIO to be returned.
If ERESTART is returned, this will cause an assertion to fail directly
in zil_lwb_write_issue, where the code assumes the return value is
EIO if dmu_tx_assign returns a non-zero value. This can occur if the
SPA is suspended when dmu_tx_assign is called, and most often occurs
when running zloop.
If EIO is returned, this can cause assertions to fail elsewhere in the
ZIL code. For example, zil_commit_waiter_timeout contains the
following logic:
lwb_t *nlwb = zil_lwb_write_issue(zilog, lwb);
ASSERT3S(lwb->lwb_state, !=, LWB_STATE_OPENED);
In this case, if dmu_tx_assign returned EIO from within
zil_lwb_write_issue, the lwb variable passed in will not be issued
to disk. Thus, it's lwb_state field will remain LWB_STATE_OPENED and
this assertion will fail. zil_commit_waiter_timeout assumes that after
it calls zil_lwb_write_issue, the lwb will be issued to disk, and
doesn't handle the case where this is not true; i.e. it doesn't handle
the case where dmu_tx_assign returns EIO.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
MFC after: 3 weeks
illumos/illumos-gate@a6c1eb3c08a6c1eb3c08https://www.illumos.org/issues/8731
annotate_ecksum() asserts that nui64s, calculated as nui64s = size / sizeof
(uint64_t), is not greater than UINT16_MAX.
This restriction is needed because histograms of incorrectly set and cleared
bits have 16 bit counters and if the buffer consists of too many 64-bit words,
then a counter can potentially overflow producing an incorrect result.
When the largest buffer size was 128KB the greatest value of nui64s was 16K,
well within the limit.
But now we have support for large buffers and for buffer sizes of 512KB and
above the restriction is violated.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 2 weeks
illumos/illumos-gate@3f7978d02b3f7978d02bhttps://www.illumos.org/issues/8081
zdb(8) is full of minor problems that generate compiler warnings. On FreeBSD,
which uses -WError, the only way to build it is to disable all compiler
warnings. This makes it much harder to detect newly introduced bugs. We should
cleanup all the warnings.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Alan Somers <asomers@gmail.com>
illumos/illumos-gate@5f10ef697f5f10ef697fhttps://www.illumos.org/issues/6396
LVM = SVM = Solaris Volume Manager
dead code and not using with ZFS based platform.
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Yuri Pankov <yuri.pankov@nexenta.com>
illumos/illumos-gate@7855d95b307855d95b30https://www.illumos.org/issues/7446
Since we support whole-disk configuration for boot pool, we also will need
whole disk support with UEFI boot and for this, zpool create should create efi-
system partition.
I have borrowed the idea from oracle solaris, and introducing zpool create -
B switch to provide an way to specify that boot partition should be created.
However, there is still an question, how big should the system partition be.
For time being, I have set default size 256MB (thats minimum size for FAT32
with 4k blocks). To support custom size, the set on creation "bootsize"
property is created and so the custom size can be set as: zpool create B -
o bootsize=34MB rpool c0t0d0
After pool is created, the "bootsize" property is read only. When -B switch is
not used, the bootsize defaults to 0 and is shown in zpool get output with
value ''. Older zfs/zpool implementations are ignoring this property.
https://www.illumos.org/rb/r/219/
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Dan McDonald <danmcd@kebe.com>
Author: Toomas Soome <tsoome@me.com>
This commit makes no sense for FreeBSD, that is why I blocked the option,
but it should be good to stay closer to upstream.
illumos/illumos-gate@48bbca816848bbca8168https://www.illumos.org/issues/7812
This change removes all gendered language that did not refer specifically
to an individual person or pet. The convention taken was to use
variations on "they" when referring to users and/or human beings, while
using "it" when referring to code, functions, and/or libraries.
Additionally, we took the liberty to fix up any whitespace issues that
were found in any files that were already being modified.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Daniel Hoffman <dj.hoffman@delphix.com>
Since r323578 we may remove the last reference to a covered vnode with
vrele() instead of vput(). So, v_usecount may be decremented before
the vnode is locked and zfsctl_snapdir_lookup may "catch" the vnode
with v_usecount of zero and v_holdcnt of one.
PR: 225795
Reported by: asomers
MFC after: 1 week
ZFS caches blocks it reads in its ARC, so in general the optional
pages are not as useful as with filesystems that read the data
directly into the target pages. But still the optional pages
are useful to reduce the number of page faults and associated
VM / VFS / ZFS calls.
Another case that gets optimized (as a side effect) is paging in
from a hole. ZFS DMU does not currently provide a convenient
API to check for a hole. Instead it creates a temporary zero-filled
block and allows accessing it as if it were a normal data block.
Getting multiple pages one by one from a hole results in repeated
creation and destruction of the temporary block (and an associated
ARC header).
Tested with fsx using various supported blocks sizes from 512 bytes
to 128 KB and additionally 1 MB.
Please note that in illumos and ZoL they do not do the range-locking in
the page-in path. This is because ZFS has a double-caching problem
between ARC and page cache and that requires zfs_read() and zfs_write()
to consult pages in the page cache. So, in those functions they first
lock a range and then lock pages corresponding to the range. While in
the page-in (and maybe page-out) path they first lock the pages and then
would lock the range. So, they would have a deadlock.
I believe that FreeBSD does not have that problem, because the page-in
deals only with invalid pages while zfs_read() and zfs_write() need to
access only valid pages. They do not wait on a busy page unless it's
already valid.
Reviewed by: kib
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D14263
illumos/illumos-gate@d6e1c446d7d6e1c446d7https://www.illumos.org/issues/8857
I had an OS panic on one of our servers:
ffffff01809128c0 vpanic()
ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80)
ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80)
ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0,
ffffff3373370908)
ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0)
ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0)
ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140)
ffffff0180912b40 thread_start+8()
It panicked here:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/
zio.c#430
pio->io_lock is DEAD, thus a panic. Further analysis shows the "pio"
(parent zio of "cio") has already been destroyed.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Youzhong Yang <youzhong@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>
PR: 223803
Tested by: shiva.bhanujan@quorum.com
MFC after: 2 weeks
zfsctl_common_pathconf will report all the same variables that regular ZFS
volumes report. zfsctl_common_getacl will report an ACL equivalent to 555,
except that you can't read xattrs or edit attributes.
Fixes a bug where "ls .zfs" will occasionally print something like:
ls: .zfs/.: Operation not supported
PR: 225793
Reviewed by: avg
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D14365
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.
Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000
We did that in the case of success to prevent the use of stale cached
data, but it makes even less sense to keep the cached data when we fail.
Ideally, we should call vgone() on the vnode in the case of zfs_rezget
failure, but the current lock order prevents us from doing that.
The change also rearranges the order of unlinked check and the size
change check.
While there, add missing SET_ERROR in one of the error paths.
MFC after: 2 weeks
illumos/illumos-gate@5cb8d943bchttps://www.illumos.org/issues/8835:
Sequential reads not aligned to block size are not detected by ZFS
prefetcher as sequential, killing prefetch and severely hurting
performance. It is caused by dmu_zfetch() in case of misaligned
sequential accesses being called with overlap of one block.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Alexander Motin <mav@FreeBSD.org>
illumos/illumos-gate@4ae5f5f06chttps://www.illumos.org/issues/8652:
Clang and GCC prefer to use unsigned ints to store enums. With Clang, that
causes tautological comparison warnings when comparing a zfs_prop_t or
zpool_prop_t variable to the macro ZPROP_INVAL. It's likely that error
handling code is being silently removed as a result.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Alan Somers <asomers@gmail.com>
illumos/illumos-gate@301fd1d6f2
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Sean Eric Fagan <sef@ixsystems.com>
illumos/illumos-gate@01a059ee0chttps://www.illumos.org/issues/8856:
arc_cksum_is_equal() calls zio_push_transform() that requires abd_t*
(second arg), but a void* is passed.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Roman Strashkin <roman.strashkin@nexenta.com>
8930 zfs_zinactive: do not remove the node if the filesystem is readonly
illumos/illumos-gate@93c618e0f4https://www.illumos.org/issues/8930:
We normally remove an unlinked node when its last user goes away and the
node becomes inactive. However, we should not do that if the filesystem
is mounted read-only including the case where it has its readonly
property set. The node will remain on the unlinked queue, so it will
not be leaked.
One particular scenario is when we receive an incremental stream into a
mounted read-only filesystem and that stream contains an unlinked file
(still on the unlinked queue). If that file is opened before the
receive and some time later after the receive it becomes inactive we
would remove it and, thus, modify the read-only filesystem. As a
result, the filesystem would diverge from its source and further
incremental receives would not be possible (without forcing a rollback).
Another related scenario, that may or may not be possible depending on an
OS / VFS policy, is when an open file is unlinked, then the filesystem is
remounted read-only, and then the file is closed.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@94ddd0900ahttps://www.illumos.org/issues/8909:
There's a race condition that exists if `zil_free_lwb` races with either
`zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.
Here's an example panic due to this bug:
> ::status
debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in mo
dule "zfs" due to a NULL pointer dereference
dump content: kernel pages only
> $c
zio_shrink+0x12()
zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20)
zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit+0x80(ffffff03dcd15cc0, 9a9)
zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
write+0x250(42, fffffd7ff4832000, 2000)
sys_syscall+0x177()
If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.
A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.
This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).
Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.
This race is present in `zil_suspend`, leading to this bug.
At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.
This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnesseary.
So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.
Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@cf07d3da99https://www.illumos.org/issues/8603:
To help make the ZIL's code more understandable, it was suggested that
the zilog_t's "zl_writer_lock" field should be renamed to "zl_issuer_lock".
Reviewed by: C Fraire <cfraire@me.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@a3b2868063https://www.illumos.org/issues/8677
We want to be able to run channel programs outside of synching context.
This would greatly improve performance of channel program that just gather
information, as we won't have to wait for synching context anymore.
This feature should introduce the following:
- A new command line flag in "zfs program" to specify our intention to
run in open context.
- A new flag/option within the channel program ioctl which selects the
context.
- Appropriate error handling whenever we try a channel program in
open-context that contains zfs.sync* expressions.
- Documentation for the new feature in the manual pages.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
Nowadays we do not pass zfs_cmd_t directly through the ioctl interface.
Instead a small zfs_iocparm_t object is passed and the command is
explicitly copied in and out. So, the check has become irrelevant.
MFC after: 3 weeks
Sponsored by: Panzura
These return the jail ID and jail name for the traced process,
respectively, and are analogous to "zonename" on Solaris/illumos.
"zonename" is now aliased to "jailname".
Also add some stress tests for the new variables.
Submitted by: Domagoj Stolfa <domagoj.stolfa@gmail.com>
Reviewed by: dteske (previous version)
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D13877
rather than kmem arena size to determine available memory.
Initialize the UMA limit to LONG_MAX to avoid spurious wakeups on boot before
the real limit is set.
PR: 224330 (partial), 224080
Reviewed by: markj, avg
Sponsored by: Netflix / Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13494
r326987 enabled two #if 0'd-out EMLINK checks in zfs_link_create() for
link overflow. However, one of the checks (when the vnode adding a link
is a directory such as for mkdir) always returned even if the link did not
overflow. Change this to only return early if it needs to report an
EMLINK error.
Reported by: db, shurd
Sponsored by: Chelsio Communications
On the one hand, FIFOs should respect other variables not supported by
the fifofs vnode operation (such as _PC_NAME_MAX, _PC_LINK_MAX, etc.).
These values are fs-specific and must come from a fs-specific method.
On the other hand, filesystems that support FIFOs are required to
support _PC_PIPE_BUF on directory vnodes that can contain FIFOs.
Given this latter requirement, once the fs-specific VOP_PATHCONF
method supports _PC_PIPE_BUF for directories, it is also suitable for
FIFOs permitting a single VOP_PATHCONF method to be used for both
FIFOs and non-FIFOs.
To that end, retire all of the FIFO-specific pathconf methods from
filesystems and change FIFO-specific vnode operation switches to use
the existing fs-specific VOP_PATHCONF method. For fifofs, set it's
VOP_PATHCONF to VOP_PANIC since it should no longer be used.
While here, move _PC_PIPE_BUF handling out of vop_stdpathconf() so that
only filesystems supporting FIFOs will report a value. In addition,
only report a valid _PC_PIPE_BUF for directories and FIFOs.
Discussed with: bde
Reviewed by: kib (part of a larger patch)
MFC after: 1 month
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D12572
Having all filesystems fall through to default values isn't always correct
and these values can vary for different filesystem implementations. Most
of these changes just use the existing default values with a few exceptions:
- Don't report CHOWN_RESTRICTED for ZFS since it doesn't do the exact
permissions check this claims for chown().
- Use NANDFS_NAME_LEN for NAME_MAX for nandfs.
- Don't report a LINK_MAX of 0 on smbfs. Now fail with EINVAL to
indicate hard links aren't supported.
Requested by: bde (though perhaps not this exact implementation)
Reviewed by: kib (earlier version)
MFC after: 1 month
Sponsored by: Chelsio Communications
- Define a ZFS_LINK_MAX as the ZFS version of LINK_MAX which is set to
UINT64_MAX to match the on-disk format.
- Enable the currently #if 0'd code to check for link overflows and
return EMLINK.
- Don't clamp the link count reported in stat() to LINK_MAX as that is
still the 16-bit limit, but report the full link counts. Also,
avoid possibly overflowing the reported link count to 0 when adjusting
the link count to account for ".snapshot".
- Update the LINK_MAX reported by pathconf() to report ZFS_LINK_MAX
rather than LINK_MAX (but clamped to LONG_MAX for 32-bit systems).
Reviewed by: avg (earlier version)
Sponsored by: Chelsio Communications
dtrace_gethrtime() may be called outside of probe context, and in
particular, from the DTRACEIOC_BUFSNAP handler.
Disable interrupts rather than using sched_pin() to help ensure that
we don't call any external functions when in probe context.
PR: 218452
MFC after: 1 week
Otherwise a poorly timed lowmem event may attempt to acquire a destroyed
lock. Unregister the handler before destroying the ARC reclaim thread.
Reported by: gjb
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13480
The DTrace fasttrap entry points expect a struct reg containing the
register values of the calling thread. Perform the conversion in
fasttrap rather than in the trap handler: this reduces the number of
ifdefs and avoids wasting stack space for traps that don't involve
DTrace.
MFC after: 2 weeks