On FreeBSD VFS_HOLD/VN_RELE were mapped to MNT_REF/MNT_REL that
manipulate mnt_ref. But the job of properly maintaining the reference
count is already automatically performed by insmntque(9) and
delmntque(9). So, in effect all ZFS vnodes referenced the corresponding
mountpoint twice.
That was completely harmless, but we want to be very explicit about what
FreeBSD VFS APIs are used, because illumos VFS_HOLD and FreeBSD MNT_REF
provide quite different guarantees with respect to the held vfs_t /
mountpoint. On illumos VFS_HOLD is sufficient to guarantee that
vfs_t.vfs_data stays valid. On the other hand, on FreeBSD MNT_REF does
*not* provide the same guarantee about mnt_data. We have to use
vfs_busy() to get that guarantee.
Thus, the calls to VFS_HOLD/VFS_RELE on vnode init and fini are removed.
VFS_HOLD calls are replaced with vfs_busy in the ioctl handlers.
And because vfs_busy has a richer interface that can not be dumbed down
in all cases it's better to explicitly use it rather than trying to mask
it behind VFS_HOLD.
This change fixes a panic that could result from a race between
zfs_umount() and zfs_ioc_rollback(). We observed a case where
zfsvfs_free() tried to destroy data that zfsvfs_teardown() was still
using. That happened because there was nothing to prevent unmounting of
a ZFS filesystem that was in between zfs_suspend_fs() and
zfs_resume_fs().
Reviewed by: kib, smh
MFC after: 3 weeks
Sponsored by: ClusterHQ
Differential Revision: https://reviews.freebsd.org/D2794
6672 arc_reclaim_thread() should use gethrtime() instead of ddi_get_lbolt()
6673 want a macro to convert seconds to nanoseconds and vice-versa
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Eli Rosenthal <eli.rosenthal@delphix.com>
illumos/illumos-gate@a8f6344fa0
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@97e8130957
This is the final step required allowing to compile and to run RISC-V
kernel and userland from HEAD.
RISC-V is a completely open ISA that is freely available to academia
and industry.
Thanks to all the people involved! Special thanks to Andrew Turner,
David Chisnall, Ed Maste, Konstantin Belousov, John Baldwin and
Arun Thomas for their help.
Thanks to Robert Watson for organizing this project.
This project sponsored by UK Higher Education Innovation Fund (HEIF5) and
DARPA CTSRD project at the University of Cambridge Computer Laboratory.
FreeBSD/RISC-V project home: https://wiki.freebsd.org/riscv
Reviewed by: andrew, emaste, kib
Relnotes: Yes
Sponsored by: DARPA, AFRL
Sponsored by: HEIF5
Differential Revision: https://reviews.freebsd.org/D4982
- Add
nvlist_{add,get,take,move,exists,free}_{number,bool,string,nvlist,
descriptor} functions.
- Add support for (un)packing arrays.
- Add the nvl_array_next field to the nvlist structure.
If an array is added by the nvlist_{move,add}_nvlist_array function
this field will contains next element in the array.
- Add the nitems field to the nvpair and nvpair_header structure.
This field contains number of elements in the array.
- Add special flag (NV_FLAG_IN_ARRAY) which is set if nvlist is a part of
an array.
- Add special type (NV_TYPE_NVLIST_ARRAY_NEXT).This type is used only
on packing/unpacking.
- Add new API for traversing arrays (nvlist_get_array_next).
- Add the nvlist_get_pararr function which combines the
nvlist_get_array_next and nvlist_get_parent functions. If nvlist is in
the array it will return next element from array. If nvlist is last
element in array or it isn't in array it will return his
container (parent). This function should simplify traveling over nvlist.
- Add tests for new features.
- Add documentation for new functions.
- Add my copyright.
- Regenerate the sys/cddl/compat/opensolaris/sys/nvpair.h file.
PR: 191083
Reviewed by: allanjude (doc)
Approved by: pjd (mentor)
arc_state_t stats and differentiate between "data" and "metadata"
Reviewed by: Basil Crow <basil.crow@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Bayard Bell <bayard.bell@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@4076b1bf41
This change refactors the existing create_thread() function to be more
generic. It replaces almost all of its arguments by a callback that can
be used to extract the thread ID and copy it out to the right place, but
also to perform additional initialization steps, such as setting the
trapframe. This also makes the difference between thr_new() and
thr_create() more clear in my opinion.
This function is going to be used by the CloudABI compatibility layer.
It looks like the OpenSolaris compatibility framework already provides a
function called thread_create(). Rename this function to
do_thread_create() and use a macro to deal with the namespacing
conflict. A similar approach is already used for thread_exit().
MFC after: 1 month
Previously several places were doing it on its own, partially
incorrectly (e.g. without the filedesc locked) or even actively harmful
by populating jdir or assigning rootvnode without vrefing it.
Reviewed by: kib
Currently there is no good reason to mangle the userland API.
The change was introduced in eac1d566b46edef765754203bef22c75c1699966,
r279437. Also see https://reviews.freebsd.org/D1881.
I am still convinced that nv should not have introduced intentionally
conflicting API.
Discussed with: rstone
X-MFC with: r279437
Sponsored by: ClusterHQ
In the old days callout(9) had 1 tick precision and that was inadequate
for some uses, e.g. DTrace profile module, so we had to emulate cyclic
API and behavior. Now we can directly use callout(9) in the very few
places where cyclic was used.
Differential Revision: https://reviews.freebsd.org/D1161
Reviewed by: gnn, jhb, markj
MFC after: 2 weeks
have both kern_open() and kern_openat(); change the callers to use
kern_openat().
This removes one (sometimes two) levels of indirection and
consolidates arguments checks.
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
crash.sh script attached to FreeNAS bug 4109:
https://bugs.freenas.org/issues/4109
Three are in the snapshot layer:
a) AVG explains in his notes: https://wiki.freebsd.org/AvgVfsSolarisVsFreeBSD
"VOP_INACTIVE must not do any destructive actions to a vnode
and its filesystem node, nor invalidate them in any way."
gfs_vop_inactive and zfsctl_snapshot_inactive did just that. In
OpenSolaris VOP_INACTIVE is much closer to FreeBSD's VOP_RECLAIM.
Rename & move them to gfs_vop_reclaim and zfsctl_snapshot_reclaim
and merge in the requisite vnode_destroy from zfsctl_common_reclaim.
b) gfs_lookup_dot and various zfsctl functions do not honor the
FreeBSD VFS convention of only locking from the root downward. When
looking up ".." the convention is to drop the current leaf vnode lock before
acquiring the directory vnode and then subsequently re-acquiring the lock on the
leaf vnode. This fixes that in all the places that our exercised by crash.sh.
c) The snapshot may already be unmounted when the directory vnode is reclaimed.
Check for this case and return.
One in the common layer:
d) Callers of traverse expect the reference to the vnode passed in to be
maintained. Don't release it.
This last one may be an unclear contract. There may in fact be some callers that
do expect the reference to be dropped on success in addition to callers that
expect it to be released. In this case a further audit of the callers is needed
and a consensus on the correct behavior.
PR: 184677
Submitted by: kmacy
Reviewed by: delphij, will, avg
MFC after: 2 weeks
Sponsored by: iXsystems
Remove previously added kmem methods in favour of defines which
allow diff minimisation between upstream code base.
Rebalance ARC free target to be vm_pageout_wakeup_thresh by default
which eliminates issue where ARC gets minimised instead of balancing
with VM pageout. The restores the target point prior to r270759.
Bring in missing upstream only changes which move unused code to
further eliminate code differences.
Add additional DTRACE probe to aid monitoring of ARC behaviour.
Enable upstream i386 code paths on platforms which don't define
UMA_MD_SMALL_ALLOC.
Fix mixture of byte an page values in arc_memory_throttle i386 code
path value assignment of available_memory.
PR: 187594
Review: D702
Reviewed by: avg
MFC after: 1 week
X-MFC-With: r270759 & r270861
Sponsored by: Multiplay
Also restore kmem_used() check for i386 as it has KVA limits that the raw
page counts above don't consider
PR: 187594
Reviewed by: peter
X-MFC-With: r270759
Review: D700
Sponsored by: Multiplay
Prior to this change we triggered ARC reclaim when kmem usage passed 3/4
of the total available, as indicated by vmem_size(kmem_arena, VMEM_ALLOC).
This could lead large amounts of unused RAM e.g. on a 192GB machine with
ARC the only major RAM consumer, 40GB of RAM would remain unused.
The old method has also been seen to result in extreme RAM usage under
certain loads, causing poor performance and stalls.
We now trigger ARC reclaim when the number of free pages drops below the
value defined by the new sysctl vfs.zfs.arc_free_target, which defaults
to the value of vm.v_free_target.
Credit to Karl Denninger for the original patch on which this update was
based.
PR: 191510 and 187594
Tested by: dteske
MFC after: 1 week
Relnotes: yes
Sponsored by: Multiplay
Encode CPU's number by XOR'ing the CPU ID against the 64-bit cpu_ticks().
Reviewed by: mav, gibbs
Differential Revision: https://phabric.freebsd.org/D521
MFC after: 2 weeks
nanouptime() instead of getnanouptime(). nanouptime(9) provides more
precise result at expense of being slower.
In r269223, gethrtime() is used as creation time of dbuf, which in turn
acts as portion of lookup key to maintain AVL invariant where there can
not be duplicate items. Before this change, gethrtime() have preferred
better execution time by sacrificing precision, which may lead to panic
on busy systems with:
panic: avl_find() succeeded inside avl_add()
Reported by: allanjude, mav
PR: kern/192284
MFC after: 11 days
X-MFC-with: r269223
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
The changes how TRIM requests are generated to use ZIO_TYPE_FREE + a priority
instead of ZIO_TYPE_IOCTL, until processed by vdev_geom; only then is it
translated the required geom values. This reduces the amount of changes
required for FREE requests to be supported by the new IO scheduler. This
also eliminates the need for a specific DKIOCTRIM.
Also fixed FREE vdev child IO's from running ZIO_STAGE_VDEV_IO_DONE as part
of their schedule.
As the new IO scheduler can result in a request to execute one type of IO to
actually run a different type of IO it requires that zio_trim requests are
processed without holding the trim map lock (tm->tm_lock), as the free request
execute call may result in write request running hence triggering a
trim_map_write_start call, which takes the trim map lock and hence would result
in recused on no-recursive sx lock.
This is based off avg's original work, so credit to him.
MFC after: 1 month
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.
Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.
Exp-run revealed no ports using it directly.
No objection from: arch@
Sponsored by: EMC / Isilon Storage Division
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.
MFC after: 3 weeks
upstream DTrace code. It indicates that the kernel memory allocator need not
attempt to satisfy non-blocking allocations in low-memory conditions. This
has no direct equivalent in the malloc(9) flags, so it is just defined to 0
for now.
illumos/illumos-gate@69962b5647
Please note the following changes:
- zio_ioctl has lost its priority parameter and now TRIM is executed
with 'now' priority
- some knobs are gone and some new knobs are added; not all of them are
exposed as tunables / sysctls yet
MFC after: 10 days
Sponsored by: HybridCluster [merge]
943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP
illumos/illumos-gate@5aeb94743e
Essentially FreeBSD taskqueues already operate in a mode that
was added to Illumos with taskq_dispatch_ent change.
We even exposed the superior FreeBSD interface as taskq_dispatch_safe.
Now we just rename taskq_dispatch_safe to taskq_dispatch_ent and
struct struct ostask to taskq_ent_t, so that code differences will be
minimal.
After this change sys/cddl/compat/opensolaris/sys/taskq.h header is no
longer needed.
Note that this commit is not an MFV because the upstream change was not
individually committed to the vendor area.
MFC after: 8 days
The new macros are implemented in terms of SDT_PROBE_DEFINE and SDT_PROBE.
Probes defined in this way will appear under SDT provider named "sdt".
Parameter types are exposed via SDT_PROBE_ARGTYPE.
This is something that illumos does not have by default.
This kind of SDT probes is already present in ZFS code, so those probes
will now be available if KDTRACE_HOOKS options is enabled.
A potential future illumos compatibility enhancement is to encode a provider
name as a prefix in a probe name.
Reviewed by: markj
MFC after: 3 weeks
X-MFC after: r258622
Illumos ZFS issues:
3582 zfs_delay() should support a variable resolution
3584 DTrace sdt probes for ZFS txg states
Provide a compatibility shim for Solaris's cv_timedwait_hires
to help aid future porting.
Approved by: re (ZFS blanket)