Commit Graph

200 Commits

Author SHA1 Message Date
delphij
db0a2c953d Expose an interface to determine if an ACE is inherited.
Submitted by:	sef
Reviewed by:	trasz
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D3540
2015-09-04 00:14:20 +00:00
oshogbo
fe707ca6b5 Add support for the arrays in nvlist library.
- Add
  nvlist_{add,get,take,move,exists,free}_{number,bool,string,nvlist,
  descriptor} functions.
- Add support for (un)packing arrays.
- Add the nvl_array_next field to the nvlist structure.
  If an array is added by the nvlist_{move,add}_nvlist_array function
  this field will contains next element in the array.
- Add the nitems field to the nvpair and nvpair_header structure.
  This field contains number of elements in the array.
- Add special flag (NV_FLAG_IN_ARRAY) which is set if nvlist is a part of
  an array.
- Add special type (NV_TYPE_NVLIST_ARRAY_NEXT).This type is used only
  on packing/unpacking.
- Add new API for traversing arrays (nvlist_get_array_next).
- Add the nvlist_get_pararr function which combines the
  nvlist_get_array_next and nvlist_get_parent functions. If nvlist is in
  the array it will return next element from array. If nvlist is last
  element in array or it isn't in array it will return his
  container (parent). This function should simplify traveling over nvlist.
- Add tests for new features.
- Add documentation for new functions.
- Add my copyright.
- Regenerate the sys/cddl/compat/opensolaris/sys/nvpair.h file.

PR:		191083
Reviewed by:	allanjude (doc)
Approved by:	pjd (mentor)
2015-08-15 06:34:49 +00:00
mav
7dfb1b4821 Fix r286574 build in user-space. 2015-08-10 12:25:26 +00:00
mav
ed63bd7467 MFV r277427: 5445 Add more visibility via arcstats; specifically
arc_state_t stats and differentiate between "data" and "metadata"

Reviewed by: Basil Crow <basil.crow@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Bayard Bell <bayard.bell@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>

illumos/illumos-gate@4076b1bf41
2015-08-10 10:59:58 +00:00
ed
7f556435bf Add an API for easily creating userspace threads in kernelspace.
This change refactors the existing create_thread() function to be more
generic. It replaces almost all of its arguments by a callback that can
be used to extract the thread ID and copy it out to the right place, but
also to perform additional initialization steps, such as setting the
trapframe. This also makes the difference between thr_new() and
thr_create() more clear in my opinion.

This function is going to be used by the CloudABI compatibility layer.

It looks like the OpenSolaris compatibility framework already provides a
function called thread_create(). Rename this function to
do_thread_create() and use a macro to deal with the namespacing
conflict. A similar approach is already used for thread_exit().

MFC after:	1 month
2015-07-20 10:20:04 +00:00
mjg
a85ed5531d Create a dedicated function for ensuring that cdir and rdir are populated.
Previously several places were doing it on its own, partially
incorrectly (e.g. without the filedesc locked) or even actively harmful
by populating jdir or assigning rootvnode without vrefing it.

Reviewed by:	kib
2015-07-11 16:22:48 +00:00
mjg
feeee4c707 fd: make 'rights' a manadatory argument to fget* functions 2015-07-05 19:05:16 +00:00
avg
630db52ab7 compat nvpair.h: make sure that the names are mangled only for kernel
Currently there is no good reason to mangle the userland API.
The change was introduced in eac1d566b4,
r279437.  Also see https://reviews.freebsd.org/D1881.

I am still convinced that nv should not have introduced intentionally
conflicting API.

Discussed with:	rstone
X-MFC with:	r279437
Sponsored by:	ClusterHQ
2015-06-07 08:54:25 +00:00
andrew
3046fbea8a Add the arm64 defines for cddl code.
Differential Revision:	https://reviews.freebsd.org/D2186
Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation
2015-04-01 08:31:56 +00:00
rstone
eac1d566b4 Allow Illumos code to co-exist with nv(9)
Differential Revision:		https://reviews.freebsd.org/D1881
Reviewed by:			jfv, will
Suggested by:			pjd
MFC after:			1 month
Sponsored by:			Sandvine Inc
2015-03-01 00:22:45 +00:00
will
3c6834bfdd NSEC_TO_TICK(usec) -> NSEC_TO_TICK(nsec) 2015-01-20 22:29:27 +00:00
will
4bd09b7583 Remove unused strdup() #define. 2015-01-20 22:27:45 +00:00
avg
15bfe2d262 remove opensolaris cyclic code, replace with high-precision callouts
In the old days callout(9) had 1 tick precision and that was inadequate
for some uses, e.g. DTrace profile module, so we had to emulate cyclic
API and behavior.  Now we can directly use callout(9) in the very few
places where cyclic was used.

Differential Revision:	https://reviews.freebsd.org/D1161
Reviewed by:	gnn, jhb, markj
MFC after:	2 weeks
2014-12-07 11:21:41 +00:00
kib
b4ef709604 Remove the no-at variants of the kern_xx() syscall helpers. E.g., we
have both kern_open() and kern_openat(); change the callers to use
kern_openat().

This removes one (sometimes two) levels of indirection and
consolidates arguments checks.

Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-13 18:01:51 +00:00
jpaetzel
0584337fe8 This change addresses 4 bugs in ZFS exposed by Richard Kojedzinszky's
crash.sh script attached to FreeNAS bug 4109:
https://bugs.freenas.org/issues/4109

Three are in the snapshot layer:
a) AVG explains in his notes: https://wiki.freebsd.org/AvgVfsSolarisVsFreeBSD

"VOP_INACTIVE must not do any destructive actions to a vnode
and its filesystem node, nor invalidate them in any way."
gfs_vop_inactive and zfsctl_snapshot_inactive did just that. In
OpenSolaris VOP_INACTIVE is much closer to FreeBSD's VOP_RECLAIM.
Rename & move them to gfs_vop_reclaim and zfsctl_snapshot_reclaim
and merge in the requisite vnode_destroy from zfsctl_common_reclaim.

b) gfs_lookup_dot and various zfsctl functions do not honor the
FreeBSD VFS convention of only locking from the root downward. When
looking up ".." the convention is to drop the current leaf vnode lock before
acquiring the directory vnode and then subsequently re-acquiring the lock on the
leaf vnode. This fixes that in all the places that our exercised by crash.sh.

c) The snapshot may already be unmounted when the directory vnode is reclaimed.
Check for this case and return.

One in the common layer:
d) Callers of traverse expect the reference to the vnode passed in to be
maintained. Don't release it.

This last one may be an unclear contract. There may in fact be some callers that
do expect the reference to be dropped on success in addition to callers that
expect it to be released. In this case a further audit of the callers is needed
and a consensus on the correct behavior.

PR:	184677
Submitted by:	kmacy
Reviewed by:	delphij, will, avg
MFC after:	2 weeks
Sponsored by:	iXsystems
2014-10-25 17:42:44 +00:00
avg
073bd9bd13 make userland __assfail from opensolaris compat honor 'aok' variable
This should allow zdb -A option to actually make difference.

MFC after:	2 weeks
2014-10-07 14:15:50 +00:00
smh
f2543cb01c Refactor ZFS ARC reclaim checks and limits
Remove previously added kmem methods in favour of defines which
allow diff minimisation between upstream code base.

Rebalance ARC free target to be vm_pageout_wakeup_thresh by default
which eliminates issue where ARC gets minimised instead of balancing
with VM pageout. The restores the target point prior to r270759.

Bring in missing upstream only changes which move unused code to
further eliminate code differences.

Add additional DTRACE probe to aid monitoring of ARC behaviour.

Enable upstream i386 code paths on platforms which don't define
UMA_MD_SMALL_ALLOC.

Fix mixture of byte an page values in arc_memory_throttle i386 code
path value assignment of available_memory.

PR:		187594
Review:		D702
Reviewed by:	avg
MFC after:	1 week
X-MFC-With:	r270759 & r270861
Sponsored by:	Multiplay
2014-10-03 20:34:55 +00:00
smh
c466dfba5f Remove sys/types.h include as per style (9)
SDT requries sys/param.h due to use of NULL

Reported by:	Garrett
Sponsored by:	Multiplay
2014-09-18 20:38:18 +00:00
smh
7ce047b163 Add dtrace probe support for zfs SET_ERROR(..)
MFC after:	1 week
Sponsored by:	Multiplay
2014-09-18 20:00:36 +00:00
smh
8d9d31d786 Ensure that ZFS ARC free memory checks include cached pages
Also restore kmem_used() check for i386 as it has KVA limits that the raw
page counts above don't consider

PR:		187594
Reviewed by:	peter
X-MFC-With: r270759
Review:	D700
Sponsored by:	Multiplay
2014-08-30 21:44:32 +00:00
smh
502601a540 Refactor ZFS ARC reclaim logic to be more VM cooperative
Prior to this change we triggered ARC reclaim when kmem usage passed 3/4
of the total available, as indicated by vmem_size(kmem_arena, VMEM_ALLOC).

This could lead large amounts of unused RAM e.g. on a 192GB machine with
ARC the only major RAM consumer, 40GB of RAM would remain unused.

The old method has also been seen to result in extreme RAM usage under
certain loads, causing poor performance and stalls.

We now trigger ARC reclaim when the number of free pages drops below the
value defined by the new sysctl vfs.zfs.arc_free_target, which defaults
to the value of vm.v_free_target.

Credit to Karl Denninger for the original patch on which this update was
based.

PR:		191510 and 187594
Tested by:	dteske
MFC after:	1 week
Relnotes:	yes
Sponsored by:	Multiplay
2014-08-28 19:50:08 +00:00
delphij
6922e3fedf Provide compatibility shim for atomic_dec_64_nv.
X-MFC-with:	r270247
MFC after:	13 days
2014-08-21 08:25:46 +00:00
delphij
6ba22f8d1a Revert r269404 and use cpu_ticks() for dbuf allocation.
Encode CPU's number by XOR'ing the CPU ID against the 64-bit cpu_ticks().

Reviewed by:	mav, gibbs
Differential Revision: https://phabric.freebsd.org/D521
MFC after:	2 weeks
2014-08-03 09:47:51 +00:00
ian
e5aca7d143 When arm 64-bit atomic ops are available, define ARM_HAVE_ATOMIC64. Use
that symbol (which will be correct in both kernel and userland contexts)
rather than just __arm__ to decide whether to use a local implementation.
2014-08-02 03:44:27 +00:00
ian
105b4c5f48 Use the 64-bit atomics now provided by arm machine/atomic.h instead of
(conflicting) local versions.
2014-08-01 23:45:50 +00:00
delphij
3582d4b4c2 Split gethrtime() and gethrtime_waitfree() and make the former use
nanouptime() instead of getnanouptime().  nanouptime(9) provides more
precise result at expense of being slower.

In r269223, gethrtime() is used as creation time of dbuf, which in turn
acts as portion of lookup key to maintain AVL invariant where there can
not be duplicate items.  Before this change, gethrtime() have preferred
better execution time by sacrificing precision, which may lead to panic
on busy systems with:

	panic: avl_find() succeeded inside avl_add()

Reported by:	allanjude, mav
PR:		kern/192284
MFC after:	11 days
X-MFC-with:	r269223
2014-08-01 22:33:23 +00:00
hselasky
35b126e324 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
gjb
fc21f40567 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
hselasky
bd1ed65f0f Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
smh
14906a620b Reintroduce priority for the TRIM ZIOs instead of using the "NOW" priority
The changes how TRIM requests are generated to use ZIO_TYPE_FREE + a priority
instead of ZIO_TYPE_IOCTL, until processed by vdev_geom; only then is it
translated the required geom values. This reduces the amount of changes
required for FREE requests to be supported by the new IO scheduler. This
also eliminates the need for a specific DKIOCTRIM.

Also fixed FREE vdev child IO's from running ZIO_STAGE_VDEV_IO_DONE as part
of their schedule.

As the new IO scheduler can result in a request to execute one type of IO to
actually run a different type of IO it requires that zio_trim requests are
processed without holding the trim map lock (tm->tm_lock), as the free request
execute call may result in write request running hence triggering a
trim_map_write_start call, which takes the trim map lock and hence would result
in recused on no-recursive sx lock.

This is based off avg's original work, so credit to him.

MFC after:	1 month
2014-04-30 17:46:29 +00:00
bdrewery
6fcf6199a4 Rename global cnt to vm_cnt to avoid shadowing.
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from:	arch@
Sponsored by:	EMC / Isilon Storage Division
2014-03-22 10:26:09 +00:00
rwatson
33fdc14c0c Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after:	3 weeks
2014-03-16 10:55:57 +00:00
markj
95ac3c80e1 Define the KM_NORMALPRI flag for kmem_alloc(), as it is used in some
upstream DTrace code. It indicates that the kernel memory allocator need not
attempt to satisfy non-blocking allocations in low-memory conditions. This
has no direct equivalent in the malloc(9) flags, so it is just defined to 0
for now.
2014-02-22 05:13:35 +00:00
mav
0b0d3d9762 Fix build after r260234 by converting ddi_get_lbolt64() from inline into
a macro.  Otherwise compiler complains that hz variable used there either
undefined or defined twice, thanks to header mess caused by compat shims.
2014-01-05 19:07:42 +00:00
mav
7402ef4897 Remove extra conversion to nanoseconds from ddi_get_lbolt64().
As result this uses one multiplication and shifts instead of one division
and two multiplications.
2014-01-03 18:08:31 +00:00
avg
430acb4217 zfs: add zfs_freebsd_putpages
this should be more optimal than writing pages one-by-one via zfs_write ->
update_pages in the case of multi-page putpages call

MFC after:	16 days
2013-11-29 15:39:39 +00:00
avg
7a0711c338 zfs: make zfs_map_page / zfs_unmap_page public
MFC after:	15 days
2013-11-29 15:33:40 +00:00
avg
8c62dc6efa opensolaris compat: add taskq_wait emulation
MFC after:	10 days
2013-11-28 19:17:11 +00:00
avg
51e896cc78 MFV r255255: 4045 zfs write throttle & i/o scheduler performance work
illumos/illumos-gate@69962b5647

Please note the following changes:
- zio_ioctl has lost its priority parameter and now TRIM is executed
  with 'now' priority
- some knobs are gone and some new knobs are added; not all of them are
  exposed as tunables / sysctls yet

MFC after:	10 days
Sponsored by:	HybridCluster [merge]
2013-11-26 09:57:14 +00:00
avg
d1f2a86401 734 taskq_dispatch_prealloc() desired
943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP
illumos/illumos-gate@5aeb94743e

Essentially FreeBSD taskqueues already operate in a mode that
was added to Illumos with taskq_dispatch_ent change.
We even exposed the superior FreeBSD interface as taskq_dispatch_safe.
Now we just rename taskq_dispatch_safe to taskq_dispatch_ent and
struct struct ostask to taskq_ent_t, so that code differences will be
minimal.

After this change sys/cddl/compat/opensolaris/sys/taskq.h header is no
longer needed.

Note that this commit is not an MFV because the upstream change was not
individually committed to the vendor area.

MFC after:	8 days
2013-11-26 09:26:18 +00:00
avg
a1e8ebde3f opensolaris taskq: some cosmetic changes
- drop trailing whitespace
- remove redundant "extern" from function declarations
- remove unused macro

MFC after:	1 week
2013-11-26 09:10:01 +00:00
avg
37cef93b68 sdt: add support for solaris/illumos style DTRACE_PROBE macros
The new macros are implemented in terms of SDT_PROBE_DEFINE and SDT_PROBE.
Probes defined in this way will appear under SDT provider named "sdt".
Parameter types are exposed via SDT_PROBE_ARGTYPE.
This is something that illumos does not have by default.

This kind of SDT probes is already present in ZFS code, so those probes
will now be available if KDTRACE_HOOKS options is enabled.

A potential future illumos compatibility enhancement is to encode a provider
name as a prefix in a probe name.

Reviewed by:	markj
MFC after:	3 weeks
X-MFC after:	r258622
2013-11-26 08:49:53 +00:00
delphij
a23043347c MFV r247844 (illumos-gate 13975:ef6409bc370f)
Illumos ZFS issues:
  3582 zfs_delay() should support a variable resolution
  3584 DTrace sdt probes for ZFS txg states

Provide a compatibility shim for Solaris's cv_timedwait_hires
to help aid future porting.

Approved by:	re (ZFS blanket)
2013-09-10 01:46:47 +00:00
pjd
1c7defb76e Handle cases where capability rights are not provided.
Reported by:	kib
2013-09-05 11:58:12 +00:00
pjd
029a6f5d92 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
jeff
de4ecca213 Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

 - Normalize functions that allocate memory to use kmem_*
 - Those that allocate address space are named kva_*
 - Those that operate on maps are named kmap_*
 - Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-08-07 06:21:20 +00:00
avg
a07c9d34c3 opensolaris code: translate INVARIANTS to DEBUG and ZFS_DEBUG
Do this by forcing inclusion of
sys/cddl/compat/opensolaris/sys/debug_compat.h
via -include option into all source files from OpenSolaris.
Note that this -include option must always be after -include opt_global.h.

Additionally, remove forced definition of DEBUG for some modules and fix
their build without DEBUG.

Also, meaning of DEBUG was overloaded to enable WITNESS support for some
OpenSolaris (primarily ZFS) locks.  Now this overloading is removed and
that use of DEBUG is replaced with a new option OPENSOLARIS_WITNESS.

MFC after:	17 days
2013-08-06 15:51:56 +00:00
smh
a50c0c6690 style(9) fixes
MFC after:	2 days
2013-06-29 23:39:38 +00:00
smh
2687a6b31b Switch ZFS mutex_owner macro to use sx_xholder as its now exported via sx.h
MFC after:	1 week
2013-06-21 15:55:03 +00:00
mm
696d961cbd MFV r248217:
Merge change from vendor to reduce diff only.
ZFS dtrace probes are not supported on FreeBSD yet.

Illumos ZFS issues:
  3598 want to dtrace when errors are generated in zfs

MFC after:	3 weeks
2013-04-06 10:39:38 +00:00