Commit Graph

662 Commits

Author SHA1 Message Date
Andriy Gapon
00126789e6 Revert r260705: wrong patch committed by accident
An earlier, less efficient version was committed by accident.
2014-01-16 13:20:20 +00:00
Andriy Gapon
19f5e9076b zfs_deleteextattr: name buffer from namei is needed by zfs_rename
If we prematurely free the name buffer and it gets quickly recycled,
then zfs_rename may see data from another lookup or even unmapped memory
via cn_nameptr.

MFC after:	6 days
Sponsored by:	HybridCluster
2014-01-16 12:31:27 +00:00
Andriy Gapon
2f9a31944f fix a bug in ZFS mirror code for handling multiple DVAa
The bug was introduced in r256956 "Improve ZFS N-way mirror read
performance".
The code in vdev_mirror_dva_select erroneously considers already
tried DVAs for the next attempt.  Thus, it is possible that a failing DVA
would be retried forever.
As a secondary effect, if the attempts fail with checksum error, then
checksum error reports are accumulated until the original request
ultimately fails or succeeds.  But because retrying is going on indefinitely
the cheksum reports accumulation will effectively be a memory leak.

Reviewed by:	gibbs
MFC after:	13 days
Sponsored by:	HybridCluster
2014-01-16 12:26:54 +00:00
Andriy Gapon
b8ca4667ed zfs: getnewvnode_reserve must be called outside of a zfs transaction
Otherwise we could run into the following deadlock.
A thread has a transaction open and assigned to a transaction group.
That would prevent the transaction group from be quiesced and synced.
The thread is blocked in getnewvnode_reserve waiting for a vnode to
a be reclaimed.  vnlru thread is blocked trying to enter ZFS VOP because
a filesystem is suspended by an ongoing rollback or receive operation.
In its turn the operation is waiting for the current transaction group
to be synced.

zfs_zget is always used outside of active transactions, but zfs_mknode
is always used in a transaction context.  Thus, we hoist
getnewvnode_reserve from zfs_mknode to its callers.

While there, assert that ZFS always calls getnewvnode while having
a vnode reserved.

Reported by:	adrian
Tested by:	adrian
MFC after:	17 days
Sponsored by:	HybridCluster
2014-01-16 12:22:46 +00:00
Alexander Motin
ce05e707c4 In dmu_zfetch_stream_reclaim() replace division with multiplication and
move it out of the loop and lock.
2014-01-03 18:44:37 +00:00
Xin LI
7c88e58f46 MFV r260155:
When we encounter an I/O error on a piece of metadata while deleting
a file system or zvol, we don't update the bptree_entry_phys_t's
bookmark.  This would lead to double free of bp's which will lead to
space map corruption.

Instead of tolerating and allowing the corruption, panic immediately.

See Illumos #4390 for more details.

4391 panic system rather than corrupting pool if we hit bug 4390

Illumos/illumos-gate@8b36997aa2

MFC after:	2 weeks
2014-01-02 08:10:35 +00:00
Xin LI
ab0b9f6b30 MFV r260154 + 260182:
4369 implement zfs bookmarks
4368 zfs send filesystems from readonly pools

Illumos/illumos-gate@78f1710053

MFC after:	2 weeks
2014-01-02 07:34:36 +00:00
Xin LI
6f2791f53a Fix build on platforms where atomic_swap_64 is not available. 2014-01-02 03:24:44 +00:00
Xin LI
647795d181 MFV r260153:
4121 vdev_label_init should treat request as succeeded when pool
     is read only

Illumos/illumos-gate@973c78e94b

MFC after:	2 weeks
2014-01-01 01:26:39 +00:00
Xin LI
f4c8ba8370 MFV r259170:
4370 avoid transmitting holes during zfs send

4371 DMU code clean up

illumos/illumos-gate@43466aae47

NOTE: Make sure the boot code is updated if a zpool upgrade is
done on boot zpool.

MFC after:	2 weeks
2014-01-01 00:45:28 +00:00
Xin LI
cca1e7c623 MFV r258385:
(Note: this change is not applicable to FreeBSD and the file
is not included in build.  It's integrated for completeness).

4128 disks in zpools never go away when pulled

illumos/illumos-gate@39cddb10a3

MFC after:	2 weeks
2013-12-31 21:24:00 +00:00
Xin LI
db2aff5f8b MFV r242733:
3306 zdb should be able to issue reads in parallel
3321 'zpool reopen' command should be documented in the man page
and help message

illumos/illumos-gate@31d7e8fa33

FreeBSD porting notes: the kernel part of this changeset depends
on Solaris buf(9S) interfaces and are not really applicable for
our use.  vdev_disk.c is patched as-is to reduce diverge from
upstream, but vdev_file.c is left intact.

MFC after:	2 weeks
2013-12-31 19:39:15 +00:00
Xin LI
1aaa945f67 MFV r258374:
4171 clean up spa_feature_*() interfaces

4172 implement extensible_dataset feature for use by other zpool
features

illumos/illumos-gate@2acef22db7

MFC after:	2 weeks
2013-12-24 07:14:25 +00:00
Xin LI
ec097c1634 MFV r258373:
4168 ztest assertion failure in dbuf_undirty

4169 verbatim import causes zdb to segfa
4170 zhack leaves pool in ACTIVE state

illumos/illumos-gate@7fdd916c47

MFC after:	2 weeks
2013-12-24 06:56:17 +00:00
Pawel Jakub Dawidek
4106732882 MFV r258923: 4188 assertion failed in dmu_tx_hold_free(): dn_datablkshift != 0
illumos/illumos-gate@bb411a08b0

MFC after:	3 days
2013-12-18 21:45:46 +00:00
Alan Somers
cd730bd6b2 sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
When a da or ada device dissappears, outstanding IOs fail with
	ENXIO, not EIO.  The check for EIO was probably copied from Illumos,
	where that is indeed the correct errno.

	Without this change, pulling a busy drive from a zpool would usually
	turn it into UNAVAIL, even though pulling an idle drive would turn
	it into REMOVED.  With this change, it is REMOVED every time.

	Also, vdev_geom_io_intr shouldn't do zfs_post_remove, because that
	results in devd getting two resource.fs.zfs.removed events.  The
	comment said that the event had to be sent directly instead of
	through the async removal thread because "the DE engine is using
	this information to discard prevoius I/O errors".  However, the fact
	that vdev_geom_io_intr was never actually sending the events until
	now, and that vdev_geom_orphan never sent them at all, and that
	vdev_geom_orphan usually gets called about 2 seconds after the
	actual removal, means that FreeBSD's userland can cope with a late
	event just fine.

Approved by:	ken (mentor)
Sponsored by:	Spectra Logic Corporation
MFC after:	4 weeks
2013-12-12 00:27:22 +00:00
Alexander Motin
f192c4873d Don't even try to read vdev labels from devices smaller then SPA_MINDEVSIZE
(64MB).  Even if we would find one somehow, ZFS kernel code rejects such
devices.  It is funny to look on attempts to read 4 256K vdev labels from
1.44MB floppy, though it is not very practical and quite slow.
2013-12-10 12:36:44 +00:00
Xin LI
9b11826d3d Expose spa_asize_inflation.
X-MFC-With:	r258632
2013-12-06 23:49:16 +00:00
Andriy Gapon
f77ffe1b22 zfs: add zfs_freebsd_putpages
this should be more optimal than writing pages one-by-one via zfs_write ->
update_pages in the case of multi-page putpages call

MFC after:	16 days
2013-11-29 15:39:39 +00:00
Andriy Gapon
6c5b7fffce zfs: add dmu_write_pages variant for freebsd
The freebsd variant of dmu_write_pages is hidden under _KERNEL
to avoid needlessly pulling in vm_page_t declaration.
Besides, this function seems to be useless for ZFS userland counterpart.

MFC after:	15 days
2013-11-29 15:34:43 +00:00
Andriy Gapon
fdbcc95a47 zfs: make zfs_map_page / zfs_unmap_page public
MFC after:	15 days
2013-11-29 15:33:40 +00:00
Andriy Gapon
ac79eedf85 zfs mappedread_sf: assert that a page is never partially valid
ZFS never partially validates or invalidates a page.
The higher level VM should not do that either.
mappedread_sf correct operation depends on a page being either fully
valid or invalid.

MFC after:	7 days
2013-11-29 12:19:52 +00:00
Andriy Gapon
be3d0087dc MFV r258665: 4347 ZPL can use dmu_tx_assign(TXG_WAIT)
illumos/illumos-gate@e722410c49

MFC after:	9 days
X-MFC after:	r258632
2013-11-28 19:44:36 +00:00
Andriy Gapon
456a87bb3b MFV r258371,r258372: 4101 metaslab_debug should allow for fine-grained control
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4104 ::spa_space no longer works
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab

illumos/illumos-gate@0713e232b7

Note that some tunables have been removed and some new tunables have
been added.  Of particular note, FreeBSD-only knob
vfs.zfs.space_map_last_hope is removed as it was a nop for some time now
(after one of the previous merges from upstream).

MFC after:	11 days
Sponsored by:	HybridCluster [merge]
2013-11-28 19:37:22 +00:00
Andriy Gapon
7bc07f0575 fix a serious bug in r258632: offset parameter must be set in zio
In illumos all ioctl zio-s are "global" at the moment.  That is they act
on a whole disk, e.g. a cache flush command, and thus do not need either
offset or size parameters.
FreeBSD, on the other hand, has support for TRIM command and that
command requires proper offset and size parameters.
Without this fix all TRIM commands act on the start of any disk or
partition used by ZFS destroying any data there.

Pointyhat to:	avg
Tested by:	sbruno
MFC after:	3 days
X-MFC with:	r258632
Sponsored by:	HybridCluster
2013-11-28 08:48:49 +00:00
Andriy Gapon
2ac1eeec44 fix debug.zfs_flags sysctl description in r258638
Pointyhat to:	avg
MFC after:	3 days
2013-11-26 10:57:09 +00:00
Andriy Gapon
78affb8591 expose zfs_flags as debug.zfs_flags r/w tunable and sysctl
This knob is purposefully hidden under debug.

MFC after:	5 days
Sponsored by:	HybridCluster
2013-11-26 10:46:43 +00:00
Andriy Gapon
3761ac95f7 MFV r258376: 3964 L2ARC should always compress metadata buffers
illumos/illumos-gate@e4be62a2b7

MFC after:	10 days
Sponsored by:	HybridCluster [merge]
2013-11-26 10:14:23 +00:00
Andriy Gapon
fd51e905e2 MFV r255256: 3954 metaslabs continue to load even after hitting zfs_mg_alloc_failure limit
4080 zpool clear fails to clear pool
4081 need zfs_mg_noalloc_threshold

illumos/illumos-gate@22e30981d8

MFC after:	10 days
Sponsored by:	HybridCluster [merge]
2013-11-26 10:02:02 +00:00
Andriy Gapon
2a4704ab01 MFV r255255: 4045 zfs write throttle & i/o scheduler performance work
illumos/illumos-gate@69962b5647

Please note the following changes:
- zio_ioctl has lost its priority parameter and now TRIM is executed
  with 'now' priority
- some knobs are gone and some new knobs are added; not all of them are
  exposed as tunables / sysctls yet

MFC after:	10 days
Sponsored by:	HybridCluster [merge]
2013-11-26 09:57:14 +00:00
Andriy Gapon
fb8171c240 MFV r247578: 3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock is piping hot
illumos/illumos-gate@ec94d32216

MFC after:	9 days
Sponsored by:	HybridCluster [merge]
2013-11-26 09:45:48 +00:00
Andriy Gapon
34140e78ab 734 taskq_dispatch_prealloc() desired
943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP
illumos/illumos-gate@5aeb94743e

Essentially FreeBSD taskqueues already operate in a mode that
was added to Illumos with taskq_dispatch_ent change.
We even exposed the superior FreeBSD interface as taskq_dispatch_safe.
Now we just rename taskq_dispatch_safe to taskq_dispatch_ent and
struct struct ostask to taskq_ent_t, so that code differences will be
minimal.

After this change sys/cddl/compat/opensolaris/sys/taskq.h header is no
longer needed.

Note that this commit is not an MFV because the upstream change was not
individually committed to the vendor area.

MFC after:	8 days
2013-11-26 09:26:18 +00:00
Pawel Jakub Dawidek
1cef014007 When append-only, immutable or read-only flag is set don't allow for
hard links creation. This matches UFS behaviour.

Reported by:	Oleg Ginzburg <olevole@olevole.ru>
MFC after:	1 month
2013-11-25 21:17:14 +00:00
Andriy Gapon
a7236350c3 MFV r258378: 4089 NULL pointer dereference in arc_read()
illumos/illumos-gate@57815f6b95

Tested by:	adrian
MFC after:	4 days
2013-11-20 11:52:32 +00:00
Andriy Gapon
c5f4a0a2eb MFV r258377: 4088 use after free in arc_release()
illumos/illumos-gate@ccc22e1304

MFC after:	5 days
2013-11-20 11:47:50 +00:00
Andriy Gapon
3fd7f7bef7 zfs page_busy: fix the boundaries of the cleared range
This is a fix for a regression introduced in r246293.

vm_page_clear_dirty expects the range to have DEV_BSIZE aligned boundaries,
otherwise it extends them.  Thus it can happen that the whole page is
marked clean while actually having some small dirty region(s).
This commit makes the range properly aligned and ensures that only
the clean data is marked as such.

It would interesting to evaluate how much benefit clearing with DEV_BSIZE
granularity produces.  Perhaps instead we should clear the whole page
when it is completely overwritten and don't bother clearing any bits
if only a portion a page is written.

Reported by:	George Hartzell <hartzell@alerce.com>,
		Richard Todd <rmtodd@servalan.servalan.com>
Tested by:	George Hartzell <hartzell@alerce.com>,
Reviewed by:	kib
MFC after:	5 days
2013-11-19 18:43:47 +00:00
Alexander Motin
c5068af559 Reenable vfs.zfs.zio.use_uma for amd64, disabled at r209261.
On machines with seveal CPUs and enough RAM this can easily twice improve
ZFS performance or twice reduce CPU usage.  It was disabled three years
ago due to memory and KVA exhaustion reports, but our VM subsystem got
improved a lot since that time, hopefully enough to make another try.
2013-11-19 11:19:07 +00:00
Steven Hartland
8dfd07b976 Fix ZFS deadlock when sending a snapshot which is mounted.
MFC after:	1 week
Sponsored by:	Multiplay
2013-11-18 11:28:19 +00:00
Alexander Motin
e5056f9882 Introduce allocation cache to store LZ4 compression contexts without kicking
VM subsystem twice for every written record.

Tests on 24-core system show double reduction of CPU time spent on copying
single large well-compressed file.

This patch is not really needed on illumos (while not harm either) since
their memory allocator by default uses caching for all requests up to 128K.

Reviewed by:	Saso Kiselkov <skiselkov.ml@gmail.com>
2013-11-14 15:54:54 +00:00
Steven Hartland
c28078e903 Improve ZFS N-way mirror read performance by using load and locality
information.

The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.

The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.

Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.

This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.

The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.

With pre-fetch disabled (vfs.zfs.prefetch_disable=1):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s

With pre-fetch enabled (vfs.zfs.prefetch_disable=0):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s

In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.

The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc

These changes where based on work started by the zfsonlinux developers:
https://github.com/zfsonlinux/zfs/pull/1487

Reviewed by:	gibbs, mav, will
MFC after:	2 weeks
Sponsored by:	Multiplay
2013-10-23 09:54:58 +00:00
Steven Hartland
70c3432663 Use the vdev's ashift to calculate the supported min block size passed to
zio_compress_data(..) when compressing l2arc buffers.

This eliminates l2arc I/O errors, which resulted in very poor performance on
vdev's configured with block size greater than 512b due to compression
assuming a smaller min block size than the vdev supports.

MFC after:	2 days
2013-10-22 13:31:36 +00:00
Alexander Motin
40ea77a036 Merge GEOM direct dispatch changes from the projects/camlock branch.
When safety requirements are met, it allows to avoid passing I/O requests
to GEOM g_up/g_down thread, executing them directly in the caller context.
That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid
several context switches per I/O.

The defined now safety requirements are:
 - caller should not hold any locks and should be reenterable;
 - callee should not depend on GEOM dual-threaded concurency semantics;
 - on the way down, if request is unmapped while callee doesn't support it,
   the context should be sleepable;
 - kernel thread stack usage should be below 50%.

To keep compatibility with GEOM classes not meeting above requirements
new provider and consumer flags added:
 - G_CF_DIRECT_SEND -- consumer code meets caller requirements (request);
 - G_CF_DIRECT_RECEIVE -- consumer code meets callee requirements (done);
 - G_PF_DIRECT_SEND -- provider code meets caller requirements (done);
 - G_PF_DIRECT_RECEIVE -- provider code meets callee requirements (request).
Capable GEOM class can set them, allowing direct dispatch in cases where
it is safe.  If any of requirements are not met, request is queued to
g_up or g_down thread same as before.

Such GEOM classes were reviewed and updated to support direct dispatch:
CONCAT, DEV, DISK, GATE, MD, MIRROR, MULTIPATH, NOP, PART, RAID, STRIPE,
VFS, ZERO, ZFS::VDEV, ZFS::ZVOL, all classes based on g_slice KPI (LABEL,
MAP, FLASHMAP, etc).

To declare direct completion capability disk(9) KPI got new flag equivalent
to G_PF_DIRECT_SEND -- DISKFLAG_DIRECT_COMPLETION.  da(4) and ada(4) disk
drivers got it set now thanks to earlier CAM locking work.

This change more then twice increases peak block storage performance on
systems with manu CPUs, together with earlier CAM locking changes reaching
more then 1 million IOPS (512 byte raw reads from 16 SATA SSDs on 4 HBAs to
256 user-level threads).

Sponsored by:	iXsystems, Inc.
MFC after:	2 months
2013-10-22 08:22:19 +00:00
Andriy Gapon
5d8fac897e MFV r255257: 4082 zfs receive gets EFBIG from dmu_tx_hold_free()
illumos change 14172:be36a38bac3d:
  illumos ZFS issues:
    4082 zfs receive gets EFBIG from dmu_tx_hold_free()

Please note that this change is slightly different from r255257, because
it is merged out of order with other (larger) upstream changes.

PR:		kern/182570
Reported by:	Keith White <kwhite@site.uottawa.ca>
Tested by:	Keith White <kwhite@site.uottawa.ca>
Approved by:	re (glebius)
MFC after:	1 week
X-MFC after:	r254753
2013-10-10 09:53:46 +00:00
Xin LI
6eb151f212 Improve lzjb decompress performance by reorganizing the code
to tighten the copy loop.

Submitted by:	Denis Ahrens <denis h3q com>
MFC after:	2 weeks
Approved by:	re (gjb)
2013-10-08 01:38:24 +00:00
Justin T. Gibbs
69d1b777e8 Optimize the block size used on ZFS cache devices as is already done
for data and log devices.

Reported by:	Dmitryy Makarov
Submitted by:	smh
Reviewed by:	gibbs
Approved by:	re (delphij)
MFC after:	2 weeks
2013-09-21 03:52:08 +00:00
Xin LI
253aa02fc3 MFV r254750:
Add support of Illumos dumps on zvol over RAID-Z.

Note that this only adds the features.  FreeBSD would
still need more work to support dumping on zvols.

Illumos ZFS issues:
  2932 support crash dumps to raidz, etc. pools

MFC after:	1 month
Approved by:	re (ZFS blanket)
2013-09-21 00:17:26 +00:00
Davide Italiano
a25a7e386a Fixup cross-device rename checks in ZFS. Add a check for the case
where 'fdvp' is a directory, 'tvp' is an already existing directory
and they have different mount points.

Reported by:	avg, pjd
Reviewed by:	pjd
Approved by:	re (rodrigc)
2013-09-20 23:22:00 +00:00
Xin LI
e8de677c74 MFV r247844 (illumos-gate 13975:ef6409bc370f)
Illumos ZFS issues:
  3582 zfs_delay() should support a variable resolution
  3584 DTrace sdt probes for ZFS txg states

Provide a compatibility shim for Solaris's cv_timedwait_hires
to help aid future porting.

Approved by:	re (ZFS blanket)
2013-09-10 01:46:47 +00:00
Pawel Jakub Dawidek
7e473ea146 Add sysctl/tunables for various metaslab variables. 2013-09-05 00:53:01 +00:00
Pawel Jakub Dawidek
7008be5bd7 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00