ZFS is trying to open and taste ZVOL as its VDEV. This is not supported,
so return an error instead of panicing on spa_namespace_lock recursion.
Reported by: Robert Millan <rmh@debian.org>
PR: kern/162008
MFC after: 3 days
of the CDDL licence explicitly requires every Contributor to add
a copyright notice.
This also reflects the copyright notices for the changes recently
added by Illumos.
MFC after: 3 days
It is possible for file systems with 'mountpoint' preperty set to 'legacy'
or 'none' - we don't have to change mount directory for them.
Currently such file systems are unmounted on rename and not even mounted back.
This introduces layering violation, as we need to update 'f_mntfromname'
field in statfs structure related to mountpoint (for the dataset we are
renaming and all its children).
In my opinion it is worth it, as it allow to update FreeBSD in even cleaner
way - in ZFS-only configuration root file system is ZFS file system with
'mountpoint' property set to 'legacy'. If root dataset is named system/rootfs,
we can snapshot it (system/rootfs@upgrade), clone it (system/oldrootfs),
update FreeBSD and if it doesn't boot we can boot back from system/oldrootfs
and rename it back to system/rootfs while it is mounted as /. Before it was
not possible, because unmounting / was not possible.
MFC after: 2 weeks
This allows to see processes I/O activity in 'top -m io' output.
PR kern/156218
Reported by: Marcus Reid <marcus@blazingdot.com>
Patch by: avg
MFC after: 3 days
- Decompress assembled gang block data if compressed.
- Verify checksum of a gang header.
- Verify checksum of assembled gang block data.
- Verify checksum of uber block.
Submitted by: avg
MFC after: 3 days
physical block size declared in bp may not always be what we want.
For example in case of gang block header physical block size declared
in bp is much larger than SPA_GANGBLOCKSIZE (512 bytes) and checksum
calculation failed. This bug could lead to accessing unallocated
memory and resets/failures during boot.
MFC after: 3 days
When calculating space needed for SA_BONUS buffers,
hdrsize is always rounded up to next 8-aligned boundary.
However, in two places the round up was done against
sum of 'total' plus hdrsize. On the other hand,
hdrsize increments by 4 each time, which means in
certain conditions, we would end up returning with
will_spill == 0 and (total + hdrsize) larger than
full_space, leading to a failed assertion because
it's invalid for dmu_set_bonus.
Sponsored by: iXsystems, Inc.
Reviewed by: mm
MFC after: 3 days
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.
Reviewed by: rwatson
Approved by: re (bz)
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.
Document the changes to flags field to only require the page lock.
Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.
Reviewed by: alc, attilio
Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by: re (bz)
Remove mapped pages for all dataset vnodes in zfs_rezget() using
new vn_pages_remove() to fix mmapped files changed by
zfs rollback or zfs receive -F.
PR: kern/160035, kern/156933
Reviewed by: kib, pjd
Approved by: re (kib)
MFC after: 1 week
zvol.c: fix calling of dmu_objset_prefetch() in zvol_create_minors()
by passing full instead of relative dataset name and prefetching all
visible datasets to be processed later instead of just the pool name
Reviewed by: pjd
Approved by: re (kib)
MFC after: 1 week
> Reviewed by: If someone else reviewed your modification.
> Approved by: If you needed approval for this commit.
> Obtained from: If the change is from a third party.
> MFC after: N [day[s]|week[s]|month[s]]. Request a reminder email.
> Security: Vulnerability reference (one per line) or description.
> Empty fields above will be automatically removed.
M opensolaris/uts/common/fs/zfs/zfs_ioctl.c
M opensolaris/uts/common/fs/zfs/zvol.c
zfs_ioc_dataset_list_next() and dsl_dir_destroy_check() indirectly
invoked from dmu_recv_existing_end() via dsl_dataset_destroy() by not
prefetching temporary clones, as these count as always inconsistent.
In addition, do not prefetch hidden datasets at all as we are not
going to process these later.
Filed as Illumos Bug #1346
PR: kern/157728
Tested by: Borja Marcos <borjam@sarenet.es>, mm
Reviewed by: pjd
Approved by: re (kib)
MFC after: 1 week
spa_namespace_lock. This fixes LOR between the spa_namespace_lock and
spa_config lock. LOR can cause deadlock on vdevs removal/insertion.
Reported by: gibbs, delphij
Tested by: delphij
Approved by: re (kib)
MFC after: 1 week
kernel for FreeBSD 9.0:
Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.
Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.
In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.
Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.
Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc
zfsvfs->z_log before calling zil_commit(). [1]
Do not call zfs_read() from zfs_getextattr() with the IO_SYNC flag.
Submitted by: Alexander Zagrebin <alex@zagrebin.ru> [1]
Reviewed by: pjd@
Approved by: re (kib)
MFC after: 3 days
in the case of a held dataset during remount.
Detailed description is available at:
https://www.illumos.org/issues/883
illumos-gate revision: 13380:161b964a0e10
Reviewed by: pjd
Approved by: re (kib)
Obtained from: Illumos (Bug #883)
MFC after: 3 days
which does not require change in the znode structure.
Specifically, it queries rdev from the znode in the
same sa_bulk_lookup already done in zfs_getattr().
Submitted by: pjd (with some revisions)
Reviewed by: pjd, mm
Approved by: re (kib)
devices are imbalanced zfs will lots of CPU searching for space on devices
which tend to be pretty full. It should instead fail quickly on the full
devices and move onto devices which have more availability.
New loader tunable: vfs.zfs.mg_alloc_failures (min = 8)
Illumos-gate changeset: 13379:4df42cc92254
Obtained from: Illumos (Bug #1051)
MFC after: 2 weeks
For snapshots, this is the same as COMPRESSRATIO, but for
filesystems/volumes, the COMPRESSRATIO is based on the data "USED" (ie,
includes blocks in children, but not blocks shared with the origin).
This is needed to figure out how much space a filesystem would use if it
were not compressed (ignoring snapshots).
Illumos-gate revision: 13387
Obtained from: Illumos (Feature #1092)
MFC after: 2 weeks
The vdev cache is very underutilized (hit ratio 30%-70%) and may consume
excessive memory on systems with many vdevs.
Illumos-gate revision: 13346
Obtained from: Illumos (Bug #175)
MFC after: 1 week
OpenSolaris and ZFS header files. These changes are sufficient
to allow a C++ program to use the libzfs library.
Note: The majority of these files already included 'extern "C"'
declarations, so the intention of providing C++ compatibility
already existed even if it wasn't provided.
cddl/compat/opensolaris/include/assert.h:
Wrap our compatibility assert implementation in
'extern "C"'. Since this is a compatibility header
I matched the Solaris style of doing this explicitly
rather than rely on FreeBSD's __BEGIN/END_DECLS macro.
sys/cddl/compat/opensolaris/sys/kstat.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/arc.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_pool.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/ddt.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h:
Rename parameters in function declarations that conflict
with C++ keywords. This was the solution preferred by
members of the Illumos community.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h:
In C, nested structures are visible in the global namespace,
but in C++, they take on the namespace of the structure in
which they are contained. Flatten nested structure
definitions within struct zfs_cmd so these structures are
visible in the global namespace when compiled in both
languages.
Sponsored by: Spectra Logic Corporation
User upgrades his system to fix the problem, but if he has any ZFS snapshots
for the file system which contains problematic binary, any user can mount the
snapshot and execute vulnerable binary.
Prevent this from happening by always mounting snapshots with setuid turned off.
MFC after: 2 weeks
The task structure might be no longer available.
This also allows to eliminates the need for two tasks in the zio structure.
Submitted by: anonymous
MFC after: 2 weeks
method, so that callers can indicate the minimum vnode
locking requirement. This will allow some file systems to choose
to return a LK_SHARED locked vnode when LK_SHARED is specified
for the flags argument. This patch only adds the flag. It
does not change any file system to use it and all callers
specify LK_EXCLUSIVE, so file system semantics are not changed.
Reviewed by: kib
(zfs destroy -r pool/dataset@snapshot)
To destroy all descendent snapshots with the same name the top level
snapshot was not required to exist. So if the top level snapshot does
not exist, check permissions of the parent dataset instead.
Filed as Illumos Bug #1043
Reviewed by: delphij
Approved by: pjd
MFC after: together with v28
Now in the case when one-shot timers are used cyclic events should fire
closer to theier scheduled times. As the cyclic is currently used only
to drive DTrace profile provider, this is the area where the change
makes a difference.
Reviewed by: mav (earlier version, a while ago)
X-MFC after: clocksource/eventtimer subsystem
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).
Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.
The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN
while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.
Some technical notes:
- This commit may be considered an ABI nop for all the architectures
different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
accessed avoiding migration, because the size of cpuset_t should be
considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
primirally done in order to leave some more space in userland to cope
with KBI extensions). If you need to access kernel cpuset_t from the
userland please refer to example in this patch on how to do that
correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now
The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.
Tested by: pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by: jeff, jhb, sbruno
safer for i386 because it can be easily over 4 GHz now. More worse, it can
be easily changed by user with 'machdep.tsc_freq' tunable (directly) or
cpufreq(4) (indirectly). Note it is intentionally not used in performance
critical paths to avoid performance regression (but we should, in theory).
Alternatively, we may add "virtual TSC" with lower frequency if maximum
frequency overflows 32 bits (and ignore possible incoherency as we do now).
VFS where we know if this is truncate(2) or ftruncate(2). If this is the
latter we should depend on the mode the file was opened and not on the current
permission.
PR: standards/154873
Reported by: Mark Martinec <Mark.Martinec@ijs.si>
Discussed with: Eric Schrock <eric.schrock@delphix.com>
Discussed with: Mark Maybee <Mark.Maybee@Oracle.COM>
MFC after: 1 month
Add systrace_linux32 and systrace_freebsd32 modules which provide
support for tracing compat system calls in addition to native system
call tracing provided by systrace module.
Provided that all the systrace modules are loaded now you can select
what syscalls to trace in the following manner:
syscall::xxx:yyy - work on all system calls that match the specification
syscall:freebsd:xxx:yyy - only native system calls
syscall:linux32:xxx:yyy - linux32 compat system calls
syscall:freebsd32:xxx:yyy - freebsd32 compat system calls on amd64
PR: kern/152822
Submitted by: Artem Belevich <fbsdlist@src.cx>
Reviewed by: jhb (earlier version)
MFC after: 3 weeks
Few new things available from now on:
- Data deduplication.
- Triple parity RAIDZ (RAIDZ3).
- zfs diff.
- zpool split.
- Snapshot holds.
- zpool import -F. Allows to rewind corrupted pool to earlier
transaction group.
- Possibility to import pool in read-only mode.
MFC after: 1 month
attached, activate the page after the successful read, and free the page
if read was unsuccessfull.
Freshly allocated page is not on any queue yet, and not activating (or
deactivating) the page leaves it on no queue, excluding the page from
pagedaemon scans and making the memory disappeared until the vnode
reclaimed.
Reviewed by: avg
MFC after: 1 week
The clock_t type in OpenSolaris is long (int64_t on amd64).
On FreeBSD clock_t is int32_t. The clock_t type is used in several places
in the ZFS code to store system uptime in milliseconds ("seconds * hz").
With hz=1000 we have a 32-bit integer overflow in 24 days, 20 hours,
31 minutes and 23.648 seconds. This has a user reported negative impact
on l2arc_feed_thread() and may cause unexpected results from other functions
using clock_t.
Reported by: Artem Belevich <fbsdlist@src.cx> on freebsd-fs@
MFC after: 1 week
is no way to disable NFSv4 ACLs in ZFS. This should make it easier
for the NFS server to figure out whether the exported filesystem supports
ACLs or not.
Reviewed by: pjd
MFC after: 2 weeks
Fix a race by defining two tasks in the zio structure
as we can still be returning from issue task when interrupt task is used.
Tested by: pjd
Approved by: pjd, delphij (mentor)
MFC after: 3 days
In this case we call target function only on a single CPU and do not
need any synchronization at the setup stage.
It's a bit non-obvious but setup function of NULL means that
smp_rendezvous_cpus waits for all CPUs to arrive at the rendezvous
point, but without doing any actual setup. While using
smp_no_rendevous_barrier means that each CPU proceeds on its own
schedule without any synchronization whatsoever.
MFC after: 3 weeks
detected ashift does not support this. With this change, pools
created while stripesize=512 could not be imported when stripesize
becomes larger (on the same drive).
Noticed by: pjd
The dealock was caused in the following way:
- thread T1 on CPU C1 holds a spin mutex, IPIs CPU C2 and waits for the
IPI to be handled
- C2 executes timer interrupt filter, thus has interrupts disabled, and
gets blocked on the spin mutex held by T1
The problem seems to have been introduced by simplifications made to
OpenSolaris code during porting.
The problem is fixed by reorganizing the code to more closely resemble
the upstream version. Interrupt filter (cyclic_fire) now doesn't
acquire any locks, all per-CPU data accesses are performed on a
target CPU with preemption and interrupts disabled thus precluding
concurrent access to the data.
cyp_mtx spin mutex is used to disable preemtion and interrupts; it's not
used for classical mutual exclusion, because xcall already serializes
calls to a CPU. It's an emulation of OpenSolaris
cyb_set_level(CY_HIGH_LEVEL) call, the spin mutexes could probably be
reduced to just a spinlock_enter()/_exit() pair.
Diff with upstream version is now reduced by ~500 lines, however it still
remains quite large - many things that are not needed (at the moment) or
are irrelevant on FreeBSD were simply ripped out during porting.
Examples of such things:
- support for CPU onlining/offlining
- support for suspend/resume
- support for running callouts at soft interrupt levels
- support for callout rebinding from CPU to CPU
- support for CPU partitions
Tested by: Artem Belevich <fbsdlist@src.cx>
MFC after: 3 weeks
X-MFC with: r216252