allow.mount.zfs:
allow mounting the zfs filesystem inside a jail
This way the permssions for mounting all current VFCF_JAIL filesystems
inside a jail are controlled wia allow.mount.* jail parameters.
Update sysctl descriptions.
Update jail(8) and zfs(8) manpages.
TODO: document the connection of allow.mount.* and VFCF_JAIL for kernel
developers
MFC after: 10 days
Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.
Discussed with: bde, das (previous versions)
MFC after: 1 month
excluding other allocations including UMA now entails the addition of
a single flag to kmem_alloc or uma zone create
Reviewed by: alc, avg
MFC after: 2 weeks
names and wants to sort them by name, ie. when executes:
# zfs list -t snapshot -o name -s name
Because only name is needed we don't have to read all snapshot properties.
Below you can find how long does it take to list 34509 snapshots from a single
disk pool before and after this change with cold and warm cache:
before:
# time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 525s
warm cache: 218s
after:
# time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 1.7s
warm cache: 1.1s
MFC after: 1 week
the allocated memory before calling mtx_init(9) on mtx pointing to it.
Otherwize, random contents of uninitialized memory might occasionally
trigger the assertion.
Reported by: Pavel Polyakov <bsd kobyla org>
Reviewed by: pjd
MFC after: 1 week
the number of links against LINK_MAX (which is INT16_MAX), not against
UINT32_MAX. Otherwise, the constant would implicitly be converted to
-1.
Reviewed by: pjd
MFC after: 1 week
(ss == NULL) on pool import. I had such a panic recently. With current version
of ZFS it is still possible to import the pool in readonly mode and backup
all the data, but in case it is impossible for some reason add tunable
vfs.zfs.space_map_last_hope, which when set to '1' will tell ZFS to remove
colliding range and retry. This seems to have worked for me, but I consider
it highly risky to use.
MFC after: 1 week
lock_success/lock_failure, introduced in r228424, by directly skipping
in dtrace_probe.
This mainly helps in avoiding namespace pollution and thus lockstat.h
dependency by systm.h.
As an added bonus, this also helps in MFC case.
Reviewed by: avg
MFC after: 3 months (or never)
X-MFC: r228424
nullfs. The problem is that resulting vnode is only required to be
held on return from the successfull call to vop, instead of being
referenced.
Nullfs VOP_INACTIVE() method reclaims the vnode, which in combination
with the VOP_VPTOCNP() interface means that the directory vnode
returned from VOP_VPTOCNP() is reclaimed in advance, causing
vn_fullpath() to error with EBADF or like.
Change the interface for VOP_VPTOCNP(), now the dvp must be
referenced. Convert all in-tree implementations of VOP_VPTOCNP(),
which is trivial, because vhold(9) and vref(9) are similar in the
locking prerequisites. Out-of-tree fs implementation of VOP_VPTOCNP(),
if any, should have no trouble with the fix.
Tested by: pho
Reviewed by: mckusick
MFC after: 3 weeks (subject of re approval)
certain instructions in a function prologue or epilogue. DTrace has a
hook into the invalid opcode fault handler that checks whether the fault
was due to an probe and if so, runs the DTrace magic.
Upon returning from an invalid opcode fault caused by a probe, DTrace must
emulate the instruction that was replaced with the invalid opcode and then
return control to the instruction following the invalid opcode.
There were a pair of related bugs in the emulation for the leave
instruction. The leave instruction is used to pop off a stack frame prior
to returning from a function. The emulation for this instruction must
move the trap frame for the invalid opcode fault down the stack to the
bottom of the stack frame that is being removed, and then execute an iret.
At two points in this process, the emulation code was storing values above
the current value of the stack pointer. This opened up a window in which
if we were two take an interrupt, the trap frame for the interrupt would
overwrite the values stored on the stack, causing the system to panic
later.
The first bug was that at one point the emulation code saves the new value
for $esp above the current stack pointer value. The fix is to save this
value instead inside of the original trap frame. At this point we do
not need the original trap frame so this is safe.
The second bug is that when the emulate code loads $esp from the stack, it
points part-way through the new trap frame instead of at its beginning.
The emulation code adjusts the stack pointer to the correct value
immediately afterwards, but this still leaves a one instruction window in
which an interrupt would corrupt this trap frame. Fix this by adjusting
the stack frame value before loading it into $esp.
This fixes panics in invop_leave on i386 when using fbt return probes.
Reviewed by: rpaulo, attilio
MFC after: 1 week
ZFS is trying to open and taste ZVOL as its VDEV. This is not supported,
so return an error instead of panicing on spa_namespace_lock recursion.
Reported by: Robert Millan <rmh@debian.org>
PR: kern/162008
MFC after: 3 days
of the CDDL licence explicitly requires every Contributor to add
a copyright notice.
This also reflects the copyright notices for the changes recently
added by Illumos.
MFC after: 3 days
It is possible for file systems with 'mountpoint' preperty set to 'legacy'
or 'none' - we don't have to change mount directory for them.
Currently such file systems are unmounted on rename and not even mounted back.
This introduces layering violation, as we need to update 'f_mntfromname'
field in statfs structure related to mountpoint (for the dataset we are
renaming and all its children).
In my opinion it is worth it, as it allow to update FreeBSD in even cleaner
way - in ZFS-only configuration root file system is ZFS file system with
'mountpoint' property set to 'legacy'. If root dataset is named system/rootfs,
we can snapshot it (system/rootfs@upgrade), clone it (system/oldrootfs),
update FreeBSD and if it doesn't boot we can boot back from system/oldrootfs
and rename it back to system/rootfs while it is mounted as /. Before it was
not possible, because unmounting / was not possible.
MFC after: 2 weeks
This allows to see processes I/O activity in 'top -m io' output.
PR kern/156218
Reported by: Marcus Reid <marcus@blazingdot.com>
Patch by: avg
MFC after: 3 days
- Decompress assembled gang block data if compressed.
- Verify checksum of a gang header.
- Verify checksum of assembled gang block data.
- Verify checksum of uber block.
Submitted by: avg
MFC after: 3 days
physical block size declared in bp may not always be what we want.
For example in case of gang block header physical block size declared
in bp is much larger than SPA_GANGBLOCKSIZE (512 bytes) and checksum
calculation failed. This bug could lead to accessing unallocated
memory and resets/failures during boot.
MFC after: 3 days
When calculating space needed for SA_BONUS buffers,
hdrsize is always rounded up to next 8-aligned boundary.
However, in two places the round up was done against
sum of 'total' plus hdrsize. On the other hand,
hdrsize increments by 4 each time, which means in
certain conditions, we would end up returning with
will_spill == 0 and (total + hdrsize) larger than
full_space, leading to a failed assertion because
it's invalid for dmu_set_bonus.
Sponsored by: iXsystems, Inc.
Reviewed by: mm
MFC after: 3 days
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.
Reviewed by: rwatson
Approved by: re (bz)
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.
Document the changes to flags field to only require the page lock.
Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.
Reviewed by: alc, attilio
Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by: re (bz)
Remove mapped pages for all dataset vnodes in zfs_rezget() using
new vn_pages_remove() to fix mmapped files changed by
zfs rollback or zfs receive -F.
PR: kern/160035, kern/156933
Reviewed by: kib, pjd
Approved by: re (kib)
MFC after: 1 week
zvol.c: fix calling of dmu_objset_prefetch() in zvol_create_minors()
by passing full instead of relative dataset name and prefetching all
visible datasets to be processed later instead of just the pool name
Reviewed by: pjd
Approved by: re (kib)
MFC after: 1 week
> Reviewed by: If someone else reviewed your modification.
> Approved by: If you needed approval for this commit.
> Obtained from: If the change is from a third party.
> MFC after: N [day[s]|week[s]|month[s]]. Request a reminder email.
> Security: Vulnerability reference (one per line) or description.
> Empty fields above will be automatically removed.
M opensolaris/uts/common/fs/zfs/zfs_ioctl.c
M opensolaris/uts/common/fs/zfs/zvol.c
zfs_ioc_dataset_list_next() and dsl_dir_destroy_check() indirectly
invoked from dmu_recv_existing_end() via dsl_dataset_destroy() by not
prefetching temporary clones, as these count as always inconsistent.
In addition, do not prefetch hidden datasets at all as we are not
going to process these later.
Filed as Illumos Bug #1346
PR: kern/157728
Tested by: Borja Marcos <borjam@sarenet.es>, mm
Reviewed by: pjd
Approved by: re (kib)
MFC after: 1 week
spa_namespace_lock. This fixes LOR between the spa_namespace_lock and
spa_config lock. LOR can cause deadlock on vdevs removal/insertion.
Reported by: gibbs, delphij
Tested by: delphij
Approved by: re (kib)
MFC after: 1 week
kernel for FreeBSD 9.0:
Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.
Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.
In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.
Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.
Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc
zfsvfs->z_log before calling zil_commit(). [1]
Do not call zfs_read() from zfs_getextattr() with the IO_SYNC flag.
Submitted by: Alexander Zagrebin <alex@zagrebin.ru> [1]
Reviewed by: pjd@
Approved by: re (kib)
MFC after: 3 days
in the case of a held dataset during remount.
Detailed description is available at:
https://www.illumos.org/issues/883
illumos-gate revision: 13380:161b964a0e10
Reviewed by: pjd
Approved by: re (kib)
Obtained from: Illumos (Bug #883)
MFC after: 3 days
which does not require change in the znode structure.
Specifically, it queries rdev from the znode in the
same sa_bulk_lookup already done in zfs_getattr().
Submitted by: pjd (with some revisions)
Reviewed by: pjd, mm
Approved by: re (kib)
devices are imbalanced zfs will lots of CPU searching for space on devices
which tend to be pretty full. It should instead fail quickly on the full
devices and move onto devices which have more availability.
New loader tunable: vfs.zfs.mg_alloc_failures (min = 8)
Illumos-gate changeset: 13379:4df42cc92254
Obtained from: Illumos (Bug #1051)
MFC after: 2 weeks
For snapshots, this is the same as COMPRESSRATIO, but for
filesystems/volumes, the COMPRESSRATIO is based on the data "USED" (ie,
includes blocks in children, but not blocks shared with the origin).
This is needed to figure out how much space a filesystem would use if it
were not compressed (ignoring snapshots).
Illumos-gate revision: 13387
Obtained from: Illumos (Feature #1092)
MFC after: 2 weeks
The vdev cache is very underutilized (hit ratio 30%-70%) and may consume
excessive memory on systems with many vdevs.
Illumos-gate revision: 13346
Obtained from: Illumos (Bug #175)
MFC after: 1 week
OpenSolaris and ZFS header files. These changes are sufficient
to allow a C++ program to use the libzfs library.
Note: The majority of these files already included 'extern "C"'
declarations, so the intention of providing C++ compatibility
already existed even if it wasn't provided.
cddl/compat/opensolaris/include/assert.h:
Wrap our compatibility assert implementation in
'extern "C"'. Since this is a compatibility header
I matched the Solaris style of doing this explicitly
rather than rely on FreeBSD's __BEGIN/END_DECLS macro.
sys/cddl/compat/opensolaris/sys/kstat.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/arc.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_pool.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/ddt.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h:
Rename parameters in function declarations that conflict
with C++ keywords. This was the solution preferred by
members of the Illumos community.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h:
In C, nested structures are visible in the global namespace,
but in C++, they take on the namespace of the structure in
which they are contained. Flatten nested structure
definitions within struct zfs_cmd so these structures are
visible in the global namespace when compiled in both
languages.
Sponsored by: Spectra Logic Corporation
User upgrades his system to fix the problem, but if he has any ZFS snapshots
for the file system which contains problematic binary, any user can mount the
snapshot and execute vulnerable binary.
Prevent this from happening by always mounting snapshots with setuid turned off.
MFC after: 2 weeks
The task structure might be no longer available.
This also allows to eliminates the need for two tasks in the zio structure.
Submitted by: anonymous
MFC after: 2 weeks
method, so that callers can indicate the minimum vnode
locking requirement. This will allow some file systems to choose
to return a LK_SHARED locked vnode when LK_SHARED is specified
for the flags argument. This patch only adds the flag. It
does not change any file system to use it and all callers
specify LK_EXCLUSIVE, so file system semantics are not changed.
Reviewed by: kib
(zfs destroy -r pool/dataset@snapshot)
To destroy all descendent snapshots with the same name the top level
snapshot was not required to exist. So if the top level snapshot does
not exist, check permissions of the parent dataset instead.
Filed as Illumos Bug #1043
Reviewed by: delphij
Approved by: pjd
MFC after: together with v28
Now in the case when one-shot timers are used cyclic events should fire
closer to theier scheduled times. As the cyclic is currently used only
to drive DTrace profile provider, this is the area where the change
makes a difference.
Reviewed by: mav (earlier version, a while ago)
X-MFC after: clocksource/eventtimer subsystem
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).
Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.
The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN
while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.
Some technical notes:
- This commit may be considered an ABI nop for all the architectures
different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
accessed avoiding migration, because the size of cpuset_t should be
considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
primirally done in order to leave some more space in userland to cope
with KBI extensions). If you need to access kernel cpuset_t from the
userland please refer to example in this patch on how to do that
correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now
The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.
Tested by: pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by: jeff, jhb, sbruno
safer for i386 because it can be easily over 4 GHz now. More worse, it can
be easily changed by user with 'machdep.tsc_freq' tunable (directly) or
cpufreq(4) (indirectly). Note it is intentionally not used in performance
critical paths to avoid performance regression (but we should, in theory).
Alternatively, we may add "virtual TSC" with lower frequency if maximum
frequency overflows 32 bits (and ignore possible incoherency as we do now).
VFS where we know if this is truncate(2) or ftruncate(2). If this is the
latter we should depend on the mode the file was opened and not on the current
permission.
PR: standards/154873
Reported by: Mark Martinec <Mark.Martinec@ijs.si>
Discussed with: Eric Schrock <eric.schrock@delphix.com>
Discussed with: Mark Maybee <Mark.Maybee@Oracle.COM>
MFC after: 1 month
Add systrace_linux32 and systrace_freebsd32 modules which provide
support for tracing compat system calls in addition to native system
call tracing provided by systrace module.
Provided that all the systrace modules are loaded now you can select
what syscalls to trace in the following manner:
syscall::xxx:yyy - work on all system calls that match the specification
syscall:freebsd:xxx:yyy - only native system calls
syscall:linux32:xxx:yyy - linux32 compat system calls
syscall:freebsd32:xxx:yyy - freebsd32 compat system calls on amd64
PR: kern/152822
Submitted by: Artem Belevich <fbsdlist@src.cx>
Reviewed by: jhb (earlier version)
MFC after: 3 weeks
Few new things available from now on:
- Data deduplication.
- Triple parity RAIDZ (RAIDZ3).
- zfs diff.
- zpool split.
- Snapshot holds.
- zpool import -F. Allows to rewind corrupted pool to earlier
transaction group.
- Possibility to import pool in read-only mode.
MFC after: 1 month
attached, activate the page after the successful read, and free the page
if read was unsuccessfull.
Freshly allocated page is not on any queue yet, and not activating (or
deactivating) the page leaves it on no queue, excluding the page from
pagedaemon scans and making the memory disappeared until the vnode
reclaimed.
Reviewed by: avg
MFC after: 1 week
The clock_t type in OpenSolaris is long (int64_t on amd64).
On FreeBSD clock_t is int32_t. The clock_t type is used in several places
in the ZFS code to store system uptime in milliseconds ("seconds * hz").
With hz=1000 we have a 32-bit integer overflow in 24 days, 20 hours,
31 minutes and 23.648 seconds. This has a user reported negative impact
on l2arc_feed_thread() and may cause unexpected results from other functions
using clock_t.
Reported by: Artem Belevich <fbsdlist@src.cx> on freebsd-fs@
MFC after: 1 week
is no way to disable NFSv4 ACLs in ZFS. This should make it easier
for the NFS server to figure out whether the exported filesystem supports
ACLs or not.
Reviewed by: pjd
MFC after: 2 weeks
Fix a race by defining two tasks in the zio structure
as we can still be returning from issue task when interrupt task is used.
Tested by: pjd
Approved by: pjd, delphij (mentor)
MFC after: 3 days
In this case we call target function only on a single CPU and do not
need any synchronization at the setup stage.
It's a bit non-obvious but setup function of NULL means that
smp_rendezvous_cpus waits for all CPUs to arrive at the rendezvous
point, but without doing any actual setup. While using
smp_no_rendevous_barrier means that each CPU proceeds on its own
schedule without any synchronization whatsoever.
MFC after: 3 weeks
detected ashift does not support this. With this change, pools
created while stripesize=512 could not be imported when stripesize
becomes larger (on the same drive).
Noticed by: pjd
The dealock was caused in the following way:
- thread T1 on CPU C1 holds a spin mutex, IPIs CPU C2 and waits for the
IPI to be handled
- C2 executes timer interrupt filter, thus has interrupts disabled, and
gets blocked on the spin mutex held by T1
The problem seems to have been introduced by simplifications made to
OpenSolaris code during porting.
The problem is fixed by reorganizing the code to more closely resemble
the upstream version. Interrupt filter (cyclic_fire) now doesn't
acquire any locks, all per-CPU data accesses are performed on a
target CPU with preemption and interrupts disabled thus precluding
concurrent access to the data.
cyp_mtx spin mutex is used to disable preemtion and interrupts; it's not
used for classical mutual exclusion, because xcall already serializes
calls to a CPU. It's an emulation of OpenSolaris
cyb_set_level(CY_HIGH_LEVEL) call, the spin mutexes could probably be
reduced to just a spinlock_enter()/_exit() pair.
Diff with upstream version is now reduced by ~500 lines, however it still
remains quite large - many things that are not needed (at the moment) or
are irrelevant on FreeBSD were simply ripped out during porting.
Examples of such things:
- support for CPU onlining/offlining
- support for suspend/resume
- support for running callouts at soft interrupt levels
- support for callout rebinding from CPU to CPU
- support for CPU partitions
Tested by: Artem Belevich <fbsdlist@src.cx>
MFC after: 3 weeks
X-MFC with: r216252
alignment on drives with large sector sizes (e.g. 4 KiB) but the
implementation might need to be revisited if devices with large stripesizes
appear (e.g. if RAID controllers or flash drives start using the field),
probably by introducing a physsectorsize field in GEOM providers.
Discussed with: mav, mostly silence on freebsd-geom@ and freebsd-fs@
with filesystems created under MacOS X ZFS port. This is kind of filesystem
corruption (we don't allow for setting empty ACLs), so make acl_get_file(3)
and related syscalls fail with EINVAL in that case. In theory, we could
return empty ACL to userland, but I'm afraid this would break some code.
MFC after: 3 days
kern_sendfile() uses vm_rdwr() to read-ahead blocks of data to populate
page cache. When sendfile stumbles upon a page that is not populated
yet, it sends out all the mbufs that it collected so far. This
resulted in very poor performance with ZFS when file data is not in the
page cache, because ZFS vop_read for UIO_NOCOPY case populated only
those pages that are already in cache, but not valid. Which means that
most of the time it populated only the first requested page in the
described above scenario.
Reported by: Alexander Zagrebin <alexz@visp.ru>
Tested by: Alexander Zagrebin <alexz@visp.ru>,
Artemiev Igor <ai@kliksys.ru>
MFC after: 12 days
and VFS_RELE on a non-existing hold on snapshot parent's z_vfs.
This disables the changes from OpenSolaris onnv-revision 9234:bffdc4fc05c4
(bug IDs: 6792139, 6794830) - not applicable to FreeBSD.
This fixes the process hang if umounting a manually mounted snapshot.
Reported by: Alexander Zagrebin <alexz@visp.ru>
Approved by: delphij (mentor)
MFC after: 1 week
what we have. Without the check the kernel could accessing memory that
does not belong to the request struct.
Note that we do not test if the struct equals in size at this time, which
may faciliate forward compatibility with newer binaries.
Reviewed by: pjd at MeetBSD CA '2010
MFC after: 1 week
OpenSolaris onnv-revision: 10209:91f47f0e7728
6830541 zfs_get_data_trips on a verify
6696242 multiple zfs_fillpage() zfs: accessing past end of object panics
6785914 zfs fails to drop dn_struct_rwlock in recovery code path
Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6830541, 6696242, 6785914)
MFC after: 2 weeks
This should make vnode_pager_getpages path a bit shorter and clearer.
Also this should eliminate problems with partially valid pages.
Having this method opens room for future optimizations.
To do: try to satisfy other pages besides the required one taking into
account tradeofs between number of page faults, read throughput and read
latency. Also, eventually vop_putpages should be added too.
Reviewed by: kib, mm, pjd
MFC after: 3 weeks
Since r212650 and before this change sendfile(2) could produce
a partially valid page for a trailing portion of a ZFS vnode.
vm_fault() always wants to see a fully valid page even if it's the last
page that partially extends beyond vnode's end. Otherwise it calls
vop_getpages() to bring in the page. In the case of ZFS this means
that the data is read from the page into the same page and this breaks
checks in ZFS mappedread() - a thread that set VPO_BUSY on the page in
vm_fault() will get blocked forever waiting for it to be cleared.
Many thanks to Kai and Jeremy for reproducing the issue and providing
important debugging information and help.
Reported by: Kai Gallasch <gallasch@free.de>,
Jeremy Chadwick <freebsd@jdc.parodius.com>
Tested by: Kai Gallasch <gallasch@free.de>,
Jeremy Chadwick <freebsd@jdc.parodius.com>
Reviewed by: kib
MFC after: 3 days
To-Do: apply the same treatment to tmpfs + sendfile
physical memory
This is needed to correctly autotune ZFS ARC size when vm_kmem_size is
set to value larger than available physical memory.
MFC after: 2 weeks
Retry IO once with ZIO_FLAG_TRYHARD before declaring a pool faulted
OpenSolaris revision and Bug IDs:
9725:0bf7402e8022
6843014 ZFS B_FAILFAST handling is broken
Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6843014)
MFC after: 3 weeks
OpenSolaris revision and Bug IDs:
9701:cc5b64682e64
6803605 should be able to offline log devices
6726045 vdev_deflate_ratio is not set when offlining a log device
6599442 zpool import has faults in the display
Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6803605, 6726045, 6599442)
MFC after: 3 weeks
zfs_map_page/zfs_unmap_page are mostly called around potential I/O paths
and it seems to be a not very good idea to do cpu pinning there.
Suggested by: kib
MFC after: 2 weeks
Those checks are not present in upstream code and they are enforced in
actual calculations of delta by which ARC size can be grown or should be
reduced.
MFC after: 3 weeks
vm_paging_target() is not a trigger of any kind for pageademon, but
rather a "soft" target for it when it's already triggered.
Thus, trying to keep 2048 pages above that level at the expense of ARC
was simply driving ARC size into the ground even with normal memory
loads.
Instead, use a threshold at which a pagedaemon scan is triggered, so
that ARC reclaiming helps with pagedaemon's task, but the latter still
recycles active and inactive pages.
PR: kern/146410, kern/138790
MFC after: 3 weeks
Fix possible loss of correct error return code in ZFS mount
OpenSolaris revisions and Bug IDs:
11824:53128e5db7cf
6863610 ZFS mount can lose correct error return
12079:13822b941977
6939941 problem with moving files in zfs (142901-12)
Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6863610, 6939941)
MFC after: 3 days
This mirrors code in tmpfs.
This changge shouldn't affect much read path, it may cause unnecessary
vm_page_lookup calls in the case where v_object has no active or inactive
pages but has some cache pages. I believe this situation to be non-essential.
In write path this change should allow us to properly detect the above
case and free a cache page when we write to a range that corresponds to it.
If this situation is undetected then we could have a discrepancy between
data in page cache and in ARC or on disk.
This change allows us to re-enable vn_has_cached_data() check in zfs_write.
NOTE: strictly speaking resident_page_count and cache fields of v_object
should be exmined under VM_OBJECT_LOCK, but for this particular usage
we may get away with it.
Discussed with: alc, kib
Approved by: pjd
Tested with: tools/regression/fsx
MFC after: 3 weeks
Otherwise, adding insult to injury, in addition to double-caching of data
we would always copy the data into a vnode's vm object page from backend.
This is specific to sendfile case only (VOP_READ with UIO_NOCOPY).
PR: kern/141305
Reported by: Wiktor Niesiobedzki <bsd@vink.pl>
Reviewed by: alc
Tested by: tools/regression/sockets/sendfile
MFC after: 2 weeks
* processes now can't go away while we are inserting probes (fixes a panic)
* if a trap happens, we won't be holding the process lock (fixes a hang)
* fix a LOR between the process lock and the fasttrap bucket list lock
Thanks to kib for pointing some problems.
Sponsored by: The FreeBSD Foundation
code associated with overflow or with the drain function. While this
function is not expected to be used often, it produces more information
in the form of an errno that sbuf_overflowed() did.
Add the BIO_ORDERED flag for struct bio and update bio clients to use it.
The barrier semantics of bioq_insert_tail() were broken in two ways:
o In bioq_disksort(), an added bio could be inserted at the head of
the queue, even when a barrier was present, if the sort key for
the new entry was less than that of the last queued barrier bio.
o The last_offset used to generate the sort key for newly queued bios
did not stay at the position of the barrier until either the
barrier was de-queued, or a new barrier (which updates last_offset)
was queued. When a barrier is in effect, we know that the disk
will pass through the barrier position just before the
"blocked bios" are released, so using the barrier's offset for
last_offset is the optimal choice.
sys/geom/sched/subr_disk.c:
sys/kern/subr_disk.c:
o Update last_offset in bioq_insert_tail().
o Only update last_offset in bioq_remove() if the removed bio is
at the head of the queue (typically due to a call via
bioq_takefirst()) and no barrier is active.
o In bioq_disksort(), if we have a barrier (insert_point is non-NULL),
set prev to the barrier and cur to it's next element. Now that
last_offset is kept at the barrier position, this change isn't
strictly necessary, but since we have to take a decision branch
anyway, it does avoid one, no-op, loop iteration in the while
loop that immediately follows.
o In bioq_disksort(), bypass the normal sort for bios with the
BIO_ORDERED attribute and instead insert them into the queue
with bioq_insert_tail(). bioq_insert_tail() not only gives
the desired command order during insertion, but also provides
barrier semantics so that commands disksorted in the future
cannot pass the just enqueued transaction.
sys/sys/bio.h:
Add BIO_ORDERED as bit 4 of the bio_flags field in struct bio.
sys/cam/ata/ata_da.c:
sys/cam/scsi/scsi_da.c
Use an ordered command for SCSI/ATA-NCQ commands issued in
response to bios with the BIO_ORDERED flag set.
sys/cam/scsi/scsi_da.c
Use an ordered tag when issuing a synchronize cache command.
Wrap some lines to 80 columns.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
sys/geom/geom_io.c
Mark bios with the BIO_FLUSH command as BIO_ORDERED.
Sponsored by: Spectra Logic Corporation
MFC after: 1 month
least one execute bit set, otherwise execve(2) will return EACCES even
for an user with PRIV_VFS_EXEC privilege.
Add the check also to vaccess(9), vaccess_acl_nfs4(9) and
vaccess_acl_posix1e(9). This makes access(2) to better agree with
execve(2). Because ZFS doesn't use vaccess(9) for VEXEC, add the check
to zfs_freebsd_access() too. There may be other file systems which are
not using vaccess*() functions and need to be handled separately.
PR: kern/125009
Reviewed by: bde, trasz
Approved by: pjd (ZFS part)
- better ACL caching and speedup of ACL permission checks
- faster handling of stat()
- lowered mutex contention in the read/writer lock (rrwlock)
- several related bugfixes
Detailed information (OpenSolaris onnv changesets and Bug IDs):
9749:105f407a2680
6802734 Support for Access Based Enumeration (not used on FreeBSD)
6844861 inconsistent xattr readdir behavior with too-small buffer
9866:ddc5f1d8eb4e
6848431 zfs with rstchown=0 or file_chown_self privilege allows user to "take" ownership
9981:b4907297e740
6775100 stat() performance on files on zfs should be improved
6827779 rrwlock is overly protective of its counters
10143:d2d432dfe597
6857433 memory leaks found at: zfs_acl_alloc/zfs_acl_node_alloc
6860318 truncate() on zfsroot succeeds when file has a component of its path set without access permission
10232:f37b85f7e03e
6865875 zfs sometimes incorrectly giving search access to a dir
10250:b179ceb34b62
6867395 zpool_upgrade_007_pos testcase panic'd with BAD TRAP: type=e (#pf Page fault)
10269:2788675568fd
6868276 zfs_rezget() can be hazardous when znode has a cached ACL
10295:f7a18a1e9610
6870564 panic in zfs_getsecattr
Approved by: delphij (mentor)
Obtained from: OpenSolaris (multiple Bug IDs)
MFC after: 2 weeks
This provides a noticeable write speedup, especially on pools with
less than 30% of free space.
Detailed information (OpenSolaris onnv changesets and Bug IDs):
11146:7e58f40bcb1c
6826241 Sync write IOPS drops dramatically during TXG sync
6869229 zfs should switch to shiny new metaslabs more frequently
11728:59fdb3b856f6
6918420 zdb -m has issues printing metaslab statistics
12047:7c1fcc8419ca
6917066 zfs block picking can be improved
Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6826241, 6869229, 6918420, 6917066)
MFC after: 2 weeks
resetting needfree
needfree is checked at the very start of arc_reclaim_needed.
This change makes code easier to follow and maintain in face of
potential changed in arc_reclaim_needed.
Also, put the whole sub-block under _KERNEL because needfree can be set
only in kernel code.
To do: rename needfree to something else to aovid confusion with
OpenSolaris global variable of the same name which is used in the same
code, but has different meaning (page deficit).
Note: I have an impression that locking around accesses to this variable
as well as mutual notifications between arc_reclaim_thread and
arc_lowmem are not proper.
MFC after: 1 week
The changes do not affect FreeBSD code because
zfs_znode_move(), cleanlocks() and cleanshares() are not used.
OpenSolaris onnv changeset: 9788:f660bc44f2e8, 9909:aa280f585a3e
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6843700, 6790232)
MFC after: 7 weeks
OpenSolaris freemem has the same meaning as our v_free_count +
v_cache_count.
Obtained from: Artem Belevich <fbsdlist@src.cx>,
Peter Jeremy <peterjeremy@acm.org>
Discussed with: pjd
MFC after: 2 weeks
- fixes panics when Solaris/OpenSolaris pools that contain files
uploaded with the SMB protocol are accessed
Enable seting/unsetting the sharesmb property (dummy action)
- allows users who import pools from Solaris/Opensolaris to unset
the sharesmb property and get rid of annoying messages
PR: kern/145778, kern/148709
Approved by: pjd, delphij (mentor)
MFC after: 7 weeks
implementation in 8.0 and later as its flags field does not hold dynamic
state such as waiters flags, but is only modified in lockinit() aside
from VN_LOCK_*().
Discussed with: attilio