nullfs. The problem is that resulting vnode is only required to be
held on return from the successfull call to vop, instead of being
referenced.
Nullfs VOP_INACTIVE() method reclaims the vnode, which in combination
with the VOP_VPTOCNP() interface means that the directory vnode
returned from VOP_VPTOCNP() is reclaimed in advance, causing
vn_fullpath() to error with EBADF or like.
Change the interface for VOP_VPTOCNP(), now the dvp must be
referenced. Convert all in-tree implementations of VOP_VPTOCNP(),
which is trivial, because vhold(9) and vref(9) are similar in the
locking prerequisites. Out-of-tree fs implementation of VOP_VPTOCNP(),
if any, should have no trouble with the fix.
Tested by: pho
Reviewed by: mckusick
MFC after: 3 weeks (subject of re approval)
ZFS is trying to open and taste ZVOL as its VDEV. This is not supported,
so return an error instead of panicing on spa_namespace_lock recursion.
Reported by: Robert Millan <rmh@debian.org>
PR: kern/162008
MFC after: 3 days
of the CDDL licence explicitly requires every Contributor to add
a copyright notice.
This also reflects the copyright notices for the changes recently
added by Illumos.
MFC after: 3 days
It is possible for file systems with 'mountpoint' preperty set to 'legacy'
or 'none' - we don't have to change mount directory for them.
Currently such file systems are unmounted on rename and not even mounted back.
This introduces layering violation, as we need to update 'f_mntfromname'
field in statfs structure related to mountpoint (for the dataset we are
renaming and all its children).
In my opinion it is worth it, as it allow to update FreeBSD in even cleaner
way - in ZFS-only configuration root file system is ZFS file system with
'mountpoint' property set to 'legacy'. If root dataset is named system/rootfs,
we can snapshot it (system/rootfs@upgrade), clone it (system/oldrootfs),
update FreeBSD and if it doesn't boot we can boot back from system/oldrootfs
and rename it back to system/rootfs while it is mounted as /. Before it was
not possible, because unmounting / was not possible.
MFC after: 2 weeks
This allows to see processes I/O activity in 'top -m io' output.
PR kern/156218
Reported by: Marcus Reid <marcus@blazingdot.com>
Patch by: avg
MFC after: 3 days
When calculating space needed for SA_BONUS buffers,
hdrsize is always rounded up to next 8-aligned boundary.
However, in two places the round up was done against
sum of 'total' plus hdrsize. On the other hand,
hdrsize increments by 4 each time, which means in
certain conditions, we would end up returning with
will_spill == 0 and (total + hdrsize) larger than
full_space, leading to a failed assertion because
it's invalid for dmu_set_bonus.
Sponsored by: iXsystems, Inc.
Reviewed by: mm
MFC after: 3 days
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.
Document the changes to flags field to only require the page lock.
Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.
Reviewed by: alc, attilio
Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by: re (bz)
Remove mapped pages for all dataset vnodes in zfs_rezget() using
new vn_pages_remove() to fix mmapped files changed by
zfs rollback or zfs receive -F.
PR: kern/160035, kern/156933
Reviewed by: kib, pjd
Approved by: re (kib)
MFC after: 1 week
zvol.c: fix calling of dmu_objset_prefetch() in zvol_create_minors()
by passing full instead of relative dataset name and prefetching all
visible datasets to be processed later instead of just the pool name
Reviewed by: pjd
Approved by: re (kib)
MFC after: 1 week
> Reviewed by: If someone else reviewed your modification.
> Approved by: If you needed approval for this commit.
> Obtained from: If the change is from a third party.
> MFC after: N [day[s]|week[s]|month[s]]. Request a reminder email.
> Security: Vulnerability reference (one per line) or description.
> Empty fields above will be automatically removed.
M opensolaris/uts/common/fs/zfs/zfs_ioctl.c
M opensolaris/uts/common/fs/zfs/zvol.c
zfs_ioc_dataset_list_next() and dsl_dir_destroy_check() indirectly
invoked from dmu_recv_existing_end() via dsl_dataset_destroy() by not
prefetching temporary clones, as these count as always inconsistent.
In addition, do not prefetch hidden datasets at all as we are not
going to process these later.
Filed as Illumos Bug #1346
PR: kern/157728
Tested by: Borja Marcos <borjam@sarenet.es>, mm
Reviewed by: pjd
Approved by: re (kib)
MFC after: 1 week
spa_namespace_lock. This fixes LOR between the spa_namespace_lock and
spa_config lock. LOR can cause deadlock on vdevs removal/insertion.
Reported by: gibbs, delphij
Tested by: delphij
Approved by: re (kib)
MFC after: 1 week
zfsvfs->z_log before calling zil_commit(). [1]
Do not call zfs_read() from zfs_getextattr() with the IO_SYNC flag.
Submitted by: Alexander Zagrebin <alex@zagrebin.ru> [1]
Reviewed by: pjd@
Approved by: re (kib)
MFC after: 3 days
in the case of a held dataset during remount.
Detailed description is available at:
https://www.illumos.org/issues/883
illumos-gate revision: 13380:161b964a0e10
Reviewed by: pjd
Approved by: re (kib)
Obtained from: Illumos (Bug #883)
MFC after: 3 days
which does not require change in the znode structure.
Specifically, it queries rdev from the znode in the
same sa_bulk_lookup already done in zfs_getattr().
Submitted by: pjd (with some revisions)
Reviewed by: pjd, mm
Approved by: re (kib)
devices are imbalanced zfs will lots of CPU searching for space on devices
which tend to be pretty full. It should instead fail quickly on the full
devices and move onto devices which have more availability.
New loader tunable: vfs.zfs.mg_alloc_failures (min = 8)
Illumos-gate changeset: 13379:4df42cc92254
Obtained from: Illumos (Bug #1051)
MFC after: 2 weeks
For snapshots, this is the same as COMPRESSRATIO, but for
filesystems/volumes, the COMPRESSRATIO is based on the data "USED" (ie,
includes blocks in children, but not blocks shared with the origin).
This is needed to figure out how much space a filesystem would use if it
were not compressed (ignoring snapshots).
Illumos-gate revision: 13387
Obtained from: Illumos (Feature #1092)
MFC after: 2 weeks
The vdev cache is very underutilized (hit ratio 30%-70%) and may consume
excessive memory on systems with many vdevs.
Illumos-gate revision: 13346
Obtained from: Illumos (Bug #175)
MFC after: 1 week
OpenSolaris and ZFS header files. These changes are sufficient
to allow a C++ program to use the libzfs library.
Note: The majority of these files already included 'extern "C"'
declarations, so the intention of providing C++ compatibility
already existed even if it wasn't provided.
cddl/compat/opensolaris/include/assert.h:
Wrap our compatibility assert implementation in
'extern "C"'. Since this is a compatibility header
I matched the Solaris style of doing this explicitly
rather than rely on FreeBSD's __BEGIN/END_DECLS macro.
sys/cddl/compat/opensolaris/sys/kstat.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/arc.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_pool.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/ddt.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h:
Rename parameters in function declarations that conflict
with C++ keywords. This was the solution preferred by
members of the Illumos community.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h:
In C, nested structures are visible in the global namespace,
but in C++, they take on the namespace of the structure in
which they are contained. Flatten nested structure
definitions within struct zfs_cmd so these structures are
visible in the global namespace when compiled in both
languages.
Sponsored by: Spectra Logic Corporation
The task structure might be no longer available.
This also allows to eliminates the need for two tasks in the zio structure.
Submitted by: anonymous
MFC after: 2 weeks
method, so that callers can indicate the minimum vnode
locking requirement. This will allow some file systems to choose
to return a LK_SHARED locked vnode when LK_SHARED is specified
for the flags argument. This patch only adds the flag. It
does not change any file system to use it and all callers
specify LK_EXCLUSIVE, so file system semantics are not changed.
Reviewed by: kib
(zfs destroy -r pool/dataset@snapshot)
To destroy all descendent snapshots with the same name the top level
snapshot was not required to exist. So if the top level snapshot does
not exist, check permissions of the parent dataset instead.
Filed as Illumos Bug #1043
Reviewed by: delphij
Approved by: pjd
MFC after: together with v28
VFS where we know if this is truncate(2) or ftruncate(2). If this is the
latter we should depend on the mode the file was opened and not on the current
permission.
PR: standards/154873
Reported by: Mark Martinec <Mark.Martinec@ijs.si>
Discussed with: Eric Schrock <eric.schrock@delphix.com>
Discussed with: Mark Maybee <Mark.Maybee@Oracle.COM>
MFC after: 1 month
Few new things available from now on:
- Data deduplication.
- Triple parity RAIDZ (RAIDZ3).
- zfs diff.
- zpool split.
- Snapshot holds.
- zpool import -F. Allows to rewind corrupted pool to earlier
transaction group.
- Possibility to import pool in read-only mode.
MFC after: 1 month
attached, activate the page after the successful read, and free the page
if read was unsuccessfull.
Freshly allocated page is not on any queue yet, and not activating (or
deactivating) the page leaves it on no queue, excluding the page from
pagedaemon scans and making the memory disappeared until the vnode
reclaimed.
Reviewed by: avg
MFC after: 1 week
is no way to disable NFSv4 ACLs in ZFS. This should make it easier
for the NFS server to figure out whether the exported filesystem supports
ACLs or not.
Reviewed by: pjd
MFC after: 2 weeks
Fix a race by defining two tasks in the zio structure
as we can still be returning from issue task when interrupt task is used.
Tested by: pjd
Approved by: pjd, delphij (mentor)
MFC after: 3 days
detected ashift does not support this. With this change, pools
created while stripesize=512 could not be imported when stripesize
becomes larger (on the same drive).
Noticed by: pjd
alignment on drives with large sector sizes (e.g. 4 KiB) but the
implementation might need to be revisited if devices with large stripesizes
appear (e.g. if RAID controllers or flash drives start using the field),
probably by introducing a physsectorsize field in GEOM providers.
Discussed with: mav, mostly silence on freebsd-geom@ and freebsd-fs@
kern_sendfile() uses vm_rdwr() to read-ahead blocks of data to populate
page cache. When sendfile stumbles upon a page that is not populated
yet, it sends out all the mbufs that it collected so far. This
resulted in very poor performance with ZFS when file data is not in the
page cache, because ZFS vop_read for UIO_NOCOPY case populated only
those pages that are already in cache, but not valid. Which means that
most of the time it populated only the first requested page in the
described above scenario.
Reported by: Alexander Zagrebin <alexz@visp.ru>
Tested by: Alexander Zagrebin <alexz@visp.ru>,
Artemiev Igor <ai@kliksys.ru>
MFC after: 12 days
and VFS_RELE on a non-existing hold on snapshot parent's z_vfs.
This disables the changes from OpenSolaris onnv-revision 9234:bffdc4fc05c4
(bug IDs: 6792139, 6794830) - not applicable to FreeBSD.
This fixes the process hang if umounting a manually mounted snapshot.
Reported by: Alexander Zagrebin <alexz@visp.ru>
Approved by: delphij (mentor)
MFC after: 1 week
what we have. Without the check the kernel could accessing memory that
does not belong to the request struct.
Note that we do not test if the struct equals in size at this time, which
may faciliate forward compatibility with newer binaries.
Reviewed by: pjd at MeetBSD CA '2010
MFC after: 1 week
OpenSolaris onnv-revision: 10209:91f47f0e7728
6830541 zfs_get_data_trips on a verify
6696242 multiple zfs_fillpage() zfs: accessing past end of object panics
6785914 zfs fails to drop dn_struct_rwlock in recovery code path
Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6830541, 6696242, 6785914)
MFC after: 2 weeks
This should make vnode_pager_getpages path a bit shorter and clearer.
Also this should eliminate problems with partially valid pages.
Having this method opens room for future optimizations.
To do: try to satisfy other pages besides the required one taking into
account tradeofs between number of page faults, read throughput and read
latency. Also, eventually vop_putpages should be added too.
Reviewed by: kib, mm, pjd
MFC after: 3 weeks
Since r212650 and before this change sendfile(2) could produce
a partially valid page for a trailing portion of a ZFS vnode.
vm_fault() always wants to see a fully valid page even if it's the last
page that partially extends beyond vnode's end. Otherwise it calls
vop_getpages() to bring in the page. In the case of ZFS this means
that the data is read from the page into the same page and this breaks
checks in ZFS mappedread() - a thread that set VPO_BUSY on the page in
vm_fault() will get blocked forever waiting for it to be cleared.
Many thanks to Kai and Jeremy for reproducing the issue and providing
important debugging information and help.
Reported by: Kai Gallasch <gallasch@free.de>,
Jeremy Chadwick <freebsd@jdc.parodius.com>
Tested by: Kai Gallasch <gallasch@free.de>,
Jeremy Chadwick <freebsd@jdc.parodius.com>
Reviewed by: kib
MFC after: 3 days
To-Do: apply the same treatment to tmpfs + sendfile
Retry IO once with ZIO_FLAG_TRYHARD before declaring a pool faulted
OpenSolaris revision and Bug IDs:
9725:0bf7402e8022
6843014 ZFS B_FAILFAST handling is broken
Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6843014)
MFC after: 3 weeks
OpenSolaris revision and Bug IDs:
9701:cc5b64682e64
6803605 should be able to offline log devices
6726045 vdev_deflate_ratio is not set when offlining a log device
6599442 zpool import has faults in the display
Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6803605, 6726045, 6599442)
MFC after: 3 weeks
zfs_map_page/zfs_unmap_page are mostly called around potential I/O paths
and it seems to be a not very good idea to do cpu pinning there.
Suggested by: kib
MFC after: 2 weeks
Those checks are not present in upstream code and they are enforced in
actual calculations of delta by which ARC size can be grown or should be
reduced.
MFC after: 3 weeks
vm_paging_target() is not a trigger of any kind for pageademon, but
rather a "soft" target for it when it's already triggered.
Thus, trying to keep 2048 pages above that level at the expense of ARC
was simply driving ARC size into the ground even with normal memory
loads.
Instead, use a threshold at which a pagedaemon scan is triggered, so
that ARC reclaiming helps with pagedaemon's task, but the latter still
recycles active and inactive pages.
PR: kern/146410, kern/138790
MFC after: 3 weeks
Fix possible loss of correct error return code in ZFS mount
OpenSolaris revisions and Bug IDs:
11824:53128e5db7cf
6863610 ZFS mount can lose correct error return
12079:13822b941977
6939941 problem with moving files in zfs (142901-12)
Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6863610, 6939941)
MFC after: 3 days
This mirrors code in tmpfs.
This changge shouldn't affect much read path, it may cause unnecessary
vm_page_lookup calls in the case where v_object has no active or inactive
pages but has some cache pages. I believe this situation to be non-essential.
In write path this change should allow us to properly detect the above
case and free a cache page when we write to a range that corresponds to it.
If this situation is undetected then we could have a discrepancy between
data in page cache and in ARC or on disk.
This change allows us to re-enable vn_has_cached_data() check in zfs_write.
NOTE: strictly speaking resident_page_count and cache fields of v_object
should be exmined under VM_OBJECT_LOCK, but for this particular usage
we may get away with it.
Discussed with: alc, kib
Approved by: pjd
Tested with: tools/regression/fsx
MFC after: 3 weeks
Otherwise, adding insult to injury, in addition to double-caching of data
we would always copy the data into a vnode's vm object page from backend.
This is specific to sendfile case only (VOP_READ with UIO_NOCOPY).
PR: kern/141305
Reported by: Wiktor Niesiobedzki <bsd@vink.pl>
Reviewed by: alc
Tested by: tools/regression/sockets/sendfile
MFC after: 2 weeks
code associated with overflow or with the drain function. While this
function is not expected to be used often, it produces more information
in the form of an errno that sbuf_overflowed() did.
Add the BIO_ORDERED flag for struct bio and update bio clients to use it.
The barrier semantics of bioq_insert_tail() were broken in two ways:
o In bioq_disksort(), an added bio could be inserted at the head of
the queue, even when a barrier was present, if the sort key for
the new entry was less than that of the last queued barrier bio.
o The last_offset used to generate the sort key for newly queued bios
did not stay at the position of the barrier until either the
barrier was de-queued, or a new barrier (which updates last_offset)
was queued. When a barrier is in effect, we know that the disk
will pass through the barrier position just before the
"blocked bios" are released, so using the barrier's offset for
last_offset is the optimal choice.
sys/geom/sched/subr_disk.c:
sys/kern/subr_disk.c:
o Update last_offset in bioq_insert_tail().
o Only update last_offset in bioq_remove() if the removed bio is
at the head of the queue (typically due to a call via
bioq_takefirst()) and no barrier is active.
o In bioq_disksort(), if we have a barrier (insert_point is non-NULL),
set prev to the barrier and cur to it's next element. Now that
last_offset is kept at the barrier position, this change isn't
strictly necessary, but since we have to take a decision branch
anyway, it does avoid one, no-op, loop iteration in the while
loop that immediately follows.
o In bioq_disksort(), bypass the normal sort for bios with the
BIO_ORDERED attribute and instead insert them into the queue
with bioq_insert_tail(). bioq_insert_tail() not only gives
the desired command order during insertion, but also provides
barrier semantics so that commands disksorted in the future
cannot pass the just enqueued transaction.
sys/sys/bio.h:
Add BIO_ORDERED as bit 4 of the bio_flags field in struct bio.
sys/cam/ata/ata_da.c:
sys/cam/scsi/scsi_da.c
Use an ordered command for SCSI/ATA-NCQ commands issued in
response to bios with the BIO_ORDERED flag set.
sys/cam/scsi/scsi_da.c
Use an ordered tag when issuing a synchronize cache command.
Wrap some lines to 80 columns.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
sys/geom/geom_io.c
Mark bios with the BIO_FLUSH command as BIO_ORDERED.
Sponsored by: Spectra Logic Corporation
MFC after: 1 month
least one execute bit set, otherwise execve(2) will return EACCES even
for an user with PRIV_VFS_EXEC privilege.
Add the check also to vaccess(9), vaccess_acl_nfs4(9) and
vaccess_acl_posix1e(9). This makes access(2) to better agree with
execve(2). Because ZFS doesn't use vaccess(9) for VEXEC, add the check
to zfs_freebsd_access() too. There may be other file systems which are
not using vaccess*() functions and need to be handled separately.
PR: kern/125009
Reviewed by: bde, trasz
Approved by: pjd (ZFS part)
- better ACL caching and speedup of ACL permission checks
- faster handling of stat()
- lowered mutex contention in the read/writer lock (rrwlock)
- several related bugfixes
Detailed information (OpenSolaris onnv changesets and Bug IDs):
9749:105f407a2680
6802734 Support for Access Based Enumeration (not used on FreeBSD)
6844861 inconsistent xattr readdir behavior with too-small buffer
9866:ddc5f1d8eb4e
6848431 zfs with rstchown=0 or file_chown_self privilege allows user to "take" ownership
9981:b4907297e740
6775100 stat() performance on files on zfs should be improved
6827779 rrwlock is overly protective of its counters
10143:d2d432dfe597
6857433 memory leaks found at: zfs_acl_alloc/zfs_acl_node_alloc
6860318 truncate() on zfsroot succeeds when file has a component of its path set without access permission
10232:f37b85f7e03e
6865875 zfs sometimes incorrectly giving search access to a dir
10250:b179ceb34b62
6867395 zpool_upgrade_007_pos testcase panic'd with BAD TRAP: type=e (#pf Page fault)
10269:2788675568fd
6868276 zfs_rezget() can be hazardous when znode has a cached ACL
10295:f7a18a1e9610
6870564 panic in zfs_getsecattr
Approved by: delphij (mentor)
Obtained from: OpenSolaris (multiple Bug IDs)
MFC after: 2 weeks
This provides a noticeable write speedup, especially on pools with
less than 30% of free space.
Detailed information (OpenSolaris onnv changesets and Bug IDs):
11146:7e58f40bcb1c
6826241 Sync write IOPS drops dramatically during TXG sync
6869229 zfs should switch to shiny new metaslabs more frequently
11728:59fdb3b856f6
6918420 zdb -m has issues printing metaslab statistics
12047:7c1fcc8419ca
6917066 zfs block picking can be improved
Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6826241, 6869229, 6918420, 6917066)
MFC after: 2 weeks
resetting needfree
needfree is checked at the very start of arc_reclaim_needed.
This change makes code easier to follow and maintain in face of
potential changed in arc_reclaim_needed.
Also, put the whole sub-block under _KERNEL because needfree can be set
only in kernel code.
To do: rename needfree to something else to aovid confusion with
OpenSolaris global variable of the same name which is used in the same
code, but has different meaning (page deficit).
Note: I have an impression that locking around accesses to this variable
as well as mutual notifications between arc_reclaim_thread and
arc_lowmem are not proper.
MFC after: 1 week
The changes do not affect FreeBSD code because
zfs_znode_move(), cleanlocks() and cleanshares() are not used.
OpenSolaris onnv changeset: 9788:f660bc44f2e8, 9909:aa280f585a3e
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6843700, 6790232)
MFC after: 7 weeks
OpenSolaris freemem has the same meaning as our v_free_count +
v_cache_count.
Obtained from: Artem Belevich <fbsdlist@src.cx>,
Peter Jeremy <peterjeremy@acm.org>
Discussed with: pjd
MFC after: 2 weeks
- fixes panics when Solaris/OpenSolaris pools that contain files
uploaded with the SMB protocol are accessed
Enable seting/unsetting the sharesmb property (dummy action)
- allows users who import pools from Solaris/Opensolaris to unset
the sharesmb property and get rid of annoying messages
PR: kern/145778, kern/148709
Approved by: pjd, delphij (mentor)
MFC after: 7 weeks
implementation in 8.0 and later as its flags field does not hold dynamic
state such as waiters flags, but is only modified in lockinit() aside
from VN_LOCK_*().
Discussed with: attilio
changed to defer the setting of VN_LOCK_ASHARE() (which clears LK_NOSHARE
in the vnode lock's flags) until after they had determined if the vnode was
a FIFO. This occurs after the vnode has been inserted a VFS hash or some
similar table, so it is possible for another thread to find this vnode via
vget() on an i-node number and block on the vnode lock. If the lockmgr
interlock (vnode interlock for vnode locks) is not held when clearing the
LK_NOSHARE flag, then the lk_flags field can be clobbered. As a result
the thread blocked on the vnode lock may never get woken up. Fix this by
holding the vnode interlock while modifying the lock flags in this case.
MFC after: 3 days
in Solaris 10 updates 141445-09 and 142901-14.
Detailed information:
(OpenSolaris revisions and Bug IDs, Solaris 10 patch numbers)
7844:effed23820ae
6755435 zfs_open() and zfs_close() needs to use ZFS_ENTER/ZFS_VERIFY_ZP (141445-01)
7897:e520d8258820
6748436 inconsistent zpool.cache in boot_archive could panic a zfs root filesystem upon boot-up (141445-01)
7965:b795da521357
6740164 zpool attach can create an illegal root pool (141909-02)
8084:b811cc60d650
6769612 zpool_import() will continue to write to cachefile even if altroot is set (N/A)
8121:7fd09d4ebd9c
6757430 want an option for zdb to disable space map loading and leak tracking (141445-01)
8129:e4f45a0bfbb0
6542860 ASSERT: reason != VDEV_LABEL_REMOVE||vdev_inuse(vd, crtxg, reason, 0) (141445-01)
8188:fd00c0a81e80
6761100 want zdb option to select older uberblocks (141445-01)
8190:6eeea43ced42
6774886 zfs_setattr() won't allow ndmp to restore SUNWattr_rw (141445-01)
8225:59a9961c2aeb
6737463 panic while trying to write out config file if root pool import fails (141445-01)
8227:f7d7be9b1f56
6765294 Refactor replay (141445-01)
8228:51e9ca9ee3a5
6572357 libzfs should do more to avoid mnttab lookups (141909-01)
6572376 zfs_iter_filesystems and zfs_iter_snapshots get objset stats twice (141909-01)
8241:5a60f16123ba
6328632 zpool offline is a bit too conservative (141445-01)
6739487 ASSERT: txg <= spa_final_txg due to scrub/export race (141445-01)
6767129 ASSERT: cvd->vdev_isspare, in spa_vdev_detach() (141445-01)
6747698 checksum failures after offline -t / export / import / scrub (141445-01)
6745863 ZFS writes to disk after it has been offlined (141445-01)
6722540 50% slowdown on scrub/resilver with certain vdev configurations (141445-01)
6759999 resilver logic rewrites ditto blocks on both source and destination (141445-01)
6758107 I/O should never suspend during spa_load() (141445-01)
6776548 codereview(1) runs off the page when faced with multi-line comments (N/A)
6761406 AMD errata 91 workaround doesn't work on 64-bit systems (141445-01)
8242:e46e4b2f0a03
6770866 GRUB/ZFS should require physical path or devid, but not both (141445-01)
8269:03a7e9050cfd
6674216 "zfs share" doesn't work, but "zfs set sharenfs=on" does (141445-01)
6621164 $SRC/cmd/zfs/zfs_main.c seems to have a syntax error in the translation note (141445-01)
6635482 i18n problems in libzfs_dataset.c and zfs_main.c (141445-01)
6595194 "zfs get" VALUE column is as wide as NAME (141445-01)
6722991 vdev_disk.c: error checking for ddi_pathname_to_dev_t() must test for NODEV (141445-01)
6396518 ASSERT strings shouldn't be pre-processed (141445-01)
8274:846b39508aff
6713916 scrub/resilver needlessly decompress data (141445-01)
8343:655db2375fed
6739553 libzfs_status msgid table is out of sync (141445-01)
6784104 libzfs unfairly rejects numerical values greater than 2^63 (141445-01)
6784108 zfs_realloc() should not free original memory on failure (141445-01)
8525:e0e0e525d0f8
6788830 set large value to reservation cause core dump (141445-01)
6791064 want sysevents for ZFS scrub (141445-01)
6791066 need to be able to set cachefile on faulted pools (141445-01)
6791071 zpool_do_import() should not enable datasets on faulted pools (141445-01)
6792134 getting multiple properties on a faulted pool leads to confusion (141445-01)
8547:bcc7b46e5ff7
6792884 Vista clients cannot access .zfs (141445-01)
8632:36ef517870a3
6798384 It can take a village to raise a zio (141445-01)
8636:7e4ce9158df3
6551866 deadlock between zfs_write(), zfs_freesp(), and zfs_putapage() (141909-01)
6504953 zfs_getpage() misunderstands VOP_GETPAGE() interface (141909-01)
6702206 ZFS read/writer lock contention throttles sendfile() benchmark (141445-01)
6780491 Zone on a ZFS filesystem has poor fork/exec performance (141445-01)
6747596 assertion failed: DVA_EQUAL(BP_IDENTITY(&zio->io_bp_orig), BP_IDENTITY(zio->io_bp))); (141445-01)
8692:692d4668b40d
6801507 ZFS read aggregation should not mind the gap (141445-01)
8697:e62d2612c14d
6633095 creating a filesystem with many properties set is slow (141445-01)
8768:dfecfdbb27ed
6775697 oracle crashes when overwriting after hitting quota on zfs (141909-01)
8811:f8deccf701cf
6790687 libzfs mnttab caching ignores external changes (141445-01)
6791101 memory leak from libzfs_mnttab_init (141445-01)
8845:91af0d9c0790
6800942 smb_session_create() incorrectly stores IP addresses (N/A)
6582163 Access Control List (ACL) for shares (141445-01)
6804954 smb_search - shortname field should be space padded following the NULL terminator (N/A)
6800184 Panic at smb_oplock_conflict+0x35() (N/A)
8876:59d2e67b4b65
6803822 Reboot after replacement of system disk in a ZFS mirror drops to grub> prompt (141445-01)
8924:5af812f84759
6789318 coredump when issue zdb -uuuu poolname/ (141445-01)
6790345 zdb -dddd -e poolname coredump (141445-01)
6797109 zdb: 'zdb -dddddd pool_name/fs_name inode' coredump if the file with inode was deleted (141445-01)
6797118 zdb: 'zdb -dddddd poolname inum' coredump if I miss the fs name (141445-01)
6803343 shareiscsi=on failed, iscsitgtd failed request to share (141445-01)
9030:243fd360d81f
6815893 hang mounting a dataset after booting into a new boot environment (141445-01)
9056:826e1858a846
6809691 'zpool create -f' no longer overwrites ufs infomation (141445-01)
9179:d8fbd96b79b3
6790064 zfs needs to determine uid and gid earlier in create process (141445-01)
9214:8d350e5d04aa
6604992 forced unmount + being in .zfs/snapshot/<snap1> = not happy (141909-01)
6810367 assertion failed: dvp->v_flag & VROOT, file: ../../common/fs/gfs.c, line: 426 (141909-01)
9229:e3f8b41e5db4
6807765 ztest_dsl_dataset_promote_busy needs to clean up after ENOSPC (141445-01)
9230:e4561e3eb1ef
6821169 offlining a device results in checksum errors (141445-01)
6821170 ZFS should not increment error stats for unavailable devices (141445-01)
6824006 need to increase issue and interrupt taskqs threads in zfs (141445-01)
9234:bffdc4fc05c4
6792139 recovering from a suspended pool needs some work (141445-01)
6794830 reboot command hangs on a failed zfs pool (141445-01)
9246:67c03c93c071
6824062 System panicked in zfs_mount due to NULL pointer dereference when running btts and svvs tests (141909-01)
9276:a8a7fc849933
6816124 System crash running zpool destroy on broken zpool (141445-03)
9355:09928982c591
6818183 zfs snapshot -r is slow due to set_snap_props() doing txg_wait_synced() for each new snapshot (141445-03)
9391:413d0661ef33
6710376 log device can show incorrect status when other parts of pool are degraded (141445-03)
9396:f41cf682d0d3 (part already merged)
6501037 want user/group quotas on ZFS (141445-03)
6827260 assertion failed in arc_read(): hdr == pbuf->b_hdr (141445-03)
6815592 panic: No such hold X on refcount Y from zfs_znode_move (141445-03)
6759986 zfs list shows temporary %clone when doing online zfs recv (141445-03)
9404:319573cd93f8
6774713 zfs ignores canmount=noauto when sharenfs property != off (141445-03)
9412:4aefd8704ce0
6717022 ZFS DMU needs zero-copy support (141445-03)
9425:e7ffacaec3a8
6799895 spa_add_spares() needs to be protected by config lock (141445-03)
6826466 want to post sysevents on hot spare activation (141445-03)
6826468 spa 'allowfaulted' needs some work (141445-03)
6826469 kernel support for storing vdev FRU information (141445-03)
6826470 skip posting checksum errors from DTL regions of leaf vdevs (141445-03)
6826471 I/O errors after device remove probe can confuse FMA (141445-03)
6826472 spares should enjoy some of the benefits of cache devices (141445-03)
9443:2a96d8478e95
6833711 gang leaders shouldn't have to be logical (141445-03)
9463:d0bd231c7518
6764124 want zdb to be able to checksum metadata blocks only (141445-03)
9465:8372081b8019
6830237 zfs panic in zfs_groupmember() (141445-03)
9466:1fdfd1fed9c4
6833162 phantom log device in zpool status (141445-03)
9469:4f68f041ddcd
6824968 add ZFS userquota support to rquotad (141445-03)
9470:6d827468d7b5
6834217 godfather I/O should reexecute (141445-03)
9480:fcff33da767f
6596237 Stop looking and start ganging (141909-02)
9493:9933d599bc93
6623978 lwb->lwb_buf != NULL, file ../../../uts/common/fs/zfs/zil.c, line 787, function zil_lwb_commit (141445-06)
9512:64cafcbcc337
6801810 Commit of aligned streaming rewrites to ZIL device causes unwanted disk reads (N/A)
9515:d3b739d9d043
6586537 async zio taskqs can block out userland commands (142901-09)
9554:787363635b6a
6836768 zfs_userspace() callback has no way to indicate failure (N/A)
9574:1eb6a6ab2c57
6838062 zfs panics when an error is encountered in space_map_load() (141909-02)
9583:b0696cd037cc
6794136 Panic BAD TRAP: type=e when importing degraded zraid pool. (141909-03)
9630:e25a03f552e0
6776104 "zfs import" deadlock between spa_unload() and spa_async_thread() (141445-06)
9653:a70048a304d1
6664765 Unable to remove files when using fat-zap and quota exceeded on ZFS filesystem (141445-06)
9688:127be1845343
6841321 zfs userspace / zfs get userused@ doesn't work on mounted snapshot (N/A)
6843069 zfs get userused@S-1-... doesn't work (N/A)
9873:8ddc892eca6e
6847229 assertion failed: refcount_count(&tx->tx_space_written) + delta <= tx->tx_space_towrite in dmu_tx.c (141445-06)
9904:d260bd3fd47c
6838344 kernel heap corruption detected on zil while stress testing (141445-06)
9951:a4895b3dd543
6844900 zfs_ioc_userspace_upgrade leaks (N/A)
10040:38b25aeeaf7a
6857012 zfs panics on zpool import (141445-06)
10000:241a51d8720c
6848242 zdb -e no longer works as expected (N/A)
10100:4a6965f6bef8
6856634 snv_117 not booting: zfs_parse_bootfs: error2 (141445-07)
10160:a45b03783d44
6861983 zfs should use new name <-> SID interfaces (N/A)
6862984 userquota commands can hang (141445-06)
10299:80845694147f
6696858 zfs receive of incremental replication stream can dereference NULL pointer and crash (N/A)
10302:a9e3d1987706
6696858 zfs receive of incremental replication stream can dereference NULL pointer and crash (fix lint) (N/A)
10575:2a8816c5173b (partial merge)
6882227 spa_async_remove() shouldn't do a full clear (142901-14)
10800:469478b180d9
6880764 fsync on zfs is broken if writes are greater than 32kb on a hard crash and no log attached (142901-09)
6793430 zdb -ivvvv assertion failure: bp->blk_cksum.zc_word[2] == dmu_objset_id(zilog->zl_os) (N/A)
10801:e0bf032e8673 (partial merge)
6822816 assertion failed: zap_remove_int(ds_next_clones_obj) returns ENOENT (142901-09)
10810:b6b161a6ae4a
6892298 buf->b_hdr->b_state != arc_anon, file: ../../common/fs/zfs/arc.c, line: 2849 (142901-09)
10890:499786962772
6807339 spurious checksum errors when replacing a vdev (142901-13)
11249:6c30f7dfc97b
6906110 bad trap panic in zil_replay_log_record (142901-13)
6906946 zfs replay isn't handling uid/gid correctly (142901-13)
11454:6e69bacc1a5a
6898245 suspended zpool should not cause rest of the zfs/zpool commands to hang (142901-10)
11546:42ea6be8961b (partial merge)
6833999 3-way deadlock in dsl_dataset_hold_ref() and dsl_sync_task_group_sync() (142901-09)
Discussed with: pjd
Approved by: delphij (mentor)
Obtained from: OpenSolaris (multiple Bug IDs)
MFC after: 2 months
The fix is a partial import and merge of OpenSolaris onnv revisions
8227:f7d7be9b1f56. and 9292:e112194b5b73
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6798298)
MFC after: 3 days
If disk was missing on pool load or import and on next pool load or import
it was present, resilver wasn't started automatically and ZFS reported all disks
as ONLINE and healthy. Then, when another disk died, pool became unaccessible,
because if it was 2-way mirror or RAIDZ1 two vdevs were out of sync.
To fix the problem, start resilver automatically on pool load or import.
Obtained from: OpenSolaris
MFC after: 3 days
The arcstats.l2_write_bytes_written kstat counter introduced
in r205231 was duplicite with vendor's arcstats.l2_write_bytes counter
imported in r208373 (OpenSolaris revision 8582:df9361868dbe)
Approved by: pjd, delphij (mentor)
MFC after: 3 days
This eliminates the only place where we can sleep when calling zio_interrupt().
As a side-effect this can actually improve performance a little as we
allocate one less thing for every I/O.
Prodded by: kib
MFC after: 1 week
avoid calling zio_interrupt() from geom_up thread context. It turns out that
when provider is forcibly removed from the system and we kill worker thread
there can still be some ZIOs pending. To complete pending ZIOs when there is
no worker thread anymore we still have to call zio_interrupt() from geom_up
context. To avoid this race just remove use of worker threads altogether.
This should be more or less fine, because I also thought that zio_interrupt()
does more work, but it only makes small UMA allocation with M_WAITOK.
It also saves one context switch per I/O request.
PR: kern/145339
Reported by: Alex Bakhtin <Alex.Bakhtin@gmail.com>
MFC after: 1 week
It includes the following changes:
- parallel reads in traversal code (Bug ID 6333409)
- faster traversal for zfs send (Bug ID 6418042)
- traversal code cleanup (Bug ID 6725675)
- fix for two scrub related bugs (Bug ID 6729696, 6730101)
- fix assertion in dbuf_verify (Bug ID 6752226)
- fix panic during zfs send with i/o errors (Bug ID 6577985)
- replace P2CROSS with P2BOUNDARY (Bug ID 6725680)
List of OpenSolaris Bug IDs:
6333409, 6418042, 6757112, 6725668, 6725675, 6725680,
6725698, 6729696, 6730101, 6752226, 6577985, 6755042
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (multiple Bug IDs)
MFC after: 1 week
ZFS still like to open all vdevs, close them and open them again,
which in turn provokes taste traffic anyway.
I don't know of any clean way to fix it, so do it the hard way - if we can't
open provider for writing just retry 5 times with 0.5 pauses. This should
elimitate accidental races caused by other classes tasting providers created on
top of our vdevs.
MFC after: 3 days
Reported by: James R. Van Artsdalen <james-freebsd-fs2@jrv.org>
Reported by: Yuri Pankov <yuri.pankov@gmail.com>
- Enable zfs_ace_byteswap() on FreeBSD as it works just fine (tested between
amd64 and sparc64 in both directions by Michael Moll).
PR: 146272
Approved by: mm, pjd
Obtained from: OpenSolaris (onnv rev. 8283:1ca59f393041; Bug ID 6764193) [1]
MFC after: 3 days
The order of operations is the following:
1. Try to open vdev by remembered path and guid.
2. If 1 failed, try to find vdev which guid matches and ignore the path.
3. If 2 failed this means either that the vdev we're looking for is gone
or that pool is being created and vdev doesn't contain proper guid yet.
To be able to handle pool creation we open vdev by path anyway.
Because of 3 it is possible that we open wrong vdev on import which can lead to
confusions.
The solution for this is to check spa_load_state. On pool creation it will be
equal to SPA_LOAD_NONE and we can open vdev only by path immediately and if it
is not equal to SPA_LOAD_NONE we first open by path+guid and when that fails,
we open by guid. We no longer open wrong vdev on import.
MFC after: 2 weeks
and 16 for metadata
- export L2ARC tunables as sysctls
- add several kstats to track L2ARC state more precisely
- avoid holding a contended lock when atomically incrementing a
contended counter (no lock protection needed for atomics)
providers for writing provokes huge traffic related to taste events send
by GEOM on close. This can lead to various problems with opening GEOM
providers that are created on top of other GEOM providers.
Reorted by: Kurt Touet <ktouet@gmail.com>, mr
Tested by: mr, Baginski Darren <kickbsd@ya.ru>
MFC after: 2 weeks
revision 200726 and 200727). It looks like that the two revisions
were not applied in the right sequence, I found this when comparing
with the OpenSolaris code.
MFC after: 3 days
Reviewed by: mm@
making it possible for zpools created on OpenSolaris 2009.06 be used
on FreeBSD.
PR: kern/141800
Submitted by: mm
Reviewed by: pjd, trasz
Obtained from: OpenSolaris
MFC after: 2 weeks
both to not panic when fsync(2) is called for fifo on zfs
filedescriptor, and to actually fsync fifo inode to permanent storage.
PR: kern/141177
Reviewed by: pjd
MFC after: 1 week
for attaching when there is no metadata yet.
Before r200125 the order of looking for providers was wrong. It was:
1. Find provider by name.
2. Find provider by guid.
3. Find provider by name and guid.
Where it should have been:
1. Find provider by name and guid.
2. Find provider by guid.
3. Find provider by name.
MFC after: 1 week
calling scrub when pool is in a degraded state. It will try to taste ZVOLs,
which will lead to deadlock, as ZVOL will try to acquire the same locks as
replace/scrub is holding already.
We can't simply skip provider based on their GEOM class, because ZVOL can have
providers build on top of it and we need to skip those as well.
We do it by asking for ZFS::iszvol attribute. Any ZVOL-based provider will give
us positive answer and we have to skip those providers.
This way we remove possibility to create ZFS pools on top of ZVOLs, but it is
not very useful anyway.
I believe deadlock is still possible in some very complex situations like when
we have MD provider on top of UFS file on top of ZVOL. When we try to replace
dead component in the pool mentioned ZVOL is based on, there might be a
deadlock when ZFS will try to taste MD provider. There is no easy way to detect
that, but it isn't very common.
MFC after: 1 week
zfs_access() instead of vaccess() in this case as well.
- If VADMIN is specified with another V* flag (unlikely) call both
zfs_access() and vaccess() after spliting V* flags.
This fixes "dirtying snapshot!" panic.
PR: kern/139806
Reported by: Carl Chave <carl@chave.us>
In co-operation with: jh
MFC after: 3 days
received, we don't have to do it on every ENXIO error in I/O path.
Solaris has no GEOM so they have to handle it in a less clean way.
MFC after: 3 days
to set ownership and mode in the same setattr operation, the mode was
overwritten by secpolicy_vnode_setattr().
PR: kern/118320
Submitted by: Mark Thompson <info-gentoo@mark.thompson.bz>
MFC after: 3 days
unmount. In that case we cannot depend on the proper order of invalidating
vnodes, so we have to free resources when we have a chance.
PR: kern/139062
Reported by: trasz
MFC after: 3 days
well with forced unmounts when GFS vnodes are referenced.
- Make other preparations to GFS for forced unmounts.
PR: kern/139062
Reported by: trasz
MFC after: 3 days
following vnops will fail. This is very important, because without this change
vnode could be reclaimed at any point, even if we increased usecount. The only
way to ensure that vnode won't be reclaimed was to lock it, which would be very
hard to do in ZFS without changing a lot of code. With this change simply
increasing usecount is enough to be sure vnode won't be reclaimed from under
us. To be precise it can still be reclaimed but we won't be able to see it,
because every try to enter ZFS through VFS will result in EIO.
The only function that cannot return EIO, because it is needed for vflush() is
zfs_root(). Introduce ZFS_ENTER_NOERROR() macro that only locks
z_teardown_lock and never returns EIO.
MFC after: 3 days
gid to set group ownership and not process gid.
This was overlooked during v6 -> v13 switch.
PR: kern/139076
Reported by: Sean Winn <sean@gothic.net.au>
MFC after: 3 days
df(1) and mount(8) output. This is a bit smilar to OpenSolaris and follows
ZFS route of not listing snapshots by default with 'zfs list' command.
- Add UPDATING entry to note that ZFS snapshots are no longer visible in
mount(8) and df(1) output by default.
Reviewed by: kib
MFC after: 3 days
by just returning EOPNOTSUPP. This will allow NFS server to fall back to
regular READDIR.
Note that converting inode number to snapshot's vnode is expensive operation.
Snapshots are stored in AVL tree, but based on their names, not inode numbers,
so to convert inode to snapshot vnode we have to interate over all snalshots.
This is not a problem in OpenSolaris, because in their READDIRPLUS
implementation they use VOP_LOOKUP() on d_name, instead of VFS_VGET() on
d_fileno as we do.
PR: kern/125149
Reported by: Weldon Godfrey <wgodfrey@ena.com>
Analysis by: Jaakko Heinonen <jh@saunalahti.fi>
MFC after: 3 days
the same entry twice. This bug is not fixed yet, but leads to situation
where when try to access corrupted directory the kernel will panic.
Until the bug is properly fixed, try to recover from it and log that it
happened.
Reported by: marck
OpenSolaris bug: 6709336
MFC after: 3 days
- Be prepared for dbuf to disappear in zfs_reclaim_complete() and check if
z_dbuf field is NULL - this might happen in case of rollback or forced
unmount between zfs_freebsd_reclaim() and zfs_reclaim_complete().
- On forced unmount wait for all znodes to be destroyed - destruction can be
done asynchronously via zfs_reclaim_complete().
MFC after: 1 week
NULL, but also can point to dead vnode, take that into account.
PR: kern/132068
Reported by: Edward Fisk" <7ogcg7g02@sneakemail.com>, kris
Fix based on patch from: Jaakko Heinonen <jh@saunalahti.fi>
MFC after: 1 week
commands executed by unprivileged users. Action is not really taken, but it is
logged to pool history, which might be confusing.
Reported by: Denis Ahrens <denis@h3q.com>
MFC after: 3 days
Such situation is not supported.
This problem was triggered by something like this:
# zpool create tank da0
# zfs snapshot tank@snap
# cd /tank/.zfs/snapshot/snap (this will mount the snapshot)
# cd
# mount -u nosuid /tank/.zfs/snapshot/snap (refcount leak)
# zpool export tank
cannot export 'tank': pool is busy
MFC after: 1 week
we will fail to unmount it, but it won't be removed from the tree,
so in that case there is no need to reinsert it.
This fixes a panic reproducable in the following steps:
# zfs create tank/foo
# zfs snapshot tank/foo@snap
# cd /tank/foo/.zfs/snapshot/snap
# umount /tank/foo
panic: avl_find() succeeded inside avl_add()
Reported by: trasz
MFC after: 3 days
provider is closed should be ok.
When administrator requests to change ZVOL size do it immediately if ZVOL
is closed or do it on last ZVOL close.
PR: kern/136942
Requested by: Bernard Buri <bsd@ask-us.at>
MFC after: 1 week
(..), calling readdir and looking for previous directory inode. In case of
.zfs/ directory this doesn't work, because .zfs/ is hidden by default, so it
won't be visible in readdir output.
Fix this by implementing VPTOCNP for snapshot directories, so __getcwd()
doesn't fail and getcwd() doesn't have to use readdir method.
This fixes /bin/pwd from within .zfs/snapshot/<name>/.
Suggested by: kib
Approved by: re (rwatson)
replace it with wrappers around our taskqueue(9).
To make it possible implement taskqueue_member() function which returns 1
if the given thread was created by the given taskqueue.
Approved by: re (kib)
initialized. Also destroy /dev/zfs before doing other deinitializations.
- Initialization through taskq is no longer needed and there is a race
where one of the zpool/zfs command loads zfs.ko and tries to do some work
immediately, but /dev/zfs is not there yet.
Reported by: pav
Approved by: re (kib)
panic when in zfs_fuid_create_cred() when userid is negative. It is
converted to unsigned value which makes IS_EPHEMERAL() macro to
incorrectly report that this is ephemeral ID. The most reasonable
solution for now is to always report that the given ID is not ephemeral.
PR: kern/132337
Submitted by: Matthew West <freebsd@r.zeeb.org>
Tested by: Thomas Backman <serenity@exscape.org>, Michael Reifenberger <mike@reifenberger.com>
Approved by: re (kib)
MFC after: 2 weeks
doesn't exist and user doesn't have write access to the file.
Without this fix, it returns bogus value instead of 0. For some
reason this didn't manifest on my kernel compiled with -O0.
PR: kern/136601
Submitted by: Jaakko Heinonen <jh at saunalahti dot fi>
Approved by: re (kib)
this change, ZFS uses SunOS Alternate Data Streams semantics - each
EA has its own permissions, which are set at EA creation time
and - unlike SunOS - invisible to the user and impossible to change.
From the user point of view, it's just broken: sometimes access
is granted when it shouldn't be, sometimes it's denied when
it shouldn't be.
This patch makes it behave just like UFS, i.e. depend on current
file permissions. Also, it fixes returned error codes (ENOATTR
instead of ENOENT) and makes listextattr(2) return 0 instead
of EPERM where there is no EA directory (i.e. the file never had
any EA).
Reviewed by: pjd (idea, not actual code)
Approved by: re (kib)
vn_open_cred invocations shall not audit namei path.
In particular, specify VN_OPEN_NOAUDIT for dotdot lookup performed by
default implementation of vop_vptocnp, and for the open done for core
file. vn_fullpath is called from the audit code, and vn_open there need
to disable audit to avoid infinite recursion. Core file is created on
return to user mode, that, in particular, happens during syscall return.
The creation of the core file is audited by direct calls, and we do not
want to overwrite audit information for syscall.
Reported, reviewed and tested by: rwatson
by prefetched than helped. On i386 systems and systems with less than 4GB,
prefetch is now disabled by default. I've added a prefetch enable tunable, to
enable prefetching for those systems. The prefetch disable tunable will continue
to unconditionally disable prefetching.
about removing a few #ifdefs and providing compatibility wrappers and
VOP implementations to get and set an ACL; ZFS does ACL enforcement all
by itself.
Note that the VOPs are ifdefed out for now, so this change should be
a no-op.
Reviewed by: pjd
the VFS. Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.
In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.
While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.
VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled. Bump __FreeBSD_version in order to signal such
situation.
This is based on a fix that went in to opensolaris on March 9th. However, it uses a dedicated
thread instead of a Solaris' taskq to avoid doing a blocking memory allocation with the vnode
interlock held.
This fixes a long-time deadlock in ZFS. This is not, strictly speaking, an LOR. The spa_zio
thread releases a vnode, this calls in to vn_reclaim which in turn needs to acquire range locks
to sync dirty data out to disk. The range locks are already held by a user-level process waiting
on a condition variable that it the process is waiting on a spa_zio thread to signal it on. The
process could not be signalled because the spa_zio thread could not proceed.
The nature of this problem was not apparent due to ZFS locks opting out of witness which meant
that DDB did not know about the locks that were held by ZFS.
Reviewed by: pjd
MFC after: 7 days
the removal of NQNFS, but was left in in case it was required for NFSv4.
Since our new NFSv4 client and server can't use it for their
requirements, GC the old mechanism, as well as other unused lease-
related code and interfaces.
Due to its impact on kernel programming and binary interfaces, this
change should not be MFC'd.
Proposed by: jeff
Reviewed by: jeff
Discussed with: rmacklem, zach loafman @ isilon
directory for a znode. When the directory already exists, it returns a
referenced but unlocked vnode. When a directory does not yet exist, it
calls zfs_make_xattrdir() to create a new one. zfs_make_xattrdir() returns
the vnode both referenced and and locked and zfs_get_xattrdir() was leaking
this vnode lock to its callers. Fix this by dropping the vnode lock if
zfs_make_xattrdir() successfully creates a new extended attribute
directory.
Reviewed by: pjd
filesystem supports additional operations using shared vnode locks.
Currently this is used to enable shared locks for open() and close() of
read-only file descriptors.
- When an ISOPEN namei() request is performed with LOCKSHARED, use a
shared vnode lock for the leaf vnode only if the mount point has the
extended shared flag set.
- Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but
not O_CREAT.
- Use a shared vnode lock around VOP_CLOSE() if the file was opened with
O_RDONLY and the mountpoint has the extended shared flag set.
- Adjust md(4) to upgrade the vnode lock on the vnode it gets back from
vn_open() since it now may only have a shared vnode lock.
- Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since
FIFO's require exclusive vnode locks for their open() and close()
routines. (My recent MPSAFE patches for UDF and cd9660 already included
this change.)
- Enable extended shared operations on UFS, cd9660, and UDF.
Submitted by: ups
Reviewed by: pjd (ZFS bits)
MFC after: 1 month
Inside the kernel, the minor() function was responsible for obtaining
the device minor number of a character device. Because we made device
numbers dynamically allocated and independent of the unit number passed
to make_dev() a long time ago, it was actually a misnomer. If you really
want to obtain the device number, you should use dev2udev().
We already converted all the drivers to use dev2unit() to obtain the
device unit number, which is still used by a lot of drivers. I've
noticed not a single driver passes NULL to dev2unit(). Even if they
would, its behaviour would make little sense. This is why I've removed
the NULL check.
Ths commit removes minor(), minor2unit() and unit2minor() from the
kernel. Because there was a naming collision with uminor(), we can
rename umajor() and uminor() back to major() and minor(). This means
that the makedev(3) manual page also applies to kernel space code now.
I suspect umajor() and uminor() isn't used that often in external code,
but to make it easier for other parties to port their code, I've
increased __FreeBSD_version to 800062.
This bring huge amount of changes, I'll enumerate only user-visible changes:
- Delegated Administration
Allows regular users to perform ZFS operations, like file system
creation, snapshot creation, etc.
- L2ARC
Level 2 cache for ZFS - allows to use additional disks for cache.
Huge performance improvements mostly for random read of mostly
static content.
- slog
Allow to use additional disks for ZFS Intent Log to speed up
operations like fsync(2).
- vfs.zfs.super_owner
Allows regular users to perform privileged operations on files stored
on ZFS file systems owned by him. Very careful with this one.
- chflags(2)
Not all the flags are supported. This still needs work.
- ZFSBoot
Support to boot off of ZFS pool. Not finished, AFAIK.
Submitted by: dfr
- Snapshot properties
- New failure modes
Before if write requested failed, system paniced. Now one
can select from one of three failure modes:
- panic - panic on write error
- wait - wait for disk to reappear
- continue - serve read requests if possible, block write requests
- Refquota, refreservation properties
Just quota and reservation properties, but don't count space consumed
by children file systems, clones and snapshots.
- Sparse volumes
ZVOLs that don't reserve space in the pool.
- External attributes
Compatible with extattr(2).
- NFSv4-ACLs
Not sure about the status, might not be complete yet.
Submitted by: trasz
- Creation-time properties
- Regression tests for zpool(8) command.
Obtained from: OpenSolaris
from one parent directory to another, in addition to the usual access checks
one also needs write access to the subdirectory being moved.
Approved by: rwatson (mentor), pjd
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.
Approved by: rwatson (mentor)
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()
and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()
Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.
As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP
Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
- The vnode has to be locked exclusively before calling insmntque().
- Until I find a way to handle insmntque() failures use VV_FORCEINSMQ flag
to force insmntque() to always succeed.
Reported by: kris, trasz, des, others
Suggested by: kib
Tested by: trasz
Now that we got rid of the minor-to-unit conversion and the constraints
on device minor numbers, we can convert the functions that operate on
minor and unit numbers to simple macro's. The unit2minor() and
minor2unit() macro's are now no-ops.
The ZFS code als defined a macro named `minor'. Change the ZFS code to
use umajor() and uminor() here, as it is the correct approach to do
this. Also add $FreeBSD$ to keep SVN happy.
Approved by: philip (mentor), pjd
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.
Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.
The implementation of the lf_purgelocks() is submitted by dfr.
Reported by: kris
Tested by: kris, pho
Discussed with: jeff, dfr
MFC after: 2 weeks
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.
Highlights include:
* Thread-safe kernel RPC client - many threads can use the same RPC
client handle safely with replies being de-multiplexed at the socket
upcall (typically driven directly by the NIC interrupt) and handed
off to whichever thread matches the reply. For UDP sockets, many RPC
clients can share the same socket. This allows the use of a single
privileged UDP port number to talk to an arbitrary number of remote
hosts.
* Single-threaded kernel RPC server. Adding support for multi-threaded
server would be relatively straightforward and would follow
approximately the Solaris KPI. A single thread should be sufficient
for the NLM since it should rarely block in normal operation.
* Kernel mode NLM server supporting cancel requests and granted
callbacks. I've tested the NLM server reasonably extensively - it
passes both my own tests and the NFS Connectathon locking tests
running on Solaris, Mac OS X and Ubuntu Linux.
* Userland NLM client supported. While the NLM server doesn't have
support for the local NFS client's locking needs, it does have to
field async replies and granted callbacks from remote NLMs that the
local client has contacted. We relay these replies to the userland
rpc.lockd over a local domain RPC socket.
* Robust deadlock detection for the local lock manager. In particular
it will detect deadlocks caused by a lock request that covers more
than one blocking request. As required by the NLM protocol, all
deadlock detection happens synchronously - a user is guaranteed that
if a lock request isn't rejected immediately, the lock will
eventually be granted. The old system allowed for a 'deferred
deadlock' condition where a blocked lock request could wake up and
find that some other deadlock-causing lock owner had beaten them to
the lock.
* Since both local and remote locks are managed by the same kernel
locking code, local and remote processes can safely use file locks
for mutual exclusion. Local processes have no fairness advantage
compared to remote processes when contending to lock a region that
has just been unlocked - the local lock manager enforces a strict
first-come first-served model for both local and remote lockers.
Sponsored by: Isilon Systems
PR: 95247 107555 115524 116679
MFC after: 2 weeks