Generally the first argument in calloc is supposed to stand for a count
and the second for a size. Try to make that consistent. While here,
attempt to make some use of the overflow detection capability in
calloc(3).
illumos/illumos-gate@99189164df99189164dfhttps://www.illumos.org/issues/6940
Similar to #6334, but this time with empty directories:
$ zfs create tank/quota
$ zfs set quota=10M tank/quota
$ zfs snapshot tank/quota@snap1
$ zfs set mountpoint=/mnt/tank/quota tank/quota
$ mkdir /mnt/tank/quota/dir # create an empty directory
$ mkfile 11M /mnt/tank/quota/11M
/mnt/tank/quota/11M: initialized 9830400 of 11534336 bytes: Disc quota exceeded
$ rmdir /mnt/tank/quota/dir # now unlink the empty directory
rmdir: directory "/mnt/tank/quota/dir": Disc quota exceeded
From user perspective, I would expect that ZFS is always able to remove files
and directories even when the quota is exceeded.
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Simon Klinkert <simon.klinkert@gmail.com>
7020 sdev_cleandir can loop forever
Note that the bulk of the upstream change is not applicable to FreeBSD
and the affected files are not even in the vendor area.
illumos/illumos-gate@45b174751545b1747515https://www.illumos.org/issues/7019
Currently zfsdev_ioctl, when confronted by a request with the FKIOCTL flag set,
skips all processing of secpolicy functions. This means that ZFS is not doing
any kind of verification of the credentials or access rights of the caller and
assuming that (as it is an in-kernel client) all such checks have already been
done.
This turns out to be quite a dangerous assumption, especially with respect to
sdev. In general I don't think it's particularly reasonable to offload this
enforcement of access rights onto other kernel subsystems when ZFS has some
particular local semantics in this area (delegated datasets etc) and does not
provide any kind of API to allow other subsystems to avoid code duplication
when doing it. ZFS should apply its normal access policy to requests from
within the kernel, and callers should take care to give it the correct
credentials and call it from the correct context in order to get the results
they need.
You can observe the currently unfortunate consequences of this bug in any non-
global zone that has access to /dev/zvol or any subset of it via sdev profiles.
In particular, a zone used to contain a KVM or similar which has a single zvol
passed through to it using a <device match= block in its zone XML.
Even though sdev makes something of an attempt to control for whether the
caller should have access to nodes in /dev/zvol, it doesn't do this correctly,
or really at all in the lookup call path. So, if we have a zone that's been
given access to any part of /dev/zvol, it can simply look up the full path to
any other zvol on the entire system, and the node will appear and be able to be
used.
https://www.illumos.org/issues/7020
sdev_cleandir can currently hang forever when it encounters a child node that
is busy, or when it is given a matching expr and the first entry on the list
does not match.
The previous code (circa 2013) iterated over the children of the node using a
for loop with SDEV_NEXT_ENTRY, which was then changed to a while ((dv =
SDEV_FIRST_ENTRY(ddv)) { loop. Unfortunately the continue statements that
previously made it skip over an entry were left as they were, which now result
in an infinite busy-loop in the kernel.
You can trigger this pretty easily by setting up an sdev exclude rule in
zonecfg.
Diagnosis: look for a runaway process consuming 100% CPU in kernel -- they have
a distinctive stack:
# mdb -k
> 0t1234::pid2proc | ::walk thread | ::findstack -v
[ ffffd001efcd3310 _resume_from_idle+0x112() ]
ffffd001efcd3360 apix_hilevel_intr_epilog+0xc1(ffffd001efcd33d0, 0)
ffffd001efcd33c0 apix_do_interrupt+0x34a(ffffd001efcd33d0, 0)
ffffd001efcd33d0 _sys_rtt_ints_disabled+8()
ffffd001efcd3550 rw_enter+0x58()
ffffd001efcd35e0 sdev_cleandir+0x60(ffffd0631b6d75d8, 0, 0)
ffffd001efcd3630 devzvol_prunedir+0xec(ffffd0631b6d76e8)
ffffd001efcd36d0 devzvol_readdir+0x150(ffffd06333250e00, ffffd001efcd3790,
ffffd062dc990e18, ffffd001efcd37dc, 0, 0)
ffffd001efcd3760 fop_readdir+0x6b(ffffd06333250e00, ffffd001efcd3790,
ffffd062dc990e18, ffffd001efcd37dc, 0, 0)
ffffd001efcd3830 walk_dir+0xee(ffffd06333250e00, ffffd0669e4483c8,
fffffffffbbdf410)
ffffd001efcd3850 prof_make_names_walk+0x2e(ffffd0669e4483c8,
fffffffffbbdf410)
ffffd001efcd38b0 prof_make_names+0xfc(ffffd0669e4483c8)
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Alex Wilson <alex.wilson@joyent.com>
illumos/illumos-gate@63364b0ee263364b0ee2https://www.illumos.org/issues/6922
ZFS does not do a config_sync after removing an aux (spare, log, or cache)
device. AFAICT this isn't being done because it is slow and was deemed
unnecessary. However, it should be such a rare operation that speed doesn't
matter, and not doing it results in two problems:
1) It is theoretically possible to remove an aux device from one pool and
attach it to another, then lose power. When power is restored, both pools would
think that they own the aux device.
2) Removal of the aux device doesn't send any useful sysevents to userland.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Alan Somers <asomers@gmail.com>
illumos/illumos-gate@1825bc56e51825bc56e5https://www.illumos.org/issues/6878
Summary of changes:
* Replace generic "scan done" message with "scan aborted, restarting",
"scan cancelled", or "scan done"
* Log number of errors using spa_get_errlog_size
* Refactor scan restarting check into static function
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Nav Ravindranath <nav@delphix.com>
illumos/illumos-gate@8df0bcf0df8df0bcf0dfhttps://www.illumos.org/issues/6513
If a ZFS object contains a hole at level one, and then a data block is created
at level 0 underneath that l1 block, l0 holes will be created. However, these
l0 holes do not have the birth time property set; as a result, incremental
sends will not send those holes.
Fix is to modify the dbuf_read code to fill in birth time data.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Paul Dagnelie <pcd@delphix.com>
illumos/illumos-gate@0d8fa8f8eb0d8fa8f8ebhttps://www.illumos.org/issues/6902
pjd has authored and commited a patch in Jan 21, 2012 that substanially speeds
up zfs snapshot listing if requesting only the name property and sorting by
name.
In this special case, the snapshot properties do not need to be loaded. This
code has been adopted by zfsonlinux on May 29, 2012.
Commit message from pjd:
Dramatically optimize listing snapshots when user requests only
snapshot
names and wants to sort them by name, ie. when executes:
1. zfs list -t snapshot -o name -s name
Because only name is needed we don't have to read all snapshot
properties.
Below you can find how long does it take to list 34509 snapshots from
a single
disk pool before and after this change with cold and warm cache:
before:
1. time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 525s
warm cache: 218s
after:
1. time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 1.7s
warm cache: 1.1s
References:
http://svnweb.freebsd.org/base?view=revision&revision=230438https://github.com/freebsd/freebsd/commit/8e3e9863https://github.com/zfsonlinux/zfs/commit/0cee2406
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pawel Dawidek <pjd@freebsd.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Martin Matuska <martin@matuska.org>
illumos/illumos-gate@c971037baac971037baahttps://www.illumos.org/issues/6876
Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking for
trouble. We should check every dataset on import, using a 1024 byte buffer and
checking each time to see if the dataset's new name is longer than 256 bytes.
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Paul Dagnelie <pcd@delphix.com>
illumos/illumos-gate@11ceac77ea11ceac77eahttps://www.illumos.org/issues/6844
dnode_next_offset is used in a variety of places to iterate over the holes or
allocated blocks in a dnode. It operates under the premise that it can iterate
over the blockpointers of a dnode in open context while holding only the
dn_struct_rwlock as reader. Unfortunately, this premise does not hold.
When we create the zio for a dbuf, we pass in the actual block pointer in the
indirect block above that dbuf. When we later zero the bp in
zio_write_compress, we are directly modifying the bp. The state of the bp is
now inconsistent from the perspective of dnode_next_offset: the bp will appear
to be a hole until zio_dva_allocate finally finishes filling it in. In the
meantime, dnode_next_offset can detect a hole in the dnode when none exists.
I was able to experimentally demonstrate this behavior with the following
setup:
1. Create a file with 1 million dbufs.
2. Create a thread that randomly dirties L2 blocks by writing to the first L0
block under them.
3. Observe dnode_next_offset, waiting for it to skip over a hole in the middle
of a file.
4. Do dnode_next_offset in a loop until we skip over such a non-existent hole.
The fix is to ensure that it is valid to iterate over the indirect blocks in a
dnode while holding the dn_struct_rwlock by passing the zio a copy of the BP
and updating the actual BP in dbuf_write_ready while holding the lock.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Alex Reece <alex@delphix.com>
illumos/illumos-gate@1fdcbd00c91fdcbd00c9https://www.illumos.org/issues/6874
When we do a clone swap (caused by "zfs rollback" or "zfs receive"), the ZPL
doesn't completely reload the state from the DMU; some values remain cached in
the zfsvfs_t.
steps to reproduce:
```
#!/bin/bash -x
zfs destroy -R test/fs
zfs destroy -R test/recvd
zfs create test/fs
zfs snapshot test/fs@a
zfs set userquota@$USER=1m test/fs
zfs snapshot test/fs@b
zfs send test/fs@a | zfs recv test/recvd
zfs send -i @a test/fs@b | zfs recv test/recvd
zfs userspace test/recvd
1. should show 1m quota
dd if=/dev/urandom of=/test/recvd/file bs=1k count=1024
sync
dd if=/dev/urandom of=/test/recvd/file2 bs=1k count=1024
2. should fail with ENOSPC
sync
zfs unmount test/recvd
zfs mount test/recvd
zfs userspace test/recvd
3. if bug above, now shows 1m quota
dd if=/dev/urandom of=/test/recvd/file3 bs=1k count=1024
4. if bug above, now fails with ENOSPC
```
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matthew Ahrens <mahrens@delphix.com>
If the hypervisor version is smaller than 4.6.0. Xen commits 74fd00 and
70a3cb are required on the hypervisor side for this to be fixed, and those
are only included in 4.6.0, so stay on the safe side and disable MSI-X
interrupt migration on anything older than 4.6.0.
It should not cause major performance degradation unless a lot of MSI-X
interrupts are allocated.
Sponsored by: Citrix Systems R&D
MFC after: 3 days
Reviewed by: jhb
Differential revision: https://reviews.freebsd.org/D7148
For multi-channel devices, once the primary channel is closed,
a set of 'rescind' messages for sub-channels will be delivered
by Hypervisor. Sub-channel MUST be freed according to these
'rescind' messages; directly re-openning sub-channels in the
same fashion as the primary channel's re-opening does NOT work
at all.
After the primary channel is re-opened, requested # of sub-
channels will be delivered though 'channel offer' messages, and
this set of newly offered channels can be opened along side with
the primary channel.
This unbreaks the MTU setting for hn(4), which requires re-
openning all existsing channels upon MTU change.
MFC after: 1 week
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D6978
Instead of global variable, vmbus version is accessed through
a vmbus DEVMETHOD now.
MFC after: 1 week
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D6953
Pin the channel to cpu0 by default. Drivers having special channel-cpu
mapping requirement should call vmbus_channel_cpu_{set,rr}() themselves.
MFC after: 1 week
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D6918
Despite the implication (process has pending signals -> the current
thread marked for AST and has TDF_NEEDSIGCHK set) is not true due to
other thread might manipulate its signal blocking mask, it should still
hold for the single-threaded processes. Enable check for the condition
for single-threaded case, and replicate it from userret() to ast() as
well, where we check that ast indeed has no signal to deliver.
Note that the check is under DIAGNOSTIC, it is not enabled for INVARIANTS
but !DIAGNOSTIC since it imposes too heavy-weight locking for day-to-day
used debugging kernel.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
AST must not execute with TDF_SBDRY or TDF_SEINTR/TDF_SERESTART thread
flags set, which is asserted in userret(). As the consequence, -1 return
from cursig() must not be possible.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This also fixes memory leakge if sub-connect messages are needed.
MFC after: 1 week
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D6878
The current command response handling discards status and xfer
length unconditionally, so that all of the commands would be
considered successful, even if errors happened. When errors
really happens, this causes all kinds of wiredness, since the
buffer will not be filled on the host side and sense data will
be ignored.
Most of the time, errors do not happen, however, error does
happen for the request sent immediately after the disk resizing.
Discarding the SCSI status (SCSI_STATUS_CHECK_COND) and sense
data (capacity changes) prevents the disk resizing from working
properly.
This commit saves the response status and xfer length properly
for later use.
Submitted by: Dexuan Cui <decui microsoft com>
Noticed by: sephe
MFC after: 3 days
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D7181
NVRAM, ChipCommon, etc).
This extends the existing handling of NVRAM core discovery to support
locating additional devices that may be attached either directly as real
cores, or indirectly via ChipCommon (e.g. bhnd_pmu).
When attached as a SoC root bus (as opposed to a bridged WiFi device),
the platform devices may not be attached until later bus passes,
necessitating delayed discovery/initialization.
Approved by: adrian (mentor)
Differential Revision: https://reviews.freebsd.org/D6962
- Cast 32-bit register values to uintmax_t for use with %jx.
- Add special-case return address handling for MipsKernGenException to
avoid early termination of stack walking in the exception handler
stack frame.
Submitted by: Michael Zhilin <mizhka@gmail.com>
Reviewed by: ray
Approved by: adrian (mentor)
Differential Revision: https://reviews.freebsd.org/D6907