Commit Graph

584 Commits

Author SHA1 Message Date
Attilio Rao
5b6ea0b538 MFC 2011-05-31 14:18:10 +00:00
Pawel Jakub Dawidek
12b9f8e47d Imagine situation where a security problem is found in setuid binary.
User upgrades his system to fix the problem, but if he has any ZFS snapshots
for the file system which contains problematic binary, any user can mount the
snapshot and execute vulnerable binary.

Prevent this from happening by always mounting snapshots with setuid turned off.

MFC after:	2 weeks
2011-05-31 07:02:49 +00:00
Attilio Rao
9cb46334ee MFC 2011-05-27 16:09:10 +00:00
Pawel Jakub Dawidek
43cadeaa27 Silence warnings about unsupoorted value types.
MFC after:	2 weeks
2011-05-27 08:34:31 +00:00
Attilio Rao
7fcdc9a26f MFC 2011-05-26 17:38:00 +00:00
Pawel Jakub Dawidek
b5a060dd8b Don't pass pointer to name buffer which is on the stack to another thread,
because the stack might be paged out once the other thread tries to use the
data. Instead, just allocate memory.

MFC after:	2 weeks
2011-05-24 20:10:12 +00:00
Pawel Jakub Dawidek
541c60d988 Don't access task structure once we call task function.
The task structure might be no longer available.
This also allows to eliminates the need for two tasks in the zio structure.

Submitted by:	anonymous
MFC after:	2 weeks
2011-05-24 20:07:15 +00:00
Attilio Rao
b97e49c0e1 MFC 2011-05-22 21:46:55 +00:00
Rick Macklem
965e561750 Fix the zfs file system so that it uses the lock
flags argument added to VFS_FHTOVP() by r222167.

Reviewed by:	pjd
2011-05-22 21:04:32 +00:00
Attilio Rao
8c4431d022 MFC 2011-05-22 20:41:10 +00:00
Rick Macklem
694a586a43 Add a lock flags argument to the VFS_FHTOVP() file system
method, so that callers can indicate the minimum vnode
locking requirement. This will allow some file systems to choose
to return a LK_SHARED locked vnode when LK_SHARED is specified
for the flags argument. This patch only adds the flag. It
does not change any file system to use it and all callers
specify LK_EXCLUSIVE, so file system semantics are not changed.

Reviewed by:	kib
2011-05-22 01:07:54 +00:00
Attilio Rao
5f6b159db7 MFC 2011-05-18 16:01:29 +00:00
Martin Matuska
a5c44f92bf Restore old (v15) behaviour for a recursive snapshot destroy.
(zfs destroy -r pool/dataset@snapshot)

To destroy all descendent snapshots with the same name the top level
snapshot was not required to exist. So if the top level snapshot does
not exist, check permissions of the parent dataset instead.

Filed as Illumos Bug #1043

Reviewed by:	delphij
Approved by:	pjd
MFC after:	together with v28
2011-05-18 07:37:02 +00:00
Attilio Rao
7e7a34e520 MFC 2011-05-16 16:34:03 +00:00
Andriy Gapon
20208c3bf0 Revert accidentally committed local change in r221990
Pointyhat to:	avg
2011-05-16 15:36:11 +00:00
Andriy Gapon
dd7498ae03 better integrate cyclic module with clocksource/eventtimer subsystem
Now in the case when one-shot timers are used cyclic events should fire
closer to theier scheduled times.  As the cyclic is currently used only
to drive DTrace profile provider, this is the area where the change
makes a difference.

Reviewed by:	mav (earlier version, a while ago)
X-MFC after:	clocksource/eventtimer subsystem
2011-05-16 15:29:59 +00:00
Attilio Rao
b68eda3b54 MFC 2011-05-10 15:54:37 +00:00
Andriy Gapon
d9b8935fb9 dtrace: remove unused code
Which is also useless, IMO.

MFC after:	5 days
2011-05-10 15:05:27 +00:00
Attilio Rao
71a19bdc64 Commit the support for removing cpumask_t and replacing it directly with
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).

Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.

The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN

while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.

Some technical notes:
- This commit may be considered an ABI nop for all the architectures
  different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
  accessed avoiding migration, because the size of cpuset_t should be
  considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
  primirally done in order to leave some more space in userland to cope
  with KBI extensions). If you need to access kernel cpuset_t from the
  userland please refer to example in this patch on how to do that
  correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now

The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.

Tested by:	pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by:	jeff, jhb, sbruno
2011-05-05 14:39:14 +00:00
Marius Strobl
edd870e447 Convert the last use of xcopyout() to ddi_copyout() and remove the now
unused xcopyin() as well as xcopyout().
MFC together with r219089.

Approved by:	mm
2011-05-03 20:13:27 +00:00
Martin Matuska
29bf94b8d8 Fix deduplicated zfs receive
(dmu_recv_stream builds incomplete guid_to_ds_map)

Illumos-gate changeset:	13329:c48b8bf84ab7
MFC together with v28

Approved by:	pjd
Obtained from:	Illumos (Bug #755)
2011-04-30 14:52:49 +00:00
Marcel Moolenaar
8d098dc0c4 Fix copy-paste bug. 2011-04-27 04:03:04 +00:00
Martin Matuska
8b2aa22d8f Partially fix ZFS compat code for sparc64.
Some endianess bugs still need to be resolved.

Submitted by:	marius (parts of the fix)
MFC after:	1 month
2011-04-08 11:08:26 +00:00
Artem Belevich
7a3f3cabb1 Stripped '32' suffix from linux systrace module name on i386.
Approved by: avg
2011-04-08 06:27:43 +00:00
Jung-uk Kim
3453537fa5 Use atomic load & store for TSC frequency. It may be overkill for amd64 but
safer for i386 because it can be easily over 4 GHz now.  More worse, it can
be easily changed by user with 'machdep.tsc_freq' tunable (directly) or
cpufreq(4) (indirectly).  Note it is intentionally not used in performance
critical paths to avoid performance regression (but we should, in theory).
Alternatively, we may add "virtual TSC" with lower frequency if maximum
frequency overflows 32 bits (and ignore possible incoherency as we do now).
2011-04-07 23:28:28 +00:00
Pawel Jakub Dawidek
65612637e8 Checking file access on size change is bogus. The checks are done earlier by
VFS where we know if this is truncate(2) or ftruncate(2). If this is the
latter we should depend on the mode the file was opened and not on the current
permission.

PR:		standards/154873
Reported by:	Mark Martinec <Mark.Martinec@ijs.si>
Discussed with:	Eric Schrock <eric.schrock@delphix.com>
Discussed with:	Mark Maybee <Mark.Maybee@Oracle.COM>
MFC after:	1 month
2011-03-24 20:28:09 +00:00
Pawel Jakub Dawidek
d7d23301ae Fix potential panic in dbuf_sync_list() relate to spill blocks handling.
Obtained from:	IllumOS
MFC after:	1 month
2011-03-14 11:07:12 +00:00
Andriy Gapon
308bce2a0e add DTrace systrace support for linux32 and freebsd32 on amd64 syscalls
Add systrace_linux32 and systrace_freebsd32 modules which provide
support for tracing compat system calls in addition to native system
call tracing provided by systrace module.

Provided that all the systrace modules are loaded now you can select
what syscalls to trace in the following manner:

syscall::xxx:yyy - work on all system calls that match the specification
syscall:freebsd:xxx:yyy - only native system calls
syscall:linux32:xxx:yyy - linux32 compat system calls
syscall:freebsd32:xxx:yyy - freebsd32 compat system calls on amd64

PR:		kern/152822
Submitted by:	Artem Belevich <fbsdlist@src.cx>
Reviewed by:	jhb (earlier version)
MFC after:	3 weeks
2011-03-12 09:09:25 +00:00
Pawel Jakub Dawidek
cae905e5d0 Correct readdir over ZFS handling.
Reported by:	Pierre Beyssac <pb@fasterix.frmug.org>
MFC after:	1 month
2011-03-08 18:39:41 +00:00
Pawel Jakub Dawidek
a96e8e86f0 Fix libzpool build.
MFC after:	1 month
2011-03-06 01:22:14 +00:00
Pawel Jakub Dawidek
2348f1110e Make renaming of a ZVOL, ZVOL's parent directory and ZVOL snapshot work.
Reported by:	avg
MFC after:	1 month
2011-03-05 22:31:03 +00:00
Pawel Jakub Dawidek
5bf0660559 Simplify zvol_remove_minors() a bit.
MFC after:	1 month
2011-03-05 22:24:31 +00:00
Pawel Jakub Dawidek
2fbdb9c0a0 Use proper lock in assertion.
MFC after:	1 month
2011-02-28 05:45:31 +00:00
Pawel Jakub Dawidek
10b9d77bf1 Finally... Import the latest open-source ZFS version - (SPA) 28.
Few new things available from now on:

- Data deduplication.
- Triple parity RAIDZ (RAIDZ3).
- zfs diff.
- zpool split.
- Snapshot holds.
- zpool import -F. Allows to rewind corrupted pool to earlier
  transaction group.
- Possibility to import pool in read-only mode.

MFC after:	1 month
2011-02-27 19:41:40 +00:00
Rebecca Cran
6bccea7c2b Fix typos - remove duplicate "the".
PR:	bin/154928
Submitted by:	Eitan Adler <lists at eitanadler.com>
MFC after: 	3 days
2011-02-21 09:01:34 +00:00
Marcel Moolenaar
6e23016fd7 Use the preload_fetch_addr() and preload_fetch_size() convenience
functions to obtain the address and size of the preloaded pool
configuration file/repository.

Sponsored by: Juniper Networks.
2011-02-13 19:46:55 +00:00
Konstantin Belousov
ca67168159 For UIO_NOCOPY case of reading request on zfs vnode, which has vm object
attached, activate the page after the successful read, and free the page
if read was unsuccessfull.

Freshly allocated page is not on any queue yet, and not activating (or
deactivating) the page leaves it on no queue, excluding the page from
pagedaemon scans and making the memory disappeared until the vnode
reclaimed.

Reviewed by:	avg
MFC after:	1 week
2011-02-11 10:46:15 +00:00
Edward Tomasz Napierala
dc7a965673 Make it impossible to clear the MNT_NFS4ACLS flag on ZFS filesystem
by using "mount -uw".

Reviewed by:	pjd
MFC after:	2 weeks
2011-02-06 23:34:09 +00:00
Andrey V. Elsukov
459d0e830d vdev's sectorsize should not be greater than 8 Kbytes and also
it should be power of 2. This prevents non-aligned access while
probing vdev's labels.

PR:		kern/147852
Reviewed by:	pjd
MFC after:	1 week
2011-02-04 15:22:56 +00:00
Martin Matuska
5c92680fa9 Recommit r218169, enclosing with #ifdef _KERNEL
This change is sufficient for the ZFS kernel module.

Discussed with:	pjd
MFC after:	1 week
2011-02-01 23:12:13 +00:00
Alexander Kabaev
a9c28a203d Revert r218169 until it can be tested and fixed properly. 2011-02-01 21:15:35 +00:00
Martin Matuska
4530e5f790 For ZFS, change the type of clock_t to int64_t.
The clock_t type in OpenSolaris is long (int64_t on amd64).
On FreeBSD clock_t is int32_t. The clock_t type is used in several places
in the ZFS code to store system uptime in milliseconds ("seconds * hz").

With hz=1000 we have a 32-bit integer overflow in 24 days, 20 hours,
31 minutes and 23.648 seconds. This has a user reported negative impact
on l2arc_feed_thread() and may cause unexpected results from other functions
using clock_t.

Reported by:	Artem Belevich <fbsdlist@src.cx> on freebsd-fs@
MFC after:	1 week
2011-02-01 14:28:50 +00:00
Jayachandran C.
baa8c35cb4 CDDL fixes for MIPS n32.
Provide 64 bit atomic ops, and use 32 bit pointer.
2011-01-28 06:12:59 +00:00
Matthew D Fleming
cbc134ad03 Introduce signed and unsigned version of CTLTYPE_QUAD, renaming
existing uses.  Rename sysctl_handle_quad() to sysctl_handle_64().
2011-01-19 23:00:25 +00:00
Edward Tomasz Napierala
7a93bf9a69 Add MNT_NFS4ACLS to ZFS mount flags. It's not conditional, since there
is no way to disable NFSv4 ACLs in ZFS.  This should make it easier
for the NFS server to figure out whether the exported filesystem supports
ACLs or not.

Reviewed by:	pjd
MFC after:	2 weeks
2011-01-19 17:11:52 +00:00
Matthew D Fleming
e704482d43 Re-commit the zfs sysctl(9) type-safety changes.
Thanks to dim and pjd for the pointer to zfs_context.h for building
userland.
2011-01-13 18:20:19 +00:00
Matthew D Fleming
374a993a88 Revert cddl changes for sysctl(9) until I understand why this isn't
building on universe.
2011-01-12 23:06:38 +00:00
Matthew D Fleming
4a2ce5903f sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.
Commit the zfs piece.
2011-01-12 19:53:30 +00:00
Martin Matuska
df06a59a77 MFp4 r186485, r186859:
Fix a race by defining two tasks in the zio structure
as we can still be returning from issue task when interrupt task is used.

Tested by:	pjd
Approved by:	pjd, delphij (mentor)
MFC after:	3 days
2011-01-03 12:57:07 +00:00
Andriy Gapon
dfe3a1b374 cyclic xcall: use smp_no_rendevous_barrier as setup function parameter
In this case we call target function only on a single CPU and do not
need any synchronization at the setup stage.

It's a bit non-obvious but setup function of NULL means that
smp_rendezvous_cpus waits for all CPUs to arrive at the rendezvous
point, but without doing any actual setup.  While using
smp_no_rendevous_barrier means that each CPU proceeds on its own
schedule without any synchronization whatsoever.

MFC after:	3 weeks
2010-12-17 18:22:50 +00:00
Pawel Jakub Dawidek
8735863465 Remove redundant semicolon and empty like. 2010-12-11 13:35:25 +00:00
Ivan Voras
d7ccd95be8 Undo r216230: the interaction between saved ashift in metadata and
detected ashift does not support this. With this change, pools
created while stripesize=512 could not be imported when stripesize
becomes larger (on the same drive).

Noticed by:	pjd
2010-12-07 15:24:08 +00:00
Andriy Gapon
58f61ce4eb opensolaris cyclic: fix deadlock and make a little bit closer to upstream
The dealock was caused in the following way:
- thread T1 on CPU C1 holds a spin mutex, IPIs CPU C2 and waits for the
  IPI to be handled
- C2 executes timer interrupt filter, thus has interrupts disabled, and
  gets blocked on the spin mutex held by T1
The problem seems to have been introduced by simplifications made to
OpenSolaris code during porting.
The problem is fixed by reorganizing the code to more closely resemble
the upstream version.  Interrupt filter (cyclic_fire) now doesn't
acquire any locks, all per-CPU data accesses are performed on a
target CPU with preemption and interrupts disabled thus precluding
concurrent access to the data.
cyp_mtx spin mutex is used to disable preemtion and interrupts; it's not
used for classical mutual exclusion, because xcall already serializes
calls to a CPU.  It's an emulation of OpenSolaris
cyb_set_level(CY_HIGH_LEVEL) call, the spin mutexes could probably be
reduced to just a spinlock_enter()/_exit() pair.

Diff with upstream version is now reduced by ~500 lines, however it still
remains quite large - many things that are not needed (at the moment) or
are irrelevant on FreeBSD were simply ripped out during porting.
Examples of such things:
- support for CPU onlining/offlining
- support for suspend/resume
- support for running callouts at soft interrupt levels
- support for callout rebinding from CPU to CPU
- support for CPU partitions

Tested by:	Artem Belevich <fbsdlist@src.cx>
MFC after:	3 weeks
X-MFC with:	r216252
2010-12-07 12:25:26 +00:00
Andriy Gapon
a10b0e67d9 opensolaris cyclic xcall: no need for special handling of curcpu
smp_rendezvous_cpus already properly handles current CPU case
and non-SMP case.

MFC after:	3 weeks
2010-12-07 12:04:06 +00:00
Andriy Gapon
fe8c7b3d77 dtrace_xcall: no need for special handling of curcpu
smp_rendezvous_cpus alreadt does the right thing in a very similar
fashion, so the code was kind of duplicating that.

MFC after:	3 weeks
2010-12-07 09:19:47 +00:00
Andriy Gapon
7becfa95b9 dtrace_gethrtime_init: pin to master while examining other CPUs
Also use pc_cpumask to be future-friendly.

Reviewed by:	jhb
MFC after:	2 weeks
2010-12-07 09:03:17 +00:00
Ivan Voras
8b08562112 Use GEOM stripesize field when calculating ashift. This will enable correct
alignment on drives with large sector sizes (e.g. 4 KiB) but the
implementation might need to be revisited if devices with large stripesizes
appear (e.g. if RAID controllers or flash drives start using the field),
probably by introducing a physsectorsize field in GEOM providers.

Discussed with: mav, mostly silence on freebsd-geom@ and freebsd-fs@
2010-12-06 12:18:02 +00:00
Edward Tomasz Napierala
de2a57325d Don't panic when we read an empty ACL from ZFS. Apparently this may happen
with filesystems created under MacOS X ZFS port.  This is kind of filesystem
corruption (we don't allow for setting empty ACLs), so make acl_get_file(3)
and related syscalls fail with EINVAL in that case.  In theory, we could
return empty ACL to userland, but I'm afraid this would break some code.

MFC after:	3 days
2010-11-30 21:04:05 +00:00
Andriy Gapon
c59690f249 zfs+sendfile: populate all requested pages, not just those already cached
kern_sendfile() uses vm_rdwr() to read-ahead blocks of data to populate
page cache.  When sendfile stumbles upon a page that is not populated
yet, it sends out all the mbufs that it collected so far.  This
resulted in very poor performance with ZFS when file data is not in the
page cache, because ZFS vop_read for UIO_NOCOPY case populated only
those pages that are already in cache, but not valid.  Which means that
most of the time it populated only the first requested page in the
described above scenario.

Reported by:	Alexander Zagrebin <alexz@visp.ru>
Tested by:	Alexander Zagrebin <alexz@visp.ru>,
		Artemiev Igor <ai@kliksys.ru>
MFC after:	12 days
2010-11-16 15:53:44 +00:00
Andriy Gapon
f9e2e99d5d fix misspelling in a comment
Reported by:	Daniel Braniss <danny@cs.huji.ac.il>
MFC after:	3 days
2010-11-16 12:30:47 +00:00
Martin Matuska
8db47aa15e Disable VFS_HOLD placed on mnt_vnodecovered during the mount of a snapshot
and VFS_RELE on a non-existing hold on snapshot parent's z_vfs.

This disables the changes from OpenSolaris onnv-revision 9234:bffdc4fc05c4
(bug IDs: 6792139, 6794830) - not applicable to FreeBSD.

This fixes the process hang if umounting a manually mounted snapshot.

Reported by:	Alexander Zagrebin <alexz@visp.ru>
Approved by:	delphij (mentor)
MFC after:	1 week
2010-11-13 21:09:18 +00:00
Xin LI
b97a9057c2 Validate whether the zfs_cmd_t submitted from userland is not smaller than
what we have.  Without the check the kernel could accessing memory that
does not belong to the request struct.

Note that we do not test if the struct equals in size at this time, which
may faciliate forward compatibility with newer binaries.

Reviewed by:	pjd at MeetBSD CA '2010
MFC after:	1 week
2010-11-05 22:18:09 +00:00
Martin Matuska
e25376bdd0 Bugfix merge from OpenSolaris:
OpenSolaris onnv-revision:	10209:91f47f0e7728
6830541	zfs_get_data_trips on a verify
6696242	multiple zfs_fillpage() zfs: accessing past end of object panics
6785914	zfs fails to drop dn_struct_rwlock in recovery code path

Approved by:	delphij (mentor)
Obtained from:	OpenSolaris (Bug ID 6830541, 6696242, 6785914)
MFC after:	2 weeks
2010-10-26 15:48:03 +00:00
Andriy Gapon
23a1bcf8c6 zfs: add vop_getpages method implementation
This should make vnode_pager_getpages path a bit shorter and clearer.
Also this should eliminate problems with partially valid pages.
Having this method opens room for future optimizations.

To do: try to satisfy other pages besides the required one taking into
account tradeofs between number of page faults, read throughput and read
latency.  Also, eventually vop_putpages should be added too.

Reviewed by:	kib, mm, pjd
MFC after:	3 weeks
2010-10-16 20:43:05 +00:00
Rui Paulo
910a5e18ba Pass a format string to panic() and to taskqueue_start_threads().
Found with:	clang
2010-10-13 17:13:43 +00:00
Rui Paulo
6e634bb80f In zfs_post_common(), use %d instead of %hhu.
Found with:	clang
2010-10-13 17:12:23 +00:00
Andriy Gapon
f6bb41924c zfs + sendfile: do not produce partially valid pages for vnode's tail
Since r212650 and before this change sendfile(2) could produce
a partially valid page for a trailing portion of a ZFS vnode.
vm_fault() always wants to see a fully valid page even if it's the last
page that partially extends beyond vnode's end.  Otherwise it calls
vop_getpages() to bring in the page.  In the case of ZFS this means
that the data is read from the page into the same page and this breaks
checks in ZFS mappedread() - a thread that set VPO_BUSY on the page in
vm_fault() will get blocked forever waiting for it to be cleared.

Many thanks to Kai and Jeremy for reproducing the issue and providing
important debugging information and help.

Reported by:	Kai Gallasch <gallasch@free.de>,
		Jeremy Chadwick <freebsd@jdc.parodius.com>
Tested by:	Kai Gallasch <gallasch@free.de>,
		Jeremy Chadwick <freebsd@jdc.parodius.com>
Reviewed by:	kib
MFC after:	3 days
To-Do:		apply the same treatment to tmpfs + sendfile
2010-10-12 17:04:21 +00:00
Pawel Jakub Dawidek
19ebc67beb Provide internal ioflags() function that converts ioflag provided by FreeBSD's
VFS to OpenSolaris-specific ioflag expected by ZFS. Use it for read and write
operations.

Reviewed by:	mm
MFC after:	1 week
2010-10-10 20:49:33 +00:00
Martin Matuska
a362d75576 Change FAPPEND to IO_APPEND as this is a ioflag and not a fflag.
This corrects writing to append-only files on ZFS.

PR:		kern/149495 [1], kern/151082 [2]
Submitted by:	Daniel Zhelev <daniel@zhelev.biz> [1], Michael Naef <cal@linu.gs> [2]
Approved by:	delphij (mentor)
MFC after:	1 week
2010-10-08 23:01:38 +00:00
Andriy Gapon
6c6aca1203 opensolaris_kmem kmem_size(): report lesser of vm_kmem_size and available
physical memory

This is needed to correctly autotune ZFS ARC size when vm_kmem_size is
set to value larger than available physical memory.

MFC after:	2 weeks
2010-10-07 18:16:14 +00:00
Martin Matuska
aa007a9f0e Properly handle IO with B_FAILFAST
Retry IO once with ZIO_FLAG_TRYHARD before declaring a pool faulted

OpenSolaris revision and Bug IDs:

9725:0bf7402e8022
6843014 ZFS B_FAILFAST handling is broken

Approved by:	delphij (mentor)
Obtained from:	OpenSolaris (Bug ID 6843014)
MFC after:	3 weeks
2010-09-27 09:42:31 +00:00
Martin Matuska
96a1a6a568 Enable offlining of log devices.
OpenSolaris revision and Bug IDs:

9701:cc5b64682e64
6803605	should be able to offline log devices
6726045	vdev_deflate_ratio is not set when offlining a log device
6599442	zpool import has faults in the display

Approved by:	delphij (mentor)
Obtained from:	OpenSolaris (Bug ID 6803605, 6726045, 6599442)
MFC after:	3 weeks
2010-09-27 09:05:51 +00:00
Andriy Gapon
68653c3bd6 zfs_map_page/zfs_unmap_page: do not use sched_pin() and SFB_CPUPRIVATE
zfs_map_page/zfs_unmap_page are mostly called around potential I/O paths
and it seems to be a not very good idea to do cpu pinning there.

Suggested by:	kib
MFC after:	2 weeks
2010-09-21 05:58:45 +00:00
Andriy Gapon
ff5e15a487 zfs_vnops: use zfs_map_page/zfs_unmap_page helper functions in another place
MFC after:	2 weeks
2010-09-21 05:54:36 +00:00
Andriy Gapon
9d5eb9aa5d zfs arc_reclaim_needed: fix typo in mismerge in r212780
PR:		kern/146410, kern/138790
MFC after:	3 weeks
X-MFC with:	r212780
2010-09-17 07:34:50 +00:00
Andriy Gapon
921d3fd122 zfs+sendfile: advance uio_offset upon reading as well
Picked from analogous code in tmpfs.

MFC after:	1 week
2010-09-17 07:20:20 +00:00
Andriy Gapon
44532bc5cd zfs arc_reclaim_needed: remove redundant checks for arc_c_max and arc_c_max
Those checks are not present in upstream code and they are enforced in
actual calculations of delta by which ARC size can be grown or should be
reduced.

MFC after:	3 weeks
2010-09-17 07:17:38 +00:00
Andriy Gapon
7c1353491f zfs arc_reclaim_needed: more reasonable threshold for available pages
vm_paging_target() is not a trigger of any kind for pageademon, but
rather a "soft" target for it when it's already triggered.
Thus, trying to keep 2048 pages above that level at the expense of ARC
was simply driving ARC size into the ground even with normal memory
loads.
Instead, use a threshold at which a pagedaemon scan is triggered, so
that ARC reclaiming helps with pagedaemon's task, but the latter still
recycles active and inactive pages.

PR:		kern/146410, kern/138790
MFC after:	3 weeks
2010-09-17 07:14:07 +00:00
Martin Matuska
d1ee63f836 Fix kernel panic when moving a file to .zfs/shares
Fix possible loss of correct error return code in ZFS mount

OpenSolaris revisions and Bug IDs:

11824:53128e5db7cf
6863610	ZFS mount can lose correct error return

12079:13822b941977
6939941	problem with moving files in zfs (142901-12)

Approved by:	delphij (mentor)
Obtained from:	OpenSolaris (Bug ID 6863610, 6939941)
MFC after:	3 days
2010-09-15 19:55:26 +00:00
Andriy Gapon
8a3883cfb7 zfs vn_has_cached_data: take into account v_object->cache != NULL
This mirrors code in tmpfs.
This changge shouldn't affect much read path, it may cause unnecessary
vm_page_lookup calls in the case where v_object has no active or inactive
pages but has some cache pages.  I believe this situation to be non-essential.

In write path this change should allow us to properly detect the above
case and free a cache page when we write to a range that corresponds to it.
If this situation is undetected then we could have a discrepancy between
data in page cache and in ARC or on disk.

This change allows us to re-enable vn_has_cached_data() check in zfs_write.

NOTE: strictly speaking resident_page_count and cache fields of v_object
should be exmined under VM_OBJECT_LOCK, but for this particular usage
we may get away with it.

Discussed with:	alc, kib
Approved by:	pjd
Tested with:	tools/regression/fsx
MFC after:	3 weeks
2010-09-15 11:05:41 +00:00
Andriy Gapon
0b1ca38a69 zfs mappedread, update_pages: use int for offset and length within a page
uint64_t, int64_t were redundant there

Approved by:	pjd
Tested by:	tools/regression/fsx
MFC after:	2 weeks
2010-09-15 10:48:16 +00:00
Andriy Gapon
c002c3e8c2 zfs mappedread: use uiomove_fromphys where possible
Reviewed by:	alc
Approved by:	pjd
Tested by:	tools/regression/fsx
MFC after:	2 weeks
2010-09-15 10:44:20 +00:00
Andriy Gapon
fbbdb19dcd zfs: catch up with vm_page_sleep_if_busy changes
Reviewed by:	alc
Approved by:	pjd
Tested by:	tools/regression/fsx
MFC after:	2 weeks
2010-09-15 10:39:21 +00:00
Andriy Gapon
21bd3e2576 tmpfs, zfs + sendfile: mark page bits as valid after populating it with data
Otherwise, adding insult to injury, in addition to double-caching of data
we would always copy the data into a vnode's vm object page from backend.
This is specific to sendfile case only (VOP_READ with UIO_NOCOPY).

PR:		kern/141305
Reported by:	Wiktor Niesiobedzki <bsd@vink.pl>
Reviewed by:	alc
Tested by:	tools/regression/sockets/sendfile
MFC after:	2 weeks
2010-09-15 10:31:27 +00:00
Martin Matuska
9a13d2e1b3 Remove duplicated VFS_HOLD due to a mismerge.
PR:		kern/150544
Approved by:	delphij (mentor)
MFC after:	1 day
2010-09-14 12:12:18 +00:00
Martin Matuska
4eeef2e44a Add missing vop_vector zfsctl_ops_shares
Add missing locks around VOP_READDIR and VOP_GETATTR with z_shares_dir

PR:		kern/150544
Approved by:	delphij (mentor)
Obtained from:	perforce (pjd)
MFC after:	1 day
2010-09-14 10:27:32 +00:00
Pawel Jakub Dawidek
3c907063e9 Remove the page queues lock around vm_page_undirty() - it is no longer needed.
Reviewed by:	alc
2010-09-13 19:47:09 +00:00
Rui Paulo
47047e3418 Revamp locking a bit. This fixes three problems:
* processes now can't go away while we are inserting probes (fixes a panic)
* if a trap happens, we won't be holding the process lock (fixes a hang)
* fix a LOR between the process lock and the fasttrap bucket list lock

Thanks to kib for pointing some problems.
Sponsored by:	The FreeBSD Foundation
2010-09-12 14:12:16 +00:00
Rui Paulo
eae81e9501 Avoid a LOR (sleepable after non-sleepable) in
fasttrap_tracepoint_enable().

Sponsored by:	The FreeBSD Foundation
2010-09-11 12:58:31 +00:00
Matthew D Fleming
4d369413e1 Replace sbuf_overflowed() with sbuf_error(), which returns any error
code associated with overflow or with the drain function.  While this
function is not expected to be used often, it produces more information
in the form of an errno that sbuf_overflowed() did.
2010-09-10 16:42:16 +00:00
Pawel Jakub Dawidek
6a85b5e08a Forgot to commit this file. Add ZPOOL_CONFIG_IS_LOG.
Reported by:	keramida
MFC after:	2 weeks
2010-09-10 04:44:13 +00:00
Pawel Jakub Dawidek
86b19d1861 On FreeBSD we can log from pool that have multiple top-level vdevs or log
vdevs, so don't deny adding new vdevs if bootfs property is set.

MFC after:	2 weeks
2010-09-09 21:20:18 +00:00
Rui Paulo
d3555b6fc2 Fix two bugs in DTrace:
* when the process exits, remove the associated USDT probes
* when the process forks, duplicate the USDT probes.

Sponsored by:	The FreeBSD Foundation
2010-09-09 09:58:05 +00:00
Justin T. Gibbs
f03f7a0ca3 Correct bioq_disksort so that bioq_insert_tail() offers barrier semantic.
Add the BIO_ORDERED flag for struct bio and update bio clients to use it.

The barrier semantics of bioq_insert_tail() were broken in two ways:

 o In bioq_disksort(), an added bio could be inserted at the head of
   the queue, even when a barrier was present, if the sort key for
   the new entry was less than that of the last queued barrier bio.

 o The last_offset used to generate the sort key for newly queued bios
   did not stay at the position of the barrier until either the
   barrier was de-queued, or a new barrier (which updates last_offset)
   was queued.  When a barrier is in effect, we know that the disk
   will pass through the barrier position just before the
   "blocked bios" are released, so using the barrier's offset for
   last_offset is the optimal choice.

sys/geom/sched/subr_disk.c:
sys/kern/subr_disk.c:
	o Update last_offset in bioq_insert_tail().

	o Only update last_offset in bioq_remove() if the removed bio is
	  at the head of the queue (typically due to a call via
	  bioq_takefirst()) and no barrier is active.

	o In bioq_disksort(), if we have a barrier (insert_point is non-NULL),
	  set prev to the barrier and cur to it's next element.  Now that
	  last_offset is kept at the barrier position, this change isn't
	  strictly necessary, but since we have to take a decision branch
	  anyway, it does avoid one, no-op, loop iteration in the while
	  loop that immediately follows.

	o In bioq_disksort(), bypass the normal sort for bios with the
	  BIO_ORDERED attribute and instead insert them into the queue
	  with bioq_insert_tail().  bioq_insert_tail() not only gives
	  the desired command order during insertion, but also provides
	  barrier semantics so that commands disksorted in the future
	  cannot pass the just enqueued transaction.

sys/sys/bio.h:
	Add BIO_ORDERED as bit 4 of the bio_flags field in struct bio.

sys/cam/ata/ata_da.c:
sys/cam/scsi/scsi_da.c
	Use an ordered command for SCSI/ATA-NCQ commands issued in
	response to bios with the BIO_ORDERED flag set.

sys/cam/scsi/scsi_da.c
	Use an ordered tag when issuing a synchronize cache command.

	Wrap some lines to 80 columns.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
sys/geom/geom_io.c
	Mark bios with the BIO_FLUSH command as BIO_ORDERED.

Sponsored by:	Spectra Logic Corporation
MFC after:	1 month
2010-09-02 19:40:28 +00:00
Rui Paulo
ea950d20f6 Make the /dev/dtrace/helper node have the mode 0660. This allows
programs that refuse to run as root (pgsql) to install probes when their
user is part of the wheel group.

Sponsored by:	The FreeBSD Foundation
2010-09-01 12:08:32 +00:00
Jaakko Heinonen
de478dd4b4 execve(2) has a special check for file permissions: a file must have at
least one execute bit set, otherwise execve(2) will return EACCES even
for an user with PRIV_VFS_EXEC privilege.

Add the check also to vaccess(9), vaccess_acl_nfs4(9) and
vaccess_acl_posix1e(9). This makes access(2) to better agree with
execve(2). Because ZFS doesn't use vaccess(9) for VEXEC, add the check
to zfs_freebsd_access() too. There may be other file systems which are
not using vaccess*() functions and need to be handled separately.

PR:		kern/125009
Reviewed by:	bde, trasz
Approved by:	pjd (ZFS part)
2010-08-30 16:30:18 +00:00
Pawel Jakub Dawidek
b8a4becc2d Return NULL pointer instead of B_FALSE as it is done in the vendor code.
Obtained from:	//depot/user/pjd/zfs/...
2010-08-28 19:29:06 +00:00
Pawel Jakub Dawidek
3e9e888541 Move ZUT_OBJS in the same place that is used in vendor code.
Obtained from:	//depot/user/pjd/zfs/...
2010-08-28 19:28:12 +00:00
Martin Matuska
8d87b396f8 Import changes from OpenSolaris that provide
- better ACL caching and speedup of ACL permission checks
- faster handling of stat()
- lowered mutex contention in the read/writer lock (rrwlock)
- several related bugfixes

Detailed information (OpenSolaris onnv changesets and Bug IDs):

9749:105f407a2680
6802734	Support for Access Based Enumeration (not used on FreeBSD)
6844861	inconsistent xattr readdir behavior with too-small buffer

9866:ddc5f1d8eb4e
6848431	zfs with rstchown=0 or file_chown_self privilege allows user to "take" ownership

9981:b4907297e740
6775100	stat() performance on files on zfs should be improved
6827779	rrwlock is overly protective of its counters

10143:d2d432dfe597
6857433	memory leaks found at: zfs_acl_alloc/zfs_acl_node_alloc
6860318	truncate() on zfsroot succeeds when file has a component of its path set without access permission

10232:f37b85f7e03e
6865875	zfs sometimes incorrectly giving search access to a dir

10250:b179ceb34b62
6867395	zpool_upgrade_007_pos testcase panic'd with BAD TRAP: type=e (#pf Page fault)

10269:2788675568fd
6868276	zfs_rezget() can be hazardous when znode has a cached ACL

10295:f7a18a1e9610
6870564	panic in zfs_getsecattr

Approved by:	delphij (mentor)
Obtained from:	OpenSolaris (multiple Bug IDs)
MFC after:	2 weeks
2010-08-28 09:24:11 +00:00
Martin Matuska
abe5837f7c Update ZFS metaslab code from OpenSolaris.
This provides a noticeable write speedup, especially on pools with
less than 30% of free space.

Detailed information (OpenSolaris onnv changesets and Bug IDs):

11146:7e58f40bcb1c
6826241	Sync write IOPS drops dramatically during TXG sync
6869229	zfs should switch to shiny new metaslabs more frequently

11728:59fdb3b856f6
6918420	zdb -m has issues printing metaslab statistics

12047:7c1fcc8419ca
6917066	zfs block picking can be improved

Approved by:	delphij (mentor)
Obtained from:	OpenSolaris (Bug ID 6826241, 6869229, 6918420, 6917066)
MFC after:	2 weeks
2010-08-28 08:59:55 +00:00