Commit Graph

432 Commits

Author SHA1 Message Date
Pawel Jakub Dawidek
7746b6461d Work-around READDIRPLUS problem with .zfs/ and .zfs/snapshot/ directories
by just returning EOPNOTSUPP. This will allow NFS server to fall back to
regular READDIR.

Note that converting inode number to snapshot's vnode is expensive operation.
Snapshots are stored in AVL tree, but based on their names, not inode numbers,
so to convert inode to snapshot vnode we have to interate over all snalshots.

This is not a problem in OpenSolaris, because in their READDIRPLUS
implementation they use VOP_LOOKUP() on d_name, instead of VFS_VGET() on
d_fileno as we do.

PR:		kern/125149
Reported by:	Weldon Godfrey <wgodfrey@ena.com>
Analysis by:	Jaakko Heinonen <jh@saunalahti.fi>
MFC after:	3 days
2009-09-13 16:05:20 +00:00
Pawel Jakub Dawidek
7b4a12379b When zfs.ko is compiled with debug, make sure that znode and vnode point at
each other.

MFC after:	3 days
2009-09-13 10:33:51 +00:00
Pawel Jakub Dawidek
33a0ef82f2 Extend scope of the z_teardown_lock lock for consistency and "just in case".
MFC after:	3 days
2009-09-13 10:29:51 +00:00
Pawel Jakub Dawidek
7dae3c4faf Be sure not to overflow struct fid.
MFC after:	3 days
2009-09-13 10:25:33 +00:00
Pawel Jakub Dawidek
f53901193d There is a bug where mze_insert() can trigger an assert() of inserting
the same entry twice. This bug is not fixed yet, but leads to situation
where when try to access corrupted directory the kernel will panic.
Until the bug is properly fixed, try to recover from it and log that it
happened.

Reported by:	marck
OpenSolaris bug:	6709336
MFC after:	3 days
2009-09-13 10:12:29 +00:00
Pawel Jakub Dawidek
f5516e3d1d - Protect reclaim with z_teardown_inactive_lock.
- Be prepared for dbuf to disappear in zfs_reclaim_complete() and check if
  z_dbuf field is NULL - this might happen in case of rollback or forced
  unmount between zfs_freebsd_reclaim() and zfs_reclaim_complete().
- On forced unmount wait for all znodes to be destroyed - destruction can be
  done asynchronously via zfs_reclaim_complete().

MFC after:	1 week
2009-09-12 19:53:31 +00:00
Pawel Jakub Dawidek
2a8e7dad33 Tighten up the check for race in zfs_zget() - ZTOV(zp) can not only contain
NULL, but also can point to dead vnode, take that into account.

PR:		kern/132068
Reported by:	Edward Fisk" <7ogcg7g02@sneakemail.com>, kris
Fix based on patch from:	Jaakko Heinonen <jh@saunalahti.fi>
MFC after:	1 week
2009-09-12 19:27:54 +00:00
Pawel Jakub Dawidek
3770996142 Only log successful commands! Without this fix we log even unsuccessful
commands executed by unprivileged users. Action is not really taken, but it is
logged to pool history, which might be confusing.

Reported by:	Denis Ahrens <denis@h3q.com>
MFC after:	3 days
2009-09-08 16:40:08 +00:00
Pawel Jakub Dawidek
d6b8039292 We don't export individual snapshots, so mnt_export field in snapshot's
mount point is NULL. That's why when we try to access snapshots over NFS
use mnt_export field from the parent file system.

MFC after:	1 week
2009-09-08 15:57:03 +00:00
Pawel Jakub Dawidek
f148fd9a4a When we automatically mount snapshot we want to return vnode of the mount point
from the lookup and not covered vnode. This is one of the fixes for using .zfs/
over NFS.

MFC after:	1 week
2009-09-08 15:51:40 +00:00
Pawel Jakub Dawidek
2391003912 On FreeBSD we don't have to look for snapshot's mount point,
because fhtovp method is already called with proper mount point.

MFC after:	1 week
2009-09-08 15:42:55 +00:00
Pawel Jakub Dawidek
6f8e88e1da Call ZFS_EXIT() after locking the vnode.
MFC after:	1 week
2009-09-08 15:37:01 +00:00
Konstantin Belousov
211ddddce7 Lock Giant around vn_open_cred().
Remove innocent unnecessary call to NDFREE().

Reported by:	marcel
Reviewed and tested by:	pjd
MFC after:	3 days
2009-09-08 09:17:34 +00:00
Pawel Jakub Dawidek
1ea3566294 Fix reference count leak for a case where snapshot's mount point is updated.
Such situation is not supported.

This problem was triggered by something like this:

	# zpool create tank da0
	# zfs snapshot tank@snap
	# cd /tank/.zfs/snapshot/snap  (this will mount the snapshot)
	# cd
	# mount -u nosuid /tank/.zfs/snapshot/snap  (refcount leak)
	# zpool export tank
	cannot export 'tank': pool is busy

MFC after:	1 week
2009-09-08 08:54:15 +00:00
Pawel Jakub Dawidek
28e449adf2 If we have to use avl_find(), optimize a bit and use avl_insert() instead of
avl_add() (the latter is actually a wrapper around avl_find() + avl_insert()).

Fix similar case in the code that is currently commented out.
2009-09-07 21:58:54 +00:00
Pawel Jakub Dawidek
3f6043a57d When snapshot mount point is busy (for example we are still in it)
we will fail to unmount it, but it won't be removed from the tree,
so in that case there is no need to reinsert it.

This fixes a panic reproducable in the following steps:

	# zfs create tank/foo
	# zfs snapshot tank/foo@snap
	# cd /tank/foo/.zfs/snapshot/snap
	# umount /tank/foo
	panic: avl_find() succeeded inside avl_add()

Reported by:	trasz
MFC after:	3 days
2009-09-07 21:46:51 +00:00
Edward Tomasz Napierala
343775c0b4 Enable NFSv4 ACL support in ZFS.
Reviewed by:	pjd
2009-09-07 19:43:13 +00:00
Pawel Jakub Dawidek
08780916dd Defer thread start until we set priority.
Reviewed by:	kib
MFC after:	3 days
2009-09-07 19:22:44 +00:00
Pawel Jakub Dawidek
c739b7b22b Don't recheck ownership on update mount. This will eliminate LOR between
vfs_busy() and mount mutex. We check ownership in vfs_domount() anyway.

Noticed by:	kib
Reviewed by:	kib
MFC after:	1 week
2009-09-07 18:54:55 +00:00
Pawel Jakub Dawidek
2ff6f0f89a - Avoid holding mutex around M_WAITOK allocations.
- Add locking for mnt_opt field.

MFC after:	1 week
2009-09-07 18:23:26 +00:00
Edward Tomasz Napierala
900b1670c4 Prevent the line from wrapping. 2009-09-07 16:56:41 +00:00
Pawel Jakub Dawidek
841bcfea21 Changing provider size is not really supported by GEOM, but doing so when
provider is closed should be ok.

When administrator requests to change ZVOL size do it immediately if ZVOL
is closed or do it on last ZVOL close.

PR:		kern/136942
Requested by:	Bernard Buri <bsd@ask-us.at>
MFC after:	1 week
2009-09-07 14:16:50 +00:00
Pawel Jakub Dawidek
5e65224daf bzero() on-stack argument, so mutex_init() won't misinterpret that the
lock is already initialized if we have some garbage on the stack.

PR:		kern/135480
Reported by:	Emil Mikulic <emikulic@gmail.com>
MFC after:	3 days
2009-09-07 11:38:43 +00:00
Edward Tomasz Napierala
a41422a93e Improve wording.
Discussed with:	pjd, cperciva, rink, wkoszek and des, in order of appearance.
2009-09-05 15:08:58 +00:00
Pawel Jakub Dawidek
26d0605727 Backport the 'dirtying dbuf' panic fix from newer ZFS version.
Reported by:	Thomas Backman <serenity@exscape.org>
MFC after:	1 week
2009-08-31 16:27:00 +00:00
Pawel Jakub Dawidek
575c1d371c Add missing mountpoint vnode locking.
This fixes panic on assertion with DEBUG_VFS_LOCKS and vfs.usermount=1 when
regular user tries to mount dataset owned by him.

MFC after:	1 week
2009-08-30 21:03:40 +00:00
Pawel Jakub Dawidek
5d5535163a - Hide ZFS kernel threads under zfskern process.
- Use better (shorter) threads names:
	'zvol:worker zvol/tank/vol00' -> 'zvol tank/vol00'
	'vdev:worker da0' -> 'vdev da0'
2009-08-23 11:33:46 +00:00
Pawel Jakub Dawidek
4ec2b0e7ce Set priority of vdev_geom threads and zvol threads to PRIBIO. 2009-08-23 11:27:08 +00:00
Pawel Jakub Dawidek
1869987e42 - Give minclsyspri and maxclsyspri real values (consulted with kmacy).
- Honour 'pri' argument for thread_create().
2009-08-23 11:22:46 +00:00
Pawel Jakub Dawidek
35ae9291c2 Our libc doesn't implement control method for XDR (only kernel does) and it
will always return failure. Fix this by bringing userland implementation of
xdrmem_control() back. This allow 'zpool import' to work again.

Reported by:	Thomas Backman <serenity@exscape.org>
Reviewed by:	kmacy
Approved by:	re (kib)
2009-08-20 00:05:29 +00:00
Pawel Jakub Dawidek
8e9fd65fbf getcwd() (when __getcwd() fails) works by stating current directory, going up
(..), calling readdir and looking for previous directory inode.  In case of
.zfs/ directory this doesn't work, because .zfs/ is hidden by default, so it
won't be visible in readdir output.

Fix this by implementing VPTOCNP for snapshot directories, so __getcwd()
doesn't fail and getcwd() doesn't have to use readdir method.

This fixes /bin/pwd from within .zfs/snapshot/<name>/.

Suggested by:	kib
Approved by:	re (rwatson)
2009-08-17 10:00:18 +00:00
Pawel Jakub Dawidek
8461b0f043 Manage asynchronous vnode release just like Solaris.
Discussed with:	kmacy
Approved by:	re (kib)
2009-08-17 09:48:34 +00:00
Pawel Jakub Dawidek
e35eb914f4 - Reduce z_teardown_lock lock scope a bit.
- The error variable is int, not bool.
- Convert spaces to tabs where needed.

Approved by:	re (kib)
2009-08-17 09:28:15 +00:00
Pawel Jakub Dawidek
0330a5dc10 If z_buf is NULL, we should free znode immediately.
Noticed by:	avg
Approved by:	re (kib)
2009-08-17 09:25:37 +00:00
Pawel Jakub Dawidek
d83cfc37a4 - We need to recycle vnode instead of freeing znode.
Submitted by:	avg

- Add missing vnode interlock unlock.
- Remove redundant znode locking.

Approved by:	re (kib)
2009-08-17 09:21:39 +00:00
Pawel Jakub Dawidek
f820bc079f Fix panic in zfs recv code. The last vnode (mountpoint's vnode) can have
0 usecount.

Reported by:	Thomas Backman <serenity@exscape.org>
Approved by:	re (kib)
2009-08-17 09:13:22 +00:00
Pawel Jakub Dawidek
159ef108e1 Remove OpenSolaris taskq port (it performs very poorly in our kernel) and
replace it with wrappers around our taskqueue(9).
To make it possible implement taskqueue_member() function which returns 1
if the given thread was created by the given taskqueue.

Approved by:	re (kib)
2009-08-17 09:01:20 +00:00
Pawel Jakub Dawidek
fddc954016 - Fix a race where /dev/zfs control device is created before ZFS is fully
initialized. Also destroy /dev/zfs before doing other deinitializations.
- Initialization through taskq is no longer needed and there is a race
  where one of the zpool/zfs command loads zfs.ko and tries to do some work
  immediately, but /dev/zfs is not there yet.

Reported by:	pav
Approved by:	re (kib)
2009-08-17 08:36:41 +00:00
Pawel Jakub Dawidek
830940567b Remove files that are no longer used.
Discussed with:	kmacy
Approved by:	re (kib)
2009-08-17 08:03:02 +00:00
Marcel Moolenaar
538e86713d Fix misalignment in nvpair_native_embedded() caused by the compiler
replacing the bzero(). See also revision 195627, which fixed the
misalignment in nvpair_native_embedded_array().

Approved by:	re (kensmith)
2009-08-16 01:48:46 +00:00
Edward Tomasz Napierala
abd370a36b Remove CDDL warning.
Approved by:	re (kib), core
2009-08-13 12:28:30 +00:00
Pawel Jakub Dawidek
abd8353f5d We don't support ephemeral IDs in FreeBSD and without this fix ZFS can
panic when in zfs_fuid_create_cred() when userid is negative. It is
converted to unsigned value which makes IS_EPHEMERAL() macro to
incorrectly report that this is ephemeral ID. The most reasonable
solution for now is to always report that the given ID is not ephemeral.

PR:		kern/132337
Submitted by:	Matthew West <freebsd@r.zeeb.org>
Tested by:	Thomas Backman <serenity@exscape.org>, Michael Reifenberger <mike@reifenberger.com>
Approved by:	re (kib)
MFC after:	2 weeks
2009-07-27 14:52:34 +00:00
Edward Tomasz Napierala
d2ceff236a Fix extattr_list_file(2) on ZFS in case the attribute directory
doesn't exist and user doesn't have write access to the file.
Without this fix, it returns bogus value instead of 0.  For some
reason this didn't manifest on my kernel compiled with -O0.

PR:		kern/136601
Submitted by:	Jaakko Heinonen <jh at saunalahti dot fi>
Approved by:	re (kib)
2009-07-22 15:15:58 +00:00
Edward Tomasz Napierala
65588fd503 Fix permission handling for extended attributes in ZFS. Without
this change, ZFS uses SunOS Alternate Data Streams semantics - each
EA has its own permissions, which are set at EA creation time
and - unlike SunOS - invisible to the user and impossible to change.
From the user point of view, it's just broken: sometimes access
is granted when it shouldn't be, sometimes it's denied when
it shouldn't be.

This patch makes it behave just like UFS, i.e. depend on current
file permissions.  Also, it fixes returned error codes (ENOATTR
instead of ENOENT) and makes listextattr(2) return 0 instead
of EPERM where there is no EA directory (i.e. the file never had
any EA).

Reviewed by:	pjd (idea, not actual code)
Approved by:	re (kib)
2009-07-20 19:16:42 +00:00
Andriy Gapon
b064b6d1cd dtrace_gethrtime: improve scaling of TSC ticks to nanoseconds
Currently dtrace_gethrtime uses formula similar to the following for
converting TSC ticks to nanoseconds:
rdtsc() * 10^9 / tsc_freq
The dividend overflows 64-bit type and wraps-around every 2^64/10^9 =
18446744073 ticks which is just a few seconds on modern machines.

Now we instead use precalculated scaling factor of
10^9*2^N/tsc_freq < 2^32 and perform TSC value multiplication separately
for each 32-bit half.  This allows to avoid overflow of the dividend
described above.
The idea is taken from OpenSolaris.
This has an added feature of always scaling TSC with invariant value
regardless of TSC frequency changes. Thus the timestamps will not be
accurate if TSC actually changes, but they are always proportional to
TSC ticks and thus monotonic. This should be much better than current
formula which produces wildly different non-monotonic results on when
tsc_freq changes.

Also drop write-only 'cp' variable from amd64 dtrace_gethrtime_init()
to make it identical to the i386 twin.

PR:		kern/127441
Tested by:	Thomas Backman <serenity@exscape.org>
Reviewed by:	jhb
Discussed with:	current@, bde, gnn
Silence from:	jb
Approved by:	re (gnn)
MFC after:	1 week
2009-07-15 17:07:39 +00:00
Konstantin Belousov
f33a947b56 Add new msleep(9) flag PBDY that shall be specified together with
PCATCH, to indicate that thread shall not be stopped upon receipt of
SIGSTOP until it reaches the kernel->usermode boundary.

Also change thread_single(SINGLE_NO_EXIT) to only stop threads at
the user boundary unconditionally.

Tested by:	pho
Reviewed by:	jhb
Approved by:	re (kensmith)
2009-07-14 22:52:46 +00:00
Marcel Moolenaar
dd8a461e83 In nvpair_native_embedded_array(), meaningless pointers are zeroed.
The programmer was aware that alignment was not guaranteed in the
packed structure and used bzero() to NULL out the pointers.
However, on ia64, the compiler is quite agressive in finding ILP
and calls to bzero() are often replaced by simple assignments (i.e.
stores). Especially when the width or size in question corresponds
with a store instruction (i.e. st1, st2, st4 or st8).

The problem here is not a compiler bug. The address of the memory
to zero-out was given by '&packed->nvl_priv' and given the type of
the 'packed' pointer the compiler could assume proper alignment for
the replacement of bzero() with an 8-byte wide store to be valid.
The problem is with the programmer. The programmer knew that the
address did not have the alignment guarantees needed for a regular
assignment, but failed to inform the compiler of that fact. In
fact, the programmer told the compiler the opposite: alignment is
guaranteed.

The fix is to avoid using a pointer of type "nvlist_t *" and
instead use a "char *" pointer as the basis for calculating the
address. This tells the compiler that only 1-byte alignment can
be assumed and the compiler will either keep the bzero() call
or instead replace it with a sequence of byte-wise stores. Both
are valid.

Approved by:	re (kib)
2009-07-11 22:43:20 +00:00
Andriy Gapon
f340e9fe71 dtrace/amd64: fix virtual address checks
On amd64 KERNBASE/kernbase does not mean start of kernel memory.
This should fix a KASSERT panic in dtrace_copycheck when copyin*()
is used in D program.
Also make checks for user memory a bit stricter.

Reported by:	Thomas Backman <serenity@exscape.org>
Submitted by:	wxs (kaddr part)
Tested by:	Thomas Backman (prototype), wxs
Reviewed by:	alc (concept), jhb, current@
Aprroved by:	jb (concept)
MFC after:	2 weeks
PR:		kern/134408
2009-06-24 16:03:57 +00:00
Konstantin Belousov
a18a95db4a O_NOFOLLOW shall be in flags, not in cmode.
Noted by:	bde
2009-06-22 10:08:48 +00:00
Konstantin Belousov
e0c161b89c Add another flags argument to vn_open_cred. Use it to specify that some
vn_open_cred invocations shall not audit namei path.

In particular, specify VN_OPEN_NOAUDIT for dotdot lookup performed by
default implementation of vop_vptocnp, and for the open done for core
file. vn_fullpath is called from the audit code, and vn_open there need
to disable audit to avoid infinite recursion. Core file is created on
return to user mode, that, in particular, happens during syscall return.
The creation of the core file is audited by direct calls, and we do not
want to overwrite audit information for syscall.

Reported, reviewed and tested by: rwatson
2009-06-21 13:41:32 +00:00
Jamie Gritton
c1f192193d Rename the host-related prison fields to be the same as the host.*
parameters they represent, and the variables they replaced, instead of
abbreviated versions of them.

Approved by:	bz (mentor)
2009-06-13 15:39:12 +00:00
Kip Macy
f0c6b798a3 pjd has requested that I keep the tunable as zfs_prefetch_disable to minimize gratuitous
differences with Opensolaris' ZFS

Sorry for the churn
2009-06-11 22:24:08 +00:00
Kip Macy
e4e5e663e0 check against prefetch_enable 2009-06-11 09:51:21 +00:00
Kip Macy
3fa5485637 use default policy for enabling prefetching unless the TUNABLE is set 2009-06-10 21:05:37 +00:00
Kip Macy
107b659450 As far as I can tell systems that have less than 4GB are more often hurt
by prefetched than helped.  On i386 systems and systems with less than 4GB,
prefetch is now disabled by default. I've added a prefetch enable tunable, to
enable prefetching for those systems. The prefetch disable tunable will continue
to unconditionally disable prefetching.
2009-06-10 01:21:32 +00:00
Paul Saab
a6d545d8ed Support shared vnode locks for write operations when the offset is
provided on filesystems that support it.  This really improves mysql
+ innodb performance on ZFS.

Reviewed by:	jhb, kmacy, jeffr
2009-06-04 16:18:07 +00:00
Doug Rabson
8be608b58c Allow the bootfs property to be set for raidz pools on FreeBSD.
Reviewed by:	pjd
2009-05-31 11:59:32 +00:00
Kip Macy
762169b50a fix xdrmem_control to be safe in an if statement
fix zfs to depend on krpc
remove xdr from zfs makefile

Submitted by:	dchagin@freebsd.org
2009-05-30 22:23:58 +00:00
Kip Macy
139ccddec0 work around snapshot shutdown race reported by Henri Hennebert 2009-05-30 19:26:35 +00:00
Jamie Gritton
76ca6f88da Place hostnames and similar information fully under the prison system.
The system hostname is now stored in prison0, and the global variable
"hostname" has been removed, as has the hostname_mtx mutex.  Jails may
have their own host information, or they may inherit it from the
parent/system.  The proper way to read the hostname is via
getcredhostname(), which will copy either the hostname associated with
the passed cred, or the system hostname if you pass NULL.  The system
hostname can still be accessed directly (and without locking) at
prison0.pr_host, but that should be avoided where possible.

The "similar information" referred to is domainname, hostid, and
hostuuid, which have also become prison parameters and had their
associated global variables removed.

Approved by:	bz (mentor)
2009-05-29 21:27:12 +00:00
Attilio Rao
1ae1c2a3bd Reverse the logic for ADAPTIVE_SX option and enable it by default.
Introduce for this operation the reverse NO_ADAPTIVE_SX option.
The flag SX_ADAPTIVESPIN to be passed to sx_init_flags(9) gets suppressed
and the new flag, offering the reversed logic, SX_NOADAPTIVE is added.

Additively implements adaptive spininning for sx held in shared mode.
The spinning limit can be handled through sysctls in order to be tuned
while the code doesn't reach the release, after which time they should
be dropped probabilly.

This change has made been necessary by recent benchmarks where it does
improve concurrency of workloads in presence of high contention
(ie. ZFS).

KPI breakage is documented by __FreeBSD_version bumping, manpage and
UPDATING updates.

Requested by:	jeff, kmacy
Reviewed by:	jeff
Tested by:	pho
2009-05-29 01:49:27 +00:00
Kip Macy
c334d2d544 MFdevbranch 192944
- add FreeBSD implementation of xdrmem_control needed by zfs
 - have zfs define xdr_ops using FreeBSD's definition
 - remove solaris xdr files from zfs compile
2009-05-28 08:18:12 +00:00
Stacey Son
a5aedd68b4 Add the OpenSolaris dtrace lockstat provider. The lockstat provider
adds probes for mutexes, reader/writer and shared/exclusive locks to
gather contention statistics and other locking information for
dtrace scripts, the lockstat(1M) command and other potential
consumers.

Reviewed by:	attilio jhb jb
Approved by:	gnn (mentor)
2009-05-26 20:28:22 +00:00
Edward Tomasz Napierala
b7014134a7 Change license to more bori^Wadul^Wcanonical.
Submitted by:	rwatson@
2009-05-26 11:42:06 +00:00
Edward Tomasz Napierala
0970b4bae0 MFp4 changes neccessary for NFSv4 ACLs support in ZFS. This is mostly
about removing a few #ifdefs and providing compatibility wrappers and
VOP implementations to get and set an ACL; ZFS does ACL enforcement all
by itself.

Note that the VOPs are ifdefed out for now, so this change should be
a no-op.

Reviewed by:	pjd
2009-05-26 08:21:59 +00:00
Edward Tomasz Napierala
4076aa37dc Don't allow non-owner to set SUID bit on a file. It doesn't make
any difference now, but in NFSv4 ACLs, there is write_acl permission,
which also affects mode changes.

Reviewed by:	pjd
2009-05-24 19:21:49 +00:00
Edward Tomasz Napierala
194f4d42de Fix comment. 2009-05-24 15:48:48 +00:00
Dag-Erling Smørgrav
bba5cfd28b Unexpand $FreeBSD$. 2009-05-23 16:01:58 +00:00
Dag-Erling Smørgrav
6feca53bed Remove svn:keywords on a file that had fbsd:nokeywords (though I don't
understand the reason for the latter)
2009-05-23 16:00:16 +00:00
Kip Macy
e95d34711b - back out direct map hack
- it is no longer needed
2009-05-19 01:14:37 +00:00
Kip Macy
0fe5460dbd set createtxg prop name
PR: bin/130105
2009-05-17 04:04:25 +00:00
Kip Macy
ea41c77517 SAVESTART implies SAVENAME 2009-05-17 01:31:28 +00:00
Kip Macy
2e9c90d55b enable adaptive spinning on zfs locks 2009-05-16 23:56:45 +00:00
Kip Macy
be08aa8b59 - allow forced unmounts
- don't assume snapshot was auto-mounted
2009-05-16 20:33:13 +00:00
Kip Macy
71bc1ce36e only use direct map if system has more than 2GB 2009-05-16 20:09:07 +00:00
Kip Macy
32237d8492 apply band-aid to x86_64 systems with more physical memory than kmem by allocating from the direct map 2009-05-16 19:17:15 +00:00
Doug Rabson
e1899ef6c8 Add support for booting from raidz1 and raidz2 pools. 2009-05-16 10:48:20 +00:00
Attilio Rao
dfd233edd5 Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS.  Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled.  Bump __FreeBSD_version in order to signal such
situation.
2009-05-11 15:33:26 +00:00
Kip Macy
469ef3e563 rename xdr support files to avoid conflicts when linking in to the kernel 2009-05-11 04:18:58 +00:00
Kip Macy
8569258bf8 - rename atomic.S and crc32.c to avoid collisions when linking zfs in to the kernel
- update Makefile
- ifdef out acl_{alloc, free}, they aren't used by zfs and conflict with existing in-kernel routines
2009-05-09 01:45:55 +00:00
Marko Zec
29b02909eb Introduce a new virtualization container, provisionally named vprocg, to hold
virtualized instances of hostname and domainname, as well as a new top-level
virtualization struct vimage, which holds pointers to struct vnet and struct
vprocg.  Struct vprocg is likely to become replaced in the near future with
a new jail management API import.

As a consequence of this change, change struct ucred to point to a struct
vimage, instead of directly pointing to a vnet.

Merge vnet / vimage / ucred refcounting infrastructure from p4 / vimage
branch.

Permit kldload / kldunload operations to be executed only from the default
vimage context.

This change should have no functional impact on nooptions VIMAGE kernel
builds.

Reviewed by:	bz
Approved by:	julian (mentor)
2009-05-08 14:11:06 +00:00
Kip Macy
a6827463ad don't call vn_rele_async_fini in the !_KERNEL case 2009-05-07 23:34:41 +00:00
Kip Macy
c20fd07777 move VN_RELE_ASYNC to the compatibility layer with the rest of the VN_* defines 2009-05-07 23:02:15 +00:00
Kip Macy
6ef1a81d6e avoid LOR and gratuitous extra lock acquisitions by moving user_evict list buffers to
a temporary list
2009-05-07 21:51:13 +00:00
Kip Macy
77d0162c70 Allow the VM to provide backpressure on the ARC cache as it does
on Solaris.
2009-05-07 20:57:06 +00:00
Kip Macy
62fa227ccd Asynchronously release vnodes to avoid blocking on range locks when calling back in to zfs.
This is based on a fix that went in to opensolaris on March 9th. However, it uses a dedicated
thread instead of a Solaris' taskq to avoid doing a blocking memory allocation with the vnode
interlock held.

This fixes a long-time deadlock in ZFS. This is not, strictly speaking, an LOR. The spa_zio
thread releases a vnode, this calls in to vn_reclaim which in turn needs to acquire range locks
to sync dirty data out to disk. The range locks are already held by a user-level process waiting
on a condition variable that it the process is waiting on a spa_zio thread to signal it on. The
process could not be signalled because the spa_zio thread could not proceed.

The nature of this problem was not apparent due to ZFS locks opting out of witness which meant
that DDB did not know about the locks that were held by ZFS.

Reviewed by:	pjd
MFC after:	7 days
2009-05-07 20:28:06 +00:00
Jamie Gritton
b38ff370e4 Introduce the extensible jail framework, using the same "name=value"
interface as nmount(2).  Three new system calls are added:
* jail_set, to create jails and change the parameters of existing jails.
  This replaces jail(2).
* jail_get, to read the parameters of existing jails.  This replaces the
  security.jail.list sysctl.
* jail_remove to kill off a jail's processes and remove the jail.
Most jail parameters may now be changed after creation, and jails may be
set to exist without any attached processes.  The current jail(2) system
call still exists, though it is now a stub to jail_set(2).

Approved by:	bz (mentor)
2009-04-29 21:14:15 +00:00
Robert Watson
885868cd8f Remove VOP_LEASE and supporting functions. This hasn't been used since
the removal of NQNFS, but was left in in case it was required for NFSv4.
Since our new NFSv4 client and server can't use it for their
requirements, GC the old mechanism, as well as other unused lease-
related code and interfaces.

Due to its impact on kernel programming and binary interfaces, this
change should not be MFC'd.

Proposed by:    jeff
Reviewed by:    jeff
Discussed with: rmacklem, zach loafman @ isilon
2009-04-10 10:52:19 +00:00
Andrew Thompson
853a10a581 Revert r190676,190677
The geom and CAM changes for root_hold are the wrong solution for USB design
quirks.

Requested by:	scottl
2009-04-10 04:08:34 +00:00
Andrew Thompson
626fc9fe3d Add a how argument to root_mount_hold() so it can be passed NOWAIT and be called
in situations where sleeping isnt allowed.
2009-04-03 19:46:12 +00:00
Robert Watson
455f3aa24f Move dtnfsclient.c in the cddl tree to nfs_kdtrace.c in the nfsclient
directory, since it's under a BSD license, and this keeps NFS internals-
aware tracing parts close to NFS.

MFC after:	1 month
Suggested by:	jhb
2009-03-25 17:47:22 +00:00
Robert Watson
10263f0832 Add DTrace probes to the NFS access and attribute caches. Access cache
events are:

  nfsclient:accesscache:flush:done
  nfsclient:accesscache:get:hit
  nfsclient:accesscache:get:miss
  nfsclient:accesscache:load:done

They pass the vnode, uid, and requested or loaded access mode (if any);
the load event may also report a load error if the RPC fails.

The attribute cache events are:

  nfsclient:attrcache:flush:done
  nfsclient:attrcache:get:hit
  nfsclient:attrcache:get:miss
  nfsclient:attrcache:load:done

They pass the vnode, optionally the vattr if one is present (hit or load),
and in the case of a load event, also a possible RPC error.

MFC after:	1 month
Sponsored by:	Google, Inc.
2009-03-24 17:14:34 +00:00
Robert Watson
47294818f9 Add dtnfsclient, a first cut at an NFSv2/v3 client reuest DTrace
provider.  The NFS client exposes 'start' and 'done' probes for NFSv2
and NFSv3 RPCs when using the new RPC implementation, passing in the
vnode, mbuf chain, credential, and NFSv2 or NFSv3 procedure number.
For 'done' probes, the error number is also available.

Probes are named in the following way:

  ...
  nfsclient:nfs2:write:start
  nfsclient:nfs2:write:done
  ...
  nfsclient:nfs3:access:start
  nfsclient:nfs3:access:done
  ...

Access to the unmarshalled arguments is not easily available at this
point in the stack, but the passed probe arguments are sufficient to
to a lot of interesting things in practice.  Technically, these probes
may cover multiple RPC retransmits, and even transactions if the
transaction ID change as a result of authentication failure or a
jukebox error from the server, but usefully capture the intent of a
single NFS request, such as access, getattr, write, etc.

Typical use might involve profiling RPC latency by system call, number
of RPCs, how often a getattr leads to a call to access, when failed
access control checks occur, etc.  More detailed RPC information might
best be provided by adding a krpc provider.  It would also be useful
to add NFS client probes for events such as the access cache or
attribute cache satisfying requests without an RPC.

Sponsored by:	Google, Inc.
MFC after:	1 month
2009-03-22 22:07:52 +00:00
John Baldwin
9fca7a854c The zfs_get_xattrdir() function is used to find the extended attribute
directory for a znode.  When the directory already exists, it returns a
referenced but unlocked vnode.  When a directory does not yet exist, it
calls zfs_make_xattrdir() to create a new one.  zfs_make_xattrdir() returns
the vnode both referenced and and locked and zfs_get_xattrdir() was leaking
this vnode lock to its callers.  Fix this by dropping the vnode lock if
zfs_make_xattrdir() successfully creates a new extended attribute
directory.

Reviewed by:	pjd
2009-03-18 16:19:44 +00:00
John Baldwin
33fc362512 Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that a
filesystem supports additional operations using shared vnode locks.
Currently this is used to enable shared locks for open() and close() of
read-only file descriptors.
- When an ISOPEN namei() request is performed with LOCKSHARED, use a
  shared vnode lock for the leaf vnode only if the mount point has the
  extended shared flag set.
- Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but
  not O_CREAT.
- Use a shared vnode lock around VOP_CLOSE() if the file was opened with
  O_RDONLY and the mountpoint has the extended shared flag set.
- Adjust md(4) to upgrade the vnode lock on the vnode it gets back from
  vn_open() since it now may only have a shared vnode lock.
- Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since
  FIFO's require exclusive vnode locks for their open() and close()
  routines.  (My recent MPSAFE patches for UDF and cd9660 already included
  this change.)
- Enable extended shared operations on UFS, cd9660, and UDF.

Submitted by:	ups
Reviewed by:	pjd (ZFS bits)
MFC after:	1 month
2009-03-11 14:13:47 +00:00
Jamie Gritton
f86bce5ed0 Extend the "vfsopt" mount options for more general use. Make struct
vfsopt and the vfs_buildopts function public, and add some new fields
to struct vfsopt (pos and seen), and new functions vfs_getopt_pos and
vfs_opterror.

Further extend the interface to allow reading options from the kernel
in addition to sending them to the kernel, with vfs_setopt and related
functions.

While this allows the "name=value" option interface to be used for more
than just FS mounts (planned use is for jails), it retains the current
"vfsopt" name and <sys/mount.h> requirement.

Approved by:	bz (mentor)
2009-03-02 23:26:30 +00:00
Ed Schouten
802cb57e34 Add memmove() to the kernel, making the kernel compile with Clang.
When copying big structures, LLVM generates calls to memmove(), because
it may not be able to figure out whether structures overlap. This caused
linker errors to occur. memmove() is now implemented using bcopy().
Ideally it would be the other way around, but that can be solved in the
future. On ARM we don't do add anything, because it already has
memmove().

Discussed on:	arch@
Reviewed by:	rdivacky
2009-02-28 16:21:25 +00:00
John Baldwin
ea77ff0a15 Use shared vnode locks when invoking VOP_READDIR().
MFC after:	1 month
2009-02-13 18:18:14 +00:00
Ed Schouten
a4611ab612 Last step of splitting up minor and unit numbers: remove minor().
Inside the kernel, the minor() function was responsible for obtaining
the device minor number of a character device. Because we made device
numbers dynamically allocated and independent of the unit number passed
to make_dev() a long time ago, it was actually a misnomer. If you really
want to obtain the device number, you should use dev2udev().

We already converted all the drivers to use dev2unit() to obtain the
device unit number, which is still used by a lot of drivers. I've
noticed not a single driver passes NULL to dev2unit(). Even if they
would, its behaviour would make little sense. This is why I've removed
the NULL check.

Ths commit removes minor(), minor2unit() and unit2minor() from the
kernel. Because there was a naming collision with uminor(), we can
rename umajor() and uminor() back to major() and minor(). This means
that the makedev(3) manual page also applies to kernel space code now.

I suspect umajor() and uminor() isn't used that often in external code,
but to make it easier for other parties to port their code, I've
increased __FreeBSD_version to 800062.
2009-01-28 17:57:16 +00:00
Warner Losh
78bc7eec0d Put the MIPS support back in after it was removed in r185029. 2008-12-04 16:31:08 +00:00
Pawel Jakub Dawidek
35a15332f3 MFp4: Remove assertion that is no longer valid - we now use VOP_CLOSE() in
more places (ie vdev_file.c).
2008-11-29 12:32:42 +00:00
Edward Tomasz Napierala
38cc5da78e MFp4: We don't support TX_CREATE_ACL_ATTR nor TX_MKDIR_ACL_ATTR; code found
in zfs_replay.c will panic if it encounters transactions of this type.
Make sure we don't put these into the ZIL.

Approved by:	rwatson (mentor), pjd
2008-11-25 23:05:46 +00:00
Pawel Jakub Dawidek
ad35ee04f4 Fix locking (file descriptor table and Giant around VFS).
Most submitted by:	kib
Reviewed by:		kib
2008-11-25 21:14:00 +00:00
Ganbold Tsagaankhuu
79dae0aa0b Remove unused variable.
Found with:     Coverity Prevent(tm)
CID: 3669,3671

Approved by: jb
2008-11-25 19:25:54 +00:00
Pawel Jakub Dawidek
83080c1ece Don't use PRIV_ROOT. Here we check if user can share ZFS file system, so
PRIV_NFS_DAEMON seems best choice.

Discussed with:	rwatson
2008-11-23 20:14:19 +00:00
Pawel Jakub Dawidek
bcfbcdca9c IFp4: Don't rely on disk IDs and always use vdev guids, which means always look
up for components by reading metadata. This might be slower when there are big
number of disks in the system, but is definiately more reliable.
2008-11-22 13:33:06 +00:00
Pawel Jakub Dawidek
74303ba55c IFp4: Finish implemnetation of chflags(2) for ZFS. While doing this I found
that zfs_access() can only handle VREAD, VWRITE and VEXEC, for the rest we need
to use vaccess(9).
2008-11-22 13:24:44 +00:00
Pawel Jakub Dawidek
5189bf22c0 IFp4: Don't free pathname too soon, debugging code is still using it. 2008-11-22 13:22:24 +00:00
Doug Rabson
786895f6ba Add definitions for ZFS pool version 13. 2008-11-21 09:10:35 +00:00
Doug Rabson
0d16312b46 Some zfsboot fixes from Norikatsu Shigemura:
1. zfsboot2 (boot2) doesn't %d (printf), so change %d to %u.
2. chase new zpool versioning as SPA_VERSION.
   Obtained from: sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h

Submitted by:	nork
2008-11-19 16:59:19 +00:00
Pawel Jakub Dawidek
1ba4a712dd Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes.
This bring huge amount of changes, I'll enumerate only user-visible changes:

- Delegated Administration

	Allows regular users to perform ZFS operations, like file system
	creation, snapshot creation, etc.

- L2ARC

	Level 2 cache for ZFS - allows to use additional disks for cache.
	Huge performance improvements mostly for random read of mostly
	static content.

- slog

	Allow to use additional disks for ZFS Intent Log to speed up
	operations like fsync(2).

- vfs.zfs.super_owner

	Allows regular users to perform privileged operations on files stored
	on ZFS file systems owned by him. Very careful with this one.

- chflags(2)

	Not all the flags are supported. This still needs work.

- ZFSBoot

	Support to boot off of ZFS pool. Not finished, AFAIK.

	Submitted by:	dfr

- Snapshot properties

- New failure modes

	Before if write requested failed, system paniced. Now one
	can select from one of three failure modes:
	- panic - panic on write error
	- wait - wait for disk to reappear
	- continue - serve read requests if possible, block write requests

- Refquota, refreservation properties

	Just quota and reservation properties, but don't count space consumed
	by children file systems, clones and snapshots.

- Sparse volumes

	ZVOLs that don't reserve space in the pool.

- External attributes

	Compatible with extattr(2).

- NFSv4-ACLs

	Not sure about the status, might not be complete yet.

	Submitted by:	trasz

- Creation-time properties

- Regression tests for zpool(8) command.

Obtained from:	OpenSolaris
2008-11-17 20:49:29 +00:00
Edward Tomasz Napierala
4bdaada206 Require write access on a directory being moved from one parent
directory to another in ZFS.

Approved by:	rwatson (mentor), pjd
2008-11-08 19:56:32 +00:00
Edward Tomasz Napierala
36d227d9ed Backoff the last patch. It was overly restrictive - we want to check
for write permission on target only when moving the target between two
directories.

Approved by:	rwatson (mentor)
2008-11-06 22:28:04 +00:00
Edward Tomasz Napierala
b92eda309d Change ZFS behaviour to match UFS: when moving (rename(2)) a subdirectory
from one parent directory to another, in addition to the usual access checks
one also needs write access to the subdirectory being moved.

Approved by:    rwatson (mentor), pjd
2008-11-06 19:17:58 +00:00
Craig Rodrigues
6a73ed4f46 Remove definition of KMEM_DEBUG accidentally brought in by latest DTrace
import.

Noticed by:	thompsa
2008-11-05 20:32:13 +00:00
Craig Rodrigues
f5a97d1bcb Merge latest DTrace changes from Perforce. 2008-11-05 19:39:11 +00:00
Edward Tomasz Napierala
15bc6b2bd8 Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.

Approved by:	rwatson (mentor)
2008-10-28 13:44:11 +00:00
Attilio Rao
0d7935fd01 Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by:	kib
Tested by:	Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2008-10-10 21:23:50 +00:00
John Birrell
fd4cdfbf46 Disable use of the user credentials until there is code to set the levels
that DTrace uses.

This fixes a bug that would have affected kernels built with MAC and all
kernels built after the mpsafetty integration.

The bug will be apparent in RELENG7 on MAC kernels.

Reported by: kan
2008-09-27 17:52:48 +00:00
Ed Schouten
6bfa9a2d66 Replace all calls to minor() with dev2unit().
After I removed all the unit2minor()/minor2unit() calls from the kernel
yesterday, I realised calling minor() everywhere is quite confusing.
Character devices now only have the ability to store a unit number, not
a minor number. Remove the confusion by using dev2unit() everywhere.

This commit could also be considered as a bug fix. A lot of drivers call
minor(), while they should actually be calling dev2unit(). In -CURRENT
this isn't a problem, but it turns out we never had any problem reports
related to that issue in the past. I suspect not many people connect
more than 256 pieces of the same hardware.

Reviewed by:	kib
2008-09-27 08:51:18 +00:00
Ed Schouten
d3ce832719 Remove unit2minor() use from kernel code.
When I changed kern_conf.c three months ago I made device unit numbers
equal to (unneeded) device minor numbers. We used to require
bitshifting, because there were eight bits in the middle that were
reserved for a device major number. Not very long after I turned
dev2unit(), minor(), unit2minor() and minor2unit() into macro's.
The unit2minor() and minor2unit() macro's were no-ops.

We'd better not remove these four macro's from the kernel, because there
is a lot of (external) code that may still depend on them. For now it's
harmless to remove all invocations of unit2minor() and minor2unit().

Reviewed by:	kib
2008-09-26 14:19:52 +00:00
Warner Losh
6e1a9d1739 Mips needs the same treatment for atomic_or_8 as the other RISCy
architectures.
2008-09-18 19:57:06 +00:00
Pawel Jakub Dawidek
062ea27ee4 Add missing ZFS_EXIT().
PR:		kern/124899
Submitted by:	Masakazu Asama <m-asama@ginzado.ne.jp>
2008-09-15 11:27:25 +00:00
Edward Tomasz Napierala
dfa7fd1d70 Remove VSVTX, VSGID and VSUID. This should be a no-op,
as VSVTX == S_ISVTX, VSGID == S_ISGID and VSUID == S_ISUID.

Approved by:	rwatson (mentor)
2008-09-10 13:16:41 +00:00
Pawel Jakub Dawidek
1b856fa491 Initialize vp, so we don't call VOP_UNLOCK() with NULL vnode pointer.
Confirmed by:	marcus
2008-09-07 07:55:12 +00:00
Pawel Jakub Dawidek
433751bb50 Lock vnode exclusively around insmntque(). 2008-09-06 17:24:07 +00:00
Pawel Jakub Dawidek
7fa1f32a7e Catch up after last insmntque() changes:
- The vnode has to be locked exclusively before calling insmntque().
- Until I find a way to handle insmntque() failures use VV_FORCEINSMQ flag
  to force insmntque() to always succeed.

Reported by:	kris, trasz, des, others
Suggested by:	kib
Tested by:	trasz
2008-09-05 07:00:40 +00:00
Attilio Rao
59d4932531 Decontextualize vfs_busy(), vfs_unbusy() and vfs_mount_alloc() functions.
Manpages are updated accordingly.

Tested by:	Diego Sardina <siarodx at gmail dot com>
2008-08-31 14:26:08 +00:00
Scott Long
a25cb00747 Ensure that the padding calcualtion doesn't return a negative value.
Submitted by:	kib
Approved by:	jb
2008-08-29 15:55:49 +00:00
Attilio Rao
0359a12ead Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2008-08-28 15:23:18 +00:00
Warner Losh
e6b3a7a9c1 Add MIPS support.
Reviewed by:	jb@
2008-08-23 04:58:11 +00:00
John Birrell
ac80559536 Add calls to callout_drain() to ensure the callouts are flushed before
we free memory from underneath them.

This fixes an occasional panic I've been seeing in softclock() where a bad
pointer would be encountered when pushing DTrace hard.
2008-08-19 21:28:58 +00:00
Pawel Jakub Dawidek
37876323b1 We want to use LBOLT instead of lbolt on FreeBSD.
I've this already fixed in p4, but the fix was never integrated into HEAD.

Reported by:	ed
2008-07-21 14:35:48 +00:00
Pawel Jakub Dawidek
28814ddbe8 We want to check new options given, not the current ones.
This fixes 'zpool import -o <mntopt> <name>' not working properly.
2008-07-21 09:45:44 +00:00
Ed Schouten
3f7eea97fd Remove the $FreeBSD$ tag again, now I know fbsd:nokeywords exists.
Requested by:	pjd
Approved by:	philip (mentor)
2008-06-12 08:53:54 +00:00
Ed Schouten
0f03ce1bb8 Turn dev2unit(), minor(), unit2minor() and minor2unit() into macro's.
Now that we got rid of the minor-to-unit conversion and the constraints
on device minor numbers, we can convert the functions that operate on
minor and unit numbers to simple macro's. The unit2minor() and
minor2unit() macro's are now no-ops.

The ZFS code als defined a macro named `minor'. Change the ZFS code to
use umajor() and uminor() here, as it is the correct approach to do
this. Also add $FreeBSD$ to keep SVN happy.

Approved by:	philip (mentor), pjd
2008-06-12 08:30:54 +00:00
Ed Schouten
29d4cb241b Don't enforce unique device minor number policy anymore.
Except for the case where we use the cloner library (clone_create() and
friends), there is no reason to enforce a unique device minor number
policy. There are various drivers in the source tree that allocate unr
pools and such to provide minor numbers, without using them themselves.

Because we still need to support unique device minor numbers for the
cloner library, introduce a new flag called D_NEEDMINOR. All cdevsw's
that are used in combination with the cloner library should be marked
with this flag to make the cloning work.

This means drivers can now freely use si_drv0 to store their own flags
and state, making it effectively the same as si_drv1 and si_drv2. We
still keep the minor() and dev2unit() routines around to make drivers
happy.

The NTFS code also used the minor number in its hash table. We should
not do this anymore. If the si_drv0 field would be changed, it would no
longer end up in the same list.

Approved by:	philip (mentor)
2008-06-11 18:55:19 +00:00
John Birrell
4ca07625aa Merge a recent change from the OpenSolaris source tree.
(Don't ask for a vendor import of this yet, we're in the early days of svn)

Instead of using cyclic timers to call the state clean and deadman callbacks,
use a callout on FreeBSD to avoid the deadlock on FreeBSD due to trying to
send interprocessor interrupts with interrupts disabled.

Reported by: ps, jhb, peter, thompsa
2008-06-01 01:46:37 +00:00
Pawel Jakub Dawidek
ed5a2ac45c Fix namespace collision after src/sys/sys/file.h:1.78. 2008-05-25 22:34:17 +00:00
John Birrell
727acbb41b Comment out the code that breaks with invariants. This is stuff that is
still WIP along with the lockstat provider, so there is no harm leaving
it out for now.
2008-05-25 20:24:07 +00:00
Bjoern A. Zeeb
079d3bfcfb Remove redundant redeclaration of 'zone_drain'. 2008-05-24 19:30:38 +00:00
John Birrell
8fc6245976 Make the zfs module depend on the opensolaris module in preparation for it
to shared stuff with the DTrace modules.
2008-05-24 06:43:55 +00:00
John Birrell
25f292128c Messing with the endian defines breaks the use of other FreeBSD headers. 2008-05-23 23:03:17 +00:00
John Birrell
fd930d81d8 Delete a couple of OpenSolaris headers which get in the way of our
implementation.
2008-05-23 22:40:58 +00:00
John Birrell
8599306711 OpenSolaris kernel module compatibility sources. 2008-05-23 22:39:28 +00:00
John Birrell
5a3c3bfaa3 The cyclic timer device. This is a cut down version of the one in
OpenSolaris. We don't have the lock levels that they do, so this is just
hooked into clock interrupts.
2008-05-23 22:21:58 +00:00
John Birrell
91eaf3e183 Custom DTrace kernel module files plus FreeBSD-specific DTrace providers. 2008-05-23 05:59:42 +00:00
John Birrell
32a109c1d8 A 'special' compatibility header to plug OpenSolaris code. 2008-05-22 09:08:41 +00:00
John Birrell
4706efa4f6 Additional compatibility headers. 2008-05-22 08:35:03 +00:00
John Birrell
1583a68737 Compatibility stuff for DTrace. 2008-05-22 08:33:24 +00:00
John Birrell
5a1b490d50 FreeBSD changes to vendor source. 2008-05-22 07:33:39 +00:00
John Birrell
cd844e7a7d This commit was generated by cvs2svn to compensate for changes in r179193,
which included commits to RCS files with non-trunk default branches.
2008-05-22 07:04:10 +00:00
Attilio Rao
295624f56a LO_ENROLLPEND is no more existing so just axe it (it was left out by the
original commit axing it).
2008-05-16 02:09:13 +00:00
John Birrell
db612abe8d Add FreeBSD IDs to files that originate in FreeBSD. 2008-04-22 07:43:00 +00:00
Konstantin Belousov
eab626f110 Move the head of byte-level advisory lock list from the
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.

Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.

The implementation of the lf_purgelocks() is submitted by dfr.

Reported by:	kris
Tested by:	kris, pho
Discussed with:	jeff, dfr
MFC after:	2 weeks
2008-04-16 11:33:32 +00:00
Marius Strobl
5b20de10b9 Add atomic operations for ZFS/sparc64.
Approved by:	core, pjd
Obtained from:	OpenSolaris (w/ adaptations)
MFC after:	2 weeks
2008-04-11 22:59:33 +00:00
Marius Strobl
20a8e8d594 - Fix the path encoded in the multiple inclusion protection.
- GCC uses 32-byte function alignment for UltraSPARC CPUs.
- Remove code duplication.

Approved by:	core, pjd
MFC after:	2 weeks
2008-04-11 22:53:06 +00:00
Doug Rabson
dfdcada31e Add the new kernel-mode NFS Lock Manager. To use it instead of the
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.

Highlights include:

* Thread-safe kernel RPC client - many threads can use the same RPC
  client handle safely with replies being de-multiplexed at the socket
  upcall (typically driven directly by the NIC interrupt) and handed
  off to whichever thread matches the reply. For UDP sockets, many RPC
  clients can share the same socket. This allows the use of a single
  privileged UDP port number to talk to an arbitrary number of remote
  hosts.

* Single-threaded kernel RPC server. Adding support for multi-threaded
  server would be relatively straightforward and would follow
  approximately the Solaris KPI. A single thread should be sufficient
  for the NLM since it should rarely block in normal operation.

* Kernel mode NLM server supporting cancel requests and granted
  callbacks. I've tested the NLM server reasonably extensively - it
  passes both my own tests and the NFS Connectathon locking tests
  running on Solaris, Mac OS X and Ubuntu Linux.

* Userland NLM client supported. While the NLM server doesn't have
  support for the local NFS client's locking needs, it does have to
  field async replies and granted callbacks from remote NLMs that the
  local client has contacted. We relay these replies to the userland
  rpc.lockd over a local domain RPC socket.

* Robust deadlock detection for the local lock manager. In particular
  it will detect deadlocks caused by a lock request that covers more
  than one blocking request. As required by the NLM protocol, all
  deadlock detection happens synchronously - a user is guaranteed that
  if a lock request isn't rejected immediately, the lock will
  eventually be granted. The old system allowed for a 'deferred
  deadlock' condition where a blocked lock request could wake up and
  find that some other deadlock-causing lock owner had beaten them to
  the lock.

* Since both local and remote locks are managed by the same kernel
  locking code, local and remote processes can safely use file locks
  for mutual exclusion. Local processes have no fairness advantage
  compared to remote processes when contending to lock a region that
  has just been unlocked - the local lock manager enforces a strict
  first-come first-served model for both local and remote lockers.

Sponsored by:	Isilon Systems
PR:		95247 107555 115524 116679
MFC after:	2 weeks
2008-03-26 15:23:12 +00:00
Robert Watson
237fdd787b In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation.  This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after:	1 month
Discussed with:	imp, rink
2008-03-16 10:58:09 +00:00
Pawel Jakub Dawidek
2b1c6615bc Fix mmap(2) on ZFS after some changes in VM subsystem.
Submitted by:	alc
Reported by:	kris (originally) and many others
Tested with:	fsx
MFC after:	1 week
2008-03-15 23:23:04 +00:00
Attilio Rao
81c794f998 Axe the 'thread' argument from VOP_ISLOCKED() and lockstatus() as it is
always curthread.

As KPI gets broken by this patch, manpages and __FreeBSD_version will be
updated by further commits.

Tested by:	Andrea Barberio <insomniac at slackware dot it>
2008-02-25 18:45:57 +00:00
Attilio Rao
628f51d275 Introduce some functions in the vnode locks namespace and in the ffs
namespace in order to handle lockmgr fields in a controlled way instead
than spreading all around bogus stubs:
- VN_LOCK_AREC() allows lock recursion for a specified vnode
- VN_LOCK_ASHARE() allows lock sharing for a specified vnode

In FFS land:
- BUF_AREC() allows lock recursion for a specified buffer lock
- BUF_NOREC() disallows recursion for a specified buffer lock

Side note: union_subr.c::unionfs_node_update() is the only other function
directly handling lockmgr fields. As this is not simple to fix, it has
been left behind as "sole" exception.
2008-02-24 16:38:58 +00:00
Pawel Jakub Dawidek
79bc018dd7 - Reduce how much ZFS caches by default. This is another change to mitigate
'kmem_map too small panics'.
- Print two warnings if there is not enough memory and not enough address
  space.
- Improve comment.
2008-01-24 11:24:16 +00:00
Pawel Jakub Dawidek
44ce1efd91 Change type of kmem_used() and kmem_size() functions to uint64_t, so it
doesn't overflow in arc.c in this check:

	if (kmem_used() > (kmem_size() * 4) / 5)
		return (1);

With this bug ZFS almost doesn't cache.

Only 32bit machines are affected that have vm.kmem_size set to values >=1GB.

Reported by:	David Taylor <davidt@yadt.co.uk>
2008-01-24 11:21:54 +00:00
Attilio Rao
22db15c06f VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
2008-01-13 14:44:15 +00:00
Attilio Rao
cb05b60a89 vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by:	Diego Sardina <siarodx at gmail dot com>,
		Andrea Di Pasquale <whyx dot it at gmail dot com>
2008-01-10 01:10:58 +00:00
John Birrell
35a04710d7 Remove some compatibility stuff that we now get from the Solaris header. 2007-11-29 00:15:08 +00:00
John Birrell
b468fe2bce * Check endianness the FreeBSD way.
* Use LBOLT rather than lbolt to avoid a clash with a FreeBSD global
  variable.
2007-11-28 22:16:00 +00:00
John Birrell
9587fed572 Fix a prototype definition. 2007-11-28 22:13:28 +00:00
John Birrell
da9085a1c0 Check endianness the FreeBSD way. 2007-11-28 22:12:21 +00:00
John Birrell
47b288c152 Include an extra header to get this to compile cleanly. 2007-11-28 22:11:39 +00:00
John Birrell
57438287ab Add more OpenSolaris compatibility headers. 2007-11-28 21:50:40 +00:00
John Birrell
eca148b637 Remove an extern that is defined elsewhere. 2007-11-28 21:50:05 +00:00
John Birrell
edadde229a Add compatibility cruft moved from under _SOLARIS_C_SOURCE in sys/types.h 2007-11-28 21:49:16 +00:00
John Birrell
35ba7f225f Remove a typedef which was just a hack to avoid including vmem.h.
That typedef breaks other Solaris code.
2007-11-28 21:48:25 +00:00
John Birrell
773f4e3849 Add a missing volatile so that the code compiles cleanly. 2007-11-28 21:47:09 +00:00
John Birrell
4fc8feafc7 Rename the definition of lbolt to LBOLT to avoid a clash with a global
variable in FreeBSD. Until now lbolt in sys/proc.h has been #ifdef'ed
out based on _SOLARIS_C_SOURCE, but that is going away now.
2007-11-28 21:44:17 +00:00
Pawel Jakub Dawidek
4d4daf5901 Warn if kmem_map size is set to less than 512MB. Previous warning was a bit
pointless, because default is set to something around 300MB and also
insufficient.

MFC after:	3 days
2007-11-07 14:44:31 +00:00
Pawel Jakub Dawidek
232a80f675 Remove unused header.
MFC after:	3 days
2007-11-05 22:18:34 +00:00
Pawel Jakub Dawidek
a33b7a8f5f If setting a state to anything but open state, close access to vdev.
This fixes replacing drive in place, eg. zpool replace tank da1 da1.
Before it complained that device is already open.

MFC after:	1 week
2007-11-05 21:30:48 +00:00
Pawel Jakub Dawidek
171eb887e9 Remove "zfs:" prefix from lock and condvar names and also skip non-letter
characters (mostly "&"). Because top(1) shows only first six characters of
wait channel, without this change we saw only one meaningful character.

Requested by:	kris & others
MFC after:	1 week
2007-11-05 18:40:55 +00:00
Ulf Lilleengen
6509baf851 - Add sysctl for sizeof(znode_t), which will be used by fstat(1).
Approved by:	pjd (mentor)
2007-11-02 00:35:05 +00:00
Pawel Jakub Dawidek
ef2d58b58f Call zil_commit() (if ZIL is not disabled) after every non-read request
(BIO_WRITE and BIO_FLUSH) as it is done is Solaris. The difference is
that Solaris calls it only for sync requests, but we can't say in GEOM
is the request is sync or async, so we do it for every request.

MFC after:	1 week
2007-11-01 11:04:21 +00:00
Pawel Jakub Dawidek
4f2398ea17 - Move crfree() outside MNT_ILOCK()/MNT_IUNLOCK() to eliminate a LOR:
1st 0xc4cea568 struct mount mtx (struct mount mtx) @ /usr/src/sys/modules/zfs/../../compat/opensolaris/kern/opensolaris_vfs.c:209
  2nd 0xc3ee9010 sleep mtxpool (sleep mtxpool) @ /usr/src/sys/kern/kern_resource.c:1266
- Move crdup() outside MNT_ILOCK()/MNT_IUNLOCK(), as it can sleep.

Reported by:	Olli Hauer <ohauer@gmx.de>
MFC after:	3 days
2007-11-01 08:58:29 +00:00
Julian Elischer
3745c395ec Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0  so that we can eventually MFC the
new kthread_xxx() calls.
2007-10-20 23:23:23 +00:00
Andrew Thompson
1fe1be1535 ZFS_LOG adds a newline by itself.
Pointed out by:	pjd
2007-10-14 16:14:32 +00:00
Andrew Thompson
9528621759 Print the ZFS ereport to the console if vfs.zfs.debug is set to help diagnose
problems with zfs-on-root since devd isnt running yet.

Reviewed by:	pjd
2007-10-14 07:58:50 +00:00
Pawel Jakub Dawidek
e8bd23b460 Fix lock leak leading to the 'System call <name> returning with 1 locks held'
panic.

Reported by:	kris
Approved by:	re (kensmith)
2007-10-04 17:51:59 +00:00
Pawel Jakub Dawidek
a95a61fc19 Now that we have CDDLed code in the tree, add CDDL license.
Discussed with:	core
Approved by:	re (kensmith)
2007-09-23 07:04:50 +00:00
Pawel Jakub Dawidek
a3c8c2e60f Reduce the limit of vnodes on i386 when ZFS is loaded to 3/4 of the original
value, so we don't run out of KVA. The default vnodes limit fits better for
UFS, but ZFS allocated more file system specific memory for a vnode than UFS.

Don't touch vnodes limit if we detect it was tuned by system administrator
and restore original value when ZFS is unloaded.

This isn't final fix, but before we implement something better, this will
help to stabilize ZFS under heavy load on i386.

Approved by:	re (bmah)
2007-09-10 19:58:14 +00:00
Pawel Jakub Dawidek
ef0ffc1c6f After dfr@ vnode leak fix, we can allow ARC to consume more memory.
Tested by:	kris
Approved by:	re (bmah)
2007-09-10 18:12:27 +00:00
Pawel Jakub Dawidek
6bc581fcf0 Use CTLFLAG_RDTUN for tunable sysctls.
Approved by:	re (bmah)
2007-09-01 06:23:42 +00:00
Pawel Jakub Dawidek
70eaa4219c Some ZFS threads needs stack larger than the default 8kB, so use 16kB of
alternate stack if the default is smaller than 16kB.

Approved by:	re (rwatson)
2007-08-16 20:33:20 +00:00
Pawel Jakub Dawidek
aa222db26f Update assertion after revision 1.23.
Reviewed by:	dfr
Approved by:	re (rwatson)
2007-07-24 15:00:43 +00:00
Doug Rabson
2dc26b36c8 Correct a reference-counting mistake in the ZFS code which led to abnormal
memory usage and pessimal cache performance.

Reviewed by: pjd
Approved by: re (rwatson)
2007-07-09 09:03:49 +00:00
Doug Rabson
7761242694 In zfs_vget, if we fail to translate an inode number to the corresponding
vnode, make sure we return an error code to the caller.

Reviewed by: pjd
Approved by: re
2007-06-27 12:00:24 +00:00
Robert Watson
32f9753cfb Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.

Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.

We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths.  Do, however, move those prototypes to priv.h.

Reviewed by:	csjp
Obtained from:	TrustedBSD Project
2007-06-12 00:12:01 +00:00
Marcel Moolenaar
6d63683c41 Add my copyright.
Requested by: pjd@
2007-06-08 16:20:03 +00:00
Pawel Jakub Dawidek
3b7917d766 - Reduce number of atomic operations needed to be implemented in asm by
implementing some of them using existing ones.
- Allow to compile ZFS on all archs and use atomic operations surrounded
  by global mutex on archs we don't have or can't have all atomic
  operations needed by ZFS.
2007-06-08 12:35:47 +00:00
Pawel Jakub Dawidek
083c4dd695 Missing atomic operations for ZFS/ia64.
Submitted by:	marcel
2007-06-08 12:26:30 +00:00