Commit Graph

234 Commits

Author SHA1 Message Date
Konstantin Belousov
a18a95db4a O_NOFOLLOW shall be in flags, not in cmode.
Noted by:	bde
2009-06-22 10:08:48 +00:00
Konstantin Belousov
e0c161b89c Add another flags argument to vn_open_cred. Use it to specify that some
vn_open_cred invocations shall not audit namei path.

In particular, specify VN_OPEN_NOAUDIT for dotdot lookup performed by
default implementation of vop_vptocnp, and for the open done for core
file. vn_fullpath is called from the audit code, and vn_open there need
to disable audit to avoid infinite recursion. Core file is created on
return to user mode, that, in particular, happens during syscall return.
The creation of the core file is audited by direct calls, and we do not
want to overwrite audit information for syscall.

Reported, reviewed and tested by: rwatson
2009-06-21 13:41:32 +00:00
Jamie Gritton
c1f192193d Rename the host-related prison fields to be the same as the host.*
parameters they represent, and the variables they replaced, instead of
abbreviated versions of them.

Approved by:	bz (mentor)
2009-06-13 15:39:12 +00:00
Kip Macy
f0c6b798a3 pjd has requested that I keep the tunable as zfs_prefetch_disable to minimize gratuitous
differences with Opensolaris' ZFS

Sorry for the churn
2009-06-11 22:24:08 +00:00
Kip Macy
e4e5e663e0 check against prefetch_enable 2009-06-11 09:51:21 +00:00
Kip Macy
3fa5485637 use default policy for enabling prefetching unless the TUNABLE is set 2009-06-10 21:05:37 +00:00
Kip Macy
107b659450 As far as I can tell systems that have less than 4GB are more often hurt
by prefetched than helped.  On i386 systems and systems with less than 4GB,
prefetch is now disabled by default. I've added a prefetch enable tunable, to
enable prefetching for those systems. The prefetch disable tunable will continue
to unconditionally disable prefetching.
2009-06-10 01:21:32 +00:00
Paul Saab
a6d545d8ed Support shared vnode locks for write operations when the offset is
provided on filesystems that support it.  This really improves mysql
+ innodb performance on ZFS.

Reviewed by:	jhb, kmacy, jeffr
2009-06-04 16:18:07 +00:00
Doug Rabson
8be608b58c Allow the bootfs property to be set for raidz pools on FreeBSD.
Reviewed by:	pjd
2009-05-31 11:59:32 +00:00
Kip Macy
762169b50a fix xdrmem_control to be safe in an if statement
fix zfs to depend on krpc
remove xdr from zfs makefile

Submitted by:	dchagin@freebsd.org
2009-05-30 22:23:58 +00:00
Kip Macy
139ccddec0 work around snapshot shutdown race reported by Henri Hennebert 2009-05-30 19:26:35 +00:00
Jamie Gritton
76ca6f88da Place hostnames and similar information fully under the prison system.
The system hostname is now stored in prison0, and the global variable
"hostname" has been removed, as has the hostname_mtx mutex.  Jails may
have their own host information, or they may inherit it from the
parent/system.  The proper way to read the hostname is via
getcredhostname(), which will copy either the hostname associated with
the passed cred, or the system hostname if you pass NULL.  The system
hostname can still be accessed directly (and without locking) at
prison0.pr_host, but that should be avoided where possible.

The "similar information" referred to is domainname, hostid, and
hostuuid, which have also become prison parameters and had their
associated global variables removed.

Approved by:	bz (mentor)
2009-05-29 21:27:12 +00:00
Attilio Rao
1ae1c2a3bd Reverse the logic for ADAPTIVE_SX option and enable it by default.
Introduce for this operation the reverse NO_ADAPTIVE_SX option.
The flag SX_ADAPTIVESPIN to be passed to sx_init_flags(9) gets suppressed
and the new flag, offering the reversed logic, SX_NOADAPTIVE is added.

Additively implements adaptive spininning for sx held in shared mode.
The spinning limit can be handled through sysctls in order to be tuned
while the code doesn't reach the release, after which time they should
be dropped probabilly.

This change has made been necessary by recent benchmarks where it does
improve concurrency of workloads in presence of high contention
(ie. ZFS).

KPI breakage is documented by __FreeBSD_version bumping, manpage and
UPDATING updates.

Requested by:	jeff, kmacy
Reviewed by:	jeff
Tested by:	pho
2009-05-29 01:49:27 +00:00
Kip Macy
c334d2d544 MFdevbranch 192944
- add FreeBSD implementation of xdrmem_control needed by zfs
 - have zfs define xdr_ops using FreeBSD's definition
 - remove solaris xdr files from zfs compile
2009-05-28 08:18:12 +00:00
Stacey Son
a5aedd68b4 Add the OpenSolaris dtrace lockstat provider. The lockstat provider
adds probes for mutexes, reader/writer and shared/exclusive locks to
gather contention statistics and other locking information for
dtrace scripts, the lockstat(1M) command and other potential
consumers.

Reviewed by:	attilio jhb jb
Approved by:	gnn (mentor)
2009-05-26 20:28:22 +00:00
Edward Tomasz Napierala
b7014134a7 Change license to more bori^Wadul^Wcanonical.
Submitted by:	rwatson@
2009-05-26 11:42:06 +00:00
Edward Tomasz Napierala
0970b4bae0 MFp4 changes neccessary for NFSv4 ACLs support in ZFS. This is mostly
about removing a few #ifdefs and providing compatibility wrappers and
VOP implementations to get and set an ACL; ZFS does ACL enforcement all
by itself.

Note that the VOPs are ifdefed out for now, so this change should be
a no-op.

Reviewed by:	pjd
2009-05-26 08:21:59 +00:00
Edward Tomasz Napierala
4076aa37dc Don't allow non-owner to set SUID bit on a file. It doesn't make
any difference now, but in NFSv4 ACLs, there is write_acl permission,
which also affects mode changes.

Reviewed by:	pjd
2009-05-24 19:21:49 +00:00
Edward Tomasz Napierala
194f4d42de Fix comment. 2009-05-24 15:48:48 +00:00
Dag-Erling Smørgrav
bba5cfd28b Unexpand $FreeBSD$. 2009-05-23 16:01:58 +00:00
Dag-Erling Smørgrav
6feca53bed Remove svn:keywords on a file that had fbsd:nokeywords (though I don't
understand the reason for the latter)
2009-05-23 16:00:16 +00:00
Kip Macy
e95d34711b - back out direct map hack
- it is no longer needed
2009-05-19 01:14:37 +00:00
Kip Macy
0fe5460dbd set createtxg prop name
PR: bin/130105
2009-05-17 04:04:25 +00:00
Kip Macy
ea41c77517 SAVESTART implies SAVENAME 2009-05-17 01:31:28 +00:00
Kip Macy
2e9c90d55b enable adaptive spinning on zfs locks 2009-05-16 23:56:45 +00:00
Kip Macy
be08aa8b59 - allow forced unmounts
- don't assume snapshot was auto-mounted
2009-05-16 20:33:13 +00:00
Kip Macy
71bc1ce36e only use direct map if system has more than 2GB 2009-05-16 20:09:07 +00:00
Kip Macy
32237d8492 apply band-aid to x86_64 systems with more physical memory than kmem by allocating from the direct map 2009-05-16 19:17:15 +00:00
Doug Rabson
e1899ef6c8 Add support for booting from raidz1 and raidz2 pools. 2009-05-16 10:48:20 +00:00
Attilio Rao
dfd233edd5 Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS.  Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled.  Bump __FreeBSD_version in order to signal such
situation.
2009-05-11 15:33:26 +00:00
Kip Macy
469ef3e563 rename xdr support files to avoid conflicts when linking in to the kernel 2009-05-11 04:18:58 +00:00
Kip Macy
8569258bf8 - rename atomic.S and crc32.c to avoid collisions when linking zfs in to the kernel
- update Makefile
- ifdef out acl_{alloc, free}, they aren't used by zfs and conflict with existing in-kernel routines
2009-05-09 01:45:55 +00:00
Marko Zec
29b02909eb Introduce a new virtualization container, provisionally named vprocg, to hold
virtualized instances of hostname and domainname, as well as a new top-level
virtualization struct vimage, which holds pointers to struct vnet and struct
vprocg.  Struct vprocg is likely to become replaced in the near future with
a new jail management API import.

As a consequence of this change, change struct ucred to point to a struct
vimage, instead of directly pointing to a vnet.

Merge vnet / vimage / ucred refcounting infrastructure from p4 / vimage
branch.

Permit kldload / kldunload operations to be executed only from the default
vimage context.

This change should have no functional impact on nooptions VIMAGE kernel
builds.

Reviewed by:	bz
Approved by:	julian (mentor)
2009-05-08 14:11:06 +00:00
Kip Macy
a6827463ad don't call vn_rele_async_fini in the !_KERNEL case 2009-05-07 23:34:41 +00:00
Kip Macy
c20fd07777 move VN_RELE_ASYNC to the compatibility layer with the rest of the VN_* defines 2009-05-07 23:02:15 +00:00
Kip Macy
6ef1a81d6e avoid LOR and gratuitous extra lock acquisitions by moving user_evict list buffers to
a temporary list
2009-05-07 21:51:13 +00:00
Kip Macy
77d0162c70 Allow the VM to provide backpressure on the ARC cache as it does
on Solaris.
2009-05-07 20:57:06 +00:00
Kip Macy
62fa227ccd Asynchronously release vnodes to avoid blocking on range locks when calling back in to zfs.
This is based on a fix that went in to opensolaris on March 9th. However, it uses a dedicated
thread instead of a Solaris' taskq to avoid doing a blocking memory allocation with the vnode
interlock held.

This fixes a long-time deadlock in ZFS. This is not, strictly speaking, an LOR. The spa_zio
thread releases a vnode, this calls in to vn_reclaim which in turn needs to acquire range locks
to sync dirty data out to disk. The range locks are already held by a user-level process waiting
on a condition variable that it the process is waiting on a spa_zio thread to signal it on. The
process could not be signalled because the spa_zio thread could not proceed.

The nature of this problem was not apparent due to ZFS locks opting out of witness which meant
that DDB did not know about the locks that were held by ZFS.

Reviewed by:	pjd
MFC after:	7 days
2009-05-07 20:28:06 +00:00
Jamie Gritton
b38ff370e4 Introduce the extensible jail framework, using the same "name=value"
interface as nmount(2).  Three new system calls are added:
* jail_set, to create jails and change the parameters of existing jails.
  This replaces jail(2).
* jail_get, to read the parameters of existing jails.  This replaces the
  security.jail.list sysctl.
* jail_remove to kill off a jail's processes and remove the jail.
Most jail parameters may now be changed after creation, and jails may be
set to exist without any attached processes.  The current jail(2) system
call still exists, though it is now a stub to jail_set(2).

Approved by:	bz (mentor)
2009-04-29 21:14:15 +00:00
Robert Watson
885868cd8f Remove VOP_LEASE and supporting functions. This hasn't been used since
the removal of NQNFS, but was left in in case it was required for NFSv4.
Since our new NFSv4 client and server can't use it for their
requirements, GC the old mechanism, as well as other unused lease-
related code and interfaces.

Due to its impact on kernel programming and binary interfaces, this
change should not be MFC'd.

Proposed by:    jeff
Reviewed by:    jeff
Discussed with: rmacklem, zach loafman @ isilon
2009-04-10 10:52:19 +00:00
Andrew Thompson
853a10a581 Revert r190676,190677
The geom and CAM changes for root_hold are the wrong solution for USB design
quirks.

Requested by:	scottl
2009-04-10 04:08:34 +00:00
Andrew Thompson
626fc9fe3d Add a how argument to root_mount_hold() so it can be passed NOWAIT and be called
in situations where sleeping isnt allowed.
2009-04-03 19:46:12 +00:00
Robert Watson
455f3aa24f Move dtnfsclient.c in the cddl tree to nfs_kdtrace.c in the nfsclient
directory, since it's under a BSD license, and this keeps NFS internals-
aware tracing parts close to NFS.

MFC after:	1 month
Suggested by:	jhb
2009-03-25 17:47:22 +00:00
Robert Watson
10263f0832 Add DTrace probes to the NFS access and attribute caches. Access cache
events are:

  nfsclient:accesscache:flush:done
  nfsclient:accesscache:get:hit
  nfsclient:accesscache:get:miss
  nfsclient:accesscache:load:done

They pass the vnode, uid, and requested or loaded access mode (if any);
the load event may also report a load error if the RPC fails.

The attribute cache events are:

  nfsclient:attrcache:flush:done
  nfsclient:attrcache:get:hit
  nfsclient:attrcache:get:miss
  nfsclient:attrcache:load:done

They pass the vnode, optionally the vattr if one is present (hit or load),
and in the case of a load event, also a possible RPC error.

MFC after:	1 month
Sponsored by:	Google, Inc.
2009-03-24 17:14:34 +00:00
Robert Watson
47294818f9 Add dtnfsclient, a first cut at an NFSv2/v3 client reuest DTrace
provider.  The NFS client exposes 'start' and 'done' probes for NFSv2
and NFSv3 RPCs when using the new RPC implementation, passing in the
vnode, mbuf chain, credential, and NFSv2 or NFSv3 procedure number.
For 'done' probes, the error number is also available.

Probes are named in the following way:

  ...
  nfsclient:nfs2:write:start
  nfsclient:nfs2:write:done
  ...
  nfsclient:nfs3:access:start
  nfsclient:nfs3:access:done
  ...

Access to the unmarshalled arguments is not easily available at this
point in the stack, but the passed probe arguments are sufficient to
to a lot of interesting things in practice.  Technically, these probes
may cover multiple RPC retransmits, and even transactions if the
transaction ID change as a result of authentication failure or a
jukebox error from the server, but usefully capture the intent of a
single NFS request, such as access, getattr, write, etc.

Typical use might involve profiling RPC latency by system call, number
of RPCs, how often a getattr leads to a call to access, when failed
access control checks occur, etc.  More detailed RPC information might
best be provided by adding a krpc provider.  It would also be useful
to add NFS client probes for events such as the access cache or
attribute cache satisfying requests without an RPC.

Sponsored by:	Google, Inc.
MFC after:	1 month
2009-03-22 22:07:52 +00:00
John Baldwin
9fca7a854c The zfs_get_xattrdir() function is used to find the extended attribute
directory for a znode.  When the directory already exists, it returns a
referenced but unlocked vnode.  When a directory does not yet exist, it
calls zfs_make_xattrdir() to create a new one.  zfs_make_xattrdir() returns
the vnode both referenced and and locked and zfs_get_xattrdir() was leaking
this vnode lock to its callers.  Fix this by dropping the vnode lock if
zfs_make_xattrdir() successfully creates a new extended attribute
directory.

Reviewed by:	pjd
2009-03-18 16:19:44 +00:00
John Baldwin
33fc362512 Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that a
filesystem supports additional operations using shared vnode locks.
Currently this is used to enable shared locks for open() and close() of
read-only file descriptors.
- When an ISOPEN namei() request is performed with LOCKSHARED, use a
  shared vnode lock for the leaf vnode only if the mount point has the
  extended shared flag set.
- Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but
  not O_CREAT.
- Use a shared vnode lock around VOP_CLOSE() if the file was opened with
  O_RDONLY and the mountpoint has the extended shared flag set.
- Adjust md(4) to upgrade the vnode lock on the vnode it gets back from
  vn_open() since it now may only have a shared vnode lock.
- Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since
  FIFO's require exclusive vnode locks for their open() and close()
  routines.  (My recent MPSAFE patches for UDF and cd9660 already included
  this change.)
- Enable extended shared operations on UFS, cd9660, and UDF.

Submitted by:	ups
Reviewed by:	pjd (ZFS bits)
MFC after:	1 month
2009-03-11 14:13:47 +00:00
Jamie Gritton
f86bce5ed0 Extend the "vfsopt" mount options for more general use. Make struct
vfsopt and the vfs_buildopts function public, and add some new fields
to struct vfsopt (pos and seen), and new functions vfs_getopt_pos and
vfs_opterror.

Further extend the interface to allow reading options from the kernel
in addition to sending them to the kernel, with vfs_setopt and related
functions.

While this allows the "name=value" option interface to be used for more
than just FS mounts (planned use is for jails), it retains the current
"vfsopt" name and <sys/mount.h> requirement.

Approved by:	bz (mentor)
2009-03-02 23:26:30 +00:00
Ed Schouten
802cb57e34 Add memmove() to the kernel, making the kernel compile with Clang.
When copying big structures, LLVM generates calls to memmove(), because
it may not be able to figure out whether structures overlap. This caused
linker errors to occur. memmove() is now implemented using bcopy().
Ideally it would be the other way around, but that can be solved in the
future. On ARM we don't do add anything, because it already has
memmove().

Discussed on:	arch@
Reviewed by:	rdivacky
2009-02-28 16:21:25 +00:00
John Baldwin
ea77ff0a15 Use shared vnode locks when invoking VOP_READDIR().
MFC after:	1 month
2009-02-13 18:18:14 +00:00