Commit Graph

3084 Commits

Author SHA1 Message Date
Dag-Erling Smørgrav
c93c82f464 Fix a bug that allows NFS clients to issue READDIR on files.
PR:		kern/178016
Security:	CVE-2013-3266
Security:	FreeBSD-SA-13:05.nfsserver
2013-04-29 20:09:44 +00:00
Konstantin Belousov
6f2af3fcf3 Rework the handling of the tmpfs node backing swap object and tmpfs
vnode v_object to avoid double-buffering.  Use the same object both as
the backing store for tmpfs node and as the v_object.

Besides reducing memory use up to 2x times for situation of mapping
files from tmpfs, it also makes tmpfs read and write operations copy
twice bytes less.

VM subsystem was already slightly adapted to tolerate OBJT_SWAP object
as v_object. Now the vm_object_deallocate() is modified to not
reinstantiate OBJ_ONEMAPPING flag and help the VFS to correctly handle
VV_TEXT flag on the last dereference of the tmpfs backing object.

Reviewed by:	alc
Tested by:	pho, bf
MFC after:	1 month
2013-04-28 19:38:59 +00:00
Rick Macklem
64a0e848ab When an NFS unmount occurs, once vflush() writes the last dirty
buffer for the last vnode on the mount back to the server, it
returns. At that point, the code continues with the unmount,
including freeing up the nfs specific part of the mount structure.
It is possible that an nfsiod thread will try to check for an
empty I/O queue in the nfs specific part of the mount structure
after it has been free'd by the unmount. This patch avoids this problem by
setting the iodmount entries for the mount back to NULL while holding the
mutex in the unmount and checking the appropriate entry is non-NULL after
acquiring the mutex in the nfsiod thread.

Reported and tested by:	pho
Reviewed by:	kib
MFC after:	2 weeks
2013-04-18 23:20:16 +00:00
Rick Macklem
175b3f31d3 Both NFS clients can deadlock when using the "rdirplus" mount
option. This can occur when an nfsiod thread that already holds
a buffer lock attempts to acquire a vnode lock on an entry in
the directory (a LOR) when another thread holding the vnode lock
is waiting on an nfsiod thread. This patch avoids the deadlock by disabling
readahead for this case, so the nfsiod threads never do readdirplus.
Since readaheads for directories need the directory offset cookie
from the previous read, they cannot normally happen in parallel.
As such, testing by jhb@ and myself didn't find any performance
degredation when this patch is applied. If there is a case where
this results in a significant performance degradation, mounting
without the "rdirplus" option can be done to re-enable readahead
for directories.

Reported and tested by:	jhb
Reviewed by:	jhb
MFC after:	2 weeks
2013-04-18 13:09:04 +00:00
Kenneth D. Merry
adb974068b Move the NFS FHA (File Handle Affinity) code from sys/nfsserver to
sys/nfs, since it is now shared by the two NFS servers.

Suggested by:	rmacklem
Sponsored by:	Spectra Logic
MFC after:	2 weeks
2013-04-17 22:42:43 +00:00
Kenneth D. Merry
d96b98a360 Revamp the old NFS server's File Handle Affinity (FHA) code so that
it will work with either the old or new server.

The FHA code keeps a cache of currently active file handles for
NFSv2 and v3 requests, so that read and write requests for the same
file are directed to the same group of threads (reads) or thread
(writes).  It does not currently work for NFSv4 requests.  They are
more complex, and will take more work to support.

This improves read-ahead performance, especially with ZFS, if the
FHA tuning parameters are configured appropriately.  Without the
FHA code, concurrent reads that are part of a sequential read from
a file will be directed to separate NFS threads.  This has the
effect of confusing the ZFS zfetch (prefetch) code and makes
sequential reads significantly slower with clients like Linux that
do a lot of prefetching.

The FHA code has also been updated to direct write requests to nearby
file offsets to the same thread in the same way it batches reads,
and the FHA code will now also send writes to multiple threads when
needed.

This improves sequential write performance in ZFS, because writes
to a file are now more ordered.  Since NFS writes (generally
less than 64K) are smaller than the typical ZFS record size
(usually 128K), out of order NFS writes to the same block can
trigger a read in ZFS.  Sending them down the same thread increases
the odds of their being in order.

In order for multiple write threads per file in the FHA code to be
useful, writes in the NFS server have been changed to use a LK_SHARED
vnode lock, and upgrade that to LK_EXCLUSIVE if the filesystem
doesn't allow multiple writers to a file at once.  ZFS is currently
the only filesystem that allows multiple writers to a file, because
it has internal file range locking.  This change does not affect the
NFSv4 code.

This improves random write performance to a single file in ZFS, since
we can now have multiple writers inside ZFS at one time.

I have changed the default tuning parameters to a 22 bit (4MB)
window size (from 256K) and unlimited commands per thread as a
result of my benchmarking with ZFS.

The FHA code has been updated to allow configuring the tuning
parameters from loader tunable variables in addition to sysctl
variables.  The read offset window calculation has been slightly
modified as well.  Instead of having separate bins, each file
handle has a rolling window of bin_shift size.  This minimizes
glitches in throughput when shifting from one bin to another.

sys/conf/files:
	Add nfs_fha_new.c and nfs_fha_old.c.  Compile nfs_fha.c
	when either the old or the new NFS server is built.

sys/fs/nfs/nfsport.h,
sys/fs/nfs/nfs_commonport.c:
	Bring in changes from Rick Macklem to newnfs_realign that
	allow it to operate in blocking (M_WAITOK) or non-blocking
	(M_NOWAIT) mode.

sys/fs/nfs/nfs_commonsubs.c,
sys/fs/nfs/nfs_var.h:
	Bring in a change from Rick Macklem to allow telling
	nfsm_dissect() whether or not to wait for mallocs.

sys/fs/nfs/nfsm_subs.h:
	Bring in changes from Rick Macklem to create a new
	nfsm_dissect_nonblock() inline function and
	NFSM_DISSECT_NONBLOCK() macro.

sys/fs/nfs/nfs_commonkrpc.c,
sys/fs/nfsclient/nfs_clkrpc.c:
	Add the malloc wait flag to a newnfs_realign() call.

sys/fs/nfsserver/nfs_nfsdkrpc.c:
	Setup the new NFS server's RPC thread pool so that it will
	call the FHA code.

	Add the malloc flag argument to newnfs_realign().

	Unstaticize newnfs_nfsv3_procid[] so that we can use it in
	the FHA code.

sys/fs/nfsserver/nfs_nfsdsocket.c:
	In nfsrvd_dorpc(), add NFSPROC_WRITE to the list of RPC types
	that use the LK_SHARED lock type.

sys/fs/nfsserver/nfs_nfsdport.c:
	In nfsd_fhtovp(), if we're starting a write, check to see
	whether the underlying filesystem supports shared writes.
	If not, upgrade the lock type from LK_SHARED to LK_EXCLUSIVE.

sys/nfsserver/nfs_fha.c:
	Remove all code that is specific to the NFS server
	implementation.  Anything that is server-specific is now
	accessed through a callback supplied by that server's FHA
	shim in the new softc.

	There are now separate sysctls and tunables for the FHA
	implementations for the old and new NFS servers.  The new
	NFS server has its tunables under vfs.nfsd.fha, the old
	NFS server's tunables are under vfs.nfsrv.fha as before.

	In fha_extract_info(), use callouts for all server-specific
	code.  Getting file handles and offsets is now done in the
	individual server's shim module.

	In fha_hash_entry_choose_thread(), change the way we decide
	whether two reads are in proximity to each other.
	Previously, the calculation was a simple shift operation to
	see whether the offsets were in the same power of 2 bucket.
	The issue was that there would be a bucket (and therefore
	thread) transition, even if the reads were in close
	proximity.  When there is a thread transition, reads wind
	up going somewhat out of order, and ZFS gets confused.

	The new calculation simply tries to see whether the offsets
	are within 1 << bin_shift of each other.  If they are, the
	reads will be sent to the same thread.

	The effect of this change is that for sequential reads, if
	the client doesn't exceed the max_reqs_per_nfsd parameter
	and the bin_shift is set to a reasonable value (22, or
	4MB works well in my tests), the reads in any sequential
	stream will largely be confined to a single thread.

	Change fha_assign() so that it takes a softc argument.  It
	is now called from the individual server's shim code, which
	will pass in the softc.

	Change fhe_stats_sysctl() so that it takes a softc
	parameter.  It is now called from the individual server's
	shim code.  Add the current offset to the list of things
	printed out about each active thread.

	Change the num_reads and num_writes counters in the
	fha_hash_entry structure to 32-bit values, and rename them
	num_rw and num_exclusive, respectively, to reflect their
	changed usage.

	Add an enable sysctl and tunable that allows the user to
	disable the FHA code (when vfs.XXX.fha.enable = 0).  This
	is useful for before/after performance comparisons.

nfs_fha.h:
	Move most structure definitions out of nfs_fha.c and into
	the header file, so that the individual server shims can
	see them.

	Change the default bin_shift to 22 (4MB) instead of 18
	(256K).  Allow unlimited commands per thread.

sys/nfsserver/nfs_fha_old.c,
sys/nfsserver/nfs_fha_old.h,
sys/fs/nfsserver/nfs_fha_new.c,
sys/fs/nfsserver/nfs_fha_new.h:
	Add shims for the old and new NFS servers to interface with
	the FHA code, and callbacks for the

	The shims contain all of the code and definitions that are
	specific to the NFS servers.

	They setup the server-specific callbacks and set the server
	name for the sysctl and loader tunable variables.

sys/nfsserver/nfs_srvkrpc.c:
	Configure the RPC code to call fhaold_assign() instead of
	fha_assign().

sys/modules/nfsd/Makefile:
	Add nfs_fha.c and nfs_fha_new.c.

sys/modules/nfsserver/Makefile:
	Add nfs_fha_old.c.

Reviewed by:	rmacklem
Sponsored by:	Spectra Logic
MFC after:	2 weeks
2013-04-17 21:00:22 +00:00
Gabor Kovesdan
a8b5c2a0aa - Correct spelling in comments
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
2013-04-17 11:56:11 +00:00
Gabor Kovesdan
a2098fea6d - Correct mispellings of the word necessary
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
2013-04-17 11:42:40 +00:00
Jeff Roberson
26089666b6 Prepare to replace the buf splay with a trie:
- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
   No consumers need to find them there and it complicates the tree.
   These flags are all FFS specific and could be moved out of the buf
   cache.
 - Use pbgetvp() and pbrelvp() to associate the background and journal
   bufs with the vp.  Not only is this much cheaper it makes more sense
   for these transient bufs.
 - Fix the assertions in pbget* and pbrel*.  It's not safe to check list
   pointers which were never initialized.  Use the BX flags instead.  We
   also check B_PAGING in reassignbuf() so this should cover all cases.

Discussed with:	kib, mckusick, attilio
Sponsored by:	EMC / Isilon Storage Division
2013-04-06 22:21:23 +00:00
Konstantin Belousov
159a400eb6 Strip the unnneeded spaces, mostly at the end of lines.
MFC after:	3 days
2013-04-01 09:56:48 +00:00
Pawel Jakub Dawidek
051a23d4e8 - Constify local path variable for chflagsat().
- Use correct format characters (%lx) for u_long.

This fixes the build broken in r248599.
2013-03-22 07:40:34 +00:00
Pawel Jakub Dawidek
b4b2596b97 - Make 'flags' argument to chflags(2), fchflags(2) and lchflags(2) of type
u_long. Before this change it was of type int for syscalls, but prototypes
  in sys/stat.h and documentation for chflags(2) and fchflags(2) (but not
  for lchflags(2)) stated that it was u_long. Now some related functions
  use u_long type for flags (strtofflags(3), fflagstostr(3)).
- Make path argument of type 'const char *' for consistency.

Discussed on:	arch
Sponsored by:	The FreeBSD Foundation
2013-03-21 22:44:33 +00:00
Konstantin Belousov
4d569af96c Initialize the variable to avoid (false) compiler warning about
use of an uninitialized local.

Reported by:	Ivan Klymenko <fidaj@ukr.net>
MFC after:	2 weeks
2013-03-21 12:59:24 +00:00
Konstantin Belousov
7157d8f7ab Do not call vnode_pager_setsize() while a NFS node mutex is
locked. vnode_pager_setsize() might sleep waiting for the page after
EOF be unbusied.

Call vnode_pager_setsize() both for the regular and directory vnodes.

Reported by:	mich
Reviewed by:	rmacklem
Discussed with:	avg, jhb
MFC after:	2 weeks
2013-03-21 07:25:08 +00:00
Ed Maste
96ecfd9813 Fix remainder calculation when biosize is not a power of 2
In common configurations biosize is a power of two, but is not required to
be so.  Thanks to markj@ for spotting an additional case beyond my original
patch.

Reviewed by: rmacklem@
2013-03-19 13:06:11 +00:00
Konstantin Belousov
0d3bb4afa8 Remove negative name cache entry pointing to the target name, which
could be instantiated while tdvp was unlocked.

Reported by:	Rick Miller <vmiller at hostileadmin com>
Tested by:	pho
MFC after:	1 week
2013-03-17 15:11:37 +00:00
Konstantin Belousov
c535690b33 Add currently unused flag argument to the cluster_read(),
cluster_write() and cluster_wbuild() functions.  The flags to be
allowed are a subset of the GB_* flags for getblk().

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
2013-03-14 20:28:26 +00:00
John Baldwin
3b14c753ff Revert 195703 and 195821 as this special stop handling in NFS is now
implemented via VFCF_SBDRY rather than passing PBDRY to individual
sleep calls.
2013-03-13 21:06:03 +00:00
Gleb Smirnoff
8634e3199c Finish r243882: mechanically substitute flags from historic mbuf
allocator with malloc(9) flags within sys.

Sponsored by:	Nginx, Inc.
2013-03-12 08:59:51 +00:00
Davide Italiano
ce589ae2a7 smbfs_lookup() in the DOTDOT case operates on dvp->n_parent without
proper locking. This doesn't prevent in any case reclaim of the vnode.
Avoid this not going over-the-wire in this case and relying on subsequent
smbfs_getattr() call to restore consistency.
While I'm here, change a couple of SMBVDEBUG() in MPASS().
sbmfs_smb_lookup() doesn't and shouldn't know about '.' and '..'

Reported by:	pho's stress2 suite
2013-03-09 13:25:45 +00:00
Davide Italiano
e346bd8177 - Initialize variable in smbfs_rename() to silent compiler warning
- Fix smbfs_mkdir() return value (in case of error).

Reported by:	pho
2013-03-09 13:05:21 +00:00
Attilio Rao
a92189b377 Garbage collect NWFS and NCP bits which are now completely disconnected
from the tree since few months.

This patch is not targeted for MFC.
2013-03-09 12:45:36 +00:00
Attilio Rao
89f6b8632c Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
Attilio Rao
737a61a1ee Garbage collect NTFS bits which are now completely disconnected from
the tree since few months.

This patch is not targeted for MFC.
2013-03-02 18:40:04 +00:00
Attilio Rao
4eb0218ace Garbage collect PORTALFS bits which are now completely disconnected from
the tree since few months.

This patch is not targeted for MFC.
2013-03-02 16:43:28 +00:00
Attilio Rao
f51fb78533 Garbage collect CODAFS bits which are now completely disconnected from
the tree since few months.

This patch is not targeted for MFC.
2013-03-02 16:30:18 +00:00
Attilio Rao
258bee160c Garbage collect HPFS bits which are now already completely disconnected
from the tree since few months (please note that the userland bits
were already disconnected since a long time, thus there is no need
to update the OLD* entries).

This is not targeted for MFC.
2013-03-02 14:54:33 +00:00
Jilles Tjoelker
6d6a91c50f nullfs: Improve f_flags in statfs().
Include some flags of the nullfs mount itself:
MNT_RDONLY, MNT_NOEXEC, MNT_NOSUID, MNT_UNION, MNT_NOSYMFOLLOW.

This allows userland code calling statfs() or fstatfs() to see these flags.
In particular, this allows opendir() to detect that a -t nullfs -o union
mount needs deduplication (otherwise at least . and .. are returned twice)
and allows rtld to detect a -t nullfs -o noexec mount as noexec.

Turn off the MNT_ROOTFS flag from the underlying filesystem because the
nullfs mount is definitely not the root filesystem.

Reviewed by:	kib
MFC after:	1 week
2013-03-02 12:42:23 +00:00
Pawel Jakub Dawidek
2609222ab4 Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
  has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
  should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
  cap_new(2), which limits capability rights of the given descriptor
  without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
  ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
  ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
  that can be used with the new cap_fcntls_limit(2) syscall and retrive
  them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
  heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
  recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
  backward API and ABI compatibility there are some incompatible changes
  that are described in detail below:

	CAP_CREATE old behaviour:
	- Allow for openat(2)+O_CREAT.
	- Allow for linkat(2).
	- Allow for symlinkat(2).
	CAP_CREATE new behaviour:
	- Allow for openat(2)+O_CREAT.

	Added CAP_LINKAT:
	- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
	- Allow to be target for renameat(2).

	Added CAP_SYMLINKAT:
	- Allow for symlinkat(2).

	Removed CAP_DELETE. Old behaviour:
	- Allow for unlinkat(2) when removing non-directory object.
	- Allow to be source for renameat(2).

	Removed CAP_RMDIR. Old behaviour:
	- Allow for unlinkat(2) when removing directory.

	Added CAP_RENAMEAT:
	- Required for source directory for the renameat(2) syscall.

	Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
	- Allow for unlinkat(2) on any object.
	- Required if target of renameat(2) exists and will be removed by this
	  call.

	Removed CAP_MAPEXEC.

	CAP_MMAP old behaviour:
	- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
	  PROT_WRITE.
	CAP_MMAP new behaviour:
	- Allow for mmap(2)+PROT_NONE.

	Added CAP_MMAP_R:
	- Allow for mmap(PROT_READ).
	Added CAP_MMAP_W:
	- Allow for mmap(PROT_WRITE).
	Added CAP_MMAP_X:
	- Allow for mmap(PROT_EXEC).
	Added CAP_MMAP_RW:
	- Allow for mmap(PROT_READ | PROT_WRITE).
	Added CAP_MMAP_RX:
	- Allow for mmap(PROT_READ | PROT_EXEC).
	Added CAP_MMAP_WX:
	- Allow for mmap(PROT_WRITE | PROT_EXEC).
	Added CAP_MMAP_RWX:
	- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

	Renamed CAP_MKDIR to CAP_MKDIRAT.
	Renamed CAP_MKFIFO to CAP_MKFIFOAT.
	Renamed CAP_MKNODE to CAP_MKNODEAT.

	CAP_READ old behaviour:
	- Allow pread(2).
	- Disallow read(2), readv(2) (if there is no CAP_SEEK).
	CAP_READ new behaviour:
	- Allow read(2), readv(2).
	- Disallow pread(2) (CAP_SEEK was also required).

	CAP_WRITE old behaviour:
	- Allow pwrite(2).
	- Disallow write(2), writev(2) (if there is no CAP_SEEK).
	CAP_WRITE new behaviour:
	- Allow write(2), writev(2).
	- Disallow pwrite(2) (CAP_SEEK was also required).

	Added convinient defines:

	#define	CAP_PREAD		(CAP_SEEK | CAP_READ)
	#define	CAP_PWRITE		(CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_R		(CAP_MMAP | CAP_SEEK | CAP_READ)
	#define	CAP_MMAP_W		(CAP_MMAP | CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_X		(CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
	#define	CAP_MMAP_RW		(CAP_MMAP_R | CAP_MMAP_W)
	#define	CAP_MMAP_RX		(CAP_MMAP_R | CAP_MMAP_X)
	#define	CAP_MMAP_WX		(CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_MMAP_RWX		(CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_RECV		CAP_READ
	#define	CAP_SEND		CAP_WRITE

	#define	CAP_SOCK_CLIENT \
		(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
		 CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
	#define	CAP_SOCK_SERVER \
		(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
		 CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
		 CAP_SETSOCKOPT | CAP_SHUTDOWN)

	Added defines for backward API compatibility:

	#define	CAP_MAPEXEC		CAP_MMAP_X
	#define	CAP_DELETE		CAP_UNLINKAT
	#define	CAP_MKDIR		CAP_MKDIRAT
	#define	CAP_RMDIR		CAP_UNLINKAT
	#define	CAP_MKFIFO		CAP_MKFIFOAT
	#define	CAP_MKNOD		CAP_MKNODAT
	#define	CAP_SOCK_ALL		(CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by:	The FreeBSD Foundation
Reviewed by:	Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with:	rwatson, benl, jonathan
ABI compatibility discussed with:	kib
2013-03-02 00:53:12 +00:00
Alan Cox
2c8472682c Eliminate a duplicate #include.
Sponsored by:	EMC / Isilon Storage Division
2013-02-26 07:00:24 +00:00
Attilio Rao
590f9303e5 Merge from vmobj-rwlock branch:
Remove unused inclusion of vm/vm_pager.h and vm/vnode_pager.h.

Sponsored by:	EMC / Isilon storage division
Tested by:	pho
Reviewed by:	alc
2013-02-26 01:00:11 +00:00
John Baldwin
593efaf9f7 Further refine the handling of stop signals in the NFS client. The
changes in r246417 were incomplete as they did not add explicit calls to
sigdeferstop() around all the places that previously passed SBDRY to
_sleep().  In addition, nfs_getcacheblk() could trigger a write RPC from
getblk() resulting in sigdeferstop() recursing.  Rather than manually
deferring stop signals in specific places, change the VFS_*() and VOP_*()
methods to defer stop signals for filesystems which request this behavior
via a new VFCF_SBDRY flag.  Note that this has to be a VFC flag rather than
a MNTK flag so that it works properly with VFS_MOUNT() when the mount is
not yet fully constructed.  For now, only the NFS clients are set this new
flag in VFS_SET().

A few other related changes:
- Add an assertion to ensure that TDF_SBDRY doesn't leak to userland.
- When a lookup request uses VOP_READLINK() to follow a symlink, mark
  the request as being on behalf of the thread performing the lookup
  (cnp_thread) rather than using a NULL thread pointer.  This causes
  NFS to properly handle signals during this VOP on an interruptible
  mount.

PR:		kern/176179
Reported by:	Russell Cattelan (sigdeferstop() recursion)
Reviewed by:	kib
MFC after:	1 month
2013-02-21 19:02:50 +00:00
Warner Losh
b96f7e0a60 The request queue is already locked, so we don't need the splsofclock/splx
here to note future work.
2013-02-21 02:43:44 +00:00
Konstantin Belousov
bb7ca8229d Do not update the fsinfo block on each update of any fat block, this
is excessive. Postpone the flush of the fsinfo to VFS_SYNC(),
remembering the need for update with the flag MSDOSFS_FSIMOD, stored
in pm_flags.

FAT32 specification describes both FSI_Free_Count and FSI_Nxt_Free as
the advisory hints, not requiring them to be correct.

Based on the patch from bde, modified by me.

Reviewed by: bde
MFC after:   2 weeks
2013-02-17 20:35:54 +00:00
Baptiste Daroussin
3c4223a65a Revert r246791 as it needs a security review first
Reported by:	gavin, rwatson
2013-02-14 15:17:53 +00:00
Baptiste Daroussin
f4365abd91 Allow fdescfs to be mounted from inside a jail
MFC after:	1 week
2013-02-14 13:03:15 +00:00
Pedro F. Giffuni
a9d1b29995 ext2fs: Use prototype declarations for function definitions
Submitted by:	Christoph Mallon
MFC after:	2 weeks
2013-02-10 19:49:37 +00:00
Attilio Rao
5e60cb948e Remove a racy checks on resident and cached pages for
tmpfs_mapped{read, write}() functions:
- tmpfs_mapped{read, write}() are only called within VOP_{READ, WRITE}(),
  which check before-hand to work only on valid VREG vnodes.  Also the
  vnode is locked for the duration of the work, making vnode reclaiming
  impossible, during the operation. Hence, vobj can never be NULL.
- Currently check on resident pages and cached pages without vm object
  lock held is racy and can do even more harm than good, as a page could
  be transitioning between these 2 pools and then be skipped entirely.
  Skip the checks as lookups on empty splay trees are very cheap.

Discussed with:	alc
Tested by:	flo
MFC after:	2 weeks
2013-02-10 01:04:10 +00:00
Pedro F. Giffuni
4e1e0e2582 ext2fs: Replace redundant EXT2_MIN_BLOCK with EXT2_MIN_BLOCK_SIZE.
Submitted by:	Christoph Mallon
MFC after:	2 weeks
2013-02-08 21:09:44 +00:00
Pedro F. Giffuni
1a125d6d85 ext2fs: make e2fs_maxcontig local and remove tautological check.
e2fs_maxcontig was modelled after UFS when bringing the
"Orlov allocator" to ext2. On UFS fs_maxcontig is kept in the
superblock and is used by userland tools (fsck and growfs),

In ext2 this information is volatile so it is not available
for userland tools, so in this case it doesn't have sense
to carry it in the in-memory superblock.

Also remove a pointless check for MAX(1, x) > 0.

Submitted by:	Christoph Mallon
MFC after:	2 weeks
2013-02-08 20:58:00 +00:00
Pedro F. Giffuni
a940ce65cd Remove unused MAXSYMLINKLEN macro.
Reviewed by:	mckusick
PR:		kern/175794
MFC after:	1 week
2013-02-08 20:30:19 +00:00
Konstantin Belousov
2ca4998342 Stop translating the ERESTART error from the open(2) into EINTR.
Posix requires that open(2) is restartable for SA_RESTART.

For non-posix objects, in particular, devfs nodes, still disable
automatic restart of the opens. The open call to a driver could have
significant side effects for the hardware.

Noted and reviewed by:	jilles
Discussed with:	bde
MFC after:	2 weeks
2013-02-07 14:53:33 +00:00
John Baldwin
a120a7a3cd Rework the handling of stop signals in the NFS client. The changes in
195702, 195703, and 195821 prevented a thread from suspending while holding
locks inside of NFS by forcing the thread to fail sleeps with EINTR or
ERESTART but defer the thread suspension to the user boundary.  However,
this had the effect that stopping a process during an NFS request could
abort the request and trigger EINTR errors that were visible to userland
processes (previously the thread would have suspended and completed the
request once it was resumed).

This change instead effectively masks stop signals while in the NFS client.
It uses the existing TDF_SBDRY flag to effect this since SIGSTOP cannot
be masked directly.  Also, instead of setting PBDRY on individual sleeps,
the NFS client now sets the TDF_SBDRY flag around each NFS request and
stop signals are masked for all sleeps during that region (the previous
change missed sleeps in lockmgr locks).  The end result is that stop
signals sent to threads performing an NFS request are completely
ignored until after the NFS request has finished processing and the
thread prepares to return to userland.  This restores the behavior of
stop signals being transparent to userland processes while still
preventing threads from suspending while holding NFS locks.

Reviewed by:	kib
MFC after:	1 month
2013-02-06 17:06:51 +00:00
Pedro F. Giffuni
d427334435 ext2fs: move assignment where it is not dead.
Submitted by:	Christoph Mallon
MFC after:	2 weeks
2013-02-05 03:26:34 +00:00
Pedro F. Giffuni
80b6a61199 ext2fs: Remove unused em_e2fsb definition..
Submitted by:	Christoph Mallon
MFC after:	2 weeks
2013-02-05 03:23:56 +00:00
Pedro F. Giffuni
555368dcf1 ext2fs: Remove useless rootino local variable.
Submitted by:	Christoph Mallon
MFC after:	2 weeks
2013-02-05 03:17:41 +00:00
Pedro F. Giffuni
ef024b0da9 ext2fs: Correct off-by-one errors in FFTODT() and DDTOFT().
Submitted by:	Christoph Mallon
MFC after:	2 weeks
2013-02-05 03:13:05 +00:00
Pedro F. Giffuni
666116e4a3 ext2fs: Use nitems().
Submitted by:	Christoph Mallon
MFC after:	2 weeks
2013-02-05 03:08:56 +00:00
Pedro F. Giffuni
fdc100e4c2 ext2fs: Use EXT2_LINK_MAX instead of LINK_MAX
Submitted by:	Christoph Mallon
MFC after:	2 weeks
2013-02-05 03:01:04 +00:00
Pedro F. Giffuni
757224cbdb ext2fs: general cleanup.
- Remove unused extern declarations in fs.h
- Correct comments in ext2_dir.h
- Several panic() messages showed wrong function names.
- Remove commented out stray line in ext2_alloc.c.
- Remove the unused macro EXT2_BLOCK_SIZE_BITS() and the then
  write-only member e2fs_blocksize_bits from struct m_ext2fs.
- Remove the unused macro EXT2_FIRST_INO() and the then write-only
  member e2fs_first_inode from struct m_ext2fs.
- Remove EXT2_DESC_PER_BLOCK() and the member e2fs_descpb from
  struct m_ext2fs.
- Remove the unused members e2fs_bmask, e2fs_dbpg and
  e2fs_mount_opt from struct m_ext2fs
- Correct harmless off-by-one error for fspath in ext2_vfsops.c.
- Remove the unused and broken macros EXT2_ADDR_PER_BLOCK_BITS()
  and EXT2_DESC_PER_BLOCK_BITS().
- Remove the !_KERNEL versions of the EXT2_* macros.

Submitted by:	Christoph Mallon
MFC after:	2 weeks
2013-02-02 22:23:45 +00:00
Konstantin Belousov
11fca81ccd The MSDOSFSMNT_WAITONFAT flag is bogus and broken. It does less than
track the MNT_SYNCHRONOUS flag.  It is set to the latter at mount time
but not updated by MNT_UPDATE.

Use MNT_SYNCHRONOUS to decide to write the FAT updates syncrhonously.

Submitted by:	bde
MFC after:	1 week
2013-02-01 18:30:41 +00:00
Konstantin Belousov
79fb7dd167 Backup FATs were sometimes marked dirty by copying their first block
from the primary FAT, and then they were not marked clean on unmount.
Force marking them clean when appropriate.

Submitted by:	bde
MFC after:	1 week
2013-02-01 18:25:53 +00:00
Konstantin Belousov
9ec062dddd The directory entry for dotdot was corrupted in the FAT32 case when moving
a directory to a subdir of the root directory from somewhere else.

For all directory moves that change the parent directory, the dotdot
entry must be fixed up.  For msdosfs, the root directory is magic for
non-FAT32.  It is less magic for FAT32, but needs the same magic for
the dotdot fixup.  It didn't have it.

Both chkdsk and fsck_msdosfs fix the corrupt directory entries with no
problems.

The fix is to use the same magic for dotdot in msdosfs_rename() as in
msdosfs_mkdir().

For msdosfs_mkdir(), document the magic. When writing the dotdot entry
in mkdir, use explicitly set pcl variable instead on relying on the
start cluster of the root directory typically has a value < 65536.

Submitted by:	bde
MFC after:	1 week
2013-02-01 18:06:06 +00:00
Konstantin Belousov
48efa33b49 The mountmsdosfs() function had an insane sanity test, remove it.
Trying FAT32 on a small partition failed to mount because
pmp->pm_Sectors was nonzero.  Normally, FAT32 file systems are so
large that the 16-bit pm_Sectors can't hold the size.  This is
indicated by setting it to 0 and using only pm_HugeSectors.  But at
least old versions of newfs_msdos use the 16-bit field if possible,
and msdosfs supports this except for breaking its own support in the
sanity check.  This is quite different from the handling of pm_FATsecs
-- now the 16-bit value is always ignored for FAT32 except for
checking that it is 0, and newfs_msdos doesn't use the 16-bit value
for FAT32.

Submitted by:	bde
MFC after:	1 week
2013-02-01 18:01:03 +00:00
Konstantin Belousov
a26b949f2d Fix a backwards comment in markvoldirty().
Submitted by:	bde
MFC after:	1 week
2013-02-01 17:58:37 +00:00
Konstantin Belousov
dd6035234a Assert that the mbuf in the chain has sane length. Proper place for
this check is somewhere in the network code, but this assertion
already proven to be useful in catching what seems to be driver bugs
causing NFS scrambling random memory.

Discussed with:	rmacklem
MFC after:	1 week
2013-02-01 16:57:02 +00:00
Konstantin Belousov
6168020f66 Be conservative and do not try to consume more bytes than was
requested from the server for the read operation.  Server shall not
reply with too large size, but client should be resilent too.

Reviewed by:	rmacklem
MFC after:	1 week
2013-01-27 09:34:25 +00:00
Pedro F. Giffuni
646a7fea0c Clean some 'svn:executable' properties in the tree.
Submitted by:	Christoph Mallon
MFC after:	3 days
2013-01-26 22:08:21 +00:00
Pedro F. Giffuni
879aeda7b6 Cosmetical off-by-one
Technically, the case when all the blocks are released
is not a sanity check.
Move further the comment while here.

Suggested by:	bde
MFC after:	3 days
2013-01-26 21:50:52 +00:00
John Baldwin
a89a2c8ba4 Further cleanups to use of timestamps in NFS:
- Use NFSD_MONOSEC (which maps to time_uptime) instead of the seconds
  portion of wall-time stamps to manage timeouts on events.
- Remove unused nd_starttime from the per-request structure in the new
  NFS server.
- Use nanotime() for the modification time on a delegation to get as
  precise a time as possible.
- Use time_second instead of extracting the second from a call to
  getmicrotime().

Submitted by:	bde (3)
Reviewed by:	bde, rmacklem
MFC after:	2 weeks
2013-01-25 15:25:24 +00:00
Pedro F. Giffuni
69017f8d8c ext2fs: fix a check for negative block numbers.
The previous change accidentally left the substraction we
were trying to avoid in case that i_blocks could become
negative.

Reported by:	bde
MFC after:	4 days
2013-01-23 14:29:29 +00:00
Pedro F. Giffuni
1d04725a7a ext2fs: make some inode fields match the ext2 spec.
Ext2fs uses unsigned fields in its dinode struct.
FreeBSD can have negative values in some of those
fields and the inode is meant to interact with the
system so we have never respected the unsigned
nature of most of those fields.

Block numbers and the NFS generation number do
not need to be signed so redefine them as
unsigned to better match the on-disk information.

MFC after:	1 week
2013-01-22 18:54:03 +00:00
Pedro F. Giffuni
4b21c8fda9 ext2fs: temporarily disable the reallocation code.
Testing with fsx has revealed problems and in order to
hunt the bugs properly we need reduce the complexity.

This seems to help but is not a complete solution.

MFC after:	3 days
2013-01-22 18:36:31 +00:00
Xin LI
e4558aacfc Make it possible to force async at server side on new NFS server, similar
to the old one's nfs.nfsrv.async.

Please note that by enabling this option (default is disabled), the system
could potentionally have silent data corruption if the server crashes
before write is committed to non-volatile storage, as the client side have
no way to tell if the data is already written.

Submitted by:	rmacklem
MFC after:	2 weeks
2013-01-18 19:42:08 +00:00
Pedro F. Giffuni
b187108f66 ext2fs: Add some DOINGASYNC check to match ffs.
This is mostly cosmetical.

Reviewed by:	bde
MFC after:	3 days
2013-01-18 19:11:17 +00:00
John Baldwin
d177f14da9 Use vfs_timestamp() to set file timestamps rather than invoking
getmicrotime() or getnanotime() directly in NFS.

Reviewed by:	rmacklem, bde
MFC after:	1 week
2013-01-18 18:43:38 +00:00
John Baldwin
39804bc89d Remove a no-longer-used variable after the previous change to use
VA_UTIMES_NULL.

Submitted by:	bde, rmacklem
MFC after:	1 week
2013-01-17 18:45:20 +00:00
John Baldwin
5055536eec Use the VA_UTIMES_NULL flag to detect when NULL was passed to utimes()
instead of comparing the desired time against the current time as a
heuristic.

Reviewed by:	rmacklem
MFC after:	1 week
2013-01-16 21:52:31 +00:00
Konstantin Belousov
e8f966eeb8 Remove the filtering of the acceptable mount options for nullfs, added
in r245004.  Although the report was for noatime option which is
non-functional for the nullfs, other standard options like nosuid or
noexec are useful with it.

Reported by:	Dewayne Geraghty <dewayne.geraghty@heuristicsystems.com.au>
MFC after:	3 days
2013-01-16 05:32:49 +00:00
John Baldwin
6910d7a0d8 - More properly handle interrupted NFS requests on an interruptible mount
by returning an error of EINTR rather than EACCES.
- While here, bring back some (but not all) of the NFS RPC statistics lost
  when krpc was committed.

Reviewed by:	rmacklem
MFC after:	1 week
2013-01-15 22:08:17 +00:00
Konstantin Belousov
603f963e56 The current default size of the nullfs hash table used to lookup the
existing nullfs vnode by the lower vnode is only 16 slots.  Since the
default mode for the nullfs is to cache the vnodes, hash has extremely
huge chains.

Size the nullfs hashtbl based on the current value of
desiredvnodes. Use vfs_hash_index() to calculate the hash bucket for a
given vnode.

Pointy hat to:	    kib
Diagnosed and reviewed by:	peter
Tested by:    peter, pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	5 days
2013-01-14 05:44:47 +00:00
Konstantin Belousov
6b17595133 When nullfs mount is forcibly unmounted and nullfs vnode is reclaimed,
get back the leased write reference from the lower vnode.  There is no
other path which can correct v_writecount on the lowervp.

Reported by:	flo
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2013-01-10 18:24:48 +00:00
Baptiste Daroussin
3d94054c30 Add support for IO_APPEND flag in fuse
This make open(..., O_APPEND) actually works on fuse filesystem.

Reviewed by:	attilio
2013-01-08 12:21:50 +00:00
Pedro F. Giffuni
98f5c0d41a ext2fs: cleanup de dinode structure.
It was plagued with style errors and the offsets had been lost.
While here took the time to update the fields according to the
latest ext4 documentation.

Reviewed by:	bde
MFC after:	3 days
2013-01-07 03:36:32 +00:00
Gleb Kurtsou
4fd5efe79e tmpfs: Replace directory entry linked list with RB-Tree.
Use file name hash as a tree key, handle duplicate keys.  Both VOP_LOOKUP
and VOP_READDIR operations utilize same tree for search.  Directory
entry offset (cookie) is either file name hash or incremental id in case
of hash collisions (duplicate-cookies).  Keep sorted per directory list
of duplicate-cookie entries to facilitate cookie number allocation.

Don't fail if previous VOP_READDIR() offset is no longer valid, start
with next dirent instead.  Other file system handle it similarly.

Workaround race prone tn_readdir_last[pn] fields update.

Add tmpfs_dir_destroy() to free all dirents.

Set NFS cookies in tmpfs_dir_getdents(). Return EJUSTRETURN from
tmpfs_dir_getdents() instead of hard coded -1.

Mark directory traversal routines static as they are no longer
used outside of tmpfs_subr.c
2013-01-06 22:15:44 +00:00
Konstantin Belousov
268dd286a0 Fix reversed condition in the assertion.
Pointy hat to:	kib
MFC after:	13 days
2013-01-04 07:52:47 +00:00
Konstantin Belousov
9cf4c952ca Add the "nocache" nullfs mount option, which disables the caching of
the free nullfs vnodes, switching nullfs behaviour to pre-r240285.
The option is mostly intended as the last-resort when higher pressure
on the vnode cache due to doubling of the vnode counts is not
desirable.

Note that disabling the cache costs more than 2x wall time in the
metadata-hungry scenarious.  The default is "cache".

Tested and benchmarked by:	pho (previous version)
MFC after:	2 weeks
2013-01-03 19:17:57 +00:00
Konstantin Belousov
6b54784391 Remove the last use of the deprecated MNT_VNODE_FOREACH interface in
the tree.

With the help from:	mjg
Tested by:	Ronald Klop <ronald-freebsd8@klop.yi.org>
MFC after:	2 weeks
2013-01-03 19:01:56 +00:00
Konstantin Belousov
ad9789f6db Do not force a writer to the devfs file to drain the buffer writes.
Requested and tested by:	Ian Lepore <freebsd@damnhippie.dyndns.org>
MFC after:	2 weeks
2012-12-23 22:43:27 +00:00
Pedro F. Giffuni
e28f5d5222 More constant renaming in preparation for newer features.
We also try to make better use of the fs flags instead of
trying adapt the code according to the fs structures. In
the case of subsecond timestamps and birthtime we now
check that the feature is explicitly enabled: previously
we only checked that the reserved space was available and
silently wrote them.

This approach is much safer, especially if the filesystem
happens to use embedded inodes or support EAs.

Discussed with:	Zheng Liu
MFC after:	5 days
2012-12-20 02:22:36 +00:00
Rick Macklem
ef8f1261d2 Add "nfsstat -m" support for the two new NFS mount options
added by r244042.
2012-12-09 22:23:50 +00:00
Rick Macklem
1f60bfd822 Move the NFSv4.1 client patches over from projects/nfsv4.1-client
to head. I don't think the NFS client behaviour will change unless
the new "minorversion=1" mount option is used. It includes basic
NFSv4.1 support plus support for pNFS using the Files Layout only.
All problems detecting during an NFSv4.1 Bakeathon testing event
in June 2012 have been resolved in this code and it has been tested
against the NFSv4.1 server available to me.
Although not reviewed, I believe that kib@ has looked at it.
2012-12-08 22:52:39 +00:00
Gleb Smirnoff
eb1b1807af Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually
2012-12-05 08:04:20 +00:00
Rick Macklem
99d2727d67 Add an nfssvc() option to the kernel for the new NFS client
which dumps out the actual options being used by an NFS mount.
This will be used to implement a "-m" option for nfsstat(1).

Reviewed by:	alfred
MFC after:	2 weeks
2012-12-02 01:16:04 +00:00
Pedro F. Giffuni
371f338bfd Update some definitions or make them match NetBSD's headers.
Bring several definitions required for newer ext4 features.

Rename EXT2F_COMPAT_HTREE to EXT2F_COMPAT_DIRHASHINDEX since it
is not being used yet and the new name is more compatible with
NetBSD and Linux.

This change is purely cosmetic and has no effect on the real
code.

Obtained from:	NetBSD
MFC after:	3 days
2012-11-28 15:48:32 +00:00
Pedro F. Giffuni
7306dea4e8 Partially bring r242520 to ext2fs.
When a file is first being written, the dynamic block reallocation
(implemented by ext2_reallocblks) relocates the file's blocks
so as to cluster them together into a contiguous set of blocks on
the disk.

When the cluster crosses the boundary into the first indirect block,
the first indirect block is initially allocated in a position
immediately following the last direct block.  Block reallocation
would usually destroy locality by moving the indirect block out of
the way to keep the data blocks contiguous.

The issue was diagnosed long ago by Bruce Evans on ffs and surfaced
on ext2fs when block reallocaton was ported. This is only a partial
solution based on the similarities with FFS. We still require more
review of the allocation details that vary in ext2fs.

Reported by:	bde
MFC after:	1 week
2012-11-28 00:36:40 +00:00
Davide Italiano
42039c5bce - smbfs_rename() might return an error value without correctly upgrading
the vnode use count, and this might cause the kernel to panic if compiled
with WITNESS enable.
- Be sure to put the '\0' terminator to the rpath string.

Sponsored by:	iXsystems inc.
2012-11-26 04:29:47 +00:00
Davide Italiano
2c4415419f - Remove reset of vpp pointer in some places as long as it's not really
useful and has the side effect of obfuscating the code a bit.
- Remove spurious references to simple_lock.

Reported by:	attilio [1]
Sponsored by:	iXsystems inc.
2012-11-22 09:13:45 +00:00
Davide Italiano
80704a47af Until now, smbfs_fullpath() computed the full path starting from the
vnode and following back the chain of n_parent pointers up to the root,
without acquiring the locks of the n_parent vnodes analyzed during the
computation. This is immediately wrong because if the vnode lock is not
held there's no guarantee on the validity of the vnode pointer or the data.
In order to fix, store the whole path in the smbnode structure so that
smbfs_fullpath() can use this information.

Discussed with:		kib
Reported and tested by:		pho
Sponsored by:		iXsystems inc.
2012-11-22 08:58:29 +00:00
Konstantin Belousov
6db79c26ce Remove the check and panic for an impossible condition. The NULL
lowervp vnode v_vnlock would cause panic due to NULL pointer
dereference much earlier.

MFC after:	1 week
2012-11-20 15:25:00 +00:00
Attilio Rao
c6e0355cee r16312 is not any longer real since many years (likely since when VFS
received granular locking) but the comment present in UFS has been
copied all over other filesystems code incorrectly for several times.

Removes comments that makes no sense now.

Reviewed by:	kib
MFC after:	3 days
2012-11-19 22:43:45 +00:00
Konstantin Belousov
134eb42e24 In pget(9), if PGET_NOTWEXIT flag is not specified, also search the
zombie list for the pid. This allows several kern.proc sysctls to
report useful information for zombies.

Hold the allproc_lock around all searches instead of relocking it.
Remove private pfind_locked() from the new nfs client code.

Requested and reviewed by:	pjd
Tested by:	pho
MFC after:	3 weeks
2012-11-16 08:25:06 +00:00
Konstantin Belousov
6feceb86ab Remove M_USE_RESERVE from the devfs cdp allocator, which is one of two
uses of M_USE_RESERVE in the kernel. This allocation is not special.

Reviewed by:	alc
Tested by:	pho
MFC after:	2 weeks
2012-11-14 19:50:21 +00:00
Davide Italiano
e631d5ab78 Get rid of some old debug code. It provides checks similar to the one
offered by RedZone so there's no need to keep it.

Sponsored by:	iXsystems inc.
2012-11-14 19:10:50 +00:00
Davide Italiano
9dbe0b121c Fix the lookup in the DOTDOT case in the same way as other filesystems do,
i.e. inlining the vn_vget_ino() algorithm.

Sponsored by:	iXsystems inc.
2012-11-14 18:43:58 +00:00
Attilio Rao
1750b7b9c8 - Protect mnt_data and mnt_flags under the mount interlock
- Move mp->mnt_stat manipulation where all of them happens

Reported by:	davide
Discussed with:	kib
Tested by:	flo
MFC after:	2 months
X-MFC:		241519, 242536,242616, 242727
2012-11-10 19:32:16 +00:00
Attilio Rao
bc2258da88 Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.
2012-11-09 18:02:25 +00:00
Attilio Rao
d9454fab30 - Current caching mode is completely broken because it simply relies
on timing of the operations and not real lookup, bringing too many
  false positives. Remove the whole mechanism. If it needs to be
  implemented, next time it should really be done in the proper way.
- Fix VOP_GETATTR() in order to cope with userland bugs that would
  change the type of file and not panic. Instead it gets the entry as
  if it is not existing.

Reported and tested by:	flo
MFC after:	2 months
X-MFC:		241519, 242536,242616
2012-11-08 00:32:49 +00:00
Attilio Rao
2810826df9 fuse_io* must be able to crunch also VDIR vnodes.
Update assert appropriately.

Reported and Tested by:	flo
MFC after:	2 months
X-MFC:		241519,242536
2012-11-05 15:23:54 +00:00
Attilio Rao
6de3b00db6 Fix a bug where operations was carried on even if not implemented,
leading to handling of an invalid fdip object.

Reported and tested by:	flo
MFC after:	2 months
X-MFC:		241519
2012-11-03 23:32:32 +00:00
Konstantin Belousov
140dedb81c The r241025 fixed the case when a binary, executed from nullfs mount,
was still possible to open for write from the lower filesystem.  There
is a symmetric situation where the binary could already has file
descriptors opened for write, but it can be executed from the nullfs
overlay.

Handle the issue by passing one v_writecount reference to the lower
vnode if nullfs vnode has non-zero v_writecount.  Note that only one
write reference can be donated, since nullfs only keeps one use
reference on the lower vnode.  Always use the lower vnode v_writecount
for the checks.

Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is
currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT
to manipulate the v_writecount value, which manages a single bypass
reference to the lower vnode.  Caling the VOPs instead of directly
accessing v_writecount provide the fix described in the previous
paragraph.

Tested by:	pho
MFC after:	3 weeks
2012-11-02 13:56:36 +00:00
Davide Italiano
8680dc800f - Do not put in the mntqueue half-constructed vnodes.
- Change the code so that it relies on vfs_hash rather than on a
  home-made hashtable.
- There's no need to inline fnv_32_buf().

Reviewed by:	delphij
Tested by:	pho
Sponsored by:	iXsystems inc.
2012-10-31 03:55:33 +00:00
Davide Italiano
afe097512c Fix panic due to page faults while in kernel mode, under conditions of
VM pressure. The reason is that in some codepaths pointers to stack
variables were passed from one thread to another.

In collaboration with:	pho
Reported by:	pho's stress2 suite
Sponsored by:	iXsystems inc.
2012-10-31 03:34:07 +00:00
Davide Italiano
994f027fbc Change the code to use %jd as printf() placeholder for uio_offset and
cast to intmax_t.

Suggested by:	pjd
Sponsored by:	iXsystems inc.
2012-10-31 02:54:44 +00:00
Davide Italiano
469cb18f88 Fix build in case we have SMBVDEBUG turned on.
Reviewed by:	gnn
Approved by:	gnn
Sponsored by:	iXsystems inc.
2012-10-25 21:08:02 +00:00
Davide Italiano
8d9495bb1d - Remove the references to the deprecated zalloc kernel interface
- Use M_ZERO flag in malloc() rather than bzero()
- malloc() with M_NOWAIT can't return NULL so there's no need to check

Reviewed by:	alc
Approved by:	alc
2012-10-25 20:23:04 +00:00
Konstantin Belousov
5050aa86cf Remove the support for using non-mpsafe filesystem modules.
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by:	attilio
Tested by:	pho
2012-10-22 17:50:54 +00:00
Eitan Adler
db702c59cf remove duplicate semicolons where possible.
Approved by:	cperciva
MFC after:	1 week
2012-10-22 03:00:37 +00:00
Ed Schouten
9671713425 Remove unneeded D_NEEDMINOR.
This is only needed when using clonelists. This got remove in r238693.
2012-10-18 19:28:31 +00:00
Rick Macklem
6001db296e Add two new options to the nfssvc(2) syscall that allow
processes running as root to suspend/resume execution
of the kernel nfsd threads. An earlier version of this
patch was tested by Vincent Hoffman (vince at unsane.co.uk)
and John Hickey (jh at deterlab.net).

Reviewed by:	kib
MFC after:	2 weeks
2012-10-14 22:33:17 +00:00
Konstantin Belousov
82ed933c6f Grammar fixes.
Submitted by:	bf
MFC after:	1 week
2012-10-14 18:13:33 +00:00
Konstantin Belousov
806efacae0 Replace the XXX comment with the proper description.
MFC after:	1 week
2012-10-14 17:07:34 +00:00
Attilio Rao
4cff153b87 Rename s/DEBUG()/FS_DEBUG() and s/DEBUG2G()/FS_DEBUG2G() in order to
avoid a name clash in sparc64.

MFC after:	2 months
X-MFC:		r241519
2012-10-14 03:51:59 +00:00
Attilio Rao
5fe580195f Import a FreeBSD port of the FUSE Linux module.
This has been developed during 2 summer of code mandates and being revived
by gnn recently.
The functionality in this commit mirrors entirely content of fusefs-kmod
port, which doesn't need to be installed anymore for -CURRENT setups.

In order to get some sparse technical notes, please refer to:
http://lists.freebsd.org/pipermail/freebsd-fs/2012-March/013876.html

or to the project branch:
svn://svn.freebsd.org/base/projects/fuse/

which also contains granular history of changes happened during port
refinements. This commit does not came from the branch reintegration
itself because it seems svn is not behaving properly for this functionaly
at the moment.

Partly Sponsored by:		Google, Summer of Code program 2005, 2011
Originally submitted by:	ilya, Csaba Henk <csaba-ml AT creo DOT hu >
In collabouration with:		pho
Tested by:			flo, gnn, Gustau Perez,
				Kevin Oberman <rkoberman AT gmail DOT com>
MFC after:			2 months
2012-10-13 23:54:26 +00:00
Konstantin Belousov
877d24ac8a Fix the mis-handling of the VV_TEXT on the nullfs vnodes.
If you have a binary on a filesystem which is also mounted over by
nullfs, you could execute the binary from the lower filesystem, or
from the nullfs mount. When executed from lower filesystem, the lower
vnode gets VV_TEXT flag set, and the file cannot be modified while the
binary is active. But, if executed as the nullfs alias, only the
nullfs vnode gets VV_TEXT set, and you still can open the lower vnode
for write.

Add a set of VOPs for the VV_TEXT query, set and clear operations,
which are correctly bypassed to lower vnode.

Tested by:	pho (previous version)
MFC after:	2 weeks
2012-09-28 11:25:02 +00:00
Matthew D Fleming
fc8fdae0df Fix up kernel sources to be ready for a 64-bit ino_t.
Original code by:	Gleb Kurtsou
2012-09-27 23:30:49 +00:00
Rick Macklem
c52005a31d Modify the NFSv4 client so that it can handle owner
and owner_group strings that consist entirely of
digits, interpreting them as the uid/gid number.
This change was needed since new (>= 3.3) Linux
servers reply with these strings by default.
This change is mandated by the rfc3530bis draft.
Reported on freebsd-stable@ under the Subject
heading "Problem with Linux >= 3.3 as NFSv4 server"
by Norbert Aschendorff on Aug. 20, 2012.

Tested by:	norbert.aschendorff at yahoo.de
Reviewed by:	jhb
MFC after:	2 weeks
2012-09-20 02:49:25 +00:00
Ed Schouten
7cbef24e1a Prefer __containerof() above member2struct().
The first does proper checking of the argument types, while the latter
does not.
2012-09-15 19:28:54 +00:00
Konstantin Belousov
df3cbc41fa The deadfs VOPs for vop_ioctl and vop_bmap call itself recursively,
which is an elaborate way to cause kernel panic. Change the VOPs
implementation to return EBADF for a reclaimed vnode.

While the calls to vop_bmap should not reach deadfs, it is indeed
possible for vop_ioctl, because the VOP locking protocol is to pass
the vnode to VOP unlocked. The actual panic was observed when ioctl
was called on procfs filedescriptor which pointed to an exited
process.

Reported by:	zont
Tested by:	pho
MFC after:	1 week
2012-09-13 13:05:45 +00:00
Kevin Lo
95c79b6082 Add VFCF_READONLY flag that indicates ntfs and xfs file systems are
only supported as read-only.
2012-09-12 03:42:52 +00:00
Kevin Lo
6297d5d6f9 Prevent nump NULL pointer dereference in bmap_getlbns() 2012-09-11 09:38:32 +00:00
Kevin Lo
8e46bf68d1 Fix style nit 2012-09-11 08:36:41 +00:00
Rick Macklem
f4e2c07e73 Add a simple printf() based debug facility to the new nfs client.
Use it for a printf() that can be harmlessly generated for mmap()'d
files. It will be used extensively for the NFSv4.1 client.
Debugging printf()s are enabled by setting vfs.nfs.debuglevel to
a non-zero value. The higher the value, the more debugging printf()s.

Reviewed by:	jhb
MFC after:	2 weeks
2012-09-09 21:00:45 +00:00
Konstantin Belousov
d9e9650a36 Allow shared lookups for nullfs mounts, if lower filesystem supports
it.  There are two problems which shall be addressed for shared
lookups use to have measurable effect on nullfs scalability:

1. When vfs_lookup() calls VOP_LOOKUP() for nullfs, which passes lookup
operation to lower fs, resulting vnode is often only shared-locked. Then
null_nodeget() cannot instantiate covering vnode for lower vnode, since
insmntque1() and null_hashins() require exclusive lock on the lower.

Change the assert that lower vnode is exclusively locked to only
require any lock.  If null hash failed to find pre-existing nullfs
vnode for lower vnode and the vnode is shared-locked, the lower vnode
lock is upgraded.

2. Nullfs reclaims its vnodes on deactivation. This is due to nullfs
inability to detect reclamation of the lower vnode.  Reclamation of a
nullfs vnode at deactivation time prevents a reference to the lower
vnode to become stale.

Change nullfs VOP_INACTIVE to not reclaim the vnode, instead use the
VFS_RECLAIM_LOWERVP to get notification and reclaim upper vnode
together with the reclamation of the lower vnode.

Note that nullfs reclamation procedure calls vput() on the lowervp
vnode, temporary unlocking the vnode being reclaimed. This seems to be
fine for MPSAFE filesystems, but not-MPSAFE code often put partially
initialized vnode on some globally visible list, and later can decide
that half-constructed vnode is not needed.  If nullfs mount is created
above such filesystem, then other threads might catch such not
properly initialized vnode. Instead of trying to overcome this case,
e.g. by recursing the lower vnode lock in null_reclaim_lowervp(), I
decided to rely on nearby removal of the support for non-MPSAFE
filesystems.

In collaboration with:	pho
MFC after:	3 weeks
2012-09-09 19:20:23 +00:00
Pedro F. Giffuni
051b0df565 Add some basic definitions for a future htree implementation.
MFC after:	3 days
2012-08-24 01:12:07 +00:00
Kevin Lo
5bb295c408 Fix typo 2012-08-18 16:13:16 +00:00
Mateusz Guzik
1ec9bedabe Remove unused member of struct indir (in_exists) from UFS and EXT2 code.
Reviewed by:	mckusick
Approved by:	trasz (mentor)
MFC after:	1 week
2012-08-17 17:45:27 +00:00
Hans Petter Selasky
07da61a6cc Streamline use of cdevpriv and correct some corner cases.
1) It is not useful to call "devfs_clear_cdevpriv()" from
"d_close" callbacks, hence for example read, write, ioctl and
so on might be sleeping at the time of "d_close" being called
and then then freed private data can still be accessed.
Examples: dtrace, linux_compat, ksyms (all fixed by this patch)

2) In sys/dev/drm* there are some cases in which memory will
be freed twice, if open fails, first by code in the open
routine, secondly by the cdevpriv destructor. Move registration
of the cdevpriv to the end of the drm open routines.

3) devfs_clear_cdevpriv() is not called if the "d_open" callback
registered cdevpriv data and the "d_open" callback function
returned an error. Fix this.

Discussed with:	phk
MFC after:	2 weeks
2012-08-15 16:19:39 +00:00
Konstantin Belousov
b6c00483e9 Do not leave invalid pages in the object after the short read for a
network file systems (not only NFS proper). Short reads cause pages
other then the requested one, which were not filled by read response,
to stay invalid.

Change the vm_page_readahead_finish() interface to not take the error
code, but instead to make a decision to free or to (de)activate the
page only by its validity. As result, not requested invalid pages are
freed even if the read RPC indicated success.

Noted and reviewed by:	alc
MFC after:	1 week
2012-08-14 11:45:47 +00:00
Konstantin Belousov
1c771f9222 After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason
to pull vm_param.h was removed.  Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.

Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.

Suggested and reviewed by:	alc
MFC after:    2 weeks
2012-08-05 14:11:42 +00:00
Konstantin Belousov
0055cbd3c5 Reduce code duplication and exposure of direct access to struct
vm_page oflags by providing helper function
vm_page_readahead_finish(), which handles completed reads for pages
with indexes other then the requested one, for VOP_GETPAGES().

Reviewed by:	alc
MFC after:	1 week
2012-08-04 18:16:43 +00:00
Konstantin Belousov
843dcea09e The header uma_int.h is internal uma header, unused by this source
file.  Do not include it needlessly.

Reviewed by:  alc
MFC after:    1 week
2012-08-04 18:12:54 +00:00
David Xu
5ff2bb52cc I am comparing current pipe code with the one in 8.3-STABLE r236165,
I found 8.3 is a history BSD version using socket to implement FIFO
pipe, it uses per-file seqcount to compare with writer generation
stored in per-pipe object. The concept is after all writers are gone,
the pipe enters next generation, all old readers have not closed the
pipe should get the indication that the pipe is disconnected, result
is they should get EPIPE, SIGPIPE or get POLLHUP in poll().
But newcomer should not know that previous writters were gone, it
should treat it as a fresh session.
I am trying to bring back FIFO pipe to history behavior. It is still
unclear that if single EOF flag can represent SBS_CANTSENDMORE and
SBS_CANTRCVMORE which socket-based version is using, but I have run
the poll regression test in tool directory, output is same as the one
on 8.3-STABLE now.
I think the output "not ok 18 FIFO state 6b: poll result 0 expected 1.
expected POLLHUP; got 0" might be bogus, because newcomer should not
know that old writers were gone. I got the same behavior on Linux.
Our implementation always return POLLIN for disconnected pipe even it
should return POLLHUP, but I think it is not wise to remove POLLIN for
compatible reason, this is our history behavior.

Regression test: /usr/src/tools/regression/poll
2012-07-31 05:48:35 +00:00
David Xu
12a480fa41 When a thread is blocked in direct write state, it only sets PIPE_DIRECTW
flag but not PIPE_WANTW, but FIFO pipe code does not understand this internal
state, when a FIFO peer reader closes the pipe, it wants to notify the writer,
it checks PIPE_WANTW, if not set, it skips calling wakeup(), so blocked writer
never noticed the case, but in general, the writer should return from the
syscall with EPIPE error code and may get SIGPIPE signal. Setting the
PIPE_WANTW fixed problem, or you can turn off direct write, it should fix the
problem too. This bug is found by PR/170203.

Another bug in FIFO pipe code is when peer closes the pipe, another end which
is being blocked in select() or poll() is not notified, it missed to call
pipeselwakeup().

Third problem is found in poll regression test, the existing code can not
pass 6b,6c,6d tests, but FreeBSD-4 works. This commit does not fix the
problem, I still need to study more to find the cause.

PR: 170203
Tested by: Garrett Copper &lt; yanegomi at gmail dot com &gt;
2012-07-31 02:00:37 +00:00
Kevin Lo
f7a3729c91 Use NULL instead of 0 for pointers 2012-07-22 15:40:31 +00:00
Christian Brueffer
01cc0b6531 Simply error handling by moving the allocation of np down to where it is
actually used.  While here, improve style a little.

Submitted by:	mjg
MFC after:	2 weeks
2012-07-16 22:07:29 +00:00
Christian Brueffer
9ce63ce2a3 Save a bzero() by using M_ZERO.
Obtained from:	Dragonfly BSD (change 4faaf07c3d7ddd120deed007370aaf4d90b72ebb)
MFC after:	2 weeks
2012-07-15 15:50:12 +00:00
Attilio Rao
5ebe387ff7 Remove a check on MNTK_UPDATE that is not really necessary as it is
handled in a code snippet above.
2012-07-10 00:23:25 +00:00
Attilio Rao
8806edb4ab - Remove the unused and not completed write support for NTFS.
- Fix a bug where vfs_mountedfrom() is called also when the filesystem
  is not mounted successfully.

Tested by:	pho
2012-07-10 00:01:00 +00:00
Kevin Lo
d6cc34a1ad Fix a typo 2012-07-03 08:03:07 +00:00
Konstantin Belousov
c5c1199c83 Extend the KPI to lock and unlock f_offset member of struct file. It
now fully encapsulates all accesses to f_offset, and extends f_offset
locking to other consumers that need it, in particular, to lseek() and
variants of getdirentries().

Ensure that on 32bit architectures f_offset, which is 64bit quantity,
always read and written under the mtxpool protection. This fixes
apparently easy to trigger race when parallel lseek()s or lseek() and
read/write could destroy file offset.

The already broken ABI emulations, including iBCS and SysV, are not
converted (yet).

Tested by:	pho
No objections from:	jhb
MFC after:    3 weeks
2012-07-02 21:01:03 +00:00
Konstantin Belousov
9d232eec30 Do not override an error from uiomove() with (non-)error result from
bwrite().  VFS needs to know about EFAULT from uiomove() and does not
care much that partially filled block writeback after EFAULT was
successfull.  Early return without error causes short write to be
reported to usermode.

Reported and tested by:	andreast
MFC after:	3 weeks
2012-07-02 09:53:08 +00:00
Konstantin Belousov
ddfc47fdc9 Enable deadlock avoidance code for NFS client.
MFC after:	2 weeks
2012-06-21 09:26:06 +00:00
Rick Macklem
53e1b8fba5 Fix the NFSv4 client for the case where mmap'd files are
written, but not msync'd by a process. A VOP_PUTPAGES()
called when VOP_RECLAIM() happens will usually fail, since
the NFSv4 Open has already been closed by VOP_INACTIVE().
Add a vm_object_page_clean() call to the NFSv4 client's
VOP_INACTIVE(), so that the write happens before the NFSv4
Open is closed. kib@ suggested using vgone() instead and
I will explore this, but this patch fixes things in the
meantime. For some reason, the VOP_PUTPAGES() is still
attaempted in VOP_RECLAIM(), but having this fail doesn't
cause any problems except a "stateid0 in write" being logged.

Reviewed by:	kib
MFC after:	1 week
2012-06-18 22:17:28 +00:00
Rick Macklem
79cafccd40 Move the nfsrpc_close() call in ncl_reclaim() for the NFSv4 client
to below the vnode_destroy_vobject() call, since that is where
writes are flushed.

Suggested by:	kib
MFC after:	1 week
2012-06-17 18:34:04 +00:00
Konstantin Belousov
bfb68a9e43 Improve handling of uiomove(9) errors for the NFS client.
Do not brelse() the buffer unconditionally with BIO_ERROR set if
uiomove() failed. The brelse() treats most buffers with BIO_ERROR as
B_INVAL, dropping their content.  Instead, if the write request
covered the whole buffer, remember the cached state and brelse() with
BIO_ERROR set only if the buffer was not cached previously.

Update the buffer dirtyoff/dirtyend based on the progress recorded by
uiomove() in passed struct uio, even in the presence of
error. Otherwise, usermode could see changed data in the backed pages,
but later the buffer is destroyed without write-back.

If uiomove() failed for IO_UNIT request, try to truncate the vnode
back to the pre-write state, and rewind the progress in passed uio
accordingly, following the FFS behaviour.

Reviewed by:	rmacklem (some time ago)
Tested by:	pho
MFC after:	1 month
2012-06-06 16:30:16 +00:00
Konstantin Belousov
6eec26f5ad Capitalize start of sentence.
MFC after:	3 days
2012-05-30 14:00:23 +00:00
Marcel Moolenaar
e9b29d1604 Catch a corner case where ssegs could be 0 and thus i would be 0 and
we index suinfo out of bounds (i.e. -1).

Approved by:	gber
2012-05-28 16:33:58 +00:00
Ed Schouten
0078e2fdeb Fix style and consistency:
- Use tabs, not spaces.
- Add tab after #define.
- Don't mix the use of BSD and ISO C unsigned integer types. Prefer the
  ISO C ones.
2012-05-27 09:34:47 +00:00
Gleb Kurtsou
fb2a3e6ea1 Use C99-style initialization for struct dirent in preparation for
changing the structure.

Sponsored by:	Google Summer of Code 2011
2012-05-25 09:16:59 +00:00