Commit Graph

3018 Commits

Author SHA1 Message Date
Rick Macklem
562395581b When the NFSv4.1 client is writing to a pNFS Data Server (DS), the
file's size attribute does not get updated. As such, it is necessary
to invalidate the attribute cache before clearing NMODIFIED for pNFS.

MFC after:	2 weeks
2013-06-21 22:26:18 +00:00
Rick Macklem
315c38d135 Since some NFSv4 servers enforce the requirement for a reserved port#,
enable use of the (no)resvport mount option for NFSv4. I had thought
that the RFC required that non-reserved port #s be allowed, but I couldn't
find it in the RFC.

MFC after:	2 weeks
2013-06-21 19:41:30 +00:00
Pedro F. Giffuni
3f5747b69d Rename some prefixes in the Block Group Descriptor fields to ext4bgd_
Change prefix to avoid confusion and denote that these fields
are generally only available starting with ext4.

MFC after:	3 days
2013-06-20 00:00:33 +00:00
Pedro F. Giffuni
9e43acf6c0 More ext2fs header cleanups:
- Set MAXMNTLEN nearer to where it is used.
- Move EXT2_LINK_MAX to ext2_dir.h .

MFC after:	3 days
2013-06-18 15:49:30 +00:00
Pedro F. Giffuni
ebf0f88839 Rename remaining DIAGNOSTIC to INVARIANTS.
MFC after:	3 days
2013-06-17 00:39:23 +00:00
Pedro F. Giffuni
b6113fb31a Re-sort ext2fs headers to make things easier to find.
In the ext2fs driver we have a mixture of headers:

- The ext2_ prefixed headers have strong influence from NetBSD
and are carry specific ext2/3/4 information.
- The unprefixed headers are inspired on UFS and carry implementation
specific information.

Do some small adjustments so that the information is easier to
find coming from either UFS or the NetBSD implementation.

MFC after:	3 days
2013-06-16 16:10:45 +00:00
Pedro F. Giffuni
f744956b4a Relax some unnecessary unsigned type changes in ext2fs.
While the changes in r245820 are in line with the ext2 spec,
the code derived from UFS can use negative values so it is
better to relax some types to keep them as they were, and
somewhat more similar to UFS. While here clean some casts.

Some of the original types are still wrong and will require
more work.

Discussed with:	bde
MFC after:	3 days
2013-06-13 03:23:24 +00:00
Pedro F. Giffuni
77b193c249 Turn DIAGNOSTICs to INVARIANTS in ext2fs.
This is done to be consistent with what other filesystems and
particularly ffs already does (see r173464).

MFC after:	5 days
2013-06-12 15:24:48 +00:00
Pedro F. Giffuni
abe38ac774 s/file system/filesystem/g
Based on r96755 from UFS.

MFC after:	3 days
2013-06-11 02:47:07 +00:00
Pedro F. Giffuni
f7d4b4d3d1 e2fs_bpg and e2fs_isize are always unsigned.
The superblock in ext2fs defines all the fields as unsigned but for
some reason the in-memory superblock was carrying e2fs_bpg and
e2fs_isize as signed.

We should preserve the specified types for consistency.

MFC after:	5 days
2013-06-09 01:38:51 +00:00
Alan Cox
f50b6721e1 Add missing VM object unlocks in an error case.
Reviewed by:	kib
2013-06-07 19:42:00 +00:00
Alan Cox
27a18d6a23 Don't busy the page unless we are likely to release the object lock.
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2013-06-06 06:17:20 +00:00
Alan Cox
66c392df53 Relax the vm object locking. Use a read lock.
Sponsored by:	EMC / Isilon Storage Division
2013-06-05 17:00:10 +00:00
Alan Cox
ba887a9b33 Eliminate unnecessary vm object locking from tmpfs_nocacheread(). 2013-06-04 15:40:45 +00:00
Pedro F. Giffuni
532ebe1313 ext2fs: space vs tab.
Obtained from:	Christoph Mallon
MFC after:	3 days
2013-06-03 20:33:05 +00:00
Pedro F. Giffuni
fc3ea958b2 ext2fs: Small cosmetic fixes.
Make a long macro readable and sort a header.

Obtained from:	Christoph Mallon
MFC after:	3 days
2013-06-03 20:02:45 +00:00
Pedro F. Giffuni
4f69a09308 ext2fs: Update Block Group Descriptor struct.
Uncover some, previously reserved, fields that are used by Ext4.
These are currently unused but it is good to have them for future
reference.

Reviewed by:	bde
MFC after:	3 days
2013-06-03 18:52:14 +00:00
Jeff Roberson
22a722605d - Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
 - Convert softdep's lk to rwlock to match the bufobj lock.
 - Move INFREECNT to b_flags and protect it with the buf lock.
 - Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by:	EMC / Isilon Storage Division
Discussed with:	mckusick, kib, mdf
2013-05-31 00:43:41 +00:00
Konstantin Belousov
67b4ed4b88 Assert that OBJ_TMPFS flag on the vm object for the tmpfs node is
cleared when the tmpfs node is going away.

Tested by:	bdrewery, pho
2013-05-30 19:51:33 +00:00
Rick Macklem
734b03c38d Post-r248567, there were times when the client would return a
truncated directory for some NFS servers. This turned out to
be because the size of a directory reported by an NFS server
can be smaller that the ufs-like directory created from the
RPC XDR in the client. This patch fixes the problem by changing
r248567 so that vnode_pager_setsize() is only done for regular files.

Reported and tested by:	hartmut.brandt@dlr.de
Reviewed by:	kib
MFC after:	1 week
2013-05-28 22:36:01 +00:00
Konstantin Belousov
74c7ff1a0e Do not leak the NULLV_NOUNLOCK flag from the nullfs_unlink_lowervp(),
for the case when the nullfs vnode is not reclaimed.  Otherwise, later
reclamation would not unlock the lower vnode.

Reported by:	antoine
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-05-21 11:31:56 +00:00
Dag-Erling Smørgrav
72ccd4cc6b Fix typo in comment.
Submitted by:	Alex Weber <alexwebr@gmail.com>
MFC after:	1 week
2013-05-15 08:38:49 +00:00
Rick Macklem
77a03c148c Add support for the eofflag to nfs_readdir() in the new NFS
client so that it works under a unionfs mount.

Submitted by:	Jared Yanovich (slovichon@gmail.com)
Reviewed by:	kib
MFC after:	2 weeks
2013-05-12 21:48:08 +00:00
Eitan Adler
a164074fc4 Fix several typos
PR:		kern/176054
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de>
MFC after:	3 days
2013-05-12 16:43:26 +00:00
Jilles Tjoelker
d3045c081d fdescfs: Supply a real value for d_type in readdir.
All the fdescfs nodes (except . and ..) appear as character devices to
stat(), so DT_CHR is correct.
2013-05-12 15:44:49 +00:00
Konstantin Belousov
0fc6daa72d - Fix nullfs vnode reference leak in nullfs_reclaim_lowervp(). The
null_hashget() obtains the reference on the nullfs vnode, which must
  be dropped.

- Fix a wart which existed from the introduction of the nullfs
  caching, do not unlock lower vnode in the nullfs_reclaim_lowervp().
  It should be innocent, but now it is also formally safe.  Inform the
  nullfs_reclaim() about this using the NULLV_NOUNLOCK flag set on
  nullfs inode.

- Add a callback to the upper filesystems for the lower vnode
  unlinking. When inactivating a nullfs vnode, check if the lower
  vnode was unlinked, indicated by nullfs flag NULLV_DROP or VV_NOSYNC
  on the lower vnode, and reclaim upper vnode if so.  This allows
  nullfs to purge cached vnodes for the unlinked lower vnode, avoiding
  excessive caching.

Reported by:	G??ran L??wkrantz <goran.lowkrantz@ismobile.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-05-11 11:17:44 +00:00
Konstantin Belousov
3fa456b35d Avoid deactivating the page if it is already on a queue, only requeue
the page.  This both reduces the number of queues locking and avoids
moving the active page to inactive list just because the page was read
or written.

Based on the suggestion by:	alc
Reviewed by: alc
Tested by:   pho
2013-05-06 21:04:42 +00:00
Davide Italiano
caa8e38fa6 Change VM_OBJECT_LOCK/UNLOCK() -> VM_OBJECT_WLOCK/WUNLOCK() to reflect
the recent switch of the vm object lock to a rwlock.

Reported by:	attilio
2013-05-04 14:27:28 +00:00
Davide Italiano
a4c059845a Overhaul locking in netsmb, getting rid of the obsolete lockmgr() primitive.
This solves a long standing LOR between smb_conn and smb_vc.

Tested by:	martymac, pho (previous version)
2013-05-04 14:18:10 +00:00
Davide Italiano
92a4d9bcc8 Completely rewrite the interface to smbdev switching from dev_clone
to cdevpriv(9). This commit changes the semantic of mount_smbfs
in userland as well, which now passes file descriptor in order to
to mount a specific filesystem istance.

Reviewed by:	attilio, ed
Tested by:	martymac
2013-05-04 14:03:18 +00:00
Konstantin Belousov
293e4eb67d The fsync(2) call should sync the vnode in such way that even after
system crash which happen after successfull fsync() return, the data
is accessible.  For msdosfs, this means that FAT entries for the file
must be written.

Since we do not track the FAT blocks containing entries for the
current file, just do a sloppy sync of the devvp vnode for the mount,
which buffers, among other things, contain FAT blocks.

Simultaneously, for deupdat():
- optimize by clearing the modified flags before short-circuiting a
  return, if the mount is read-only;
- only ignore the rest of the function for denode with DE_MODIFIED
  flag clear when the waitfor argument is false.  The directory buffer
  for the entry might be of delayed write;
- microoptimize by comparing the updated directory entry with the
  current block content;
- try to cluster the write, fall back to bawrite() if low on
  resources.

Based on the submission by:	bde
MFC after:	2 weeks
2013-05-02 20:00:11 +00:00
Konstantin Belousov
df6b240b6f Fix the v_object leak for non-regular tmpfs vnodes.
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-05-02 18:46:31 +00:00
Konstantin Belousov
158cc900bb For the new regular tmpfs vnode, v_object is initialized before
insmntque() is called.  The standard insmntque destructor resets the
vop vector to deadfs one, and calls vgone() on the vnode.  As result,
v_object is kept unchanged, which triggers an assertion in the reclaim
code, on instmntque() failure.  Also, in this case, OBJ_TMPFS flag on
the backed vm object is not cleared.

Provide the tmpfs insmntque() destructor which properly clears
OBJ_TMPFS flag and resets v_object.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-05-02 18:44:31 +00:00
Konstantin Belousov
bdefcb6959 The page read or written could be wired. Do not requeue if the page
is not on a queue.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-05-02 18:36:52 +00:00
Dag-Erling Smørgrav
c93c82f464 Fix a bug that allows NFS clients to issue READDIR on files.
PR:		kern/178016
Security:	CVE-2013-3266
Security:	FreeBSD-SA-13:05.nfsserver
2013-04-29 20:09:44 +00:00
Konstantin Belousov
6f2af3fcf3 Rework the handling of the tmpfs node backing swap object and tmpfs
vnode v_object to avoid double-buffering.  Use the same object both as
the backing store for tmpfs node and as the v_object.

Besides reducing memory use up to 2x times for situation of mapping
files from tmpfs, it also makes tmpfs read and write operations copy
twice bytes less.

VM subsystem was already slightly adapted to tolerate OBJT_SWAP object
as v_object. Now the vm_object_deallocate() is modified to not
reinstantiate OBJ_ONEMAPPING flag and help the VFS to correctly handle
VV_TEXT flag on the last dereference of the tmpfs backing object.

Reviewed by:	alc
Tested by:	pho, bf
MFC after:	1 month
2013-04-28 19:38:59 +00:00
Rick Macklem
64a0e848ab When an NFS unmount occurs, once vflush() writes the last dirty
buffer for the last vnode on the mount back to the server, it
returns. At that point, the code continues with the unmount,
including freeing up the nfs specific part of the mount structure.
It is possible that an nfsiod thread will try to check for an
empty I/O queue in the nfs specific part of the mount structure
after it has been free'd by the unmount. This patch avoids this problem by
setting the iodmount entries for the mount back to NULL while holding the
mutex in the unmount and checking the appropriate entry is non-NULL after
acquiring the mutex in the nfsiod thread.

Reported and tested by:	pho
Reviewed by:	kib
MFC after:	2 weeks
2013-04-18 23:20:16 +00:00
Rick Macklem
175b3f31d3 Both NFS clients can deadlock when using the "rdirplus" mount
option. This can occur when an nfsiod thread that already holds
a buffer lock attempts to acquire a vnode lock on an entry in
the directory (a LOR) when another thread holding the vnode lock
is waiting on an nfsiod thread. This patch avoids the deadlock by disabling
readahead for this case, so the nfsiod threads never do readdirplus.
Since readaheads for directories need the directory offset cookie
from the previous read, they cannot normally happen in parallel.
As such, testing by jhb@ and myself didn't find any performance
degredation when this patch is applied. If there is a case where
this results in a significant performance degradation, mounting
without the "rdirplus" option can be done to re-enable readahead
for directories.

Reported and tested by:	jhb
Reviewed by:	jhb
MFC after:	2 weeks
2013-04-18 13:09:04 +00:00
Kenneth D. Merry
adb974068b Move the NFS FHA (File Handle Affinity) code from sys/nfsserver to
sys/nfs, since it is now shared by the two NFS servers.

Suggested by:	rmacklem
Sponsored by:	Spectra Logic
MFC after:	2 weeks
2013-04-17 22:42:43 +00:00
Kenneth D. Merry
d96b98a360 Revamp the old NFS server's File Handle Affinity (FHA) code so that
it will work with either the old or new server.

The FHA code keeps a cache of currently active file handles for
NFSv2 and v3 requests, so that read and write requests for the same
file are directed to the same group of threads (reads) or thread
(writes).  It does not currently work for NFSv4 requests.  They are
more complex, and will take more work to support.

This improves read-ahead performance, especially with ZFS, if the
FHA tuning parameters are configured appropriately.  Without the
FHA code, concurrent reads that are part of a sequential read from
a file will be directed to separate NFS threads.  This has the
effect of confusing the ZFS zfetch (prefetch) code and makes
sequential reads significantly slower with clients like Linux that
do a lot of prefetching.

The FHA code has also been updated to direct write requests to nearby
file offsets to the same thread in the same way it batches reads,
and the FHA code will now also send writes to multiple threads when
needed.

This improves sequential write performance in ZFS, because writes
to a file are now more ordered.  Since NFS writes (generally
less than 64K) are smaller than the typical ZFS record size
(usually 128K), out of order NFS writes to the same block can
trigger a read in ZFS.  Sending them down the same thread increases
the odds of their being in order.

In order for multiple write threads per file in the FHA code to be
useful, writes in the NFS server have been changed to use a LK_SHARED
vnode lock, and upgrade that to LK_EXCLUSIVE if the filesystem
doesn't allow multiple writers to a file at once.  ZFS is currently
the only filesystem that allows multiple writers to a file, because
it has internal file range locking.  This change does not affect the
NFSv4 code.

This improves random write performance to a single file in ZFS, since
we can now have multiple writers inside ZFS at one time.

I have changed the default tuning parameters to a 22 bit (4MB)
window size (from 256K) and unlimited commands per thread as a
result of my benchmarking with ZFS.

The FHA code has been updated to allow configuring the tuning
parameters from loader tunable variables in addition to sysctl
variables.  The read offset window calculation has been slightly
modified as well.  Instead of having separate bins, each file
handle has a rolling window of bin_shift size.  This minimizes
glitches in throughput when shifting from one bin to another.

sys/conf/files:
	Add nfs_fha_new.c and nfs_fha_old.c.  Compile nfs_fha.c
	when either the old or the new NFS server is built.

sys/fs/nfs/nfsport.h,
sys/fs/nfs/nfs_commonport.c:
	Bring in changes from Rick Macklem to newnfs_realign that
	allow it to operate in blocking (M_WAITOK) or non-blocking
	(M_NOWAIT) mode.

sys/fs/nfs/nfs_commonsubs.c,
sys/fs/nfs/nfs_var.h:
	Bring in a change from Rick Macklem to allow telling
	nfsm_dissect() whether or not to wait for mallocs.

sys/fs/nfs/nfsm_subs.h:
	Bring in changes from Rick Macklem to create a new
	nfsm_dissect_nonblock() inline function and
	NFSM_DISSECT_NONBLOCK() macro.

sys/fs/nfs/nfs_commonkrpc.c,
sys/fs/nfsclient/nfs_clkrpc.c:
	Add the malloc wait flag to a newnfs_realign() call.

sys/fs/nfsserver/nfs_nfsdkrpc.c:
	Setup the new NFS server's RPC thread pool so that it will
	call the FHA code.

	Add the malloc flag argument to newnfs_realign().

	Unstaticize newnfs_nfsv3_procid[] so that we can use it in
	the FHA code.

sys/fs/nfsserver/nfs_nfsdsocket.c:
	In nfsrvd_dorpc(), add NFSPROC_WRITE to the list of RPC types
	that use the LK_SHARED lock type.

sys/fs/nfsserver/nfs_nfsdport.c:
	In nfsd_fhtovp(), if we're starting a write, check to see
	whether the underlying filesystem supports shared writes.
	If not, upgrade the lock type from LK_SHARED to LK_EXCLUSIVE.

sys/nfsserver/nfs_fha.c:
	Remove all code that is specific to the NFS server
	implementation.  Anything that is server-specific is now
	accessed through a callback supplied by that server's FHA
	shim in the new softc.

	There are now separate sysctls and tunables for the FHA
	implementations for the old and new NFS servers.  The new
	NFS server has its tunables under vfs.nfsd.fha, the old
	NFS server's tunables are under vfs.nfsrv.fha as before.

	In fha_extract_info(), use callouts for all server-specific
	code.  Getting file handles and offsets is now done in the
	individual server's shim module.

	In fha_hash_entry_choose_thread(), change the way we decide
	whether two reads are in proximity to each other.
	Previously, the calculation was a simple shift operation to
	see whether the offsets were in the same power of 2 bucket.
	The issue was that there would be a bucket (and therefore
	thread) transition, even if the reads were in close
	proximity.  When there is a thread transition, reads wind
	up going somewhat out of order, and ZFS gets confused.

	The new calculation simply tries to see whether the offsets
	are within 1 << bin_shift of each other.  If they are, the
	reads will be sent to the same thread.

	The effect of this change is that for sequential reads, if
	the client doesn't exceed the max_reqs_per_nfsd parameter
	and the bin_shift is set to a reasonable value (22, or
	4MB works well in my tests), the reads in any sequential
	stream will largely be confined to a single thread.

	Change fha_assign() so that it takes a softc argument.  It
	is now called from the individual server's shim code, which
	will pass in the softc.

	Change fhe_stats_sysctl() so that it takes a softc
	parameter.  It is now called from the individual server's
	shim code.  Add the current offset to the list of things
	printed out about each active thread.

	Change the num_reads and num_writes counters in the
	fha_hash_entry structure to 32-bit values, and rename them
	num_rw and num_exclusive, respectively, to reflect their
	changed usage.

	Add an enable sysctl and tunable that allows the user to
	disable the FHA code (when vfs.XXX.fha.enable = 0).  This
	is useful for before/after performance comparisons.

nfs_fha.h:
	Move most structure definitions out of nfs_fha.c and into
	the header file, so that the individual server shims can
	see them.

	Change the default bin_shift to 22 (4MB) instead of 18
	(256K).  Allow unlimited commands per thread.

sys/nfsserver/nfs_fha_old.c,
sys/nfsserver/nfs_fha_old.h,
sys/fs/nfsserver/nfs_fha_new.c,
sys/fs/nfsserver/nfs_fha_new.h:
	Add shims for the old and new NFS servers to interface with
	the FHA code, and callbacks for the

	The shims contain all of the code and definitions that are
	specific to the NFS servers.

	They setup the server-specific callbacks and set the server
	name for the sysctl and loader tunable variables.

sys/nfsserver/nfs_srvkrpc.c:
	Configure the RPC code to call fhaold_assign() instead of
	fha_assign().

sys/modules/nfsd/Makefile:
	Add nfs_fha.c and nfs_fha_new.c.

sys/modules/nfsserver/Makefile:
	Add nfs_fha_old.c.

Reviewed by:	rmacklem
Sponsored by:	Spectra Logic
MFC after:	2 weeks
2013-04-17 21:00:22 +00:00
Gabor Kovesdan
a8b5c2a0aa - Correct spelling in comments
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
2013-04-17 11:56:11 +00:00
Gabor Kovesdan
a2098fea6d - Correct mispellings of the word necessary
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
2013-04-17 11:42:40 +00:00
Jeff Roberson
26089666b6 Prepare to replace the buf splay with a trie:
- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
   No consumers need to find them there and it complicates the tree.
   These flags are all FFS specific and could be moved out of the buf
   cache.
 - Use pbgetvp() and pbrelvp() to associate the background and journal
   bufs with the vp.  Not only is this much cheaper it makes more sense
   for these transient bufs.
 - Fix the assertions in pbget* and pbrel*.  It's not safe to check list
   pointers which were never initialized.  Use the BX flags instead.  We
   also check B_PAGING in reassignbuf() so this should cover all cases.

Discussed with:	kib, mckusick, attilio
Sponsored by:	EMC / Isilon Storage Division
2013-04-06 22:21:23 +00:00
Konstantin Belousov
159a400eb6 Strip the unnneeded spaces, mostly at the end of lines.
MFC after:	3 days
2013-04-01 09:56:48 +00:00
Pawel Jakub Dawidek
051a23d4e8 - Constify local path variable for chflagsat().
- Use correct format characters (%lx) for u_long.

This fixes the build broken in r248599.
2013-03-22 07:40:34 +00:00
Pawel Jakub Dawidek
b4b2596b97 - Make 'flags' argument to chflags(2), fchflags(2) and lchflags(2) of type
u_long. Before this change it was of type int for syscalls, but prototypes
  in sys/stat.h and documentation for chflags(2) and fchflags(2) (but not
  for lchflags(2)) stated that it was u_long. Now some related functions
  use u_long type for flags (strtofflags(3), fflagstostr(3)).
- Make path argument of type 'const char *' for consistency.

Discussed on:	arch
Sponsored by:	The FreeBSD Foundation
2013-03-21 22:44:33 +00:00
Konstantin Belousov
4d569af96c Initialize the variable to avoid (false) compiler warning about
use of an uninitialized local.

Reported by:	Ivan Klymenko <fidaj@ukr.net>
MFC after:	2 weeks
2013-03-21 12:59:24 +00:00
Konstantin Belousov
7157d8f7ab Do not call vnode_pager_setsize() while a NFS node mutex is
locked. vnode_pager_setsize() might sleep waiting for the page after
EOF be unbusied.

Call vnode_pager_setsize() both for the regular and directory vnodes.

Reported by:	mich
Reviewed by:	rmacklem
Discussed with:	avg, jhb
MFC after:	2 weeks
2013-03-21 07:25:08 +00:00
Ed Maste
96ecfd9813 Fix remainder calculation when biosize is not a power of 2
In common configurations biosize is a power of two, but is not required to
be so.  Thanks to markj@ for spotting an additional case beyond my original
patch.

Reviewed by: rmacklem@
2013-03-19 13:06:11 +00:00
Konstantin Belousov
0d3bb4afa8 Remove negative name cache entry pointing to the target name, which
could be instantiated while tdvp was unlocked.

Reported by:	Rick Miller <vmiller at hostileadmin com>
Tested by:	pho
MFC after:	1 week
2013-03-17 15:11:37 +00:00