Commit Graph

28 Commits

Author SHA1 Message Date
Rick Macklem
c59e4cc34d Merge the NFSv4.1 server code in projects/nfsv4.1-server over
into head. The code is not believed to have any effect
on the semantics of non-NFSv4.1 server behaviour.
It is a rather large merge, but I am hoping that there will
not be any regressions for the NFS server.

MFC after:	1 month
2014-07-01 20:47:16 +00:00
Kenneth D. Merry
d96b98a360 Revamp the old NFS server's File Handle Affinity (FHA) code so that
it will work with either the old or new server.

The FHA code keeps a cache of currently active file handles for
NFSv2 and v3 requests, so that read and write requests for the same
file are directed to the same group of threads (reads) or thread
(writes).  It does not currently work for NFSv4 requests.  They are
more complex, and will take more work to support.

This improves read-ahead performance, especially with ZFS, if the
FHA tuning parameters are configured appropriately.  Without the
FHA code, concurrent reads that are part of a sequential read from
a file will be directed to separate NFS threads.  This has the
effect of confusing the ZFS zfetch (prefetch) code and makes
sequential reads significantly slower with clients like Linux that
do a lot of prefetching.

The FHA code has also been updated to direct write requests to nearby
file offsets to the same thread in the same way it batches reads,
and the FHA code will now also send writes to multiple threads when
needed.

This improves sequential write performance in ZFS, because writes
to a file are now more ordered.  Since NFS writes (generally
less than 64K) are smaller than the typical ZFS record size
(usually 128K), out of order NFS writes to the same block can
trigger a read in ZFS.  Sending them down the same thread increases
the odds of their being in order.

In order for multiple write threads per file in the FHA code to be
useful, writes in the NFS server have been changed to use a LK_SHARED
vnode lock, and upgrade that to LK_EXCLUSIVE if the filesystem
doesn't allow multiple writers to a file at once.  ZFS is currently
the only filesystem that allows multiple writers to a file, because
it has internal file range locking.  This change does not affect the
NFSv4 code.

This improves random write performance to a single file in ZFS, since
we can now have multiple writers inside ZFS at one time.

I have changed the default tuning parameters to a 22 bit (4MB)
window size (from 256K) and unlimited commands per thread as a
result of my benchmarking with ZFS.

The FHA code has been updated to allow configuring the tuning
parameters from loader tunable variables in addition to sysctl
variables.  The read offset window calculation has been slightly
modified as well.  Instead of having separate bins, each file
handle has a rolling window of bin_shift size.  This minimizes
glitches in throughput when shifting from one bin to another.

sys/conf/files:
	Add nfs_fha_new.c and nfs_fha_old.c.  Compile nfs_fha.c
	when either the old or the new NFS server is built.

sys/fs/nfs/nfsport.h,
sys/fs/nfs/nfs_commonport.c:
	Bring in changes from Rick Macklem to newnfs_realign that
	allow it to operate in blocking (M_WAITOK) or non-blocking
	(M_NOWAIT) mode.

sys/fs/nfs/nfs_commonsubs.c,
sys/fs/nfs/nfs_var.h:
	Bring in a change from Rick Macklem to allow telling
	nfsm_dissect() whether or not to wait for mallocs.

sys/fs/nfs/nfsm_subs.h:
	Bring in changes from Rick Macklem to create a new
	nfsm_dissect_nonblock() inline function and
	NFSM_DISSECT_NONBLOCK() macro.

sys/fs/nfs/nfs_commonkrpc.c,
sys/fs/nfsclient/nfs_clkrpc.c:
	Add the malloc wait flag to a newnfs_realign() call.

sys/fs/nfsserver/nfs_nfsdkrpc.c:
	Setup the new NFS server's RPC thread pool so that it will
	call the FHA code.

	Add the malloc flag argument to newnfs_realign().

	Unstaticize newnfs_nfsv3_procid[] so that we can use it in
	the FHA code.

sys/fs/nfsserver/nfs_nfsdsocket.c:
	In nfsrvd_dorpc(), add NFSPROC_WRITE to the list of RPC types
	that use the LK_SHARED lock type.

sys/fs/nfsserver/nfs_nfsdport.c:
	In nfsd_fhtovp(), if we're starting a write, check to see
	whether the underlying filesystem supports shared writes.
	If not, upgrade the lock type from LK_SHARED to LK_EXCLUSIVE.

sys/nfsserver/nfs_fha.c:
	Remove all code that is specific to the NFS server
	implementation.  Anything that is server-specific is now
	accessed through a callback supplied by that server's FHA
	shim in the new softc.

	There are now separate sysctls and tunables for the FHA
	implementations for the old and new NFS servers.  The new
	NFS server has its tunables under vfs.nfsd.fha, the old
	NFS server's tunables are under vfs.nfsrv.fha as before.

	In fha_extract_info(), use callouts for all server-specific
	code.  Getting file handles and offsets is now done in the
	individual server's shim module.

	In fha_hash_entry_choose_thread(), change the way we decide
	whether two reads are in proximity to each other.
	Previously, the calculation was a simple shift operation to
	see whether the offsets were in the same power of 2 bucket.
	The issue was that there would be a bucket (and therefore
	thread) transition, even if the reads were in close
	proximity.  When there is a thread transition, reads wind
	up going somewhat out of order, and ZFS gets confused.

	The new calculation simply tries to see whether the offsets
	are within 1 << bin_shift of each other.  If they are, the
	reads will be sent to the same thread.

	The effect of this change is that for sequential reads, if
	the client doesn't exceed the max_reqs_per_nfsd parameter
	and the bin_shift is set to a reasonable value (22, or
	4MB works well in my tests), the reads in any sequential
	stream will largely be confined to a single thread.

	Change fha_assign() so that it takes a softc argument.  It
	is now called from the individual server's shim code, which
	will pass in the softc.

	Change fhe_stats_sysctl() so that it takes a softc
	parameter.  It is now called from the individual server's
	shim code.  Add the current offset to the list of things
	printed out about each active thread.

	Change the num_reads and num_writes counters in the
	fha_hash_entry structure to 32-bit values, and rename them
	num_rw and num_exclusive, respectively, to reflect their
	changed usage.

	Add an enable sysctl and tunable that allows the user to
	disable the FHA code (when vfs.XXX.fha.enable = 0).  This
	is useful for before/after performance comparisons.

nfs_fha.h:
	Move most structure definitions out of nfs_fha.c and into
	the header file, so that the individual server shims can
	see them.

	Change the default bin_shift to 22 (4MB) instead of 18
	(256K).  Allow unlimited commands per thread.

sys/nfsserver/nfs_fha_old.c,
sys/nfsserver/nfs_fha_old.h,
sys/fs/nfsserver/nfs_fha_new.c,
sys/fs/nfsserver/nfs_fha_new.h:
	Add shims for the old and new NFS servers to interface with
	the FHA code, and callbacks for the

	The shims contain all of the code and definitions that are
	specific to the NFS servers.

	They setup the server-specific callbacks and set the server
	name for the sysctl and loader tunable variables.

sys/nfsserver/nfs_srvkrpc.c:
	Configure the RPC code to call fhaold_assign() instead of
	fha_assign().

sys/modules/nfsd/Makefile:
	Add nfs_fha.c and nfs_fha_new.c.

sys/modules/nfsserver/Makefile:
	Add nfs_fha_old.c.

Reviewed by:	rmacklem
Sponsored by:	Spectra Logic
MFC after:	2 weeks
2013-04-17 21:00:22 +00:00
Rick Macklem
1f60bfd822 Move the NFSv4.1 client patches over from projects/nfsv4.1-client
to head. I don't think the NFS client behaviour will change unless
the new "minorversion=1" mount option is used. It includes basic
NFSv4.1 support plus support for pNFS using the Files Layout only.
All problems detecting during an NFSv4.1 Bakeathon testing event
in June 2012 have been resolved in this code and it has been tested
against the NFSv4.1 server available to me.
Although not reviewed, I believe that kib@ has looked at it.
2012-12-08 22:52:39 +00:00
Konstantin Belousov
5050aa86cf Remove the support for using non-mpsafe filesystem modules.
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by:	attilio
Tested by:	pho
2012-10-22 17:50:54 +00:00
Rick Macklem
f4e2c07e73 Add a simple printf() based debug facility to the new nfs client.
Use it for a printf() that can be harmlessly generated for mmap()'d
files. It will be used extensively for the NFSv4.1 client.
Debugging printf()s are enabled by setting vfs.nfs.debuglevel to
a non-zero value. The higher the value, the more debugging printf()s.

Reviewed by:	jhb
MFC after:	2 weeks
2012-09-09 21:00:45 +00:00
Konstantin Belousov
843dcea09e The header uma_int.h is internal uma header, unused by this source
file.  Do not include it needlessly.

Reviewed by:  alc
MFC after:    1 week
2012-08-04 18:12:54 +00:00
Rick Macklem
13b2772f8e Delete a couple of out of date comments that are no longer true in
the new NFS client.

Requested by:	bde
MFC after:	1 week
2012-02-16 02:19:53 +00:00
Rick Macklem
a16cd9c05e jwd@ reported via email that the "CacheSize" field reported by "nfsstat -e -s"
would go negative after using the "-z" option to zero out the stats.
This patch fixes that by not zeroing out the srvcache_size field
for "-z", since it is the size of the cache and not a counter.

MFC after:	2 weeks
2012-01-11 02:46:42 +00:00
Zack Kirsch
061c683cc2 Revert revision 224079 as Rick pointed out that I would be calling VOP_PATHCONF
without the vnode lock held.

Implicitly approved by: zml (mentor)
2011-07-17 03:44:05 +00:00
Zack Kirsch
a9285ae5c4 Add DEXITCODE plumbing to NFS.
Isilon has the concept of an in-memory exit-code ring that saves the last exit
code of a function and allows for stack tracing. This is very helpful when
debugging tough issues.

This patch is essentially a no-op for BSD at this point, until we upstream
the dexitcode logic itself. The patch adds DEXITCODE calls to every NFS
function that returns an errno error code. A number of code paths were also
reorganized to have single exit paths, to reduce code duplication.

Submitted by:   David Kwan <dkwan@isilon.com>
Reviewed by:    rmacklem
Approved by:    zml (mentor)
MFC after:      2 weeks
2011-07-16 08:51:09 +00:00
Zack Kirsch
51c099f522 Change loadattr and fillattr to ask the file system for the pathconf variable.
Small modification where VOP_PATHCONF was being called directly.

Reviewed by:    rmacklem
Approved by:    zml (mentor)
MFC after:      2 weeks
2011-07-16 08:05:21 +00:00
Zack Kirsch
40435b74f4 Move nfsvno_pathconf to be accessible to sys/fs/nfs; no functionality change.
Reviewed by:    rmacklem
Approved by:    zml (mentor)
MFC after:      2 weeks
2011-07-16 08:05:17 +00:00
Rick Macklem
1f3765902c Change the sysctl naming for the old and new NFS clients
to vfs.oldnfs.xxx and vfs.nfs.xxx respectively. This makes
the default nfs client use vfs.nfs.xxx after r221124.
2011-05-15 20:52:43 +00:00
Rick Macklem
2e3b981a4d Add kernel support for NFSSVC_ZEROCLTSTATS and NFSSVC_ZEROSRVSTATS
so that they can be used by nfsstat(1) to implement the "-z" option
for the new NFS subsystem.

MFC after:	2 weeks
2011-05-04 13:36:18 +00:00
Rick Macklem
2b08b570cb Revert r221306, since NFSSVC_ZEROSTATS zero'd both client and
server stats, when separate modifiers for NFSSVC_GETSTATS for
each of client and server stats is what it required by nfsstat(1).
2011-05-04 13:30:38 +00:00
Rick Macklem
b2946fadcd Add the kernel support needed to zero out the nfsstats
structure for the new NFS subsystem. This will be used
by nfsstats.c to implement the "-z" option.

MFC after:	2 weeks
2011-05-01 22:19:52 +00:00
Rick Macklem
0a9f005dff Fix up some of the sysctls for the experimental NFS client so
that they use the same names as the regular client. Also add
string descriptions for them.

MFC after:	2 weeks
2011-04-17 18:56:17 +00:00
Rick Macklem
a09001a82b Fix the experimental NFSv4 server so that it uses VOP_PATHCONF()
to determine if a file system supports NFSv4 ACLs. Since
VOP_PATHCONF() must be called with a locked vnode, the function
is called before nfsvno_fillattr() and the result is passed in
as an extra argument.

MFC after:	2 weeks
2011-04-14 23:46:15 +00:00
Rick Macklem
66c0f45a3d For the experimental NFSv4 server's dumplocks operation, add the
MPSAFE flag to cn_flags so that it doesn't panic. The panics weren't
seen since nfsdumpstate(8) is broken for the "-l" case, so this
was never done. I'll do a separate commit to fix nfsdumpstate(8).

Submitted by:	zack.kirsch at isilon.com
MFC after:	2 weeks
2010-07-19 23:33:42 +00:00
Rick Macklem
23f929dfe8 An NFSv4 server will reply NFSERR_GRACE for non-recovery RPCs
during the grace period after startup. This grace period must
be at least the lease duration, which is typically 1-2 minutes.
It seems prudent for the experimental NFS client to wait a few
seconds before retrying such an RPC, so that the server isn't
flooded with non-recovery RPCs during recovery. This patch adds
an argument to nfs_catnap() to implement a 5 second delay
for this case.

MFC after:	1 week
2010-04-24 22:52:14 +00:00
Rick Macklem
55909abf07 The experimental NFS client was not filling in recovery credentials
for opens done locally in the client when a delegation for the file
was held. This could cause the client to crash in crsetgroups() when
recovering from a server crash/reboot. This patch fills in the
recovery credentials for this case, in order to avoid the client crash.
Also, add KASSERT()s to the credential copy functions, to catch any
other cases where the credentials aren't filled in correctly.

MFC after:	1 week
2010-04-15 22:57:30 +00:00
Rick Macklem
a43fcbe34d This patch should fix handling of byte range locks locally
on the server for the experimental nfs server. When enabled
by setting vfs.newnfs.locallocks_enable to non-zero, the
experimental nfs server will now acquire byte range locks
on the file on behalf of NFSv4 clients, such that lock
conflicts between the NFSv4 clients and processes running
locally on the server, will be recognized and handled correctly.

MFC after:	2 weeks
2010-03-30 23:11:50 +00:00
Rick Macklem
3dfe81c650 Fix the experimental NFS subsystem so that it uses the correct
preprocessor macro name for not requiring strict data alignment.

Suggested by:	marius
MFC after:	2 weeks
2010-03-24 02:02:02 +00:00
Edward Tomasz Napierala
74991298d9 Remove unneeded ifdefs.
Reviewed by:	rmacklem
2009-12-03 18:03:42 +00:00
Brooks Davis
838d985825 Rework the credential code to support larger values of NGROUPS and
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively.  (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)

The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer.  Do the equivalent in
kinfo_proc.

Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively.  Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary.  In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.

Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups.  When feasible, truncate
the group list rather than generating an error.

Minor changes:
  - Reduce the number of hand rolled versions of groupmember().
  - Do not assign to both cr_gid and cr_groups[0].
  - Modify ipfw to cache ucreds instead of part of their contents since
    they are immutable once referenced by more than one entity.

Submitted by:	Isilon Systems (initial implementation)
X-MFC after:	never
PR:		bin/113398 kern/133867
2009-06-19 17:10:35 +00:00
Rick Macklem
ee1c1433e3 Add a function to the experimental nfs subsystem that tests to see
if a local file system supports NFSv4 ACLs. This allows the
NFSHASNFS4ACL() macro to be correctly implemented. The NFSv4 ACL
support should now work when the server exports a ZFS volume.

Approved by:	kib (mentor)
2009-05-27 15:16:56 +00:00
Rick Macklem
05c965a221 Crib the realign function out of nfs_krpc.c and add a call
to it for the client side reply. Hopefully this fixes the
problem with using the new krpc for arm for the experimental
nfs client.

Approved by:	kib (mentor)
2009-05-24 19:46:12 +00:00
Rick Macklem
9ec7b004d0 Add the experimental nfs subtree to the kernel, that includes
support for NFSv4 as well as NFSv2 and 3.
	It lives in 3 subdirs under sys/fs:
	nfs - functions that are common to the client and server
	nfsclient - a mutation of sys/nfsclient that call generic functions
	to do RPCs and handle state. As such, it retains the
	buffer cache handling characteristics and vnode semantics that
	are found in sys/nfsclient, for the most part.
	nfsserver - the server. It includes a DRC designed specifically for
	NFSv4, that is used instead of the generic DRC in sys/rpc.
	The build glue will be checked in later, so at this point, it
	consists of 3 new subdirs that should not affect kernel building.

Approved by:	kib (mentor)
2009-05-04 15:23:58 +00:00