Commit Graph

197 Commits

Author SHA1 Message Date
Andriy Gapon
90f90687b3 add svcpool_close to handle killed nfsd threads
This patch adds a new function to the server krpc called
svcpool_close().  It is similar to svcpool_destroy(), but does not free
the data structures, so that the pool can be used again.

This function is then used instead of svcpool_destroy(),
svcpool_create() when the nfsd threads are killed.

PR:		204340
Reported by:	Panzura
Approved by:	rmacklem
Obtained from:	rmacklem
MFC after:	1 week
2017-02-14 17:49:08 +00:00
Baptiste Daroussin
b4b4b5304b Revert crap accidentally committed 2017-01-28 16:31:23 +00:00
Baptiste Daroussin
814aaaa7da Revert r312923 a better approach will be taken later 2017-01-28 16:30:14 +00:00
Konstantin Belousov
2f304845e2 Do not allocate struct statfs on kernel stack.
Right now size of the structure is 472 bytes on amd64, which is
already large and stack allocations are indesirable.  With the ino64
work, MNAMELEN is increased to 1024, which will make it impossible to have
struct statfs on the stack.

Extracted from:	ino64 work by gleb
Discussed with:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-01-05 17:19:26 +00:00
Josh Paetzel
b5a8f340f1 Workaround NFS bug with readdirplus when there are greater than 1 billion files in a filesystem.
Reviewed by	kib
MFC after:	2 weeks
Sponsored by:	iXsystems
Differential Revision:	D9009
2017-01-02 19:18:56 +00:00
Rick Macklem
a5d19b81b4 Fix the NFSv4.1 server for Open reclaim after a reboot.
The NFSv4.1 server failed to update the nfs-stablerestart file for
a client when the client was issued its first Open. As such, recovery
of Opens after a server reboot failed with NFSERR_NOGRACE.
This patch fixes this.
It also changes the code so that it malloc()'s the 1024 byte array
instead of allocating it on the kernel stack for both NFSv4.0 and NFSv4.1.
Note that this bug only affected NFSv4.1 and only when clients attempted
to reclaim Opens after a server reboot.

MFC after:	2 weeks
2016-12-05 22:36:25 +00:00
Konstantin Belousov
7359fdcf5f Allow some dotdot lookups in capability mode.
If dotdot lookup does not escape from the file descriptor passed as
the lookup root, we can allow the component traversal.  Track the
directories traversed, and check the result of dotdot lookup against
the recorded list of the directory vnodes.

Dotdot lookups are enabled by sysctl vfs.lookup_cap_dotdot, currently
disabled by default until more verification of the approach is done.

Disallow non-local filesystems for dotdot, since remote server might
conspire with the local process to allow it to escape the namespace.
This might be too cautious, provide the knob
vfs.lookup_cap_dotdot_nonlocal to override as well.

Idea by:	rwatson
Discussed with:	emaste, jonathan, rwatson
Reviewed by:	mjg (previous version)
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 week
Differential revision:	https://reviews.freebsd.org/D8110
2016-11-02 12:43:15 +00:00
Rick Macklem
dcb19c3886 A problem w.r.t. interoperation between the FreeBSD NFSv4.1 server with
delegations enabled and the Linux NFSv4.1 client was reported in
reviews.freebsd.org/D7891.
I believe that the FreeBSD server behaviour conforms to the RFC and that
the Linux client has a bug. Therefore, I do not think the proposed patch
is appropriate. When nfsrv_writedelegifpos is non-zero, the FreeBSD
server will issue a write delegation for a read open if possible.
The Linux client then erroneously assumes that the credentials used for
the read open can write the file.
This patch reverses the default value for nfsrv_writedelegifpos to 0 so
that the default behaviour is Linux compatible and adds a sysctl that can
be used to set nfsrv_writedelegifpos.

This change should only affect users that are mounting a FreeBSD server
with delegations enabled (they are not enabled by default) with a Linux
NFSv4.1 client mount.

Reported by:	fatih.acar@gandi.net
Tested by:	fatih.acar@gandi.net
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D7891
2016-10-20 23:53:16 +00:00
Rick Macklem
1b819cf265 Update the nfsstats structure to include the changes needed by
the patch in D1626 plus changes so that it includes counts for
NFSv4.1 (and the draft of NFSv4.2).
Also, make all the counts uint64_t and add a vers field at the
beginning, so that future revisions can easily be implemented.
There is code in place to handle the old vesion of the nfsstats
structure for backwards binary compatibility.

Subsequent commits will update nfsstat(8) to use the new fields.

Submitted by:	will (earlier version)
Reviewed by:	ken
MFC after:	1 month
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D1626
2016-08-12 22:44:59 +00:00
Conrad Meyer
5ecc225fc5 nfsd: Fix use-after-free in NFS4 lock test service
Trivial use-after-free where stp was freed too soon in the non-error path.
To fix, simply move its release to the end of the routine.

Reported by:	Coverity
CID:		1006105
Sponsored by:	EMC / Isilon Storage Division
2016-05-12 05:03:12 +00:00
Rick Macklem
de2413b95e Don't increment srvrpccnt[] for the NFSv4.1 operations.
When support for NFSv4.1 was added to the NFS server, it broke
the server rpc count stats, since newnfsstats.srvrpccnt[] doesn't
have entries for the new NFSv4.1 operations.
Without this patch, the code was incrementing bogus entries in
newnfsstats for the new NFSv4.1 operations.
This patch is an interim fix. The nfsstats structure needs to be
updated and that will come in a future commit.

Reported by:	cem
MFC after:	2 weeks
2016-05-07 22:45:08 +00:00
Pedro F. Giffuni
ee58b56452 nfsserver: minor spelling fix in comment.
No functional change.
2016-05-06 23:40:37 +00:00
Rick Macklem
8eabbbe24b Give mountd -S priority over outstanding RPC requests when suspending the nfsd.
It was reported via email that under certain heavy RPC loads
long delays before the exports would be updated was observed
when using "mountd -S". This patch reverses the priority between
the exclusive lock request to suspend the nfsd threads and the
shared lock request for performing RPCs.
As such, when mountd attempts to suspend the nfsd threads, it
gets priority over outstanding RPC requests to do this.
I suspect that the case reported was an artificial test load,
but this patch did fix the problem for the reporter.

Reported and Tested by:	josephlai@qnap.com
MFC after:	2 weeks
2016-05-06 23:26:17 +00:00
Pedro F. Giffuni
a96c9b30e2 NFS: spelling fixes on comments.
No funcional change.
2016-04-29 16:07:25 +00:00
Rick Macklem
ae03cbd7f3 Allow the NFSv4 server to reply NFSERR_WRONGSEC for the SetClientID operation.
It was reported via email that a Linux client couldn't do a Kerberized
NFS mount when only "sec=krb5" was specified for the exports. The Linux
client attempted a mount via krb5i and the server replied NFSERR_SERVERFAULT.
Although NFSERR_WRONGSEC isn't listed as an error for SetClientID, I
think it is the correct reply, so this patch enables that.
I do not know if this fixes the mount attempt, but adding "krb5i" to the
list of allowed security flavours does allow the mount to work.

Reported by:	joef@spectralogic.com
MFC after:	2 weeks
2016-04-23 21:18:45 +00:00
Rick Macklem
0533d72612 Fix a LOR in the NFSv4.1 server.
The ordering of acquisition of the state and session mutexes was
reversed in two cases executed when an NFSv4.1 client created/freed
a session. Since clients will typically do this only when mounting
and dismounting, the likelyhood of causing a deadlock was low but possible.
This can only occur for NFSv4.1 mounts, since the others do not
use sessions.
This was detected while testing the pNFS server/client where the
client crashed during dismounting.
The patch also reorders the unlocks, although that isn't necessary
for correct operation.

MFC after:	2 weeks
2016-04-23 01:22:04 +00:00
Rick Macklem
13c581fc54 If the VOP_SETATTR() call that saves the exclusive create verifier failed,
the NFS server would leave the newly created vnode locked. This could
result in a file system that would not unmount and processes wedged,
waiting for the file to be unlocked.
Since this VOP_SETATTR() never fails for most file systems, this bug
doesn't normally manifest itself. I found it during testing of an
exported GlusterFS file system, which can fail.
This patch adds the vput() and changes the error to the correct NFS one.

MFC after:	2 weeks
2016-04-12 20:23:09 +00:00
Pedro F. Giffuni
74b8d63dcc Cleanup unnecessary semicolons from the kernel.
Found with devel/coccinelle.
2016-04-10 23:07:00 +00:00
Rick Macklem
84be7e0952 Add kernel support to the NFS server for the "-manage-gids"
option that will be added to the nfsuserd daemon in a future
commit. It modifies the cache used by NFSv4 for name<-->id
translation (both username/uid and group/gid) to support this.
When "-manage-gids" is set, the server looks up each uid
for the RPC and uses the list of groups cached in the server
instead of the list of groups provided in the RPC request.
The cached group list is acquired for the cache by the nfsuserd
daemon via getgrouplist(3).
This avoids the 16 groups limit for the list in the RPC request.
Since the cache is now used for every RPC when "-manage-gids"
is enabled, the code also modifies the cache to use a separate
mutex for each hash list instead of a single global mutex.

Suggested by:	jpaetzel
Tested by:	jpaetzel
MFC after:	2 weeks
2015-11-30 21:54:27 +00:00
Rick Macklem
a0962bf8bc When the nfsd threads are terminated, the NFSv4 server state
(opens, locks, etc) is retained, which I believe is correct behaviour.
However, for NFSv4.1, the server also retained a reference to the xprt
(RPC transport socket structure) for the backchannel. This caused
svcpool_destroy() to not call SVC_DESTROY() for the xprt and allowed
a socket upcall to occur after the mutexes in the svcpool were destroyed,
causing a crash.
This patch fixes the code so that the backchannel xprt structure is
dereferenced just before svcpool_destroy() is called, so the code
does do an SVC_DESTROY() on the xprt, which shuts down the socket upcall.

Tested by:	g_amanakis@yahoo.com
PR:		204340
MFC after:	2 weeks
2015-11-21 23:55:46 +00:00
Rick Macklem
29dc40b6be For the case where an NFSv4.1 ExchangeID operation has the client identifier
that already has a confirmed ClientID, the nfsrv_setclient() function would
not fill in the clientidp being returned. As such, the value of ClientID
returned would be whatever garbage was on the stack.
An NFSv4.1 client would not normally do this, but it appears that it can
happen for certain Linux clients. When it happens, the client persistently
retries the ExchangeID and Create_session after Create_session fails when
it uses the bogus clientid. With this patch, the correct clientid is replied.
This problem was identified in a packet trace supplied by
Ahmed Kamal via email.

Reported by:	email.ahmedkamal@googlemail.com
MFC after:	2 weeks
2015-08-14 22:02:14 +00:00
Rick Macklem
25f37276e5 This patch fixes a problem where, if the NFSv4 server has a previous
unconfirmed clientid structure for the same client on the last hash list,
this old entry would not be removed/deleted. I do not think this bug would have
caused serious problems, since the new entry would have been before the old one
on the list. This old entry would have eventually been scavenged/removed.
Detected while reading the code looking for another bug.

MFC after:	3 days
2015-07-29 23:06:30 +00:00
Rick Macklem
0c419e226c Make the NFS server use shared vnode locks for a few cases
that are allowed by the VFS/VOP interface instead of using
exclusive locks.

MFC after:	2 weeks
2015-05-29 20:22:53 +00:00
Rick Macklem
1f54e596ad Make the size of the hash tables used by the NFSv4 server tunable.
No appreciable change in performance was observed after increasing
the sizes of these tables and then testing with a single client.
However, there was an email that indicated high CPU overheads for
a heavily loaded NFSv4 and it is hoped that increasing the sizes
of the hash tables via these tunables might help.
The tables remain the same size by default.

Differential Revision:	https://reviews.freebsd.org/D2596
MFC after:	2 weeks
2015-05-27 22:00:05 +00:00
Konstantin Belousov
1bc93bb7b9 Currently, softupdate code detects overstepping on the workitems
limits in the code which is deep in the call stack, and owns several
critical system resources, like vnode locks.  Attempt to wait while
the per-mount softupdate thread cleans up the backlog may deadlock,
because the thread might need to lock the same vnode which is owned by
the waiting thread.

Instead of synchronously waiting for the worker, perform the worker'
tickle and pause until the backlog is cleaned, at the safe point
during return from kernel to usermode.  A new ast request to call
softdep_ast_cleanup() is created, the SU code now only checks the size
of queue and schedules ast.

There is no ast delivery for the kernel threads, so they are exempted
from the mechanism, except NFS daemon threads.  NFS server loop
explicitely checks for the request, and informs the schedule_cleanup()
that it is capable of handling the requests by the process P2_AST_SU
flag.  This is needed because nfsd may be the sole cause of the SU
workqueue overflow.  But, to not cause nsfd to spawn additional
threads just because we slow down existing workers, only tickle su
threads, without waiting for the backlog cleanup.

Reviewed by:	jhb, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-27 09:20:42 +00:00
Rick Macklem
2fbb9d563b Fix the NFS server's handling of a bogus NFSv2 ROOT RPC.
The ROOT RPC is deprecated in the NFSv2 RFC, RFC-1094
and should never be used by a client.

Tested by:	thmu@freenet.de
MFC after:	1 week
2015-04-25 00:58:24 +00:00
Edward Tomasz Napierala
50a220c699 Replace "new NFS" with just "NFS" in some sysctl description strings.
Sponsored by:	The FreeBSD Foundation
2015-04-19 06:18:41 +00:00
Rick Macklem
66e80f77d2 mav@ has found that NFS servers exporting ZFS file systems
can perform better when using a 128K read/write data size.
This patch changes NFS_MAXDATA from 64K to 128K so that
clients can use 128K for NFS mounts to allow this.
The patch also renames NFS_MAXDATA to NFS_SRVMAXIO so
that it is clear that it applies to the NFS server side
only. It also avoids a name conflict with the NFS_MAXDATA
defined in rpcsvc/nfs_prot.h, that is used for userland RPC.

Tested by:	mav
Reviewed by:	mav
MFC after:	2 weeks
2015-04-16 22:35:15 +00:00
Rick Macklem
dda11d4ab9 File systems that do not use the buffer cache (such as ZFS) must
use VOP_FSYNC() to perform the NFS server's Commit operation.
This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which
is set by file systems that use the buffer cache. If this flag
is not set, the NFS server always does a VOP_FSYNC().
This should be ok for old file system modules that do not set
MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although
it might not be optimal for file systems that use the buffer cache.

Reviewed by:	kib
MFC after:	2 weeks
2015-04-15 20:16:31 +00:00
Robert Watson
eae6da3db4 Use M_SIZE() instead of hand-crafted (and mostly correct) NFSMSIZ() macro
in the NFS server; garbage collect now-unused NFSMSIZ() and M_HASCL()
macros.  Also garbage collect now-unused versions in headers for the
removed previous NFS client and server.

Reviewed by:	rmacklem
Sponsored by:	EMC / Isilon Storage Division
2015-01-07 17:22:56 +00:00
Rick Macklem
52f1bb38c2 A deadlock in the NFSv4 server with vfs.nfsd.enable_locallocks=1
was reported via email. This was caused by a LOR between the
sleep lock used to serialize the local locking (nfsrv_locklf())
and locking the vnode. I believe this patch fixes the problem
by delaying relocking of the vnode until the sleep lock is
unlocked (nfsrv_unlocklf()). To avoid nfsvno_advlock() having the side
effect of unlocking the vnode, unlocking the vnode was moved to before
the functions that call nfsvno_advlock().
It shouldn't affect the execution of the default case where
vfs.nfsd.enable_locallocks=0.

Reported by:	loic.blot@unix-experience.fr
Discussed with:	kib
MFC after:	1 week
2014-12-25 01:55:17 +00:00
Konstantin Belousov
6c21f6edb8 The VOP_LOOKUP() implementations for CREATE op do not put the name
into namecache, to avoid cache trashing when doing large operations.
E.g., tar archive extraction is not usually followed by access to many
of the files created.

Right now, each VOP_LOOKUP() implementation explicitely knowns about
this quirk and tests for both MAKEENTRY flag presence and op != CREATE
to make the call to cache_enter().  Centralize the handling of the
quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP.
VFS now sets NOCACHE flag for CREATE namei() calls.

Note that the change in semantic is backward-compatible and could be
merged to the stable branch, and is compatible with non-changed
third-party filesystems which correctly handle MAKEENTRY.

Suggested by:	Chris Torek <torek@pi-coral.com>
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-12-18 10:01:12 +00:00
Konstantin Belousov
42ecb595f2 Allow the vfs.nfsd knobs to be set from loader.conf (or using
kenv(8)).  This is useful when nfsd is loaded as module.

Reviewed by:	rmacklem
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-10-27 07:47:13 +00:00
Marcelo Araujo
f9246664f5 Make the sysctl(8) for checkutf8 positively defined and improve
the description of it.

Submitted by:	Ronald Klop <ronald-lists@klop.ws>
Reviewed by:	rmacklem
Approved by:	rmacklem
Sponsored by:	QNAP Systems Inc.
2014-10-17 02:11:09 +00:00
Marcelo Araujo
3dd6b7ff3d Add two sysctl(8) to enable/disable NFSv4 server to check when setting
user nobody and/or setting group nogroup as owner of a file or directory.
Usually at the client side, if there is an username that is not in the
client's passwd database, some clients will send 'nobody@<your.dns.domain>'
in the wire and the NFSv4 server will treat it as an ERROR.
However, if you have a valid user nobody in your passwd database,
the NFSv4 server will treat it as a NFSERR_BADOWNER as its believes the
client doesn't has the username mapped.

Submitted by:	Loic Blot <loic.blot@unix-experience.fr>
Reviewed by:	rmacklem
Approved by:	rmacklem
MFC after:	2 weeks
2014-10-16 02:24:19 +00:00
Marcelo Araujo
d8a5961f88 Fix failures and warnings reported by newpynfs20090424 test tool.
This fix addresses only issues with the pynfs reports, none of these
issues are know to create problems for extant real clients.

Submitted by:	Bart Hsiao <bart.hsiao@gmail.com>
Reworked by:	myself
Reviewed by:	rmacklem
Approved by:	rmacklem
Sponsored by:	QNAP Systems Inc.
2014-10-03 02:24:41 +00:00
Rick Macklem
03738f6076 Change the NFS server's printf related to hitting
the DRC cache's flood level so that it suggests
increasing vfs.nfsd.tcphighwater.

Suggested by:	h.schmalzbauer@omnilan.de
2014-08-10 01:13:32 +00:00
Konstantin Belousov
e7375b6fa5 Do not generate 1000 unique lock names for nfsrc hash chain locks.
It overflows witness.

Shorten the names of some nfs mutexes.

Reported and tested by:	pho
No objections from:	rmacklem, mav
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-07-31 19:24:44 +00:00
Rick Macklem
6c7d2293d3 The new NFSv3 server did not generate directory postop attributes for
the reply to ReaddirPlus when the server failed within the loop
that calls VFS_VGET(). This failure is most likely an error
return from VFS_VGET() caused by a bogus d_fileno that was
truncated to 32bits.
This patch fixes the server so that it will return directory postop
attributes for the failure. It does not fix the underlying issue caused
by d_fileno being uint32_t when a file system like ZFS generates
a fileno that is greater than 32bits.

Reported by:	jpaetzel
Reviewed by:	jpaetzel
MFC after:	1 month
2014-07-04 22:47:07 +00:00
Rick Macklem
c59e4cc34d Merge the NFSv4.1 server code in projects/nfsv4.1-server over
into head. The code is not believed to have any effect
on the semantics of non-NFSv4.1 server behaviour.
It is a rather large merge, but I am hoping that there will
not be any regressions for the NFS server.

MFC after:	1 month
2014-07-01 20:47:16 +00:00
Bryan Drewery
4fc0f18c20 Change NFS readdir() to only ignore cookies preceding the given offset for
UFS rather than for all but ZFS.  This code was assuming that offsets were
monotonically increasing for all file systems except ZFS and that the
cookies from a previous call may have been rewound to a block boundary.
According to mckusick@ only UFS is known to do this, so only requests against
UFS file systems should remove cookies smaller than the given offset.  This
fixes serving TMPFS over NFS as it too does not have monotonically increasing
offsets.  The comment around the code also indicated it was specific to UFS.

Some of the code using 'not_zfs' is specific to ZFS snapshot handling, so
add a 'is_zfs' variable for those cases.

It's possible that 'is_zfs' check for VFS_VGET() support may not be
specific to ZFS.  This needs more research and testing.

After this fix TMPFS and other file systems can be served over NFS.

To test I compared the results of syncing a /usr/src tree into a tmpfs and
serving that over NFS.  Before the fix 3589 files were missing on the remote
view.  After the fix all files were successfully found.

Reviewed by:	rmacklem
Discussed with:	mckusick, rmacklem via fs@
Discussed at:	http://lists.freebsd.org/pipermail/freebsd-fs/2014-April/019264.html
MFC after:	2 weeks
Sponsored by:	EMC / Isilon Storage Division
2014-07-01 20:00:35 +00:00
Rick Macklem
ca4defd583 The new NFS server would not allow a hard link to be
created to a symlink. This restriction (which was
inherited from OpenBSD) is not required by the NFS RFCs.
Since this is allowed by the old NFS server, it is a
POLA violation to not allow it. This patch modifies the
new NFS server to allow this.

Reported by:	jhb
Reviewed by:	jhb
MFC after:	3 days
2014-06-06 21:38:49 +00:00
Rick Macklem
ca20bd924f The new draft specification for NFSv4.0 specifies that a server
should either accept owner and owner_group strings that are just
the digits of the uid/gid or return NFS4ERR_BADOWNER.
This patch adds a sysctl vfs.nfsd.enable_stringtouid, which can
be set to enable the server w.r.t. accepting numeric string. It
also ensures that NFS4ERR_BADOWNER is returned if numeric uid/gid
strings are not enabled. This fixes the server for recent Linux
nfs4 clients that use numeric uid/gid strings by default.

Reported and tested by:	craigyk@gmail.com
MFC after:	2 weeks
2014-05-03 00:13:45 +00:00
Rick Macklem
3c53f923dc The PR reported that the old NFS server did not set uio_td == NULL
for the VOP_READ() call. This patch fixes both the old and new
server for this case.

PR:		185232
Submitted by:	PR had patch for old server
Reviewed by:	kib
MFC after:	2 weeks
2014-04-24 20:47:58 +00:00
Rick Macklem
ab7f24103e Remove an unnecessary level of indirection for an argument.
This simplifies the code and should avoid the clang sparc
port from generating an abort() call.

Requested by:	rdivacky
Submitted by:	jhb
MFC after:	2 weeks
2014-04-23 23:13:46 +00:00
Xin LI
25bfde79d6 Fix NFS deadlock vulnerability. [SA-14:05]
Fix "Heartbleed" vulnerability and ECDSA Cache Side-channel
Attack in OpenSSL. [SA-14:06]
2014-04-08 18:27:32 +00:00
Robert Watson
4a14441044 Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after:	3 weeks
2014-03-16 10:55:57 +00:00
Alexander Motin
6103bae6ae Fix lock leak in purely hypothetical case of TCP connection without SVC_ACK
method.  This change should be NOP now, but it is better to be future safe.

Reported by:	rmacklem
2014-01-14 20:18:38 +00:00
Alexander Motin
45e18ea7ea Fix off-by-one error in r260229.
Coverity CID:	1148955
2014-01-07 11:43:51 +00:00
Alexander Motin
d473bac729 Rework NFS Duplicate Request Cache cleanup logic.
- Introduce additional hash to group requests by hash of sockref.  This
allows to process TCP acknowledgements without looping though all the cache,
and as result allows to do it every time.
 - Indroduce additional callbacks to notify application layer about sockets
disconnection.  Without this last few requests processed just before socket
disconnection never processed their ACKs and stuck in cache for many hours.
 - Implement transport-specific method for tracking reply acknowledgements.
New implementation does not cross multiple stack layers to get the data and
does not have race conditions that previously made some requests stuck
in cache.  This could be done more efficiently at sockbuf layer, but that
would broke some KBIs, while I don't know other consumers for it aside NFS.
 - Instead of traversing all DRC twice per request, run cleaning only once
per request, and except in some conditions traverse only single hash slot
at a time.

Together this limits NFS DRC growth only to situations of real connectivity
problems.  If network is working well, and so all replies are acknowledged,
cache remains almost empty even after hours of heavy load.  Without this
change on the same test cache was growing to many thousand requests even
with perfectly working local network.

As another result this reduces CPU time spent on the DRC handling during
SPEC NFS benchmark from about 10% to 0.5%.

Sponsored by:	iXsystems, Inc.
2014-01-03 15:09:59 +00:00