Commit Graph

187 Commits

Author SHA1 Message Date
Rick Macklem
1a59bccc42 Set SO_SNDTIMEO in the client side krpc when CLSET_TIMEOUT is done.
During testing of the pNFS client, it was observed that an RPC could get
stuck in sosend() for a very long time if the network connection to a DS
had failed. This is fixed by setting SO_SNDTIMEO on the TCP socket.
This is only done when CLSET_TIMEOUT is done and this is not done by any
use of the krpc currently in the source tree, so there should be no effect
on extant uses.
A future patch will use CLSET_TIMEOUT for TCP connections to DSs.

Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D16293
2018-07-20 12:03:16 +00:00
Rick Macklem
1b09d9df3d Fix the server side krpc so that the kernel nfsd threads terminate.
Occationally the kernel nfsd threads would not terminate when a SIGKILL
was posted for the kernel process (called nfsd (slave)). When this occurred,
the thread associated with the process (called "ismaster") had returned from
svc_run_internal() and was sleeping waiting for the other threads to terminate.
The other threads (created by kthread_start()) were still in svc_run_internal()
handling NFS RPCs.
The only way this could occur is for the "ismaster" thread to return from
svc_run_internal() without having called svc_exit().
There was only one place in the code where this could happen and this patch
stops that from happening.
Since the problem is intermittent, I cannot be sure if this has fixed the
problem, but I have not seen an occurrence of the problem with this patch
applied.

Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D16087
2018-07-02 17:50:46 +00:00
Alexander Kabaev
151ba7933a Do pass removing some write-only variables from the kernel.
This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.

Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385
2017-12-25 04:48:39 +00:00
Pedro F. Giffuni
fe267a5590 sys: general adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

No functional change intended.
2017-11-27 15:23:17 +00:00
Pedro F. Giffuni
51369649b0 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
Gleb Smirnoff
779f106aa1 Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
  fields that belong to normal dataflow, and unionize them.  This
  shrinks the structure a bit.
  - Take out selinfo's from the socket buffers into the socket. The
    first reason is to support braindamaged scenario when a socket is
    added to kevent(2) and then listen(2) is cast on it. The second
    reason is that there is future plan to make socket buffers pluggable,
    so that for a dataflow socket a socket buffer can be changed, and
    in this case we also want to keep same selinfos through the lifetime
    of a socket.
  - Remove struct struct so_accf. Since now listening stuff no longer
    affects struct socket size, just move its fields into listening part
    of the union.
  - Provide sol_upcall field and enforce that so_upcall_set() may be called
    only on a dataflow socket, which has buffers, and for listening sockets
    provide solisten_upcall_set().

o Remove ACCEPT_LOCK() global.
  - Add a mutex to socket, to be used instead of socket buffer lock to lock
    fields of struct socket that don't belong to a socket buffer.
  - Allow to acquire two socket locks, but the first one must belong to a
    listening socket.
  - Make soref()/sorele() to use atomic(9).  This allows in some situations
    to do soref() without owning socket lock.  There is place for improvement
    here, it is possible to make sorele() also to lock optionally.
  - Most protocols aren't touched by this change, except UNIX local sockets.
    See below for more information.

o Reduce copy-and-paste in kernel modules that accept connections from
  listening sockets: provide function solisten_dequeue(), and use it in
  the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
  infiniband, rpc.

o UNIX local sockets.
  - Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
    local sockets.  Most races exist around spawning a new socket, when we
    are connecting to a local listening socket.  To cover them, we need to
    hold locks on both PCBs when spawning a third one.  This means holding
    them across sonewconn().  This creates a LOR between pcb locks and
    unp_list_lock.
  - To fix the new LOR, abandon the global unp_list_lock in favor of global
    unp_link_lock.  Indeed, separating these two locks didn't provide us any
    extra parralelism in the UNIX sockets.
  - Now call into uipc_attach() may happen with unp_link_lock hold if, we
    are accepting, or without unp_link_lock in case if we are just creating
    a socket.
  - Another problem in UNIX sockets is that uipc_close() basicly did nothing
    for a listening socket.  The vnode remained opened for connections.  This
    is fixed by removing vnode in uipc_close().  Maybe the right way would be
    to do it for all sockets (not only listening), simply move the vnode
    teardown from uipc_detach() to uipc_close()?

Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
Xin LI
6448ec89e7 * limit size of buffers to RPC_MAXDATASIZE
* don't leak memory
 * be more picky about bad parameters

From:

https://raw.githubusercontent.com/guidovranken/rpcbomb/master/libtirpc_patch.txt
https://github.com/guidovranken/rpcbomb/blob/master/rpcbind_patch.txt

via NetBSD.

Reviewed by:	emaste, cem (earlier version)
Differential Revision:	https://reviews.freebsd.org/D10922
MFC after:	3 days
2017-06-01 06:12:25 +00:00
Ed Maste
3e85b721d6 Remove register keyword from sys/ and ANSIfy prototypes
A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.

ANSIfy related prototypes while here.

Reviewed by:	cem, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D10193
2017-05-17 00:34:34 +00:00
Rick Macklem
dfd174d6e0 Fix the client side krpc from doing TCP reconnects for ERESTART from sosend().
When sosend() replies ERESTART in the client side krpc, it indicates that
the RPC message hasn't yet been sent and that the send queue is full or
locked while a signal is posted for the process.
Without this patch, this would result in a RPC_CANTSEND reply from
clnt_vc_call(), which would cause clnt_reconnect_call() to create a new
TCP transport connection. For most NFS servers, this wasn't a serious problem,
although it did imply retries of outstanding RPCs, which could possibly
have missed the DRC.
For an NFSv4.1 mount to AmazonEFS, this caused a serious problem, since
AmazonEFS often didn't retain the NFSv4.1 session and would reply with
NFS4ERR_BAD_SESSION. This implies to the client a crash/reboot which
requires open/lock state recovery.

Three options were considered to fix this:
- Return the ERESTART all the way up to the system call boundary and then
  have the system call redone. This is fraught with risk, due to convoluted
  code paths, asynchronous I/O RPCs etc. cperciva@ worked on this, but it
  is still a work in prgress and may not be feasible.
- Set SB_NOINTR for the socket buffer. This fixes the problem, but makes
  the sosend() completely non interruptible, which kib@ considered
  inappropriate. It also would break forced dismount when a thread
  was blocked in sosend().
- Modify the retry loop in clnt_vc_call(), so that it loops for this case
  for up to 15sec. Testing showed that the sosend() usually succeeded by
  the 2nd retry. The extreme case observed was 111 loop iterations, or
  about 100msec of delay.
This third alternative is what is implemented in this patch, since the
change is:
- localized
- straightforward
- forced dismount is not broken by it.

This patch has been tested by cperciva@ extensively against AmazonEFS.

Reported by:	cperciva
Tested by:	cperciva
MFC after:	2 weeks
2017-05-07 12:12:45 +00:00
Rick Macklem
34f1fddb1e Fix a crash during unmount of an NFSv4.1 mount.
Larry Rosenman reported a crash on freebsd-current@ which was caused by
a premature release of the krpc backchannel socket structure.
I believe this was caused by a race between the SVC_RELEASE() in clnt_vc.c
and the xprt_unregister() in the higher layer (clnt_rc.c), which tried
to lock the mutex in the xprt structure and crashed.
This patch fixes this by removing the xprt_unregister() in the clnt_vc
layer and allowing this to always be done by the clnt_rc (higher reconnect
layer).

Reported by:	ler@lerctr.org
Tested by:	ler@letctr.org
MFC after:	2 weeks
2017-04-10 22:47:18 +00:00
Warner Losh
fbbd9655e5 Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
Andriy Gapon
90f90687b3 add svcpool_close to handle killed nfsd threads
This patch adds a new function to the server krpc called
svcpool_close().  It is similar to svcpool_destroy(), but does not free
the data structures, so that the pool can be used again.

This function is then used instead of svcpool_destroy(),
svcpool_create() when the nfsd threads are killed.

PR:		204340
Reported by:	Panzura
Approved by:	rmacklem
Obtained from:	rmacklem
MFC after:	1 week
2017-02-14 17:49:08 +00:00
Konstantin Belousov
584b675ed6 Hide the boottime and bootimebin globals, provide the getboottime(9)
and getboottimebin(9) KPI. Change consumers of boottime to use the
KPI.  The variables were renamed to avoid shadowing issues with local
variables of the same name.

Issue is that boottime* should be adjusted from tc_windup(), which
requires them to be members of the timehands structure.  As a
preparation, this commit only introduces the interface.

Some uses of boottime were found doubtful, e.g. NLM uses boottime to
identify the system boot instance.  Arguably the identity should not
change on the leap second adjustment, but the commit is about the
timekeeping code and the consumers were kept bug-to-bug compatible.

Tested by:	pho (as part of the bigger patch)
Reviewed by:	jhb (same)
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
X-Differential revision:	https://reviews.freebsd.org/D7302
2016-07-27 11:08:59 +00:00
Enji Cooper
789872f2e9 Don't test for xpt not being NULL before calling svc_xprt_free(..)
svc_xprt_alloc(..) will always return initialized memory as it uses
mem_alloc(..) under the covers, which uses malloc(.., M_WAITOK, ..).

MFC after: 1 week
Reported by: Coverity
CID: 1007341
Sponsored by: EMC / Isilon Storage Division
2016-07-11 07:24:56 +00:00
Enji Cooper
462984cb6f Convert svc_xprt_alloc(..) and svc_xprt_free(..)'s prototypes to
ANSI C style prototypes

MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
2016-07-11 07:17:52 +00:00
Enji Cooper
7d3db23544 Deobfuscate cleanup path in clnt_vc_create(..)
Similar to r300836, r301800, and r302550, cl and ct will always
be non-NULL as they're allocated using the mem_alloc routines,
which always use `malloc(..., M_WAITOK)`.

MFC after: 1 week
Reported by: Coverity
CID: 1007342
Sponsored by: EMC / Isilon Storage Division
2016-07-11 07:07:15 +00:00
Enji Cooper
f99529597c Deobfuscate cleanup path in clnt_dg_create(..)
Similar to r300836 and r301800, cl and cu will always be non-NULL as they're
allocated using the mem_alloc routines, which always use
`malloc(..., M_WAITOK)`.

Deobfuscating the cleanup path fixes a leak where if cl was NULL and
cu was not, cu would not be free'd, and also removes a duplicate test for
cl not being NULL.

MFC after: 1 week
Reported by: Coverity
CID: 1007033, 1007344
Sponsored by: EMC / Isilon Storage Division
2016-07-11 06:58:24 +00:00
Enji Cooper
ea85a1540a Deobfuscate cleanup path in clnt_bck_create(..)
Similar to r300836, cl and ct will always be non-NULL as they're allocated
using the mem_alloc routines, which always use `malloc(..., M_WAITOK)`.

Deobfuscating the cleanup path fixes a leak where if cl was NULL and
ct was not, ct would not be free'd, and also removes a duplicate test for
cl not being NULL.

Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D6801
MFC after: 1 week
Reported by: Coverity
CID: 1229999
Reviewed by: cem
Sponsored by: EMC / Isilon Storage Division
2016-06-10 17:53:28 +00:00
Kevin Lo
88609a6a7a Fix the rpcb_getaddr() definition to match its declaration.
Submitted by:	Sebastian Huber <sebastian dot huber at embedded-brains dot de>
2016-06-09 14:33:00 +00:00
Enji Cooper
1b53778113 Quell false positives in svc_vc_create and svc_vc_create_conn with cd and xprt
Both cd and xprt will be non-NULL after their respective malloc(9) wrappers are
called (mem_alloc and svc_xprt_alloc, which calls mem_alloc) as mem_alloc
always gets called with M_WAITOK|M_ZERO today. Thus, testing for them being
non-NULL is incorrect -- it misleads Coverity and it misleads the reader.

Remove some unnecessary NULL initializations as a follow up to help solidify
the fact that these pointers will be initialized properly in sys/rpc/.. with
the interfaces the way they are currently.

Differential Revision: https://reviews.freebsd.org/D6572
MFC after: 2 weeks
Reported by: Coverity
CID: 1007338, 1007339, 1007340
Reviewed by: markj, truckman
Sponsored by: EMC / Isilon Storage Division
2016-05-27 08:48:33 +00:00
Enji Cooper
cb05064e70 Remove unnecessary memset(.., 0, ..)'s
The mem_alloc macro calls calloc (userspace) / malloc(.., M_WAITOK|M_ZERO)
under the covers, so zeroing out memory is already handled by the underlying
calls

MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
2016-05-24 20:06:41 +00:00
Pedro F. Giffuni
6244c6e7db sys/rpc: minor spelling fixes.
No functional change.
2016-05-06 01:49:46 +00:00
Pedro F. Giffuni
4ed3c0e713 sys: Make use of our rounddown() macro when sys/param.h is available.
No functional change.
2016-04-30 14:41:18 +00:00
Conrad Meyer
e3081f7e3e kgssapi(4): Fix string overrun in Kerberos principal construction
'buf.value' was previously treated as a nul-terminated string, but only
allocated with strlen() space.  Rectify this.

Reported by:	Coverity
CID:		1007639
Sponsored by:	EMC / Isilon Storage Division
2016-04-20 04:45:23 +00:00
Pedro F. Giffuni
7f5b12538b RPC: for pointers replace 0 with NULL.
These are mostly cosmetical, no functional change.

Found with devel/coccinelle.
2016-04-14 17:06:37 +00:00
Pedro F. Giffuni
74b8d63dcc Cleanup unnecessary semicolons from the kernel.
Found with devel/coccinelle.
2016-04-10 23:07:00 +00:00
Edward Tomasz Napierala
35030a5dd4 Remove some NULL checks for M_WAITOK allocations.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-29 13:56:59 +00:00
Alexander Motin
8576dc0092 Fix incorrect (fortunately bigger) malloc size.
Submitted by:	pfg
MFC after:	1 week
2016-03-19 11:48:06 +00:00
Gleb Smirnoff
8ec07310fa These files were getting sys/malloc.h and vm/uma.h with header pollution
via sys/mbuf.h
2016-02-01 17:41:21 +00:00
Alexander Motin
ece9d8b702 Improve locking of sg_threadcount.
MFC after:	1 week
2015-11-19 08:04:05 +00:00
Josh Paetzel
5eff3ec6e0 Increase group limit for kerberized NFSv4
PR:	202659
Submitted by:	matthew.l.dailey@dartmouth.edu
Reviewed by:	rmacklem dfr
MFC after:	1 week
Sponsored by:	iXsystems
2015-09-26 16:30:16 +00:00
Xin LI
2c98c61dad Set curvnet context inside the RPC code in more places.
Reviewed by:	melifaro
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D3398
2015-08-18 18:12:46 +00:00
Konstantin Belousov
b4c0214605 Remove useless acquire semantic from the atomic_add operation before
sosend().  The only release on the xp_snt_cnt is done after sosend(),
with an intent to synchronize with load_acq in svc_vc_ack().

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-07-28 06:58:10 +00:00
Alexander Motin
80867e61d8 Remove hard limits on number of accepting NFS connections.
Limits of 5 connections set long ago creates problems for SPEC benchmark.
Make the NFS follow system-wide maximum.

MFC after:	1 week
2015-04-07 10:25:27 +00:00
Garrett Wollman
3c42b5bf28 Fix overflow bugs in and remove obsolete limit from kernel RPC
implementation.

The kernel RPC code, which is responsible for the low-level scheduling
of incoming NFS requests, contains a throttling mechanism that
prevents too much kernel memory from being tied up by NFS requests
that are being serviced.  When the throttle is engaged, the RPC layer
stops servicing incoming NFS sockets, resulting ultimately in
backpressure on the clients (if they're using TCP).  However, this is
a very heavy-handed mechanism as it prevents all clients from making
any requests, regardless of how heavy or light they are.  (Thus, when
engaged, the throttle often prevents clients from even mounting the
filesystem.)  The throttle mechanism applies specifically to requests
that have been received by the RPC layer (from a TCP or UDP socket)
and are queued waiting to be serviced by one of the nfsd threads; it
does not limit the amount of backlog in the socket buffers.

The original implementation limited the total bytes of queued requests
to the minimum of a quarter of (nmbclusters * MCLBYTES) and 45 MiB.
The former limit seems reasonable, since requests queued in the socket
buffers and replies being constructed to the requests in progress will
all require some amount of network memory, but the 45 MiB limit is
plainly ridiculous for modern memory sizes: when running 256 service
threads on a busy server, 45 MiB would result in just a single
maximum-sized NFS3PROC_WRITE queued per thread before throttling.

Removing this limit exposed integer-overflow bugs in the original
computation, and related bugs in the routines that actually account
for the amount of traffic enqueued for service threads.  The old
implementation also attempted to reduce accounting overhead by
batching updates until each queue is fully drained, but this is prone
to livelock, resulting in repeated accumulate-throttle-drain cycles on
a busy server.  Various data types are changed to long or unsigned
long; explicit 64-bit types are not used due to the unavailability of
64-bit atomics on many 32-bit platforms, but those platforms also
cannot support nmbclusters large enough to cause overflow.

This code (in a 10.1 kernel) is presently running on production NFS
servers at CSAIL.

Summary of this revision:
* Removes 45 MiB limit on requests queued for nfsd service threads
* Fixes integer-overflow and signedness bugs
* Avoids unnecessary throttling by not deferring accounting for
  completed requests

Differential Revision:	https://reviews.freebsd.org/D2165
Reviewed by:	rmacklem, mav
MFC after:	30 days
Relnotes:	yes
Sponsored by:	MIT Computer Science & Artificial Intelligence Laboratory
2015-04-01 00:45:47 +00:00
Pedro F. Giffuni
84a9ba84bb rpc: Uninitialized pointer read
Initialize *xprt to avoid exposing a random value
in cleanup_svc_vc_create.
This is the kernel counterpart of r278041.

CID:		1007340
2015-02-02 16:07:07 +00:00
Konstantin Belousov
6ddcc23386 Add facility to stop all userspace processes. The supposed use of the
feature is to quisce the system before suspend.

Stop is implemented by reusing the thread_single(9) with the special
mode SINGLE_ALLPROC.  SINGLE_ALLPROC differs from the existing
single-threading modes by allowing (requiring) caller to operate on
other process.  Interruptible sleeps for !TDF_SBDRY threads are
suspended like SIGSTOP does it, instead of aborting the sleep, like
SINGLE_NO_EXIT, to avoid spurious EINTRs on resume.

Provide debugging sysctl debug.stop_all_proc, which causes total stop
and suspends syncer, while waiting for variable reset for resume.  It
is used for debugging; should be removed after the real use of the
interface is added.

In collaboration with:	pho
Discussed with:	avg
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-12-13 16:18:29 +00:00
Konstantin Belousov
f87c8878e6 Current reaction of the nfsd worker threads to any signal is exit.
This is not correct at least for the stop requests.  Check for stop
conditions and suspend threads if requested.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-12-08 16:33:18 +00:00
Gleb Smirnoff
cfa6009e36 In preparation of merging projects/sendfile, transform bare access to
sb_cc member of struct sockbuf to a couple of inline functions:

sbavail() and sbused()

Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-12 09:57:15 +00:00
Rick Macklem
c59e4cc34d Merge the NFSv4.1 server code in projects/nfsv4.1-server over
into head. The code is not believed to have any effect
on the semantics of non-NFSv4.1 server behaviour.
It is a rather large merge, but I am hoping that there will
not be any regressions for the NFS server.

MFC after:	1 month
2014-07-01 20:47:16 +00:00
Alexander Motin
82dcc80db1 Fix race in r267221.
MFC after:	2 weeks
2014-06-09 15:00:43 +00:00
Alexander Motin
b563304c50 Split RPC pool threads into number of smaller semi-isolated groups.
Old design with unified thread pool was good from the point of thread
utilization.  But single pool-wide mutex became huge congestion point
for systems with many CPUs.  To reduce the congestion create several
thread groups within a pool (one group for every 6 CPUs and 12 threads),
each group with own mutex.  Each connection during its registration is
assigned to one of the groups in round-robin fashion.  File affinify
code may still move requests between the groups, but otherwise groups
are self-contained.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-06-08 11:19:32 +00:00
Alexander Motin
b5d7fb7398 Remove st_idle variable, duplicating st_xprt.
MFC after:	2 weeks
2014-06-08 10:18:22 +00:00
Alexander Motin
b776fb2d67 Introduce new per-thread lock to protect the list of requests.
This allows to slightly simplify svc_run_internal() code: if we processed
all the requests in a queue, then we know that new one will not appear.

MFC after:	2 weeks
2014-06-08 09:40:26 +00:00
Christian Brueffer
c3e2c655a5 Properly free resources in case of error.
CID:		1007032
Found with:	Coverity Prevent(tm)
MFC after:	2 weeks
2014-05-02 20:45:55 +00:00
Alexander Motin
b4fced900b Fix lock acquisition in case no request space available, missed in r260097.
MFC after:	3 days
2014-02-04 00:00:01 +00:00
Peter Wemm
bcea84bd86 Don't expose svc_loss_reg / _unreg to userland as they're kernel-only
additions from r260229 and the SVCPOOL type doesn't exist in userland.
2014-01-08 22:37:18 +00:00
Alexander Motin
0979970a1d Fix NULL dereference panic on UDP requests introduced in r260229. 2014-01-06 12:40:46 +00:00
Alexander Motin
c809a67a72 Replace locks added in r260229 to protect sequence counters with atomics.
New algorithm does not create additional lock congestion, while some races
it includes should not be a problem.  Those races may keep requests in DRC
cache for some more time by returning ACK position smaller then actual,
but it still should be able to drop thems when proper ACK finally read.

Races of the original algorithm based on TCP seq number were worse because
they happened when reply sequence number were recorded. After that even
correctly read ACKs could not clean DRC sometimes.
2014-01-04 15:51:31 +00:00
Alexander Motin
d473bac729 Rework NFS Duplicate Request Cache cleanup logic.
- Introduce additional hash to group requests by hash of sockref.  This
allows to process TCP acknowledgements without looping though all the cache,
and as result allows to do it every time.
 - Indroduce additional callbacks to notify application layer about sockets
disconnection.  Without this last few requests processed just before socket
disconnection never processed their ACKs and stuck in cache for many hours.
 - Implement transport-specific method for tracking reply acknowledgements.
New implementation does not cross multiple stack layers to get the data and
does not have race conditions that previously made some requests stuck
in cache.  This could be done more efficiently at sockbuf layer, but that
would broke some KBIs, while I don't know other consumers for it aside NFS.
 - Instead of traversing all DRC twice per request, run cleaning only once
per request, and except in some conditions traverse only single hash slot
at a time.

Together this limits NFS DRC growth only to situations of real connectivity
problems.  If network is working well, and so all replies are acknowledged,
cache remains almost empty even after hours of heavy load.  Without this
change on the same test cache was growing to many thousand requests even
with perfectly working local network.

As another result this reduces CPU time spent on the DRC handling during
SPEC NFS benchmark from about 10% to 0.5%.

Sponsored by:	iXsystems, Inc.
2014-01-03 15:09:59 +00:00