Commit Graph

161 Commits

Author SHA1 Message Date
trasz
ca92bb3067 Remove some NULL checks for M_WAITOK allocations.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-29 13:56:59 +00:00
mav
55b7402d49 Fix incorrect (fortunately bigger) malloc size.
Submitted by:	pfg
MFC after:	1 week
2016-03-19 11:48:06 +00:00
glebius
306a6faf84 These files were getting sys/malloc.h and vm/uma.h with header pollution
via sys/mbuf.h
2016-02-01 17:41:21 +00:00
mav
62e7403834 Improve locking of sg_threadcount.
MFC after:	1 week
2015-11-19 08:04:05 +00:00
jpaetzel
bd730ad2f4 Increase group limit for kerberized NFSv4
PR:	202659
Submitted by:	matthew.l.dailey@dartmouth.edu
Reviewed by:	rmacklem dfr
MFC after:	1 week
Sponsored by:	iXsystems
2015-09-26 16:30:16 +00:00
delphij
277c92f612 Set curvnet context inside the RPC code in more places.
Reviewed by:	melifaro
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D3398
2015-08-18 18:12:46 +00:00
kib
2b8c79506d Remove useless acquire semantic from the atomic_add operation before
sosend().  The only release on the xp_snt_cnt is done after sosend(),
with an intent to synchronize with load_acq in svc_vc_ack().

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-07-28 06:58:10 +00:00
mav
979d590154 Remove hard limits on number of accepting NFS connections.
Limits of 5 connections set long ago creates problems for SPEC benchmark.
Make the NFS follow system-wide maximum.

MFC after:	1 week
2015-04-07 10:25:27 +00:00
wollman
299467e040 Fix overflow bugs in and remove obsolete limit from kernel RPC
implementation.

The kernel RPC code, which is responsible for the low-level scheduling
of incoming NFS requests, contains a throttling mechanism that
prevents too much kernel memory from being tied up by NFS requests
that are being serviced.  When the throttle is engaged, the RPC layer
stops servicing incoming NFS sockets, resulting ultimately in
backpressure on the clients (if they're using TCP).  However, this is
a very heavy-handed mechanism as it prevents all clients from making
any requests, regardless of how heavy or light they are.  (Thus, when
engaged, the throttle often prevents clients from even mounting the
filesystem.)  The throttle mechanism applies specifically to requests
that have been received by the RPC layer (from a TCP or UDP socket)
and are queued waiting to be serviced by one of the nfsd threads; it
does not limit the amount of backlog in the socket buffers.

The original implementation limited the total bytes of queued requests
to the minimum of a quarter of (nmbclusters * MCLBYTES) and 45 MiB.
The former limit seems reasonable, since requests queued in the socket
buffers and replies being constructed to the requests in progress will
all require some amount of network memory, but the 45 MiB limit is
plainly ridiculous for modern memory sizes: when running 256 service
threads on a busy server, 45 MiB would result in just a single
maximum-sized NFS3PROC_WRITE queued per thread before throttling.

Removing this limit exposed integer-overflow bugs in the original
computation, and related bugs in the routines that actually account
for the amount of traffic enqueued for service threads.  The old
implementation also attempted to reduce accounting overhead by
batching updates until each queue is fully drained, but this is prone
to livelock, resulting in repeated accumulate-throttle-drain cycles on
a busy server.  Various data types are changed to long or unsigned
long; explicit 64-bit types are not used due to the unavailability of
64-bit atomics on many 32-bit platforms, but those platforms also
cannot support nmbclusters large enough to cause overflow.

This code (in a 10.1 kernel) is presently running on production NFS
servers at CSAIL.

Summary of this revision:
* Removes 45 MiB limit on requests queued for nfsd service threads
* Fixes integer-overflow and signedness bugs
* Avoids unnecessary throttling by not deferring accounting for
  completed requests

Differential Revision:	https://reviews.freebsd.org/D2165
Reviewed by:	rmacklem, mav
MFC after:	30 days
Relnotes:	yes
Sponsored by:	MIT Computer Science & Artificial Intelligence Laboratory
2015-04-01 00:45:47 +00:00
pfg
a9fa41d648 rpc: Uninitialized pointer read
Initialize *xprt to avoid exposing a random value
in cleanup_svc_vc_create.
This is the kernel counterpart of r278041.

CID:		1007340
2015-02-02 16:07:07 +00:00
kib
d41bf48327 Add facility to stop all userspace processes. The supposed use of the
feature is to quisce the system before suspend.

Stop is implemented by reusing the thread_single(9) with the special
mode SINGLE_ALLPROC.  SINGLE_ALLPROC differs from the existing
single-threading modes by allowing (requiring) caller to operate on
other process.  Interruptible sleeps for !TDF_SBDRY threads are
suspended like SIGSTOP does it, instead of aborting the sleep, like
SINGLE_NO_EXIT, to avoid spurious EINTRs on resume.

Provide debugging sysctl debug.stop_all_proc, which causes total stop
and suspends syncer, while waiting for variable reset for resume.  It
is used for debugging; should be removed after the real use of the
interface is added.

In collaboration with:	pho
Discussed with:	avg
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-12-13 16:18:29 +00:00
kib
e83a8b4fa1 Current reaction of the nfsd worker threads to any signal is exit.
This is not correct at least for the stop requests.  Check for stop
conditions and suspend threads if requested.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-12-08 16:33:18 +00:00
glebius
c0b38b545a In preparation of merging projects/sendfile, transform bare access to
sb_cc member of struct sockbuf to a couple of inline functions:

sbavail() and sbused()

Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-12 09:57:15 +00:00
rmacklem
8c4948bf02 Merge the NFSv4.1 server code in projects/nfsv4.1-server over
into head. The code is not believed to have any effect
on the semantics of non-NFSv4.1 server behaviour.
It is a rather large merge, but I am hoping that there will
not be any regressions for the NFS server.

MFC after:	1 month
2014-07-01 20:47:16 +00:00
mav
9264ec94fe Fix race in r267221.
MFC after:	2 weeks
2014-06-09 15:00:43 +00:00
mav
18c81a620b Split RPC pool threads into number of smaller semi-isolated groups.
Old design with unified thread pool was good from the point of thread
utilization.  But single pool-wide mutex became huge congestion point
for systems with many CPUs.  To reduce the congestion create several
thread groups within a pool (one group for every 6 CPUs and 12 threads),
each group with own mutex.  Each connection during its registration is
assigned to one of the groups in round-robin fashion.  File affinify
code may still move requests between the groups, but otherwise groups
are self-contained.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-06-08 11:19:32 +00:00
mav
c73f36ad0d Remove st_idle variable, duplicating st_xprt.
MFC after:	2 weeks
2014-06-08 10:18:22 +00:00
mav
aa54e738fe Introduce new per-thread lock to protect the list of requests.
This allows to slightly simplify svc_run_internal() code: if we processed
all the requests in a queue, then we know that new one will not appear.

MFC after:	2 weeks
2014-06-08 09:40:26 +00:00
brueffer
c006433c40 Properly free resources in case of error.
CID:		1007032
Found with:	Coverity Prevent(tm)
MFC after:	2 weeks
2014-05-02 20:45:55 +00:00
mav
7cbc3205cc Fix lock acquisition in case no request space available, missed in r260097.
MFC after:	3 days
2014-02-04 00:00:01 +00:00
peter
3244f6064b Don't expose svc_loss_reg / _unreg to userland as they're kernel-only
additions from r260229 and the SVCPOOL type doesn't exist in userland.
2014-01-08 22:37:18 +00:00
mav
b421f931ee Fix NULL dereference panic on UDP requests introduced in r260229. 2014-01-06 12:40:46 +00:00
mav
048eda83eb Replace locks added in r260229 to protect sequence counters with atomics.
New algorithm does not create additional lock congestion, while some races
it includes should not be a problem.  Those races may keep requests in DRC
cache for some more time by returning ACK position smaller then actual,
but it still should be able to drop thems when proper ACK finally read.

Races of the original algorithm based on TCP seq number were worse because
they happened when reply sequence number were recorded. After that even
correctly read ACKs could not clean DRC sometimes.
2014-01-04 15:51:31 +00:00
mav
d75fe782e9 Rework NFS Duplicate Request Cache cleanup logic.
- Introduce additional hash to group requests by hash of sockref.  This
allows to process TCP acknowledgements without looping though all the cache,
and as result allows to do it every time.
 - Indroduce additional callbacks to notify application layer about sockets
disconnection.  Without this last few requests processed just before socket
disconnection never processed their ACKs and stuck in cache for many hours.
 - Implement transport-specific method for tracking reply acknowledgements.
New implementation does not cross multiple stack layers to get the data and
does not have race conditions that previously made some requests stuck
in cache.  This could be done more efficiently at sockbuf layer, but that
would broke some KBIs, while I don't know other consumers for it aside NFS.
 - Instead of traversing all DRC twice per request, run cleaning only once
per request, and except in some conditions traverse only single hash slot
at a time.

Together this limits NFS DRC growth only to situations of real connectivity
problems.  If network is working well, and so all replies are acknowledged,
cache remains almost empty even after hours of heavy load.  Without this
change on the same test cache was growing to many thousand requests even
with perfectly working local network.

As another result this reduces CPU time spent on the DRC handling during
SPEC NFS benchmark from about 10% to 0.5%.

Sponsored by:	iXsystems, Inc.
2014-01-03 15:09:59 +00:00
mav
16b0bb1da4 Move most of NFS file handle affinity code out of the heavily congested
global RPC thread pool lock and protect it with own set of locks.

On synthetic benchmarks this improves peak NFS request rate by 40%.
2013-12-30 20:23:15 +00:00
mav
082b09553b Introduce xprt_inactive_self() -- variant for use when sure that port
is assigned to thread.  For example, withing receive handlers.  In that
case the function reduces to single assignment and can avoid locking.
2013-12-29 11:19:09 +00:00
mav
6ab680bb65 In addition to r259632 completely block receive upcalls if we have more
data than we need.  This reduces lock pressure from xprt_active() side.
2013-12-29 03:43:25 +00:00
dim
b1d5d16e85 Move a static const variable to the #if 0 part where it is only used.
(Note the #if 0 part has been inactive since the initial commit,
r177633, so maybe it should be removed altogether).

MFC after:	3 days
2013-12-24 20:57:26 +00:00
dim
7e48e15976 Remove some unused static const strings under sys/rpc, which have never
been used since the initial commit (r177633).

MFC after:	3 days
2013-12-24 20:55:22 +00:00
mav
dfcd443c18 Fix a bug introduced at r259632, triggering infinite loop in some cases. 2013-12-24 17:28:27 +00:00
glebius
303df467d6 Fix build. 2013-12-20 19:44:29 +00:00
mav
88262ed57f Remove several linear list traversals per request from RPC server code.
Do not insert active ports into pool->sp_active list if they are success-
fully assigned to some thread.  This makes that list include only ports that
really require attention, and so traversal can be reduced to simple taking
the first one.

  Remove idle thread from pool->sp_idlethreads list when assigning some
work (port of requests) to it.  That again makes possible to replace list
traversals with simple taking the first element.
2013-12-20 17:39:07 +00:00
mav
17e956838c Rework flow control for connection-oriented (TCP) RPC server.
When processing receive buffer, write the amount of data, expected
in present request record, into socket's so_rcv.sb_lowat to make stack
aware about our needs.  When processing following upcalls, ignore them
until socket collect enough data to be read and processed in one turn.
  This change reduces number of context switches and other operations
in RPC stack during large NFS writes (especially via non-Jumbo networks)
by order of magnitude.

  After precessing current packet, take another look into the pending
buffer to find out whether the next packet had been already received.
If not, deactivate this port right there without making RPC code to
push this port to another thread just to find that there is nothing.
If the next packet is received partially, also deactivate the port, but
also update socket's so_rcv.sb_lowat to not be woken up prematurely.
  This change additionally reduces number of context switches per NFS
request about in half.
2013-12-19 21:31:28 +00:00
hrs
aee8502f1b Replace Sun Industry Standards Source License for Sun RPC code with a
3-clause BSD license as specified by Oracle America, Inc. in 2010.
This license change was approved by Wim Coekaerts, Senior Vice
President, Linux and Virtualization at Oracle Corporation.
2013-11-25 19:08:38 +00:00
hrs
8003c4b3cf Replace Sun RPC license in TI-RPC library with a 3-clause BSD license,
with the explicit permission of Sun Microsystems in 2009.
2013-11-25 19:07:44 +00:00
hrs
2f61636cf8 Replace Sun RPC license in TI-RPC library with a 3-clause BSD license,
with the explicit permission of Sun Microsystems in 2009.
2013-11-25 19:04:36 +00:00
mav
5cc483db90 Some minor tuning to rpc/svc.c:
- close cosmetic race in svc_exit();
 - do not set wait timeout for idle threads if we have no use for wakeups;
 - create new requested thread sooner, not only after some another thread
wakeup, that may happen later under constant load.
2013-11-14 13:51:53 +00:00
rmacklem
13a117b3c6 It was reported via email that the cu_sent field used by the
krpc client side UDP was observed as way out of range and
caused the rpc.lockd daemon to hang trying to do an RPC.
Inspection of the code found two places where the RPC request
is re-queued, but the value of cu_sent was not incremented.
Since cu_sent is always decremented when the RPC request is
dequeued, I think this could have caused cu_sent to go out of
range. This patch adds lines to increment cu_sent for these
two cases.

Reported by:	dwhite@ixsystems.com
Discussed with:	dwhite@ixsystems.com
MFC after:	2 weeks
2013-09-06 02:34:34 +00:00
rmacklem
4050bd5a8c Add support for host-based (Kerberos 5 service principal) initiator
credentials to the kernel rpc. Modify the NFSv4 client to add
support for the gssname and allgssname mount options to use this
capability. Requires the gssd daemon to be running with the "-h" option.

Reviewed by:	jhb
2013-07-09 01:05:28 +00:00
jhb
4312ec3f0d Fix a potential socket leak in the NFS server. If a client closes its
connection after it was accepted by the userland nfsd process but before
it was handled off to svc_vc_create() in the kernel, then svc_vc_create()
would see it as a new listen socket and try to listen on it leaving a
dangling reference to the socket.  Instead, check for disconnected sockets
and treat them like a connected socket.  The call to pru_getaddr() should
fail and cause svc_vc_create() to fail.  Note that we need to lock the
socket to get a consistent snapshot of so_state since there is a window
in soisdisconnected() where both flags are clear.

Reviewed by:	dfr, rmacklem
MFC after:	1 week
2013-04-08 19:03:01 +00:00
gnn
ea724ef317 Improve error handling when unwrapping received data.
Submitted by:	Rick Macklem
MFC after:	1 week
2013-04-04 15:16:53 +00:00
jhb
b2e811621c Revert 195703 and 195821 as this special stop handling in NFS is now
implemented via VFCF_SBDRY rather than passing PBDRY to individual
sleep calls.
2013-03-13 21:06:03 +00:00
glebius
82c4432765 Use m_get(), m_gethdr() and m_getcl() instead of historic macros.
Sponsored by:	Nginx, Inc.
2013-03-12 12:17:19 +00:00
rmacklem
ded0f38e61 Add support for backchannels to the kernel RPC. Backchannels
are used by NFSv4.1 for callbacks. A backchannel is a connection
established by the client, but used for RPCs done by the server
on the client (callbacks). As a result, this patch mixes some
client side calls in the server side and vice versa. Some
definitions in the .c files were extracted out into a file called
krpc.h, so that they could be included in multiple .c files.
This code has been in projects/nfsv4.1-client for some time.
Although no one has given it a formal review, I believe kib@
has taken a look at it.
2012-12-08 00:29:16 +00:00
glebius
8e20fa5ae9 Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually
2012-12-05 08:04:20 +00:00
rmacklem
534523d975 Modify the comment to take out the names and URL.
Requested by:	kib
MFC after:	3 days
2012-10-25 19:30:58 +00:00
rmacklem
6496473bfe Add a comment describing why r241097 was done.
Suggested by:	rwatson
MFC after:	1 week
2012-10-15 13:38:25 +00:00
pfg
322c7845da rpc: convert all uid and gid variables to u_int.
After further discussion, instead of pretending to use
uid_t and gid_t as upstream Solaris and linux try to, we
are better using u_int, which is in fact what the code
can handle and best approaches the range of values used
by uid and gid.

Discussed with:	bde
Reviewed by:	bde
2012-10-04 04:15:18 +00:00
pfg
4143500b7b libtirpc: be sure to free cl_netid and cl_tp
When creating a client with clnt_tli_create, it uses strdup to copy
strings for these fields if nconf is passed in. clnt_dg_destroy frees
these strings already. Make sure clnt_vc_destroy frees them in the same
way.

This change matches the reference (OpenSolaris) implementation.

Tested by:	David Wolfskill
Obtained from:	Bull GNU/Linux NFSv4 Project (libtirpc)
MFC after:	2 weeks
2012-10-02 19:10:19 +00:00
pfg
801a570273 RPC: Convert all uid and gid variables of the type uid_t and gid_t.
This matches what upstream (OpenSolaris) does.

Tested by:	David Wolfskill
Obtained from:	Bull GNU/Linux NFSv4 project (libtirpc)
MFC after:	3 days
2012-10-02 19:00:56 +00:00