sosend(). The only release on the xp_snt_cnt is done after sosend(),
with an intent to synchronize with load_acq in svc_vc_ack().
Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
implementation.
The kernel RPC code, which is responsible for the low-level scheduling
of incoming NFS requests, contains a throttling mechanism that
prevents too much kernel memory from being tied up by NFS requests
that are being serviced. When the throttle is engaged, the RPC layer
stops servicing incoming NFS sockets, resulting ultimately in
backpressure on the clients (if they're using TCP). However, this is
a very heavy-handed mechanism as it prevents all clients from making
any requests, regardless of how heavy or light they are. (Thus, when
engaged, the throttle often prevents clients from even mounting the
filesystem.) The throttle mechanism applies specifically to requests
that have been received by the RPC layer (from a TCP or UDP socket)
and are queued waiting to be serviced by one of the nfsd threads; it
does not limit the amount of backlog in the socket buffers.
The original implementation limited the total bytes of queued requests
to the minimum of a quarter of (nmbclusters * MCLBYTES) and 45 MiB.
The former limit seems reasonable, since requests queued in the socket
buffers and replies being constructed to the requests in progress will
all require some amount of network memory, but the 45 MiB limit is
plainly ridiculous for modern memory sizes: when running 256 service
threads on a busy server, 45 MiB would result in just a single
maximum-sized NFS3PROC_WRITE queued per thread before throttling.
Removing this limit exposed integer-overflow bugs in the original
computation, and related bugs in the routines that actually account
for the amount of traffic enqueued for service threads. The old
implementation also attempted to reduce accounting overhead by
batching updates until each queue is fully drained, but this is prone
to livelock, resulting in repeated accumulate-throttle-drain cycles on
a busy server. Various data types are changed to long or unsigned
long; explicit 64-bit types are not used due to the unavailability of
64-bit atomics on many 32-bit platforms, but those platforms also
cannot support nmbclusters large enough to cause overflow.
This code (in a 10.1 kernel) is presently running on production NFS
servers at CSAIL.
Summary of this revision:
* Removes 45 MiB limit on requests queued for nfsd service threads
* Fixes integer-overflow and signedness bugs
* Avoids unnecessary throttling by not deferring accounting for
completed requests
Differential Revision: https://reviews.freebsd.org/D2165
Reviewed by: rmacklem, mav
MFC after: 30 days
Relnotes: yes
Sponsored by: MIT Computer Science & Artificial Intelligence Laboratory
feature is to quisce the system before suspend.
Stop is implemented by reusing the thread_single(9) with the special
mode SINGLE_ALLPROC. SINGLE_ALLPROC differs from the existing
single-threading modes by allowing (requiring) caller to operate on
other process. Interruptible sleeps for !TDF_SBDRY threads are
suspended like SIGSTOP does it, instead of aborting the sleep, like
SINGLE_NO_EXIT, to avoid spurious EINTRs on resume.
Provide debugging sysctl debug.stop_all_proc, which causes total stop
and suspends syncer, while waiting for variable reset for resume. It
is used for debugging; should be removed after the real use of the
interface is added.
In collaboration with: pho
Discussed with: avg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
This is not correct at least for the stop requests. Check for stop
conditions and suspend threads if requested.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
sb_cc member of struct sockbuf to a couple of inline functions:
sbavail() and sbused()
Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
into head. The code is not believed to have any effect
on the semantics of non-NFSv4.1 server behaviour.
It is a rather large merge, but I am hoping that there will
not be any regressions for the NFS server.
MFC after: 1 month
Old design with unified thread pool was good from the point of thread
utilization. But single pool-wide mutex became huge congestion point
for systems with many CPUs. To reduce the congestion create several
thread groups within a pool (one group for every 6 CPUs and 12 threads),
each group with own mutex. Each connection during its registration is
assigned to one of the groups in round-robin fashion. File affinify
code may still move requests between the groups, but otherwise groups
are self-contained.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
This allows to slightly simplify svc_run_internal() code: if we processed
all the requests in a queue, then we know that new one will not appear.
MFC after: 2 weeks
New algorithm does not create additional lock congestion, while some races
it includes should not be a problem. Those races may keep requests in DRC
cache for some more time by returning ACK position smaller then actual,
but it still should be able to drop thems when proper ACK finally read.
Races of the original algorithm based on TCP seq number were worse because
they happened when reply sequence number were recorded. After that even
correctly read ACKs could not clean DRC sometimes.
- Introduce additional hash to group requests by hash of sockref. This
allows to process TCP acknowledgements without looping though all the cache,
and as result allows to do it every time.
- Indroduce additional callbacks to notify application layer about sockets
disconnection. Without this last few requests processed just before socket
disconnection never processed their ACKs and stuck in cache for many hours.
- Implement transport-specific method for tracking reply acknowledgements.
New implementation does not cross multiple stack layers to get the data and
does not have race conditions that previously made some requests stuck
in cache. This could be done more efficiently at sockbuf layer, but that
would broke some KBIs, while I don't know other consumers for it aside NFS.
- Instead of traversing all DRC twice per request, run cleaning only once
per request, and except in some conditions traverse only single hash slot
at a time.
Together this limits NFS DRC growth only to situations of real connectivity
problems. If network is working well, and so all replies are acknowledged,
cache remains almost empty even after hours of heavy load. Without this
change on the same test cache was growing to many thousand requests even
with perfectly working local network.
As another result this reduces CPU time spent on the DRC handling during
SPEC NFS benchmark from about 10% to 0.5%.
Sponsored by: iXsystems, Inc.
Do not insert active ports into pool->sp_active list if they are success-
fully assigned to some thread. This makes that list include only ports that
really require attention, and so traversal can be reduced to simple taking
the first one.
Remove idle thread from pool->sp_idlethreads list when assigning some
work (port of requests) to it. That again makes possible to replace list
traversals with simple taking the first element.
When processing receive buffer, write the amount of data, expected
in present request record, into socket's so_rcv.sb_lowat to make stack
aware about our needs. When processing following upcalls, ignore them
until socket collect enough data to be read and processed in one turn.
This change reduces number of context switches and other operations
in RPC stack during large NFS writes (especially via non-Jumbo networks)
by order of magnitude.
After precessing current packet, take another look into the pending
buffer to find out whether the next packet had been already received.
If not, deactivate this port right there without making RPC code to
push this port to another thread just to find that there is nothing.
If the next packet is received partially, also deactivate the port, but
also update socket's so_rcv.sb_lowat to not be woken up prematurely.
This change additionally reduces number of context switches per NFS
request about in half.
3-clause BSD license as specified by Oracle America, Inc. in 2010.
This license change was approved by Wim Coekaerts, Senior Vice
President, Linux and Virtualization at Oracle Corporation.
- close cosmetic race in svc_exit();
- do not set wait timeout for idle threads if we have no use for wakeups;
- create new requested thread sooner, not only after some another thread
wakeup, that may happen later under constant load.
krpc client side UDP was observed as way out of range and
caused the rpc.lockd daemon to hang trying to do an RPC.
Inspection of the code found two places where the RPC request
is re-queued, but the value of cu_sent was not incremented.
Since cu_sent is always decremented when the RPC request is
dequeued, I think this could have caused cu_sent to go out of
range. This patch adds lines to increment cu_sent for these
two cases.
Reported by: dwhite@ixsystems.com
Discussed with: dwhite@ixsystems.com
MFC after: 2 weeks
credentials to the kernel rpc. Modify the NFSv4 client to add
support for the gssname and allgssname mount options to use this
capability. Requires the gssd daemon to be running with the "-h" option.
Reviewed by: jhb
connection after it was accepted by the userland nfsd process but before
it was handled off to svc_vc_create() in the kernel, then svc_vc_create()
would see it as a new listen socket and try to listen on it leaving a
dangling reference to the socket. Instead, check for disconnected sockets
and treat them like a connected socket. The call to pru_getaddr() should
fail and cause svc_vc_create() to fail. Note that we need to lock the
socket to get a consistent snapshot of so_state since there is a window
in soisdisconnected() where both flags are clear.
Reviewed by: dfr, rmacklem
MFC after: 1 week
are used by NFSv4.1 for callbacks. A backchannel is a connection
established by the client, but used for RPCs done by the server
on the client (callbacks). As a result, this patch mixes some
client side calls in the server side and vice versa. Some
definitions in the .c files were extracted out into a file called
krpc.h, so that they could be included in multiple .c files.
This code has been in projects/nfsv4.1-client for some time.
Although no one has given it a formal review, I believe kib@
has taken a look at it.
After further discussion, instead of pretending to use
uid_t and gid_t as upstream Solaris and linux try to, we
are better using u_int, which is in fact what the code
can handle and best approaches the range of values used
by uid and gid.
Discussed with: bde
Reviewed by: bde
When creating a client with clnt_tli_create, it uses strdup to copy
strings for these fields if nconf is passed in. clnt_dg_destroy frees
these strings already. Make sure clnt_vc_destroy frees them in the same
way.
This change matches the reference (OpenSolaris) implementation.
Tested by: David Wolfskill
Obtained from: Bull GNU/Linux NFSv4 Project (libtirpc)
MFC after: 2 weeks
w.r.t. a Linux NFS client doing a krb5 NFS mount against the
FreeBSD server. We determined this was a Linux bug:
http://www.spinics.net/lists/linux-nfs/msg32466.html, however
the mount failed to work, because the Destroy operation with a
bogus encrypted checksum destroyed the authenticator handle.
This patch changes the rpcsec_gss code so that it doesn't
Destroy the authenticator handle for this case and, as such,
the Linux mount will work.
Tested by: Attila Bogar and Herbert Poeckl
MFC after: 2 weeks
The attempt to merge changes from the linux libtirpc caused
rpc.lockd to exit after startup under unclear conditions.
After many hours of selective experiments and inconsistent results
the conclusion is that it's better to just revert everything and
restart in a future time with a much smaller subset of the
changes.
____
MFC after: 3 days
Reported by: David Wolfskill
Tested by: David Wolfskill
The following change caused rpc.lockd to exit after startup:
____
libtirpc: be sure to free cl_netid and cl_tp
When creating a client with clnt_tli_create, it uses strdup to copy
strings for these fields if nconf is passed in. clnt_dg_destroy frees
these strings already. Make sure clnt_vc_destroy frees them in the
same way.
____
MFC after: 3 days
Reported by: David Wolfskill
Tested by: David Wolfskill
C++ mangling will cause trouble with variables like __rpc_xdr
in xdr.h so rename this to XDR.
While here add proper C++ guards to RPC headers.
PR: 137443
MFC after: 2 weeks
We especifically ignored the glibc compatibility changes
but this should help interaction with Solaris and Linux.
____
Fixed infinite loop in svc_run()
author Steve Dickson
Tue, 10 Jun 2008 12:35:52 -0500 (13:35 -0400)
Fixed infinite loop in svc_run()
____
__rpc_taddr2uaddr_af() assumes the netbuf to always have a
non-zero data. This is a bad assumption and can lead to a
seg-fault. This patch adds a check for zero length and returns
NULL when found.
author Steve Dickson
Mon, 27 Oct 2008 11:46:54 -0500 (12:46 -0400)
____
Changed clnt_spcreateerror() to return clearer
and more concise error messages.
author Steve Dickson
Thu, 20 Nov 2008 08:55:31 -0500 (08:55 -0500)
____
Converted all uid and gid variables of the type uid_t and gid_t.
author Steve Dickson
Wed, 28 Jan 2009 12:44:46 -0500 (12:44 -0500)
____
libtirpc: set r_netid and r_owner in __rpcb_findaddr_timed
These fields in the rpcbind GETADDR call are being passed uninitialized
to CLNT_CALL. In the case of x86_64 at least, this usually leads to a
segfault. On x86, it sometimes causes segfaults and other times causes
garbage to be sent on the wire.
rpcbind generally ignores the r_owner field for calls that come in over
the wire, so it really doesn't matter what we send in that slot. We just
need to send something. The reference implementation from Sun seems to
send a blank string. Have ours follow suit.
author Jeff Layton
Fri, 13 Mar 2009 11:44:16 -0500 (12:44 -0400)
____
libtirpc: be sure to free cl_netid and cl_tp
When creating a client with clnt_tli_create, it uses strdup to copy
strings for these fields if nconf is passed in. clnt_dg_destroy frees
these strings already. Make sure clnt_vc_destroy frees them in the same
way.
author Jeff Layton
Fri, 13 Mar 2009 11:47:36 -0500 (12:47 -0400)
Obtained from: Bull GNU/Linux NFSv4 Project
MFC after: 3 weeks
subject heading "mtx_lock() of destroyed mutex on NFS" and
PR# 156168 appear to be caused by clnt_dg_destroy() closing
down the socket prematurely. When to close down the socket
is controlled by a reference count (cs_refs), but clnt_dg_create()
checks for sb_upcall being non-NULL to decide if a new socket
is needed. I believe the crashes were caused by the following race:
clnt_dg_destroy() finds cs_refs == 0 and decides to delete socket
clnt_dg_destroy() then loses race with clnt_dg_create() for
acquisition of the SOCKBUF_LOCK()
clnt_dg_create() finds sb_upcall != NULL and increments cs_refs to 1
clnt_dg_destroy() then acquires SOCKBUF_LOCK(), sets sb_upcall to
NULL and destroys socket
This patch fixes the above race by changing clnt_dg_destroy() so
that it acquires SOCKBUF_LOCK() before testing cs_refs.
Tested by: bz
PR: 156168
Reviewed by: dfr
MFC after: 2 weeks
heading "kernel panics with RPCSEC_GSS" appears to be caused by a
corrupted tailq list for the client structure. Looking at the code, calls
to the function svc_rpc_gss_forget_client() were done in an SMP unsafe
manner, with the svc_rpc_gss_lock only being acquired in the function
and not before it. As such, when multiple threads called
svc_rpc_gss_forget_client() concurrently, it could try and remove the
same client structure from the tailq lists multiple times.
The patch fixes this by moving the critical code into a separate
function called svc_rpc_gss_forget_client_locked(), which must be
called with the lock held. For the one case where the caller would
have no interest in the lock, svc_rpc_gss_forget_client() was retained,
but a loop was added to check that the client structure is still in
the tailq lists before removing it, to make it safe for multiple
concurrent calls.
Tested by: clinton.adams at gmail.com (earlier version)
Reviewed by: zkirsch
MFC after: 3 days
been interrupted in a restartable syscall. Otherwise we could end up
in an (almost) endless loop in clnt_reconnect_call().
PR: kern/160198
Reviewed by: rmacklem
Approved by: re (kib), avg (mentor)
MFC after: 1 week
the NFS subsystems use five of the rpcsec_gss/kgssapi entry points,
but since it was not obvious which others might be useful, all
nineteen were included. Basically the nineteen entry points are
set in a structure called rpc_gss_entries and inline functions
defined in sys/rpc/rpcsec_gss.h check for the entry points being
non-NULL and then call them. A default value is returned otherwise.
Requested by rwatson.
Reviewed by: jhb
MFC after: 2 weeks
non-interruptible NFS mounts, where a kernel thread will seem
to be stuck sleeping on "rpccon". The msleep() in clnt_vc_create()
that was waiting to a TCP connect to complete would return ERESTART,
since PCATCH was specified. Then the tsleep() in clnt_reconnect_call()
would sleep for 1 second and then try again and again and...
The patch changes the msleep() in clnt_vc_create() so it only sets
the PCATCH flag for interruptible cases.
Tested by: pho
Reviewed by: jhb
MFC after: 2 weeks
not believe that these leaks had a practical impact,
since the situations in which they would have occurred
would have been extremely rare.
MFC after: 2 weeks
VNET socket push back:
try to minimize the number of places where we have to switch vnets
and narrow down the time we stay switched. Add assertions to the
socket code to catch possibly unset vnets as seen in r204147.
While this reduces the number of vnet recursion in some places like
NFS, POSIX local sockets and some netgraph, .. recursions are
impossible to fix.
The current expectations are documented at the beginning of
uipc_socket.c along with the other information there.
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
Reviewed by: jhb
Tested by: zec
Tested by: Mikolaj Golub (to.my.trociny gmail.com)
MFC after: 2 weeks
erroneously, assumed that 4 bytes of data were in the first
mbuf of a list by replacing the bcopy() with m_copydata().
Also, replace the uses of m_pullup(), which can fail for
reasons other than not enough data, with m_copydata().
For the cases where it isn't known that there is enough
data in the mbuf list, check first via m_len and m_length().
This is believed to fix a problem reported by dpd at dpdtech.com
and george+freebsd at m5p.com.
Reviewed by: jhb
MFC after: 8 days
data size greater than 8192. Since soreserve(so, 256*1024, 256*1024)
would always fail for the default value of sb_max, modify clnt_dg.c
so that it uses the calculated values and checks for an error return
from soreserve(). Also, add a check for error return from soreserve()
to clnt_vc.c and change __rpc_get_t_size() to use sb_max_adj instead of
the bogus maxsize == 256*1024.
PR: kern/150910
Reviewed by: jhb
MFC after: 2 weeks
in the kernel (just as inet_ntoa() and inet_aton()) are and sync their
prototype accordingly with already mentioned functions.
Sponsored by: Sandvine Incorporated
Reviewed by: emaste, rstone
Approved by: dfr
MFC after: 2 weeks
(replay_alloc()) knows how to handle replay_alloc() failure.
- Eliminate 'freed_one' variable, it is not needed - when no entry is found
rce will be NULL.
- Add locking assertions where we expect a rc_lock to be held.
Reviewed by: rmacklem
MFC after: 2 weeks
succeeded and a subsequent interation failed to find an
entry to prune, it could loop infinitely, since the
"freed" variable wasn't reset to FALSE. This patch moves
setting freed FALSE to inside the loop to fix the problem.
Tested by: alan.bryan at yahoo.com
MFC after: 2 weeks
cache, it did not free the request argument mbuf list, resulting in a leak.
This patch fixes that leak.
Tested by: danny AT cs.huji.ac.il
PR: kern/144330
Submitted by: to.my.trociny AT gmail.com (earlier version)
Reviewed by: dfr
MFC after: 2 weeks
kern.ngroups+1. kern.ngroups can range from NGROUPS_MAX=1023 to
INT_MAX-1. Given that the Windows group limit is 1024, this range
should be sufficient for most applications.
MFC after: 1 month
Fix some wrong usages.
Note: this does not affect generated binaries as this argument is not used.
PR: 137213
Submitted by: Eygene Ryabinkin (initial version)
MFC after: 1 month
client just before queuing a request for the connection. The
code already had a check for the connection being shut down
while the request was queued, but not one for the shut down
having been initiated by the server before the request was
in the queue. This appears to fix the problem of slow reconnects
against an NFS server that drops inactive connections reported
by Olaf Seibert, but does not fix the case
where the FreeBSD client generates RST segments at about the
same time as ACKs. This is still a problem that is being
investigated. This patch does not cause a regression for this
case.
Tested by: Olaf Seibert, Daniel Braniss
Reviewed by: dfr
MFC after: 5 days
Who knew that "svn export" was an actual command, or that I would have
vfs_export.c stuck in my mind deep enough to type "export" instead of
"commit"?
Pointy Hat to: jamie
context inside the RPC code.
Temporarily set td's cred to mount's cred before calling socreate() via
__rpc_nconf2socket().
Submitted by: rmacklem (in part)
Reviewed by: rmacklem, rwatson
Discussed with: dfr, bz
Approved by: re (rwatson), julian (mentor)
MFC after: 3 days
kernel resources that block other threads, like vnode locks. The SIGSTOP
sent to such thread (process, rather) shall not stop it until thread
releases the resources.
Tested by: pho
Reviewed by: jhb
Approved by: re (kensmith)
call could get hung sleeping on "gsssta" if the credentials for a user
that had been accessing the mount point have expired. This happened
because rpc_gss_destroy_context() would end up calling itself when the
"destroy context" RPC was attempted, trying to refresh the credentials.
This patch just checks for this case in rpc_gss_refresh() and returns
without attempting the refresh, which avoids the recursive call to
rpc_gss_destroy_context() and the subsequent hang.
Reviewed by: dfr
Approved by: re (Ken Smith), kib (mentor)
This is normally done by a loop in clnt_dg_close(), but requests that aren't
in the pending queue at the time of closing, don't get set. This avoids a
panic in xdrmbuf_create() when it is called with a NULL cr_mrep if
cr_error doesn't get set to ESHUTDOWN while closing.
Reviewed by: dfr
Approved by: re (Ken Smith), kib (mentor)
during reading of the code. Change the code so that it never accesses
rc_connecting, rc_closed or rc_client when the rc_lock mutex is not held.
Also, it now performs the CLNT_CLOSE(client) and CLNT_RELEASE(client)
calls after the rc_lock mutex has been released, since those calls do
msleep()s with another mutex held. Change clnt_reconnect_call() so that
releasing the reference count is delayed until after the
"if (rc->rc_client == client)" check, so that rc_client cannot have been
recycled.
Tested by: pho
Reviewed by: dfr
Approved by: kib (mentor)
side fails, the entry in the cache is left with no valid context
(gd_ctx == GSS_C_NO_CONTEXT). As such, subsequent hits on the cache
will result in persistent authentication failure, even after the user has
done a kinit or similar and acquired a new valid TGT. This patch adds a test
for that case upon a cache hit and calls rpc_gss_init() to make another
attempt at getting valid credentials. It also moves the setting of gc_proc
to before the import of the principal name to ensure that, if that case
fails, it will be detected as a failure after going to "out:".
Reviewed by: dfr
Approved by: kib (mentor)
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively. (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)
The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer. Do the equivalent in
kinfo_proc.
Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively. Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary. In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.
Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups. When feasible, truncate
the group list rather than generating an error.
Minor changes:
- Reduce the number of hand rolled versions of groupmember().
- Do not assign to both cr_gid and cr_groups[0].
- Modify ipfw to cache ucreds instead of part of their contents since
they are immutable once referenced by more than one entity.
Submitted by: Isilon Systems (initial implementation)
X-MFC after: never
PR: bin/113398 kern/133867
SVCXPTR structure returned by them, it was possible for the structure
to be free'd before svc_reg() had been completed using the structure.
This patch acquires a reference count on the newly created structure
that is returned by svc_[dg|vc|tli|tp]_create(). It also
adds the appropriate SVC_RELEASE() calls to the callers, except the
experimental nfs subsystem. The latter will be committed separately.
Submitted by: dfr
Tested by: pho
Approved by: kib (mentor)
variables set via the getcredhostid() function. I also changed the type
of ci_hostid to "unsigned long" so that it matches what is returned by
getcredhostid(). Although "struct svc_rpc_gss_clientid" goes on the wire
during RPCSEC_GSS, it is just a variable # of opaque bytes to the client,
so it doesn't matter how much storage ci_hostid uses.
Approved by: kib (mentor)
server would crash because the Solaris10 client would attempt to use
Sun's NFSACL protocol, which FreeBSD doesn't support. When the server
generated the error reply via svcerr_noprog(), it would cause a crash
because it would try and wrap a NULL reply. According to RFC2203, no
wrapping is required for error cases. This one line change avoids
wrapping of NULL replies.
Reviewed by: dfr
Approved by: kib (mentor)
connect failed, the thread would be left stuck in msleep()
indefinitely, since it would call msleep() again for the case
where rc_client == NULL. Change the loop criteria and the if just
after the loop, so that this case is handled correctly.
Reviewed by: dfr
Approved by: kib (mentor)
where an improperly initialized prison field could lead to a panic. This
is not the correct solution, since it fails to address similar problems
for both AUDIT and MAC, which also rely on properly initialized
credentials, but should reduce panic reports while we work that out.
Reported by: ps, kan, others
thread has already unregistered the structure. Also add a KASSERT()
to xprt_unregister_locked() to check that the structure hasn't already
been unregistered.
Reviewed by: jhb
Tested by: pho
Approved by: kib (mentor)
mtx_destroy() of the pool mutex to after SVC_RELEASE(), because
the pool mutex was still locked when soclose() was called by svc_dg_destroy().
To fix this, an mtx_unlock() was added where mtx_destroy() was before
r193436.
Reviewed by: jhb
Tested by: pho
Approved by: rwatson (mentor)
holding SOCKBUF_LOCK() isn't sufficient to guarantee that there is
no upcall in progress, since SOCKBUF_LOCK() is released/re-acquired
in the upcall. An upcall reference counter was added to the upcall
structure that is incremented at the beginning of the upcall and
decremented at the end of the upcall. As such, a reference count == 0
when holding the SOCKBUF_LOCK() guarantees there is no upcall in
progress. Add a function that is called just after soupcall_clear(),
which waits until the reference count == 0.
Also, move the mtx_destroy() down to after soupcall_clear(), so that
the mutex is not destroyed before upcalls are done.
Reviewed by: dfr, jhb
Tested by: pho
Approved by: kib (mentor)
Add a flag so that soupcall_clear() is only called once to cancel
an upcall.
Move the test for xprt_registered in the upcall down to after the
mtx_lock() of the pool mutex, to catch the case where it is
unregistered while the upcall is waiting for the mutex.
Also, move the mtx_destroy() of the pool mutex to after SVC_RELEASE(),
so that it isn't destroyed before the upcalls are disabled.
Reviewed by: dfr, jhb
Tested by: pho
Approved by: kib (mentor)
count of the number of registered policies.
Rather than unconditionally locking sockets before passing them into MAC,
lock them in the MAC entry points only if mac_policy_count is non-zero.
This avoids locking overhead for a number of socket system calls when no
policies are registered, eliminating measurable overhead for the MAC
Framework for the socket subsystem when there are no active policies.
Possibly socket locks should be acquired by policies if they are required
for socket labels, which would further avoid locking overhead when there
are policies but they don't require labeling of sockets, or possibly
don't even implement socket controls.
Obtained from: TrustedBSD Project
- Each socket upcall is now invoked with the appropriate socket buffer
locked. It is not permissible to call soisconnected() with this lock
held; however, so socket upcalls now return an integer value. The two
possible values are SU_OK and SU_ISCONNECTED. If an upcall returns
SU_ISCONNECTED, then the soisconnected() will be invoked on the
socket after the socket buffer lock is dropped.
- A new API is provided for setting and clearing socket upcalls. The
API consists of soupcall_set() and soupcall_clear().
- To simplify locking, each socket buffer now has a separate upcall.
- When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from
the receive socket buffer automatically. Note that a SO_SND upcall
should never return SU_ISCONNECTED.
- All this means that accept filters should now return SU_ISCONNECTED
instead of calling soisconnected() directly. They also no longer need
to explicitly clear the upcall on the new socket.
- The HTTP accept filter still uses soupcall_set() to manage its internal
state machine, but other accept filters no longer have any explicit
knowlege of socket upcall internals aside from their return value.
- The various RPC client upcalls currently drop the socket buffer lock
while invoking soreceive() as a temporary band-aid. The plan for
the future is to add a new flag to allow soreceive() to be called with
the socket buffer locked.
- The AIO callback for socket I/O is now also invoked with the socket
buffer locked. Previously sowakeup() would drop the socket buffer
lock only to call aio_swake() which immediately re-acquired the socket
buffer lock for the duration of the function call.
Discussed with: rwatson, rmacklem