Commit Graph

204 Commits

Author SHA1 Message Date
rwatson
3cf529ac1e Remove more one more stale comment regarding unpcb type-safety. 2007-05-11 12:28:45 +00:00
rwatson
144e902f6c Clarify and update quite a few comments to reflect locking optimizations,
the addition of unpcb refcounts, and bug fixes.  Some of these fixes are
appropriate for MFC.

MFC after:	3 days
2007-05-11 12:10:45 +00:00
wkoszek
a8246d5e35 Don't acquire Giant unconditionally.
Reviewed by:	rwatson
2007-05-06 12:00:38 +00:00
rwatson
765a83fd79 Replace custom file descriptor array sleep lock constructed using a mutex
and flags with an sxlock.  This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention.  All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently.  Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.

- Generally eliminate the distinction between "fast" and regular
  acquisisition of the filedesc lock; the plan is that they will now all
  be fast.  Change all locking instances to either shared or exclusive
  locks.

- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
  was called without the mutex held; sx_sleep() is now always called with
  the sxlock held exclusively.

- Universally hold the struct file lock over changes to struct file,
  rather than the filedesc lock or no lock.  Always update the f_ops
  field last. A further memory barrier is required here in the future
  (discussed with jhb).

- Improve locking and reference management in linux_at(), which fails to
  properly acquire vnode references before using vnode pointers.  Annotate
  improper use of vn_fullpath(), which will be replaced at a future date.

In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).

Tested by:	kris
Discussed with:	jhb, kris, attilio, jeff
2007-04-04 09:11:34 +00:00
rwatson
5ce4db6432 In uipc_close(), we no longer always free the unpcb, as the last reference
may be dropped later.  In this case, always unlock the unpcb so as not to
leak the lock.

Found by:	kris (BugMagnet)
2007-03-12 14:52:00 +00:00
rwatson
c077671dda Remove two simultaneous acquisitions of multiple unpcb locks from
uipc_send in cases where only a global read lock is held by breaking
them out and avoiding the unpcb lock acquire in the common case.  This
avoids deadlocks which manifested with X11, and should also marginally
further improve performance.

Reported by:	sepotvin, brooks
2007-03-01 09:00:42 +00:00
rwatson
9408f478d2 Lock unp2 after checking for a non-NULL unp2 pointer in uipc_send() on
datagram UNIX domain sockets, not before.
2007-02-28 08:08:50 +00:00
rwatson
fdcdf27f80 Revise locking strategy used for UNIX domain sockets in order to improve
concurrency:

- Add per-unpcb mutexes protecting unpcb connection state, fields, etc.

- Replace global UNP mutex with a global UNP rwlock, which will protect the
  UNIX domain socket connection topology, v_socket, and be acquired
  exclusively before acquiring more than per-unpcb at a time in order to
  avoid lock order issues.

In performance measurements involving MySQL, this change has little or no
overhead on UP (+/- 1%), but leads to a significant (5%-30%) improvement in
multi-processor measurements using the sysbench and supersmack benchmarks.

Much testing by:	kris
Approved by:		re (kensmith)
2007-02-26 20:47:52 +00:00
rwatson
61cab71be1 Add an additional MAC check to the UNIX domain socket connect path:
check that the subject has read/write access to the vnode using the
vnode MAC check.

MFC after:	3 weeks
Submitted by:	Spencer Minear <spencer_minear at securecomputing dot com>
Obtained from:	TrustedBSD Project
2007-02-22 09:37:44 +00:00
rwatson
6d90d77c6f Break introductory comment into two paragraphs to separate material on the
garbage collection complications from general discussion of UNIX domain
sockets.

Staticize unp_addsockcred().

Remove XXX comment regarding Giant and v_socket -- v_socket is protected
by the global UNIX domain socket lock.
2007-02-20 10:50:02 +00:00
rwatson
778df3a8f1 Minor rearrangement of global variables, comments, etc, in UNIX domain
sockets.
2007-02-14 15:05:40 +00:00
rwatson
c0d4ce5ca7 Change unp_mtx to supporting recursion, and do not drop the unp_mtx over
sonewconn() in unp_connect().  This avoids a race that occurs due to
v_socket being an uncounted reference, as the lock was being released in
order to call sonewconn(), which otherwise recurses into the UNIX domain
socket code via pru_attach, as well as holding the lock over a sleeping
memory allocation in uipc_attach().  Switch to a non-sleeping memory
allocation during UNIX domain socket attach.

This fix non-ideal in that it requires enabling recursion, but is a much
smaller change than moving to using true references for v_socket.  The
reported panic occurs in unp_connect() following the return of
sonewconn().

Update copyright year.

Panic reported by:      jhb
2007-02-14 12:22:11 +00:00
rwatson
c8c4b22747 Set UNP_CONNECTING when committing to moving ahead in unp_connect().
This logic was lost when merging the remainder of these changes in
1.178.
2007-02-13 21:00:57 +00:00
rwatson
a7eaaf4149 Push UNIX domain socket locking further into uipc_ctloutput() in order to
avoid holding the UNIX domain socket subsystem lock over soooptcopyin()
and sooptcopyout().  This problem was introduced when LOCAL_CREDS, and
LOCAL_CONNWAIT support were added.

Reviewed by:	mdodd
2007-02-06 14:31:37 +00:00
rwatson
fd7dad9d6e Canonicalize copyrights in some files I hold copyrights on:
- Sort by date in license blocks, oldest copyright first.
- All rights reserved after all copyrights, not just the first.
- Use (c) to be consistent with other entries.

MFC after:	3 days
2007-01-08 17:49:59 +00:00
jhb
256d3cdbaf - Close a race between enumerating UNIX domain socket pcb structures via
sysctl and socket teardown by adding a reference count to the UNIX domain
  pcb object and fixing the sysctl that enumerates unpcbs to grab a
  reference on each unpcb while it builds the list to copy out to userland.
- Close a race between UNIX domain pcb garbage collection (unp_gc()) and
  file descriptor teardown (fdrop()) by adding a new garbage collection
  flag FWAIT.  unp_gc() sets FWAIT while it walks the message buffers
  in a UNIX domain socket looking for nested file descriptor references
  and clears the flag when it is finished.  fdrop() checks to see if the
  flag is set on a file descriptor whose refcount just dropped to 0 and
  waits for unp_gc() to clear the flag before completely destroying the
  file descriptor.

MFC after:	1 week
Reviewed by:	rwatson
Submitted by:	ups
Hopefully makes the panics go away:	mx1
2007-01-05 19:59:46 +00:00
rwatson
7beaaf5cd2 Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h.  sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from:	TrustedBSD Project
Sponsored by:	SPARTA
2006-10-22 11:52:19 +00:00
rwatson
ff74569fb4 Minor white space tweaks. 2006-08-13 23:16:59 +00:00
rwatson
3bd21f489d Move definition of UNIX domain socket protosw and domain entries from
uipc_proto.c to uipc_usrreq.c, making localdomain static.  Remove
uipc_proto.c as it's no longer used.  With this change, UNIX domain
sockets are entirely encapsulated in uipc_usrreq.c.
2006-08-07 12:02:43 +00:00
rwatson
295b39f6ec Don't set pru_sosend, pru_soreceive, pru_sopoll to default values, as they
are already set to default values.
2006-08-06 10:39:21 +00:00
rwatson
fb3cc7e20d Remove now unneeded ENOTCONN clause from SOCK_DGRAM side of uipc_send():
we have to check it regardless of the target address, so don't check it
twice.
2006-08-02 14:30:58 +00:00
rwatson
66e61accdf Close a race that occurs when using sendto() to connect and send on a
UNIX domain socket at the same time as the remote host is closing the
new connections as quickly as they open.  Since the connect() and
send() paths are non-atomic with respect to another, it is possible
for the second thread's close() call to disconnect the two sockets
as connect() returns, leading to the consumer (which plans to send())
with a NULL kernel pointer to its proposed peer.  As a result, after
acquiring the UNIX domain socket subsystem lock, we need to revalidate
the connection pointers even though connect() has technically succeed,
and reurn an error to say that there's no connection on which to
perform the send.

We might want to rethink the specific errno number, perhaps ECONNRESET
would be better.

PR:		100940
Reported by:	Young Hyun <youngh at caida dot org>
MFC after:	2 weeks
MFC note:	Some adaptation will be required
2006-07-31 23:00:05 +00:00
rwatson
2268f15916 Remove call to soisdisconnected() in uipc_detach(), since it will already
have been invoked by uipc_close() or uipc_abort(), and the socket is in a
state of being torn down by the time we get to this point, so kqueue
state frobbed by soisdisconnected() is not available, so frobbing it will
result in a panic.

Reported by:	Munehiro Matsuda <haro at h4 dot dion dot ne dot jp>
2006-07-26 19:16:34 +00:00
rwatson
40868fda8a soreceive_generic(), and sopoll_generic(). Add new functions sosend(),
soreceive(), and sopoll(), which are wrappers for pru_sosend,
pru_soreceive, and pru_sopoll, and are now used univerally by socket
consumers rather than either directly invoking the old so*() functions
or directly invoking the protocol switch method (about an even split
prior to this commit).

This completes an architectural change that was begun in 1996 to permit
protocols to provide substitute implementations, as now used by UDP.
Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to
perform these operations on sockets -- in particular, distributed file
systems and socket system calls.

Architectural head nod:	sam, gnn, wollman
2006-07-24 15:20:08 +00:00
rwatson
1aefa1572f Remove duplicate 'or'.
Submitted by:	ru
2006-07-23 21:01:09 +00:00
rwatson
407a79e120 Add additional comments to the top of the UNIX domain socket implementation
providing some high level pointers regarding the implementation.
2006-07-23 20:06:45 +00:00
rwatson
c6cb3a5582 Add two new unpcb flags, UNP_BINDING and UNP_CONNECTING, which will be
used to mark UNIX domain sockets as being in the process of binding or
connecting.  Use these to prevent simultaneous bind or connect
operations by multiple threads or processes on the same socket at the
same time, which closes race conditions present in the UNIX domain
socket implementation since inception.
2006-07-23 12:01:14 +00:00
rwatson
e5969e57e6 Merge unp_bind() into uipc_bind(), as it is called only from uipc_bind(). 2006-07-23 11:02:12 +00:00
rwatson
b383d2c883 Since unp_attach() and unp_detach() are now called only from uipc_attach()
and uipc_detach(), merge them into their calling functions.
2006-07-23 10:25:28 +00:00
rwatson
777bd5286c Move various UNIX socket global variables and sysctls from the middle of
the file to the top.
2006-07-23 10:19:04 +00:00
rwatson
5d4d7953f0 In uipc_send() and uipc_rcvd(), store unp->unp_conn pointer in unp2
while working with the second unpcb to make the code more clear.
2006-07-22 18:41:42 +00:00
rwatson
b176399b59 Re-wrap and other minor formatting and punctuation fixes for UNIX domain
socket comments.
2006-07-22 17:24:55 +00:00
rwatson
720efebbba Change semantics of socket close and detach. Add a new protocol switch
function, pru_close, to notify protocols that the file descriptor or
other consumer of a socket is closing the socket.  pru_abort is now a
notification of close also, and no longer detaches.  pru_detach is no
longer used to notify of close, and will be called during socket
tear-down by sofree() when all references to a socket evaporate after
an earlier call to abort or close the socket.  This means detach is now
an unconditional teardown of a socket, whereas previously sockets could
persist after detach of the protocol retained a reference.

This faciliates sharing mutexes between layers of the network stack as
the mutex is required during the checking and removal of references at
the head of sofree().  With this change, pru_detach can now assume that
the mutex will no longer be required by the socket layer after
completion, whereas before this was not necessarily true.

Reviewed by:	gnn
2006-07-21 17:11:15 +00:00
rwatson
7647a9354e Reduce periods of simultaneous acquisition of various socket buffer
locks and the unplock during uipc_rcvd() and uipc_send() by caching
certain values from one structure while its locks are held, and
applying them to a second structure while its locks are held.  If
done carefully, this should be correct, and will reduce the amount
of work done with the global unp lock held.

Tested by:	kris (earlier version)
2006-07-11 21:49:54 +00:00
rwatson
e4433ceff4 Trim basically unused 'unp' in uipc_connect(). 2006-06-26 16:18:22 +00:00
rwatson
83e5d6bce1 Remove unused (and ifdef'd) unp_abort() and unp_drain().
MFC after:	1 month
2006-06-16 22:11:49 +00:00
maxim
0cf435f016 o There are two methods to get a process credentials over the unix
sockets:

1) A sender sends SCM_CREDS message to a reciever, struct cmsgcred;
2) A reciever sets LOCAL_CREDS socket option and gets sender
credentials in control message, struct sockcred.

Both methods use the same control message type SCM_CREDS with the
same control message level SOL_SOCKET, so they are indistinguishable
for the receiver.  A difference in struct cmsgcred and struct sockcred
layouts may lead to unwanted effects.

Now for sockets with LOCAL_CREDS option remove all previous linked
SCM_CREDS control messages and then add a control message with
struct sockcred so the process specifically asked for the peer
credentials by LOCAL_CREDS option always gets struct sockcred.

PR:		kern/90800
Submitted by:	Andrey Simonenko
Regres. tests:	tools/regression/sockets/unix_cmsg/
MFC after:	1 month
2006-06-13 14:33:35 +00:00
maxim
b583a2a914 Inherit LOCAL_CREDS option from listen socket for sockets returned
by accept(2).

PR:		kern/90644
Submitted by:	Andrey Simonenko
OK'ed by:	mdodd
Tested by:	NetBSD regress/sys/kern/unfdpass/unfdpass.c
MFC after:	1 month
2006-04-24 19:09:33 +00:00
ps
10b2fe8dea Allow for nmbclusters and maxsockets to be increased via sysctl.
An eventhandler is used to update all the various zones that depend
on these values.
2006-04-21 09:25:40 +00:00
rwatson
5479e5d692 Chance protocol switch method pru_detach() so that it returns void
rather than an error.  Detaches do not "fail", they other occur or
the protocol flags SS_PROTOREF to take ownership of the socket.

soclose() no longer looks at so_pcb to see if it's NULL, relying
entirely on the protocol to decide whether it's time to free the
socket or not using SS_PROTOREF.  so_pcb is now entirely owned and
managed by the protocol code.  Likewise, no longer test so_pcb in
other socket functions, such as soreceive(), which have no business
digging into protocol internals.

Protocol detach routines no longer try to free the socket on detach,
this is performed in the socket code if the protocol permits it.

In rts_detach(), no longer test for rp != NULL in detach, and
likewise in other protocols that don't permit a NULL so_pcb, reduce
the incidence of testing for it during detach.

netinet and netinet6 are not fully updated to this change, which
will be in an upcoming commit.  In their current state they may leak
memory or panic.

MFC after:	3 months
2006-04-01 15:42:02 +00:00
rwatson
8622e776f9 Change protocol switch pru_abort() API so that it returns void rather
than an int, as an error here is not meaningful.  Modify soabort() to
unconditionally free the socket on the return of pru_abort(), and
modify most protocols to no longer conditionally free the socket,
since the caller will do this.

This commit likely leaves parts of netinet and netinet6 in a situation
where they may panic or leak memory, as they have not are not fully
updated by this commit.  This will be corrected shortly in followup
commits to these components.

MFC after:      3 months
2006-04-01 15:15:05 +00:00
rwatson
f7a01805a2 Modify UNIX domain sockets to guarantee, and assume, that so_pcb is always
defined for an in-use socket.  This allows us to eliminate countless tests
of whether so_pcb is non-NULL, eliminating dozens of error cases.  For
now, retain the call to sotryfree() in the uipc_abort() path, but this
will eventually move to soabort().

These new assumptions should be largely correct, and will become more so
as the socket/pcb reference model is fixed.  Removing the notion that
so_pcb can be non-NULL is a critical step towards further fine-graining
of the UNIX domain socket locking, as the so_pcb reference no longer
needs to be protected using locks, instead it is a property of the socket
life cycle.
2006-03-17 13:52:57 +00:00
jeff
822d3c8355 - Lock access to vrele() with VFS_LOCK_GIANT() rather than mtx_lock(&Giant).
Sponsored by:	Isilon Systems, Inc.
2006-01-30 08:19:01 +00:00
rwatson
a59d345748 XXX a comment in uipc_usrreq.c that requires updating. 2006-01-13 00:00:32 +00:00
mux
b29e3549b8 Fix a bunch of SYSCTL_INT() that should have been SYSCTL_ULONG() to
match the type of the variable they are exporting.

Spotted by:	Thomas Hurst <tom@hur.st>
MFC after:	3 days
2005-12-14 22:27:48 +00:00
rwatson
9487c057e2 Correct a number of serious and closely related bugs in the UNIX domain
socket file descriptor garbage collection code, which is intended to
detect and clear cycles of orphaned file descriptors that are "in-flight"
in a socket when that socket is closed before they are received.  The
algorithm present was both run at poor times (resulting in recursion and
reentrance), and also buggy in the presence of parallelism.  In order to
fix these problems, make the following changes:

- When there are in-flight sockets and a UNIX domain socket is destroyed,
  asynchronously schedule the garbage collector, rather than running it
  synchronously in the current context.  This avoids lock order issues
  when the garbage collection code reenters the UNIX domain socket code,
  avoiding lock order reversals, deadlocks, etc.  Run the code
  asynchronously in a task queue.

- In the garbage collector, when skipping file descriptors that have
  entered a closing state (i.e., have f_count == 0), re-test the FDEFER
  flag, and decrement unp_defer.  As file descriptors can now transition
  to a closed state, while the garbage collector is running, it is no
  longer the case that unp_defer will remain an accurate count of
  deferred sockets in the mark portion of the GC algorithm.  Otherwise,
  the garbage collector will loop waiting waiting for unp_defer to reach
  zero, which it will never do as it is skipping file descriptors that
  were marked in an earlier pass, but now closed.

- Acquire the UNIX domain socket subsystem lock in unp_discard() when
  modifying the unp_rights counter, or a read/write race is risked with
  other threads also manipulating the counter.

While here:

- Remove #if 0'd code regarding acquiring the socket buffer sleep lock in
  the garbage collector, this is not required as we are able to use the
  socket buffer receive lock to protect scanning the receive buffer for
  in-flight file descriptors on the socket buffer.

- Annotate that the description of the garbage collector implementation
  is increasingly inaccurate and needs to be updated.

- Add counters of the number of deferred garbage collections and recycled
  file descriptors.  This will be removed and is here temporarily for
  debugging purposes.

With these changes in place, the unp_passfd regression test now appears
to be passed consistently on UP and SMP systems for extended runs,
whereas before it hung quickly or panicked, depending on which bug was
triggered.

Reported by:	Philip Kizer <pckizer at nostrum dot com>
MFC after:	2 weeks
2005-11-10 16:06:04 +00:00
rwatson
be4f357149 Normalize a significant number of kernel malloc type names:
- Prefer '_' to ' ', as it results in more easily parsed results in
  memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
  as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
  memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
  attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion.  Similar changes are required for UMA zone names.
2005-10-31 15:41:29 +00:00
rwatson
49831ed8da Push the assignment of a new or updated so_qlimit from solisten()
following the protocol pru_listen() call to solisten_proto(), so
that it occurs under the socket lock acquisition that also sets
SO_ACCEPTCONN.  This requires passing the new backlog parameter
to the protocol, which also allows the protocol to be aware of
changes in queue limit should it wish to do something about the
new queue limit.  This continues a move towards the socket layer
acting as a library for the protocol.

Bump __FreeBSD_version due to a change in the in-kernel protocol
interface.  This change has been tested with IPv4 and UNIX domain
sockets, but not other protocols.
2005-10-30 19:44:40 +00:00
rwatson
32e95735e8 Canonicalize the UNIX domain socket copyright layout: original holders
before more recent holders.

MFC after:	3 days
2005-09-23 12:41:06 +00:00
cperciva
a199a4f74b Fix two issues which were missed in FreeBSD-SA-05:08.kmem.
Reported by:	Uwe Doering
2005-05-07 00:41:36 +00:00