Commit Graph

299 Commits

Author SHA1 Message Date
Mikolaj Golub
c7e41c8b50 Introduce VOP_UNP_BIND(), VOP_UNP_CONNECT(), and VOP_UNP_DETACH()
operations for setting and accessing vnode's v_socket field.

The operations are necessary to implement proper unix socket handling
on layered file systems like nullfs(5).

This change fixes the long standing issue with nullfs(5) being in that
unix sockets did not work between lower and upper layers: if we bound
to a socket on the lower layer we could connect only to the lower
path; if we bound to the upper layer we could connect only to the
upper path. The new behavior is one can connect to both the lower and
the upper paths regardless what layer path one binds to.

PR:		kern/51583, kern/159663
Suggested by:	kib
Reviewed by:	arch
MFC after:	2 weeks
2012-02-29 21:38:31 +00:00
Mikolaj Golub
662c901c54 When detaching an unix domain socket, uipc_detach() checks
unp->unp_vnode pointer to detect if there is a vnode associated with
(binded to) this socket and does necessary cleanup if there is.

The issue is that after forced unmount this check may be too late as
the unp_vnode is reclaimed and the reference is stale.

To fix this provide a helper function that is called on a socket vnode
reclamation to do necessary cleanup.

Pointed by:	kib
Reviewed by:	kib
MFC after:	2 weeks
2012-02-25 10:15:41 +00:00
Mikolaj Golub
a95852edf3 unp_connect() may use a shared lock on the vnode to fetch the socket.
Suggested by:	jhb
Reviewed by:	jhb, kib, rwatson
MFC after:	2 weeks
2012-02-21 19:40:13 +00:00
Ed Schouten
6472ac3d8a Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
2011-11-07 15:43:11 +00:00
Bjoern A. Zeeb
a06534c3c2 Fix handling of corrupt compress(1)ed data. [11:04]
Add missing length checks on unix socket addresses. [11:05]

Approved by:	so (cperciva)
Approved by:	re (kensmith)
Security:	FreeBSD-SA-11:04.compress
Security:	CVE-2011-2895 [11:04]
Security:	FreeBSD-SA-11:05.unix
2011-09-28 08:47:17 +00:00
Konstantin Belousov
aab4f50170 Prevent the hiwatermark for the unix domain socket from becoming
effectively negative. Often seen as upstream fastcgi connection timeouts
in nginx when using sendfile over unix domain sockets for communication.

Sendfile(2) may send more bytes then currently allowed by the
hiwatermark of the socket, e.g. because the so_snd sockbuf lock is
dropped after sbspace() call in the kern_sendfile() loop. In this case,
recalculated hiwatermark will overflow. Since lowatermark is renewed
as half of the hiwatermark by sendfile code, and both are unsigned,
the send buffer never reaches the free space requested by lowatermark,
causing indefinite wait in sendfile.

Reviewed by:	rwatson
Approved by:	re (bz)
MFC after:	2 weeks
2011-08-20 16:12:29 +00:00
Bjoern A. Zeeb
1fb51a12f2 Mfp4 CH=177274,177280,177284-177285,177297,177324-177325
VNET socket push back:
  try to minimize the number of places where we have to switch vnets
  and narrow down the time we stay switched.  Add assertions to the
  socket code to catch possibly unset vnets as seen in r204147.

  While this reduces the number of vnet recursion in some places like
  NFS, POSIX local sockets and some netgraph, .. recursions are
  impossible to fix.

  The current expectations are documented at the beginning of
  uipc_socket.c along with the other information there.

  Sponsored by: The FreeBSD Foundation
  Sponsored by: CK Software GmbH
  Reviewed by:  jhb
  Tested by:    zec

Tested by:	Mikolaj Golub (to.my.trociny gmail.com)
MFC after:	2 weeks
2011-02-16 21:29:13 +00:00
Konstantin Belousov
f7780c61e7 The unp_gc() function drops and reaquires lock between scan and
collect phases.  The unp_discard() function executes
unp_externalize_fp(), which might make the socket eligible for gc-ing,
and then, later, taskqueue will close the socket.  Since unp_gc()
dropped the list lock to do the malloc, close might happen after the
mark step but before the collection step, causing collection to not
find the socket and miss one array element.

I believe that the race was there before r216158, but the stated
revision made the window much wider by postponing the close to
taskqueue sometimes.

Only process as much array elements as we find the sockets during
second phase of gc [1].  Take linkage lock and recheck the eligibility
of the socket for gc, as well as call fhold() under the linkage lock.

Reported and tested by:	jmallett
Submitted by:   jmallett [1]
Reviewed by:	rwatson, jeff (possibly)
MFC after:	1 week
2011-02-01 13:33:49 +00:00
Matthew D Fleming
2fee06f087 Specify a CTLTYPE_FOO so that a future sysctl(8) change does not need
to rely on the format string.
2011-01-18 21:14:18 +00:00
Konstantin Belousov
9f4ba450f2 Trim whitespaces at the end of lines. Use the commit to record
proper log message for r216150.

MFC after:	1 week

If unix socket has a unix socket attached as the rights that has a
unix socket attached as the rights that has a unix socket attached as
the rights ... Kernel may overflow the stack on attempt to close such
socket.

Only close the rights file in the context of the current close if the
file is not unix domain socket. Otherwise, postpone the work to
taskqueue, preventing unlimited recursion.

The pass of the unix domain sockets over the SCM_RIGHTS message
control is not widely used, and more, the close of the socket with
still attached rights is mostly an application failure. The change
should not affect the performance of typical users of SCM_RIGHTS.

Reviewed by:	jeff, rwatson
2010-12-03 20:39:06 +00:00
Konstantin Belousov
0cb64678bc Reviewed by: jeff, rwatson
MFC after:	1 week
2010-12-03 16:15:44 +00:00
Edward Tomasz Napierala
175389cff2 Remove spurious '/*-' marks and fix some other style problems.
Submitted by:	bde@
2010-07-22 05:42:29 +00:00
Edward Tomasz Napierala
1a996ed1d8 Revert r210225 - turns out I was wrong; the "/*-" is not license-only
thing; it's also used to indicate that the comment should not be automatically
rewrapped.

Explained by:	cperciva@
2010-07-18 20:57:53 +00:00
Edward Tomasz Napierala
805cc58ac0 The "/*-" comment marker is supposed to denote copyrights. Remove non-copyright
occurences from sys/sys/ and sys/kern/.
2010-07-18 20:23:10 +00:00
Robert Watson
604f19c91e Fix build on amd64, where sysctl arg1 is a pointer.
Reported by:	Mr Tinderbox
MFC after:	3 months
2009-10-05 22:23:12 +00:00
Robert Watson
84d61770bc First cut at implementing SOCK_SEQPACKET support for UNIX (local) domain
sockets.  This allows for reliable bi-directional datagram communication
over UNIX domain sockets, in contrast to SOCK_DGRAM (M:N, unreliable) or
SOCK_STERAM (bi-directional bytestream).  Largely, this reuses existing
UNIX domain socket code.  This allows applications requiring record-
oriented semantics to do so reliably via local IPC.

Some implementation notes (also present in XXX comments):

- Currently we lack an sbappend variant able to do datagrams and control
  data without doing addresses, so we mark SOCK_SEQPACKET as PR_ADDR.
  Adding a new variant will solve this problem.

- UNIX domain sockets on FreeBSD provide back-pressure/flow control
  notification for stream sockets by manipulating the send socket
  buffer's size during pru_send and pru_rcvd.  This trick works less well
  for SOCK_SEQPACKET as sosend_generic() uses sb_hiwat not just to
  manage blocking, but also to determine maximum datagram size.  Fixing
  this requires rethinking how back-pressure is done for SOCK_SEQPACKET;
  in the mean time, it's possible to get EMSGSIZE when buffers fill,
  instead of blocking.

Discussed with:	benl
Reviewed by:	bz, rpaulo
MFC after:	3 months
Sponsored by:	Google
2009-10-05 14:49:16 +00:00
Robert Watson
530c006014 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
Jamie Gritton
399645e1b9 Remove unnecessary/redundant includes.
Approved by:	bz (mentor)
2009-06-23 14:39:21 +00:00
John Baldwin
afd9f91c63 Fix a deadlock in the getpeername() method for UNIX domain sockets.
Instead of locking the local unp followed by the remote unp, use the same
locking model as accept() and read lock the global link lock followed by
the remote unp while fetching the remote sockaddr.

Reported by:	Mel Flynn  mel.flynn of mailing.thruhere.net
Reviewed by:	rwatson
MFC after:	1 week
2009-06-18 20:56:22 +00:00
Robert Watson
bcf11e8d00 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
Robert Watson
f93bfb23dc Add internal 'mac_policy_count' counter to the MAC Framework, which is a
count of the number of registered policies.

Rather than unconditionally locking sockets before passing them into MAC,
lock them in the MAC entry points only if mac_policy_count is non-zero.

This avoids locking overhead for a number of socket system calls when no
policies are registered, eliminating measurable overhead for the MAC
Framework for the socket subsystem when there are no active policies.

Possibly socket locks should be acquired by policies if they are required
for socket labels, which would further avoid locking overhead when there
are policies but they don't require labeling of sockets, or possibly
don't even implement socket controls.

Obtained from:	TrustedBSD Project
2009-06-02 18:26:17 +00:00
Marko Zec
21ca7b57bd Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one.  The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE().  Recursions
on curvnet are permitted, though strongly discuouraged.

This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.

The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc.  Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.

The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.

This change also introduces a DDB subcommand to show the list of all
vnet instances.

Approved by:	julian (mentor)
2009-05-05 10:56:12 +00:00
Robert Watson
885868cd8f Remove VOP_LEASE and supporting functions. This hasn't been used since
the removal of NQNFS, but was left in in case it was required for NFSv4.
Since our new NFSv4 client and server can't use it for their
requirements, GC the old mechanism, as well as other unused lease-
related code and interfaces.

Due to its impact on kernel programming and binary interfaces, this
change should not be MFC'd.

Proposed by:    jeff
Reviewed by:    jeff
Discussed with: rmacklem, zach loafman @ isilon
2009-04-10 10:52:19 +00:00
Robert Watson
3dab55bc86 Decompose the global UNIX domain sockets rwlock into two different
locks: a global list/counter/generation counter protected by a new
mutex unp_list_lock, and a global linkage rwlock, unp_global_rwlock,
which protects the connections between UNIX domain sockets.

This eliminates conditional lock acquisition that was previously a
property of the global lock being held over sonewconn() leading to a
call to uipc_attach(), which also required the global lock, but
couldn't rely on it as other paths existed to uipc_attach() that
didn't hold it: now uipc_attach() uses only the list lock, which
follows the linkage lock in the lock order.  It may also reduce
contention on the global lock for some workloads.

Add global UNIX domain socket locks to hard-coded witness lock
order.

MFC after:	1 week
Discussed with:	kris
2009-03-08 21:48:29 +00:00
Robert Watson
b523ec24b9 White space and comment tweaks.
MFC after:	3 weeks
2009-01-01 20:03:22 +00:00
Robert Watson
a9f3c7d2ff Rename mbcnt to mbcnt_delta in uipc_send() -- unlike other local
variables named mbcnt in uipc_usrreq.c, this instance is a delta
rather than a cache of sb_mbcnt.

MFC after:	3 weeks
2008-12-30 16:09:57 +00:00
Dag-Erling Smørgrav
1ede983cc9 Retire the MALLOC and FREE macros. They are an abomination unto style(9).
MFC after:	3 months
2008-10-23 15:53:51 +00:00
Robert Watson
4b0e2b9add Remove stale comment: while uipc_connect2() was, until recently, not
static so it could be used by fifofs (actually portalfs), it is now
static.

Submitted by:	kensmith
2008-10-11 17:28:22 +00:00
Robert Watson
e298cf5902 Remove stale comment (and XXX saying so) about why we zero the file
descriptor pointer in unp_freerights: we can no longer recurse into
unp_gc due to unp_gc being invoked in a deferred way, but it's still
a good idea.

MFC after:	3 days
2008-10-08 06:26:51 +00:00
Robert Watson
fa9402f28a Differentiate pr_usrreqs for stream and datagram UNIX domain sockets, and
employ soreceive_dgram for the datagram case.

MFC after:	3 months
2008-10-08 06:19:49 +00:00
Robert Watson
2c8995842c Now that portalfs doesn't directly invoke uipc_connect2(), make it a
static symbol.

MFC after:	3 days
2008-10-06 18:43:11 +00:00
Robert Watson
0b36cd25fc Further minor cleanups to UNIX domain sockets:
- Staticize and locally prototype functions uipc_ctloutput(), unp_dispose(),
  unp_init(), and unp_externalize(), none of which have been required
  outside of uipc_usrreq.c since uipc_proto.c was removed.
- Remove stale prototype for uipc_usrreq(), which has not existed in the
  code since 1997
- Forward declare and staticize uipc_usrreqs structure in uipc_usrreq.c and
  not un.h.
- Comment on why uipc_connect2() is still non-static -- it is used directly
  by fifofs.
- Remove stale comments, tidy up whitespace.

MFC after:	3 days (where applicable)
2008-10-03 13:01:56 +00:00
Robert Watson
60a5ef26a1 Remove or update several stale comments.
A bit of whitespace/style cleanup.

Update copyright.

MFC after:	3 days (applicable changes)
2008-10-03 09:01:55 +00:00
Tom Rhodes
be6b130476 Fill in a few sysctl descriptions.
Approved by:	rwatson
2008-07-26 00:55:35 +00:00
Ed Maste
7928893d83 Use bcopy instead of strlcpy in uipc_bind and unp_connect, since
soun->sun_path isn't a null-terminated string.  As UNIX(4) states, "the
terminating NUL is not part of the address."  Since strlcpy has to return
"the total length of the string [it] tried to create," it walks off the end
of soun->sun_path looking for a \0.

This reverts r105332.

Reported by:    Ryan Stone
2008-07-03 23:26:10 +00:00
Robert Watson
8c96f9c193 Move unlock of global UNIX domain socket lock slightly lower in
unp_connect(): it is expected to return with the lock held, and two
possible error paths otherwise returned with it unlocked.

The fix committed here is slightly different from the patch in the
PR, but along an alternative line suggested in the PR.

PR:		119778
MFC after:	3 days
Submitted by:	James Juran <james dot juran at baesystems dot com>
2008-01-18 19:16:03 +00:00
Attilio Rao
22db15c06f VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
2008-01-13 14:44:15 +00:00
Robert Watson
8a69e5fa71 Remove "lock pushdown" todo item in comment -- I did that for 7.0.
MFC after:	3 weeks
2008-01-10 12:38:17 +00:00
Robert Watson
a635784569 Correct typos in comments.
MFC after:	3 weeks
2008-01-10 12:29:12 +00:00
Jeff Roberson
41e0f66d41 - Place the fhold() in unp_internalize_fp to be more consistent with refs.
- Clear all of the gc flags before doing a run.  Stale flags were causing
   us to skip some descriptors.
 - If a unp socket has been marked REF in a gc pass it can't be dead.

Found by:	rwatson's test tool.
2008-01-01 01:46:42 +00:00
Jeff Roberson
6f552cb098 - Check the correct variable against NULL in two places.
- If the unp_file is NULL that means it has never been internalized and it
   must be reachable.
2007-12-31 03:44:54 +00:00
Jeff Roberson
397c19d175 Remove explicit locking of struct file.
- Introduce a finit() which is used to initailize the fields of struct file
   in such a way that the ops vector is only valid after the data, type,
   and flags are valid.
 - Protect f_flag and f_count with atomic operations.
 - Remove the global list of all files and associated accounting.
 - Rewrite the unp garbage collection such that it no longer requires
   the global list of all files and instead uses a list of all unp sockets.
 - Mark sockets in the accept queue so we don't incorrectly gc them.

Tested by:	kris, pho
2007-12-30 01:42:15 +00:00
Robert Watson
30d239bc4c Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

  mac_<object>_<method/action>
  mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme.  Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier.  Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods.  Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-24 19:04:04 +00:00
Pawel Jakub Dawidek
57fd3d5572 When we do open, we should lock the vnode exclusively. This fixes few races:
- fifo race, where two threads assign v_fifoinfo,
- v_writecount modifications,
- v_object modifications,
- and probably more...

Discussed with:	kib, ups
Approved by:	re (rwatson)
2007-07-26 16:58:09 +00:00
Robert Watson
03c96c3176 Add DDB "show unpcb" command, allowing DDB to print out many pertinent
details from UNIX domain socket protocol layer state.
2007-05-29 12:36:00 +00:00
Robert Watson
08d73f1370 Remove more one more stale comment regarding unpcb type-safety. 2007-05-11 12:28:45 +00:00
Robert Watson
d7924b7086 Clarify and update quite a few comments to reflect locking optimizations,
the addition of unpcb refcounts, and bug fixes.  Some of these fixes are
appropriate for MFC.

MFC after:	3 days
2007-05-11 12:10:45 +00:00
Wojciech A. Koszek
9e2894466a Don't acquire Giant unconditionally.
Reviewed by:	rwatson
2007-05-06 12:00:38 +00:00
Robert Watson
5e3f7694b1 Replace custom file descriptor array sleep lock constructed using a mutex
and flags with an sxlock.  This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention.  All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently.  Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.

- Generally eliminate the distinction between "fast" and regular
  acquisisition of the filedesc lock; the plan is that they will now all
  be fast.  Change all locking instances to either shared or exclusive
  locks.

- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
  was called without the mutex held; sx_sleep() is now always called with
  the sxlock held exclusively.

- Universally hold the struct file lock over changes to struct file,
  rather than the filedesc lock or no lock.  Always update the f_ops
  field last. A further memory barrier is required here in the future
  (discussed with jhb).

- Improve locking and reference management in linux_at(), which fails to
  properly acquire vnode references before using vnode pointers.  Annotate
  improper use of vn_fullpath(), which will be replaced at a future date.

In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).

Tested by:	kris
Discussed with:	jhb, kris, attilio, jeff
2007-04-04 09:11:34 +00:00
Robert Watson
6e2faa2444 In uipc_close(), we no longer always free the unpcb, as the last reference
may be dropped later.  In this case, always unlock the unpcb so as not to
leak the lock.

Found by:	kris (BugMagnet)
2007-03-12 14:52:00 +00:00
Robert Watson
ede6e136f8 Remove two simultaneous acquisitions of multiple unpcb locks from
uipc_send in cases where only a global read lock is held by breaking
them out and avoiding the unpcb lock acquire in the common case.  This
avoids deadlocks which manifested with X11, and should also marginally
further improve performance.

Reported by:	sepotvin, brooks
2007-03-01 09:00:42 +00:00
Robert Watson
3592fd4de5 Lock unp2 after checking for a non-NULL unp2 pointer in uipc_send() on
datagram UNIX domain sockets, not before.
2007-02-28 08:08:50 +00:00
Robert Watson
e7c33e29ed Revise locking strategy used for UNIX domain sockets in order to improve
concurrency:

- Add per-unpcb mutexes protecting unpcb connection state, fields, etc.

- Replace global UNP mutex with a global UNP rwlock, which will protect the
  UNIX domain socket connection topology, v_socket, and be acquired
  exclusively before acquiring more than per-unpcb at a time in order to
  avoid lock order issues.

In performance measurements involving MySQL, this change has little or no
overhead on UP (+/- 1%), but leads to a significant (5%-30%) improvement in
multi-processor measurements using the sysbench and supersmack benchmarks.

Much testing by:	kris
Approved by:		re (kensmith)
2007-02-26 20:47:52 +00:00
Robert Watson
6fac927ccc Add an additional MAC check to the UNIX domain socket connect path:
check that the subject has read/write access to the vnode using the
vnode MAC check.

MFC after:	3 weeks
Submitted by:	Spencer Minear <spencer_minear at securecomputing dot com>
Obtained from:	TrustedBSD Project
2007-02-22 09:37:44 +00:00
Robert Watson
5b950deabc Break introductory comment into two paragraphs to separate material on the
garbage collection complications from general discussion of UNIX domain
sockets.

Staticize unp_addsockcred().

Remove XXX comment regarding Giant and v_socket -- v_socket is protected
by the global UNIX domain socket lock.
2007-02-20 10:50:02 +00:00
Robert Watson
aea52f1bf8 Minor rearrangement of global variables, comments, etc, in UNIX domain
sockets.
2007-02-14 15:05:40 +00:00
Robert Watson
46a1d9bfe8 Change unp_mtx to supporting recursion, and do not drop the unp_mtx over
sonewconn() in unp_connect().  This avoids a race that occurs due to
v_socket being an uncounted reference, as the lock was being released in
order to call sonewconn(), which otherwise recurses into the UNIX domain
socket code via pru_attach, as well as holding the lock over a sleeping
memory allocation in uipc_attach().  Switch to a non-sleeping memory
allocation during UNIX domain socket attach.

This fix non-ideal in that it requires enabling recursion, but is a much
smaller change than moving to using true references for v_socket.  The
reported panic occurs in unp_connect() following the return of
sonewconn().

Update copyright year.

Panic reported by:      jhb
2007-02-14 12:22:11 +00:00
Robert Watson
05102f04d5 Set UNP_CONNECTING when committing to moving ahead in unp_connect().
This logic was lost when merging the remainder of these changes in
1.178.
2007-02-13 21:00:57 +00:00
Robert Watson
1f837c4753 Push UNIX domain socket locking further into uipc_ctloutput() in order to
avoid holding the UNIX domain socket subsystem lock over soooptcopyin()
and sooptcopyout().  This problem was introduced when LOCAL_CREDS, and
LOCAL_CONNWAIT support were added.

Reviewed by:	mdodd
2007-02-06 14:31:37 +00:00
Robert Watson
abdeb3b01f Canonicalize copyrights in some files I hold copyrights on:
- Sort by date in license blocks, oldest copyright first.
- All rights reserved after all copyrights, not just the first.
- Use (c) to be consistent with other entries.

MFC after:	3 days
2007-01-08 17:49:59 +00:00
John Baldwin
9ae328fc8f - Close a race between enumerating UNIX domain socket pcb structures via
sysctl and socket teardown by adding a reference count to the UNIX domain
  pcb object and fixing the sysctl that enumerates unpcbs to grab a
  reference on each unpcb while it builds the list to copy out to userland.
- Close a race between UNIX domain pcb garbage collection (unp_gc()) and
  file descriptor teardown (fdrop()) by adding a new garbage collection
  flag FWAIT.  unp_gc() sets FWAIT while it walks the message buffers
  in a UNIX domain socket looking for nested file descriptor references
  and clears the flag when it is finished.  fdrop() checks to see if the
  flag is set on a file descriptor whose refcount just dropped to 0 and
  waits for unp_gc() to clear the flag before completely destroying the
  file descriptor.

MFC after:	1 week
Reviewed by:	rwatson
Submitted by:	ups
Hopefully makes the panics go away:	mx1
2007-01-05 19:59:46 +00:00
Robert Watson
aed5570872 Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h.  sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from:	TrustedBSD Project
Sponsored by:	SPARTA
2006-10-22 11:52:19 +00:00
Robert Watson
b7e2f3ec76 Minor white space tweaks. 2006-08-13 23:16:59 +00:00
Robert Watson
e4445a031f Move definition of UNIX domain socket protosw and domain entries from
uipc_proto.c to uipc_usrreq.c, making localdomain static.  Remove
uipc_proto.c as it's no longer used.  With this change, UNIX domain
sockets are entirely encapsulated in uipc_usrreq.c.
2006-08-07 12:02:43 +00:00
Robert Watson
52b384621e Don't set pru_sosend, pru_soreceive, pru_sopoll to default values, as they
are already set to default values.
2006-08-06 10:39:21 +00:00
Robert Watson
f8b20fb6d6 Remove now unneeded ENOTCONN clause from SOCK_DGRAM side of uipc_send():
we have to check it regardless of the target address, so don't check it
twice.
2006-08-02 14:30:58 +00:00
Robert Watson
b5ff091431 Close a race that occurs when using sendto() to connect and send on a
UNIX domain socket at the same time as the remote host is closing the
new connections as quickly as they open.  Since the connect() and
send() paths are non-atomic with respect to another, it is possible
for the second thread's close() call to disconnect the two sockets
as connect() returns, leading to the consumer (which plans to send())
with a NULL kernel pointer to its proposed peer.  As a result, after
acquiring the UNIX domain socket subsystem lock, we need to revalidate
the connection pointers even though connect() has technically succeed,
and reurn an error to say that there's no connection on which to
perform the send.

We might want to rethink the specific errno number, perhaps ECONNRESET
would be better.

PR:		100940
Reported by:	Young Hyun <youngh at caida dot org>
MFC after:	2 weeks
MFC note:	Some adaptation will be required
2006-07-31 23:00:05 +00:00
Robert Watson
0075d85869 Remove call to soisdisconnected() in uipc_detach(), since it will already
have been invoked by uipc_close() or uipc_abort(), and the socket is in a
state of being torn down by the time we get to this point, so kqueue
state frobbed by soisdisconnected() is not available, so frobbing it will
result in a panic.

Reported by:	Munehiro Matsuda <haro at h4 dot dion dot ne dot jp>
2006-07-26 19:16:34 +00:00
Robert Watson
b0668f7151 soreceive_generic(), and sopoll_generic(). Add new functions sosend(),
soreceive(), and sopoll(), which are wrappers for pru_sosend,
pru_soreceive, and pru_sopoll, and are now used univerally by socket
consumers rather than either directly invoking the old so*() functions
or directly invoking the protocol switch method (about an even split
prior to this commit).

This completes an architectural change that was begun in 1996 to permit
protocols to provide substitute implementations, as now used by UDP.
Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to
perform these operations on sockets -- in particular, distributed file
systems and socket system calls.

Architectural head nod:	sam, gnn, wollman
2006-07-24 15:20:08 +00:00
Robert Watson
ca948c5e93 Remove duplicate 'or'.
Submitted by:	ru
2006-07-23 21:01:09 +00:00
Robert Watson
f23929fbc5 Add additional comments to the top of the UNIX domain socket implementation
providing some high level pointers regarding the implementation.
2006-07-23 20:06:45 +00:00
Robert Watson
4f1f0ef523 Add two new unpcb flags, UNP_BINDING and UNP_CONNECTING, which will be
used to mark UNIX domain sockets as being in the process of binding or
connecting.  Use these to prevent simultaneous bind or connect
operations by multiple threads or processes on the same socket at the
same time, which closes race conditions present in the UNIX domain
socket implementation since inception.
2006-07-23 12:01:14 +00:00
Robert Watson
dd47f5ca9c Merge unp_bind() into uipc_bind(), as it is called only from uipc_bind(). 2006-07-23 11:02:12 +00:00
Robert Watson
6d32873c29 Since unp_attach() and unp_detach() are now called only from uipc_attach()
and uipc_detach(), merge them into their calling functions.
2006-07-23 10:25:28 +00:00
Robert Watson
7e711c3aae Move various UNIX socket global variables and sysctls from the middle of
the file to the top.
2006-07-23 10:19:04 +00:00
Robert Watson
f3f49bbbe8 In uipc_send() and uipc_rcvd(), store unp->unp_conn pointer in unp2
while working with the second unpcb to make the code more clear.
2006-07-22 18:41:42 +00:00
Robert Watson
1c381b19ff Re-wrap and other minor formatting and punctuation fixes for UNIX domain
socket comments.
2006-07-22 17:24:55 +00:00
Robert Watson
a152f8a361 Change semantics of socket close and detach. Add a new protocol switch
function, pru_close, to notify protocols that the file descriptor or
other consumer of a socket is closing the socket.  pru_abort is now a
notification of close also, and no longer detaches.  pru_detach is no
longer used to notify of close, and will be called during socket
tear-down by sofree() when all references to a socket evaporate after
an earlier call to abort or close the socket.  This means detach is now
an unconditional teardown of a socket, whereas previously sockets could
persist after detach of the protocol retained a reference.

This faciliates sharing mutexes between layers of the network stack as
the mutex is required during the checking and removal of references at
the head of sofree().  With this change, pru_detach can now assume that
the mutex will no longer be required by the socket layer after
completion, whereas before this was not necessarily true.

Reviewed by:	gnn
2006-07-21 17:11:15 +00:00
Robert Watson
337cc6b60e Reduce periods of simultaneous acquisition of various socket buffer
locks and the unplock during uipc_rcvd() and uipc_send() by caching
certain values from one structure while its locks are held, and
applying them to a second structure while its locks are held.  If
done carefully, this should be correct, and will reduce the amount
of work done with the global unp lock held.

Tested by:	kris (earlier version)
2006-07-11 21:49:54 +00:00
Robert Watson
e83b30bdcb Trim basically unused 'unp' in uipc_connect(). 2006-06-26 16:18:22 +00:00
Robert Watson
9a44cbf19c Remove unused (and ifdef'd) unp_abort() and unp_drain().
MFC after:	1 month
2006-06-16 22:11:49 +00:00
Maxim Konovalov
70df31f4de o There are two methods to get a process credentials over the unix
sockets:

1) A sender sends SCM_CREDS message to a reciever, struct cmsgcred;
2) A reciever sets LOCAL_CREDS socket option and gets sender
credentials in control message, struct sockcred.

Both methods use the same control message type SCM_CREDS with the
same control message level SOL_SOCKET, so they are indistinguishable
for the receiver.  A difference in struct cmsgcred and struct sockcred
layouts may lead to unwanted effects.

Now for sockets with LOCAL_CREDS option remove all previous linked
SCM_CREDS control messages and then add a control message with
struct sockcred so the process specifically asked for the peer
credentials by LOCAL_CREDS option always gets struct sockcred.

PR:		kern/90800
Submitted by:	Andrey Simonenko
Regres. tests:	tools/regression/sockets/unix_cmsg/
MFC after:	1 month
2006-06-13 14:33:35 +00:00
Maxim Konovalov
481f8fe85f Inherit LOCAL_CREDS option from listen socket for sockets returned
by accept(2).

PR:		kern/90644
Submitted by:	Andrey Simonenko
OK'ed by:	mdodd
Tested by:	NetBSD regress/sys/kern/unfdpass/unfdpass.c
MFC after:	1 month
2006-04-24 19:09:33 +00:00
Paul Saab
4f590175b7 Allow for nmbclusters and maxsockets to be increased via sysctl.
An eventhandler is used to update all the various zones that depend
on these values.
2006-04-21 09:25:40 +00:00
Robert Watson
bc725eafc7 Chance protocol switch method pru_detach() so that it returns void
rather than an error.  Detaches do not "fail", they other occur or
the protocol flags SS_PROTOREF to take ownership of the socket.

soclose() no longer looks at so_pcb to see if it's NULL, relying
entirely on the protocol to decide whether it's time to free the
socket or not using SS_PROTOREF.  so_pcb is now entirely owned and
managed by the protocol code.  Likewise, no longer test so_pcb in
other socket functions, such as soreceive(), which have no business
digging into protocol internals.

Protocol detach routines no longer try to free the socket on detach,
this is performed in the socket code if the protocol permits it.

In rts_detach(), no longer test for rp != NULL in detach, and
likewise in other protocols that don't permit a NULL so_pcb, reduce
the incidence of testing for it during detach.

netinet and netinet6 are not fully updated to this change, which
will be in an upcoming commit.  In their current state they may leak
memory or panic.

MFC after:	3 months
2006-04-01 15:42:02 +00:00
Robert Watson
ac45e92ff2 Change protocol switch pru_abort() API so that it returns void rather
than an int, as an error here is not meaningful.  Modify soabort() to
unconditionally free the socket on the return of pru_abort(), and
modify most protocols to no longer conditionally free the socket,
since the caller will do this.

This commit likely leaves parts of netinet and netinet6 in a situation
where they may panic or leak memory, as they have not are not fully
updated by this commit.  This will be corrected shortly in followup
commits to these components.

MFC after:      3 months
2006-04-01 15:15:05 +00:00
Robert Watson
4d4b555efa Modify UNIX domain sockets to guarantee, and assume, that so_pcb is always
defined for an in-use socket.  This allows us to eliminate countless tests
of whether so_pcb is non-NULL, eliminating dozens of error cases.  For
now, retain the call to sotryfree() in the uipc_abort() path, but this
will eventually move to soabort().

These new assumptions should be largely correct, and will become more so
as the socket/pcb reference model is fixed.  Removing the notion that
so_pcb can be non-NULL is a critical step towards further fine-graining
of the UNIX domain socket locking, as the so_pcb reference no longer
needs to be protected using locks, instead it is a property of the socket
life cycle.
2006-03-17 13:52:57 +00:00
Jeff Roberson
033eb86e52 - Lock access to vrele() with VFS_LOCK_GIANT() rather than mtx_lock(&Giant).
Sponsored by:	Isilon Systems, Inc.
2006-01-30 08:19:01 +00:00
Robert Watson
d7dca9034c XXX a comment in uipc_usrreq.c that requires updating. 2006-01-13 00:00:32 +00:00
Maxime Henrion
e59898ff36 Fix a bunch of SYSCTL_INT() that should have been SYSCTL_ULONG() to
match the type of the variable they are exporting.

Spotted by:	Thomas Hurst <tom@hur.st>
MFC after:	3 days
2005-12-14 22:27:48 +00:00
Robert Watson
a0ec558af0 Correct a number of serious and closely related bugs in the UNIX domain
socket file descriptor garbage collection code, which is intended to
detect and clear cycles of orphaned file descriptors that are "in-flight"
in a socket when that socket is closed before they are received.  The
algorithm present was both run at poor times (resulting in recursion and
reentrance), and also buggy in the presence of parallelism.  In order to
fix these problems, make the following changes:

- When there are in-flight sockets and a UNIX domain socket is destroyed,
  asynchronously schedule the garbage collector, rather than running it
  synchronously in the current context.  This avoids lock order issues
  when the garbage collection code reenters the UNIX domain socket code,
  avoiding lock order reversals, deadlocks, etc.  Run the code
  asynchronously in a task queue.

- In the garbage collector, when skipping file descriptors that have
  entered a closing state (i.e., have f_count == 0), re-test the FDEFER
  flag, and decrement unp_defer.  As file descriptors can now transition
  to a closed state, while the garbage collector is running, it is no
  longer the case that unp_defer will remain an accurate count of
  deferred sockets in the mark portion of the GC algorithm.  Otherwise,
  the garbage collector will loop waiting waiting for unp_defer to reach
  zero, which it will never do as it is skipping file descriptors that
  were marked in an earlier pass, but now closed.

- Acquire the UNIX domain socket subsystem lock in unp_discard() when
  modifying the unp_rights counter, or a read/write race is risked with
  other threads also manipulating the counter.

While here:

- Remove #if 0'd code regarding acquiring the socket buffer sleep lock in
  the garbage collector, this is not required as we are able to use the
  socket buffer receive lock to protect scanning the receive buffer for
  in-flight file descriptors on the socket buffer.

- Annotate that the description of the garbage collector implementation
  is increasingly inaccurate and needs to be updated.

- Add counters of the number of deferred garbage collections and recycled
  file descriptors.  This will be removed and is here temporarily for
  debugging purposes.

With these changes in place, the unp_passfd regression test now appears
to be passed consistently on UP and SMP systems for extended runs,
whereas before it hung quickly or panicked, depending on which bug was
triggered.

Reported by:	Philip Kizer <pckizer at nostrum dot com>
MFC after:	2 weeks
2005-11-10 16:06:04 +00:00
Robert Watson
5bb84bc84b Normalize a significant number of kernel malloc type names:
- Prefer '_' to ' ', as it results in more easily parsed results in
  memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
  as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
  memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
  attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion.  Similar changes are required for UMA zone names.
2005-10-31 15:41:29 +00:00
Robert Watson
d374e81efd Push the assignment of a new or updated so_qlimit from solisten()
following the protocol pru_listen() call to solisten_proto(), so
that it occurs under the socket lock acquisition that also sets
SO_ACCEPTCONN.  This requires passing the new backlog parameter
to the protocol, which also allows the protocol to be aware of
changes in queue limit should it wish to do something about the
new queue limit.  This continues a move towards the socket layer
acting as a library for the protocol.

Bump __FreeBSD_version due to a change in the in-kernel protocol
interface.  This change has been tested with IPv4 and UNIX domain
sockets, but not other protocols.
2005-10-30 19:44:40 +00:00
Robert Watson
e1ac28e239 Canonicalize the UNIX domain socket copyright layout: original holders
before more recent holders.

MFC after:	3 days
2005-09-23 12:41:06 +00:00
Colin Percival
fe2eee8231 Fix two issues which were missed in FreeBSD-SA-05:08.kmem.
Reported by:	Uwe Doering
2005-05-07 00:41:36 +00:00
Matthew N. Dodd
abb886facb Add missing break.
Found by:	marcus
2005-04-25 00:48:04 +00:00
Matthew N. Dodd
96a041b533 Check sopt_level in uipc_ctloutput() and return early if it is non-zero.
This prevents unintended consequnces when an application calls things like
setsockopt(x, SOL_SOCKET, SO_REUSEADDR, ...) on a Unix domain socket.
2005-04-20 02:57:56 +00:00
Matthew N. Dodd
6a2989fd54 Implement unix(4) socket options LOCAL_CREDS and LOCAL_CONNWAIT.
- Add unp_addsockcred() (for LOCAL_CREDS).
- Add an argument to unp_connect2() to differentiate between
  PRU_CONNECT and PRU_CONNECT2. (for LOCAL_CONNWAIT)

Obtained from:	 NetBSD (with some changes)
2005-04-13 00:01:46 +00:00
Robert Watson
0daccb9c94 In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections.  As part of this state transition, solisten() calls into the
protocol to update protocol-layer state.  There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.

This change does the following:

- Pushes the socket state transition from the socket layer solisten() to
  to socket "library" routines called from the protocol.  This permits
  the socket routines to be called while holding the protocol mutexes,
  preventing a race exposing the incomplete socket state transition to TCP
  after the TCP state transition has completed.  The check for a socket
  layer state transition is performed by solisten_proto_check(), and the
  actual transition is performed by solisten_proto().

- Holds the socket lock for the duration of the socket state test and set,
  and over the protocol layer state transition, which is now possible as
  the socket lock is acquired by the protocol layer, rather than vice
  versa.  This prevents additional state related races in the socket
  layer.

This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another.  Similar changes are likely require
elsewhere in the socket/protocol code.

Reported by:		Peter Holm <peter@holm.cc>
Review and fixes from:	emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod:	gnn
2005-02-21 21:58:17 +00:00
Robert Watson
c364c823d0 When aborting a UNIX domain socket bind() because VOP_CREATE() failed,
make sure to call vn_finished_write(mp) before returning.

MFC after:	3 days
2005-02-21 14:21:50 +00:00