Commit Graph

235 Commits

Author SHA1 Message Date
Robert Watson
604f19c91e Fix build on amd64, where sysctl arg1 is a pointer.
Reported by:	Mr Tinderbox
MFC after:	3 months
2009-10-05 22:23:12 +00:00
Robert Watson
84d61770bc First cut at implementing SOCK_SEQPACKET support for UNIX (local) domain
sockets.  This allows for reliable bi-directional datagram communication
over UNIX domain sockets, in contrast to SOCK_DGRAM (M:N, unreliable) or
SOCK_STERAM (bi-directional bytestream).  Largely, this reuses existing
UNIX domain socket code.  This allows applications requiring record-
oriented semantics to do so reliably via local IPC.

Some implementation notes (also present in XXX comments):

- Currently we lack an sbappend variant able to do datagrams and control
  data without doing addresses, so we mark SOCK_SEQPACKET as PR_ADDR.
  Adding a new variant will solve this problem.

- UNIX domain sockets on FreeBSD provide back-pressure/flow control
  notification for stream sockets by manipulating the send socket
  buffer's size during pru_send and pru_rcvd.  This trick works less well
  for SOCK_SEQPACKET as sosend_generic() uses sb_hiwat not just to
  manage blocking, but also to determine maximum datagram size.  Fixing
  this requires rethinking how back-pressure is done for SOCK_SEQPACKET;
  in the mean time, it's possible to get EMSGSIZE when buffers fill,
  instead of blocking.

Discussed with:	benl
Reviewed by:	bz, rpaulo
MFC after:	3 months
Sponsored by:	Google
2009-10-05 14:49:16 +00:00
Robert Watson
530c006014 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
Jamie Gritton
399645e1b9 Remove unnecessary/redundant includes.
Approved by:	bz (mentor)
2009-06-23 14:39:21 +00:00
John Baldwin
afd9f91c63 Fix a deadlock in the getpeername() method for UNIX domain sockets.
Instead of locking the local unp followed by the remote unp, use the same
locking model as accept() and read lock the global link lock followed by
the remote unp while fetching the remote sockaddr.

Reported by:	Mel Flynn  mel.flynn of mailing.thruhere.net
Reviewed by:	rwatson
MFC after:	1 week
2009-06-18 20:56:22 +00:00
Robert Watson
bcf11e8d00 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
Robert Watson
f93bfb23dc Add internal 'mac_policy_count' counter to the MAC Framework, which is a
count of the number of registered policies.

Rather than unconditionally locking sockets before passing them into MAC,
lock them in the MAC entry points only if mac_policy_count is non-zero.

This avoids locking overhead for a number of socket system calls when no
policies are registered, eliminating measurable overhead for the MAC
Framework for the socket subsystem when there are no active policies.

Possibly socket locks should be acquired by policies if they are required
for socket labels, which would further avoid locking overhead when there
are policies but they don't require labeling of sockets, or possibly
don't even implement socket controls.

Obtained from:	TrustedBSD Project
2009-06-02 18:26:17 +00:00
Marko Zec
21ca7b57bd Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one.  The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE().  Recursions
on curvnet are permitted, though strongly discuouraged.

This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.

The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc.  Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.

The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.

This change also introduces a DDB subcommand to show the list of all
vnet instances.

Approved by:	julian (mentor)
2009-05-05 10:56:12 +00:00
Robert Watson
885868cd8f Remove VOP_LEASE and supporting functions. This hasn't been used since
the removal of NQNFS, but was left in in case it was required for NFSv4.
Since our new NFSv4 client and server can't use it for their
requirements, GC the old mechanism, as well as other unused lease-
related code and interfaces.

Due to its impact on kernel programming and binary interfaces, this
change should not be MFC'd.

Proposed by:    jeff
Reviewed by:    jeff
Discussed with: rmacklem, zach loafman @ isilon
2009-04-10 10:52:19 +00:00
Robert Watson
3dab55bc86 Decompose the global UNIX domain sockets rwlock into two different
locks: a global list/counter/generation counter protected by a new
mutex unp_list_lock, and a global linkage rwlock, unp_global_rwlock,
which protects the connections between UNIX domain sockets.

This eliminates conditional lock acquisition that was previously a
property of the global lock being held over sonewconn() leading to a
call to uipc_attach(), which also required the global lock, but
couldn't rely on it as other paths existed to uipc_attach() that
didn't hold it: now uipc_attach() uses only the list lock, which
follows the linkage lock in the lock order.  It may also reduce
contention on the global lock for some workloads.

Add global UNIX domain socket locks to hard-coded witness lock
order.

MFC after:	1 week
Discussed with:	kris
2009-03-08 21:48:29 +00:00
Robert Watson
b523ec24b9 White space and comment tweaks.
MFC after:	3 weeks
2009-01-01 20:03:22 +00:00
Robert Watson
a9f3c7d2ff Rename mbcnt to mbcnt_delta in uipc_send() -- unlike other local
variables named mbcnt in uipc_usrreq.c, this instance is a delta
rather than a cache of sb_mbcnt.

MFC after:	3 weeks
2008-12-30 16:09:57 +00:00
Dag-Erling Smørgrav
1ede983cc9 Retire the MALLOC and FREE macros. They are an abomination unto style(9).
MFC after:	3 months
2008-10-23 15:53:51 +00:00
Robert Watson
4b0e2b9add Remove stale comment: while uipc_connect2() was, until recently, not
static so it could be used by fifofs (actually portalfs), it is now
static.

Submitted by:	kensmith
2008-10-11 17:28:22 +00:00
Robert Watson
e298cf5902 Remove stale comment (and XXX saying so) about why we zero the file
descriptor pointer in unp_freerights: we can no longer recurse into
unp_gc due to unp_gc being invoked in a deferred way, but it's still
a good idea.

MFC after:	3 days
2008-10-08 06:26:51 +00:00
Robert Watson
fa9402f28a Differentiate pr_usrreqs for stream and datagram UNIX domain sockets, and
employ soreceive_dgram for the datagram case.

MFC after:	3 months
2008-10-08 06:19:49 +00:00
Robert Watson
2c8995842c Now that portalfs doesn't directly invoke uipc_connect2(), make it a
static symbol.

MFC after:	3 days
2008-10-06 18:43:11 +00:00
Robert Watson
0b36cd25fc Further minor cleanups to UNIX domain sockets:
- Staticize and locally prototype functions uipc_ctloutput(), unp_dispose(),
  unp_init(), and unp_externalize(), none of which have been required
  outside of uipc_usrreq.c since uipc_proto.c was removed.
- Remove stale prototype for uipc_usrreq(), which has not existed in the
  code since 1997
- Forward declare and staticize uipc_usrreqs structure in uipc_usrreq.c and
  not un.h.
- Comment on why uipc_connect2() is still non-static -- it is used directly
  by fifofs.
- Remove stale comments, tidy up whitespace.

MFC after:	3 days (where applicable)
2008-10-03 13:01:56 +00:00
Robert Watson
60a5ef26a1 Remove or update several stale comments.
A bit of whitespace/style cleanup.

Update copyright.

MFC after:	3 days (applicable changes)
2008-10-03 09:01:55 +00:00
Tom Rhodes
be6b130476 Fill in a few sysctl descriptions.
Approved by:	rwatson
2008-07-26 00:55:35 +00:00
Ed Maste
7928893d83 Use bcopy instead of strlcpy in uipc_bind and unp_connect, since
soun->sun_path isn't a null-terminated string.  As UNIX(4) states, "the
terminating NUL is not part of the address."  Since strlcpy has to return
"the total length of the string [it] tried to create," it walks off the end
of soun->sun_path looking for a \0.

This reverts r105332.

Reported by:    Ryan Stone
2008-07-03 23:26:10 +00:00
Robert Watson
8c96f9c193 Move unlock of global UNIX domain socket lock slightly lower in
unp_connect(): it is expected to return with the lock held, and two
possible error paths otherwise returned with it unlocked.

The fix committed here is slightly different from the patch in the
PR, but along an alternative line suggested in the PR.

PR:		119778
MFC after:	3 days
Submitted by:	James Juran <james dot juran at baesystems dot com>
2008-01-18 19:16:03 +00:00
Attilio Rao
22db15c06f VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
2008-01-13 14:44:15 +00:00
Robert Watson
8a69e5fa71 Remove "lock pushdown" todo item in comment -- I did that for 7.0.
MFC after:	3 weeks
2008-01-10 12:38:17 +00:00
Robert Watson
a635784569 Correct typos in comments.
MFC after:	3 weeks
2008-01-10 12:29:12 +00:00
Jeff Roberson
41e0f66d41 - Place the fhold() in unp_internalize_fp to be more consistent with refs.
- Clear all of the gc flags before doing a run.  Stale flags were causing
   us to skip some descriptors.
 - If a unp socket has been marked REF in a gc pass it can't be dead.

Found by:	rwatson's test tool.
2008-01-01 01:46:42 +00:00
Jeff Roberson
6f552cb098 - Check the correct variable against NULL in two places.
- If the unp_file is NULL that means it has never been internalized and it
   must be reachable.
2007-12-31 03:44:54 +00:00
Jeff Roberson
397c19d175 Remove explicit locking of struct file.
- Introduce a finit() which is used to initailize the fields of struct file
   in such a way that the ops vector is only valid after the data, type,
   and flags are valid.
 - Protect f_flag and f_count with atomic operations.
 - Remove the global list of all files and associated accounting.
 - Rewrite the unp garbage collection such that it no longer requires
   the global list of all files and instead uses a list of all unp sockets.
 - Mark sockets in the accept queue so we don't incorrectly gc them.

Tested by:	kris, pho
2007-12-30 01:42:15 +00:00
Robert Watson
30d239bc4c Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

  mac_<object>_<method/action>
  mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme.  Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier.  Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods.  Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-24 19:04:04 +00:00
Pawel Jakub Dawidek
57fd3d5572 When we do open, we should lock the vnode exclusively. This fixes few races:
- fifo race, where two threads assign v_fifoinfo,
- v_writecount modifications,
- v_object modifications,
- and probably more...

Discussed with:	kib, ups
Approved by:	re (rwatson)
2007-07-26 16:58:09 +00:00
Robert Watson
03c96c3176 Add DDB "show unpcb" command, allowing DDB to print out many pertinent
details from UNIX domain socket protocol layer state.
2007-05-29 12:36:00 +00:00
Robert Watson
08d73f1370 Remove more one more stale comment regarding unpcb type-safety. 2007-05-11 12:28:45 +00:00
Robert Watson
d7924b7086 Clarify and update quite a few comments to reflect locking optimizations,
the addition of unpcb refcounts, and bug fixes.  Some of these fixes are
appropriate for MFC.

MFC after:	3 days
2007-05-11 12:10:45 +00:00
Wojciech A. Koszek
9e2894466a Don't acquire Giant unconditionally.
Reviewed by:	rwatson
2007-05-06 12:00:38 +00:00
Robert Watson
5e3f7694b1 Replace custom file descriptor array sleep lock constructed using a mutex
and flags with an sxlock.  This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention.  All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently.  Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.

- Generally eliminate the distinction between "fast" and regular
  acquisisition of the filedesc lock; the plan is that they will now all
  be fast.  Change all locking instances to either shared or exclusive
  locks.

- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
  was called without the mutex held; sx_sleep() is now always called with
  the sxlock held exclusively.

- Universally hold the struct file lock over changes to struct file,
  rather than the filedesc lock or no lock.  Always update the f_ops
  field last. A further memory barrier is required here in the future
  (discussed with jhb).

- Improve locking and reference management in linux_at(), which fails to
  properly acquire vnode references before using vnode pointers.  Annotate
  improper use of vn_fullpath(), which will be replaced at a future date.

In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).

Tested by:	kris
Discussed with:	jhb, kris, attilio, jeff
2007-04-04 09:11:34 +00:00
Robert Watson
6e2faa2444 In uipc_close(), we no longer always free the unpcb, as the last reference
may be dropped later.  In this case, always unlock the unpcb so as not to
leak the lock.

Found by:	kris (BugMagnet)
2007-03-12 14:52:00 +00:00
Robert Watson
ede6e136f8 Remove two simultaneous acquisitions of multiple unpcb locks from
uipc_send in cases where only a global read lock is held by breaking
them out and avoiding the unpcb lock acquire in the common case.  This
avoids deadlocks which manifested with X11, and should also marginally
further improve performance.

Reported by:	sepotvin, brooks
2007-03-01 09:00:42 +00:00
Robert Watson
3592fd4de5 Lock unp2 after checking for a non-NULL unp2 pointer in uipc_send() on
datagram UNIX domain sockets, not before.
2007-02-28 08:08:50 +00:00
Robert Watson
e7c33e29ed Revise locking strategy used for UNIX domain sockets in order to improve
concurrency:

- Add per-unpcb mutexes protecting unpcb connection state, fields, etc.

- Replace global UNP mutex with a global UNP rwlock, which will protect the
  UNIX domain socket connection topology, v_socket, and be acquired
  exclusively before acquiring more than per-unpcb at a time in order to
  avoid lock order issues.

In performance measurements involving MySQL, this change has little or no
overhead on UP (+/- 1%), but leads to a significant (5%-30%) improvement in
multi-processor measurements using the sysbench and supersmack benchmarks.

Much testing by:	kris
Approved by:		re (kensmith)
2007-02-26 20:47:52 +00:00
Robert Watson
6fac927ccc Add an additional MAC check to the UNIX domain socket connect path:
check that the subject has read/write access to the vnode using the
vnode MAC check.

MFC after:	3 weeks
Submitted by:	Spencer Minear <spencer_minear at securecomputing dot com>
Obtained from:	TrustedBSD Project
2007-02-22 09:37:44 +00:00
Robert Watson
5b950deabc Break introductory comment into two paragraphs to separate material on the
garbage collection complications from general discussion of UNIX domain
sockets.

Staticize unp_addsockcred().

Remove XXX comment regarding Giant and v_socket -- v_socket is protected
by the global UNIX domain socket lock.
2007-02-20 10:50:02 +00:00
Robert Watson
aea52f1bf8 Minor rearrangement of global variables, comments, etc, in UNIX domain
sockets.
2007-02-14 15:05:40 +00:00
Robert Watson
46a1d9bfe8 Change unp_mtx to supporting recursion, and do not drop the unp_mtx over
sonewconn() in unp_connect().  This avoids a race that occurs due to
v_socket being an uncounted reference, as the lock was being released in
order to call sonewconn(), which otherwise recurses into the UNIX domain
socket code via pru_attach, as well as holding the lock over a sleeping
memory allocation in uipc_attach().  Switch to a non-sleeping memory
allocation during UNIX domain socket attach.

This fix non-ideal in that it requires enabling recursion, but is a much
smaller change than moving to using true references for v_socket.  The
reported panic occurs in unp_connect() following the return of
sonewconn().

Update copyright year.

Panic reported by:      jhb
2007-02-14 12:22:11 +00:00
Robert Watson
05102f04d5 Set UNP_CONNECTING when committing to moving ahead in unp_connect().
This logic was lost when merging the remainder of these changes in
1.178.
2007-02-13 21:00:57 +00:00
Robert Watson
1f837c4753 Push UNIX domain socket locking further into uipc_ctloutput() in order to
avoid holding the UNIX domain socket subsystem lock over soooptcopyin()
and sooptcopyout().  This problem was introduced when LOCAL_CREDS, and
LOCAL_CONNWAIT support were added.

Reviewed by:	mdodd
2007-02-06 14:31:37 +00:00
Robert Watson
abdeb3b01f Canonicalize copyrights in some files I hold copyrights on:
- Sort by date in license blocks, oldest copyright first.
- All rights reserved after all copyrights, not just the first.
- Use (c) to be consistent with other entries.

MFC after:	3 days
2007-01-08 17:49:59 +00:00
John Baldwin
9ae328fc8f - Close a race between enumerating UNIX domain socket pcb structures via
sysctl and socket teardown by adding a reference count to the UNIX domain
  pcb object and fixing the sysctl that enumerates unpcbs to grab a
  reference on each unpcb while it builds the list to copy out to userland.
- Close a race between UNIX domain pcb garbage collection (unp_gc()) and
  file descriptor teardown (fdrop()) by adding a new garbage collection
  flag FWAIT.  unp_gc() sets FWAIT while it walks the message buffers
  in a UNIX domain socket looking for nested file descriptor references
  and clears the flag when it is finished.  fdrop() checks to see if the
  flag is set on a file descriptor whose refcount just dropped to 0 and
  waits for unp_gc() to clear the flag before completely destroying the
  file descriptor.

MFC after:	1 week
Reviewed by:	rwatson
Submitted by:	ups
Hopefully makes the panics go away:	mx1
2007-01-05 19:59:46 +00:00
Robert Watson
aed5570872 Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h.  sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from:	TrustedBSD Project
Sponsored by:	SPARTA
2006-10-22 11:52:19 +00:00
Robert Watson
b7e2f3ec76 Minor white space tweaks. 2006-08-13 23:16:59 +00:00
Robert Watson
e4445a031f Move definition of UNIX domain socket protosw and domain entries from
uipc_proto.c to uipc_usrreq.c, making localdomain static.  Remove
uipc_proto.c as it's no longer used.  With this change, UNIX domain
sockets are entirely encapsulated in uipc_usrreq.c.
2006-08-07 12:02:43 +00:00