2005-01-06 23:35:40 +00:00
|
|
|
/*-
|
2017-11-20 19:43:44 +00:00
|
|
|
* SPDX-License-Identifier: BSD-3-Clause
|
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
* Copyright (c) 1982, 1986, 1989, 1991, 1993
|
2018-05-19 17:29:57 +00:00
|
|
|
* The Regents of the University of California. All Rights Reserved.
|
|
|
|
* Copyright (c) 2004-2009 Robert N. M. Watson All Rights Reserved.
|
2018-05-19 02:15:40 +00:00
|
|
|
* Copyright (c) 2018 Matthew Macy
|
1994-05-24 10:09:53 +00:00
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
2016-09-15 13:16:20 +00:00
|
|
|
* 3. Neither the name of the University nor the names of its contributors
|
1994-05-24 10:09:53 +00:00
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*
|
1995-05-11 00:13:26 +00:00
|
|
|
* From: @(#)uipc_usrreq.c 8.3 (Berkeley) 1/4/94
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
|
2006-07-23 20:06:45 +00:00
|
|
|
/*
|
|
|
|
* UNIX Domain (Local) Sockets
|
|
|
|
*
|
|
|
|
* This is an implementation of UNIX (local) domain sockets. Each socket has
|
|
|
|
* an associated struct unpcb (UNIX protocol control block). Stream sockets
|
|
|
|
* may be connected to 0 or 1 other socket. Datagram sockets may be
|
|
|
|
* connected to 0, 1, or many other sockets. Sockets may be created and
|
|
|
|
* connected in pairs (socketpair(2)), or bound/connected to using the file
|
|
|
|
* system name space. For most purposes, only the receive socket buffer is
|
|
|
|
* used, as sending on one socket delivers directly to the receive socket
|
2007-02-20 10:50:02 +00:00
|
|
|
* buffer of a second socket.
|
|
|
|
*
|
|
|
|
* The implementation is substantially complicated by the fact that
|
|
|
|
* "ancillary data", such as file descriptors or credentials, may be passed
|
|
|
|
* across UNIX domain sockets. The potential for passing UNIX domain sockets
|
|
|
|
* over other UNIX domain sockets requires the implementation of a simple
|
|
|
|
* garbage collector to find and tear down cycles of disconnected sockets.
|
2007-02-14 15:05:40 +00:00
|
|
|
*
|
|
|
|
* TODO:
|
2009-10-05 14:49:16 +00:00
|
|
|
* RDM
|
2007-02-14 15:05:40 +00:00
|
|
|
* rethink name space problems
|
|
|
|
* need a proper out-of-band
|
2006-07-23 20:06:45 +00:00
|
|
|
*/
|
|
|
|
|
2003-06-11 00:56:59 +00:00
|
|
|
#include <sys/cdefs.h>
|
|
|
|
__FBSDID("$FreeBSD$");
|
|
|
|
|
2007-05-29 12:36:00 +00:00
|
|
|
#include "opt_ddb.h"
|
2002-07-31 03:03:22 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/param.h>
|
2014-03-16 10:55:57 +00:00
|
|
|
#include <sys/capsicum.h>
|
2001-05-01 08:13:21 +00:00
|
|
|
#include <sys/domain.h>
|
2006-04-21 09:25:40 +00:00
|
|
|
#include <sys/eventhandler.h>
|
2020-09-15 19:21:33 +00:00
|
|
|
#include <sys/fcntl.h>
|
1997-02-24 20:30:58 +00:00
|
|
|
#include <sys/file.h>
|
2002-04-30 01:54:54 +00:00
|
|
|
#include <sys/filedesc.h>
|
|
|
|
#include <sys/kernel.h>
|
|
|
|
#include <sys/lock.h>
|
2020-09-15 19:21:33 +00:00
|
|
|
#include <sys/malloc.h>
|
1997-02-24 20:30:58 +00:00
|
|
|
#include <sys/mbuf.h>
|
2006-01-30 08:19:01 +00:00
|
|
|
#include <sys/mount.h>
|
2002-04-30 01:54:54 +00:00
|
|
|
#include <sys/mutex.h>
|
1997-02-24 20:30:58 +00:00
|
|
|
#include <sys/namei.h>
|
|
|
|
#include <sys/proc.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/protosw.h>
|
2010-12-03 16:15:44 +00:00
|
|
|
#include <sys/queue.h>
|
2002-04-30 01:54:54 +00:00
|
|
|
#include <sys/resourcevar.h>
|
2007-02-26 20:47:52 +00:00
|
|
|
#include <sys/rwlock.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/socket.h>
|
|
|
|
#include <sys/socketvar.h>
|
2002-04-30 01:54:54 +00:00
|
|
|
#include <sys/signalvar.h>
|
1997-02-24 20:30:58 +00:00
|
|
|
#include <sys/stat.h>
|
2002-04-30 01:54:54 +00:00
|
|
|
#include <sys/sx.h>
|
1997-02-24 20:30:58 +00:00
|
|
|
#include <sys/sysctl.h>
|
2002-04-30 01:54:54 +00:00
|
|
|
#include <sys/systm.h>
|
Correct a number of serious and closely related bugs in the UNIX domain
socket file descriptor garbage collection code, which is intended to
detect and clear cycles of orphaned file descriptors that are "in-flight"
in a socket when that socket is closed before they are received. The
algorithm present was both run at poor times (resulting in recursion and
reentrance), and also buggy in the presence of parallelism. In order to
fix these problems, make the following changes:
- When there are in-flight sockets and a UNIX domain socket is destroyed,
asynchronously schedule the garbage collector, rather than running it
synchronously in the current context. This avoids lock order issues
when the garbage collection code reenters the UNIX domain socket code,
avoiding lock order reversals, deadlocks, etc. Run the code
asynchronously in a task queue.
- In the garbage collector, when skipping file descriptors that have
entered a closing state (i.e., have f_count == 0), re-test the FDEFER
flag, and decrement unp_defer. As file descriptors can now transition
to a closed state, while the garbage collector is running, it is no
longer the case that unp_defer will remain an accurate count of
deferred sockets in the mark portion of the GC algorithm. Otherwise,
the garbage collector will loop waiting waiting for unp_defer to reach
zero, which it will never do as it is skipping file descriptors that
were marked in an earlier pass, but now closed.
- Acquire the UNIX domain socket subsystem lock in unp_discard() when
modifying the unp_rights counter, or a read/write race is risked with
other threads also manipulating the counter.
While here:
- Remove #if 0'd code regarding acquiring the socket buffer sleep lock in
the garbage collector, this is not required as we are able to use the
socket buffer receive lock to protect scanning the receive buffer for
in-flight file descriptors on the socket buffer.
- Annotate that the description of the garbage collector implementation
is increasingly inaccurate and needs to be updated.
- Add counters of the number of deferred garbage collections and recycled
file descriptors. This will be removed and is here temporarily for
debugging purposes.
With these changes in place, the unp_passfd regression test now appears
to be passed consistently on UP and SMP systems for extended runs,
whereas before it hung quickly or panicked, depending on which bug was
triggered.
Reported by: Philip Kizer <pckizer at nostrum dot com>
MFC after: 2 weeks
2005-11-10 16:06:04 +00:00
|
|
|
#include <sys/taskqueue.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/un.h>
|
1998-05-15 20:11:40 +00:00
|
|
|
#include <sys/unpcb.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/vnode.h>
|
2009-08-01 19:26:27 +00:00
|
|
|
|
|
|
|
#include <net/vnet.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2007-05-29 12:36:00 +00:00
|
|
|
#ifdef DDB
|
|
|
|
#include <ddb/ddb.h>
|
|
|
|
#endif
|
|
|
|
|
2006-10-22 11:52:19 +00:00
|
|
|
#include <security/mac/mac_framework.h>
|
|
|
|
|
2002-03-20 04:11:52 +00:00
|
|
|
#include <vm/uma.h>
|
1998-05-15 20:11:40 +00:00
|
|
|
|
2013-03-03 23:39:30 +00:00
|
|
|
MALLOC_DECLARE(M_FILECAPS);
|
|
|
|
|
2009-03-08 21:48:29 +00:00
|
|
|
/*
|
2020-09-15 19:21:33 +00:00
|
|
|
* See unpcb.h for the locking key.
|
2009-03-08 21:48:29 +00:00
|
|
|
*/
|
|
|
|
|
2007-02-14 15:05:40 +00:00
|
|
|
static uma_zone_t unp_zone;
|
2009-03-08 21:48:29 +00:00
|
|
|
static unp_gen_t unp_gencnt; /* (l) */
|
|
|
|
static u_int unp_count; /* (l) Count of local sockets. */
|
2007-02-14 15:05:40 +00:00
|
|
|
static ino_t unp_ino; /* Prototype for fake inode numbers. */
|
2009-03-08 21:48:29 +00:00
|
|
|
static int unp_rights; /* (g) File descriptors in flight. */
|
|
|
|
static struct unp_head unp_shead; /* (l) List of stream sockets. */
|
|
|
|
static struct unp_head unp_dhead; /* (l) List of datagram sockets. */
|
2009-10-05 14:49:16 +00:00
|
|
|
static struct unp_head unp_sphead; /* (l) List of seqpacket sockets. */
|
1998-05-15 20:11:40 +00:00
|
|
|
|
2010-12-03 16:15:44 +00:00
|
|
|
struct unp_defer {
|
|
|
|
SLIST_ENTRY(unp_defer) ud_link;
|
|
|
|
struct file *ud_fp;
|
|
|
|
};
|
|
|
|
static SLIST_HEAD(, unp_defer) unp_defers;
|
|
|
|
static int unp_defers_count;
|
|
|
|
|
2007-02-14 15:05:40 +00:00
|
|
|
static const struct sockaddr sun_noname = { sizeof(sun_noname), AF_LOCAL };
|
1998-05-15 20:11:40 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2007-02-14 15:05:40 +00:00
|
|
|
* Garbage collection of cyclic file descriptor/socket references occurs
|
|
|
|
* asynchronously in a taskqueue context in order to avoid recursion and
|
|
|
|
* reentrance in the UNIX domain socket, file descriptor, and socket layer
|
|
|
|
* code. See unp_gc() for a full description.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2012-11-20 15:45:48 +00:00
|
|
|
static struct timeout_task unp_gc_task;
|
1995-12-14 09:55:16 +00:00
|
|
|
|
2010-12-03 16:15:44 +00:00
|
|
|
/*
|
|
|
|
* The close of unix domain sockets attached as SCM_RIGHTS is
|
|
|
|
* postponed to the taskqueue, to avoid arbitrary recursion depth.
|
|
|
|
* The attached sockets might have another sockets attached.
|
|
|
|
*/
|
|
|
|
static struct task unp_defer_task;
|
|
|
|
|
2006-07-23 10:19:04 +00:00
|
|
|
/*
|
|
|
|
* Both send and receive buffers are allocated PIPSIZ bytes of buffering for
|
|
|
|
* stream sockets, although the total for sender and receiver is actually
|
|
|
|
* only PIPSIZ.
|
|
|
|
*
|
|
|
|
* Datagram sockets really use the sendspace as the maximum datagram size,
|
|
|
|
* and don't really want to reserve the sendspace. Their recvspace should be
|
|
|
|
* large enough for at least one max-size datagram plus address.
|
|
|
|
*/
|
|
|
|
#ifndef PIPSIZ
|
|
|
|
#define PIPSIZ 8192
|
|
|
|
#endif
|
|
|
|
static u_long unpst_sendspace = PIPSIZ;
|
|
|
|
static u_long unpst_recvspace = PIPSIZ;
|
|
|
|
static u_long unpdg_sendspace = 2*1024; /* really max datagram size */
|
|
|
|
static u_long unpdg_recvspace = 4*1024;
|
2009-10-05 14:49:16 +00:00
|
|
|
static u_long unpsp_sendspace = PIPSIZ; /* really max datagram size */
|
|
|
|
static u_long unpsp_recvspace = PIPSIZ;
|
2006-07-23 10:19:04 +00:00
|
|
|
|
2020-02-26 14:26:36 +00:00
|
|
|
static SYSCTL_NODE(_net, PF_LOCAL, local, CTLFLAG_RW | CTLFLAG_MPSAFE, 0,
|
|
|
|
"Local domain");
|
|
|
|
static SYSCTL_NODE(_net_local, SOCK_STREAM, stream,
|
|
|
|
CTLFLAG_RW | CTLFLAG_MPSAFE, 0,
|
2011-11-07 15:43:11 +00:00
|
|
|
"SOCK_STREAM");
|
2020-02-26 14:26:36 +00:00
|
|
|
static SYSCTL_NODE(_net_local, SOCK_DGRAM, dgram,
|
|
|
|
CTLFLAG_RW | CTLFLAG_MPSAFE, 0,
|
|
|
|
"SOCK_DGRAM");
|
|
|
|
static SYSCTL_NODE(_net_local, SOCK_SEQPACKET, seqpacket,
|
|
|
|
CTLFLAG_RW | CTLFLAG_MPSAFE, 0,
|
2009-10-05 14:49:16 +00:00
|
|
|
"SOCK_SEQPACKET");
|
2006-08-07 12:02:43 +00:00
|
|
|
|
2006-07-23 10:19:04 +00:00
|
|
|
SYSCTL_ULONG(_net_local_stream, OID_AUTO, sendspace, CTLFLAG_RW,
|
2008-07-26 00:55:35 +00:00
|
|
|
&unpst_sendspace, 0, "Default stream send space.");
|
2006-07-23 10:19:04 +00:00
|
|
|
SYSCTL_ULONG(_net_local_stream, OID_AUTO, recvspace, CTLFLAG_RW,
|
2008-07-26 00:55:35 +00:00
|
|
|
&unpst_recvspace, 0, "Default stream receive space.");
|
2006-07-23 10:19:04 +00:00
|
|
|
SYSCTL_ULONG(_net_local_dgram, OID_AUTO, maxdgram, CTLFLAG_RW,
|
2008-07-26 00:55:35 +00:00
|
|
|
&unpdg_sendspace, 0, "Default datagram send space.");
|
2006-07-23 10:19:04 +00:00
|
|
|
SYSCTL_ULONG(_net_local_dgram, OID_AUTO, recvspace, CTLFLAG_RW,
|
2008-07-26 00:55:35 +00:00
|
|
|
&unpdg_recvspace, 0, "Default datagram receive space.");
|
2009-10-05 14:49:16 +00:00
|
|
|
SYSCTL_ULONG(_net_local_seqpacket, OID_AUTO, maxseqpacket, CTLFLAG_RW,
|
|
|
|
&unpsp_sendspace, 0, "Default seqpacket send space.");
|
|
|
|
SYSCTL_ULONG(_net_local_seqpacket, OID_AUTO, recvspace, CTLFLAG_RW,
|
|
|
|
&unpsp_recvspace, 0, "Default seqpacket receive space.");
|
2010-12-03 20:39:06 +00:00
|
|
|
SYSCTL_INT(_net_local, OID_AUTO, inflight, CTLFLAG_RD, &unp_rights, 0,
|
2008-07-26 00:55:35 +00:00
|
|
|
"File descriptors in flight.");
|
2010-12-03 16:15:44 +00:00
|
|
|
SYSCTL_INT(_net_local, OID_AUTO, deferred, CTLFLAG_RD,
|
2010-12-03 20:39:06 +00:00
|
|
|
&unp_defers_count, 0,
|
2010-12-03 16:15:44 +00:00
|
|
|
"File descriptors deferred to taskqueue for close.");
|
2006-07-23 10:19:04 +00:00
|
|
|
|
2010-07-22 05:42:29 +00:00
|
|
|
/*
|
2007-02-26 20:47:52 +00:00
|
|
|
* Locking and synchronization:
|
|
|
|
*
|
2020-09-15 19:21:33 +00:00
|
|
|
* Several types of locks exist in the local domain socket implementation:
|
|
|
|
* - a global linkage lock
|
|
|
|
* - a global connection list lock
|
|
|
|
* - the mtxpool lock
|
|
|
|
* - per-unpcb mutexes
|
2018-05-17 17:59:35 +00:00
|
|
|
*
|
2020-09-15 19:21:33 +00:00
|
|
|
* The linkage lock protects the global socket lists, the generation number
|
|
|
|
* counter and garbage collector state.
|
2018-05-17 17:59:35 +00:00
|
|
|
*
|
2020-09-15 19:21:33 +00:00
|
|
|
* The connection list lock protects the list of referring sockets in a datagram
|
|
|
|
* socket PCB. This lock is also overloaded to protect a global list of
|
|
|
|
* sockets whose buffers contain socket references in the form of SCM_RIGHTS
|
|
|
|
* messages. To avoid recursion, such references are released by a dedicated
|
|
|
|
* thread.
|
2018-05-17 17:59:35 +00:00
|
|
|
*
|
2020-09-15 19:21:33 +00:00
|
|
|
* The mtxpool lock protects the vnode from being modified while referenced.
|
|
|
|
* Lock ordering rules require that it be acquired before any PCB locks.
|
|
|
|
*
|
|
|
|
* The unpcb lock (unp_mtx) protects the most commonly referenced fields in the
|
|
|
|
* unpcb. This includes the unp_conn field, which either links two connected
|
|
|
|
* PCBs together (for connected socket types) or points at the destination
|
|
|
|
* socket (for connectionless socket types). The operations of creating or
|
|
|
|
* destroying a connection therefore involve locking multiple PCBs. To avoid
|
|
|
|
* lock order reversals, in some cases this involves dropping a PCB lock and
|
|
|
|
* using a reference counter to maintain liveness.
|
2007-02-26 20:47:52 +00:00
|
|
|
*
|
|
|
|
* UNIX domain sockets each have an unpcb hung off of their so_pcb pointer,
|
|
|
|
* allocated in pru_attach() and freed in pru_detach(). The validity of that
|
|
|
|
* pointer is an invariant, so no lock is required to dereference the so_pcb
|
|
|
|
* pointer if a valid socket reference is held by the caller. In practice,
|
|
|
|
* this is always true during operations performed on a socket. Each unpcb
|
|
|
|
* has a back-pointer to its socket, unp_socket, which will be stable under
|
|
|
|
* the same circumstances.
|
|
|
|
*
|
|
|
|
* This pointer may only be safely dereferenced as long as a valid reference
|
|
|
|
* to the unpcb is held. Typically, this reference will be from the socket,
|
|
|
|
* or from another unpcb when the referring unpcb's lock is held (in order
|
|
|
|
* that the reference not be invalidated during use). For example, to follow
|
2018-05-17 17:59:35 +00:00
|
|
|
* unp->unp_conn->unp_socket, you need to hold a lock on unp_conn to guarantee
|
|
|
|
* that detach is not run clearing unp_socket.
|
2004-08-16 01:52:04 +00:00
|
|
|
*
|
2007-02-26 20:47:52 +00:00
|
|
|
* Blocking with UNIX domain sockets is a tricky issue: unlike most network
|
|
|
|
* protocols, bind() is a non-atomic operation, and connect() requires
|
|
|
|
* potential sleeping in the protocol, due to potentially waiting on local or
|
|
|
|
* distributed file systems. We try to separate "lookup" operations, which
|
|
|
|
* may sleep, and the IPC operations themselves, which typically can occur
|
|
|
|
* with relative atomicity as locks can be held over the entire operation.
|
2004-08-16 01:52:04 +00:00
|
|
|
*
|
2007-02-26 20:47:52 +00:00
|
|
|
* Another tricky issue is simultaneous multi-threaded or multi-process
|
|
|
|
* access to a single UNIX domain socket. These are handled by the flags
|
|
|
|
* UNP_CONNECTING and UNP_BINDING, which prevent concurrent connecting or
|
|
|
|
* binding, both of which involve dropping UNIX domain socket locks in order
|
|
|
|
* to perform namei() and other file system operations.
|
2004-08-16 01:52:04 +00:00
|
|
|
*/
|
2009-03-08 21:48:29 +00:00
|
|
|
static struct rwlock unp_link_rwlock;
|
2010-12-03 16:15:44 +00:00
|
|
|
static struct mtx unp_defers_lock;
|
2007-02-26 20:47:52 +00:00
|
|
|
|
2009-03-08 21:48:29 +00:00
|
|
|
#define UNP_LINK_LOCK_INIT() rw_init(&unp_link_rwlock, \
|
|
|
|
"unp_link_rwlock")
|
2007-02-26 20:47:52 +00:00
|
|
|
|
2020-01-25 08:57:26 +00:00
|
|
|
#define UNP_LINK_LOCK_ASSERT() rw_assert(&unp_link_rwlock, \
|
2007-02-26 20:47:52 +00:00
|
|
|
RA_LOCKED)
|
2009-03-08 21:48:29 +00:00
|
|
|
#define UNP_LINK_UNLOCK_ASSERT() rw_assert(&unp_link_rwlock, \
|
2007-02-26 20:47:52 +00:00
|
|
|
RA_UNLOCKED)
|
|
|
|
|
2009-03-08 21:48:29 +00:00
|
|
|
#define UNP_LINK_RLOCK() rw_rlock(&unp_link_rwlock)
|
|
|
|
#define UNP_LINK_RUNLOCK() rw_runlock(&unp_link_rwlock)
|
|
|
|
#define UNP_LINK_WLOCK() rw_wlock(&unp_link_rwlock)
|
|
|
|
#define UNP_LINK_WUNLOCK() rw_wunlock(&unp_link_rwlock)
|
|
|
|
#define UNP_LINK_WLOCK_ASSERT() rw_assert(&unp_link_rwlock, \
|
2007-02-26 20:47:52 +00:00
|
|
|
RA_WLOCKED)
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
#define UNP_LINK_WOWNED() rw_wowned(&unp_link_rwlock)
|
2007-02-26 20:47:52 +00:00
|
|
|
|
2010-12-03 16:15:44 +00:00
|
|
|
#define UNP_DEFERRED_LOCK_INIT() mtx_init(&unp_defers_lock, \
|
|
|
|
"unp_defer", NULL, MTX_DEF)
|
|
|
|
#define UNP_DEFERRED_LOCK() mtx_lock(&unp_defers_lock)
|
|
|
|
#define UNP_DEFERRED_UNLOCK() mtx_unlock(&unp_defers_lock)
|
|
|
|
|
2018-05-17 17:59:35 +00:00
|
|
|
#define UNP_REF_LIST_LOCK() UNP_DEFERRED_LOCK();
|
|
|
|
#define UNP_REF_LIST_UNLOCK() UNP_DEFERRED_UNLOCK();
|
|
|
|
|
2007-02-26 20:47:52 +00:00
|
|
|
#define UNP_PCB_LOCK_INIT(unp) mtx_init(&(unp)->unp_mtx, \
|
2018-05-20 04:32:48 +00:00
|
|
|
"unp", "unp", \
|
2018-05-17 17:59:35 +00:00
|
|
|
MTX_DUPOK|MTX_DEF)
|
2007-02-26 20:47:52 +00:00
|
|
|
#define UNP_PCB_LOCK_DESTROY(unp) mtx_destroy(&(unp)->unp_mtx)
|
2020-09-15 19:23:22 +00:00
|
|
|
#define UNP_PCB_LOCKPTR(unp) (&(unp)->unp_mtx)
|
2007-02-26 20:47:52 +00:00
|
|
|
#define UNP_PCB_LOCK(unp) mtx_lock(&(unp)->unp_mtx)
|
2018-05-17 17:59:35 +00:00
|
|
|
#define UNP_PCB_TRYLOCK(unp) mtx_trylock(&(unp)->unp_mtx)
|
2007-02-26 20:47:52 +00:00
|
|
|
#define UNP_PCB_UNLOCK(unp) mtx_unlock(&(unp)->unp_mtx)
|
2018-05-17 17:59:35 +00:00
|
|
|
#define UNP_PCB_OWNED(unp) mtx_owned(&(unp)->unp_mtx)
|
2007-02-26 20:47:52 +00:00
|
|
|
#define UNP_PCB_LOCK_ASSERT(unp) mtx_assert(&(unp)->unp_mtx, MA_OWNED)
|
2018-05-17 17:59:35 +00:00
|
|
|
#define UNP_PCB_UNLOCK_ASSERT(unp) mtx_assert(&(unp)->unp_mtx, MA_NOTOWNED)
|
2007-02-26 20:47:52 +00:00
|
|
|
|
2008-10-06 18:43:11 +00:00
|
|
|
static int uipc_connect2(struct socket *, struct socket *);
|
2008-10-03 13:01:56 +00:00
|
|
|
static int uipc_ctloutput(struct socket *, struct sockopt *);
|
2007-02-26 20:47:52 +00:00
|
|
|
static int unp_connect(struct socket *, struct sockaddr *,
|
2007-02-14 15:05:40 +00:00
|
|
|
struct thread *);
|
2013-03-02 21:11:30 +00:00
|
|
|
static int unp_connectat(int, struct socket *, struct sockaddr *,
|
|
|
|
struct thread *);
|
2007-02-26 20:47:52 +00:00
|
|
|
static int unp_connect2(struct socket *so, struct socket *so2, int);
|
|
|
|
static void unp_disconnect(struct unpcb *unp, struct unpcb *unp2);
|
2016-08-31 21:48:22 +00:00
|
|
|
static void unp_dispose(struct socket *so);
|
|
|
|
static void unp_dispose_mbuf(struct mbuf *);
|
2007-02-26 20:47:52 +00:00
|
|
|
static void unp_shutdown(struct unpcb *);
|
2016-02-26 12:46:34 +00:00
|
|
|
static void unp_drop(struct unpcb *);
|
2007-02-26 20:47:52 +00:00
|
|
|
static void unp_gc(__unused void *, int);
|
2013-03-11 22:59:07 +00:00
|
|
|
static void unp_scan(struct mbuf *, void (*)(struct filedescent **, int));
|
2007-02-26 20:47:52 +00:00
|
|
|
static void unp_discard(struct file *);
|
2013-03-03 23:39:30 +00:00
|
|
|
static void unp_freerights(struct filedescent **, int);
|
2008-10-03 13:01:56 +00:00
|
|
|
static void unp_init(void);
|
2007-02-26 20:47:52 +00:00
|
|
|
static int unp_internalize(struct mbuf **, struct thread *);
|
2007-12-30 01:42:15 +00:00
|
|
|
static void unp_internalize_fp(struct file *);
|
2013-03-19 20:58:17 +00:00
|
|
|
static int unp_externalize(struct mbuf *, struct mbuf **, int);
|
2010-12-03 16:15:44 +00:00
|
|
|
static int unp_externalize_fp(struct file *);
|
2020-11-17 20:01:21 +00:00
|
|
|
static struct mbuf *unp_addsockcred(struct thread *, struct mbuf *, int);
|
2010-12-03 16:15:44 +00:00
|
|
|
static void unp_process_defers(void * __unused, int);
|
1995-12-14 09:55:16 +00:00
|
|
|
|
2018-05-17 17:59:35 +00:00
|
|
|
static void
|
|
|
|
unp_pcb_hold(struct unpcb *unp)
|
|
|
|
{
|
2020-09-15 19:21:58 +00:00
|
|
|
u_int old __unused;
|
|
|
|
|
|
|
|
old = refcount_acquire(&unp->unp_refcount);
|
|
|
|
KASSERT(old > 0, ("%s: unpcb %p has no references", __func__, unp));
|
2018-05-17 17:59:35 +00:00
|
|
|
}
|
|
|
|
|
2020-09-15 19:21:58 +00:00
|
|
|
static __result_use_check bool
|
2018-05-17 17:59:35 +00:00
|
|
|
unp_pcb_rele(struct unpcb *unp)
|
|
|
|
{
|
2020-09-15 19:21:58 +00:00
|
|
|
bool ret;
|
2018-05-17 17:59:35 +00:00
|
|
|
|
|
|
|
UNP_PCB_LOCK_ASSERT(unp);
|
2020-09-15 19:21:58 +00:00
|
|
|
|
|
|
|
if ((ret = refcount_release(&unp->unp_refcount))) {
|
2018-05-17 17:59:35 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
|
|
|
UNP_PCB_LOCK_DESTROY(unp);
|
|
|
|
uma_zfree(unp_zone, unp);
|
|
|
|
}
|
2020-09-15 19:21:58 +00:00
|
|
|
return (ret);
|
2018-05-17 17:59:35 +00:00
|
|
|
}
|
|
|
|
|
2020-09-15 19:22:37 +00:00
|
|
|
static void
|
|
|
|
unp_pcb_rele_notlast(struct unpcb *unp)
|
|
|
|
{
|
|
|
|
bool ret __unused;
|
|
|
|
|
|
|
|
ret = refcount_release(&unp->unp_refcount);
|
|
|
|
KASSERT(!ret, ("%s: unpcb %p has no references", __func__, unp));
|
|
|
|
}
|
|
|
|
|
2018-05-17 17:59:35 +00:00
|
|
|
static void
|
2020-09-15 19:22:16 +00:00
|
|
|
unp_pcb_lock_pair(struct unpcb *unp, struct unpcb *unp2)
|
2018-05-17 17:59:35 +00:00
|
|
|
{
|
|
|
|
UNP_PCB_UNLOCK_ASSERT(unp);
|
|
|
|
UNP_PCB_UNLOCK_ASSERT(unp2);
|
2020-09-15 19:22:16 +00:00
|
|
|
|
|
|
|
if (unp == unp2) {
|
|
|
|
UNP_PCB_LOCK(unp);
|
|
|
|
} else if ((uintptr_t)unp2 > (uintptr_t)unp) {
|
2018-05-17 17:59:35 +00:00
|
|
|
UNP_PCB_LOCK(unp);
|
|
|
|
UNP_PCB_LOCK(unp2);
|
|
|
|
} else {
|
|
|
|
UNP_PCB_LOCK(unp2);
|
|
|
|
UNP_PCB_LOCK(unp);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-09-15 19:22:16 +00:00
|
|
|
static void
|
|
|
|
unp_pcb_unlock_pair(struct unpcb *unp, struct unpcb *unp2)
|
|
|
|
{
|
|
|
|
UNP_PCB_UNLOCK(unp);
|
|
|
|
if (unp != unp2)
|
|
|
|
UNP_PCB_UNLOCK(unp2);
|
|
|
|
}
|
|
|
|
|
2020-09-15 19:23:22 +00:00
|
|
|
/*
|
|
|
|
* Try to lock the connected peer of an already locked socket. In some cases
|
|
|
|
* this requires that we unlock the current socket. The pairbusy counter is
|
|
|
|
* used to block concurrent connection attempts while the lock is dropped. The
|
|
|
|
* caller must be careful to revalidate PCB state.
|
|
|
|
*/
|
|
|
|
static struct unpcb *
|
|
|
|
unp_pcb_lock_peer(struct unpcb *unp)
|
2018-05-17 17:59:35 +00:00
|
|
|
{
|
|
|
|
struct unpcb *unp2;
|
|
|
|
|
2020-09-15 19:23:22 +00:00
|
|
|
UNP_PCB_LOCK_ASSERT(unp);
|
|
|
|
unp2 = unp->unp_conn;
|
2020-12-13 21:32:19 +00:00
|
|
|
if (unp2 == NULL)
|
2020-09-15 19:23:22 +00:00
|
|
|
return (NULL);
|
|
|
|
if (__predict_false(unp == unp2))
|
|
|
|
return (unp);
|
|
|
|
|
|
|
|
UNP_PCB_UNLOCK_ASSERT(unp2);
|
|
|
|
|
|
|
|
if (__predict_true(UNP_PCB_TRYLOCK(unp2)))
|
|
|
|
return (unp2);
|
|
|
|
if ((uintptr_t)unp2 > (uintptr_t)unp) {
|
|
|
|
UNP_PCB_LOCK(unp2);
|
|
|
|
return (unp2);
|
|
|
|
}
|
|
|
|
unp->unp_pairbusy++;
|
2018-08-04 20:16:36 +00:00
|
|
|
unp_pcb_hold(unp2);
|
|
|
|
UNP_PCB_UNLOCK(unp);
|
2020-09-15 19:23:22 +00:00
|
|
|
|
2018-08-04 20:16:36 +00:00
|
|
|
UNP_PCB_LOCK(unp2);
|
|
|
|
UNP_PCB_LOCK(unp);
|
2020-09-15 19:23:22 +00:00
|
|
|
KASSERT(unp->unp_conn == unp2 || unp->unp_conn == NULL,
|
|
|
|
("%s: socket %p was reconnected", __func__, unp));
|
|
|
|
if (--unp->unp_pairbusy == 0 && (unp->unp_flags & UNP_WAITING) != 0) {
|
|
|
|
unp->unp_flags &= ~UNP_WAITING;
|
|
|
|
wakeup(unp);
|
|
|
|
}
|
|
|
|
if (unp_pcb_rele(unp2)) {
|
|
|
|
/* unp2 is unlocked. */
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
if (unp->unp_conn == NULL) {
|
|
|
|
UNP_PCB_UNLOCK(unp2);
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
return (unp2);
|
2018-05-17 17:59:35 +00:00
|
|
|
}
|
|
|
|
|
2006-08-07 12:02:43 +00:00
|
|
|
/*
|
|
|
|
* Definitions of protocols supported in the LOCAL domain.
|
|
|
|
*/
|
|
|
|
static struct domain localdomain;
|
2008-10-08 06:19:49 +00:00
|
|
|
static struct pr_usrreqs uipc_usrreqs_dgram, uipc_usrreqs_stream;
|
2009-10-05 14:49:16 +00:00
|
|
|
static struct pr_usrreqs uipc_usrreqs_seqpacket;
|
2006-08-07 12:02:43 +00:00
|
|
|
static struct protosw localsw[] = {
|
|
|
|
{
|
|
|
|
.pr_type = SOCK_STREAM,
|
|
|
|
.pr_domain = &localdomain,
|
|
|
|
.pr_flags = PR_CONNREQUIRED|PR_WANTRCVD|PR_RIGHTS,
|
|
|
|
.pr_ctloutput = &uipc_ctloutput,
|
2008-10-08 06:19:49 +00:00
|
|
|
.pr_usrreqs = &uipc_usrreqs_stream
|
2006-08-07 12:02:43 +00:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.pr_type = SOCK_DGRAM,
|
|
|
|
.pr_domain = &localdomain,
|
|
|
|
.pr_flags = PR_ATOMIC|PR_ADDR|PR_RIGHTS,
|
2012-09-07 21:06:54 +00:00
|
|
|
.pr_ctloutput = &uipc_ctloutput,
|
2008-10-08 06:19:49 +00:00
|
|
|
.pr_usrreqs = &uipc_usrreqs_dgram
|
2006-08-07 12:02:43 +00:00
|
|
|
},
|
2009-10-05 14:49:16 +00:00
|
|
|
{
|
|
|
|
.pr_type = SOCK_SEQPACKET,
|
|
|
|
.pr_domain = &localdomain,
|
|
|
|
|
|
|
|
/*
|
|
|
|
* XXXRW: For now, PR_ADDR because soreceive will bump into them
|
|
|
|
* due to our use of sbappendaddr. A new sbappend variants is needed
|
|
|
|
* that supports both atomic record writes and control data.
|
|
|
|
*/
|
|
|
|
.pr_flags = PR_ADDR|PR_ATOMIC|PR_CONNREQUIRED|PR_WANTRCVD|
|
|
|
|
PR_RIGHTS,
|
2013-09-11 18:22:30 +00:00
|
|
|
.pr_ctloutput = &uipc_ctloutput,
|
2009-10-05 14:49:16 +00:00
|
|
|
.pr_usrreqs = &uipc_usrreqs_seqpacket,
|
|
|
|
},
|
2006-08-07 12:02:43 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
static struct domain localdomain = {
|
|
|
|
.dom_family = AF_LOCAL,
|
|
|
|
.dom_name = "local",
|
|
|
|
.dom_init = unp_init,
|
|
|
|
.dom_externalize = unp_externalize,
|
2016-08-31 21:48:22 +00:00
|
|
|
.dom_dispose = unp_dispose,
|
2006-08-07 12:02:43 +00:00
|
|
|
.dom_protosw = localsw,
|
2016-04-19 23:48:27 +00:00
|
|
|
.dom_protoswNPROTOSW = &localsw[nitems(localsw)]
|
2006-08-07 12:02:43 +00:00
|
|
|
};
|
|
|
|
DOMAIN_SET(local);
|
|
|
|
|
2006-04-01 15:15:05 +00:00
|
|
|
static void
|
1997-04-27 20:01:29 +00:00
|
|
|
uipc_abort(struct socket *so)
|
|
|
|
{
|
2007-02-26 20:47:52 +00:00
|
|
|
struct unpcb *unp, *unp2;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2004-08-16 04:41:03 +00:00
|
|
|
unp = sotounpcb(so);
|
2006-03-17 13:52:57 +00:00
|
|
|
KASSERT(unp != NULL, ("uipc_abort: unp == NULL"));
|
2018-05-17 17:59:35 +00:00
|
|
|
UNP_PCB_UNLOCK_ASSERT(unp);
|
2007-02-26 20:47:52 +00:00
|
|
|
|
|
|
|
UNP_PCB_LOCK(unp);
|
|
|
|
unp2 = unp->unp_conn;
|
|
|
|
if (unp2 != NULL) {
|
2018-05-17 17:59:35 +00:00
|
|
|
unp_pcb_hold(unp2);
|
|
|
|
UNP_PCB_UNLOCK(unp);
|
2016-02-26 12:46:34 +00:00
|
|
|
unp_drop(unp2);
|
2018-05-17 17:59:35 +00:00
|
|
|
} else
|
|
|
|
UNP_PCB_UNLOCK(unp);
|
1997-04-27 20:01:29 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
1997-08-16 19:16:27 +00:00
|
|
|
uipc_accept(struct socket *so, struct sockaddr **nam)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2007-02-26 20:47:52 +00:00
|
|
|
struct unpcb *unp, *unp2;
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
const struct sockaddr *sa;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1997-04-27 20:01:29 +00:00
|
|
|
/*
|
2006-07-22 17:24:55 +00:00
|
|
|
* Pass back name of connected socket, if it was bound and we are
|
|
|
|
* still connected (our peer may have closed already!).
|
1997-04-27 20:01:29 +00:00
|
|
|
*/
|
2006-03-17 13:52:57 +00:00
|
|
|
unp = sotounpcb(so);
|
|
|
|
KASSERT(unp != NULL, ("uipc_accept: unp == NULL"));
|
2007-02-26 20:47:52 +00:00
|
|
|
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
*nam = malloc(sizeof(struct sockaddr_un), M_SONAME, M_WAITOK);
|
2020-09-15 19:23:42 +00:00
|
|
|
UNP_PCB_LOCK(unp);
|
|
|
|
unp2 = unp_pcb_lock_peer(unp);
|
|
|
|
if (unp2 != NULL && unp2->unp_addr != NULL)
|
|
|
|
sa = (struct sockaddr *)unp2->unp_addr;
|
|
|
|
else
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
sa = &sun_noname;
|
2020-09-15 19:23:42 +00:00
|
|
|
bcopy(sa, *nam, sa->sa_len);
|
2020-09-15 23:03:56 +00:00
|
|
|
if (unp2 != NULL)
|
|
|
|
unp_pcb_unlock_pair(unp, unp2);
|
|
|
|
else
|
|
|
|
UNP_PCB_UNLOCK(unp);
|
2004-01-11 19:48:19 +00:00
|
|
|
return (0);
|
1997-04-27 20:01:29 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1997-04-27 20:01:29 +00:00
|
|
|
static int
|
2001-09-12 08:38:13 +00:00
|
|
|
uipc_attach(struct socket *so, int proto, struct thread *td)
|
1997-04-27 20:01:29 +00:00
|
|
|
{
|
2007-02-26 20:47:52 +00:00
|
|
|
u_long sendspace, recvspace;
|
2006-07-23 10:25:28 +00:00
|
|
|
struct unpcb *unp;
|
2009-03-08 21:48:29 +00:00
|
|
|
int error;
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
bool locked;
|
2006-07-23 10:25:28 +00:00
|
|
|
|
|
|
|
KASSERT(so->so_pcb == NULL, ("uipc_attach: so_pcb != NULL"));
|
|
|
|
if (so->so_snd.sb_hiwat == 0 || so->so_rcv.sb_hiwat == 0) {
|
|
|
|
switch (so->so_type) {
|
|
|
|
case SOCK_STREAM:
|
2007-02-26 20:47:52 +00:00
|
|
|
sendspace = unpst_sendspace;
|
|
|
|
recvspace = unpst_recvspace;
|
2006-07-23 10:25:28 +00:00
|
|
|
break;
|
|
|
|
|
|
|
|
case SOCK_DGRAM:
|
2007-02-26 20:47:52 +00:00
|
|
|
sendspace = unpdg_sendspace;
|
|
|
|
recvspace = unpdg_recvspace;
|
2006-07-23 10:25:28 +00:00
|
|
|
break;
|
|
|
|
|
2009-10-05 14:49:16 +00:00
|
|
|
case SOCK_SEQPACKET:
|
|
|
|
sendspace = unpsp_sendspace;
|
|
|
|
recvspace = unpsp_recvspace;
|
|
|
|
break;
|
|
|
|
|
2006-07-23 10:25:28 +00:00
|
|
|
default:
|
2007-02-26 20:47:52 +00:00
|
|
|
panic("uipc_attach");
|
2006-07-23 10:25:28 +00:00
|
|
|
}
|
2007-02-26 20:47:52 +00:00
|
|
|
error = soreserve(so, sendspace, recvspace);
|
2006-07-23 10:25:28 +00:00
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
}
|
2007-02-14 12:22:11 +00:00
|
|
|
unp = uma_zalloc(unp_zone, M_NOWAIT | M_ZERO);
|
2006-07-23 10:25:28 +00:00
|
|
|
if (unp == NULL)
|
|
|
|
return (ENOBUFS);
|
|
|
|
LIST_INIT(&unp->unp_refs);
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_LOCK_INIT(unp);
|
2006-07-23 10:25:28 +00:00
|
|
|
unp->unp_socket = so;
|
|
|
|
so->so_pcb = unp;
|
2020-09-15 19:21:58 +00:00
|
|
|
refcount_init(&unp->unp_refcount, 1);
|
2007-02-26 20:47:52 +00:00
|
|
|
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
if ((locked = UNP_LINK_WOWNED()) == false)
|
|
|
|
UNP_LINK_WLOCK();
|
|
|
|
|
2006-07-23 10:25:28 +00:00
|
|
|
unp->unp_gencnt = ++unp_gencnt;
|
2018-11-21 22:25:05 +00:00
|
|
|
unp->unp_ino = ++unp_ino;
|
2006-07-23 10:25:28 +00:00
|
|
|
unp_count++;
|
2009-10-05 14:49:16 +00:00
|
|
|
switch (so->so_type) {
|
|
|
|
case SOCK_STREAM:
|
|
|
|
LIST_INSERT_HEAD(&unp_shead, unp, unp_link);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case SOCK_DGRAM:
|
|
|
|
LIST_INSERT_HEAD(&unp_dhead, unp, unp_link);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case SOCK_SEQPACKET:
|
|
|
|
LIST_INSERT_HEAD(&unp_sphead, unp, unp_link);
|
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
|
|
|
panic("uipc_attach");
|
|
|
|
}
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
|
|
|
|
if (locked == false)
|
|
|
|
UNP_LINK_WUNLOCK();
|
2006-07-23 10:25:28 +00:00
|
|
|
|
|
|
|
return (0);
|
1997-04-27 20:01:29 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1997-04-27 20:01:29 +00:00
|
|
|
static int
|
2013-03-02 21:11:30 +00:00
|
|
|
uipc_bindat(int fd, struct socket *so, struct sockaddr *nam, struct thread *td)
|
1997-04-27 20:01:29 +00:00
|
|
|
{
|
2006-07-23 11:02:12 +00:00
|
|
|
struct sockaddr_un *soun = (struct sockaddr_un *)nam;
|
|
|
|
struct vattr vattr;
|
2012-10-22 17:50:54 +00:00
|
|
|
int error, namelen;
|
2006-07-23 11:02:12 +00:00
|
|
|
struct nameidata nd;
|
2004-08-16 04:41:03 +00:00
|
|
|
struct unpcb *unp;
|
2006-07-23 11:02:12 +00:00
|
|
|
struct vnode *vp;
|
|
|
|
struct mount *mp;
|
Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.
The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.
The structure definition looks like this:
struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};
The initial CAP_RIGHTS_VERSION is 0.
The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.
The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.
To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.
#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)
We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:
#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)
#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)
There is new API to manage the new cap_rights_t structure:
cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);
bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);
Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);
There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:
#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);
Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:
cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);
Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.
This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.
Sponsored by: The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
|
|
|
cap_rights_t rights;
|
2006-07-23 11:02:12 +00:00
|
|
|
char *buf;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2014-07-14 06:00:01 +00:00
|
|
|
if (nam->sa_family != AF_UNIX)
|
|
|
|
return (EAFNOSUPPORT);
|
|
|
|
|
2004-08-16 04:41:03 +00:00
|
|
|
unp = sotounpcb(so);
|
2006-03-17 13:52:57 +00:00
|
|
|
KASSERT(unp != NULL, ("uipc_bind: unp == NULL"));
|
2006-07-23 12:01:14 +00:00
|
|
|
|
2011-09-28 08:47:17 +00:00
|
|
|
if (soun->sun_len > sizeof(struct sockaddr_un))
|
|
|
|
return (EINVAL);
|
2006-07-23 12:01:14 +00:00
|
|
|
namelen = soun->sun_len - offsetof(struct sockaddr_un, sun_path);
|
|
|
|
if (namelen <= 0)
|
|
|
|
return (EINVAL);
|
2006-07-23 11:02:12 +00:00
|
|
|
|
|
|
|
/*
|
2006-07-23 12:01:14 +00:00
|
|
|
* We don't allow simultaneous bind() calls on a single UNIX domain
|
|
|
|
* socket, so flag in-progress operations, and return an error if an
|
|
|
|
* operation is already in progress.
|
|
|
|
*
|
|
|
|
* Historically, we have not allowed a socket to be rebound, so this
|
2007-05-11 12:10:45 +00:00
|
|
|
* also returns an error. Not allowing re-binding simplifies the
|
|
|
|
* implementation and avoids a great many possible failure modes.
|
2006-07-23 11:02:12 +00:00
|
|
|
*/
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_LOCK(unp);
|
2006-07-23 11:02:12 +00:00
|
|
|
if (unp->unp_vnode != NULL) {
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
2006-07-23 11:02:12 +00:00
|
|
|
return (EINVAL);
|
|
|
|
}
|
2006-07-23 12:01:14 +00:00
|
|
|
if (unp->unp_flags & UNP_BINDING) {
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
2006-07-23 12:01:14 +00:00
|
|
|
return (EALREADY);
|
2006-07-23 11:02:12 +00:00
|
|
|
}
|
2006-07-23 12:01:14 +00:00
|
|
|
unp->unp_flags |= UNP_BINDING;
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
2006-07-23 11:02:12 +00:00
|
|
|
|
|
|
|
buf = malloc(namelen + 1, M_TEMP, M_WAITOK);
|
2008-07-03 23:26:10 +00:00
|
|
|
bcopy(soun->sun_path, buf, namelen);
|
|
|
|
buf[namelen] = 0;
|
2006-07-23 11:02:12 +00:00
|
|
|
|
|
|
|
restart:
|
2014-12-18 10:01:12 +00:00
|
|
|
NDINIT_ATRIGHTS(&nd, CREATE, NOFOLLOW | LOCKPARENT | SAVENAME | NOCACHE,
|
Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.
The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.
The structure definition looks like this:
struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};
The initial CAP_RIGHTS_VERSION is 0.
The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.
The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.
To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.
#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)
We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:
#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)
#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)
There is new API to manage the new cap_rights_t structure:
cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);
bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);
Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);
There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:
#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);
Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:
cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);
Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.
This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.
Sponsored by: The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
|
|
|
UIO_SYSSPACE, buf, fd, cap_rights_init(&rights, CAP_BINDAT), td);
|
2006-07-23 11:02:12 +00:00
|
|
|
/* SHOULD BE ABLE TO ADOPT EXISTING AND wakeup() ALA FIFO's */
|
|
|
|
error = namei(&nd);
|
|
|
|
if (error)
|
2006-07-23 12:01:14 +00:00
|
|
|
goto error;
|
2006-07-23 11:02:12 +00:00
|
|
|
vp = nd.ni_vp;
|
|
|
|
if (vp != NULL || vn_start_write(nd.ni_dvp, &mp, V_NOWAIT) != 0) {
|
|
|
|
NDFREE(&nd, NDF_ONLY_PNBUF);
|
|
|
|
if (nd.ni_dvp == vp)
|
|
|
|
vrele(nd.ni_dvp);
|
|
|
|
else
|
|
|
|
vput(nd.ni_dvp);
|
|
|
|
if (vp != NULL) {
|
|
|
|
vrele(vp);
|
|
|
|
error = EADDRINUSE;
|
2006-07-23 12:01:14 +00:00
|
|
|
goto error;
|
2006-07-23 11:02:12 +00:00
|
|
|
}
|
|
|
|
error = vn_start_write(NULL, &mp, V_XSLEEP | PCATCH);
|
|
|
|
if (error)
|
2006-07-23 12:01:14 +00:00
|
|
|
goto error;
|
2006-07-23 11:02:12 +00:00
|
|
|
goto restart;
|
|
|
|
}
|
|
|
|
VATTR_NULL(&vattr);
|
|
|
|
vattr.va_type = VSOCK;
|
2020-11-17 21:14:13 +00:00
|
|
|
vattr.va_mode = (ACCESSPERMS & ~td->td_proc->p_pd->pd_cmask);
|
2006-07-23 11:02:12 +00:00
|
|
|
#ifdef MAC
|
2007-10-24 19:04:04 +00:00
|
|
|
error = mac_vnode_check_create(td->td_ucred, nd.ni_dvp, &nd.ni_cnd,
|
2006-07-23 11:02:12 +00:00
|
|
|
&vattr);
|
|
|
|
#endif
|
2009-04-10 10:52:19 +00:00
|
|
|
if (error == 0)
|
2006-07-23 11:02:12 +00:00
|
|
|
error = VOP_CREATE(nd.ni_dvp, &nd.ni_vp, &nd.ni_cnd, &vattr);
|
|
|
|
NDFREE(&nd, NDF_ONLY_PNBUF);
|
|
|
|
vput(nd.ni_dvp);
|
|
|
|
if (error) {
|
|
|
|
vn_finished_write(mp);
|
2020-11-13 09:42:32 +00:00
|
|
|
if (error == ERELOOKUP)
|
|
|
|
goto restart;
|
2006-07-23 12:01:14 +00:00
|
|
|
goto error;
|
2006-07-23 11:02:12 +00:00
|
|
|
}
|
|
|
|
vp = nd.ni_vp;
|
2007-07-26 16:58:09 +00:00
|
|
|
ASSERT_VOP_ELOCKED(vp, "uipc_bind");
|
2006-07-23 11:02:12 +00:00
|
|
|
soun = (struct sockaddr_un *)sodupsockaddr(nam, M_WAITOK);
|
2007-02-26 20:47:52 +00:00
|
|
|
|
|
|
|
UNP_PCB_LOCK(unp);
|
2017-06-02 17:31:25 +00:00
|
|
|
VOP_UNP_BIND(vp, unp);
|
2006-07-23 11:02:12 +00:00
|
|
|
unp->unp_vnode = vp;
|
|
|
|
unp->unp_addr = soun;
|
2006-07-23 12:01:14 +00:00
|
|
|
unp->unp_flags &= ~UNP_BINDING;
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
2020-01-03 22:29:58 +00:00
|
|
|
VOP_UNLOCK(vp);
|
2006-07-23 11:02:12 +00:00
|
|
|
vn_finished_write(mp);
|
2006-07-23 12:01:14 +00:00
|
|
|
free(buf, M_TEMP);
|
|
|
|
return (0);
|
2007-02-26 20:47:52 +00:00
|
|
|
|
2006-07-23 12:01:14 +00:00
|
|
|
error:
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_LOCK(unp);
|
2006-07-23 12:01:14 +00:00
|
|
|
unp->unp_flags &= ~UNP_BINDING;
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
2006-07-23 11:02:12 +00:00
|
|
|
free(buf, M_TEMP);
|
2004-08-16 04:41:03 +00:00
|
|
|
return (error);
|
1997-04-27 20:01:29 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2013-03-02 21:11:30 +00:00
|
|
|
static int
|
|
|
|
uipc_bind(struct socket *so, struct sockaddr *nam, struct thread *td)
|
|
|
|
{
|
|
|
|
|
|
|
|
return (uipc_bindat(AT_FDCWD, so, nam, td));
|
|
|
|
}
|
|
|
|
|
1997-04-27 20:01:29 +00:00
|
|
|
static int
|
2001-09-12 08:38:13 +00:00
|
|
|
uipc_connect(struct socket *so, struct sockaddr *nam, struct thread *td)
|
1997-04-27 20:01:29 +00:00
|
|
|
{
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
int error;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2004-07-25 23:30:43 +00:00
|
|
|
KASSERT(td == curthread, ("uipc_connect: td != curthread"));
|
|
|
|
error = unp_connect(so, nam, td);
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
return (error);
|
1997-04-27 20:01:29 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2013-03-02 21:11:30 +00:00
|
|
|
static int
|
|
|
|
uipc_connectat(int fd, struct socket *so, struct sockaddr *nam,
|
|
|
|
struct thread *td)
|
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
|
|
|
KASSERT(td == curthread, ("uipc_connectat: td != curthread"));
|
|
|
|
error = unp_connectat(fd, so, nam, td);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2006-07-21 17:11:15 +00:00
|
|
|
static void
|
|
|
|
uipc_close(struct socket *so)
|
|
|
|
{
|
2007-02-26 20:47:52 +00:00
|
|
|
struct unpcb *unp, *unp2;
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
struct vnode *vp = NULL;
|
2018-05-17 17:59:35 +00:00
|
|
|
struct mtx *vplock;
|
2020-09-15 19:23:22 +00:00
|
|
|
|
2006-07-21 17:11:15 +00:00
|
|
|
unp = sotounpcb(so);
|
|
|
|
KASSERT(unp != NULL, ("uipc_close: unp == NULL"));
|
2007-02-26 20:47:52 +00:00
|
|
|
|
2018-05-17 17:59:35 +00:00
|
|
|
vplock = NULL;
|
|
|
|
if ((vp = unp->unp_vnode) != NULL) {
|
|
|
|
vplock = mtx_pool_find(mtxpool_sleep, vp);
|
|
|
|
mtx_lock(vplock);
|
|
|
|
}
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_LOCK(unp);
|
2018-05-17 17:59:35 +00:00
|
|
|
if (vp && unp->unp_vnode == NULL) {
|
|
|
|
mtx_unlock(vplock);
|
|
|
|
vp = NULL;
|
2007-02-26 20:47:52 +00:00
|
|
|
}
|
2018-05-17 17:59:35 +00:00
|
|
|
if (vp != NULL) {
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
VOP_UNP_DETACH(vp);
|
|
|
|
unp->unp_vnode = NULL;
|
|
|
|
}
|
2020-09-15 19:23:22 +00:00
|
|
|
if ((unp2 = unp_pcb_lock_peer(unp)) != NULL)
|
2018-05-24 21:13:46 +00:00
|
|
|
unp_disconnect(unp, unp2);
|
2020-09-15 19:23:22 +00:00
|
|
|
else
|
2018-05-17 17:59:35 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
|
|
|
if (vp) {
|
|
|
|
mtx_unlock(vplock);
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
vrele(vp);
|
2018-05-17 17:59:35 +00:00
|
|
|
}
|
2006-07-21 17:11:15 +00:00
|
|
|
}
|
|
|
|
|
2008-10-06 18:43:11 +00:00
|
|
|
static int
|
1997-04-27 20:01:29 +00:00
|
|
|
uipc_connect2(struct socket *so1, struct socket *so2)
|
|
|
|
{
|
2007-02-26 20:47:52 +00:00
|
|
|
struct unpcb *unp, *unp2;
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
int error;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2007-02-26 20:47:52 +00:00
|
|
|
unp = so1->so_pcb;
|
2006-03-17 13:52:57 +00:00
|
|
|
KASSERT(unp != NULL, ("uipc_connect2: unp == NULL"));
|
2007-02-26 20:47:52 +00:00
|
|
|
unp2 = so2->so_pcb;
|
|
|
|
KASSERT(unp2 != NULL, ("uipc_connect2: unp2 == NULL"));
|
2020-09-15 19:22:16 +00:00
|
|
|
unp_pcb_lock_pair(unp, unp2);
|
2005-04-13 00:01:46 +00:00
|
|
|
error = unp_connect2(so1, so2, PRU_CONNECT2);
|
2020-09-15 19:22:16 +00:00
|
|
|
unp_pcb_unlock_pair(unp, unp2);
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
return (error);
|
1997-04-27 20:01:29 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
Chance protocol switch method pru_detach() so that it returns void
rather than an error. Detaches do not "fail", they other occur or
the protocol flags SS_PROTOREF to take ownership of the socket.
soclose() no longer looks at so_pcb to see if it's NULL, relying
entirely on the protocol to decide whether it's time to free the
socket or not using SS_PROTOREF. so_pcb is now entirely owned and
managed by the protocol code. Likewise, no longer test so_pcb in
other socket functions, such as soreceive(), which have no business
digging into protocol internals.
Protocol detach routines no longer try to free the socket on detach,
this is performed in the socket code if the protocol permits it.
In rts_detach(), no longer test for rp != NULL in detach, and
likewise in other protocols that don't permit a NULL so_pcb, reduce
the incidence of testing for it during detach.
netinet and netinet6 are not fully updated to this change, which
will be in an upcoming commit. In their current state they may leak
memory or panic.
MFC after: 3 months
2006-04-01 15:42:02 +00:00
|
|
|
static void
|
1997-04-27 20:01:29 +00:00
|
|
|
uipc_detach(struct socket *so)
|
|
|
|
{
|
2007-02-26 20:47:52 +00:00
|
|
|
struct unpcb *unp, *unp2;
|
2018-05-17 17:59:35 +00:00
|
|
|
struct mtx *vplock;
|
2006-07-23 10:25:28 +00:00
|
|
|
struct vnode *vp;
|
2020-09-15 19:23:22 +00:00
|
|
|
int local_unp_rights;
|
1997-04-27 20:01:29 +00:00
|
|
|
|
2004-08-16 04:41:03 +00:00
|
|
|
unp = sotounpcb(so);
|
2006-03-17 13:52:57 +00:00
|
|
|
KASSERT(unp != NULL, ("uipc_detach: unp == NULL"));
|
2007-02-26 20:47:52 +00:00
|
|
|
|
2016-08-08 20:25:04 +00:00
|
|
|
vp = NULL;
|
2018-05-19 02:15:40 +00:00
|
|
|
vplock = NULL;
|
2016-08-08 20:25:04 +00:00
|
|
|
|
2020-04-10 20:41:59 +00:00
|
|
|
SOCK_LOCK(so);
|
|
|
|
if (!SOLISTENING(so)) {
|
|
|
|
/*
|
|
|
|
* Once the socket is removed from the global lists,
|
|
|
|
* uipc_ready() will not be able to locate its socket buffer, so
|
|
|
|
* clear the buffer now. At this point internalized rights have
|
|
|
|
* already been disposed of.
|
|
|
|
*/
|
|
|
|
sbrelease(&so->so_rcv, so);
|
|
|
|
}
|
|
|
|
SOCK_UNLOCK(so);
|
|
|
|
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
UNP_LINK_WLOCK();
|
2006-07-23 10:25:28 +00:00
|
|
|
LIST_REMOVE(unp, unp_link);
|
2020-01-25 08:57:26 +00:00
|
|
|
if (unp->unp_gcflag & UNPGC_DEAD)
|
|
|
|
LIST_REMOVE(unp, unp_dead);
|
2006-07-23 10:25:28 +00:00
|
|
|
unp->unp_gencnt = ++unp_gencnt;
|
|
|
|
--unp_count;
|
2018-05-17 17:59:35 +00:00
|
|
|
UNP_LINK_WUNLOCK();
|
|
|
|
|
|
|
|
UNP_PCB_UNLOCK_ASSERT(unp);
|
|
|
|
restart:
|
|
|
|
if ((vp = unp->unp_vnode) != NULL) {
|
|
|
|
vplock = mtx_pool_find(mtxpool_sleep, vp);
|
|
|
|
mtx_lock(vplock);
|
|
|
|
}
|
2016-08-08 20:25:04 +00:00
|
|
|
UNP_PCB_LOCK(unp);
|
2020-03-20 16:18:54 +00:00
|
|
|
if (unp->unp_vnode != vp && unp->unp_vnode != NULL) {
|
2018-05-19 02:15:40 +00:00
|
|
|
if (vplock)
|
|
|
|
mtx_unlock(vplock);
|
2018-05-17 17:59:35 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
|
|
|
goto restart;
|
|
|
|
}
|
2006-07-23 10:25:28 +00:00
|
|
|
if ((vp = unp->unp_vnode) != NULL) {
|
2012-02-29 21:38:31 +00:00
|
|
|
VOP_UNP_DETACH(vp);
|
2006-07-23 10:25:28 +00:00
|
|
|
unp->unp_vnode = NULL;
|
|
|
|
}
|
2020-09-15 19:23:22 +00:00
|
|
|
if ((unp2 = unp_pcb_lock_peer(unp)) != NULL)
|
|
|
|
unp_disconnect(unp, unp2);
|
|
|
|
else
|
|
|
|
UNP_PCB_UNLOCK(unp);
|
|
|
|
|
2018-05-17 17:59:35 +00:00
|
|
|
UNP_REF_LIST_LOCK();
|
2006-07-23 10:25:28 +00:00
|
|
|
while (!LIST_EMPTY(&unp->unp_refs)) {
|
|
|
|
struct unpcb *ref = LIST_FIRST(&unp->unp_refs);
|
2007-02-26 20:47:52 +00:00
|
|
|
|
2018-05-17 17:59:35 +00:00
|
|
|
unp_pcb_hold(ref);
|
|
|
|
UNP_REF_LIST_UNLOCK();
|
|
|
|
|
|
|
|
MPASS(ref != unp);
|
|
|
|
UNP_PCB_UNLOCK_ASSERT(ref);
|
2016-02-26 12:46:34 +00:00
|
|
|
unp_drop(ref);
|
2018-05-17 17:59:35 +00:00
|
|
|
UNP_REF_LIST_LOCK();
|
2006-07-23 10:25:28 +00:00
|
|
|
}
|
2018-05-17 17:59:35 +00:00
|
|
|
UNP_REF_LIST_UNLOCK();
|
2020-09-15 19:23:22 +00:00
|
|
|
|
2018-05-17 17:59:35 +00:00
|
|
|
UNP_PCB_LOCK(unp);
|
2007-12-30 01:42:15 +00:00
|
|
|
local_unp_rights = unp_rights;
|
2006-07-23 10:25:28 +00:00
|
|
|
unp->unp_socket->so_pcb = NULL;
|
2018-05-17 17:59:35 +00:00
|
|
|
unp->unp_socket = NULL;
|
2020-03-20 16:18:54 +00:00
|
|
|
free(unp->unp_addr, M_SONAME);
|
|
|
|
unp->unp_addr = NULL;
|
|
|
|
if (!unp_pcb_rele(unp))
|
2007-03-12 14:52:00 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
2018-05-17 17:59:35 +00:00
|
|
|
if (vp) {
|
|
|
|
mtx_unlock(vplock);
|
2006-07-23 10:25:28 +00:00
|
|
|
vrele(vp);
|
2018-05-17 17:59:35 +00:00
|
|
|
}
|
2006-07-23 10:25:28 +00:00
|
|
|
if (local_unp_rights)
|
2012-11-20 15:45:48 +00:00
|
|
|
taskqueue_enqueue_timeout(taskqueue_thread, &unp_gc_task, -1);
|
1997-04-27 20:01:29 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
uipc_disconnect(struct socket *so)
|
|
|
|
{
|
2007-02-26 20:47:52 +00:00
|
|
|
struct unpcb *unp, *unp2;
|
1997-04-27 20:01:29 +00:00
|
|
|
|
2004-08-16 04:41:03 +00:00
|
|
|
unp = sotounpcb(so);
|
2006-03-17 13:52:57 +00:00
|
|
|
KASSERT(unp != NULL, ("uipc_disconnect: unp == NULL"));
|
2007-02-26 20:47:52 +00:00
|
|
|
|
|
|
|
UNP_PCB_LOCK(unp);
|
2020-09-15 19:23:22 +00:00
|
|
|
if ((unp2 = unp_pcb_lock_peer(unp)) != NULL)
|
|
|
|
unp_disconnect(unp, unp2);
|
|
|
|
else
|
2018-05-17 17:59:35 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
2004-01-11 19:48:19 +00:00
|
|
|
return (0);
|
1997-04-27 20:01:29 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2005-10-30 19:44:40 +00:00
|
|
|
uipc_listen(struct socket *so, int backlog, struct thread *td)
|
1997-04-27 20:01:29 +00:00
|
|
|
{
|
2004-08-16 04:41:03 +00:00
|
|
|
struct unpcb *unp;
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
int error;
|
1997-04-27 20:01:29 +00:00
|
|
|
|
2017-01-25 22:26:45 +00:00
|
|
|
if (so->so_type != SOCK_STREAM && so->so_type != SOCK_SEQPACKET)
|
|
|
|
return (EOPNOTSUPP);
|
|
|
|
|
2004-08-16 04:41:03 +00:00
|
|
|
unp = sotounpcb(so);
|
2006-03-17 13:52:57 +00:00
|
|
|
KASSERT(unp != NULL, ("uipc_listen: unp == NULL"));
|
2007-02-26 20:47:52 +00:00
|
|
|
|
|
|
|
UNP_PCB_LOCK(unp);
|
2006-03-17 13:52:57 +00:00
|
|
|
if (unp->unp_vnode == NULL) {
|
2015-07-10 06:47:14 +00:00
|
|
|
/* Already connected or not bound to an address. */
|
|
|
|
error = unp->unp_conn != NULL ? EINVAL : EDESTADDRREQ;
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
2015-07-10 06:47:14 +00:00
|
|
|
return (error);
|
2004-08-16 04:41:03 +00:00
|
|
|
}
|
2007-02-26 20:47:52 +00:00
|
|
|
|
|
|
|
SOCK_LOCK(so);
|
|
|
|
error = solisten_proto_check(so);
|
|
|
|
if (error == 0) {
|
2019-05-30 14:24:26 +00:00
|
|
|
cru2xt(td, &unp->unp_peercred);
|
2007-02-26 20:47:52 +00:00
|
|
|
solisten_proto(so, backlog);
|
|
|
|
}
|
|
|
|
SOCK_UNLOCK(so);
|
|
|
|
UNP_PCB_UNLOCK(unp);
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
return (error);
|
1997-04-27 20:01:29 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
1997-08-16 19:16:27 +00:00
|
|
|
uipc_peeraddr(struct socket *so, struct sockaddr **nam)
|
1997-04-27 20:01:29 +00:00
|
|
|
{
|
2007-02-26 20:47:52 +00:00
|
|
|
struct unpcb *unp, *unp2;
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
const struct sockaddr *sa;
|
1997-04-27 20:01:29 +00:00
|
|
|
|
2006-03-17 13:52:57 +00:00
|
|
|
unp = sotounpcb(so);
|
|
|
|
KASSERT(unp != NULL, ("uipc_peeraddr: unp == NULL"));
|
2007-02-26 20:47:52 +00:00
|
|
|
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
*nam = malloc(sizeof(struct sockaddr_un), M_SONAME, M_WAITOK);
|
2009-06-18 20:56:22 +00:00
|
|
|
UNP_LINK_RLOCK();
|
2007-02-26 20:47:52 +00:00
|
|
|
/*
|
|
|
|
* XXX: It seems that this test always fails even when connection is
|
|
|
|
* established. So, this else clause is added as workaround to
|
|
|
|
* return PF_LOCAL sockaddr.
|
|
|
|
*/
|
|
|
|
unp2 = unp->unp_conn;
|
|
|
|
if (unp2 != NULL) {
|
|
|
|
UNP_PCB_LOCK(unp2);
|
|
|
|
if (unp2->unp_addr != NULL)
|
2009-06-18 20:56:22 +00:00
|
|
|
sa = (struct sockaddr *) unp2->unp_addr;
|
2007-02-26 20:47:52 +00:00
|
|
|
else
|
|
|
|
sa = &sun_noname;
|
|
|
|
bcopy(sa, *nam, sa->sa_len);
|
|
|
|
UNP_PCB_UNLOCK(unp2);
|
|
|
|
} else {
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
sa = &sun_noname;
|
2007-02-26 20:47:52 +00:00
|
|
|
bcopy(sa, *nam, sa->sa_len);
|
2003-01-22 18:03:06 +00:00
|
|
|
}
|
2009-06-18 20:56:22 +00:00
|
|
|
UNP_LINK_RUNLOCK();
|
2004-01-11 19:48:19 +00:00
|
|
|
return (0);
|
1997-04-27 20:01:29 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
uipc_rcvd(struct socket *so, int flags)
|
|
|
|
{
|
2007-02-26 20:47:52 +00:00
|
|
|
struct unpcb *unp, *unp2;
|
1997-04-27 20:01:29 +00:00
|
|
|
struct socket *so2;
|
2006-07-11 21:49:54 +00:00
|
|
|
u_int mbcnt, sbcc;
|
1997-04-27 20:01:29 +00:00
|
|
|
|
2004-08-16 04:41:03 +00:00
|
|
|
unp = sotounpcb(so);
|
2014-11-12 10:17:46 +00:00
|
|
|
KASSERT(unp != NULL, ("%s: unp == NULL", __func__));
|
|
|
|
KASSERT(so->so_type == SOCK_STREAM || so->so_type == SOCK_SEQPACKET,
|
|
|
|
("%s: socktype %d", __func__, so->so_type));
|
2007-02-26 20:47:52 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Adjust backpressure on sender and wakeup any waiting to write.
|
|
|
|
*
|
2007-05-11 12:10:45 +00:00
|
|
|
* The unp lock is acquired to maintain the validity of the unp_conn
|
|
|
|
* pointer; no lock on unp2 is required as unp2->unp_socket will be
|
|
|
|
* static as long as we don't permit unp2 to disconnect from unp,
|
|
|
|
* which is prevented by the lock on unp. We cache values from
|
|
|
|
* so_rcv to avoid holding the so_rcv lock over the entire
|
|
|
|
* transaction on the remote so_snd.
|
2007-02-26 20:47:52 +00:00
|
|
|
*/
|
|
|
|
SOCKBUF_LOCK(&so->so_rcv);
|
|
|
|
mbcnt = so->so_rcv.sb_mbcnt;
|
2014-11-12 10:17:46 +00:00
|
|
|
sbcc = sbavail(&so->so_rcv);
|
2007-02-26 20:47:52 +00:00
|
|
|
SOCKBUF_UNLOCK(&so->so_rcv);
|
Replace 4.4BSD Lite's unix domain socket backpressure hack with a cleaner
mechanism, based on the new SB_STOP sockbuf flag. The old hack dynamically
changed the sending sockbuf's high water mark whenever adding or removing
data from the receiving sockbuf. It worked for stream sockets, but it never
worked for SOCK_SEQPACKET sockets because of their atomic nature. If the
sockbuf was partially full, it might return EMSGSIZE instead of blocking.
The new solution is based on DragonFlyBSD's fix from commit
3a6117bbe0ed6a87605c1e43e12a1438d8844380 on 2008-05-27. It adds an SB_STOP
flag to sockbufs. Whenever uipc_send surpasses the socket's size limit, it
sets SB_STOP on the sending sockbuf. sbspace() will then return 0 for that
sockbuf, causing sosend_generic and friends to block. uipc_rcvd will
likewise clear SB_STOP. There are two fringe benefits: uipc_{send,rcvd} no
longer need to call chgsbsize() on every send and receive because they don't
change the sockbuf's high water mark. Also, uipc_sense no longer needs to
acquire the UIPC linkage lock, because it's simpler to compute the
st_blksizes.
There is one drawback: since sbspace() will only ever return 0 or the
maximum, sosend_generic will allow the sockbuf to exceed its nominal maximum
size by at most one packet of size less than the max. I don't think that's
a serious problem. In fact, I'm not even positive that FreeBSD guarantees a
socket will always stay within its nominal size limit.
sys/sys/sockbuf.h
Add the SB_STOP flag and adjust sbspace()
sys/sys/unpcb.h
Delete the obsolete unp_cc and unp_mbcnt fields from struct unpcb.
sys/kern/uipc_usrreq.c
Adjust uipc_rcvd, uipc_send, and uipc_sense to use the SB_STOP
backpressure mechanism. Removing obsolete unpcb fields from
db_show_unpcb.
tests/sys/kern/unix_seqpacket_test.c
Clear expected failures from ATF.
Obtained from: DragonFly BSD
PR: kern/185812
Reviewed by: silence from freebsd-net@ and rwatson@
MFC after: 3 weeks
Sponsored by: Spectra Logic Corporation
2014-03-13 18:42:12 +00:00
|
|
|
/*
|
|
|
|
* There is a benign race condition at this point. If we're planning to
|
|
|
|
* clear SB_STOP, but uipc_send is called on the connected socket at
|
|
|
|
* this instant, it might add data to the sockbuf and set SB_STOP. Then
|
|
|
|
* we would erroneously clear SB_STOP below, even though the sockbuf is
|
|
|
|
* full. The race is benign because the only ill effect is to allow the
|
|
|
|
* sockbuf to exceed its size limit, and the size limits are not
|
|
|
|
* strictly guaranteed anyway.
|
|
|
|
*/
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_LOCK(unp);
|
|
|
|
unp2 = unp->unp_conn;
|
|
|
|
if (unp2 == NULL) {
|
|
|
|
UNP_PCB_UNLOCK(unp);
|
|
|
|
return (0);
|
1997-04-27 20:01:29 +00:00
|
|
|
}
|
2007-02-26 20:47:52 +00:00
|
|
|
so2 = unp2->unp_socket;
|
|
|
|
SOCKBUF_LOCK(&so2->so_snd);
|
Replace 4.4BSD Lite's unix domain socket backpressure hack with a cleaner
mechanism, based on the new SB_STOP sockbuf flag. The old hack dynamically
changed the sending sockbuf's high water mark whenever adding or removing
data from the receiving sockbuf. It worked for stream sockets, but it never
worked for SOCK_SEQPACKET sockets because of their atomic nature. If the
sockbuf was partially full, it might return EMSGSIZE instead of blocking.
The new solution is based on DragonFlyBSD's fix from commit
3a6117bbe0ed6a87605c1e43e12a1438d8844380 on 2008-05-27. It adds an SB_STOP
flag to sockbufs. Whenever uipc_send surpasses the socket's size limit, it
sets SB_STOP on the sending sockbuf. sbspace() will then return 0 for that
sockbuf, causing sosend_generic and friends to block. uipc_rcvd will
likewise clear SB_STOP. There are two fringe benefits: uipc_{send,rcvd} no
longer need to call chgsbsize() on every send and receive because they don't
change the sockbuf's high water mark. Also, uipc_sense no longer needs to
acquire the UIPC linkage lock, because it's simpler to compute the
st_blksizes.
There is one drawback: since sbspace() will only ever return 0 or the
maximum, sosend_generic will allow the sockbuf to exceed its nominal maximum
size by at most one packet of size less than the max. I don't think that's
a serious problem. In fact, I'm not even positive that FreeBSD guarantees a
socket will always stay within its nominal size limit.
sys/sys/sockbuf.h
Add the SB_STOP flag and adjust sbspace()
sys/sys/unpcb.h
Delete the obsolete unp_cc and unp_mbcnt fields from struct unpcb.
sys/kern/uipc_usrreq.c
Adjust uipc_rcvd, uipc_send, and uipc_sense to use the SB_STOP
backpressure mechanism. Removing obsolete unpcb fields from
db_show_unpcb.
tests/sys/kern/unix_seqpacket_test.c
Clear expected failures from ATF.
Obtained from: DragonFly BSD
PR: kern/185812
Reviewed by: silence from freebsd-net@ and rwatson@
MFC after: 3 weeks
Sponsored by: Spectra Logic Corporation
2014-03-13 18:42:12 +00:00
|
|
|
if (sbcc < so2->so_snd.sb_hiwat && mbcnt < so2->so_snd.sb_mbmax)
|
|
|
|
so2->so_snd.sb_flags &= ~SB_STOP;
|
2007-02-26 20:47:52 +00:00
|
|
|
sowwakeup_locked(so2);
|
|
|
|
UNP_PCB_UNLOCK(unp);
|
2004-01-11 19:48:19 +00:00
|
|
|
return (0);
|
1997-04-27 20:01:29 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1997-04-27 20:01:29 +00:00
|
|
|
static int
|
1997-08-16 19:16:27 +00:00
|
|
|
uipc_send(struct socket *so, int flags, struct mbuf *m, struct sockaddr *nam,
|
2005-02-20 23:22:13 +00:00
|
|
|
struct mbuf *control, struct thread *td)
|
1997-04-27 20:01:29 +00:00
|
|
|
{
|
2006-07-22 18:41:42 +00:00
|
|
|
struct unpcb *unp, *unp2;
|
1997-04-27 20:01:29 +00:00
|
|
|
struct socket *so2;
|
Replace 4.4BSD Lite's unix domain socket backpressure hack with a cleaner
mechanism, based on the new SB_STOP sockbuf flag. The old hack dynamically
changed the sending sockbuf's high water mark whenever adding or removing
data from the receiving sockbuf. It worked for stream sockets, but it never
worked for SOCK_SEQPACKET sockets because of their atomic nature. If the
sockbuf was partially full, it might return EMSGSIZE instead of blocking.
The new solution is based on DragonFlyBSD's fix from commit
3a6117bbe0ed6a87605c1e43e12a1438d8844380 on 2008-05-27. It adds an SB_STOP
flag to sockbufs. Whenever uipc_send surpasses the socket's size limit, it
sets SB_STOP on the sending sockbuf. sbspace() will then return 0 for that
sockbuf, causing sosend_generic and friends to block. uipc_rcvd will
likewise clear SB_STOP. There are two fringe benefits: uipc_{send,rcvd} no
longer need to call chgsbsize() on every send and receive because they don't
change the sockbuf's high water mark. Also, uipc_sense no longer needs to
acquire the UIPC linkage lock, because it's simpler to compute the
st_blksizes.
There is one drawback: since sbspace() will only ever return 0 or the
maximum, sosend_generic will allow the sockbuf to exceed its nominal maximum
size by at most one packet of size less than the max. I don't think that's
a serious problem. In fact, I'm not even positive that FreeBSD guarantees a
socket will always stay within its nominal size limit.
sys/sys/sockbuf.h
Add the SB_STOP flag and adjust sbspace()
sys/sys/unpcb.h
Delete the obsolete unp_cc and unp_mbcnt fields from struct unpcb.
sys/kern/uipc_usrreq.c
Adjust uipc_rcvd, uipc_send, and uipc_sense to use the SB_STOP
backpressure mechanism. Removing obsolete unpcb fields from
db_show_unpcb.
tests/sys/kern/unix_seqpacket_test.c
Clear expected failures from ATF.
Obtained from: DragonFly BSD
PR: kern/185812
Reviewed by: silence from freebsd-net@ and rwatson@
MFC after: 3 weeks
Sponsored by: Spectra Logic Corporation
2014-03-13 18:42:12 +00:00
|
|
|
u_int mbcnt, sbcc;
|
2018-05-17 17:59:35 +00:00
|
|
|
int freed, error;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2004-08-16 04:41:03 +00:00
|
|
|
unp = sotounpcb(so);
|
2014-11-12 10:17:46 +00:00
|
|
|
KASSERT(unp != NULL, ("%s: unp == NULL", __func__));
|
|
|
|
KASSERT(so->so_type == SOCK_STREAM || so->so_type == SOCK_DGRAM ||
|
|
|
|
so->so_type == SOCK_SEQPACKET,
|
|
|
|
("%s: socktype %d", __func__, so->so_type));
|
2007-02-26 20:47:52 +00:00
|
|
|
|
2018-05-17 17:59:35 +00:00
|
|
|
freed = error = 0;
|
1997-04-27 20:01:29 +00:00
|
|
|
if (flags & PRUS_OOB) {
|
|
|
|
error = EOPNOTSUPP;
|
|
|
|
goto release;
|
|
|
|
}
|
2004-03-30 02:16:25 +00:00
|
|
|
if (control != NULL && (error = unp_internalize(&control, td)))
|
1997-04-27 20:01:29 +00:00
|
|
|
goto release;
|
2018-05-17 17:59:35 +00:00
|
|
|
|
|
|
|
unp2 = NULL;
|
1997-04-27 20:01:29 +00:00
|
|
|
switch (so->so_type) {
|
2004-01-11 19:48:19 +00:00
|
|
|
case SOCK_DGRAM:
|
1997-04-27 20:01:29 +00:00
|
|
|
{
|
2004-06-04 04:07:08 +00:00
|
|
|
const struct sockaddr *from;
|
1995-05-30 08:16:23 +00:00
|
|
|
|
2004-03-30 02:16:25 +00:00
|
|
|
if (nam != NULL) {
|
2020-09-15 19:23:22 +00:00
|
|
|
error = unp_connect(so, nam, td);
|
|
|
|
if (error != 0)
|
2018-05-24 18:22:13 +00:00
|
|
|
break;
|
2018-05-17 17:59:35 +00:00
|
|
|
}
|
2020-09-15 19:23:22 +00:00
|
|
|
UNP_PCB_LOCK(unp);
|
|
|
|
|
2006-07-31 23:00:05 +00:00
|
|
|
/*
|
2020-09-15 19:23:22 +00:00
|
|
|
* Because connect() and send() are non-atomic in a sendto()
|
|
|
|
* with a target address, it's possible that the socket will
|
|
|
|
* have disconnected before the send() can run. In that case
|
|
|
|
* return the slightly counter-intuitive but otherwise
|
|
|
|
* correct error that the socket is not connected.
|
2006-07-31 23:00:05 +00:00
|
|
|
*/
|
2020-09-15 19:23:22 +00:00
|
|
|
unp2 = unp_pcb_lock_peer(unp);
|
|
|
|
if (unp2 == NULL) {
|
2018-05-17 17:59:35 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
2006-07-31 23:00:05 +00:00
|
|
|
error = ENOTCONN;
|
|
|
|
break;
|
|
|
|
}
|
2020-09-15 19:23:22 +00:00
|
|
|
|
2020-11-03 01:17:45 +00:00
|
|
|
if (unp2->unp_flags & UNP_WANTCRED_MASK)
|
2020-11-17 20:01:21 +00:00
|
|
|
control = unp_addsockcred(td, control,
|
|
|
|
unp2->unp_flags);
|
2004-03-30 02:16:25 +00:00
|
|
|
if (unp->unp_addr != NULL)
|
1997-08-16 19:16:27 +00:00
|
|
|
from = (struct sockaddr *)unp->unp_addr;
|
1997-04-27 20:01:29 +00:00
|
|
|
else
|
|
|
|
from = &sun_noname;
|
2007-03-01 09:00:42 +00:00
|
|
|
so2 = unp2->unp_socket;
|
Merge next step in socket buffer locking:
- sowakeup() now asserts the socket buffer lock on entry. Move
the call to KNOTE higher in sowakeup() so that it is made with
the socket buffer lock held for consistency with other calls.
Release the socket buffer lock prior to calling into pgsigio(),
so_upcall(), or aio_swake(). Locking for this event management
will need revisiting in the future, but this model avoids lock
order reversals when upcalls into other subsystems result in
socket/socket buffer operations. Assert that the socket buffer
lock is not held at the end of the function.
- Wrapper macros for sowakeup(), sorwakeup() and sowwakeup(), now
have _locked versions which assert the socket buffer lock on
entry. If a wakeup is required by sb_notify(), invoke
sowakeup(); otherwise, unconditionally release the socket buffer
lock. This results in the socket buffer lock being released
whether a wakeup is required or not.
- Break out socantsendmore() into socantsendmore_locked() that
asserts the socket buffer lock. socantsendmore()
unconditionally locks the socket buffer before calling
socantsendmore_locked(). Note that both functions return with
the socket buffer unlocked as socantsendmore_locked() calls
sowwakeup_locked() which has the same properties. Assert that
the socket buffer is unlocked on return.
- Break out socantrcvmore() into socantrcvmore_locked() that
asserts the socket buffer lock. socantrcvmore() unconditionally
locks the socket buffer before calling socantrcvmore_locked().
Note that both functions return with the socket buffer unlocked
as socantrcvmore_locked() calls sorwakeup_locked() which has
similar properties. Assert that the socket buffer is unlocked
on return.
- Break out sbrelease() into a sbrelease_locked() that asserts the
socket buffer lock. sbrelease() unconditionally locks the
socket buffer before calling sbrelease_locked().
sbrelease_locked() now invokes sbflush_locked() instead of
sbflush().
- Assert the socket buffer lock in socket buffer sanity check
functions sblastrecordchk(), sblastmbufchk().
- Assert the socket buffer lock in SBLINKRECORD().
- Break out various sbappend() functions into sbappend_locked()
(and variations on that name) that assert the socket buffer
lock. The !_locked() variations unconditionally lock the socket
buffer before calling their _locked counterparts. Internally,
make sure to call _locked() support routines, etc, if already
holding the socket buffer lock.
- Break out sbinsertoob() into sbinsertoob_locked() that asserts
the socket buffer lock. sbinsertoob() unconditionally locks the
socket buffer before calling sbinsertoob_locked().
- Break out sbflush() into sbflush_locked() that asserts the
socket buffer lock. sbflush() unconditionally locks the socket
buffer before calling sbflush_locked(). Update panic strings
for new function names.
- Break out sbdrop() into sbdrop_locked() that asserts the socket
buffer lock. sbdrop() unconditionally locks the socket buffer
before calling sbdrop_locked().
- Break out sbdroprecord() into sbdroprecord_locked() that asserts
the socket buffer lock. sbdroprecord() unconditionally locks
the socket buffer before calling sbdroprecord_locked().
- sofree() now calls socantsendmore_locked() and re-acquires the
socket buffer lock on return. It also now calls
sbrelease_locked().
- sorflush() now calls socantrcvmore_locked() and re-acquires the
socket buffer lock on return. Clean up/mess up other behavior
in sorflush() relating to the temporary stack copy of the socket
buffer used with dom_dispose by more properly initializing the
temporary copy, and selectively bzeroing/copying more carefully
to prevent WITNESS from getting confused by improperly
initialized mutexes. Annotate why that's necessary, or at
least, needed.
- soisconnected() now calls sbdrop_locked() before unlocking the
socket buffer to avoid locking overhead.
Some parts of this change were:
Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS
2004-06-21 00:20:43 +00:00
|
|
|
SOCKBUF_LOCK(&so2->so_rcv);
|
2014-08-03 22:37:21 +00:00
|
|
|
if (sbappendaddr_locked(&so2->so_rcv, from, m,
|
2014-03-06 20:24:15 +00:00
|
|
|
control)) {
|
Reduce the number of unnecessary unlock-relocks on socket buffer mutexes
associated with performing a wakeup on the socket buffer:
- When performing an sbappend*() followed by a so[rw]wakeup(), explicitly
acquire the socket buffer lock and use the _locked() variants of both
calls. Note that the _locked() sowakeup() versions unlock the mutex on
return. This is done in uipc_send(), divert_packet(), mroute
socket_send(), raw_append(), tcp_reass(), tcp_input(), and udp_append().
- When the socket buffer lock is dropped before a sowakeup(), remove the
explicit unlock and use the _locked() sowakeup() variant. This is done
in soisdisconnecting(), soisdisconnected() when setting the can't send/
receive flags and dropping data, and in uipc_rcvd() which adjusting
back-pressure on the sockets.
For UNIX domain sockets running mpsafe with a contention-intensive SMP
mysql benchmark, this results in a 1.6% query rate improvement due to
reduce mutex costs.
2004-06-26 19:10:39 +00:00
|
|
|
sorwakeup_locked(so2);
|
2004-03-30 02:16:25 +00:00
|
|
|
m = NULL;
|
|
|
|
control = NULL;
|
2004-01-11 19:48:19 +00:00
|
|
|
} else {
|
Merge next step in socket buffer locking:
- sowakeup() now asserts the socket buffer lock on entry. Move
the call to KNOTE higher in sowakeup() so that it is made with
the socket buffer lock held for consistency with other calls.
Release the socket buffer lock prior to calling into pgsigio(),
so_upcall(), or aio_swake(). Locking for this event management
will need revisiting in the future, but this model avoids lock
order reversals when upcalls into other subsystems result in
socket/socket buffer operations. Assert that the socket buffer
lock is not held at the end of the function.
- Wrapper macros for sowakeup(), sorwakeup() and sowwakeup(), now
have _locked versions which assert the socket buffer lock on
entry. If a wakeup is required by sb_notify(), invoke
sowakeup(); otherwise, unconditionally release the socket buffer
lock. This results in the socket buffer lock being released
whether a wakeup is required or not.
- Break out socantsendmore() into socantsendmore_locked() that
asserts the socket buffer lock. socantsendmore()
unconditionally locks the socket buffer before calling
socantsendmore_locked(). Note that both functions return with
the socket buffer unlocked as socantsendmore_locked() calls
sowwakeup_locked() which has the same properties. Assert that
the socket buffer is unlocked on return.
- Break out socantrcvmore() into socantrcvmore_locked() that
asserts the socket buffer lock. socantrcvmore() unconditionally
locks the socket buffer before calling socantrcvmore_locked().
Note that both functions return with the socket buffer unlocked
as socantrcvmore_locked() calls sorwakeup_locked() which has
similar properties. Assert that the socket buffer is unlocked
on return.
- Break out sbrelease() into a sbrelease_locked() that asserts the
socket buffer lock. sbrelease() unconditionally locks the
socket buffer before calling sbrelease_locked().
sbrelease_locked() now invokes sbflush_locked() instead of
sbflush().
- Assert the socket buffer lock in socket buffer sanity check
functions sblastrecordchk(), sblastmbufchk().
- Assert the socket buffer lock in SBLINKRECORD().
- Break out various sbappend() functions into sbappend_locked()
(and variations on that name) that assert the socket buffer
lock. The !_locked() variations unconditionally lock the socket
buffer before calling their _locked counterparts. Internally,
make sure to call _locked() support routines, etc, if already
holding the socket buffer lock.
- Break out sbinsertoob() into sbinsertoob_locked() that asserts
the socket buffer lock. sbinsertoob() unconditionally locks the
socket buffer before calling sbinsertoob_locked().
- Break out sbflush() into sbflush_locked() that asserts the
socket buffer lock. sbflush() unconditionally locks the socket
buffer before calling sbflush_locked(). Update panic strings
for new function names.
- Break out sbdrop() into sbdrop_locked() that asserts the socket
buffer lock. sbdrop() unconditionally locks the socket buffer
before calling sbdrop_locked().
- Break out sbdroprecord() into sbdroprecord_locked() that asserts
the socket buffer lock. sbdroprecord() unconditionally locks
the socket buffer before calling sbdroprecord_locked().
- sofree() now calls socantsendmore_locked() and re-acquires the
socket buffer lock on return. It also now calls
sbrelease_locked().
- sorflush() now calls socantrcvmore_locked() and re-acquires the
socket buffer lock on return. Clean up/mess up other behavior
in sorflush() relating to the temporary stack copy of the socket
buffer used with dom_dispose by more properly initializing the
temporary copy, and selectively bzeroing/copying more carefully
to prevent WITNESS from getting confused by improperly
initialized mutexes. Annotate why that's necessary, or at
least, needed.
- soisconnected() now calls sbdrop_locked() before unlocking the
socket buffer to avoid locking overhead.
Some parts of this change were:
Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS
2004-06-21 00:20:43 +00:00
|
|
|
SOCKBUF_UNLOCK(&so2->so_rcv);
|
1997-04-27 20:01:29 +00:00
|
|
|
error = ENOBUFS;
|
2004-01-11 19:48:19 +00:00
|
|
|
}
|
2018-05-17 17:59:35 +00:00
|
|
|
if (nam != NULL)
|
2007-02-26 20:47:52 +00:00
|
|
|
unp_disconnect(unp, unp2);
|
2020-09-15 19:22:37 +00:00
|
|
|
else
|
|
|
|
unp_pcb_unlock_pair(unp, unp2);
|
1997-04-27 20:01:29 +00:00
|
|
|
break;
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2009-10-05 14:49:16 +00:00
|
|
|
case SOCK_SEQPACKET:
|
1997-04-27 20:01:29 +00:00
|
|
|
case SOCK_STREAM:
|
|
|
|
if ((so->so_state & SS_ISCONNECTED) == 0) {
|
2004-03-30 02:16:25 +00:00
|
|
|
if (nam != NULL) {
|
2020-09-15 19:23:22 +00:00
|
|
|
error = unp_connect(so, nam, td);
|
2020-04-13 19:22:05 +00:00
|
|
|
if (error != 0)
|
2018-05-17 17:59:35 +00:00
|
|
|
break;
|
2020-04-13 19:22:05 +00:00
|
|
|
} else {
|
1997-04-27 20:01:29 +00:00
|
|
|
error = ENOTCONN;
|
|
|
|
break;
|
|
|
|
}
|
2020-04-13 19:22:05 +00:00
|
|
|
}
|
|
|
|
|
2020-09-15 19:23:22 +00:00
|
|
|
UNP_PCB_LOCK(unp);
|
|
|
|
if ((unp2 = unp_pcb_lock_peer(unp)) == NULL) {
|
2020-04-13 19:22:05 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
2018-05-17 17:59:35 +00:00
|
|
|
error = ENOTCONN;
|
|
|
|
break;
|
|
|
|
} else if (so->so_snd.sb_state & SBS_CANTSENDMORE) {
|
2020-09-15 19:23:22 +00:00
|
|
|
unp_pcb_unlock_pair(unp, unp2);
|
1997-04-27 20:01:29 +00:00
|
|
|
error = EPIPE;
|
|
|
|
break;
|
2018-05-17 17:59:35 +00:00
|
|
|
}
|
|
|
|
UNP_PCB_UNLOCK(unp);
|
|
|
|
if ((so2 = unp2->unp_socket) == NULL) {
|
|
|
|
UNP_PCB_UNLOCK(unp2);
|
2006-07-31 23:00:05 +00:00
|
|
|
error = ENOTCONN;
|
|
|
|
break;
|
|
|
|
}
|
Merge next step in socket buffer locking:
- sowakeup() now asserts the socket buffer lock on entry. Move
the call to KNOTE higher in sowakeup() so that it is made with
the socket buffer lock held for consistency with other calls.
Release the socket buffer lock prior to calling into pgsigio(),
so_upcall(), or aio_swake(). Locking for this event management
will need revisiting in the future, but this model avoids lock
order reversals when upcalls into other subsystems result in
socket/socket buffer operations. Assert that the socket buffer
lock is not held at the end of the function.
- Wrapper macros for sowakeup(), sorwakeup() and sowwakeup(), now
have _locked versions which assert the socket buffer lock on
entry. If a wakeup is required by sb_notify(), invoke
sowakeup(); otherwise, unconditionally release the socket buffer
lock. This results in the socket buffer lock being released
whether a wakeup is required or not.
- Break out socantsendmore() into socantsendmore_locked() that
asserts the socket buffer lock. socantsendmore()
unconditionally locks the socket buffer before calling
socantsendmore_locked(). Note that both functions return with
the socket buffer unlocked as socantsendmore_locked() calls
sowwakeup_locked() which has the same properties. Assert that
the socket buffer is unlocked on return.
- Break out socantrcvmore() into socantrcvmore_locked() that
asserts the socket buffer lock. socantrcvmore() unconditionally
locks the socket buffer before calling socantrcvmore_locked().
Note that both functions return with the socket buffer unlocked
as socantrcvmore_locked() calls sorwakeup_locked() which has
similar properties. Assert that the socket buffer is unlocked
on return.
- Break out sbrelease() into a sbrelease_locked() that asserts the
socket buffer lock. sbrelease() unconditionally locks the
socket buffer before calling sbrelease_locked().
sbrelease_locked() now invokes sbflush_locked() instead of
sbflush().
- Assert the socket buffer lock in socket buffer sanity check
functions sblastrecordchk(), sblastmbufchk().
- Assert the socket buffer lock in SBLINKRECORD().
- Break out various sbappend() functions into sbappend_locked()
(and variations on that name) that assert the socket buffer
lock. The !_locked() variations unconditionally lock the socket
buffer before calling their _locked counterparts. Internally,
make sure to call _locked() support routines, etc, if already
holding the socket buffer lock.
- Break out sbinsertoob() into sbinsertoob_locked() that asserts
the socket buffer lock. sbinsertoob() unconditionally locks the
socket buffer before calling sbinsertoob_locked().
- Break out sbflush() into sbflush_locked() that asserts the
socket buffer lock. sbflush() unconditionally locks the socket
buffer before calling sbflush_locked(). Update panic strings
for new function names.
- Break out sbdrop() into sbdrop_locked() that asserts the socket
buffer lock. sbdrop() unconditionally locks the socket buffer
before calling sbdrop_locked().
- Break out sbdroprecord() into sbdroprecord_locked() that asserts
the socket buffer lock. sbdroprecord() unconditionally locks
the socket buffer before calling sbdroprecord_locked().
- sofree() now calls socantsendmore_locked() and re-acquires the
socket buffer lock on return. It also now calls
sbrelease_locked().
- sorflush() now calls socantrcvmore_locked() and re-acquires the
socket buffer lock on return. Clean up/mess up other behavior
in sorflush() relating to the temporary stack copy of the socket
buffer used with dom_dispose by more properly initializing the
temporary copy, and selectively bzeroing/copying more carefully
to prevent WITNESS from getting confused by improperly
initialized mutexes. Annotate why that's necessary, or at
least, needed.
- soisconnected() now calls sbdrop_locked() before unlocking the
socket buffer to avoid locking overhead.
Some parts of this change were:
Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS
2004-06-21 00:20:43 +00:00
|
|
|
SOCKBUF_LOCK(&so2->so_rcv);
|
2020-11-03 01:17:45 +00:00
|
|
|
if (unp2->unp_flags & UNP_WANTCRED_MASK) {
|
2005-04-13 00:01:46 +00:00
|
|
|
/*
|
2020-11-03 01:17:45 +00:00
|
|
|
* Credentials are passed only once on SOCK_STREAM and
|
|
|
|
* SOCK_SEQPACKET (LOCAL_CREDS => WANTCRED_ONESHOT), or
|
|
|
|
* forever (LOCAL_CREDS_PERSISTENT => WANTCRED_ALWAYS).
|
2005-04-13 00:01:46 +00:00
|
|
|
*/
|
2020-11-17 20:01:21 +00:00
|
|
|
control = unp_addsockcred(td, control, unp2->unp_flags);
|
2020-11-03 01:17:45 +00:00
|
|
|
unp2->unp_flags &= ~UNP_WANTCRED_ONESHOT;
|
2005-04-13 00:01:46 +00:00
|
|
|
}
|
2018-08-04 20:26:54 +00:00
|
|
|
|
1997-04-27 20:01:29 +00:00
|
|
|
/*
|
2018-08-04 20:26:54 +00:00
|
|
|
* Send to paired receive port and wake up readers. Don't
|
|
|
|
* check for space available in the receive buffer if we're
|
|
|
|
* attaching ancillary data; Unix domain sockets only check
|
|
|
|
* for space in the sending sockbuf, and that check is
|
|
|
|
* performed one level up the stack. At that level we cannot
|
|
|
|
* precisely account for the amount of buffer space used
|
|
|
|
* (e.g., because control messages are not yet internalized).
|
1997-04-27 20:01:29 +00:00
|
|
|
*/
|
2009-10-05 14:49:16 +00:00
|
|
|
switch (so->so_type) {
|
|
|
|
case SOCK_STREAM:
|
|
|
|
if (control != NULL) {
|
2018-08-04 20:26:54 +00:00
|
|
|
sbappendcontrol_locked(&so2->so_rcv, m,
|
2020-04-10 20:42:11 +00:00
|
|
|
control, flags);
|
2018-08-04 20:26:54 +00:00
|
|
|
control = NULL;
|
2009-10-05 14:49:16 +00:00
|
|
|
} else
|
2016-01-08 19:03:20 +00:00
|
|
|
sbappend_locked(&so2->so_rcv, m, flags);
|
2009-10-05 14:49:16 +00:00
|
|
|
break;
|
|
|
|
|
2020-04-13 19:22:05 +00:00
|
|
|
case SOCK_SEQPACKET:
|
2014-03-06 20:24:15 +00:00
|
|
|
if (sbappendaddr_nospacecheck_locked(&so2->so_rcv,
|
2020-04-13 19:22:05 +00:00
|
|
|
&sun_noname, m, control))
|
2004-03-30 02:16:25 +00:00
|
|
|
control = NULL;
|
2009-10-05 14:49:16 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
Replace 4.4BSD Lite's unix domain socket backpressure hack with a cleaner
mechanism, based on the new SB_STOP sockbuf flag. The old hack dynamically
changed the sending sockbuf's high water mark whenever adding or removing
data from the receiving sockbuf. It worked for stream sockets, but it never
worked for SOCK_SEQPACKET sockets because of their atomic nature. If the
sockbuf was partially full, it might return EMSGSIZE instead of blocking.
The new solution is based on DragonFlyBSD's fix from commit
3a6117bbe0ed6a87605c1e43e12a1438d8844380 on 2008-05-27. It adds an SB_STOP
flag to sockbufs. Whenever uipc_send surpasses the socket's size limit, it
sets SB_STOP on the sending sockbuf. sbspace() will then return 0 for that
sockbuf, causing sosend_generic and friends to block. uipc_rcvd will
likewise clear SB_STOP. There are two fringe benefits: uipc_{send,rcvd} no
longer need to call chgsbsize() on every send and receive because they don't
change the sockbuf's high water mark. Also, uipc_sense no longer needs to
acquire the UIPC linkage lock, because it's simpler to compute the
st_blksizes.
There is one drawback: since sbspace() will only ever return 0 or the
maximum, sosend_generic will allow the sockbuf to exceed its nominal maximum
size by at most one packet of size less than the max. I don't think that's
a serious problem. In fact, I'm not even positive that FreeBSD guarantees a
socket will always stay within its nominal size limit.
sys/sys/sockbuf.h
Add the SB_STOP flag and adjust sbspace()
sys/sys/unpcb.h
Delete the obsolete unp_cc and unp_mbcnt fields from struct unpcb.
sys/kern/uipc_usrreq.c
Adjust uipc_rcvd, uipc_send, and uipc_sense to use the SB_STOP
backpressure mechanism. Removing obsolete unpcb fields from
db_show_unpcb.
tests/sys/kern/unix_seqpacket_test.c
Clear expected failures from ATF.
Obtained from: DragonFly BSD
PR: kern/185812
Reviewed by: silence from freebsd-net@ and rwatson@
MFC after: 3 weeks
Sponsored by: Spectra Logic Corporation
2014-03-13 18:42:12 +00:00
|
|
|
mbcnt = so2->so_rcv.sb_mbcnt;
|
2014-11-12 10:17:46 +00:00
|
|
|
sbcc = sbavail(&so2->so_rcv);
|
|
|
|
if (sbcc)
|
|
|
|
sorwakeup_locked(so2);
|
|
|
|
else
|
|
|
|
SOCKBUF_UNLOCK(&so2->so_rcv);
|
2006-07-11 21:49:54 +00:00
|
|
|
|
Replace 4.4BSD Lite's unix domain socket backpressure hack with a cleaner
mechanism, based on the new SB_STOP sockbuf flag. The old hack dynamically
changed the sending sockbuf's high water mark whenever adding or removing
data from the receiving sockbuf. It worked for stream sockets, but it never
worked for SOCK_SEQPACKET sockets because of their atomic nature. If the
sockbuf was partially full, it might return EMSGSIZE instead of blocking.
The new solution is based on DragonFlyBSD's fix from commit
3a6117bbe0ed6a87605c1e43e12a1438d8844380 on 2008-05-27. It adds an SB_STOP
flag to sockbufs. Whenever uipc_send surpasses the socket's size limit, it
sets SB_STOP on the sending sockbuf. sbspace() will then return 0 for that
sockbuf, causing sosend_generic and friends to block. uipc_rcvd will
likewise clear SB_STOP. There are two fringe benefits: uipc_{send,rcvd} no
longer need to call chgsbsize() on every send and receive because they don't
change the sockbuf's high water mark. Also, uipc_sense no longer needs to
acquire the UIPC linkage lock, because it's simpler to compute the
st_blksizes.
There is one drawback: since sbspace() will only ever return 0 or the
maximum, sosend_generic will allow the sockbuf to exceed its nominal maximum
size by at most one packet of size less than the max. I don't think that's
a serious problem. In fact, I'm not even positive that FreeBSD guarantees a
socket will always stay within its nominal size limit.
sys/sys/sockbuf.h
Add the SB_STOP flag and adjust sbspace()
sys/sys/unpcb.h
Delete the obsolete unp_cc and unp_mbcnt fields from struct unpcb.
sys/kern/uipc_usrreq.c
Adjust uipc_rcvd, uipc_send, and uipc_sense to use the SB_STOP
backpressure mechanism. Removing obsolete unpcb fields from
db_show_unpcb.
tests/sys/kern/unix_seqpacket_test.c
Clear expected failures from ATF.
Obtained from: DragonFly BSD
PR: kern/185812
Reviewed by: silence from freebsd-net@ and rwatson@
MFC after: 3 weeks
Sponsored by: Spectra Logic Corporation
2014-03-13 18:42:12 +00:00
|
|
|
/*
|
|
|
|
* The PCB lock on unp2 protects the SB_STOP flag. Without it,
|
|
|
|
* it would be possible for uipc_rcvd to be called at this
|
|
|
|
* point, drain the receiving sockbuf, clear SB_STOP, and then
|
|
|
|
* we would set SB_STOP below. That could lead to an empty
|
|
|
|
* sockbuf having SB_STOP set
|
|
|
|
*/
|
2006-07-11 21:49:54 +00:00
|
|
|
SOCKBUF_LOCK(&so->so_snd);
|
Replace 4.4BSD Lite's unix domain socket backpressure hack with a cleaner
mechanism, based on the new SB_STOP sockbuf flag. The old hack dynamically
changed the sending sockbuf's high water mark whenever adding or removing
data from the receiving sockbuf. It worked for stream sockets, but it never
worked for SOCK_SEQPACKET sockets because of their atomic nature. If the
sockbuf was partially full, it might return EMSGSIZE instead of blocking.
The new solution is based on DragonFlyBSD's fix from commit
3a6117bbe0ed6a87605c1e43e12a1438d8844380 on 2008-05-27. It adds an SB_STOP
flag to sockbufs. Whenever uipc_send surpasses the socket's size limit, it
sets SB_STOP on the sending sockbuf. sbspace() will then return 0 for that
sockbuf, causing sosend_generic and friends to block. uipc_rcvd will
likewise clear SB_STOP. There are two fringe benefits: uipc_{send,rcvd} no
longer need to call chgsbsize() on every send and receive because they don't
change the sockbuf's high water mark. Also, uipc_sense no longer needs to
acquire the UIPC linkage lock, because it's simpler to compute the
st_blksizes.
There is one drawback: since sbspace() will only ever return 0 or the
maximum, sosend_generic will allow the sockbuf to exceed its nominal maximum
size by at most one packet of size less than the max. I don't think that's
a serious problem. In fact, I'm not even positive that FreeBSD guarantees a
socket will always stay within its nominal size limit.
sys/sys/sockbuf.h
Add the SB_STOP flag and adjust sbspace()
sys/sys/unpcb.h
Delete the obsolete unp_cc and unp_mbcnt fields from struct unpcb.
sys/kern/uipc_usrreq.c
Adjust uipc_rcvd, uipc_send, and uipc_sense to use the SB_STOP
backpressure mechanism. Removing obsolete unpcb fields from
db_show_unpcb.
tests/sys/kern/unix_seqpacket_test.c
Clear expected failures from ATF.
Obtained from: DragonFly BSD
PR: kern/185812
Reviewed by: silence from freebsd-net@ and rwatson@
MFC after: 3 weeks
Sponsored by: Spectra Logic Corporation
2014-03-13 18:42:12 +00:00
|
|
|
if (sbcc >= so->so_snd.sb_hiwat || mbcnt >= so->so_snd.sb_mbmax)
|
|
|
|
so->so_snd.sb_flags |= SB_STOP;
|
2004-12-22 20:28:46 +00:00
|
|
|
SOCKBUF_UNLOCK(&so->so_snd);
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_UNLOCK(unp2);
|
2004-03-30 02:16:25 +00:00
|
|
|
m = NULL;
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
|
|
|
}
|
1997-04-27 20:01:29 +00:00
|
|
|
|
|
|
|
/*
|
2008-10-03 09:01:55 +00:00
|
|
|
* PRUS_EOF is equivalent to pru_send followed by pru_shutdown.
|
1997-04-27 20:01:29 +00:00
|
|
|
*/
|
|
|
|
if (flags & PRUS_EOF) {
|
2007-03-01 09:00:42 +00:00
|
|
|
UNP_PCB_LOCK(unp);
|
1997-04-27 20:01:29 +00:00
|
|
|
socantsendmore(so);
|
|
|
|
unp_shutdown(unp);
|
2007-03-01 09:00:42 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
1997-04-27 20:01:29 +00:00
|
|
|
}
|
2004-03-30 02:16:25 +00:00
|
|
|
if (control != NULL && error != 0)
|
2016-08-31 21:48:22 +00:00
|
|
|
unp_dispose_mbuf(control);
|
1999-05-10 18:09:39 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
release:
|
2004-03-30 02:16:25 +00:00
|
|
|
if (control != NULL)
|
1994-05-24 10:09:53 +00:00
|
|
|
m_freem(control);
|
2017-09-13 16:47:23 +00:00
|
|
|
/*
|
|
|
|
* In case of PRUS_NOTREADY, uipc_ready() is responsible
|
|
|
|
* for freeing memory.
|
|
|
|
*/
|
|
|
|
if (m != NULL && (flags & PRUS_NOTREADY) == 0)
|
1994-05-24 10:09:53 +00:00
|
|
|
m_freem(m);
|
2004-01-11 19:48:19 +00:00
|
|
|
return (error);
|
1997-04-27 20:01:29 +00:00
|
|
|
}
|
|
|
|
|
2020-04-10 20:41:59 +00:00
|
|
|
static bool
|
|
|
|
uipc_ready_scan(struct socket *so, struct mbuf *m, int count, int *errorp)
|
|
|
|
{
|
|
|
|
struct mbuf *mb, *n;
|
|
|
|
struct sockbuf *sb;
|
|
|
|
|
|
|
|
SOCK_LOCK(so);
|
|
|
|
if (SOLISTENING(so)) {
|
|
|
|
SOCK_UNLOCK(so);
|
|
|
|
return (false);
|
|
|
|
}
|
|
|
|
mb = NULL;
|
|
|
|
sb = &so->so_rcv;
|
|
|
|
SOCKBUF_LOCK(sb);
|
|
|
|
if (sb->sb_fnrdy != NULL) {
|
|
|
|
for (mb = sb->sb_mb, n = mb->m_nextpkt; mb != NULL;) {
|
|
|
|
if (mb == m) {
|
|
|
|
*errorp = sbready(sb, m, count);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
mb = mb->m_next;
|
|
|
|
if (mb == NULL) {
|
|
|
|
mb = n;
|
2020-07-30 00:52:37 +00:00
|
|
|
if (mb != NULL)
|
|
|
|
n = mb->m_nextpkt;
|
2020-04-10 20:41:59 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
SOCKBUF_UNLOCK(sb);
|
|
|
|
SOCK_UNLOCK(so);
|
|
|
|
return (mb != NULL);
|
|
|
|
}
|
|
|
|
|
2014-11-30 13:40:58 +00:00
|
|
|
static int
|
|
|
|
uipc_ready(struct socket *so, struct mbuf *m, int count)
|
|
|
|
{
|
|
|
|
struct unpcb *unp, *unp2;
|
|
|
|
struct socket *so2;
|
2020-04-10 20:41:59 +00:00
|
|
|
int error, i;
|
2014-11-30 13:40:58 +00:00
|
|
|
|
|
|
|
unp = sotounpcb(so);
|
|
|
|
|
2020-04-10 20:41:59 +00:00
|
|
|
KASSERT(so->so_type == SOCK_STREAM,
|
|
|
|
("%s: unexpected socket type for %p", __func__, so));
|
|
|
|
|
2018-06-08 20:31:59 +00:00
|
|
|
UNP_PCB_LOCK(unp);
|
2020-09-15 19:23:22 +00:00
|
|
|
if ((unp2 = unp_pcb_lock_peer(unp)) != NULL) {
|
2018-06-08 20:31:59 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
2020-09-15 19:23:22 +00:00
|
|
|
so2 = unp2->unp_socket;
|
|
|
|
SOCKBUF_LOCK(&so2->so_rcv);
|
|
|
|
if ((error = sbready(&so2->so_rcv, m, count)) == 0)
|
|
|
|
sorwakeup_locked(so2);
|
|
|
|
else
|
|
|
|
SOCKBUF_UNLOCK(&so2->so_rcv);
|
|
|
|
UNP_PCB_UNLOCK(unp2);
|
|
|
|
return (error);
|
2017-09-13 16:47:23 +00:00
|
|
|
}
|
2020-09-15 19:23:22 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
2014-11-30 13:40:58 +00:00
|
|
|
|
2020-04-10 20:41:59 +00:00
|
|
|
/*
|
|
|
|
* The receiving socket has been disconnected, but may still be valid.
|
|
|
|
* In this case, the now-ready mbufs are still present in its socket
|
|
|
|
* buffer, so perform an exhaustive search before giving up and freeing
|
|
|
|
* the mbufs.
|
|
|
|
*/
|
|
|
|
UNP_LINK_RLOCK();
|
|
|
|
LIST_FOREACH(unp, &unp_shead, unp_link) {
|
|
|
|
if (uipc_ready_scan(unp->unp_socket, m, count, &error))
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
UNP_LINK_RUNLOCK();
|
|
|
|
|
|
|
|
if (unp == NULL) {
|
|
|
|
for (i = 0; i < count; i++)
|
|
|
|
m = m_free(m);
|
|
|
|
error = ECONNRESET;
|
|
|
|
}
|
2014-11-30 13:40:58 +00:00
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
1997-04-27 20:01:29 +00:00
|
|
|
static int
|
|
|
|
uipc_sense(struct socket *so, struct stat *sb)
|
|
|
|
{
|
Replace 4.4BSD Lite's unix domain socket backpressure hack with a cleaner
mechanism, based on the new SB_STOP sockbuf flag. The old hack dynamically
changed the sending sockbuf's high water mark whenever adding or removing
data from the receiving sockbuf. It worked for stream sockets, but it never
worked for SOCK_SEQPACKET sockets because of their atomic nature. If the
sockbuf was partially full, it might return EMSGSIZE instead of blocking.
The new solution is based on DragonFlyBSD's fix from commit
3a6117bbe0ed6a87605c1e43e12a1438d8844380 on 2008-05-27. It adds an SB_STOP
flag to sockbufs. Whenever uipc_send surpasses the socket's size limit, it
sets SB_STOP on the sending sockbuf. sbspace() will then return 0 for that
sockbuf, causing sosend_generic and friends to block. uipc_rcvd will
likewise clear SB_STOP. There are two fringe benefits: uipc_{send,rcvd} no
longer need to call chgsbsize() on every send and receive because they don't
change the sockbuf's high water mark. Also, uipc_sense no longer needs to
acquire the UIPC linkage lock, because it's simpler to compute the
st_blksizes.
There is one drawback: since sbspace() will only ever return 0 or the
maximum, sosend_generic will allow the sockbuf to exceed its nominal maximum
size by at most one packet of size less than the max. I don't think that's
a serious problem. In fact, I'm not even positive that FreeBSD guarantees a
socket will always stay within its nominal size limit.
sys/sys/sockbuf.h
Add the SB_STOP flag and adjust sbspace()
sys/sys/unpcb.h
Delete the obsolete unp_cc and unp_mbcnt fields from struct unpcb.
sys/kern/uipc_usrreq.c
Adjust uipc_rcvd, uipc_send, and uipc_sense to use the SB_STOP
backpressure mechanism. Removing obsolete unpcb fields from
db_show_unpcb.
tests/sys/kern/unix_seqpacket_test.c
Clear expected failures from ATF.
Obtained from: DragonFly BSD
PR: kern/185812
Reviewed by: silence from freebsd-net@ and rwatson@
MFC after: 3 weeks
Sponsored by: Spectra Logic Corporation
2014-03-13 18:42:12 +00:00
|
|
|
struct unpcb *unp;
|
1997-04-27 20:01:29 +00:00
|
|
|
|
2004-08-16 04:41:03 +00:00
|
|
|
unp = sotounpcb(so);
|
2006-03-17 13:52:57 +00:00
|
|
|
KASSERT(unp != NULL, ("uipc_sense: unp == NULL"));
|
2007-02-26 20:47:52 +00:00
|
|
|
|
1997-04-27 20:01:29 +00:00
|
|
|
sb->st_blksize = so->so_snd.sb_hiwat;
|
2004-06-17 17:16:53 +00:00
|
|
|
sb->st_dev = NODEV;
|
1997-04-27 20:01:29 +00:00
|
|
|
sb->st_ino = unp->unp_ino;
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
uipc_shutdown(struct socket *so)
|
|
|
|
{
|
2004-08-16 04:41:03 +00:00
|
|
|
struct unpcb *unp;
|
1997-04-27 20:01:29 +00:00
|
|
|
|
2004-08-16 04:41:03 +00:00
|
|
|
unp = sotounpcb(so);
|
2006-03-17 13:52:57 +00:00
|
|
|
KASSERT(unp != NULL, ("uipc_shutdown: unp == NULL"));
|
2007-02-26 20:47:52 +00:00
|
|
|
|
|
|
|
UNP_PCB_LOCK(unp);
|
1997-04-27 20:01:29 +00:00
|
|
|
socantsendmore(so);
|
|
|
|
unp_shutdown(unp);
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
2004-01-11 19:48:19 +00:00
|
|
|
return (0);
|
1997-04-27 20:01:29 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
1997-08-16 19:16:27 +00:00
|
|
|
uipc_sockaddr(struct socket *so, struct sockaddr **nam)
|
1997-04-27 20:01:29 +00:00
|
|
|
{
|
2004-08-16 04:41:03 +00:00
|
|
|
struct unpcb *unp;
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
const struct sockaddr *sa;
|
1997-04-27 20:01:29 +00:00
|
|
|
|
2006-03-17 13:52:57 +00:00
|
|
|
unp = sotounpcb(so);
|
|
|
|
KASSERT(unp != NULL, ("uipc_sockaddr: unp == NULL"));
|
2007-02-26 20:47:52 +00:00
|
|
|
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
*nam = malloc(sizeof(struct sockaddr_un), M_SONAME, M_WAITOK);
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_LOCK(unp);
|
2004-03-30 02:16:25 +00:00
|
|
|
if (unp->unp_addr != NULL)
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
sa = (struct sockaddr *) unp->unp_addr;
|
2001-04-24 19:09:23 +00:00
|
|
|
else
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
sa = &sun_noname;
|
|
|
|
bcopy(sa, *nam, sa->sa_len);
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
2004-01-11 19:48:19 +00:00
|
|
|
return (0);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2008-10-08 06:19:49 +00:00
|
|
|
static struct pr_usrreqs uipc_usrreqs_dgram = {
|
2004-11-08 14:44:54 +00:00
|
|
|
.pru_abort = uipc_abort,
|
|
|
|
.pru_accept = uipc_accept,
|
|
|
|
.pru_attach = uipc_attach,
|
|
|
|
.pru_bind = uipc_bind,
|
2013-03-02 21:11:30 +00:00
|
|
|
.pru_bindat = uipc_bindat,
|
2004-11-08 14:44:54 +00:00
|
|
|
.pru_connect = uipc_connect,
|
2013-03-02 21:11:30 +00:00
|
|
|
.pru_connectat = uipc_connectat,
|
2004-11-08 14:44:54 +00:00
|
|
|
.pru_connect2 = uipc_connect2,
|
|
|
|
.pru_detach = uipc_detach,
|
|
|
|
.pru_disconnect = uipc_disconnect,
|
|
|
|
.pru_listen = uipc_listen,
|
|
|
|
.pru_peeraddr = uipc_peeraddr,
|
|
|
|
.pru_rcvd = uipc_rcvd,
|
|
|
|
.pru_send = uipc_send,
|
|
|
|
.pru_sense = uipc_sense,
|
|
|
|
.pru_shutdown = uipc_shutdown,
|
|
|
|
.pru_sockaddr = uipc_sockaddr,
|
2008-10-08 06:19:49 +00:00
|
|
|
.pru_soreceive = soreceive_dgram,
|
|
|
|
.pru_close = uipc_close,
|
|
|
|
};
|
|
|
|
|
2009-10-05 14:49:16 +00:00
|
|
|
static struct pr_usrreqs uipc_usrreqs_seqpacket = {
|
|
|
|
.pru_abort = uipc_abort,
|
|
|
|
.pru_accept = uipc_accept,
|
|
|
|
.pru_attach = uipc_attach,
|
|
|
|
.pru_bind = uipc_bind,
|
2013-03-02 21:11:30 +00:00
|
|
|
.pru_bindat = uipc_bindat,
|
2009-10-05 14:49:16 +00:00
|
|
|
.pru_connect = uipc_connect,
|
2013-03-02 21:11:30 +00:00
|
|
|
.pru_connectat = uipc_connectat,
|
2009-10-05 14:49:16 +00:00
|
|
|
.pru_connect2 = uipc_connect2,
|
|
|
|
.pru_detach = uipc_detach,
|
|
|
|
.pru_disconnect = uipc_disconnect,
|
|
|
|
.pru_listen = uipc_listen,
|
|
|
|
.pru_peeraddr = uipc_peeraddr,
|
|
|
|
.pru_rcvd = uipc_rcvd,
|
|
|
|
.pru_send = uipc_send,
|
|
|
|
.pru_sense = uipc_sense,
|
|
|
|
.pru_shutdown = uipc_shutdown,
|
|
|
|
.pru_sockaddr = uipc_sockaddr,
|
|
|
|
.pru_soreceive = soreceive_generic, /* XXX: or...? */
|
|
|
|
.pru_close = uipc_close,
|
|
|
|
};
|
|
|
|
|
2008-10-08 06:19:49 +00:00
|
|
|
static struct pr_usrreqs uipc_usrreqs_stream = {
|
|
|
|
.pru_abort = uipc_abort,
|
|
|
|
.pru_accept = uipc_accept,
|
|
|
|
.pru_attach = uipc_attach,
|
|
|
|
.pru_bind = uipc_bind,
|
2013-03-02 21:11:30 +00:00
|
|
|
.pru_bindat = uipc_bindat,
|
2008-10-08 06:19:49 +00:00
|
|
|
.pru_connect = uipc_connect,
|
2013-03-02 21:11:30 +00:00
|
|
|
.pru_connectat = uipc_connectat,
|
2008-10-08 06:19:49 +00:00
|
|
|
.pru_connect2 = uipc_connect2,
|
|
|
|
.pru_detach = uipc_detach,
|
|
|
|
.pru_disconnect = uipc_disconnect,
|
|
|
|
.pru_listen = uipc_listen,
|
|
|
|
.pru_peeraddr = uipc_peeraddr,
|
|
|
|
.pru_rcvd = uipc_rcvd,
|
|
|
|
.pru_send = uipc_send,
|
2014-11-30 13:40:58 +00:00
|
|
|
.pru_ready = uipc_ready,
|
2008-10-08 06:19:49 +00:00
|
|
|
.pru_sense = uipc_sense,
|
|
|
|
.pru_shutdown = uipc_shutdown,
|
|
|
|
.pru_sockaddr = uipc_sockaddr,
|
|
|
|
.pru_soreceive = soreceive_generic,
|
2006-07-21 17:11:15 +00:00
|
|
|
.pru_close = uipc_close,
|
1997-04-27 20:01:29 +00:00
|
|
|
};
|
2001-08-17 22:01:18 +00:00
|
|
|
|
2008-10-03 13:01:56 +00:00
|
|
|
static int
|
2005-02-20 23:22:13 +00:00
|
|
|
uipc_ctloutput(struct socket *so, struct sockopt *sopt)
|
2001-08-17 22:01:18 +00:00
|
|
|
{
|
2004-08-16 04:41:03 +00:00
|
|
|
struct unpcb *unp;
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
struct xucred xu;
|
2005-04-13 00:01:46 +00:00
|
|
|
int error, optval;
|
|
|
|
|
2020-08-03 22:13:02 +00:00
|
|
|
if (sopt->sopt_level != SOL_LOCAL)
|
2005-04-20 02:57:56 +00:00
|
|
|
return (EINVAL);
|
|
|
|
|
2005-04-13 00:01:46 +00:00
|
|
|
unp = sotounpcb(so);
|
2006-03-17 13:52:57 +00:00
|
|
|
KASSERT(unp != NULL, ("uipc_ctloutput: unp == NULL"));
|
2005-04-13 00:01:46 +00:00
|
|
|
error = 0;
|
2001-08-17 22:01:18 +00:00
|
|
|
switch (sopt->sopt_dir) {
|
|
|
|
case SOPT_GET:
|
|
|
|
switch (sopt->sopt_name) {
|
|
|
|
case LOCAL_PEERCRED:
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_LOCK(unp);
|
2001-08-17 22:01:18 +00:00
|
|
|
if (unp->unp_flags & UNP_HAVEPC)
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
xu = unp->unp_peercred;
|
2001-08-17 22:01:18 +00:00
|
|
|
else {
|
|
|
|
if (so->so_type == SOCK_STREAM)
|
|
|
|
error = ENOTCONN;
|
|
|
|
else
|
|
|
|
error = EINVAL;
|
|
|
|
}
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
if (error == 0)
|
|
|
|
error = sooptcopyout(sopt, &xu, sizeof(xu));
|
2001-08-17 22:01:18 +00:00
|
|
|
break;
|
2007-02-26 20:47:52 +00:00
|
|
|
|
2005-04-13 00:01:46 +00:00
|
|
|
case LOCAL_CREDS:
|
2008-01-10 12:29:12 +00:00
|
|
|
/* Unlocked read. */
|
2020-11-03 01:17:45 +00:00
|
|
|
optval = unp->unp_flags & UNP_WANTCRED_ONESHOT ? 1 : 0;
|
|
|
|
error = sooptcopyout(sopt, &optval, sizeof(optval));
|
|
|
|
break;
|
|
|
|
|
|
|
|
case LOCAL_CREDS_PERSISTENT:
|
|
|
|
/* Unlocked read. */
|
|
|
|
optval = unp->unp_flags & UNP_WANTCRED_ALWAYS ? 1 : 0;
|
2005-04-13 00:01:46 +00:00
|
|
|
error = sooptcopyout(sopt, &optval, sizeof(optval));
|
|
|
|
break;
|
2007-02-26 20:47:52 +00:00
|
|
|
|
2005-04-13 00:01:46 +00:00
|
|
|
case LOCAL_CONNWAIT:
|
2008-01-10 12:29:12 +00:00
|
|
|
/* Unlocked read. */
|
2005-04-13 00:01:46 +00:00
|
|
|
optval = unp->unp_flags & UNP_CONNWAIT ? 1 : 0;
|
|
|
|
error = sooptcopyout(sopt, &optval, sizeof(optval));
|
|
|
|
break;
|
2007-02-26 20:47:52 +00:00
|
|
|
|
2001-08-17 22:01:18 +00:00
|
|
|
default:
|
|
|
|
error = EOPNOTSUPP;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
break;
|
2007-02-26 20:47:52 +00:00
|
|
|
|
2001-08-17 22:01:18 +00:00
|
|
|
case SOPT_SET:
|
2005-04-13 00:01:46 +00:00
|
|
|
switch (sopt->sopt_name) {
|
|
|
|
case LOCAL_CREDS:
|
2020-11-03 01:17:45 +00:00
|
|
|
case LOCAL_CREDS_PERSISTENT:
|
2005-04-13 00:01:46 +00:00
|
|
|
case LOCAL_CONNWAIT:
|
|
|
|
error = sooptcopyin(sopt, &optval, sizeof(optval),
|
|
|
|
sizeof(optval));
|
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
|
2020-11-03 01:17:45 +00:00
|
|
|
#define OPTSET(bit, exclusive) do { \
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_LOCK(unp); \
|
2020-11-03 01:17:45 +00:00
|
|
|
if (optval) { \
|
|
|
|
if ((unp->unp_flags & (exclusive)) != 0) { \
|
|
|
|
UNP_PCB_UNLOCK(unp); \
|
|
|
|
error = EINVAL; \
|
|
|
|
break; \
|
|
|
|
} \
|
|
|
|
unp->unp_flags |= (bit); \
|
|
|
|
} else \
|
|
|
|
unp->unp_flags &= ~(bit); \
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_UNLOCK(unp); \
|
|
|
|
} while (0)
|
2005-04-13 00:01:46 +00:00
|
|
|
|
|
|
|
switch (sopt->sopt_name) {
|
|
|
|
case LOCAL_CREDS:
|
2020-11-03 01:17:45 +00:00
|
|
|
OPTSET(UNP_WANTCRED_ONESHOT, UNP_WANTCRED_ALWAYS);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case LOCAL_CREDS_PERSISTENT:
|
|
|
|
OPTSET(UNP_WANTCRED_ALWAYS, UNP_WANTCRED_ONESHOT);
|
2005-04-13 00:01:46 +00:00
|
|
|
break;
|
2007-02-26 20:47:52 +00:00
|
|
|
|
2005-04-13 00:01:46 +00:00
|
|
|
case LOCAL_CONNWAIT:
|
2020-11-03 01:17:45 +00:00
|
|
|
OPTSET(UNP_CONNWAIT, 0);
|
2005-04-13 00:01:46 +00:00
|
|
|
break;
|
2007-02-26 20:47:52 +00:00
|
|
|
|
2005-04-13 00:01:46 +00:00
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
#undef OPTSET
|
|
|
|
default:
|
|
|
|
error = ENOPROTOOPT;
|
|
|
|
break;
|
|
|
|
}
|
2005-04-25 00:48:04 +00:00
|
|
|
break;
|
2007-02-26 20:47:52 +00:00
|
|
|
|
2001-08-17 22:01:18 +00:00
|
|
|
default:
|
|
|
|
error = EOPNOTSUPP;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return (error);
|
|
|
|
}
|
2004-01-11 19:48:19 +00:00
|
|
|
|
1995-12-14 09:55:16 +00:00
|
|
|
static int
|
2005-02-20 23:22:13 +00:00
|
|
|
unp_connect(struct socket *so, struct sockaddr *nam, struct thread *td)
|
2013-03-02 21:11:30 +00:00
|
|
|
{
|
|
|
|
|
|
|
|
return (unp_connectat(AT_FDCWD, so, nam, td));
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
unp_connectat(int fd, struct socket *so, struct sockaddr *nam,
|
|
|
|
struct thread *td)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2020-09-15 19:23:01 +00:00
|
|
|
struct mtx *vplock;
|
|
|
|
struct sockaddr_un *soun;
|
2005-02-20 23:22:13 +00:00
|
|
|
struct vnode *vp;
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
struct socket *so2;
|
2004-08-14 03:43:49 +00:00
|
|
|
struct unpcb *unp, *unp2, *unp3;
|
1994-05-24 10:09:53 +00:00
|
|
|
struct nameidata nd;
|
1997-08-16 19:16:27 +00:00
|
|
|
char buf[SOCK_MAXADDRLEN];
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
struct sockaddr *sa;
|
Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.
The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.
The structure definition looks like this:
struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};
The initial CAP_RIGHTS_VERSION is 0.
The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.
The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.
To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.
#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)
We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:
#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)
#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)
There is new API to manage the new cap_rights_t structure:
cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);
bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);
Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);
There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:
#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);
Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:
cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);
Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.
This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.
Sponsored by: The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
|
|
|
cap_rights_t rights;
|
2020-09-15 19:23:22 +00:00
|
|
|
int error, len;
|
2020-09-15 19:23:01 +00:00
|
|
|
bool connreq;
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
|
2014-07-14 06:00:01 +00:00
|
|
|
if (nam->sa_family != AF_UNIX)
|
|
|
|
return (EAFNOSUPPORT);
|
2011-09-28 08:47:17 +00:00
|
|
|
if (nam->sa_len > sizeof(struct sockaddr_un))
|
|
|
|
return (EINVAL);
|
1997-08-16 19:16:27 +00:00
|
|
|
len = nam->sa_len - offsetof(struct sockaddr_un, sun_path);
|
|
|
|
if (len <= 0)
|
2004-01-11 19:48:19 +00:00
|
|
|
return (EINVAL);
|
2020-09-15 19:23:01 +00:00
|
|
|
soun = (struct sockaddr_un *)nam;
|
2008-07-03 23:26:10 +00:00
|
|
|
bcopy(soun->sun_path, buf, len);
|
|
|
|
buf[len] = 0;
|
2007-02-26 20:47:52 +00:00
|
|
|
|
2018-05-17 17:59:35 +00:00
|
|
|
unp = sotounpcb(so);
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_LOCK(unp);
|
2020-09-15 19:23:22 +00:00
|
|
|
for (;;) {
|
|
|
|
/*
|
|
|
|
* Wait for connection state to stabilize. If a connection
|
|
|
|
* already exists, give up. For datagram sockets, which permit
|
|
|
|
* multiple consecutive connect(2) calls, upper layers are
|
|
|
|
* responsible for disconnecting in advance of a subsequent
|
|
|
|
* connect(2), but this is not synchronized with PCB connection
|
|
|
|
* state.
|
|
|
|
*
|
|
|
|
* Also make sure that no threads are currently attempting to
|
|
|
|
* lock the peer socket, to ensure that unp_conn cannot
|
|
|
|
* transition between two valid sockets while locks are dropped.
|
|
|
|
*/
|
|
|
|
if (unp->unp_conn != NULL) {
|
|
|
|
UNP_PCB_UNLOCK(unp);
|
|
|
|
return (EISCONN);
|
|
|
|
}
|
|
|
|
if ((unp->unp_flags & UNP_CONNECTING) != 0) {
|
|
|
|
UNP_PCB_UNLOCK(unp);
|
|
|
|
return (EALREADY);
|
|
|
|
}
|
|
|
|
if (unp->unp_pairbusy > 0) {
|
|
|
|
unp->unp_flags |= UNP_WAITING;
|
|
|
|
mtx_sleep(unp, UNP_PCB_LOCKPTR(unp), 0, "unpeer", 0);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
break;
|
2006-07-23 12:01:14 +00:00
|
|
|
}
|
2007-02-13 21:00:57 +00:00
|
|
|
unp->unp_flags |= UNP_CONNECTING;
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
|
|
|
|
2020-09-15 19:23:01 +00:00
|
|
|
connreq = (so->so_proto->pr_flags & PR_CONNREQUIRED) != 0;
|
|
|
|
if (connreq)
|
|
|
|
sa = malloc(sizeof(struct sockaddr_un), M_SONAME, M_WAITOK);
|
|
|
|
else
|
|
|
|
sa = NULL;
|
2013-03-02 21:11:30 +00:00
|
|
|
NDINIT_ATRIGHTS(&nd, LOOKUP, FOLLOW | LOCKSHARED | LOCKLEAF,
|
Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.
The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.
The structure definition looks like this:
struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};
The initial CAP_RIGHTS_VERSION is 0.
The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.
The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.
To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.
#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)
We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:
#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)
#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)
There is new API to manage the new cap_rights_t structure:
cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);
bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);
Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);
There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:
#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);
Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:
cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);
Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.
This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.
Sponsored by: The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
|
|
|
UIO_SYSSPACE, buf, fd, cap_rights_init(&rights, CAP_CONNECTAT), td);
|
1994-10-02 17:35:40 +00:00
|
|
|
error = namei(&nd);
|
|
|
|
if (error)
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
vp = NULL;
|
|
|
|
else
|
|
|
|
vp = nd.ni_vp;
|
|
|
|
ASSERT_VOP_LOCKED(vp, "unp_connect");
|
1999-12-15 23:02:35 +00:00
|
|
|
NDFREE(&nd, NDF_ONLY_PNBUF);
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
if (error)
|
|
|
|
goto bad;
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
if (vp->v_type != VSOCK) {
|
|
|
|
error = ENOTSOCK;
|
|
|
|
goto bad;
|
|
|
|
}
|
2007-02-22 09:37:44 +00:00
|
|
|
#ifdef MAC
|
2007-10-24 19:04:04 +00:00
|
|
|
error = mac_vnode_check_open(td->td_ucred, vp, VWRITE | VREAD);
|
2007-02-22 09:37:44 +00:00
|
|
|
if (error)
|
|
|
|
goto bad;
|
|
|
|
#endif
|
2002-02-27 18:32:23 +00:00
|
|
|
error = VOP_ACCESS(vp, VWRITE, td->td_ucred, td);
|
1994-10-02 17:35:40 +00:00
|
|
|
if (error)
|
1994-05-24 10:09:53 +00:00
|
|
|
goto bad;
|
2007-02-26 20:47:52 +00:00
|
|
|
|
2004-08-14 03:43:49 +00:00
|
|
|
unp = sotounpcb(so);
|
2006-03-17 13:52:57 +00:00
|
|
|
KASSERT(unp != NULL, ("unp_connect: unp == NULL"));
|
2007-02-26 20:47:52 +00:00
|
|
|
|
2018-05-17 17:59:35 +00:00
|
|
|
vplock = mtx_pool_find(mtxpool_sleep, vp);
|
|
|
|
mtx_lock(vplock);
|
2017-06-02 17:31:25 +00:00
|
|
|
VOP_UNP_CONNECT(vp, &unp2);
|
|
|
|
if (unp2 == NULL) {
|
1994-05-24 10:09:53 +00:00
|
|
|
error = ECONNREFUSED;
|
2004-07-18 01:29:43 +00:00
|
|
|
goto bad2;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2017-06-02 17:31:25 +00:00
|
|
|
so2 = unp2->unp_socket;
|
1994-05-24 10:09:53 +00:00
|
|
|
if (so->so_type != so2->so_type) {
|
|
|
|
error = EPROTOTYPE;
|
2004-07-18 01:29:43 +00:00
|
|
|
goto bad2;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2020-09-15 19:23:01 +00:00
|
|
|
if (connreq) {
|
2007-02-26 20:47:52 +00:00
|
|
|
if (so2->so_options & SO_ACCEPTCONN) {
|
2011-02-16 21:29:13 +00:00
|
|
|
CURVNET_SET(so2->so_vnet);
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
so2 = sonewconn(so2, 0);
|
2011-02-16 21:29:13 +00:00
|
|
|
CURVNET_RESTORE();
|
2007-02-26 20:47:52 +00:00
|
|
|
} else
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
so2 = NULL;
|
|
|
|
if (so2 == NULL) {
|
1994-05-24 10:09:53 +00:00
|
|
|
error = ECONNREFUSED;
|
2018-05-20 21:20:26 +00:00
|
|
|
goto bad2;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
unp3 = sotounpcb(so2);
|
2020-09-15 19:22:16 +00:00
|
|
|
unp_pcb_lock_pair(unp2, unp3);
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
if (unp2->unp_addr != NULL) {
|
|
|
|
bcopy(unp2->unp_addr, sa, unp2->unp_addr->sun_len);
|
|
|
|
unp3->unp_addr = (struct sockaddr_un *) sa;
|
|
|
|
sa = NULL;
|
|
|
|
}
|
2009-01-01 20:03:22 +00:00
|
|
|
|
2018-08-03 01:37:00 +00:00
|
|
|
unp_copy_peercred(td, unp3, unp, unp2);
|
2009-01-01 20:03:22 +00:00
|
|
|
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_UNLOCK(unp2);
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
unp2 = unp3;
|
2020-09-15 19:23:22 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* It is safe to block on the PCB lock here since unp2 is
|
|
|
|
* nascent and cannot be connected to any other sockets.
|
|
|
|
*/
|
|
|
|
UNP_PCB_LOCK(unp);
|
2002-07-31 03:03:22 +00:00
|
|
|
#ifdef MAC
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
mac_socketpeer_set_from_socket(so, so2);
|
|
|
|
mac_socketpeer_set_from_socket(so2, so);
|
2002-07-31 03:03:22 +00:00
|
|
|
#endif
|
2018-05-24 18:22:05 +00:00
|
|
|
} else {
|
2020-09-15 19:22:16 +00:00
|
|
|
unp_pcb_lock_pair(unp, unp2);
|
2018-05-24 18:22:05 +00:00
|
|
|
}
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
KASSERT(unp2 != NULL && so2 != NULL && unp2->unp_socket == so2 &&
|
|
|
|
sotounpcb(so2) == unp2,
|
|
|
|
("%s: unp2 %p so2 %p", __func__, unp2, so2));
|
2005-04-13 00:01:46 +00:00
|
|
|
error = unp_connect2(so, so2, PRU_CONNECT);
|
2020-09-15 19:22:16 +00:00
|
|
|
unp_pcb_unlock_pair(unp, unp2);
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
bad2:
|
2018-05-17 17:59:35 +00:00
|
|
|
mtx_unlock(vplock);
|
1994-05-24 10:09:53 +00:00
|
|
|
bad:
|
2018-05-17 17:59:35 +00:00
|
|
|
if (vp != NULL) {
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
vput(vp);
|
2018-05-17 17:59:35 +00:00
|
|
|
}
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
free(sa, M_SONAME);
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_LOCK(unp);
|
2020-09-15 19:23:22 +00:00
|
|
|
KASSERT((unp->unp_flags & UNP_CONNECTING) != 0,
|
|
|
|
("%s: unp %p has UNP_CONNECTING clear", __func__, unp));
|
2006-07-23 12:01:14 +00:00
|
|
|
unp->unp_flags &= ~UNP_CONNECTING;
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
1994-05-24 10:09:53 +00:00
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2018-08-03 01:37:00 +00:00
|
|
|
/*
|
|
|
|
* Set socket peer credentials at connection time.
|
|
|
|
*
|
|
|
|
* The client's PCB credentials are copied from its process structure. The
|
|
|
|
* server's PCB credentials are copied from the socket on which it called
|
|
|
|
* listen(2). uipc_listen cached that process's credentials at the time.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
unp_copy_peercred(struct thread *td, struct unpcb *client_unp,
|
|
|
|
struct unpcb *server_unp, struct unpcb *listen_unp)
|
|
|
|
{
|
2019-05-30 14:24:26 +00:00
|
|
|
cru2xt(td, &client_unp->unp_peercred);
|
2018-08-03 01:37:00 +00:00
|
|
|
client_unp->unp_flags |= UNP_HAVEPC;
|
|
|
|
|
|
|
|
memcpy(&server_unp->unp_peercred, &listen_unp->unp_peercred,
|
|
|
|
sizeof(server_unp->unp_peercred));
|
|
|
|
server_unp->unp_flags |= UNP_HAVEPC;
|
2020-11-03 01:17:45 +00:00
|
|
|
client_unp->unp_flags |= (listen_unp->unp_flags & UNP_WANTCRED_MASK);
|
2018-08-03 01:37:00 +00:00
|
|
|
}
|
|
|
|
|
2004-03-31 01:41:30 +00:00
|
|
|
static int
|
2005-04-13 00:01:46 +00:00
|
|
|
unp_connect2(struct socket *so, struct socket *so2, int req)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2007-02-26 20:47:52 +00:00
|
|
|
struct unpcb *unp;
|
2005-02-20 23:22:13 +00:00
|
|
|
struct unpcb *unp2;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2007-02-26 20:47:52 +00:00
|
|
|
unp = sotounpcb(so);
|
|
|
|
KASSERT(unp != NULL, ("unp_connect2: unp == NULL"));
|
|
|
|
unp2 = sotounpcb(so2);
|
|
|
|
KASSERT(unp2 != NULL, ("unp_connect2: unp2 == NULL"));
|
|
|
|
|
|
|
|
UNP_PCB_LOCK_ASSERT(unp);
|
|
|
|
UNP_PCB_LOCK_ASSERT(unp2);
|
2020-09-15 19:23:22 +00:00
|
|
|
KASSERT(unp->unp_conn == NULL,
|
|
|
|
("%s: socket %p is already connected", __func__, unp));
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
if (so2->so_type != so->so_type)
|
|
|
|
return (EPROTOTYPE);
|
|
|
|
unp->unp_conn = unp2;
|
2018-05-17 17:59:35 +00:00
|
|
|
unp_pcb_hold(unp2);
|
|
|
|
unp_pcb_hold(unp);
|
1994-05-24 10:09:53 +00:00
|
|
|
switch (so->so_type) {
|
|
|
|
case SOCK_DGRAM:
|
2018-05-17 17:59:35 +00:00
|
|
|
UNP_REF_LIST_LOCK();
|
1998-05-15 20:11:40 +00:00
|
|
|
LIST_INSERT_HEAD(&unp2->unp_refs, unp, unp_reflink);
|
2018-05-17 17:59:35 +00:00
|
|
|
UNP_REF_LIST_UNLOCK();
|
1994-05-24 10:09:53 +00:00
|
|
|
soisconnected(so);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case SOCK_STREAM:
|
2009-10-05 14:49:16 +00:00
|
|
|
case SOCK_SEQPACKET:
|
2020-09-15 19:23:22 +00:00
|
|
|
KASSERT(unp2->unp_conn == NULL,
|
|
|
|
("%s: socket %p is already connected", __func__, unp2));
|
1994-05-24 10:09:53 +00:00
|
|
|
unp2->unp_conn = unp;
|
2005-04-13 00:01:46 +00:00
|
|
|
if (req == PRU_CONNECT &&
|
|
|
|
((unp->unp_flags | unp2->unp_flags) & UNP_CONNWAIT))
|
|
|
|
soisconnecting(so);
|
|
|
|
else
|
|
|
|
soisconnected(so);
|
1994-05-24 10:09:53 +00:00
|
|
|
soisconnected(so2);
|
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
|
|
|
panic("unp_connect2");
|
|
|
|
}
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
1995-12-14 09:55:16 +00:00
|
|
|
static void
|
2007-02-26 20:47:52 +00:00
|
|
|
unp_disconnect(struct unpcb *unp, struct unpcb *unp2)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2018-05-17 17:59:35 +00:00
|
|
|
struct socket *so, *so2;
|
2020-09-15 19:22:37 +00:00
|
|
|
#ifdef INVARIANTS
|
|
|
|
struct unpcb *unptmp;
|
|
|
|
#endif
|
2007-02-26 20:47:52 +00:00
|
|
|
|
|
|
|
UNP_PCB_LOCK_ASSERT(unp);
|
|
|
|
UNP_PCB_LOCK_ASSERT(unp2);
|
2020-09-15 19:22:37 +00:00
|
|
|
KASSERT(unp->unp_conn == unp2,
|
|
|
|
("%s: unpcb %p is not connected to %p", __func__, unp, unp2));
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
|
2004-03-30 02:16:25 +00:00
|
|
|
unp->unp_conn = NULL;
|
2018-05-17 17:59:35 +00:00
|
|
|
so = unp->unp_socket;
|
|
|
|
so2 = unp2->unp_socket;
|
1994-05-24 10:09:53 +00:00
|
|
|
switch (unp->unp_socket->so_type) {
|
|
|
|
case SOCK_DGRAM:
|
2018-05-17 17:59:35 +00:00
|
|
|
UNP_REF_LIST_LOCK();
|
2020-09-15 19:22:37 +00:00
|
|
|
#ifdef INVARIANTS
|
|
|
|
LIST_FOREACH(unptmp, &unp2->unp_refs, unp_reflink) {
|
|
|
|
if (unptmp == unp)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
KASSERT(unptmp != NULL,
|
|
|
|
("%s: %p not found in reflist of %p", __func__, unp, unp2));
|
|
|
|
#endif
|
1998-05-15 20:11:40 +00:00
|
|
|
LIST_REMOVE(unp, unp_reflink);
|
2018-05-17 17:59:35 +00:00
|
|
|
UNP_REF_LIST_UNLOCK();
|
|
|
|
if (so) {
|
|
|
|
SOCK_LOCK(so);
|
|
|
|
so->so_state &= ~SS_ISCONNECTED;
|
|
|
|
SOCK_UNLOCK(so);
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
|
|
|
|
|
|
|
case SOCK_STREAM:
|
2009-10-05 14:49:16 +00:00
|
|
|
case SOCK_SEQPACKET:
|
2018-05-17 17:59:35 +00:00
|
|
|
if (so)
|
|
|
|
soisdisconnected(so);
|
|
|
|
MPASS(unp2->unp_conn == unp);
|
2004-03-30 02:16:25 +00:00
|
|
|
unp2->unp_conn = NULL;
|
2018-05-17 17:59:35 +00:00
|
|
|
if (so2)
|
|
|
|
soisdisconnected(so2);
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
|
|
|
}
|
2020-09-15 19:22:37 +00:00
|
|
|
|
|
|
|
if (unp == unp2) {
|
|
|
|
unp_pcb_rele_notlast(unp);
|
|
|
|
if (!unp_pcb_rele(unp))
|
|
|
|
UNP_PCB_UNLOCK(unp);
|
|
|
|
} else {
|
|
|
|
if (!unp_pcb_rele(unp))
|
|
|
|
UNP_PCB_UNLOCK(unp);
|
|
|
|
if (!unp_pcb_rele(unp2))
|
|
|
|
UNP_PCB_UNLOCK(unp2);
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
/*
|
2007-05-11 12:10:45 +00:00
|
|
|
* unp_pcblist() walks the global list of struct unpcb's to generate a
|
|
|
|
* pointer list, bumping the refcount on each unpcb. It then copies them out
|
|
|
|
* sequentially, validating the generation number on each to see if it has
|
|
|
|
* been detached. All of this is necessary because copyout() may sleep on
|
|
|
|
* disk I/O.
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
*/
|
1998-05-15 20:11:40 +00:00
|
|
|
static int
|
2000-07-04 11:25:35 +00:00
|
|
|
unp_pcblist(SYSCTL_HANDLER_ARGS)
|
1998-05-15 20:11:40 +00:00
|
|
|
{
|
|
|
|
struct unpcb *unp, **unp_list;
|
|
|
|
unp_gen_t gencnt;
|
2001-08-18 02:53:50 +00:00
|
|
|
struct xunpgen *xug;
|
1998-05-15 20:11:40 +00:00
|
|
|
struct unp_head *head;
|
2001-08-18 02:53:50 +00:00
|
|
|
struct xunpcb *xu;
|
2018-01-22 02:08:10 +00:00
|
|
|
u_int i;
|
2020-09-15 19:21:58 +00:00
|
|
|
int error, n;
|
1998-05-15 20:11:40 +00:00
|
|
|
|
2009-10-05 14:49:16 +00:00
|
|
|
switch ((intptr_t)arg1) {
|
|
|
|
case SOCK_STREAM:
|
|
|
|
head = &unp_shead;
|
|
|
|
break;
|
|
|
|
|
|
|
|
case SOCK_DGRAM:
|
|
|
|
head = &unp_dhead;
|
|
|
|
break;
|
|
|
|
|
|
|
|
case SOCK_SEQPACKET:
|
|
|
|
head = &unp_sphead;
|
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
2009-10-05 22:23:12 +00:00
|
|
|
panic("unp_pcblist: arg1 %d", (int)(intptr_t)arg1);
|
2009-10-05 14:49:16 +00:00
|
|
|
}
|
1998-05-15 20:11:40 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The process of preparing the PCB list is too time-consuming and
|
|
|
|
* resource-intensive to repeat twice on every request.
|
|
|
|
*/
|
2004-03-30 02:16:25 +00:00
|
|
|
if (req->oldptr == NULL) {
|
1998-05-15 20:11:40 +00:00
|
|
|
n = unp_count;
|
2001-08-18 02:53:50 +00:00
|
|
|
req->oldidx = 2 * (sizeof *xug)
|
1998-05-15 20:11:40 +00:00
|
|
|
+ (n + n/8) * sizeof(struct xunpcb);
|
2004-01-11 19:48:19 +00:00
|
|
|
return (0);
|
1998-05-15 20:11:40 +00:00
|
|
|
}
|
|
|
|
|
2004-03-30 02:16:25 +00:00
|
|
|
if (req->newptr != NULL)
|
2004-01-11 19:48:19 +00:00
|
|
|
return (EPERM);
|
1998-05-15 20:11:40 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* OK, now we're committed to doing something.
|
|
|
|
*/
|
2018-11-22 20:49:41 +00:00
|
|
|
xug = malloc(sizeof(*xug), M_TEMP, M_WAITOK | M_ZERO);
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
UNP_LINK_RLOCK();
|
1998-05-15 20:11:40 +00:00
|
|
|
gencnt = unp_gencnt;
|
|
|
|
n = unp_count;
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
UNP_LINK_RUNLOCK();
|
1998-05-15 20:11:40 +00:00
|
|
|
|
2001-08-18 02:53:50 +00:00
|
|
|
xug->xug_len = sizeof *xug;
|
|
|
|
xug->xug_count = n;
|
|
|
|
xug->xug_gen = gencnt;
|
|
|
|
xug->xug_sogen = so_gencnt;
|
|
|
|
error = SYSCTL_OUT(req, xug, sizeof *xug);
|
|
|
|
if (error) {
|
|
|
|
free(xug, M_TEMP);
|
2004-01-11 19:48:19 +00:00
|
|
|
return (error);
|
2001-08-18 02:53:50 +00:00
|
|
|
}
|
1998-05-15 20:11:40 +00:00
|
|
|
|
2003-02-19 05:47:46 +00:00
|
|
|
unp_list = malloc(n * sizeof *unp_list, M_TEMP, M_WAITOK);
|
2004-01-11 19:48:19 +00:00
|
|
|
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
UNP_LINK_RLOCK();
|
1999-11-16 10:56:05 +00:00
|
|
|
for (unp = LIST_FIRST(head), i = 0; unp && i < n;
|
|
|
|
unp = LIST_NEXT(unp, unp_link)) {
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_LOCK(unp);
|
2001-10-09 21:40:30 +00:00
|
|
|
if (unp->unp_gencnt <= gencnt) {
|
2002-02-27 18:32:23 +00:00
|
|
|
if (cr_cansee(req->td->td_ucred,
|
2007-02-26 20:47:52 +00:00
|
|
|
unp->unp_socket->so_cred)) {
|
|
|
|
UNP_PCB_UNLOCK(unp);
|
2001-10-05 07:06:32 +00:00
|
|
|
continue;
|
2007-02-26 20:47:52 +00:00
|
|
|
}
|
1998-05-15 20:11:40 +00:00
|
|
|
unp_list[i++] = unp;
|
2018-05-17 17:59:35 +00:00
|
|
|
unp_pcb_hold(unp);
|
2001-10-05 07:06:32 +00:00
|
|
|
}
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
1998-05-15 20:11:40 +00:00
|
|
|
}
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
UNP_LINK_RUNLOCK();
|
2006-07-22 17:24:55 +00:00
|
|
|
n = i; /* In case we lost some during malloc. */
|
1998-05-15 20:11:40 +00:00
|
|
|
|
|
|
|
error = 0;
|
2005-05-07 00:41:36 +00:00
|
|
|
xu = malloc(sizeof(*xu), M_TEMP, M_WAITOK | M_ZERO);
|
1998-05-15 20:11:40 +00:00
|
|
|
for (i = 0; i < n; i++) {
|
|
|
|
unp = unp_list[i];
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_LOCK(unp);
|
2020-09-15 19:21:58 +00:00
|
|
|
if (unp_pcb_rele(unp))
|
|
|
|
continue;
|
2018-05-17 17:59:35 +00:00
|
|
|
|
2020-09-15 19:21:58 +00:00
|
|
|
if (unp->unp_gencnt <= gencnt) {
|
2001-08-18 02:53:50 +00:00
|
|
|
xu->xu_len = sizeof *xu;
|
2018-07-10 13:03:06 +00:00
|
|
|
xu->xu_unpp = (uintptr_t)unp;
|
1998-05-15 20:11:40 +00:00
|
|
|
/*
|
|
|
|
* XXX - need more locking here to protect against
|
|
|
|
* connect/disconnect races for SMP.
|
|
|
|
*/
|
2004-03-30 02:16:25 +00:00
|
|
|
if (unp->unp_addr != NULL)
|
2004-01-11 19:48:19 +00:00
|
|
|
bcopy(unp->unp_addr, &xu->xu_addr,
|
1998-05-15 20:11:40 +00:00
|
|
|
unp->unp_addr->sun_len);
|
2017-10-02 23:29:56 +00:00
|
|
|
else
|
|
|
|
bzero(&xu->xu_addr, sizeof(xu->xu_addr));
|
2004-03-30 02:16:25 +00:00
|
|
|
if (unp->unp_conn != NULL &&
|
|
|
|
unp->unp_conn->unp_addr != NULL)
|
1998-05-15 20:11:40 +00:00
|
|
|
bcopy(unp->unp_conn->unp_addr,
|
2001-08-18 02:53:50 +00:00
|
|
|
&xu->xu_caddr,
|
1998-05-15 20:11:40 +00:00
|
|
|
unp->unp_conn->unp_addr->sun_len);
|
2017-10-02 23:29:56 +00:00
|
|
|
else
|
|
|
|
bzero(&xu->xu_caddr, sizeof(xu->xu_caddr));
|
2018-07-10 13:03:06 +00:00
|
|
|
xu->unp_vnode = (uintptr_t)unp->unp_vnode;
|
|
|
|
xu->unp_conn = (uintptr_t)unp->unp_conn;
|
|
|
|
xu->xu_firstref = (uintptr_t)LIST_FIRST(&unp->unp_refs);
|
|
|
|
xu->xu_nextref = (uintptr_t)LIST_NEXT(unp, unp_reflink);
|
2017-10-02 23:29:56 +00:00
|
|
|
xu->unp_gencnt = unp->unp_gencnt;
|
2001-08-18 02:53:50 +00:00
|
|
|
sotoxsocket(unp->unp_socket, &xu->xu_socket);
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
2001-08-18 02:53:50 +00:00
|
|
|
error = SYSCTL_OUT(req, xu, sizeof *xu);
|
2020-09-15 19:21:58 +00:00
|
|
|
} else {
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
2020-09-15 19:21:58 +00:00
|
|
|
}
|
1998-05-15 20:11:40 +00:00
|
|
|
}
|
2001-08-18 02:53:50 +00:00
|
|
|
free(xu, M_TEMP);
|
1998-05-15 20:11:40 +00:00
|
|
|
if (!error) {
|
|
|
|
/*
|
2006-07-22 17:24:55 +00:00
|
|
|
* Give the user an updated idea of our state. If the
|
|
|
|
* generation differs from what we told her before, she knows
|
|
|
|
* that something happened while we were processing this
|
|
|
|
* request, and it might be necessary to retry.
|
1998-05-15 20:11:40 +00:00
|
|
|
*/
|
2001-08-18 02:53:50 +00:00
|
|
|
xug->xug_gen = unp_gencnt;
|
|
|
|
xug->xug_sogen = so_gencnt;
|
|
|
|
xug->xug_count = unp_count;
|
|
|
|
error = SYSCTL_OUT(req, xug, sizeof *xug);
|
1998-05-15 20:11:40 +00:00
|
|
|
}
|
|
|
|
free(unp_list, M_TEMP);
|
2001-08-18 02:53:50 +00:00
|
|
|
free(xug, M_TEMP);
|
2004-01-11 19:48:19 +00:00
|
|
|
return (error);
|
1998-05-15 20:11:40 +00:00
|
|
|
}
|
|
|
|
|
2020-02-26 14:26:36 +00:00
|
|
|
SYSCTL_PROC(_net_local_dgram, OID_AUTO, pcblist,
|
|
|
|
CTLTYPE_OPAQUE | CTLFLAG_RD | CTLFLAG_MPSAFE,
|
2011-01-18 21:14:18 +00:00
|
|
|
(void *)(intptr_t)SOCK_DGRAM, 0, unp_pcblist, "S,xunpcb",
|
|
|
|
"List of active local datagram sockets");
|
2020-02-26 14:26:36 +00:00
|
|
|
SYSCTL_PROC(_net_local_stream, OID_AUTO, pcblist,
|
|
|
|
CTLTYPE_OPAQUE | CTLFLAG_RD | CTLFLAG_MPSAFE,
|
2011-01-18 21:14:18 +00:00
|
|
|
(void *)(intptr_t)SOCK_STREAM, 0, unp_pcblist, "S,xunpcb",
|
|
|
|
"List of active local stream sockets");
|
|
|
|
SYSCTL_PROC(_net_local_seqpacket, OID_AUTO, pcblist,
|
2020-02-26 14:26:36 +00:00
|
|
|
CTLTYPE_OPAQUE | CTLFLAG_RD | CTLFLAG_MPSAFE,
|
2011-01-18 21:14:18 +00:00
|
|
|
(void *)(intptr_t)SOCK_SEQPACKET, 0, unp_pcblist, "S,xunpcb",
|
|
|
|
"List of active local seqpacket sockets");
|
1998-05-15 20:11:40 +00:00
|
|
|
|
1995-12-14 09:55:16 +00:00
|
|
|
static void
|
2005-02-20 23:22:13 +00:00
|
|
|
unp_shutdown(struct unpcb *unp)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2007-02-26 20:47:52 +00:00
|
|
|
struct unpcb *unp2;
|
1994-05-24 10:09:53 +00:00
|
|
|
struct socket *so;
|
|
|
|
|
2007-02-26 20:47:52 +00:00
|
|
|
UNP_PCB_LOCK_ASSERT(unp);
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
|
2007-02-26 20:47:52 +00:00
|
|
|
unp2 = unp->unp_conn;
|
2009-10-05 14:49:16 +00:00
|
|
|
if ((unp->unp_socket->so_type == SOCK_STREAM ||
|
|
|
|
(unp->unp_socket->so_type == SOCK_SEQPACKET)) && unp2 != NULL) {
|
2007-02-26 20:47:52 +00:00
|
|
|
so = unp2->unp_socket;
|
|
|
|
if (so != NULL)
|
|
|
|
socantrcvmore(so);
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
1995-12-14 09:55:16 +00:00
|
|
|
static void
|
2016-02-26 12:46:34 +00:00
|
|
|
unp_drop(struct unpcb *unp)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
|
|
|
struct socket *so = unp->unp_socket;
|
2007-02-26 20:47:52 +00:00
|
|
|
struct unpcb *unp2;
|
Introduce a subsystem lock around UNIX domain sockets in order to protect
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
2004-06-10 21:34:38 +00:00
|
|
|
|
2016-02-26 12:46:34 +00:00
|
|
|
/*
|
|
|
|
* Regardless of whether the socket's peer dropped the connection
|
|
|
|
* with this socket by aborting or disconnecting, POSIX requires
|
|
|
|
* that ECONNRESET is returned.
|
|
|
|
*/
|
2018-05-17 17:59:35 +00:00
|
|
|
|
|
|
|
UNP_PCB_LOCK(unp);
|
|
|
|
if (so)
|
|
|
|
so->so_error = ECONNRESET;
|
2020-09-15 19:23:22 +00:00
|
|
|
if ((unp2 = unp_pcb_lock_peer(unp)) != NULL) {
|
|
|
|
/* Last reference dropped in unp_disconnect(). */
|
2020-09-15 19:22:37 +00:00
|
|
|
unp_pcb_rele_notlast(unp);
|
2018-05-17 17:59:35 +00:00
|
|
|
unp_disconnect(unp, unp2);
|
2020-09-15 19:22:37 +00:00
|
|
|
} else if (!unp_pcb_rele(unp)) {
|
2018-05-17 17:59:35 +00:00
|
|
|
UNP_PCB_UNLOCK(unp);
|
2020-09-15 19:22:37 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2001-10-04 13:11:48 +00:00
|
|
|
static void
|
2013-03-03 23:39:30 +00:00
|
|
|
unp_freerights(struct filedescent **fdep, int fdcount)
|
2001-10-04 13:11:48 +00:00
|
|
|
{
|
|
|
|
struct file *fp;
|
Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
2013-03-02 00:53:12 +00:00
|
|
|
int i;
|
2001-10-04 13:11:48 +00:00
|
|
|
|
2013-06-04 11:19:08 +00:00
|
|
|
KASSERT(fdcount > 0, ("%s: fdcount %d", __func__, fdcount));
|
|
|
|
|
2013-03-03 23:39:30 +00:00
|
|
|
for (i = 0; i < fdcount; i++) {
|
|
|
|
fp = fdep[i]->fde_file;
|
|
|
|
filecaps_free(&fdep[i]->fde_caps);
|
2001-10-04 13:11:48 +00:00
|
|
|
unp_discard(fp);
|
|
|
|
}
|
2013-03-03 23:39:30 +00:00
|
|
|
free(fdep[0], M_FILECAPS);
|
2001-10-04 13:11:48 +00:00
|
|
|
}
|
|
|
|
|
2008-10-03 13:01:56 +00:00
|
|
|
static int
|
2013-03-19 20:58:17 +00:00
|
|
|
unp_externalize(struct mbuf *control, struct mbuf **controlp, int flags)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2001-09-12 08:38:13 +00:00
|
|
|
struct thread *td = curthread; /* XXX */
|
2001-10-04 13:11:48 +00:00
|
|
|
struct cmsghdr *cm = mtod(control, struct cmsghdr *);
|
|
|
|
int i;
|
|
|
|
int *fdp;
|
Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
2013-03-02 00:53:12 +00:00
|
|
|
struct filedesc *fdesc = td->td_proc->p_fd;
|
2015-06-14 14:08:52 +00:00
|
|
|
struct filedescent **fdep;
|
2001-10-04 13:11:48 +00:00
|
|
|
void *data;
|
|
|
|
socklen_t clen = control->m_len, datalen;
|
|
|
|
int error, newfds;
|
|
|
|
u_int newlen;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2009-03-08 21:48:29 +00:00
|
|
|
UNP_LINK_UNLOCK_ASSERT();
|
2004-08-19 01:45:16 +00:00
|
|
|
|
2001-10-04 13:11:48 +00:00
|
|
|
error = 0;
|
|
|
|
if (controlp != NULL) /* controlp == NULL => free control messages */
|
|
|
|
*controlp = NULL;
|
|
|
|
while (cm != NULL) {
|
|
|
|
if (sizeof(*cm) > clen || cm->cmsg_len > clen) {
|
|
|
|
error = EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
data = CMSG_DATA(cm);
|
|
|
|
datalen = (caddr_t)cm + cm->cmsg_len - (caddr_t)data;
|
|
|
|
if (cm->cmsg_level == SOL_SOCKET
|
|
|
|
&& cm->cmsg_type == SCM_RIGHTS) {
|
Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
2013-03-02 00:53:12 +00:00
|
|
|
newfds = datalen / sizeof(*fdep);
|
2013-06-04 11:19:08 +00:00
|
|
|
if (newfds == 0)
|
|
|
|
goto next;
|
Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
2013-03-02 00:53:12 +00:00
|
|
|
fdep = data;
|
2001-10-04 13:11:48 +00:00
|
|
|
|
2003-03-23 19:41:34 +00:00
|
|
|
/* If we're not outputting the descriptors free them. */
|
2001-10-04 13:11:48 +00:00
|
|
|
if (error || controlp == NULL) {
|
Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
2013-03-02 00:53:12 +00:00
|
|
|
unp_freerights(fdep, newfds);
|
2001-10-04 13:11:48 +00:00
|
|
|
goto next;
|
|
|
|
}
|
Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
2013-03-02 00:53:12 +00:00
|
|
|
FILEDESC_XLOCK(fdesc);
|
2008-10-03 09:01:55 +00:00
|
|
|
|
2000-03-09 15:15:27 +00:00
|
|
|
/*
|
2006-07-22 17:24:55 +00:00
|
|
|
* Now change each pointer to an fd in the global
|
|
|
|
* table to an integer that is the index to the local
|
|
|
|
* fd table entry that we set up to point to the
|
|
|
|
* global one we are transferring.
|
2000-03-09 15:15:27 +00:00
|
|
|
*/
|
2001-10-04 13:11:48 +00:00
|
|
|
newlen = newfds * sizeof(int);
|
|
|
|
*controlp = sbcreatecontrol(NULL, newlen,
|
|
|
|
SCM_RIGHTS, SOL_SOCKET);
|
|
|
|
if (*controlp == NULL) {
|
Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
2013-03-02 00:53:12 +00:00
|
|
|
FILEDESC_XUNLOCK(fdesc);
|
2001-10-04 13:11:48 +00:00
|
|
|
error = E2BIG;
|
Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
2013-03-02 00:53:12 +00:00
|
|
|
unp_freerights(fdep, newfds);
|
2001-10-04 13:11:48 +00:00
|
|
|
goto next;
|
|
|
|
}
|
|
|
|
|
|
|
|
fdp = (int *)
|
|
|
|
CMSG_DATA(mtod(*controlp, struct cmsghdr *));
|
2013-04-14 17:08:34 +00:00
|
|
|
if (fdallocn(td, 0, fdp, newfds) != 0) {
|
2015-06-12 06:28:22 +00:00
|
|
|
FILEDESC_XUNLOCK(fdesc);
|
2013-04-14 17:08:34 +00:00
|
|
|
error = EMSGSIZE;
|
|
|
|
unp_freerights(fdep, newfds);
|
|
|
|
m_freem(*controlp);
|
|
|
|
*controlp = NULL;
|
|
|
|
goto next;
|
|
|
|
}
|
2013-03-03 23:39:30 +00:00
|
|
|
for (i = 0; i < newfds; i++, fdp++) {
|
2015-06-14 14:08:52 +00:00
|
|
|
_finstall(fdesc, fdep[i]->fde_file, *fdp,
|
|
|
|
(flags & MSG_CMSG_CLOEXEC) != 0 ? UF_EXCLOSE : 0,
|
|
|
|
&fdep[i]->fde_caps);
|
|
|
|
unp_externalize_fp(fdep[i]->fde_file);
|
2001-10-04 13:11:48 +00:00
|
|
|
}
|
2018-08-07 16:36:48 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The new type indicates that the mbuf data refers to
|
|
|
|
* kernel resources that may need to be released before
|
|
|
|
* the mbuf is freed.
|
|
|
|
*/
|
|
|
|
m_chtype(*controlp, MT_EXTCONTROL);
|
Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
2013-03-02 00:53:12 +00:00
|
|
|
FILEDESC_XUNLOCK(fdesc);
|
2013-06-04 11:19:08 +00:00
|
|
|
free(fdep[0], M_FILECAPS);
|
2006-07-22 17:24:55 +00:00
|
|
|
} else {
|
|
|
|
/* We can just copy anything else across. */
|
2001-10-04 13:11:48 +00:00
|
|
|
if (error || controlp == NULL)
|
|
|
|
goto next;
|
|
|
|
*controlp = sbcreatecontrol(NULL, datalen,
|
|
|
|
cm->cmsg_type, cm->cmsg_level);
|
|
|
|
if (*controlp == NULL) {
|
|
|
|
error = ENOBUFS;
|
|
|
|
goto next;
|
|
|
|
}
|
|
|
|
bcopy(data,
|
|
|
|
CMSG_DATA(mtod(*controlp, struct cmsghdr *)),
|
|
|
|
datalen);
|
2000-03-09 15:15:27 +00:00
|
|
|
}
|
2001-10-04 13:11:48 +00:00
|
|
|
controlp = &(*controlp)->m_next;
|
|
|
|
|
|
|
|
next:
|
|
|
|
if (CMSG_SPACE(datalen) < clen) {
|
|
|
|
clen -= CMSG_SPACE(datalen);
|
|
|
|
cm = (struct cmsghdr *)
|
|
|
|
((caddr_t)cm + CMSG_SPACE(datalen));
|
|
|
|
} else {
|
|
|
|
clen = 0;
|
|
|
|
cm = NULL;
|
2000-03-09 15:15:27 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2000-03-09 15:15:27 +00:00
|
|
|
|
2001-10-04 13:11:48 +00:00
|
|
|
m_freem(control);
|
|
|
|
return (error);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2006-04-21 09:25:40 +00:00
|
|
|
static void
|
|
|
|
unp_zone_change(void *tag)
|
|
|
|
{
|
|
|
|
|
|
|
|
uma_zone_set_max(unp_zone, maxsockets);
|
|
|
|
}
|
|
|
|
|
2020-09-15 19:21:58 +00:00
|
|
|
#ifdef INVARIANTS
|
|
|
|
static void
|
|
|
|
unp_zdtor(void *mem, int size __unused, void *arg __unused)
|
|
|
|
{
|
|
|
|
struct unpcb *unp;
|
|
|
|
|
|
|
|
unp = mem;
|
|
|
|
|
|
|
|
KASSERT(LIST_EMPTY(&unp->unp_refs),
|
|
|
|
("%s: unpcb %p has lingering refs", __func__, unp));
|
|
|
|
KASSERT(unp->unp_socket == NULL,
|
|
|
|
("%s: unpcb %p has socket backpointer", __func__, unp));
|
|
|
|
KASSERT(unp->unp_vnode == NULL,
|
|
|
|
("%s: unpcb %p has vnode references", __func__, unp));
|
|
|
|
KASSERT(unp->unp_conn == NULL,
|
|
|
|
("%s: unpcb %p is still connected", __func__, unp));
|
|
|
|
KASSERT(unp->unp_addr == NULL,
|
|
|
|
("%s: unpcb %p has leaked addr", __func__, unp));
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2008-10-03 13:01:56 +00:00
|
|
|
static void
|
1998-05-15 20:11:40 +00:00
|
|
|
unp_init(void)
|
|
|
|
{
|
2020-09-15 19:21:58 +00:00
|
|
|
uma_dtor dtor;
|
2006-07-22 17:24:55 +00:00
|
|
|
|
Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.
This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.
The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.
The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.
This change also introduces a DDB subcommand to show the list of all
vnet instances.
Approved by: julian (mentor)
2009-05-05 10:56:12 +00:00
|
|
|
#ifdef VIMAGE
|
|
|
|
if (!IS_DEFAULT_VNET(curvnet))
|
|
|
|
return;
|
|
|
|
#endif
|
2020-09-15 19:21:58 +00:00
|
|
|
|
|
|
|
#ifdef INVARIANTS
|
|
|
|
dtor = unp_zdtor;
|
|
|
|
#else
|
|
|
|
dtor = NULL;
|
|
|
|
#endif
|
|
|
|
unp_zone = uma_zcreate("unpcb", sizeof(struct unpcb), NULL, dtor,
|
2018-05-17 17:59:35 +00:00
|
|
|
NULL, NULL, UMA_ALIGN_CACHE, 0);
|
2006-04-21 09:25:40 +00:00
|
|
|
uma_zone_set_max(unp_zone, maxsockets);
|
2012-12-07 22:30:30 +00:00
|
|
|
uma_zone_set_warning(unp_zone, "kern.ipc.maxsockets limit reached");
|
2006-04-21 09:25:40 +00:00
|
|
|
EVENTHANDLER_REGISTER(maxsockets_change, unp_zone_change,
|
|
|
|
NULL, EVENTHANDLER_PRI_ANY);
|
1998-05-15 20:11:40 +00:00
|
|
|
LIST_INIT(&unp_dhead);
|
|
|
|
LIST_INIT(&unp_shead);
|
2009-10-05 14:49:16 +00:00
|
|
|
LIST_INIT(&unp_sphead);
|
2010-12-03 16:15:44 +00:00
|
|
|
SLIST_INIT(&unp_defers);
|
2012-11-20 15:45:48 +00:00
|
|
|
TIMEOUT_TASK_INIT(taskqueue_thread, &unp_gc_task, 0, unp_gc, NULL);
|
2010-12-03 16:15:44 +00:00
|
|
|
TASK_INIT(&unp_defer_task, 0, unp_process_defers, NULL);
|
2009-03-08 21:48:29 +00:00
|
|
|
UNP_LINK_LOCK_INIT();
|
2010-12-03 16:15:44 +00:00
|
|
|
UNP_DEFERRED_LOCK_INIT();
|
1998-05-15 20:11:40 +00:00
|
|
|
}
|
|
|
|
|
2019-07-19 20:51:39 +00:00
|
|
|
static void
|
|
|
|
unp_internalize_cleanup_rights(struct mbuf *control)
|
|
|
|
{
|
|
|
|
struct cmsghdr *cp;
|
|
|
|
struct mbuf *m;
|
|
|
|
void *data;
|
|
|
|
socklen_t datalen;
|
|
|
|
|
|
|
|
for (m = control; m != NULL; m = m->m_next) {
|
|
|
|
cp = mtod(m, struct cmsghdr *);
|
|
|
|
if (cp->cmsg_level != SOL_SOCKET ||
|
|
|
|
cp->cmsg_type != SCM_RIGHTS)
|
|
|
|
continue;
|
|
|
|
data = CMSG_DATA(cp);
|
|
|
|
datalen = (caddr_t)cp + cp->cmsg_len - (caddr_t)data;
|
|
|
|
unp_freerights(data, datalen / sizeof(struct filedesc *));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
1995-12-14 09:55:16 +00:00
|
|
|
static int
|
2005-02-20 23:22:13 +00:00
|
|
|
unp_internalize(struct mbuf **controlp, struct thread *td)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2019-07-19 20:51:39 +00:00
|
|
|
struct mbuf *control, **initial_controlp;
|
|
|
|
struct proc *p;
|
|
|
|
struct filedesc *fdesc;
|
2013-02-15 13:00:20 +00:00
|
|
|
struct bintime *bt;
|
2019-07-19 20:51:39 +00:00
|
|
|
struct cmsghdr *cm;
|
2001-10-04 13:11:48 +00:00
|
|
|
struct cmsgcred *cmcred;
|
2013-03-03 23:39:30 +00:00
|
|
|
struct filedescent *fde, **fdep, *fdev;
|
2001-10-04 13:11:48 +00:00
|
|
|
struct file *fp;
|
|
|
|
struct timeval *tv;
|
2017-01-16 17:46:38 +00:00
|
|
|
struct timespec *ts;
|
2001-10-04 13:11:48 +00:00
|
|
|
void *data;
|
2019-07-19 20:51:39 +00:00
|
|
|
socklen_t clen, datalen;
|
2019-07-21 15:07:12 +00:00
|
|
|
int i, j, error, *fdp, oldfds;
|
2000-03-09 15:15:27 +00:00
|
|
|
u_int newlen;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2009-03-08 21:48:29 +00:00
|
|
|
UNP_LINK_UNLOCK_ASSERT();
|
2004-08-19 01:45:16 +00:00
|
|
|
|
2019-07-19 20:51:39 +00:00
|
|
|
p = td->td_proc;
|
|
|
|
fdesc = p->p_fd;
|
2001-10-04 13:11:48 +00:00
|
|
|
error = 0;
|
2019-07-19 20:51:39 +00:00
|
|
|
control = *controlp;
|
|
|
|
clen = control->m_len;
|
2001-10-04 13:11:48 +00:00
|
|
|
*controlp = NULL;
|
2019-07-19 20:51:39 +00:00
|
|
|
initial_controlp = controlp;
|
|
|
|
for (cm = mtod(control, struct cmsghdr *); cm != NULL;) {
|
2001-10-04 13:11:48 +00:00
|
|
|
if (sizeof(*cm) > clen || cm->cmsg_level != SOL_SOCKET
|
2014-06-27 05:04:36 +00:00
|
|
|
|| cm->cmsg_len > clen || cm->cmsg_len < sizeof(*cm)) {
|
2001-10-04 13:11:48 +00:00
|
|
|
error = EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
data = CMSG_DATA(cm);
|
|
|
|
datalen = (caddr_t)cm + cm->cmsg_len - (caddr_t)data;
|
|
|
|
|
|
|
|
switch (cm->cmsg_type) {
|
|
|
|
/*
|
|
|
|
* Fill in credential information.
|
|
|
|
*/
|
|
|
|
case SCM_CREDS:
|
|
|
|
*controlp = sbcreatecontrol(NULL, sizeof(*cmcred),
|
|
|
|
SCM_CREDS, SOL_SOCKET);
|
|
|
|
if (*controlp == NULL) {
|
|
|
|
error = ENOBUFS;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
cmcred = (struct cmsgcred *)
|
|
|
|
CMSG_DATA(mtod(*controlp, struct cmsghdr *));
|
|
|
|
cmcred->cmcred_pid = p->p_pid;
|
2002-02-27 18:32:23 +00:00
|
|
|
cmcred->cmcred_uid = td->td_ucred->cr_ruid;
|
|
|
|
cmcred->cmcred_gid = td->td_ucred->cr_rgid;
|
|
|
|
cmcred->cmcred_euid = td->td_ucred->cr_uid;
|
|
|
|
cmcred->cmcred_ngroups = MIN(td->td_ucred->cr_ngroups,
|
2008-10-03 13:01:56 +00:00
|
|
|
CMGROUP_MAX);
|
2001-10-04 13:11:48 +00:00
|
|
|
for (i = 0; i < cmcred->cmcred_ngroups; i++)
|
|
|
|
cmcred->cmcred_groups[i] =
|
2002-02-27 18:32:23 +00:00
|
|
|
td->td_ucred->cr_groups[i];
|
2001-10-04 13:11:48 +00:00
|
|
|
break;
|
Add support to sendmsg()/recvmsg() for passing credentials between
processes using AF_LOCAL sockets. This hack is going to be used with
Secure RPC to duplicate a feature of STREAMS which has no real counterpart
in sockets (with STREAMS/TLI, you can apparently use t_getinfo() to learn
UID of a local process on the other side of a transport endpoint).
What happens is this: the client sets up a sendmsg() call with ancillary
data using the SCM_CREDS socket-level control message type. It does not
need to fill in the structure. When the kernel notices the data,
unp_internalize() fills in the cmesgcred structure with the sending
process' credentials (UID, EUID, GID, and ancillary groups). This data
is later delivered to the receiving process. The receiver can then
perform the follwing tests:
- Did the client send ancillary data?
o Yes, proceed.
o No, refuse to authenticate the client.
- The the client send data of type SCM_CREDS?
o Yes, proceed.
o No, refuse to authenticate the client.
- Is the cmsgcred structure the right size?
o Yes, proceed.
o No, signal a possible error.
The receiver can now inspect the credential information and use it to
authenticate the client.
1997-03-21 16:12:32 +00:00
|
|
|
|
2001-10-04 13:11:48 +00:00
|
|
|
case SCM_RIGHTS:
|
|
|
|
oldfds = datalen / sizeof (int);
|
2013-06-04 11:19:08 +00:00
|
|
|
if (oldfds == 0)
|
|
|
|
break;
|
2001-10-04 13:11:48 +00:00
|
|
|
/*
|
2006-07-22 17:24:55 +00:00
|
|
|
* Check that all the FDs passed in refer to legal
|
|
|
|
* files. If not, reject the entire operation.
|
2001-10-04 13:11:48 +00:00
|
|
|
*/
|
|
|
|
fdp = data;
|
Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
2013-03-02 00:53:12 +00:00
|
|
|
FILEDESC_SLOCK(fdesc);
|
2014-07-23 18:04:52 +00:00
|
|
|
for (i = 0; i < oldfds; i++, fdp++) {
|
|
|
|
fp = fget_locked(fdesc, *fdp);
|
|
|
|
if (fp == NULL) {
|
Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
2013-03-02 00:53:12 +00:00
|
|
|
FILEDESC_SUNLOCK(fdesc);
|
2001-10-04 13:11:48 +00:00
|
|
|
error = EBADF;
|
|
|
|
goto out;
|
|
|
|
}
|
2003-02-15 06:04:55 +00:00
|
|
|
if (!(fp->f_ops->fo_flags & DFLAG_PASSABLE)) {
|
Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
2013-03-02 00:53:12 +00:00
|
|
|
FILEDESC_SUNLOCK(fdesc);
|
2003-02-15 06:04:55 +00:00
|
|
|
error = EOPNOTSUPP;
|
|
|
|
goto out;
|
|
|
|
}
|
2001-10-04 13:11:48 +00:00
|
|
|
}
|
Replace custom file descriptor array sleep lock constructed using a mutex
and flags with an sxlock. This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention. All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently. Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.
- Generally eliminate the distinction between "fast" and regular
acquisisition of the filedesc lock; the plan is that they will now all
be fast. Change all locking instances to either shared or exclusive
locks.
- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
was called without the mutex held; sx_sleep() is now always called with
the sxlock held exclusively.
- Universally hold the struct file lock over changes to struct file,
rather than the filedesc lock or no lock. Always update the f_ops
field last. A further memory barrier is required here in the future
(discussed with jhb).
- Improve locking and reference management in linux_at(), which fails to
properly acquire vnode references before using vnode pointers. Annotate
improper use of vn_fullpath(), which will be replaced at a future date.
In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).
Tested by: kris
Discussed with: jhb, kris, attilio, jeff
2007-04-04 09:11:34 +00:00
|
|
|
|
2001-10-04 13:11:48 +00:00
|
|
|
/*
|
2008-10-03 13:01:56 +00:00
|
|
|
* Now replace the integer FDs with pointers to the
|
Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
2013-03-02 00:53:12 +00:00
|
|
|
* file structure and capability rights.
|
2001-10-04 13:11:48 +00:00
|
|
|
*/
|
2013-03-03 23:39:30 +00:00
|
|
|
newlen = oldfds * sizeof(fdep[0]);
|
2001-10-04 13:11:48 +00:00
|
|
|
*controlp = sbcreatecontrol(NULL, newlen,
|
|
|
|
SCM_RIGHTS, SOL_SOCKET);
|
|
|
|
if (*controlp == NULL) {
|
Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
2013-03-02 00:53:12 +00:00
|
|
|
FILEDESC_SUNLOCK(fdesc);
|
2001-10-04 13:11:48 +00:00
|
|
|
error = E2BIG;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
fdp = data;
|
2019-07-21 15:07:12 +00:00
|
|
|
for (i = 0; i < oldfds; i++, fdp++) {
|
|
|
|
if (!fhold(fdesc->fd_ofiles[*fdp].fde_file)) {
|
|
|
|
fdp = data;
|
|
|
|
for (j = 0; j < i; j++, fdp++) {
|
|
|
|
fdrop(fdesc->fd_ofiles[*fdp].
|
|
|
|
fde_file, td);
|
|
|
|
}
|
|
|
|
FILEDESC_SUNLOCK(fdesc);
|
|
|
|
error = EBADF;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
fdp = data;
|
2013-03-03 23:39:30 +00:00
|
|
|
fdep = (struct filedescent **)
|
2001-10-04 13:11:48 +00:00
|
|
|
CMSG_DATA(mtod(*controlp, struct cmsghdr *));
|
2013-03-03 23:39:30 +00:00
|
|
|
fdev = malloc(sizeof(*fdev) * oldfds, M_FILECAPS,
|
|
|
|
M_WAITOK);
|
|
|
|
for (i = 0; i < oldfds; i++, fdev++, fdp++) {
|
Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
2013-03-02 00:53:12 +00:00
|
|
|
fde = &fdesc->fd_ofiles[*fdp];
|
2013-03-03 23:39:30 +00:00
|
|
|
fdep[i] = fdev;
|
|
|
|
fdep[i]->fde_file = fde->fde_file;
|
|
|
|
filecaps_copy(&fde->fde_caps,
|
2015-09-07 20:02:56 +00:00
|
|
|
&fdep[i]->fde_caps, true);
|
2013-03-03 23:39:30 +00:00
|
|
|
unp_internalize_fp(fdep[i]->fde_file);
|
2001-10-04 13:11:48 +00:00
|
|
|
}
|
Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
2013-03-02 00:53:12 +00:00
|
|
|
FILEDESC_SUNLOCK(fdesc);
|
2001-10-04 13:11:48 +00:00
|
|
|
break;
|
2000-03-09 15:15:27 +00:00
|
|
|
|
2001-10-04 13:11:48 +00:00
|
|
|
case SCM_TIMESTAMP:
|
|
|
|
*controlp = sbcreatecontrol(NULL, sizeof(*tv),
|
|
|
|
SCM_TIMESTAMP, SOL_SOCKET);
|
|
|
|
if (*controlp == NULL) {
|
|
|
|
error = ENOBUFS;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
tv = (struct timeval *)
|
|
|
|
CMSG_DATA(mtod(*controlp, struct cmsghdr *));
|
|
|
|
microtime(tv);
|
|
|
|
break;
|
|
|
|
|
2013-02-15 13:00:20 +00:00
|
|
|
case SCM_BINTIME:
|
|
|
|
*controlp = sbcreatecontrol(NULL, sizeof(*bt),
|
|
|
|
SCM_BINTIME, SOL_SOCKET);
|
|
|
|
if (*controlp == NULL) {
|
|
|
|
error = ENOBUFS;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
bt = (struct bintime *)
|
|
|
|
CMSG_DATA(mtod(*controlp, struct cmsghdr *));
|
|
|
|
bintime(bt);
|
|
|
|
break;
|
|
|
|
|
2017-01-16 17:46:38 +00:00
|
|
|
case SCM_REALTIME:
|
|
|
|
*controlp = sbcreatecontrol(NULL, sizeof(*ts),
|
|
|
|
SCM_REALTIME, SOL_SOCKET);
|
|
|
|
if (*controlp == NULL) {
|
|
|
|
error = ENOBUFS;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
ts = (struct timespec *)
|
|
|
|
CMSG_DATA(mtod(*controlp, struct cmsghdr *));
|
|
|
|
nanotime(ts);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case SCM_MONOTONIC:
|
|
|
|
*controlp = sbcreatecontrol(NULL, sizeof(*ts),
|
|
|
|
SCM_MONOTONIC, SOL_SOCKET);
|
|
|
|
if (*controlp == NULL) {
|
|
|
|
error = ENOBUFS;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
ts = (struct timespec *)
|
|
|
|
CMSG_DATA(mtod(*controlp, struct cmsghdr *));
|
|
|
|
nanouptime(ts);
|
|
|
|
break;
|
|
|
|
|
2001-10-04 13:11:48 +00:00
|
|
|
default:
|
|
|
|
error = EINVAL;
|
|
|
|
goto out;
|
2000-03-09 15:15:27 +00:00
|
|
|
}
|
2001-10-04 13:11:48 +00:00
|
|
|
|
2019-10-08 23:34:48 +00:00
|
|
|
if (*controlp != NULL)
|
|
|
|
controlp = &(*controlp)->m_next;
|
2001-10-04 13:11:48 +00:00
|
|
|
if (CMSG_SPACE(datalen) < clen) {
|
|
|
|
clen -= CMSG_SPACE(datalen);
|
|
|
|
cm = (struct cmsghdr *)
|
|
|
|
((caddr_t)cm + CMSG_SPACE(datalen));
|
|
|
|
} else {
|
|
|
|
clen = 0;
|
|
|
|
cm = NULL;
|
2000-03-09 15:15:27 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2001-10-04 13:11:48 +00:00
|
|
|
|
|
|
|
out:
|
2019-07-19 20:51:39 +00:00
|
|
|
if (error != 0 && initial_controlp != NULL)
|
|
|
|
unp_internalize_cleanup_rights(*initial_controlp);
|
2001-10-04 13:11:48 +00:00
|
|
|
m_freem(control);
|
|
|
|
return (error);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2007-02-20 10:50:02 +00:00
|
|
|
static struct mbuf *
|
2020-11-17 20:01:21 +00:00
|
|
|
unp_addsockcred(struct thread *td, struct mbuf *control, int mode)
|
2005-04-13 00:01:46 +00:00
|
|
|
{
|
2006-06-13 14:33:35 +00:00
|
|
|
struct mbuf *m, *n, *n_prev;
|
|
|
|
const struct cmsghdr *cm;
|
2020-11-17 20:01:21 +00:00
|
|
|
int ngroups, i, cmsgtype;
|
|
|
|
size_t ctrlsz;
|
2005-04-13 00:01:46 +00:00
|
|
|
|
|
|
|
ngroups = MIN(td->td_ucred->cr_ngroups, CMGROUP_MAX);
|
2020-11-17 20:01:21 +00:00
|
|
|
if (mode & UNP_WANTCRED_ALWAYS) {
|
|
|
|
ctrlsz = SOCKCRED2SIZE(ngroups);
|
|
|
|
cmsgtype = SCM_CREDS2;
|
|
|
|
} else {
|
|
|
|
ctrlsz = SOCKCREDSIZE(ngroups);
|
|
|
|
cmsgtype = SCM_CREDS;
|
|
|
|
}
|
|
|
|
|
|
|
|
m = sbcreatecontrol(NULL, ctrlsz, cmsgtype, SOL_SOCKET);
|
2005-04-13 00:01:46 +00:00
|
|
|
if (m == NULL)
|
|
|
|
return (control);
|
|
|
|
|
2020-11-17 20:01:21 +00:00
|
|
|
if (mode & UNP_WANTCRED_ALWAYS) {
|
|
|
|
struct sockcred2 *sc;
|
|
|
|
|
|
|
|
sc = (void *)CMSG_DATA(mtod(m, struct cmsghdr *));
|
|
|
|
sc->sc_version = 0;
|
|
|
|
sc->sc_pid = td->td_proc->p_pid;
|
|
|
|
sc->sc_uid = td->td_ucred->cr_ruid;
|
|
|
|
sc->sc_euid = td->td_ucred->cr_uid;
|
|
|
|
sc->sc_gid = td->td_ucred->cr_rgid;
|
|
|
|
sc->sc_egid = td->td_ucred->cr_gid;
|
|
|
|
sc->sc_ngroups = ngroups;
|
|
|
|
for (i = 0; i < sc->sc_ngroups; i++)
|
|
|
|
sc->sc_groups[i] = td->td_ucred->cr_groups[i];
|
|
|
|
} else {
|
|
|
|
struct sockcred *sc;
|
|
|
|
|
|
|
|
sc = (void *)CMSG_DATA(mtod(m, struct cmsghdr *));
|
|
|
|
sc->sc_uid = td->td_ucred->cr_ruid;
|
|
|
|
sc->sc_euid = td->td_ucred->cr_uid;
|
|
|
|
sc->sc_gid = td->td_ucred->cr_rgid;
|
|
|
|
sc->sc_egid = td->td_ucred->cr_gid;
|
|
|
|
sc->sc_ngroups = ngroups;
|
|
|
|
for (i = 0; i < sc->sc_ngroups; i++)
|
|
|
|
sc->sc_groups[i] = td->td_ucred->cr_groups[i];
|
|
|
|
}
|
2005-04-13 00:01:46 +00:00
|
|
|
|
|
|
|
/*
|
2006-07-22 17:24:55 +00:00
|
|
|
* Unlink SCM_CREDS control messages (struct cmsgcred), since just
|
|
|
|
* created SCM_CREDS control message (struct sockcred) has another
|
|
|
|
* format.
|
2005-04-13 00:01:46 +00:00
|
|
|
*/
|
2020-11-17 20:01:21 +00:00
|
|
|
if (control != NULL && cmsgtype == SCM_CREDS)
|
2006-06-13 14:33:35 +00:00
|
|
|
for (n = control, n_prev = NULL; n != NULL;) {
|
|
|
|
cm = mtod(n, struct cmsghdr *);
|
|
|
|
if (cm->cmsg_level == SOL_SOCKET &&
|
|
|
|
cm->cmsg_type == SCM_CREDS) {
|
|
|
|
if (n_prev == NULL)
|
|
|
|
control = n->m_next;
|
|
|
|
else
|
|
|
|
n_prev->m_next = n->m_next;
|
|
|
|
n = m_free(n);
|
|
|
|
} else {
|
|
|
|
n_prev = n;
|
|
|
|
n = n->m_next;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Prepend it to the head. */
|
|
|
|
m->m_next = control;
|
|
|
|
return (m);
|
2005-04-13 00:01:46 +00:00
|
|
|
}
|
|
|
|
|
2007-12-30 01:42:15 +00:00
|
|
|
static struct unpcb *
|
|
|
|
fptounp(struct file *fp)
|
|
|
|
{
|
|
|
|
struct socket *so;
|
|
|
|
|
|
|
|
if (fp->f_type != DTYPE_SOCKET)
|
|
|
|
return (NULL);
|
|
|
|
if ((so = fp->f_data) == NULL)
|
|
|
|
return (NULL);
|
|
|
|
if (so->so_proto->pr_domain != &localdomain)
|
|
|
|
return (NULL);
|
|
|
|
return sotounpcb(so);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
unp_discard(struct file *fp)
|
|
|
|
{
|
2010-12-03 16:15:44 +00:00
|
|
|
struct unp_defer *dr;
|
|
|
|
|
|
|
|
if (unp_externalize_fp(fp)) {
|
|
|
|
dr = malloc(sizeof(*dr), M_TEMP, M_WAITOK);
|
|
|
|
dr->ud_fp = fp;
|
|
|
|
UNP_DEFERRED_LOCK();
|
|
|
|
SLIST_INSERT_HEAD(&unp_defers, dr, ud_link);
|
|
|
|
UNP_DEFERRED_UNLOCK();
|
|
|
|
atomic_add_int(&unp_defers_count, 1);
|
|
|
|
taskqueue_enqueue(taskqueue_thread, &unp_defer_task);
|
|
|
|
} else
|
|
|
|
(void) closef(fp, (struct thread *)NULL);
|
|
|
|
}
|
2007-12-30 01:42:15 +00:00
|
|
|
|
2010-12-03 16:15:44 +00:00
|
|
|
static void
|
|
|
|
unp_process_defers(void *arg __unused, int pending)
|
|
|
|
{
|
|
|
|
struct unp_defer *dr;
|
|
|
|
SLIST_HEAD(, unp_defer) drl;
|
|
|
|
int count;
|
|
|
|
|
|
|
|
SLIST_INIT(&drl);
|
|
|
|
for (;;) {
|
|
|
|
UNP_DEFERRED_LOCK();
|
|
|
|
if (SLIST_FIRST(&unp_defers) == NULL) {
|
|
|
|
UNP_DEFERRED_UNLOCK();
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
SLIST_SWAP(&unp_defers, &drl, unp_defer);
|
|
|
|
UNP_DEFERRED_UNLOCK();
|
|
|
|
count = 0;
|
|
|
|
while ((dr = SLIST_FIRST(&drl)) != NULL) {
|
|
|
|
SLIST_REMOVE_HEAD(&drl, ud_link);
|
|
|
|
closef(dr->ud_fp, NULL);
|
|
|
|
free(dr, M_TEMP);
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
atomic_add_int(&unp_defers_count, -count);
|
|
|
|
}
|
2007-12-30 01:42:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
unp_internalize_fp(struct file *fp)
|
|
|
|
{
|
|
|
|
struct unpcb *unp;
|
|
|
|
|
2009-03-08 21:48:29 +00:00
|
|
|
UNP_LINK_WLOCK();
|
2007-12-30 01:42:15 +00:00
|
|
|
if ((unp = fptounp(fp)) != NULL) {
|
|
|
|
unp->unp_file = fp;
|
|
|
|
unp->unp_msgcount++;
|
|
|
|
}
|
|
|
|
unp_rights++;
|
2009-03-08 21:48:29 +00:00
|
|
|
UNP_LINK_WUNLOCK();
|
2007-12-30 01:42:15 +00:00
|
|
|
}
|
|
|
|
|
2010-12-03 16:15:44 +00:00
|
|
|
static int
|
2007-12-30 01:42:15 +00:00
|
|
|
unp_externalize_fp(struct file *fp)
|
|
|
|
{
|
|
|
|
struct unpcb *unp;
|
2010-12-03 16:15:44 +00:00
|
|
|
int ret;
|
2007-12-30 01:42:15 +00:00
|
|
|
|
2009-03-08 21:48:29 +00:00
|
|
|
UNP_LINK_WLOCK();
|
2010-12-03 16:15:44 +00:00
|
|
|
if ((unp = fptounp(fp)) != NULL) {
|
2007-12-30 01:42:15 +00:00
|
|
|
unp->unp_msgcount--;
|
2010-12-03 16:15:44 +00:00
|
|
|
ret = 1;
|
|
|
|
} else
|
|
|
|
ret = 0;
|
2007-12-30 01:42:15 +00:00
|
|
|
unp_rights--;
|
2009-03-08 21:48:29 +00:00
|
|
|
UNP_LINK_WUNLOCK();
|
2010-12-03 16:15:44 +00:00
|
|
|
return (ret);
|
2007-12-30 01:42:15 +00:00
|
|
|
}
|
|
|
|
|
2004-08-25 21:24:36 +00:00
|
|
|
/*
|
Correct a number of serious and closely related bugs in the UNIX domain
socket file descriptor garbage collection code, which is intended to
detect and clear cycles of orphaned file descriptors that are "in-flight"
in a socket when that socket is closed before they are received. The
algorithm present was both run at poor times (resulting in recursion and
reentrance), and also buggy in the presence of parallelism. In order to
fix these problems, make the following changes:
- When there are in-flight sockets and a UNIX domain socket is destroyed,
asynchronously schedule the garbage collector, rather than running it
synchronously in the current context. This avoids lock order issues
when the garbage collection code reenters the UNIX domain socket code,
avoiding lock order reversals, deadlocks, etc. Run the code
asynchronously in a task queue.
- In the garbage collector, when skipping file descriptors that have
entered a closing state (i.e., have f_count == 0), re-test the FDEFER
flag, and decrement unp_defer. As file descriptors can now transition
to a closed state, while the garbage collector is running, it is no
longer the case that unp_defer will remain an accurate count of
deferred sockets in the mark portion of the GC algorithm. Otherwise,
the garbage collector will loop waiting waiting for unp_defer to reach
zero, which it will never do as it is skipping file descriptors that
were marked in an earlier pass, but now closed.
- Acquire the UNIX domain socket subsystem lock in unp_discard() when
modifying the unp_rights counter, or a read/write race is risked with
other threads also manipulating the counter.
While here:
- Remove #if 0'd code regarding acquiring the socket buffer sleep lock in
the garbage collector, this is not required as we are able to use the
socket buffer receive lock to protect scanning the receive buffer for
in-flight file descriptors on the socket buffer.
- Annotate that the description of the garbage collector implementation
is increasingly inaccurate and needs to be updated.
- Add counters of the number of deferred garbage collections and recycled
file descriptors. This will be removed and is here temporarily for
debugging purposes.
With these changes in place, the unp_passfd regression test now appears
to be passed consistently on UP and SMP systems for extended runs,
whereas before it hung quickly or panicked, depending on which bug was
triggered.
Reported by: Philip Kizer <pckizer at nostrum dot com>
MFC after: 2 weeks
2005-11-10 16:06:04 +00:00
|
|
|
* unp_defer indicates whether additional work has been defered for a future
|
|
|
|
* pass through unp_gc(). It is thread local and does not require explicit
|
|
|
|
* synchronization.
|
2004-08-25 21:24:36 +00:00
|
|
|
*/
|
2007-12-30 01:42:15 +00:00
|
|
|
static int unp_marked;
|
Correct a number of serious and closely related bugs in the UNIX domain
socket file descriptor garbage collection code, which is intended to
detect and clear cycles of orphaned file descriptors that are "in-flight"
in a socket when that socket is closed before they are received. The
algorithm present was both run at poor times (resulting in recursion and
reentrance), and also buggy in the presence of parallelism. In order to
fix these problems, make the following changes:
- When there are in-flight sockets and a UNIX domain socket is destroyed,
asynchronously schedule the garbage collector, rather than running it
synchronously in the current context. This avoids lock order issues
when the garbage collection code reenters the UNIX domain socket code,
avoiding lock order reversals, deadlocks, etc. Run the code
asynchronously in a task queue.
- In the garbage collector, when skipping file descriptors that have
entered a closing state (i.e., have f_count == 0), re-test the FDEFER
flag, and decrement unp_defer. As file descriptors can now transition
to a closed state, while the garbage collector is running, it is no
longer the case that unp_defer will remain an accurate count of
deferred sockets in the mark portion of the GC algorithm. Otherwise,
the garbage collector will loop waiting waiting for unp_defer to reach
zero, which it will never do as it is skipping file descriptors that
were marked in an earlier pass, but now closed.
- Acquire the UNIX domain socket subsystem lock in unp_discard() when
modifying the unp_rights counter, or a read/write race is risked with
other threads also manipulating the counter.
While here:
- Remove #if 0'd code regarding acquiring the socket buffer sleep lock in
the garbage collector, this is not required as we are able to use the
socket buffer receive lock to protect scanning the receive buffer for
in-flight file descriptors on the socket buffer.
- Annotate that the description of the garbage collector implementation
is increasingly inaccurate and needs to be updated.
- Add counters of the number of deferred garbage collections and recycled
file descriptors. This will be removed and is here temporarily for
debugging purposes.
With these changes in place, the unp_passfd regression test now appears
to be passed consistently on UP and SMP systems for extended runs,
whereas before it hung quickly or panicked, depending on which bug was
triggered.
Reported by: Philip Kizer <pckizer at nostrum dot com>
MFC after: 2 weeks
2005-11-10 16:06:04 +00:00
|
|
|
|
2007-12-30 01:42:15 +00:00
|
|
|
static void
|
2020-01-25 08:57:26 +00:00
|
|
|
unp_remove_dead_ref(struct filedescent **fdep, int fdcount)
|
2007-12-30 01:42:15 +00:00
|
|
|
{
|
|
|
|
struct unpcb *unp;
|
2013-03-11 22:59:07 +00:00
|
|
|
struct file *fp;
|
|
|
|
int i;
|
2007-12-30 01:42:15 +00:00
|
|
|
|
2020-01-25 08:57:26 +00:00
|
|
|
/*
|
|
|
|
* This function can only be called from the gc task.
|
|
|
|
*/
|
|
|
|
KASSERT(taskqueue_member(taskqueue_thread, curthread) != 0,
|
|
|
|
("%s: not on gc callout", __func__));
|
|
|
|
UNP_LINK_LOCK_ASSERT();
|
|
|
|
|
2013-03-11 22:59:07 +00:00
|
|
|
for (i = 0; i < fdcount; i++) {
|
|
|
|
fp = fdep[i]->fde_file;
|
|
|
|
if ((unp = fptounp(fp)) == NULL)
|
|
|
|
continue;
|
2020-01-25 08:57:26 +00:00
|
|
|
if ((unp->unp_gcflag & UNPGC_DEAD) == 0)
|
2013-03-11 22:59:07 +00:00
|
|
|
continue;
|
2020-01-25 08:57:26 +00:00
|
|
|
unp->unp_gcrefs--;
|
2013-03-11 22:59:07 +00:00
|
|
|
}
|
2007-12-30 01:42:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2020-01-25 08:57:26 +00:00
|
|
|
unp_restore_undead_ref(struct filedescent **fdep, int fdcount)
|
2007-12-30 01:42:15 +00:00
|
|
|
{
|
2020-01-25 08:57:26 +00:00
|
|
|
struct unpcb *unp;
|
2007-12-30 01:42:15 +00:00
|
|
|
struct file *fp;
|
2020-01-25 08:57:26 +00:00
|
|
|
int i;
|
2008-10-03 09:01:55 +00:00
|
|
|
|
2007-12-30 01:42:15 +00:00
|
|
|
/*
|
2020-01-25 08:57:26 +00:00
|
|
|
* This function can only be called from the gc task.
|
2007-12-30 01:42:15 +00:00
|
|
|
*/
|
2020-01-25 08:57:26 +00:00
|
|
|
KASSERT(taskqueue_member(taskqueue_thread, curthread) != 0,
|
|
|
|
("%s: not on gc callout", __func__));
|
|
|
|
UNP_LINK_LOCK_ASSERT();
|
|
|
|
|
|
|
|
for (i = 0; i < fdcount; i++) {
|
|
|
|
fp = fdep[i]->fde_file;
|
|
|
|
if ((unp = fptounp(fp)) == NULL)
|
|
|
|
continue;
|
|
|
|
if ((unp->unp_gcflag & UNPGC_DEAD) == 0)
|
|
|
|
continue;
|
|
|
|
unp->unp_gcrefs++;
|
|
|
|
unp_marked++;
|
2007-12-30 01:42:15 +00:00
|
|
|
}
|
2020-01-25 08:57:26 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
unp_gc_scan(struct unpcb *unp, void (*op)(struct filedescent **, int))
|
|
|
|
{
|
|
|
|
struct socket *so, *soa;
|
2008-10-03 09:01:55 +00:00
|
|
|
|
2007-12-30 01:42:15 +00:00
|
|
|
so = unp->unp_socket;
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
SOCK_LOCK(so);
|
|
|
|
if (SOLISTENING(so)) {
|
|
|
|
/*
|
|
|
|
* Mark all sockets in our accept queue.
|
|
|
|
*/
|
|
|
|
TAILQ_FOREACH(soa, &so->sol_comp, so_list) {
|
|
|
|
if (sotounpcb(soa)->unp_gcflag & UNPGC_IGNORE_RIGHTS)
|
|
|
|
continue;
|
|
|
|
SOCKBUF_LOCK(&soa->so_rcv);
|
2020-01-25 08:57:26 +00:00
|
|
|
unp_scan(soa->so_rcv.sb_mb, op);
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
SOCKBUF_UNLOCK(&soa->so_rcv);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Mark all sockets we reference with RIGHTS.
|
|
|
|
*/
|
|
|
|
if ((unp->unp_gcflag & UNPGC_IGNORE_RIGHTS) == 0) {
|
|
|
|
SOCKBUF_LOCK(&so->so_rcv);
|
2020-01-25 08:57:26 +00:00
|
|
|
unp_scan(so->so_rcv.sb_mb, op);
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
SOCKBUF_UNLOCK(&so->so_rcv);
|
|
|
|
}
|
2007-12-30 01:42:15 +00:00
|
|
|
}
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
SOCK_UNLOCK(so);
|
2007-12-30 01:42:15 +00:00
|
|
|
}
|
Correct a number of serious and closely related bugs in the UNIX domain
socket file descriptor garbage collection code, which is intended to
detect and clear cycles of orphaned file descriptors that are "in-flight"
in a socket when that socket is closed before they are received. The
algorithm present was both run at poor times (resulting in recursion and
reentrance), and also buggy in the presence of parallelism. In order to
fix these problems, make the following changes:
- When there are in-flight sockets and a UNIX domain socket is destroyed,
asynchronously schedule the garbage collector, rather than running it
synchronously in the current context. This avoids lock order issues
when the garbage collection code reenters the UNIX domain socket code,
avoiding lock order reversals, deadlocks, etc. Run the code
asynchronously in a task queue.
- In the garbage collector, when skipping file descriptors that have
entered a closing state (i.e., have f_count == 0), re-test the FDEFER
flag, and decrement unp_defer. As file descriptors can now transition
to a closed state, while the garbage collector is running, it is no
longer the case that unp_defer will remain an accurate count of
deferred sockets in the mark portion of the GC algorithm. Otherwise,
the garbage collector will loop waiting waiting for unp_defer to reach
zero, which it will never do as it is skipping file descriptors that
were marked in an earlier pass, but now closed.
- Acquire the UNIX domain socket subsystem lock in unp_discard() when
modifying the unp_rights counter, or a read/write race is risked with
other threads also manipulating the counter.
While here:
- Remove #if 0'd code regarding acquiring the socket buffer sleep lock in
the garbage collector, this is not required as we are able to use the
socket buffer receive lock to protect scanning the receive buffer for
in-flight file descriptors on the socket buffer.
- Annotate that the description of the garbage collector implementation
is increasingly inaccurate and needs to be updated.
- Add counters of the number of deferred garbage collections and recycled
file descriptors. This will be removed and is here temporarily for
debugging purposes.
With these changes in place, the unp_passfd regression test now appears
to be passed consistently on UP and SMP systems for extended runs,
whereas before it hung quickly or panicked, depending on which bug was
triggered.
Reported by: Philip Kizer <pckizer at nostrum dot com>
MFC after: 2 weeks
2005-11-10 16:06:04 +00:00
|
|
|
|
|
|
|
static int unp_recycled;
|
2008-07-26 00:55:35 +00:00
|
|
|
SYSCTL_INT(_net_local, OID_AUTO, recycled, CTLFLAG_RD, &unp_recycled, 0,
|
|
|
|
"Number of unreachable sockets claimed by the garbage collector.");
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2007-12-30 01:42:15 +00:00
|
|
|
static int unp_taskcount;
|
2008-07-26 00:55:35 +00:00
|
|
|
SYSCTL_INT(_net_local, OID_AUTO, taskcount, CTLFLAG_RD, &unp_taskcount, 0,
|
|
|
|
"Number of times the garbage collector has run.");
|
2007-12-30 01:42:15 +00:00
|
|
|
|
2020-01-25 08:57:26 +00:00
|
|
|
SYSCTL_UINT(_net_local, OID_AUTO, sockcount, CTLFLAG_RD, &unp_count, 0,
|
|
|
|
"Number of active local sockets.");
|
|
|
|
|
1995-12-14 09:55:16 +00:00
|
|
|
static void
|
Correct a number of serious and closely related bugs in the UNIX domain
socket file descriptor garbage collection code, which is intended to
detect and clear cycles of orphaned file descriptors that are "in-flight"
in a socket when that socket is closed before they are received. The
algorithm present was both run at poor times (resulting in recursion and
reentrance), and also buggy in the presence of parallelism. In order to
fix these problems, make the following changes:
- When there are in-flight sockets and a UNIX domain socket is destroyed,
asynchronously schedule the garbage collector, rather than running it
synchronously in the current context. This avoids lock order issues
when the garbage collection code reenters the UNIX domain socket code,
avoiding lock order reversals, deadlocks, etc. Run the code
asynchronously in a task queue.
- In the garbage collector, when skipping file descriptors that have
entered a closing state (i.e., have f_count == 0), re-test the FDEFER
flag, and decrement unp_defer. As file descriptors can now transition
to a closed state, while the garbage collector is running, it is no
longer the case that unp_defer will remain an accurate count of
deferred sockets in the mark portion of the GC algorithm. Otherwise,
the garbage collector will loop waiting waiting for unp_defer to reach
zero, which it will never do as it is skipping file descriptors that
were marked in an earlier pass, but now closed.
- Acquire the UNIX domain socket subsystem lock in unp_discard() when
modifying the unp_rights counter, or a read/write race is risked with
other threads also manipulating the counter.
While here:
- Remove #if 0'd code regarding acquiring the socket buffer sleep lock in
the garbage collector, this is not required as we are able to use the
socket buffer receive lock to protect scanning the receive buffer for
in-flight file descriptors on the socket buffer.
- Annotate that the description of the garbage collector implementation
is increasingly inaccurate and needs to be updated.
- Add counters of the number of deferred garbage collections and recycled
file descriptors. This will be removed and is here temporarily for
debugging purposes.
With these changes in place, the unp_passfd regression test now appears
to be passed consistently on UP and SMP systems for extended runs,
whereas before it hung quickly or panicked, depending on which bug was
triggered.
Reported by: Philip Kizer <pckizer at nostrum dot com>
MFC after: 2 weeks
2005-11-10 16:06:04 +00:00
|
|
|
unp_gc(__unused void *arg, int pending)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2009-10-05 14:49:16 +00:00
|
|
|
struct unp_head *heads[] = { &unp_dhead, &unp_shead, &unp_sphead,
|
|
|
|
NULL };
|
2007-12-30 01:42:15 +00:00
|
|
|
struct unp_head **head;
|
2020-01-25 08:57:26 +00:00
|
|
|
struct unp_head unp_deadhead; /* List of potentially-dead sockets. */
|
2011-02-01 13:33:49 +00:00
|
|
|
struct file *f, **unref;
|
2020-01-25 08:57:26 +00:00
|
|
|
struct unpcb *unp, *unptmp;
|
|
|
|
int i, total, unp_unreachable;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2020-01-25 08:57:26 +00:00
|
|
|
LIST_INIT(&unp_deadhead);
|
Correct a number of serious and closely related bugs in the UNIX domain
socket file descriptor garbage collection code, which is intended to
detect and clear cycles of orphaned file descriptors that are "in-flight"
in a socket when that socket is closed before they are received. The
algorithm present was both run at poor times (resulting in recursion and
reentrance), and also buggy in the presence of parallelism. In order to
fix these problems, make the following changes:
- When there are in-flight sockets and a UNIX domain socket is destroyed,
asynchronously schedule the garbage collector, rather than running it
synchronously in the current context. This avoids lock order issues
when the garbage collection code reenters the UNIX domain socket code,
avoiding lock order reversals, deadlocks, etc. Run the code
asynchronously in a task queue.
- In the garbage collector, when skipping file descriptors that have
entered a closing state (i.e., have f_count == 0), re-test the FDEFER
flag, and decrement unp_defer. As file descriptors can now transition
to a closed state, while the garbage collector is running, it is no
longer the case that unp_defer will remain an accurate count of
deferred sockets in the mark portion of the GC algorithm. Otherwise,
the garbage collector will loop waiting waiting for unp_defer to reach
zero, which it will never do as it is skipping file descriptors that
were marked in an earlier pass, but now closed.
- Acquire the UNIX domain socket subsystem lock in unp_discard() when
modifying the unp_rights counter, or a read/write race is risked with
other threads also manipulating the counter.
While here:
- Remove #if 0'd code regarding acquiring the socket buffer sleep lock in
the garbage collector, this is not required as we are able to use the
socket buffer receive lock to protect scanning the receive buffer for
in-flight file descriptors on the socket buffer.
- Annotate that the description of the garbage collector implementation
is increasingly inaccurate and needs to be updated.
- Add counters of the number of deferred garbage collections and recycled
file descriptors. This will be removed and is here temporarily for
debugging purposes.
With these changes in place, the unp_passfd regression test now appears
to be passed consistently on UP and SMP systems for extended runs,
whereas before it hung quickly or panicked, depending on which bug was
triggered.
Reported by: Philip Kizer <pckizer at nostrum dot com>
MFC after: 2 weeks
2005-11-10 16:06:04 +00:00
|
|
|
unp_taskcount++;
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
UNP_LINK_RLOCK();
|
2007-12-30 01:42:15 +00:00
|
|
|
/*
|
2020-01-25 08:57:26 +00:00
|
|
|
* First determine which sockets may be in cycles.
|
2007-12-30 01:42:15 +00:00
|
|
|
*/
|
2020-01-25 08:57:26 +00:00
|
|
|
unp_unreachable = 0;
|
|
|
|
|
2007-12-30 01:42:15 +00:00
|
|
|
for (head = heads; *head != NULL; head++)
|
2020-01-25 08:57:26 +00:00
|
|
|
LIST_FOREACH(unp, *head, unp_link) {
|
|
|
|
KASSERT((unp->unp_gcflag & ~UNPGC_IGNORE_RIGHTS) == 0,
|
|
|
|
("%s: unp %p has unexpected gc flags 0x%x",
|
|
|
|
__func__, unp, (unsigned int)unp->unp_gcflag));
|
|
|
|
|
|
|
|
f = unp->unp_file;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check for an unreachable socket potentially in a
|
|
|
|
* cycle. It must be in a queue as indicated by
|
|
|
|
* msgcount, and this must equal the file reference
|
|
|
|
* count. Note that when msgcount is 0 the file is
|
|
|
|
* NULL.
|
|
|
|
*/
|
|
|
|
if (f != NULL && unp->unp_msgcount != 0 &&
|
2020-11-05 02:12:33 +00:00
|
|
|
refcount_load(&f->f_count) == unp->unp_msgcount) {
|
2020-01-25 08:57:26 +00:00
|
|
|
LIST_INSERT_HEAD(&unp_deadhead, unp, unp_dead);
|
|
|
|
unp->unp_gcflag |= UNPGC_DEAD;
|
|
|
|
unp->unp_gcrefs = unp->unp_msgcount;
|
|
|
|
unp_unreachable++;
|
|
|
|
}
|
|
|
|
}
|
2008-10-03 09:01:55 +00:00
|
|
|
|
2004-01-11 19:48:19 +00:00
|
|
|
/*
|
2020-01-25 08:57:26 +00:00
|
|
|
* Scan all sockets previously marked as potentially being in a cycle
|
|
|
|
* and remove the references each socket holds on any UNPGC_DEAD
|
|
|
|
* sockets in its queue. After this step, all remaining references on
|
|
|
|
* sockets marked UNPGC_DEAD should not be part of any cycle.
|
|
|
|
*/
|
|
|
|
LIST_FOREACH(unp, &unp_deadhead, unp_dead)
|
|
|
|
unp_gc_scan(unp, unp_remove_dead_ref);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If a socket still has a non-negative refcount, it cannot be in a
|
|
|
|
* cycle. In this case increment refcount of all children iteratively.
|
2007-12-30 01:42:15 +00:00
|
|
|
* Stop the scan once we do a complete loop without discovering
|
|
|
|
* a new reachable socket.
|
1996-12-05 22:41:13 +00:00
|
|
|
*/
|
1994-05-24 10:09:53 +00:00
|
|
|
do {
|
2007-12-30 01:42:15 +00:00
|
|
|
unp_marked = 0;
|
2020-01-25 08:57:26 +00:00
|
|
|
LIST_FOREACH_SAFE(unp, &unp_deadhead, unp_dead, unptmp)
|
|
|
|
if (unp->unp_gcrefs > 0) {
|
|
|
|
unp->unp_gcflag &= ~UNPGC_DEAD;
|
|
|
|
LIST_REMOVE(unp, unp_dead);
|
|
|
|
KASSERT(unp_unreachable > 0,
|
|
|
|
("%s: unp_unreachable underflow.",
|
|
|
|
__func__));
|
|
|
|
unp_unreachable--;
|
|
|
|
unp_gc_scan(unp, unp_restore_undead_ref);
|
|
|
|
}
|
2007-12-30 01:42:15 +00:00
|
|
|
} while (unp_marked);
|
2020-01-25 08:57:26 +00:00
|
|
|
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
UNP_LINK_RUNLOCK();
|
2020-01-25 08:57:26 +00:00
|
|
|
|
2007-12-30 01:42:15 +00:00
|
|
|
if (unp_unreachable == 0)
|
|
|
|
return;
|
2008-10-03 09:01:55 +00:00
|
|
|
|
2007-12-30 01:42:15 +00:00
|
|
|
/*
|
2020-01-25 08:57:26 +00:00
|
|
|
* Allocate space for a local array of dead unpcbs.
|
|
|
|
* TODO: can this path be simplified by instead using the local
|
|
|
|
* dead list at unp_deadhead, after taking out references
|
|
|
|
* on the file object and/or unpcb and dropping the link lock?
|
2007-12-30 01:42:15 +00:00
|
|
|
*/
|
|
|
|
unref = malloc(unp_unreachable * sizeof(struct file *),
|
|
|
|
M_TEMP, M_WAITOK);
|
2008-10-03 09:01:55 +00:00
|
|
|
|
2007-12-30 01:42:15 +00:00
|
|
|
/*
|
|
|
|
* Iterate looking for sockets which have been specifically marked
|
2020-01-25 08:57:26 +00:00
|
|
|
* as unreachable and store them locally.
|
2007-12-30 01:42:15 +00:00
|
|
|
*/
|
2011-02-01 13:33:49 +00:00
|
|
|
UNP_LINK_RLOCK();
|
2020-01-25 08:57:26 +00:00
|
|
|
total = 0;
|
|
|
|
LIST_FOREACH(unp, &unp_deadhead, unp_dead) {
|
|
|
|
KASSERT((unp->unp_gcflag & UNPGC_DEAD) != 0,
|
|
|
|
("%s: unp %p not marked UNPGC_DEAD", __func__, unp));
|
|
|
|
unp->unp_gcflag &= ~UNPGC_DEAD;
|
|
|
|
f = unp->unp_file;
|
|
|
|
if (unp->unp_msgcount == 0 || f == NULL ||
|
2020-11-05 02:12:33 +00:00
|
|
|
refcount_load(&f->f_count) != unp->unp_msgcount ||
|
2020-01-25 08:57:26 +00:00
|
|
|
!fhold(f))
|
|
|
|
continue;
|
|
|
|
unref[total++] = f;
|
|
|
|
KASSERT(total <= unp_unreachable,
|
|
|
|
("%s: incorrect unreachable count.", __func__));
|
|
|
|
}
|
2011-02-01 13:33:49 +00:00
|
|
|
UNP_LINK_RUNLOCK();
|
2008-10-03 09:01:55 +00:00
|
|
|
|
2004-01-11 19:48:19 +00:00
|
|
|
/*
|
2007-12-30 01:42:15 +00:00
|
|
|
* Now flush all sockets, free'ing rights. This will free the
|
|
|
|
* struct files associated with these sockets but leave each socket
|
|
|
|
* with one remaining ref.
|
1996-12-05 22:41:13 +00:00
|
|
|
*/
|
2011-02-16 21:29:13 +00:00
|
|
|
for (i = 0; i < total; i++) {
|
|
|
|
struct socket *so;
|
|
|
|
|
|
|
|
so = unref[i]->f_data;
|
|
|
|
CURVNET_SET(so->so_vnet);
|
|
|
|
sorflush(so);
|
|
|
|
CURVNET_RESTORE();
|
|
|
|
}
|
2008-10-03 09:01:55 +00:00
|
|
|
|
2007-12-30 01:42:15 +00:00
|
|
|
/*
|
|
|
|
* And finally release the sockets so they can be reclaimed.
|
|
|
|
*/
|
2011-02-01 13:33:49 +00:00
|
|
|
for (i = 0; i < total; i++)
|
2007-12-30 01:42:15 +00:00
|
|
|
fdrop(unref[i], NULL);
|
2011-02-01 13:33:49 +00:00
|
|
|
unp_recycled += total;
|
2007-12-30 01:42:15 +00:00
|
|
|
free(unref, M_TEMP);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2008-10-03 13:01:56 +00:00
|
|
|
static void
|
2016-08-31 21:48:22 +00:00
|
|
|
unp_dispose_mbuf(struct mbuf *m)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
1997-02-10 02:22:35 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
if (m)
|
2013-03-11 22:59:07 +00:00
|
|
|
unp_scan(m, unp_freerights);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2015-07-14 02:00:50 +00:00
|
|
|
/*
|
|
|
|
* Synchronize against unp_gc, which can trip over data as we are freeing it.
|
|
|
|
*/
|
|
|
|
static void
|
2016-08-31 21:48:22 +00:00
|
|
|
unp_dispose(struct socket *so)
|
2015-07-14 02:00:50 +00:00
|
|
|
{
|
|
|
|
struct unpcb *unp;
|
|
|
|
|
|
|
|
unp = sotounpcb(so);
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
UNP_LINK_WLOCK();
|
2015-07-14 02:00:50 +00:00
|
|
|
unp->unp_gcflag |= UNPGC_IGNORE_RIGHTS;
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
UNP_LINK_WUNLOCK();
|
|
|
|
if (!SOLISTENING(so))
|
|
|
|
unp_dispose_mbuf(so->so_rcv.sb_mb);
|
2015-07-14 02:00:50 +00:00
|
|
|
}
|
|
|
|
|
1995-12-14 09:55:16 +00:00
|
|
|
static void
|
2013-03-11 22:59:07 +00:00
|
|
|
unp_scan(struct mbuf *m0, void (*op)(struct filedescent **, int))
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2001-10-04 13:11:48 +00:00
|
|
|
struct mbuf *m;
|
|
|
|
struct cmsghdr *cm;
|
|
|
|
void *data;
|
|
|
|
socklen_t clen, datalen;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2004-03-30 02:16:25 +00:00
|
|
|
while (m0 != NULL) {
|
2001-10-04 13:11:48 +00:00
|
|
|
for (m = m0; m; m = m->m_next) {
|
2001-10-29 20:04:03 +00:00
|
|
|
if (m->m_type != MT_CONTROL)
|
2001-10-04 13:11:48 +00:00
|
|
|
continue;
|
|
|
|
|
|
|
|
cm = mtod(m, struct cmsghdr *);
|
|
|
|
clen = m->m_len;
|
|
|
|
|
|
|
|
while (cm != NULL) {
|
|
|
|
if (sizeof(*cm) > clen || cm->cmsg_len > clen)
|
|
|
|
break;
|
|
|
|
|
|
|
|
data = CMSG_DATA(cm);
|
|
|
|
datalen = (caddr_t)cm + cm->cmsg_len
|
|
|
|
- (caddr_t)data;
|
|
|
|
|
|
|
|
if (cm->cmsg_level == SOL_SOCKET &&
|
|
|
|
cm->cmsg_type == SCM_RIGHTS) {
|
2013-03-11 22:59:07 +00:00
|
|
|
(*op)(data, datalen /
|
|
|
|
sizeof(struct filedescent *));
|
2001-10-04 13:11:48 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (CMSG_SPACE(datalen) < clen) {
|
|
|
|
clen -= CMSG_SPACE(datalen);
|
|
|
|
cm = (struct cmsghdr *)
|
|
|
|
((caddr_t)cm + CMSG_SPACE(datalen));
|
|
|
|
} else {
|
|
|
|
clen = 0;
|
|
|
|
cm = NULL;
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2001-10-04 13:11:48 +00:00
|
|
|
}
|
2014-07-17 05:21:16 +00:00
|
|
|
m0 = m0->m_nextpkt;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-02-25 10:15:41 +00:00
|
|
|
/*
|
|
|
|
* A helper function called by VFS before socket-type vnode reclamation.
|
|
|
|
* For an active vnode it clears unp_vnode pointer and decrements unp_vnode
|
|
|
|
* use count.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
vfs_unp_reclaim(struct vnode *vp)
|
|
|
|
{
|
|
|
|
struct unpcb *unp;
|
|
|
|
int active;
|
2018-05-17 17:59:35 +00:00
|
|
|
struct mtx *vplock;
|
2012-02-25 10:15:41 +00:00
|
|
|
|
|
|
|
ASSERT_VOP_ELOCKED(vp, "vfs_unp_reclaim");
|
|
|
|
KASSERT(vp->v_type == VSOCK,
|
|
|
|
("vfs_unp_reclaim: vp->v_type != VSOCK"));
|
|
|
|
|
|
|
|
active = 0;
|
2018-05-17 17:59:35 +00:00
|
|
|
vplock = mtx_pool_find(mtxpool_sleep, vp);
|
|
|
|
mtx_lock(vplock);
|
2017-06-02 17:31:25 +00:00
|
|
|
VOP_UNP_CONNECT(vp, &unp);
|
2012-02-25 10:15:41 +00:00
|
|
|
if (unp == NULL)
|
|
|
|
goto done;
|
|
|
|
UNP_PCB_LOCK(unp);
|
2012-02-29 21:38:31 +00:00
|
|
|
if (unp->unp_vnode == vp) {
|
|
|
|
VOP_UNP_DETACH(vp);
|
2012-02-25 10:15:41 +00:00
|
|
|
unp->unp_vnode = NULL;
|
|
|
|
active = 1;
|
|
|
|
}
|
|
|
|
UNP_PCB_UNLOCK(unp);
|
2018-05-17 17:59:35 +00:00
|
|
|
done:
|
|
|
|
mtx_unlock(vplock);
|
2012-02-25 10:15:41 +00:00
|
|
|
if (active)
|
|
|
|
vunref(vp);
|
|
|
|
}
|
|
|
|
|
2007-05-29 12:36:00 +00:00
|
|
|
#ifdef DDB
|
|
|
|
static void
|
|
|
|
db_print_indent(int indent)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < indent; i++)
|
|
|
|
db_printf(" ");
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
db_print_unpflags(int unp_flags)
|
|
|
|
{
|
|
|
|
int comma;
|
|
|
|
|
|
|
|
comma = 0;
|
|
|
|
if (unp_flags & UNP_HAVEPC) {
|
|
|
|
db_printf("%sUNP_HAVEPC", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
2020-11-03 01:17:45 +00:00
|
|
|
if (unp_flags & UNP_WANTCRED_ALWAYS) {
|
|
|
|
db_printf("%sUNP_WANTCRED_ALWAYS", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (unp_flags & UNP_WANTCRED_ONESHOT) {
|
|
|
|
db_printf("%sUNP_WANTCRED_ONESHOT", comma ? ", " : "");
|
2007-05-29 12:36:00 +00:00
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (unp_flags & UNP_CONNWAIT) {
|
|
|
|
db_printf("%sUNP_CONNWAIT", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (unp_flags & UNP_CONNECTING) {
|
|
|
|
db_printf("%sUNP_CONNECTING", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (unp_flags & UNP_BINDING) {
|
|
|
|
db_printf("%sUNP_BINDING", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
db_print_xucred(int indent, struct xucred *xu)
|
|
|
|
{
|
|
|
|
int comma, i;
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
2019-05-30 14:24:26 +00:00
|
|
|
db_printf("cr_version: %u cr_uid: %u cr_pid: %d cr_ngroups: %d\n",
|
|
|
|
xu->cr_version, xu->cr_uid, xu->cr_pid, xu->cr_ngroups);
|
2007-05-29 12:36:00 +00:00
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("cr_groups: ");
|
|
|
|
comma = 0;
|
|
|
|
for (i = 0; i < xu->cr_ngroups; i++) {
|
|
|
|
db_printf("%s%u", comma ? ", " : "", xu->cr_groups[i]);
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
db_printf("\n");
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
db_print_unprefs(int indent, struct unp_head *uh)
|
|
|
|
{
|
|
|
|
struct unpcb *unp;
|
|
|
|
int counter;
|
|
|
|
|
|
|
|
counter = 0;
|
|
|
|
LIST_FOREACH(unp, uh, unp_reflink) {
|
|
|
|
if (counter % 4 == 0)
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("%p ", unp);
|
|
|
|
if (counter % 4 == 3)
|
|
|
|
db_printf("\n");
|
|
|
|
counter++;
|
|
|
|
}
|
|
|
|
if (counter != 0 && counter % 4 != 0)
|
|
|
|
db_printf("\n");
|
|
|
|
}
|
|
|
|
|
|
|
|
DB_SHOW_COMMAND(unpcb, db_show_unpcb)
|
|
|
|
{
|
|
|
|
struct unpcb *unp;
|
|
|
|
|
|
|
|
if (!have_addr) {
|
|
|
|
db_printf("usage: show unpcb <addr>\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
unp = (struct unpcb *)addr;
|
|
|
|
|
|
|
|
db_printf("unp_socket: %p unp_vnode: %p\n", unp->unp_socket,
|
|
|
|
unp->unp_vnode);
|
|
|
|
|
2012-09-27 23:30:49 +00:00
|
|
|
db_printf("unp_ino: %ju unp_conn: %p\n", (uintmax_t)unp->unp_ino,
|
2007-05-29 12:36:00 +00:00
|
|
|
unp->unp_conn);
|
|
|
|
|
|
|
|
db_printf("unp_refs:\n");
|
|
|
|
db_print_unprefs(2, &unp->unp_refs);
|
|
|
|
|
|
|
|
/* XXXRW: Would be nice to print the full address, if any. */
|
|
|
|
db_printf("unp_addr: %p\n", unp->unp_addr);
|
|
|
|
|
Replace 4.4BSD Lite's unix domain socket backpressure hack with a cleaner
mechanism, based on the new SB_STOP sockbuf flag. The old hack dynamically
changed the sending sockbuf's high water mark whenever adding or removing
data from the receiving sockbuf. It worked for stream sockets, but it never
worked for SOCK_SEQPACKET sockets because of their atomic nature. If the
sockbuf was partially full, it might return EMSGSIZE instead of blocking.
The new solution is based on DragonFlyBSD's fix from commit
3a6117bbe0ed6a87605c1e43e12a1438d8844380 on 2008-05-27. It adds an SB_STOP
flag to sockbufs. Whenever uipc_send surpasses the socket's size limit, it
sets SB_STOP on the sending sockbuf. sbspace() will then return 0 for that
sockbuf, causing sosend_generic and friends to block. uipc_rcvd will
likewise clear SB_STOP. There are two fringe benefits: uipc_{send,rcvd} no
longer need to call chgsbsize() on every send and receive because they don't
change the sockbuf's high water mark. Also, uipc_sense no longer needs to
acquire the UIPC linkage lock, because it's simpler to compute the
st_blksizes.
There is one drawback: since sbspace() will only ever return 0 or the
maximum, sosend_generic will allow the sockbuf to exceed its nominal maximum
size by at most one packet of size less than the max. I don't think that's
a serious problem. In fact, I'm not even positive that FreeBSD guarantees a
socket will always stay within its nominal size limit.
sys/sys/sockbuf.h
Add the SB_STOP flag and adjust sbspace()
sys/sys/unpcb.h
Delete the obsolete unp_cc and unp_mbcnt fields from struct unpcb.
sys/kern/uipc_usrreq.c
Adjust uipc_rcvd, uipc_send, and uipc_sense to use the SB_STOP
backpressure mechanism. Removing obsolete unpcb fields from
db_show_unpcb.
tests/sys/kern/unix_seqpacket_test.c
Clear expected failures from ATF.
Obtained from: DragonFly BSD
PR: kern/185812
Reviewed by: silence from freebsd-net@ and rwatson@
MFC after: 3 weeks
Sponsored by: Spectra Logic Corporation
2014-03-13 18:42:12 +00:00
|
|
|
db_printf("unp_gencnt: %llu\n",
|
2007-05-29 12:36:00 +00:00
|
|
|
(unsigned long long)unp->unp_gencnt);
|
|
|
|
|
|
|
|
db_printf("unp_flags: %x (", unp->unp_flags);
|
|
|
|
db_print_unpflags(unp->unp_flags);
|
|
|
|
db_printf(")\n");
|
|
|
|
|
|
|
|
db_printf("unp_peercred:\n");
|
|
|
|
db_print_xucred(2, &unp->unp_peercred);
|
|
|
|
|
|
|
|
db_printf("unp_refcount: %u\n", unp->unp_refcount);
|
|
|
|
}
|
|
|
|
#endif
|