freebsd-dev/share/man/man4/unix.4
Gleb Smirnoff 458f475df8 unix/dgram: smart socket buffers for one-to-many sockets
A one-to-many unix/dgram socket is a socket that has been bound
with bind(2) and can get multiple connections.  A typical example
is /var/run/log bound by syslogd(8) and receiving multiple
connections from libc syslog(3) API.  Until now all of these
connections shared the same receive socket buffer of the bound
socket.  This made the socket vulnerable to overflow attack.
See 240d5a9b1c for a historical attempt to workaround the problem.

This commit creates a per-connection socket buffer for every single
connected socket and eliminates the problem.  The new behavior will
optimize seldom writers over frequent writers.  See added test case
scenarios and code comments for more detailed description of the
new behavior.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35303
2022-06-24 09:09:11 -07:00

475 lines
12 KiB
Groff

.\" Copyright (c) 1991, 1993
.\" The Regents of the University of California. All rights reserved.
.\"
.\" Redistribution and use in source and binary forms, with or without
.\" modification, are permitted provided that the following conditions
.\" are met:
.\" 1. Redistributions of source code must retain the above copyright
.\" notice, this list of conditions and the following disclaimer.
.\" 2. Redistributions in binary form must reproduce the above copyright
.\" notice, this list of conditions and the following disclaimer in the
.\" documentation and/or other materials provided with the distribution.
.\" 3. Neither the name of the University nor the names of its contributors
.\" may be used to endorse or promote products derived from this software
.\" without specific prior written permission.
.\"
.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
.\" SUCH DAMAGE.
.\"
.\" @(#)unix.4 8.1 (Berkeley) 6/9/93
.\" $FreeBSD$
.\"
.Dd June 24, 2022
.Dt UNIX 4
.Os
.Sh NAME
.Nm unix
.Nd UNIX-domain protocol family
.Sh SYNOPSIS
.In sys/types.h
.In sys/un.h
.Sh DESCRIPTION
The
.Ux Ns -domain
protocol family is a collection of protocols
that provides local (on-machine) interprocess
communication through the normal
.Xr socket 2
mechanisms.
The
.Ux Ns -domain
family supports the
.Dv SOCK_STREAM ,
.Dv SOCK_SEQPACKET ,
and
.Dv SOCK_DGRAM
socket types and uses
file system pathnames for addressing.
.Sh ADDRESSING
.Ux Ns -domain
addresses are variable-length file system pathnames of
at most 104 characters.
The include file
.In sys/un.h
defines this address:
.Bd -literal -offset indent
struct sockaddr_un {
u_char sun_len;
u_char sun_family;
char sun_path[104];
};
.Ed
.Pp
Binding a name to a
.Ux Ns -domain
socket with
.Xr bind 2
causes a socket file to be created in the file system.
This file is
.Em not
removed when the socket is closed \(em
.Xr unlink 2
must be used to remove the file.
.Pp
The length of
.Ux Ns -domain
address, required by
.Xr bind 2
and
.Xr connect 2 ,
can be calculated by the macro
.Fn SUN_LEN
defined in
.In sys/un.h .
The
.Va sun_path
field must be terminated by a
.Dv NUL
character to be used with
.Fn SUN_LEN ,
but the terminating
.Dv NUL
is
.Em not
part of the address.
.Pp
The
.Ux Ns -domain
protocol family does not support broadcast addressing or any form
of
.Dq wildcard
matching on incoming messages.
All addresses are absolute- or relative-pathnames
of other
.Ux Ns -domain
sockets.
Normal file system access-control mechanisms are also
applied when referencing pathnames; e.g., the destination
of a
.Xr connect 2
or
.Xr sendto 2
must be writable.
.Sh CONTROL MESSAGES
The
.Ux Ns -domain
sockets support the communication of
.Ux
file descriptors and process credentials through the use of the
.Va msg_control
field in the
.Fa msg
argument to
.Xr sendmsg 2
and
.Xr recvmsg 2 .
The items to be passed are described using a
.Vt "struct cmsghdr"
that is defined in the include file
.In sys/socket.h .
.Pp
To send file descriptors, the type of the message is
.Dv SCM_RIGHTS ,
and the data portion of the messages is an array of integers
representing the file descriptors to be passed.
The number of descriptors being passed is defined
by the length field of the message;
the length field is the sum of the size of the header
plus the size of the array of file descriptors.
.Pp
The received descriptor is a
.Em duplicate
of the sender's descriptor, as if it were created via
.Li dup(fd)
or
.Li fcntl(fd, F_DUPFD_CLOEXEC, 0)
depending on whether
.Dv MSG_CMSG_CLOEXEC
is passed in the
.Xr recvmsg 2
call.
Descriptors that are awaiting delivery, or that are
purposely not received, are automatically closed by the system
when the destination socket is closed.
.Pp
Credentials of the sending process can be transmitted explicitly using a
control message of type
.Dv SCM_CREDS
with a data portion of type
.Vt "struct cmsgcred" ,
defined in
.In sys/socket.h
as follows:
.Bd -literal
struct cmsgcred {
pid_t cmcred_pid; /* PID of sending process */
uid_t cmcred_uid; /* real UID of sending process */
uid_t cmcred_euid; /* effective UID of sending process */
gid_t cmcred_gid; /* real GID of sending process */
short cmcred_ngroups; /* number of groups */
gid_t cmcred_groups[CMGROUP_MAX]; /* groups */
};
.Ed
.Pp
The sender should pass a zeroed buffer which will be filled in by the system.
.Pp
The group list is truncated to at most
.Dv CMGROUP_MAX
GIDs.
.Pp
The process ID
.Fa cmcred_pid
should not be looked up (such as via the
.Dv KERN_PROC_PID
sysctl) for making security decisions.
The sending process could have exited and its process ID already been
reused for a new process.
.Sh SOCKET OPTIONS
.Tn UNIX
domain sockets support a number of socket options for the options level
.Dv SOL_LOCAL ,
which can be set with
.Xr setsockopt 2
and tested with
.Xr getsockopt 2 :
.Bl -tag -width ".Dv LOCAL_CREDS_PERSISTENT"
.It Dv LOCAL_CREDS
This option may be enabled on
.Dv SOCK_DGRAM ,
.Dv SOCK_SEQPACKET ,
or a
.Dv SOCK_STREAM
socket.
This option provides a mechanism for the receiver to
receive the credentials of the process calling
.Xr write 2 ,
.Xr send 2 ,
.Xr sendto 2
or
.Xr sendmsg 2
as a
.Xr recvmsg 2
control message.
The
.Va msg_control
field in the
.Vt msghdr
structure points to a buffer that contains a
.Vt cmsghdr
structure followed by a variable length
.Vt sockcred
structure, defined in
.In sys/socket.h
as follows:
.Bd -literal
struct sockcred {
uid_t sc_uid; /* real user id */
uid_t sc_euid; /* effective user id */
gid_t sc_gid; /* real group id */
gid_t sc_egid; /* effective group id */
int sc_ngroups; /* number of supplemental groups */
gid_t sc_groups[1]; /* variable length */
};
.Ed
.Pp
The current implementation truncates the group list to at most
.Dv CMGROUP_MAX
groups.
.Pp
The
.Fn SOCKCREDSIZE
macro computes the size of the
.Vt sockcred
structure for a specified number
of groups.
The
.Vt cmsghdr
fields have the following values:
.Bd -literal
cmsg_len = CMSG_LEN(SOCKCREDSIZE(ngroups))
cmsg_level = SOL_SOCKET
cmsg_type = SCM_CREDS
.Ed
.Pp
On
.Dv SOCK_STREAM
and
.Dv SOCK_SEQPACKET
sockets credentials are passed only on the first read from a socket,
then the system clears the option on the socket.
.Pp
This option and the above explicit
.Vt "struct cmsgcred"
both use the same value
.Dv SCM_CREDS
but incompatible control messages.
If this option is enabled and the sender attached a
.Dv SCM_CREDS
control message with a
.Vt "struct cmsgcred" ,
it will be discarded and a
.Vt "struct sockcred"
will be included.
.Pp
Many setuid programs will
.Xr write 2
data at least partially controlled by the invoker,
such as error messages.
Therefore, a message accompanied by a particular
.Fa sc_euid
value should not be trusted as being from that user.
.It Dv LOCAL_CREDS_PERSISTENT
This option is similar to
.Dv LOCAL_CREDS ,
except that socket credentials are passed on every read from a
.Dv SOCK_STREAM
or
.Dv SOCK_SEQPACKET
socket, instead of just the first read.
Additionally, the
.Va msg_control
field in the
.Vt msghdr
structure points to a buffer that contains a
.Vt cmsghdr
structure followed by a variable length
.Vt sockcred2
structure, defined in
.In sys/socket.h
as follows:
.Bd -literal
struct sockcred2 {
int sc_version; /* version of this structure */
pid_t sc_pid; /* PID of sending process */
uid_t sc_uid; /* real user id */
uid_t sc_euid; /* effective user id */
gid_t sc_gid; /* real group id */
gid_t sc_egid; /* effective group id */
int sc_ngroups; /* number of supplemental groups */
gid_t sc_groups[1]; /* variable length */
};
.Ed
.Pp
The current version is zero.
.Pp
The
.Vt cmsghdr
fields have the following values:
.Bd -literal
cmsg_len = CMSG_LEN(SOCKCRED2SIZE(ngroups))
cmsg_level = SOL_SOCKET
cmsg_type = SCM_CREDS2
.Ed
.Pp
The
.Dv LOCAL_CREDS
and
.Dv LOCAL_CREDS_PERSISTENT
options are mutually exclusive.
.It Dv LOCAL_CONNWAIT
Used with
.Dv SOCK_STREAM
sockets, this option causes the
.Xr connect 2
function to block until
.Xr accept 2
has been called on the listening socket.
.It Dv LOCAL_PEERCRED
Requested via
.Xr getsockopt 2
on a
.Dv SOCK_STREAM
or
.Dv SOCK_SEQPACKET
socket returns credentials of the remote side.
These will arrive in the form of a filled in
.Vt xucred
structure, defined in
.In sys/ucred.h
as follows:
.Bd -literal
struct xucred {
u_int cr_version; /* structure layout version */
uid_t cr_uid; /* effective user id */
short cr_ngroups; /* number of groups */
gid_t cr_groups[XU_NGROUPS]; /* groups */
pid_t cr_pid; /* process id of the sending process */
};
.Ed
The
.Vt cr_version
fields should be checked against
.Dv XUCRED_VERSION
define.
.Pp
The credentials presented to the server (the
.Xr listen 2
caller) are those of the client when it called
.Xr connect 2 ;
the credentials presented to the client (the
.Xr connect 2
caller) are those of the server when it called
.Xr listen 2 .
This mechanism is reliable; there is no way for either party to influence
the credentials presented to its peer except by calling the appropriate
system call (e.g.,
.Xr connect 2
or
.Xr listen 2 )
under different effective credentials.
.Pp
To reliably obtain peer credentials on a
.Dv SOCK_DGRAM
socket refer to the
.Dv LOCAL_CREDS
socket option.
.El
.Sh BUFFERING
Due to the local nature of the
.Ux Ns -domain
sockets, they do not implement send buffers.
The
.Xr send 2
and
.Xr write 2
families of system calls attempt to write data to the receive buffer of the
destination socket.
.Pp
The default buffer sizes for
.Dv SOCK_STREAM
and
.Dv SOCK_SEQPACKET
.Ux Ns -domain
sockets can be configured with
.Va net.local.stream
and
.Va net.local.seqpacket
branches of
.Xr sysctl 3
MIB respectively.
Note that setting the send buffer size (sendspace) affects only the maximum
write size.
.Pp
The
.Ux Ns -domain
sockets of type
.Dv SOCK_DGRAM
are unreliable and always non-blocking for write operations.
The default receive buffer can be configured with
.Va net.local.dgram.recvspace .
The maximum allowed datagram size is limited by
.Va net.local.dgram.maxdgram .
A
.Dv SOCK_DGRAM
socket that has been bound with
.Xr bind 2
can have multiple peers connected
at the same time.
The modern
.Fx
implementation will allocate
.Va net.local.dgram.recvspace
sized private buffers in the receive buffer of the bound socket for every
connected socket, preventing a situation when a single writer can exhaust
all of buffer space.
Messages coming from unconnected sends using
.Xr sendto 2
land on the shared buffer of the receiving socket, which has the same
size limit.
A side effect of the implementation is that it doesn't guarantee
that writes from different senders will arrive at the receiver in the same
chronological order they were sent.
The order is preserved for writes coming through a particular connection.
.Sh SEE ALSO
.Xr connect 2 ,
.Xr dup 2 ,
.Xr fcntl 2 ,
.Xr getsockopt 2 ,
.Xr listen 2 ,
.Xr recvmsg 2 ,
.Xr sendto 2 ,
.Xr setsockopt 2 ,
.Xr socket 2 ,
.Xr CMSG_DATA 3 ,
.Xr intro 4 ,
.Xr sysctl 8
.Rs
.%T "An Introductory 4.3 BSD Interprocess Communication Tutorial"
.%B PS1
.%N 7
.Re
.Rs
.%T "An Advanced 4.3 BSD Interprocess Communication Tutorial"
.%B PS1
.%N 8
.Re