A one-to-many unix/dgram socket is a socket that has been bound
with bind(2) and can get multiple connections. A typical example
is /var/run/log bound by syslogd(8) and receiving multiple
connections from libc syslog(3) API. Until now all of these
connections shared the same receive socket buffer of the bound
socket. This made the socket vulnerable to overflow attack.
See 240d5a9b1c for a historical attempt to workaround the problem.
This commit creates a per-connection socket buffer for every single
connected socket and eliminates the problem. The new behavior will
optimize seldom writers over frequent writers. See added test case
scenarios and code comments for more detailed description of the
new behavior.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35303
o Use m_pkthdr.memlen from m_uiotombuf()
o Modify unp_internalize() to keep track of allocated space and memory
as well as pointer to the last buffer.
o Modify unp_addsockcred() to keep track of allocated space and memory
as well as pointer to the last buffer.
o Record the datagram len/memlen/ctllen in the first (from) mbuf of the
chain in uipc_sosend_dgram() and reuse it in uipc_soreceive_dgram().
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35302
Data allocated by m_uiotombuf() usually goes into a socket buffer.
We are interested in the length of useful data to be added to sb_acc,
as well as total memory used by mbufs. The later would be added to
sb_mbcnt. Calculating this value at allocation time allows to save
on extra traversal of the mbuf chain.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35301
This change fully splits away PF_UNIX/SOCK_DGRAM from other socket
buffer implementations, without any behavior changes.
Generic socket implementation is reduced down to one STAILQ and very
little code.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35300
Use this new version in unix/dgram socket when sending to a target
address. This removes extra lock release/acquisition and possible
counter-intuitive ENOTCONN.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35298
This allows to remove one M_NOWAIT allocation and also makes it
more clear what's going on.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35297
With this second step PF_UNIX/SOCK_DGRAM has protocol specific
implementation. This gives some possibility performance
optimizations. However, it still operates on the same struct
socket as all other sockets do.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35296
Remove the dead code. The new uipc_sosend_dgram() handles send()
on PF_UNIX/SOCK_DGRAM in full.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35294
This is first step towards splitting classic BSD socket
implementation into separate classes. The first to be
split is PF_UNIX/SOCK_DGRAM as it has most differencies
to SOCK_STREAM sockets and to PF_INET sockets.
Historically a protocol shall provide two methods for sendmsg(2):
pru_sosend and pru_send. The former is a generic send method,
e.g. sosend_generic() which would internally call the latter,
uipc_send() in our case. There is one important exception, though,
the sendfile(2) code will call pru_send directly. But sendfile
doesn't work on SOCK_DGRAM, so we can do the trick. We will create
socket class specific uipc_sosend_dgram() which will carry only
important bits from sosend_generic() and uipc_send().
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35293
Make it a complex, but a single for(;;) statement. The previous cycle
with some loop logic in the beginning and some loop logic at the end
was confusing. Both me and markj@ were misleaded to a conclusion that
some checks are unnecessary, while they actually were necessary.
While here, handle an edge case found by Mark, when on 64-bit platform
an incorrect message from userland would underflow length counter, but
return without any error. Provide a test case for such message.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35375
In this function we always work with mbufs that we previously
created ourselves in unp_internalize(). They must be valid.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35319
Now that we call sbcreatecontrol() with M_WAITOK, we are expected to
pass a valid size. Return same error code, we are returning for an
oversized control from sockargs().
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35317
This change is also a preparation for further optimization to
allow locked return on success.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35182
Assert that sockets are of the same type. unp_connectat() already did
this check. Add the check to uipc_connect2().
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35181
Since c67f3b8b78 the sockbuf mutexes belong to the containing socket,
and socket buffers just point to it. In 74a68313b5 macros that access
this mutex directly were added. Go over the core socket code and
eliminate code that reaches the mutex by dereferencing the sockbuf
compatibility pointer.
This change requires a KPI change, as some functions were given the
sockbuf pointer only without any hint if it is a receive or send buffer.
This change doesn't cover the whole kernel, many protocols still use
compatibility pointers internally. However, it allows operation of a
protocol that doesn't use them.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35152
The hack can be tracked down to 4.4BSD, where copy was performed
under splimp() and then after splx() dom_dispose was called.
Stevens has a chapter on this function, but he doesn't answer why
this trick is necessary. Why can't we call into dom_dispose under
splimp()? Anyway, with multithreaded kernel the hack seems to be
necessary to avoid LORs between socket buffer lock and different
filesystem locks, especially network file systems.
The new socket buffers KPI sbcut() from 1d2df300e9 allow us to get
rid of the hack.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35125
sorflush() already did the right thing, so only sofree() needed
a fix. Turn check into assertion in our only dom_dispose method.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35124
This gets rid of the error prone naming where fget_unlocked returns with
a ref held, while fget_locked requires a lock but provides nothing in
terms of making sure the file lives past unlock.
No functional changes.
This is behavior what some programs expect and what Linux does. For
example nginx expects EAGAIN when sending messages to /var/run/log,
which it connects to with O_NONBLOCK. Particularly with nginx the
problem is magnified by the fact that a ENOBUFS on send(2) is also
logged, so situation creates a log-bomb - a failed log message
triggers another log message.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D34187
There left only three modules that used dom_init(). And netipsec
was the last one to use dom_destroy().
Differential revision: https://reviews.freebsd.org/D33540
syslog(3) was recently change to support larger messages, up to 8KB.
Our syslogd handles this fine, as it adjusts /dev/log's recv buffer to a
large size. rsyslog, however, uses the system default of 4KB. This
leads to problems since our syslog(3) retries indefinitely when a send()
returns ENOBUFS, but if the message is large enough this will never
succeed.
Increase the default recv buffer size for datagram sockets to support
8KB syslog messages without requiring the logging daemon to adjust its
buffers.
PR: 260126
Reviewed by: asomers
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33380
We need to load the socket pointer after locking the PCB, otherwise
the socket may have been detached and freed by the time that unp_drop()
sets so_error.
This previously went unnoticed as the socket zone was _NOFREE.
Reported by: pho
MFC after: 1 week
Currently, most protocols implement pru_listen with something like the
following:
SOCK_LOCK(so);
error = solisten_proto_check(so);
if (error) {
SOCK_UNLOCK(so);
return (error);
}
solisten_proto(so);
SOCK_UNLOCK(so);
solisten_proto_check() fails if the socket is connected or connecting.
However, the socket lock is not used during I/O, so this pattern is
racy.
The change modifies solisten_proto_check() to additionally acquire
socket buffer locks, and the calling thread holds them until
solisten_proto() or solisten_proto_abort() is called. Now that the
socket buffer locks are preserved across a listen(2), this change allows
socket I/O paths to properly interlock with listen(2).
This fixes a large number of syzbot reports, only one is listed below
and the rest will be dup'ed to it.
Reported by: syzbot+9fece8a63c0e27273821@syzkaller.appspotmail.com
Reviewed by: tuexen, gallatin
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31659
SO_RERROR indicates that receive buffer overflows should be handled as
errors. Historically receive buffer overflows have been ignored and
programs could not tell if they missed messages or messages had been
truncated because of overflows. Since programs historically do not
expect to get receive overflow errors, this behavior is not the
default.
This is really really important for programs that use route(4) to keep
in sync with the system. If we loose a message then we need to reload
the full system state, otherwise the behaviour from that point is
undefined and can lead to chasing bogus bug reports.
Reviewed by: philip (network), kbowling (transport), gbe (manpages)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26652
Some code was using it already, but in many places we were testing
SO_ACCEPTCONN directly. As a small step towards fixing some bugs
involving synchronization with listen(2), make the kernel consistently
use SOLISTENING(). No functional change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Capsicum did not prevent certain privileged networking operations,
specifically creation of raw sockets and network configuration ioctls.
However, these facilities can be used to circumvent some of the
restrictions that capability mode is supposed to enforce.
Add capability mode checks to disallow network configuration ioctls and
creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET
internet sockets.
Reviewed by: oshogbo
Discussed with: emaste
Reported by: manu
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D29423
MSG_CMSG_CLOEXEC has not been working since 2015 (SVN r284380) because
_finstall expects O_CLOEXEC and not UF_EXCLOSE as the flags argument.
This was probably not noticed because we don't have a test for this flag
so this commit adds one. I found this problem because one of the
libwayland tests was failing.
Fixes: ea31808c3b ("fd: move out actual fp installation to _finstall")
MFC after: 3 days
Reviewed By: mjg, kib
Differential Revision: https://reviews.freebsd.org/D29328
The current list is limited to the cases where UFS needs to handle
vput(dvp) specially. Which means VOP_CREATE(), VOP_MKDIR(), VOP_MKNOD(),
VOP_LINK(), and VOP_SYMLINK().
Reviewed by: chs, mkcusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Historically receive buffer overflows have been ignored and programs
could not tell if they missed messages or messages had been truncated
because of overflows. Since programs historically do not expect to get
receive overflow errors, this behavior is not the default.
This is really really important for programs that use route(4) to keep in sync
with the system. If we loose a message then we need to reload the full system
state, otherwise the behaviour from that point is undefined and can lead
to chasing bogus bug reports.
No functional change intended.
Tracking these structures separately for each proc enables future work to
correctly emulate clone(2) in linux(4).
__FreeBSD_version is bumped (to 1300130) for consumption by, e.g., lsof.
Reviewed by: kib
Discussed with: markj, mjg
Differential Revision: https://reviews.freebsd.org/D27037
As this ABI is still fresh (r367287), let's correct some mistakes now:
- Version the structure to allow for future changes
- Include sender's pid in control message structure
- Use a distinct control message type from the cmsgcred / sockcred mess
Discussed with: kib, markj, trasz
Differential Revision: https://reviews.freebsd.org/D27084