Commit Graph

407 Commits

Author SHA1 Message Date
Gleb Smirnoff
458f475df8 unix/dgram: smart socket buffers for one-to-many sockets
A one-to-many unix/dgram socket is a socket that has been bound
with bind(2) and can get multiple connections.  A typical example
is /var/run/log bound by syslogd(8) and receiving multiple
connections from libc syslog(3) API.  Until now all of these
connections shared the same receive socket buffer of the bound
socket.  This made the socket vulnerable to overflow attack.
See 240d5a9b1c for a historical attempt to workaround the problem.

This commit creates a per-connection socket buffer for every single
connected socket and eliminates the problem.  The new behavior will
optimize seldom writers over frequent writers.  See added test case
scenarios and code comments for more detailed description of the
new behavior.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35303
2022-06-24 09:09:11 -07:00
Gleb Smirnoff
1093f16487 unix/dgram: reduce mbuf chain traversals in send(2) and recv(2)
o Use m_pkthdr.memlen from m_uiotombuf()
o Modify unp_internalize() to keep track of allocated space and memory
  as well as pointer to the last buffer.
o Modify unp_addsockcred() to keep track of allocated space and memory
  as well as pointer to the last buffer.
o Record the datagram len/memlen/ctllen in the first (from) mbuf of the
  chain in uipc_sosend_dgram() and reuse it in uipc_soreceive_dgram().

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35302
2022-06-24 09:09:11 -07:00
Gleb Smirnoff
9b841b0e23 m_uiotombuf: write total memory length of the allocated chain in pkthdr
Data allocated by m_uiotombuf() usually goes into a socket buffer.
We are interested in the length of useful data to be added to sb_acc,
as well as total memory used by mbufs.  The later would be added to
sb_mbcnt.  Calculating this value at allocation time allows to save
on extra traversal of the mbuf chain.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35301
2022-06-24 09:09:11 -07:00
Gleb Smirnoff
a7444f807e unix/dgram: use minimal possible socket buffer for PF_UNIX/SOCK_DGRAM
This change fully splits away PF_UNIX/SOCK_DGRAM from other socket
buffer implementations, without any behavior changes.

Generic socket implementation is reduced down to one STAILQ and very
little code.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35300
2022-06-24 09:09:11 -07:00
Gleb Smirnoff
315167c0de unix: provide an option to return locked from unp_connectat()
Use this new version in unix/dgram socket when sending to a target
address.  This removes extra lock release/acquisition and possible
counter-intuitive ENOTCONN.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35298
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
5dc8dd5f3a unix/dgram: inline sbappendaddr_locked() into uipc_sosend_dgram()
This allows to remove one M_NOWAIT allocation and also makes it
more clear what's going on.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35297
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
e3fbbf965e unix/dgram: add a specific receive method - uipc_soreceive_dgram
With this second step PF_UNIX/SOCK_DGRAM has protocol specific
implementation.  This gives some possibility performance
optimizations.  However, it still operates on the same struct
socket as all other sockets do.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35296
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
f384a97c83 unix/dgram: cleanup uipc_send of PF_UNIX/SOCK_DGRAM, step 2
Just remove one level of indentation as the case clause always match.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35295
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
7e5b6b391e unix/dgram: cleanup uipc_send of PF_UNIX/SOCK_DGRAM, step 1
Remove the dead code.  The new uipc_sosend_dgram() handles send()
on PF_UNIX/SOCK_DGRAM in full.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35294
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
3464958246 unix/dgram: add a specific send method - uipc_sosend_dgram()
This is first step towards splitting classic BSD socket
implementation into separate classes.  The first to be
split is PF_UNIX/SOCK_DGRAM as it has most differencies
to SOCK_STREAM sockets and to PF_INET sockets.

Historically a protocol shall provide two methods for sendmsg(2):
pru_sosend and pru_send.  The former is a generic send method,
e.g. sosend_generic() which would internally call the latter,
uipc_send() in our case.  There is one important exception, though,
the sendfile(2) code will call pru_send directly.  But sendfile
doesn't work on SOCK_DGRAM, so we can do the trick.  We will create
socket class specific uipc_sosend_dgram() which will carry only
important bits from sosend_generic() and uipc_send().

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35293
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
d97922c6c6 unix/*: rewrite unp_internalize() cmsg parsing cycle
Make it a complex, but a single for(;;) statement.  The previous cycle
with some loop logic in the beginning and some loop logic at the end
was confusing.  Both me and markj@ were misleaded to a conclusion that
some checks are unnecessary, while they actually were necessary.

While here, handle an edge case found by Mark, when on 64-bit platform
an incorrect message from userland would underflow length counter, but
return without any error.  Provide a test case for such message.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35375
2022-06-06 10:05:28 -07:00
Gleb Smirnoff
2573e6ced9 unix/dgram: rename unpdg_sendspace to unpdg_maxdgram
Matches the meaning of the variable and sysctl node name.
2022-06-03 12:55:44 -07:00
Gleb Smirnoff
d64f2f42c1 unix: unp_externalize() can M_WAITOK
Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35318
2022-05-27 20:48:38 -07:00
Gleb Smirnoff
75e7e3ce34 unix: fix incorrect assertion in 4682ac697c
Pointy hat to:	glebius
Fixes:		4682ac697c
2022-05-26 11:35:05 -07:00
Gleb Smirnoff
4682ac697c unix: turn check in unp_externalize() into assertion
In this function we always work with mbufs that we previously
created ourselves in unp_internalize().  They must be valid.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35319
2022-05-25 13:29:20 -07:00
Gleb Smirnoff
579b45e203 unix/*: check new control size in unp_internalize()
Now that we call sbcreatecontrol() with M_WAITOK, we are expected to
pass a valid size.  Return same error code, we are returning for an
oversized control from sockargs().

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35317
2022-05-25 13:29:13 -07:00
Gleb Smirnoff
b46667c63e sockbuf: merge two versions of sbcreatecontrol() into one
No functional change.
2022-05-17 10:10:42 -07:00
Gleb Smirnoff
eac7f0798b unix: garbage collect unp_dispose_mbuf() for brevity 2022-05-17 10:10:41 -07:00
Gleb Smirnoff
2e5bf7c49f unix: fix mbuf leak on close of socket with data
Fixes:	1f32cef471
2022-05-17 10:10:41 -07:00
Gleb Smirnoff
bb35a4e11d unix: microoptimize unp_connectat() - one less lock on success
This change is also a preparation for further optimization to
allow locked return on success.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35182
2022-05-12 13:22:39 -07:00
Gleb Smirnoff
08f17d1432 unix: make unp_connect2() void
Assert that sockets are of the same type.  unp_connectat() already did
this check.  Add the check to uipc_connect2().

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35181
2022-05-12 13:22:39 -07:00
Gleb Smirnoff
4328318445 sockets: use socket buffer mutexes in struct socket directly
Since c67f3b8b78 the sockbuf mutexes belong to the containing socket,
and socket buffers just point to it.  In 74a68313b5 macros that access
this mutex directly were added.  Go over the core socket code and
eliminate code that reaches the mutex by dereferencing the sockbuf
compatibility pointer.

This change requires a KPI change, as some functions were given the
sockbuf pointer only without any hint if it is a receive or send buffer.

This change doesn't cover the whole kernel, many protocols still use
compatibility pointers internally.  However, it allows operation of a
protocol that doesn't use them.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35152
2022-05-12 13:22:12 -07:00
Gleb Smirnoff
01235012e5 unix/dgram: uipc_listen() is specific for SOCK_STREAM and SOCK_SEQPACKET
Rely on pr_usrreqs_init() to init SOCK_DGRAM to pru_listen_notsupp().
2022-05-12 11:04:40 -07:00
Gleb Smirnoff
3c87ba3c3b unix/dgram: pru_rcvd never called since PR_WANTRCVD not set 2022-05-12 11:04:40 -07:00
Gleb Smirnoff
1f32cef471 unix: don't call sbrelease() in uipc_detach()
Since a982ce0442 the socket buffer is already cleared and released in
unp_dispose() that is called just before uipc_detach().
2022-05-12 11:02:50 -07:00
Gleb Smirnoff
a982ce0442 sockets: remove the socket-on-stack hack from sorflush()
The hack can be tracked down to 4.4BSD, where copy was performed
under splimp() and then after splx() dom_dispose was called.
Stevens has a chapter on this function, but he doesn't answer why
this trick is necessary.  Why can't we call into dom_dispose under
splimp()?  Anyway, with multithreaded kernel the hack seems to be
necessary to avoid LORs between socket buffer lock and different
filesystem locks, especially network file systems.

The new socket buffers KPI sbcut() from 1d2df300e9 allow us to get
rid of the hack.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35125
2022-05-09 10:43:01 -07:00
Gleb Smirnoff
42f2fa9953 sockets: don't call dom_dispose() on a listening socket
sorflush() already did the right thing, so only sofree() needed
a fix.  Turn check into assertion in our only dom_dispose method.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35124
2022-05-09 10:42:57 -07:00
Gleb Smirnoff
24df85d29a unix/*: unp_internalize() can sleep, so allocate mbufs with M_WAITOK 2022-05-09 10:42:48 -07:00
Mateusz Guzik
bb92cd7bcd vfs: NDFREE(&nd, NDF_ONLY_PNBUF) -> NDFREE_PNBUF(&nd) 2022-03-24 10:20:51 +00:00
Mateusz Guzik
f17ef28674 fd: rename fget*_locked to fget*_noref
This gets rid of the error prone naming where fget_unlocked returns with
a ref held, while fget_locked requires a lock but provides nothing in
terms of making sure the file lives past unlock.

No functional changes.
2022-02-22 18:53:43 +00:00
Gleb Smirnoff
65572cade3 unix/dgram: return EAGAIN instead of ENOBUFS when O_NONBLOCK set
This is behavior what some programs expect and what Linux does.  For
example nginx expects EAGAIN when sending messages to /var/run/log,
which it connects to with O_NONBLOCK.  Particularly with nginx the
problem is magnified by the fact that a ENOBUFS on send(2) is also
logged, so situation creates a log-bomb - a failed log message
triggers another log message.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D34187
2022-02-14 09:21:55 -08:00
Gleb Smirnoff
24e1c6ae7d domains: init with standard SYSINIT(9) or VNET_SYSINIT()
There left only three modules that used dom_init().  And netipsec
was the last one to use dom_destroy().

Differential revision:	https://reviews.freebsd.org/D33540
2022-01-03 10:15:22 -08:00
Mark Johnston
d157f2627b unix: Increase the default datagram recv buffer size
syslog(3) was recently change to support larger messages, up to 8KB.
Our syslogd handles this fine, as it adjusts /dev/log's recv buffer to a
large size.  rsyslog, however, uses the system default of 4KB.  This
leads to problems since our syslog(3) retries indefinitely when a send()
returns ENOBUFS, but if the message is large enough this will never
succeed.

Increase the default recv buffer size for datagram sockets to support
8KB syslog messages without requiring the logging daemon to adjust its
buffers.

PR:		260126
Reviewed by:	asomers
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33380
2021-12-17 13:09:49 -05:00
Mateusz Guzik
7e1d3eefd4 vfs: remove the unused thread argument from NDINIT*
See b4a58fbf64 ("vfs: remove cn_thread")

Bump __FreeBSD_version to 1400043.
2021-11-25 22:50:42 +00:00
Mark Johnston
42188bb5c1 unix: Remove a write-only local variable
Reported by:	clang
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-11-16 13:30:22 -05:00
Mark Johnston
50b07c1f71 unix: Fix a use-after-free in unp_drop()
We need to load the socket pointer after locking the PCB, otherwise
the socket may have been detached and freed by the time that unp_drop()
sets so_error.

This previously went unnoticed as the socket zone was _NOFREE.

Reported by:	pho
MFC after:	1 week
2021-09-18 10:38:39 -04:00
Mark Johnston
bd4a39cc93 socket: Properly interlock when transitioning to a listening socket
Currently, most protocols implement pru_listen with something like the
following:

	SOCK_LOCK(so);
	error = solisten_proto_check(so);
	if (error) {
		SOCK_UNLOCK(so);
		return (error);
	}
	solisten_proto(so);
	SOCK_UNLOCK(so);

solisten_proto_check() fails if the socket is connected or connecting.
However, the socket lock is not used during I/O, so this pattern is
racy.

The change modifies solisten_proto_check() to additionally acquire
socket buffer locks, and the calling thread holds them until
solisten_proto() or solisten_proto_abort() is called.  Now that the
socket buffer locks are preserved across a listen(2), this change allows
socket I/O paths to properly interlock with listen(2).

This fixes a large number of syzbot reports, only one is listed below
and the rest will be dup'ed to it.

Reported by:	syzbot+9fece8a63c0e27273821@syzkaller.appspotmail.com
Reviewed by:	tuexen, gallatin
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31659
2021-09-07 17:11:43 -04:00
Roy Marples
7045b1603b socket: Implement SO_RERROR
SO_RERROR indicates that receive buffer overflows should be handled as
errors. Historically receive buffer overflows have been ignored and
programs could not tell if they missed messages or messages had been
truncated because of overflows. Since programs historically do not
expect to get receive overflow errors, this behavior is not the
default.

This is really really important for programs that use route(4) to keep
in sync with the system. If we loose a message then we need to reload
the full system state, otherwise the behaviour from that point is
undefined and can lead to chasing bogus bug reports.

Reviewed by:	philip (network), kbowling (transport), gbe (manpages)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D26652
2021-07-28 09:35:09 -07:00
Mark Johnston
f4bb1869dd Consistently use the SOLISTENING() macro
Some code was using it already, but in many places we were testing
SO_ACCEPTCONN directly.  As a small step towards fixing some bugs
involving synchronization with listen(2), make the kernel consistently
use SOLISTENING().  No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-06-14 17:32:27 -04:00
Mark Johnston
274579831b capsicum: Limit socket operations in capability mode
Capsicum did not prevent certain privileged networking operations,
specifically creation of raw sockets and network configuration ioctls.
However, these facilities can be used to circumvent some of the
restrictions that capability mode is supposed to enforce.

Add capability mode checks to disallow network configuration ioctls and
creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET
internet sockets.

Reviewed by:	oshogbo
Discussed with:	emaste
Reported by:	manu
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D29423
2021-04-07 14:32:56 -04:00
Alex Richardson
6ceacebdf5 Unbreak MSG_CMSG_CLOEXEC
MSG_CMSG_CLOEXEC has not been working since 2015 (SVN r284380) because
_finstall expects O_CLOEXEC and not UF_EXCLOSE as the flags argument.
This was probably not noticed because we don't have a test for this flag
so this commit adds one. I found this problem because one of the
libwayland tests was failing.

Fixes:		ea31808c3b ("fd: move out actual fp installation to _finstall")
MFC after:	3 days
Reviewed By:	mjg, kib
Differential Revision: https://reviews.freebsd.org/D29328
2021-03-18 20:52:20 +00:00
Konstantin Belousov
3b2aa36024 Use VOP_VPUT_PAIR() for eligible VFS syscalls.
The current list is limited to the cases where UFS needs to handle
vput(dvp) specially. Which means VOP_CREATE(), VOP_MKDIR(), VOP_MKNOD(),
VOP_LINK(), and VOP_SYMLINK().

Reviewed by:	chs, mkcusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Alexander V. Chernikov
924d1c9a05 Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors."
Wrong version of the change was pushed inadvertenly.

This reverts commit 4a01b854ca.
2021-02-08 22:32:32 +00:00
Alexander V. Chernikov
4a01b854ca SO_RERROR indicates that receive buffer overflows should be handled as errors.
Historically receive buffer overflows have been ignored and programs
could not tell if they missed messages or messages had been truncated
because of overflows. Since programs historically do not expect to get
receive overflow errors, this behavior is not the default.

This is really really important for programs that use route(4) to keep in sync
with the system. If we loose a message then we need to reload the full system
state, otherwise the behaviour from that point is undefined and can lead
to chasing bogus bug reports.
2021-02-08 21:42:20 +00:00
Mateusz Guzik
4faa375cdd fd: provide a dedicated closef variant for unix socket code
This avoids testing for td != NULL.
2021-01-13 03:27:03 +01:00
Mateusz Guzik
cdb62ab74e vfs: add NDFREE_NOTHING and convert several NDFREE_PNBUF callers
Check the comment above the routine for reasoning.
2021-01-12 13:16:10 +00:00
Mateusz Guzik
6b3a9a0f3d Convert remaining cap_rights_init users to cap_rights_init_one
semantic patch:

@@

expression rights, r;

@@

- cap_rights_init(&rights, r)
+ cap_rights_init_one(&rights, r)
2021-01-12 13:16:10 +00:00
Mateusz Guzik
6404d7ffc1 uipc: disable prediction in unp_pcb_lock_peer
The branch is not very predictable one way or the other, at least during
buildkernel where it only correctly matched 57% of calls.
2020-12-13 21:32:19 +00:00
Conrad Meyer
85078b8573 Split out cwd/root/jail, cmask state from filedesc table
No functional change intended.

Tracking these structures separately for each proc enables future work to
correctly emulate clone(2) in linux(4).

__FreeBSD_version is bumped (to 1300130) for consumption by, e.g., lsof.

Reviewed by:	kib
Discussed with:	markj, mjg
Differential Revision:	https://reviews.freebsd.org/D27037
2020-11-17 21:14:13 +00:00
Conrad Meyer
ede4af47ae unix(4): Enhance LOCAL_CREDS_PERSISTENT ABI
As this ABI is still fresh (r367287), let's correct some mistakes now:

- Version the structure to allow for future changes
- Include sender's pid in control message structure
- Use a distinct control message type from the cmsgcred / sockcred mess

Discussed with:	kib, markj, trasz
Differential Revision:	https://reviews.freebsd.org/D27084
2020-11-17 20:01:21 +00:00