Commit Graph

559 Commits

Author SHA1 Message Date
Eugene Grosbein
4824d78872 listen(2): improve administrator control over logging
As documented in listen.2 manual page, the kernel emits a LOG_DEBUG
syslog message if a socket listen queue overflows. For some appliances,
it may be desirable to change the priority to some higher value
like LOG_INFO while keeping other debugging suppressed.

OTOH there are cases when such overflows are normal and expected.
Then it may be desirable to suppress overflow logging altogether,
so that dmesg buffer is not flooded over long run.

In addition to existing sysctl kern.ipc.sooverinterval,
introduce new sysctl kern.ipc.sooverprio that defaults to 7 (LOG_DEBUG)
to preserve current behavior. It may be changed to any value
in a range of 0..7 for corresponding priority or to -1 to suppress logging.
Document it in the listen.2 manual page.

MFC after:	1 month
2023-05-01 03:26:44 +07:00
Mark Johnston
b4b33821fa ktls: Fix interlocking between ktls_enable_rx() and listen(2)
The TCP_TXTLS_ENABLE and TCP_RXTLS_ENABLE socket option handlers check
whether the socket is listening socket and fail if so, but this check is
racy.  Since we have to lock the socket buffer later anyway, defer the
check to that point.

ktls_enable_tx() locks the send buffer's I/O lock, which will fail if
the socket is a listening socket, so no explicit checks are needed.  In
ktls_enable_rx(), which does not acquire the I/O lock (see the review
for some discussion on this), use an explicit SOLISTENING() check after
locking the recv socket buffer.

Otherwise, a concurrent solisten_proto() call can trigger crashes and
memory leaks by wiping out socket buffers as ktls_enable_*() is
modifying them.

Also make sure that a KTLS-enabled socket can't be converted to a
listening socket, and use SOCK_(SEND|RECV)BUF_LOCK macros instead of the
old ones while here.

Add some simple regression tests involving listen(2).

Reported by:	syzkaller
MFC after:	2 weeks
Reviewed by:	gallatin, glebius, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D38504
2023-03-21 16:04:00 -04:00
Mark Johnston
636b19ead4 tcp: Disallow re-connection of a connected socket
soconnectat() tries to ensure that one cannot connect a connected
socket.  However, the check is racy and does not really prevent two
threads from attempting to connect the same TCP socket.

Modify tcp_connect() and tcp6_connect() to perform the check again, this
time synchronized by the inpcb lock, under which we call
soisconnecting().

Reported by:	syzkaller
Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	Klara, Inc.
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D38507
2023-02-14 10:07:19 -05:00
Gleb Smirnoff
a0102dee34 sockets: in sousrsend() pass down the error to aio(4)
This somewhat undermines the initial goal of sousrsend() to have all
the special error handling for a write on a socket in a single place.
The aio(4) needs to see EWOULDBLOCK to re-schedule the job.  Because
aio(4) handles return from soreceive() and sousrsend() with the same
code, we can't check for (error == 0 && done < job_nbytes).  Keeping
this exclusion for aio(4) seems a lesser evil.

Fixes:	7a2c93b86e
2023-02-01 13:03:10 -08:00
Gleb Smirnoff
7a2c93b86e sockets: provide sousrsend() that does socket specific error handling
Sockets have special handling for EPIPE on a write, that was spread out
into several places.  Treating transient errors is also special - if
protocol is atomic, than we should ignore any changes to uio_resid, a
transient error means the write had completely failed (see d2b3a0ed31).

- Provide sousrsend() that expects a valid uio, and leave sosend() for
  kernel consumers only.  Do all special error handling right here.
- In dofilewrite() don't do special handling of error for DTYPE_SOCKET.
- For send(2), write(2) and aio_write(2) call into sousrsend() and remove
  error handling for kern_sendit(), soo_write() and soaio_process_job().

PR:			265087
Reported by:            rz-rpi03 at h-ka.de
Reviewed by:            markj
Differential revision:	https://reviews.freebsd.org/D35863
2022-12-14 10:02:44 -08:00
Mateusz Guzik
ebdf27b6f3 uipc: remove accept_mtx
It is unused since 779f106aa1 ("Listening sockets improvements.")

Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-12-11 02:47:07 +00:00
Michael Tuexen
bc0d407676 Revert "listen(): improve POSIX compliance"
This reverts commit 76e6e4d72f.

Several programs in the tree use -1 instead of INT_MAX to use
the maximum value. Thanks to Eugene Grosbein for pointing this
out.
2022-10-12 04:33:00 +02:00
Michael Tuexen
76e6e4d72f listen(): improve POSIX compliance
Ensure that a negative backlog argument is handled as it if was 0.

Reviewed by:		markj@, glebius@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D31821
2022-10-11 22:46:51 +02:00
Alexander V. Chernikov
f66968564d protocols: make socket buffers ioctl handler changeable
Allow to set custom per-protocol handlers for the socket buffers
 ioctls by introducing pr_setsbopt callback with the default value
 set to the currently-used sbsetopt().

Reviewed by:	glebius
Differential Revision: https://reviews.freebsd.org/D36746
2022-09-28 10:20:09 +00:00
Gleb Smirnoff
e80062a2d4 tcp: avoid call to soisconnected() on transition to ESTABLISHED
This call existed since pre-FreeBSD times, and it is hard to understand
why it was there in the first place.  After 6f3caa6d81 it definitely
became necessary always and commit message from f1ee30ccd6 confirms that.
Now that 6f3caa6d81 is effectively backed out by 07285bb4c2, the call
appears to be useful only for sockets that landed on the incomplete queue,
e.g. sockets that have accept_filter(9) enabled on them.

Provide a new TCP flag to mark connections that are known to be on the
incomplete queue, and call soisconnected() only for those connections.

Reviewed by:		rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D36488
2022-09-08 09:16:04 -07:00
Gleb Smirnoff
24af7808fa protosw: repair protocol selection logic in socket(2)
Pointy hat to:	glebius
Fixes:		61f7427f02
2022-08-30 21:19:46 -07:00
Gleb Smirnoff
61f7427f02 protosw: cleanup protocols that existed merely to provide pr_input
Since 4.4BSD the protosw was used to implement socket types created
by socket(2) syscall and at the same to demultiplex incoming IPv4
datagrams (later copied to IPv6).  This story ended with 78b1fc05b2.

These entries (e.g. IPPROTO_ICMP) in inetsw that were added to catch
packets in ip_input(), they would also be returned by pffindproto()
if user says socket(AF_INET, SOCK_RAW, IPPROTO_ICMP).  Thus, for raw
sockets to work correctly, all the entries were pointing at raw_usrreq
differentiating only in the value of pr_protocol.

With 78b1fc05b2 all these entries are no longer needed, as ip_protox
is independent of protosw.  Any socket syscall requesting SOCK_RAW type
would end up with rip_protosw.  And this protosw has its pr_protocol
set to 0, allowing to mark socket with any protocol.

For IPv6 raw socket the change required two small fixes:
o Validate user provided protocol value
o Always use protocol number stored in inp in rip6_attach, instead
  of protosw value, which is now always 0.

Differential revision:	https://reviews.freebsd.org/D36380
2022-08-30 15:09:21 -07:00
Gleb Smirnoff
8624f4347e divert: declare PF_DIVERT domain and stop abusing PF_INET
The divert(4) is not a protocol of IPv4.  It is a socket to
intercept packets from ipfw(4) to userland and re-inject them
back.  It can divert and re-inject IPv4 and IPv6 packets today,
but potentially it is not limited to these two protocols.  The
IPPROTO_DIVERT does not belong to known IP protocols, it
doesn't even fit into u_char.  I guess, the implementation of
divert(4) was done the way it is done basically because it was
easier to do it this way, back when protocols for sockets were
intertwined with IP protocols and domains were statically
compiled in.

Moving divert(4) out of inetsw accomplished two important things:

1) IPDIVERT is getting much closer to be not dependent on INET.
   This will be finalized in following changes.
2) Now divert socket no longer aliases with raw IPv4 socket.
   Domain/proto selection code won't need a hack for SOCK_RAW and
   multiple entries in inetsw implementing different flavors of
   raw socket can merge into one without requirement of raw IPv4
   being the last member of dom_protosw.

Differential revision:	https://reviews.freebsd.org/D36379
2022-08-30 15:09:21 -07:00
Gleb Smirnoff
e7d02be19d protosw: refactor protosw and domain static declaration and load
o Assert that every protosw has pr_attach.  Now this structure is
  only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw.  This was suggested
  in 1996 by wollman@ (see 7b187005d1), and later reiterated
  in 2006 by rwatson@ (see 6fbb9cf860).
o Make struct domain hold a variable sized array of protosw pointers.
  For most protocols these pointers are initialized statically.
  Those domains that may have loadable protocols have spacers. IPv4
  and IPv6 have 8 spacers each (andre@ dff3237ee5).
o For inetsw and inet6sw leave a comment noting that many protosw
  entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36232
2022-08-17 11:50:32 -07:00
Gleb Smirnoff
f277746e13 protosw: change prototype for pr_control
For some reason protosw.h is used during world complation and userland
is not aware of caddr_t, a relic from the first version of C.  Broken
buildworld is good reason to get rid of yet another caddr_t in kernel.

Fixes:	886fc1e804
2022-08-12 12:08:18 -07:00
Gleb Smirnoff
07285bb4c2 tcp: utilize new solisten_clone() and solisten_enqueue()
This streamlines cloning of a socket from a listener.  Now we do not
drop the inpcb lock during creation of a new socket, do not do useless
state transitions, and put a fully initialized socket+inpcb+tcpcb into
the listen queue.

Before this change, first we would allocate the socket and inpcb+tcpcb via
tcp_usr_attach() as TCPS_CLOSED, link them into global list of pcbs, unlock
pcb and put this onto incomplete queue (see 6f3caa6d81).  Then, after
sonewconn() we would lock it again, transition into TCPS_SYN_RECEIVED,
insert into inpcb hash, finalize initialization of tcpcb.  And then, in
call into tcp_do_segment() and upon transition to TCPS_ESTABLISHED call
soisconnected().  This call would lock the listening socket once again
with a LOR protection sequence and then we would relocate the socket onto
the complete queue and only now it is ready for accept(2).

Reviewed by:		rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D36064
2022-08-10 11:09:34 -07:00
Gleb Smirnoff
8f5a0a2e4f sockets: provide solisten_clone(), solisten_enqueue()
as alternative KPI to sonewconn().  The latter has three stages:
- check the listening socket queue limits
- allocate a new socket
- call into protocol attach method
- link the new socket into the listen queue of the listening socket

The attach method, originally designed for a creation of socket by the
socket(2) syscall has slightly different semantics than attach of a socket
cloned by listener.  Make it possible for protocols to call into the
first stage, then perform a different attach, and then call into the
final stage.  The first stage, that checks limits and clones a socket
is called solisten_clone(), and the function that enqueues the socket
is solisten_enqueue().

Reviewed by:		tuexen
Differential revision:	https://reviews.freebsd.org/D36063
2022-08-10 11:09:34 -07:00
Alexander V. Chernikov
be1f485d7d sockets: add MSG_TRUNC flag handling for recvfrom()/recvmsg().
Implement Linux-variant of MSG_TRUNC input flag used in recv(), recvfrom() and recvmsg().
Posix defines MSG_TRUNC as an output flag, indicating packet/datagram truncation.
Linux extended it a while (~15+ years) ago to act as input flag,
resulting in returning the full packet size regarless of the input
buffer size.
It's a (relatively) popular pattern to do recvmsg( MSG_PEEK | MSG_TRUNC) to get the
packet size, allocate the buffer and issue another call to fetch the packet.
In particular, it's popular in userland netlink code, which is the primary driving factor of this change.

This commit implements the MSG_TRUNC support for SOCK_DGRAM sockets (udp, unix and all soreceive_generic() users).

PR:		kern/176322
Reviewed by:	pauamma(doc)
Differential Revision: https://reviews.freebsd.org/D35909
MFC after:	1 month
2022-07-30 18:21:51 +00:00
Gleb Smirnoff
c261510ef5 sockets: fix setsockopt(SO_RCVTIMEO) on a listening socket
MFC after:	3 weeks
2022-07-08 11:33:24 -07:00
Gleb Smirnoff
d8596171c5 sockets: use only soref()/sorele() as socket reference count
o Retire SS_FDREF as it is basically a debug flag on top of already
  existing soref()/sorele().
o Convert SS_PROTOREF into soref()/sorele().
o Change reference model for the listen queues, see below.
o Make sofree() private.  The correct KPI to use is only sorele().
o Make soabort() respect the model and sorele() instead of sofree().

Note on listening queues.  Until now the sockets on a queue had zero
reference count.  And the reference were given only upon accept(2).  The
assumption was that there is no way to see the queued socket from anywhere
except its head.  This is not true, since queued sockets already have pcbs,
which are linked at least into the global pcb lists.  With this change we
put the reference right in the sonewconn() and on accept(2) path we just
hand the existing reference to the file descriptor.

Differential revision:	https://reviews.freebsd.org/D35679
2022-07-04 12:40:51 -07:00
Gleb Smirnoff
bc7605647c sockets: use positive flag for file descriptor socket reference
Rename SS_NOFDREF to SS_FDREF and flip all bitwise operations.
Mark sockets created by socreate() with SS_FDREF.

This change is mostly illustrative. With it we see that SS_FDREF
is a debugging flag, since:
* socreate() takes a reference with soref().
* on accept path solisten_dequeue() takes a reference
  with soref() and then soaccept() sets SS_FDREF.
* soclose() checks SS_FDREF, removes it and does sorele().

Reviewed by:		tuexen
Differential revision:	https://reviews.freebsd.org/D35678
2022-07-04 12:40:51 -07:00
Gleb Smirnoff
66c8e3fccf socket: fix listen(2) on an already listening socket
Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35669
Fixes:			141fe2dcee
2022-06-30 07:50:29 -07:00
Gleb Smirnoff
a4fc41423f sockets: enable protocol specific socket buffers
Split struct sockbuf into common shared fields and protocol specific
union, where protocols are free to implement whatever buffer they
want.  Such protocols should mark themselves with PR_SOCKBUF and are
expected to initialize their buffers in their pr_attach and tear
them down in pr_detach.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35299
2022-06-24 09:09:10 -07:00
Mark Johnston
f6379f7fde socket: Fix a race between kevent(2) and listen(2)
When locking the knote list for a socket, we check whether the socket is
a listening socket in order to select the appropriate mutex; a listening
socket uses the socket lock, while data sockets use socket buffer
mutexes.

If SOLISTENING(so) is false and the knote lock routine locks a socket
buffer, then it must re-check whether the socket is a listening socket
since solisten_proto() could have changed the socket's identity while we
were blocked on the socket buffer lock.

Reported by:	syzkaller
Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35492
2022-06-16 10:20:04 -04:00
Gleb Smirnoff
a8e286bb5d sockets: use socket buffer mutexes in struct socket directly
Convert more generic socket code to not use sockbuf compat pointer.
Continuation of 4328318445.
2022-06-03 12:55:44 -07:00
Rick Macklem
373511338d uipc_socket.c: Modify MSG_TLSAPPDATA to only do Alert Records
Without this patch, the MSG_TLSAPPDATA flag would cause
soreceive_generic() to return ENXIO for any non-application
data record in a TLS receive stream.

This works ok for TLS1.2, since Alert records appear to be
the only non-application data records received.
However, for TLS1.3, there can be post-handshake handshake
records, such as NewSessionKey sent to the client from the
server. These handshake records cannot be handled by the
upcall which does an SSL_read() with length == 0.

It appears that the client can simply throw away these
NewSessionKey records, but to do so, it needs to receive
them within the kernel.

This patch modifies the semantics of MSG_TLSAPPDATA slightly,
so that it only applies to Alert records and not Handshake
records. It is needed to allow the krpc to work with KTLS1.3.

Reviewed by:	hselasky
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D35170
2022-05-14 12:56:50 -07:00
Gleb Smirnoff
4328318445 sockets: use socket buffer mutexes in struct socket directly
Since c67f3b8b78 the sockbuf mutexes belong to the containing socket,
and socket buffers just point to it.  In 74a68313b5 macros that access
this mutex directly were added.  Go over the core socket code and
eliminate code that reaches the mutex by dereferencing the sockbuf
compatibility pointer.

This change requires a KPI change, as some functions were given the
sockbuf pointer only without any hint if it is a receive or send buffer.

This change doesn't cover the whole kernel, many protocols still use
compatibility pointers internally.  However, it allows operation of a
protocol that doesn't use them.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35152
2022-05-12 13:22:12 -07:00
Gleb Smirnoff
2e4e5ee23f sockets: delete stale comment from sofree()
First  paragraph refers to old past "we used to" and is no longer
important today.  Second paragraph has just a wrong statement that
socket buffer is destroyed before pru_detach.
2022-05-12 11:02:50 -07:00
Gleb Smirnoff
a982ce0442 sockets: remove the socket-on-stack hack from sorflush()
The hack can be tracked down to 4.4BSD, where copy was performed
under splimp() and then after splx() dom_dispose was called.
Stevens has a chapter on this function, but he doesn't answer why
this trick is necessary.  Why can't we call into dom_dispose under
splimp()?  Anyway, with multithreaded kernel the hack seems to be
necessary to avoid LORs between socket buffer lock and different
filesystem locks, especially network file systems.

The new socket buffers KPI sbcut() from 1d2df300e9 allow us to get
rid of the hack.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35125
2022-05-09 10:43:01 -07:00
Gleb Smirnoff
42f2fa9953 sockets: don't call dom_dispose() on a listening socket
sorflush() already did the right thing, so only sofree() needed
a fix.  Turn check into assertion in our only dom_dispose method.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35124
2022-05-09 10:42:57 -07:00
Gleb Smirnoff
c17418a0ba sockets: assert that any protocol with PR_RIGHTS has dom_dispose()
Through the entire history only PF_UNIX has this feature.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35123
2022-05-09 10:42:48 -07:00
Gleb Smirnoff
97f8198e95 sockets: make SO_SND/SO_RCV a enum
Not a functional change now. The enum will also be used for other socket
buffer related KPIs.
2022-05-09 10:42:47 -07:00
Alexander Leidinger
aeb91e95cf Log euid, rgid and jail on listen queue overflow
If you have numerous jails with multiple similar services running,
this helps to narrow down which services this log is referring to.
2022-03-26 11:17:55 +01:00
Mark Johnston
5de79eeddb ktls: Disallow transmitting empty frames outside of TLS 1.0/CBC mode
There was nothing preventing one from sending an empty fragment on an
arbitrary KTLS TX-enabled socket, but ktls_frame() asserts that this
could not happen.  Though the transmit path handles this case for TLS
1.0 with AES-CBC, we should be strict and allow empty fragments only in
modes where it is explicitly allowed.

Modify sosend_generic() to reject writes to a KTLS-enabled socket if the
number of data bytes is zero, so that userspace cannot trigger the
aforementioned assertion.

Add regression tests to exercise this case.

Reported by:	syzkaller
Reviewed by:	gallatin, jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34195
2022-02-08 12:40:41 -05:00
Alexander Motin
fe27f1db5f kern: Remove CTLFLAG_NEEDGIANT from some sysctls.
MFC after:	2 weeks
2021-12-26 12:03:33 -05:00
Hans Petter Selasky
f9978339d1 Remove dead code.
The variable orig_resid is always set to zero right after the while loop
where it is cleared.

Reviewed by:	gallatin@ and glebius@
Differential Revision:	https://reviews.freebsd.org/D33589
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2021-12-21 18:35:03 +01:00
John Baldwin
e3ba94d4f3 Don't require the socket lock for sorele().
Previously, sorele() always required the socket lock and dropped the
lock if the released reference was not the last reference.  Many
callers locked the socket lock just before calling sorele() resulting
in a wasted lock/unlock when not dropping the last reference.

Move the previous implementation of sorele() into a new
sorele_locked() function and use it instead of sorele() for various
places in uipc_socket.c that called sorele() while already holding the
socket lock.

The sorele() macro now uses refcount_release_if_not_last() try to drop
the socket reference without locking the socket.  If that shortcut
fails, it locks the socket and calls sorele_locked().

Reviewed by:	kib, markj
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D32741
2021-11-09 10:50:12 -08:00
Allan Jude
c441592a0e Allow kern.ipc.maxsockets to be set to current value without error
Normally setting kern.ipc.maxsockets returns EINVAL if the new value
is not greater than the previous value. This can cause spurious
error messages when sysctl.conf is processed multiple times, or when
automation systems try to ensure the sysctl is set to the correct
value. If the value is unchanged, then just do nothing.

PR:	243532
Reviewed by:	markj
MFC after:	3 days
Sponsored by:	Modirum MDPay
Sponsored by:	Klara Inc.
Differential Revision:	https://reviews.freebsd.org/D32775
2021-11-04 12:56:09 +00:00
Gleb Smirnoff
a37e4fd1ea Re-style dfcef87714 to keep the code and variables related to
listening sockets separated from code for generic sockets.

No objection:	markj
2021-10-01 13:38:24 -07:00
Mark Johnston
ade1daa5c0 socket: Synchronize soshutdown() with listen(2) and AIO
To handle shutdown(SHUT_RD) we flush the receive buffer of the socket.
This may involve searching for control messages of type SCM_RIGHTS,
since we need to close the file references.  Closing arbitrary files
with socket buffer locks held is undesirable, mainly due to lock
ordering issues, so we instead make a copy of the socket buffer and
operate on that without any locks.  Fields in the original buffer are
cleared.

This behaviour clobbered the AIO job queue associated with a receive
buffer.  It could also cause us to leak a KTLS session reference.
Reorder socket buffer fields to address this.

An alternate solution would be to remove the hack in sorflush(), but
this is not quite feasible (yet).  In particular, though sorflush()
flags the sockbuf with SBS_CANTRCVMORE, it is possible for more data to
be queued - the flag just prevents userspace from reading more data.  I
suspect we should fix this; SBS_CANTRCVMORE represents a terminal state
and protocols can likely just drop any data destined for such a buffer.
Many of them already do, but in some cases the check is racy, and some
KPI churn will be needed to fix everything.  This approach is more
straightforward for now.

Reported by:	syzbot+104d8ee3430361cb2795@syzkaller.appspotmail.com
Reported by:	syzbot+5bd2e7d05f84a59d0d1b@syzkaller.appspotmail.com
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31976
2021-09-17 14:19:06 -04:00
Mark Johnston
883761f0a8 socket: Remove NOFREE from the socket zone
This flag was added during the transition away from the legacy zone
allocator, commit c897b81311.  The old
zone allocator effectively provided _NOFREE semantics, but it seems that
they are not required for sockets.  In particular, we use reference
counting to keep sockets live.

One somewhat dangerous case is sonewconn(), which returns a pointer to a
socket with reference count 0.  This socket is still effectively owned
by the listening socket.  Protocols must therefore be careful to
synchronize sonewconn() calls with their pru_close implementations,
since for listening sockets soclose() will abort the child sockets.  For
example, TCP holds the listening socket's PCB read locked across the
sonewconn() call, which blocks tcp_usr_close(), and sofree()
synchronizes with a concurrent soabort() of the nascent socket.
However, _NOFREE semantics are not required here.

Eliminating _NOFREE has several benefits: it enables use-after-free
detection (e.g., by KASAN) and lets the system reclaim memory from the
socket zone under memory pressure.  No functional change intended.

Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31975
2021-09-17 14:19:06 -04:00
Mark Johnston
6b288408ca socket: Add assertions around naked refcount decrements
Sockets in a listen queue hold a reference to the parent listening
socket.  Several code paths release this reference manually when moving
a child socket out of the queue.

Replace comments about the expected post-decrement refcount value with
assertions.  Use refcount_load() instead of a plain load.  No functional
change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31974
2021-09-17 14:19:06 -04:00
Mark Johnston
dfcef87714 socket: Fix a use-after-free in soclose()
After releasing the fd reference to a socket "so", we should avoid
testing SOLISTENING(so) since the socket may have been freed.  Instead,
directly test whether the list of unaccepted sockets is empty.

Fixes:		f4bb1869dd ("Consistently use the SOLISTENING() macro")
Pointy hat:	markj
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31973
2021-09-17 14:19:05 -04:00
Mark Johnston
fa0463c384 socket: De-duplicate SBLOCKWAIT() definitions
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-09-14 09:01:32 -04:00
Mark Johnston
141fe2dcee aio: Interlock with listen(2)
soo_aio_queue() did not handle the possibility that the provided socket
is a listening socket.  Up until recently, to fix this one would have to
acquire the socket lock first and check, since the socket buffer locks
were destroyed by listen(2).

Now that the socket buffer locks belong to the socket, simply check
SOLISTENING(so) after acquiring them, and make listen(2) return an error
if any AIO jobs are enqueued on the socket.

Add a couple of simple regression test cases.

Note that this fixes things only for the default AIO implementation;
cxgbe(4)'s TCP offload has a separate pru_aio_queue implementation which
requires its own solution.

Reported by:	syzbot+c8aa122fa2c6a4e2a28b@syzkaller.appspotmail.com
Reported by:	syzbot+39af117d43d4f0faf512@syzkaller.appspotmail.com
Reported by:	syzbot+60cceb9569145a0b993b@syzkaller.appspotmail.com
Reported by:	syzbot+2d522c5db87710277ca5@syzkaller.appspotmail.com
Reviewed by:	tuexen, gallatin, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31901
2021-09-10 17:21:11 -04:00
Mark Johnston
523d58aad1 socket: Remove unneeded SOLISTENING checks
Now that SOCK_IO_*_LOCK() checks for listening sockets, we can eliminate
some racy SOLISTENING() checks.  No functional change intended.

Reviewed by:	tuexen
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31660
2021-09-07 17:12:09 -04:00
Mark Johnston
bd4a39cc93 socket: Properly interlock when transitioning to a listening socket
Currently, most protocols implement pru_listen with something like the
following:

	SOCK_LOCK(so);
	error = solisten_proto_check(so);
	if (error) {
		SOCK_UNLOCK(so);
		return (error);
	}
	solisten_proto(so);
	SOCK_UNLOCK(so);

solisten_proto_check() fails if the socket is connected or connecting.
However, the socket lock is not used during I/O, so this pattern is
racy.

The change modifies solisten_proto_check() to additionally acquire
socket buffer locks, and the calling thread holds them until
solisten_proto() or solisten_proto_abort() is called.  Now that the
socket buffer locks are preserved across a listen(2), this change allows
socket I/O paths to properly interlock with listen(2).

This fixes a large number of syzbot reports, only one is listed below
and the rest will be dup'ed to it.

Reported by:	syzbot+9fece8a63c0e27273821@syzkaller.appspotmail.com
Reviewed by:	tuexen, gallatin
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31659
2021-09-07 17:11:43 -04:00
Mark Johnston
f94acf52a4 socket: Rename sb(un)lock() and interlock with listen(2)
In preparation for moving sockbuf locks into the containing socket,
provide alternative macros for the sockbuf I/O locks:
SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK().  These operate on a
socket rather than a socket buffer.  Note that these locks are used only
to prevent concurrent readers and writters from interleaving I/O.

When locking for I/O, return an error if the socket is a listening
socket.  Currently the check is racy since the sockbuf sx locks are
destroyed during the transition to a listening socket, but that will no
longer be true after some follow-up changes.

Modify a few places to check for errors from
sblock()/SOCK_IO_(SEND|RECV)_LOCK() where they were not before.  In
particular, add checks to sendfile() and sorflush().

Reviewed by:	tuexen, gallatin
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31657
2021-09-07 15:06:48 -04:00
Roy Marples
7045b1603b socket: Implement SO_RERROR
SO_RERROR indicates that receive buffer overflows should be handled as
errors. Historically receive buffer overflows have been ignored and
programs could not tell if they missed messages or messages had been
truncated because of overflows. Since programs historically do not
expect to get receive overflow errors, this behavior is not the
default.

This is really really important for programs that use route(4) to keep
in sync with the system. If we loose a message then we need to reload
the full system state, otherwise the behaviour from that point is
undefined and can lead to chasing bogus bug reports.

Reviewed by:	philip (network), kbowling (transport), gbe (manpages)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D26652
2021-07-28 09:35:09 -07:00
Mark Johnston
a100217489 Consistently use the SOCKBUF_MTX() and SOCK_MTX() macros
This makes it easier to change the socket locking protocols.  No
functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-06-14 17:32:32 -04:00