freebsd-dev

Author	SHA1	Message	Date
John Baldwin	e3ba94d4f3	Don't require the socket lock for sorele(). Previously, sorele() always required the socket lock and dropped the lock if the released reference was not the last reference. Many callers locked the socket lock just before calling sorele() resulting in a wasted lock/unlock when not dropping the last reference. Move the previous implementation of sorele() into a new sorele_locked() function and use it instead of sorele() for various places in uipc_socket.c that called sorele() while already holding the socket lock. The sorele() macro now uses refcount_release_if_not_last() try to drop the socket reference without locking the socket. If that shortcut fails, it locks the socket and calls sorele_locked(). Reviewed by: kib, markj Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D32741	2021-11-09 10:50:12 -08:00
Allan Jude	c441592a0e	Allow kern.ipc.maxsockets to be set to current value without error Normally setting kern.ipc.maxsockets returns EINVAL if the new value is not greater than the previous value. This can cause spurious error messages when sysctl.conf is processed multiple times, or when automation systems try to ensure the sysctl is set to the correct value. If the value is unchanged, then just do nothing. PR: 243532 Reviewed by: markj MFC after: 3 days Sponsored by: Modirum MDPay Sponsored by: Klara Inc. Differential Revision: https://reviews.freebsd.org/D32775	2021-11-04 12:56:09 +00:00
Gleb Smirnoff	a37e4fd1ea	Re-style `dfcef87714` to keep the code and variables related to listening sockets separated from code for generic sockets. No objection: markj	2021-10-01 13:38:24 -07:00
Mark Johnston	ade1daa5c0	socket: Synchronize soshutdown() with listen(2) and AIO To handle shutdown(SHUT_RD) we flush the receive buffer of the socket. This may involve searching for control messages of type SCM_RIGHTS, since we need to close the file references. Closing arbitrary files with socket buffer locks held is undesirable, mainly due to lock ordering issues, so we instead make a copy of the socket buffer and operate on that without any locks. Fields in the original buffer are cleared. This behaviour clobbered the AIO job queue associated with a receive buffer. It could also cause us to leak a KTLS session reference. Reorder socket buffer fields to address this. An alternate solution would be to remove the hack in sorflush(), but this is not quite feasible (yet). In particular, though sorflush() flags the sockbuf with SBS_CANTRCVMORE, it is possible for more data to be queued - the flag just prevents userspace from reading more data. I suspect we should fix this; SBS_CANTRCVMORE represents a terminal state and protocols can likely just drop any data destined for such a buffer. Many of them already do, but in some cases the check is racy, and some KPI churn will be needed to fix everything. This approach is more straightforward for now. Reported by: syzbot+104d8ee3430361cb2795@syzkaller.appspotmail.com Reported by: syzbot+5bd2e7d05f84a59d0d1b@syzkaller.appspotmail.com Reviewed by: jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31976	2021-09-17 14:19:06 -04:00
Mark Johnston	883761f0a8	socket: Remove NOFREE from the socket zone This flag was added during the transition away from the legacy zone allocator, commit `c897b81311`. The old zone allocator effectively provided _NOFREE semantics, but it seems that they are not required for sockets. In particular, we use reference counting to keep sockets live. One somewhat dangerous case is sonewconn(), which returns a pointer to a socket with reference count 0. This socket is still effectively owned by the listening socket. Protocols must therefore be careful to synchronize sonewconn() calls with their pru_close implementations, since for listening sockets soclose() will abort the child sockets. For example, TCP holds the listening socket's PCB read locked across the sonewconn() call, which blocks tcp_usr_close(), and sofree() synchronizes with a concurrent soabort() of the nascent socket. However, _NOFREE semantics are not required here. Eliminating _NOFREE has several benefits: it enables use-after-free detection (e.g., by KASAN) and lets the system reclaim memory from the socket zone under memory pressure. No functional change intended. Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31975	2021-09-17 14:19:06 -04:00
Mark Johnston	6b288408ca	socket: Add assertions around naked refcount decrements Sockets in a listen queue hold a reference to the parent listening socket. Several code paths release this reference manually when moving a child socket out of the queue. Replace comments about the expected post-decrement refcount value with assertions. Use refcount_load() instead of a plain load. No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31974	2021-09-17 14:19:06 -04:00
Mark Johnston	dfcef87714	socket: Fix a use-after-free in soclose() After releasing the fd reference to a socket "so", we should avoid testing SOLISTENING(so) since the socket may have been freed. Instead, directly test whether the list of unaccepted sockets is empty. Fixes: `f4bb1869dd` ("Consistently use the SOLISTENING() macro") Pointy hat: markj MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31973	2021-09-17 14:19:05 -04:00
Mark Johnston	fa0463c384	socket: De-duplicate SBLOCKWAIT() definitions MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-09-14 09:01:32 -04:00
Mark Johnston	141fe2dcee	aio: Interlock with listen(2) soo_aio_queue() did not handle the possibility that the provided socket is a listening socket. Up until recently, to fix this one would have to acquire the socket lock first and check, since the socket buffer locks were destroyed by listen(2). Now that the socket buffer locks belong to the socket, simply check SOLISTENING(so) after acquiring them, and make listen(2) return an error if any AIO jobs are enqueued on the socket. Add a couple of simple regression test cases. Note that this fixes things only for the default AIO implementation; cxgbe(4)'s TCP offload has a separate pru_aio_queue implementation which requires its own solution. Reported by: syzbot+c8aa122fa2c6a4e2a28b@syzkaller.appspotmail.com Reported by: syzbot+39af117d43d4f0faf512@syzkaller.appspotmail.com Reported by: syzbot+60cceb9569145a0b993b@syzkaller.appspotmail.com Reported by: syzbot+2d522c5db87710277ca5@syzkaller.appspotmail.com Reviewed by: tuexen, gallatin, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31901	2021-09-10 17:21:11 -04:00
Mark Johnston	523d58aad1	socket: Remove unneeded SOLISTENING checks Now that SOCK_IO_*_LOCK() checks for listening sockets, we can eliminate some racy SOLISTENING() checks. No functional change intended. Reviewed by: tuexen MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31660	2021-09-07 17:12:09 -04:00
Mark Johnston	bd4a39cc93	socket: Properly interlock when transitioning to a listening socket Currently, most protocols implement pru_listen with something like the following: SOCK_LOCK(so); error = solisten_proto_check(so); if (error) { SOCK_UNLOCK(so); return (error); } solisten_proto(so); SOCK_UNLOCK(so); solisten_proto_check() fails if the socket is connected or connecting. However, the socket lock is not used during I/O, so this pattern is racy. The change modifies solisten_proto_check() to additionally acquire socket buffer locks, and the calling thread holds them until solisten_proto() or solisten_proto_abort() is called. Now that the socket buffer locks are preserved across a listen(2), this change allows socket I/O paths to properly interlock with listen(2). This fixes a large number of syzbot reports, only one is listed below and the rest will be dup'ed to it. Reported by: syzbot+9fece8a63c0e27273821@syzkaller.appspotmail.com Reviewed by: tuexen, gallatin MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31659	2021-09-07 17:11:43 -04:00
Mark Johnston	f94acf52a4	socket: Rename sb(un)lock() and interlock with listen(2) In preparation for moving sockbuf locks into the containing socket, provide alternative macros for the sockbuf I/O locks: SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK(). These operate on a socket rather than a socket buffer. Note that these locks are used only to prevent concurrent readers and writters from interleaving I/O. When locking for I/O, return an error if the socket is a listening socket. Currently the check is racy since the sockbuf sx locks are destroyed during the transition to a listening socket, but that will no longer be true after some follow-up changes. Modify a few places to check for errors from sblock()/SOCK_IO_(SEND\|RECV)_LOCK() where they were not before. In particular, add checks to sendfile() and sorflush(). Reviewed by: tuexen, gallatin MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31657	2021-09-07 15:06:48 -04:00
Roy Marples	7045b1603b	socket: Implement SO_RERROR SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports. Reviewed by: philip (network), kbowling (transport), gbe (manpages) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D26652	2021-07-28 09:35:09 -07:00
Mark Johnston	a100217489	Consistently use the SOCKBUF_MTX() and SOCK_MTX() macros This makes it easier to change the socket locking protocols. No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-06-14 17:32:32 -04:00
Mark Johnston	f4bb1869dd	Consistently use the SOLISTENING() macro Some code was using it already, but in many places we were testing SO_ACCEPTCONN directly. As a small step towards fixing some bugs involving synchronization with listen(2), make the kernel consistently use SOLISTENING(). No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-06-14 17:32:27 -04:00
John Baldwin	aa341db39b	Rename m_unmappedtouio() to m_unmapped_uiomove(). This function doesn't only copy data into a uio but instead is a variant of uiomove() similar to uiomove_fromphys(). Reviewed by: gallatin, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30444	2021-05-25 16:59:18 -07:00
Mark Johnston	916c61a5ed	Fix handling of errors from pru_send(PRUS_NOTREADY) PRUS_NOTREADY indicates that the caller has not yet populated the chain with data, and so it is not ready for transmission. This is used by sendfile (for async I/O) and KTLS (for encryption). In particular, if pru_send returns an error, the caller is responsible for freeing the chain since other implicit references to the data buffers exist. For async sendfile, it happens that an error will only be returned if the connection was dropped, in which case tcp_usr_ready() will handle freeing the chain. But since KTLS can be used in conjunction with the regular socket I/O system calls, many more error cases - which do not result in the connection being dropped - are reachable. In these cases, KTLS was effectively assuming success. So: - Change sosend_generic() to free the mbuf chain if pru_send(PRUS_NOTREADY) fails. Nothing else owns a reference to the chain at that point. - Similarly, in vn_sendfile() change the !async I/O && KTLS case to free the chain. - If async I/O is still outstanding when pru_send fails in vn_sendfile(), set an error in the sfio structure so that the connection is aborted and the mbuf chain is freed. Reviewed by: gallatin, tuexen Discussed with: jhb MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30349	2021-05-21 17:45:19 -04:00
Lv Yunlong	b295c5ddce	socket: Release cred reference later in sodealloc() We dereference so->so_cred to update the per-uid socket buffer accounting, so the crfree() call must be deferred until after that point. PR: 255869 MFC after: 1 week	2021-05-18 15:25:40 -04:00
Mark Johnston	d8acd2681b	Fix mbuf leaks in various pru_send implementations The various protocol implementations are not very consistent about freeing mbufs in error paths. In general, all protocols must free both "m" and "control" upon an error, except if PRUS_NOTREADY is specified (this is only implemented by TCP and unix(4) and requires further work not handled in this diff), in which case "control" still must be freed. This diff plugs various leaks in the pru_send implementations. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30151	2021-05-12 13:00:09 -04:00
Thomas Munro	3aaaa2efde	poll(2): Add POLLRDHUP. Teach poll(2) to support Linux-style POLLRDHUP events for sockets, if requested. Triggered when the remote peer shuts down writing or closes its end. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D29757	2021-04-28 23:00:31 +12:00
Mark Johnston	274579831b	capsicum: Limit socket operations in capability mode Capsicum did not prevent certain privileged networking operations, specifically creation of raw sockets and network configuration ioctls. However, these facilities can be used to circumvent some of the restrictions that capability mode is supposed to enforce. Add capability mode checks to disallow network configuration ioctls and creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET internet sockets. Reviewed by: oshogbo Discussed with: emaste Reported by: manu Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D29423	2021-04-07 14:32:56 -04:00
Kyle Evans	f187d6dfbf	base: remove if_wg(4) and associated utilities, manpage After length decisions, we've decided that the if_wg(4) driver and related work is not yet ready to live in the tree. This driver has larger security implications than many, and thus will be held to more scrutiny than other drivers. Please also see the related message sent to the freebsd-hackers@ and freebsd-arch@ lists by Kyle Evans <kevans@FreeBSD.org> on 2021/03/16, with the subject line "Removing WireGuard Support From Base" for additional context.	2021-03-17 09:14:48 -05:00
Kyle Evans	74ae3f3e33	if_wg: import latest fixup work from the wireguard-freebsd project This is the culmination of about a week of work from three developers to fix a number of functional and security issues. This patch consists of work done by the following folks: - Jason A. Donenfeld <Jason@zx2c4.com> - Matt Dunwoodie <ncon@noconroy.net> - Kyle Evans <kevans@FreeBSD.org> Notable changes include: - Packets are now correctly staged for processing once the handshake has completed, resulting in less packet loss in the interim. - Various race conditions have been resolved, particularly w.r.t. socket and packet lifetime (panics) - Various tests have been added to assure correct functionality and tooling conformance - Many security issues have been addressed - if_wg now maintains jail-friendly semantics: sockets are created in the interface's home vnet so that it can act as the sole network connection for a jail - if_wg no longer fails to remove peer allowed-ips of 0.0.0.0/0 - if_wg now exports via ioctl a format that is future proof and complete. It is additionally supported by the upstream wireguard-tools (which we plan to merge in to base soon) - if_wg now conforms to the WireGuard protocol and is more closely aligned with security auditing guidelines Note that the driver has been rebased away from using iflib. iflib poses a number of challenges for a cloned device trying to operate in a vnet that are non-trivial to solve and adds complexity to the implementation for little gain. The crypto implementation that was previously added to the tree was a super complex integration of what previously appeared in an old out of tree Linux module, which has been reduced to crypto.c containing simple boring reference implementations. This is part of a near-to-mid term goal to work with FreeBSD kernel crypto folks and take advantage of or improve accelerated crypto already offered elsewhere. There's additional test suite effort underway out-of-tree taking advantage of the aforementioned jail-friendly semantics to test a number of real-world topologies, based on netns.sh. Also note that this is still a work in progress; work going further will be much smaller in nature. MFC after: 1 month (maybe)	2021-03-14 23:52:04 -05:00
Kyle Evans	504ebd612e	kern: sonewconn: set so_options before pru_attach() Protocol attachment has historically been able to observe and modify so->so_options as needed, and it still can for newly created sockets. `779f106aa1` moved this to after pru_attach() when we re-acquire the lock on the listening socket. Restore the historical behavior so that pru_attach implementations can consistently use it. Note that some pru_attach() do currently rely on this, though that may change in the future. D28265 contains a change to remove the use in TCP and IB/SDP bits, as resetting the requested linger time on incoming connections seems questionable at best. This does move the assignment out from under the head's listen lock, but glebius notes that head won't be going away and applications cannot assume any specific ordering with a race between a connection coming in and the application changing socket options anyways. Discussed-with: glebius MFC-after: 1 week	2021-02-08 21:44:43 -06:00
Alexander V. Chernikov	924d1c9a05	Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors." Wrong version of the change was pushed inadvertenly. This reverts commit `4a01b854ca`.	2021-02-08 22:32:32 +00:00
Alexander V. Chernikov	4a01b854ca	SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports.	2021-02-08 21:42:20 +00:00
Kyle Evans	34af05ead3	kern: soclose: don't sleep on SO_LINGER w/ timeout=0 This is a valid scenario that's handled in the various protocol layers where it makes sense (e.g., tcp_disconnect and sctp_disconnect). Given that it indicates we should immediately drop the connection, it makes little sense to sleep on it. This could lead to panics with INVARIANTS. On non-INVARIANTS kernels, this could result in the thread hanging until a signal interrupts it if the protocol does not mark the socket as disconnected for whatever reason. Reported by: syzbot+e625d92c1dd74e402c81@syzkaller.appspotmail.com Reviewed by: glebius, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27407	2020-12-04 04:39:48 +00:00
Mateusz Guzik	e90afaa015	kqueue: save space by using only one func pointer for assertions	2020-11-09 00:04:35 +00:00
Mateusz Guzik	6fed89b179	kern: clean up empty lines in .c and .h files	2020-09-01 22:12:32 +00:00
Rick Macklem	102829aa92	Add the MSG_TLSAPPDATA flag to indicate "return ENXIO" for non-application TLS data records. The kernel RPC cannot process non-application data records when using TLS. It must to an upcall to a userspace daemon that will call SSL_read() to process them. This patch adds a new flag called MSG_TLSAPPDATA that the kernel RPC can use to tell sorecieve() to return ENXIO instead of a non-application data record, when that is what is at the top of the receive queue. I put the code in #ifdef KERN_TLS/#endif, although it will build without that, so that it is recognized as only useful when KERN_TLS is enabled. The alternative to doing this is to have the kernel RPC re-queue the non-application data message after receiving it, but that seems more complicated and might introduce message ordering issues when there are multiple non-application data records one after another. I do not know what, if any, changes will be required to support TLS1.3. Reviewed by: glebius Differential Revision: https://reviews.freebsd.org/D25923	2020-08-19 23:42:33 +00:00
John Baldwin	0f70a1489d	Properly handle a closed TLS socket with pending receive data. If the remote end closes a TLS socket and the socket buffer still contains not-yet-decrypted TLS records but no decrypted TLS records, soreceive needs to block or fail with EWOULDBLOCK. Previously it was trying to return data and dereferencing a NULL pointer. Reviewed by: np Sponsored by: Chelsio Differential Revision: https://reviews.freebsd.org/D25838	2020-07-29 23:24:32 +00:00
John Baldwin	3c0e568505	Add support for KTLS RX via software decryption. Allow TLS records to be decrypted in the kernel after being received by a NIC. At a high level this is somewhat similar to software KTLS for the transmit path except in reverse. Protocols enqueue mbufs containing encrypted TLS records (or portions of records) into the tail of a socket buffer and the KTLS layer decrypts those records before returning them to userland applications. However, there is an important difference: - In the transmit case, the socket buffer is always a single "record" holding a chain of mbufs. Not-yet-encrypted mbufs are marked not ready (M_NOTREADY) and released to protocols for transmit by marking mbufs ready once their data is encrypted. - In the receive case, incoming (encrypted) data appended to the socket buffer is still a single stream of data from the protocol, but decrypted TLS records are stored as separate records in the socket buffer and read individually via recvmsg(). Initially I tried to make this work by marking incoming mbufs as M_NOTREADY, but there didn't seemed to be a non-gross way to deal with picking a portion of the mbuf chain and turning it into a new record in the socket buffer after decrypting the TLS record it contained (along with prepending a control message). Also, such mbufs would also need to be "pinned" in some way while they are being decrypted such that a concurrent sbcut() wouldn't free them out from under the thread performing decryption. As such, I settled on the following solution: - Socket buffers now contain an additional chain of mbufs (sb_mtls, sb_mtlstail, and sb_tlscc) containing encrypted mbufs appended by the protocol layer. These mbufs are still marked M_NOTREADY, but soreceive*() generally don't know about them (except that they will block waiting for data to be decrypted for a blocking read). - Each time a new mbuf is appended to this TLS mbuf chain, the socket buffer peeks at the TLS record header at the head of the chain to determine the encrypted record's length. If enough data is queued for the TLS record, the socket is placed on a per-CPU TLS workqueue (reusing the existing KTLS workqueues and worker threads). - The worker thread loops over the TLS mbuf chain decrypting records until it runs out of data. Each record is detached from the TLS mbuf chain while it is being decrypted to keep the mbufs "pinned". However, a new sb_dtlscc field tracks the character count of the detached record and sbcut()/sbdrop() is updated to account for the detached record. After the record is decrypted, the worker thread first checks to see if sbcut() dropped the record. If so, it is freed (can happen when a socket is closed with pending data). Otherwise, the header and trailer are stripped from the original mbufs, a control message is created holding the decrypted TLS header, and the decrypted TLS record is appended to the "normal" socket buffer chain. (Side note: the SBCHECK() infrastucture was very useful as I was able to add assertions there about the TLS chain that caught several bugs during development.) Tested by: rmacklem (various versions) Relnotes: yes Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D24628	2020-07-23 23:48:18 +00:00
Mark Johnston	95033af923	Add the SCTP_SUPPORT kernel option. This is in preparation for enabling a loadable SCTP stack. Analogous to IPSEC/IPSEC_SUPPORT, the SCTP_SUPPORT kernel option must be configured in order to support a loadable SCTP implementation. Discussed with: tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2020-06-18 19:32:34 +00:00
John Baldwin	2684603c5f	Permit SO_NO_DDP and SO_NO_OFFLOAD to be read via getsockopt(2). MFC after: 2 weeks Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D24627	2020-05-29 00:09:12 +00:00
Rick Macklem	469f2e9e9a	Fix sosend() for the case where mbufs are passed in while doing ktls. For kernel tls, sosend() needs to call ktls_frame() on the mbuf list to be sent. Without this patch, this was only done when sosend()'s arguments used a uio_iov and not when an mbuf list is passed in. At this time, sosend() is never called with an mbuf list argument when kernel tls is in use, but will be once nfs-over-tls has been incorporated into head. Reviewed by: gallatin, glebius Differential Revision: https://reviews.freebsd.org/D24674	2020-05-27 23:20:35 +00:00
Konstantin Belousov	0532a7a2df	Fix r361037. Reorder flag manipulations and use barrier to ensure that the program order is followed by compiler and CPU, for unlocked reader of so_state. In collaboration with: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24842	2020-05-14 20:17:09 +00:00
Konstantin Belousov	39845728a1	Fix spurious ENOTCONN from closed unix domain socket other' side. Sometimes, when doing read(2) over unix domain socket, for which the other side socket was closed, read(2) returns -1/ENOTCONN instead of EOF AKA zero-size read. This is because soreceive_generic() does not lock socket when testing the so_state SS_ISCONNECTED\|SS_ISCONNECTING flags. It could end up that we do not observe so->so_rcv.sb_state bit SBS_CANTRCVMORE, and then miss SS_ flags. Change the test to check that the socket was never connected before returning ENOTCONN, by adding all state bits for connected. Reported and tested by: pho In collaboration with: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D24819	2020-05-14 17:54:08 +00:00
Gleb Smirnoff	6edfd179c8	Step 4.1: mechanically rename M_NOMAP to M_EXTPG Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-03 00:21:11 +00:00
Rick Macklem	0306689367	Fix sosend_generic() so that it can handle a list of ext_pgs mbufs. Without this patch, sosend_generic() will try to use top->m_pkthdr.len, assuming that the first mbuf has a pkthdr. When a list of ext_pgs mbufs is passed in, the first mbuf is not a pkthdr and cannot be post-r359919. As such, the value of top->m_pkthdr.len is bogus (0 for my testing). This patch fixes sosend_generic() to handle this case, calculating the total length via m_length() for this case. There is currently nothing that hands a list of ext_pgs mbufs to sosend_generic(), but the nfs-over-tls kernel RPC code in projects/nfs-over-tls will do that and was used to test this patch. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24568	2020-04-27 23:55:09 +00:00
John Baldwin	f1f9347546	Initial support for kernel offload of TLS receive. - Add a new TCP_RXTLS_ENABLE socket option to set the encryption and authentication algorithms and keys as well as the initial sequence number. - When reading from a socket using KTLS receive, applications must use recvmsg(). Each successful call to recvmsg() will return a single TLS record. A new TCP control message, TLS_GET_RECORD, will contain the TLS record header of the decrypted record. The regular message buffer passed to recvmsg() will receive the decrypted payload. This is similar to the interface used by Linux's KTLS RX except that Linux does not return the full TLS header in the control message. - Add plumbing to the TOE KTLS interface to request either transmit or receive KTLS sessions. - When a socket is using receive KTLS, redirect reads from soreceive_stream() into soreceive_generic(). - Note that this interface is currently only defined for TLS 1.1 and 1.2, though I believe we will be able to reuse the same interface and structures for 1.3.	2020-04-27 23:17:19 +00:00
Jonathan T. Looney	fb401f1bba	Make sonewconn() overflow messages have per-socket rate-limits and values. sonewconn() emits debug-level messages when a listen socket's queue overflows. Currently, sonewconn() tracks overflows on a global basis. It will only log one message every 60 seconds, regardless of how many sockets experience overflows. And, when it next logs at the end of the 60 seconds, it records a single message referencing a single PCB with the total number of overflows across all sockets. This commit changes to per-socket overflow tracking. The code will now log one message every 60 seconds per socket. And, the code will provide per-socket queue length and overflow counts. It also provides a way to change the period between log messages using a sysctl. Reviewed by: jhb (previous version), bcr (manpages) MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D24316	2020-04-14 15:38:18 +00:00
Jonathan T. Looney	f6ab9795d4	Print more detail as part of the sonewconn() overflow message. When a socket's listen queue overflows, sonewconn() emits a debug-level log message. These messages are sometimes useful to systems administrators in highlighting a process which is not keeping up with its listen queue. This commit attempts to enhance the usefulness of this message by printing more details about the socket's address. If all else fails, it will at least print the domain name of the socket. Reviewed by: bz, jhb, kbowling MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D24272	2020-04-14 15:30:34 +00:00
Pawel Biernacki	7029da5c36	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718	2020-02-26 14:26:36 +00:00
Gleb Smirnoff	f85e1a806b	Make ktls_frame() never fail. Caller must supply correct mbufs. This makes sendfile code a bit simplier.	2020-02-25 19:26:40 +00:00
Gleb Smirnoff	975b8f8462	Cleanup unneeded includes that crept in with r353292.	2019-10-09 16:59:42 +00:00
John Baldwin	9e14430d46	Add a TOE KTLS mode and a TOE hook for allocating TLS sessions. This adds the glue to allocate TLS sessions and invokes it from the TLS enable socket option handler. This also adds some counters for active TOE sessions. The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when TOE KTLS is in use on a socket, but cannot be set via setsockopt(). To simplify various checks, a TLS session now includes an explicit 'mode' member set to the value returned by TLSTX_TLS_MODE. Various places that used to check 'sw_encrypt' against NULL to determine software vs ifnet (NIC) TLS now check 'mode' instead. Reviewed by: np, gallatin Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21891	2019-10-08 21:34:06 +00:00
Gleb Smirnoff	b8a6e03fac	Widen NET_EPOCH coverage. When epoch(9) was introduced to network stack, it was basically dropped in place of existing locking, which was mutexes and rwlocks. For the sake of performance mutex covered areas were as small as possible, so became epoch covered areas. However, epoch doesn't introduce any contention, it just delays memory reclaim. So, there is no point to minimise epoch covered areas in sense of performance. Meanwhile entering/exiting epoch also has non-zero CPU usage, so doing this less often is a win. Not the least is also code maintainability. In the new paradigm we can assume that at any stage of processing a packet, we are inside network epoch. This makes coding both input and output path way easier. On output path we already enter epoch quite early - in the ip_output(), in the ip6_output(). This patch does the same for the input path. All ISR processing, network related callouts, other ways of packet injection to the network stack shall be performed in net_epoch. Any leaf function that walks network configuration now asserts epoch. Tricky part is configuration code paths - ioctls, sysctls. They also call into leaf functions, so some need to be changed. This patch would introduce more epoch recursions (see EPOCH_TRACE) than we had before. They will be cleaned up separately, as several of them aren't trivial. Note, that unlike a lock recursion the epoch recursion is safe and just wastes a bit of resources. Reviewed by: gallatin, hselasky, cy, adrian, kristof Differential Revision: https://reviews.freebsd.org/D19111	2019-10-07 22:40:05 +00:00
John Baldwin	b2e60773c6	Add kernel-side support for in-kernel TLS. KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277	2019-08-27 00:01:56 +00:00
Andrey V. Elsukov	75697b16b6	Use TAILQ_FOREACH_SAFE() macro to avoid use after free in soclose(). PR: 239893 MFC after: 1 week	2019-08-19 12:42:03 +00:00
Michael Tuexen	a85b7f125b	Improve the input validation for l_linger. When using the SOL_SOCKET level socket option SO_LINGER, the structure struct linger is used as the option value. The component l_linger is of type int, but internally copied to the field so_linger of the structure struct socket. The type of so_linger is short, but it is assumed to be non-negative and the value is used to compute ticks to be stored in a variable of type int. Therefore, perform input validation on l_linger similar to the one performed by NetBSD and OpenBSD. Thanks to syzkaller for making me aware of this issue. Thanks to markj@ for pointing out that a similar check should be added to so_linger_set(). Reviewed by: markj@ MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20948	2019-07-14 21:44:18 +00:00

1 2 3 4 5 ...

523 Commits