freebsd-dev

Author	SHA1	Message	Date
Mark Johnston	7b92493ab1	inpcb: Avoid inp_cred dereferences in SMR-protected lookup The SMR-protected inpcb lookup algorithm currently has to check whether a matching inpcb belongs to a jail, in order to prioritize jailed bound sockets. To do this it has to maintain a ucred reference, and for this to be safe, the reference can't be released until the UMA destructor is called, and this will not happen within any bounded time period. Changing SMR to periodically recycle garbage is not trivial. Instead, let's implement SMR-synchronized lookup without needing to dereference inp_cred. This will allow the inpcb code to free the inp_cred reference immediately when a PCB is freed, ensuring that ucred (and thus jail) references are released promptly. Commit `220d892129` ("inpcb: immediately return matching pcb on lookup") gets us part of the way there. This patch goes further to handle lookups of unconnected sockets. Here, the strategy is to maintain a well-defined order of items within a hash chain so that a wild lookup can simply return the first match and preserve existing semantics. This makes insertion of listening sockets more complicated in order to make lookup simpler, which seems like the right tradeoff anyway given that bind() is already a fairly expensive operation and lookups are more common. In particular, when inserting an unconnected socket, in_pcbinhash() now keeps the following ordering: - jailed sockets before non-jailed sockets, - specified local addresses before unspecified local addresses. Most of the change adds a separate SMR-based lookup path for inpcb hash lookups. When a match is found, we try to lock the inpcb and re-validate its connection info. In the common case, this works well and we can simply return the inpcb. If this fails, typically because something is concurrently modifying the inpcb, we go to the slow path, which performs a serialized lookup. Note, I did not touch lbgroup lookup, since there the credential reference is formally synchronized by net_epoch, not SMR. In particular, lbgroups are rarely allocated or freed. I think it is possible to simplify in_pcblookup_hash_wild_locked() now, but I didn't do it in this patch. Discussed with: glebius Tested by: glebius Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38572	2023-04-20 12:13:06 -04:00
Mark Johnston	3e98dcb3d5	inpcb: Move inpcb matching logic into separate functions These functions will get some additional callers in future revisions. No functional change intended. Discussed with: glebius Tested by: glebius Sponsored by: Modirum MDPay Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D38571	2023-04-20 12:13:06 -04:00
Mark Johnston	fdb987bebd	inpcb: Split PCB hash tables Currently we use a single hash table per PCB database for connected and bound PCBs. Since we started using net_epoch to synchronize hash table lookups, there's been a bug, noted in a comment above in_pcbrehash(): connecting a socket can cause an inpcb to move between hash chains, and this can cause a concurrent lookup to follow the wrong linkage pointers. I believe this could cause rare, spurious ECONNREFUSED errors in the worse case. Address the problem by introducing a second hash table and adding more linkage pointers to struct inpcb. Now the database has one table each for connected and unconnected sockets. When inserting an inpcb into the hash table, in_pcbinhash() now looks at the foreign address of the inpcb to figure out which table to use. This ensures that queue linkage pointers are stable until the socket is disconnected, so the problem described above goes away. There is also a small benefit in that in_pcblookup_*() can now search just one of the two possible hash buckets. I also made the "rehash" parameter of in(6)_pcbconnect() unused. This parameter seems confusing and it is simpler to let the inpcb code figure out what to do using the existing INP_INHASHLIST flag. UDP sockets pose a special problem since they can be connected and disconnected multiple times during their lifecycle. To handle this, the patch plugs a hole in the inpcb structure and uses it to store an SMR sequence number. When an inpcb is disconnected - an operation which requires the global PCB database hash lock - the write sequence number is advanced, and in order to reconnect, the connecting thread must wait for readers to drain before reusing the inpcb's hash chain linkage pointers. raw_ip (ab)uses the hash table without using the corresponding accessors. Since there are now two hash tables, it arbitrarily uses the "connected" table for all of its PCBs. This will be addressed in some way in the future. inp interators which specify a hash bucket will only visit connected PCBs. This is not really correct, but nothing in the tree uses that functionality except raw_ip, which as mentioned above places all of its PCBs in the "connected" table and so is unaffected. Discussed with: glebius Tested by: glebius Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38569	2023-04-20 12:13:06 -04:00
Mark Johnston	713264f6b8	netinet: Tighten checks for unspecified source addresses The assertions added in commit `b0ccf53f24` ("inpcb: Assert against wildcard addrs in in_pcblookup_hash_locked()") revealed that protocol layers may pass the unspecified address to in_pcblookup(). Add some checks to filter out such packets before we attempt an inpcb lookup: - Disallow the use of an unspecified source address in in_pcbladdr() and in6_pcbladdr(). - Disallow IP packets with an unspecified destination address. - Disallow TCP packets with an unspecified source address, and add an assertion to verify the comment claiming that the case of an unspecified destination address is handled by the IP layer. Reported by: syzbot+9ca890fb84e984e82df2@syzkaller.appspotmail.com Reported by: syzbot+ae873c71d3c71d5f41cb@syzkaller.appspotmail.com Reported by: syzbot+e3e689aba1d442905067@syzkaller.appspotmail.com Reviewed by: glebius, melifaro MFC after: 2 weeks Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38570	2023-03-06 15:06:00 -05:00
Mark Johnston	3aff4ccdd7	netinet: Remove IP(V6)_BINDMULTI This option was added in commit `0a100a6f1e` but was never completed. In particular, there is no logic to map flowids to different listening sockets, so it accomplishes basically the same thing as SO_REUSEPORT. Meanwhile, we've since added SO_REUSEPORT_LB, which at least tries to balance among listening sockets using a hash of the 4-tuple and some optional NUMA policy. The option was never documented or completed, and an exp-run revealed nothing using it in the ports tree. Moreover, it complicates the already very complicated in_pcbbind_setup(), and the checking in in_pcbbind_check_bindmulti() is insufficient. So, let's remove it. PR: 261398 (exp-run) Reviewed by: glebius Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D38574	2023-02-27 10:03:11 -05:00
Gleb Smirnoff	96871af013	inpcb: use family specific sockaddr argument for bind functions Do the cast from sockaddr to either IPv4 or IPv6 sockaddr in the protocol's pr_bind method and from there on go down the call stack with family specific argument. Reviewed by: zlei, melifaro, markj Differential Revision: https://reviews.freebsd.org/D38601	2023-02-15 10:30:16 -08:00
Mark Johnston	4130ea611f	inpcb: Split in_pcblookup_hash_locked() and clean up a bit Split the in_pcblookup_hash_locked() function into several independent subroutine calls, each of which does some kind of hash table lookup. This refactoring makes it easier to introduce variants of the lookup algorithm that behave differently depending on whether they are synchronized by SMR or the PCB database hash lock. While here, do some related cleanup: - Remove an unused ifnet parameter from internal functions. Keep it in external functions so that it can be used in the future to derive a v6 scopeid. - Reorder the parameters to in_pcblookup_lbgroup() to be consistent with the other lookup functions. - Remove an always-true check from in_pcblookup_lbgroup(): we can assume that we're performing a wildcard match. No functional change intended. Reviewed by: glebius Differential Revision: https://reviews.freebsd.org/D38364	2023-02-09 16:15:03 -05:00
Gleb Smirnoff	220d892129	inpcb: immediately return matching pcb on lookup This saves a lot of CPU cycles if you got large connection table. The code removed originates from `413628a7e3`, a very large changeset. Discussed that with Bjoern, Jamie we can't recover why would we ever have identical 4-tuples in the hash, even in the presence of jails. Bjoern did a test that confirms that it is impossible to allocate an identical connection from a jail to a host. Code review also confirms that system shouldn't allow for such connections to exist. With a lack of proper test suite we decided to take a risk and go forward with removing that code. Reviewed by: gallatin, bz, markj Differential Revision: https://reviews.freebsd.org/D38015	2023-02-07 09:21:52 -08:00
Gleb Smirnoff	a9d22cce10	inpcb: use family specific sockaddr argument for connect functions Do the cast from sockaddr to either IPv4 or IPv6 sockaddr in the protocol's pr_connect method and from there on go down the call stack with family specific argument. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D38356	2023-02-03 11:33:36 -08:00
Gleb Smirnoff	3d76be28ec	netinet6: require network epoch for in6_pcbconnect() This removes recursive epoch entry in the syncache case. Fixes unprotected access to V_in6_ifaddrhead in in6_pcbladdr(), as well as access to prison IP address lists. It also matches what IPv4 in_pcbconnect() does. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D38355	2023-02-03 11:33:36 -08:00
Gleb Smirnoff	221b9e3d06	inpcb: merge two versions of in6_pcbconnect() into one No functional change. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D38354	2023-02-03 11:33:35 -08:00
Mark Johnston	2589ec0f36	pcb: Move an assignment into in_pcbdisconnect() All callers of in_pcbdisconnect() clear the local address, so let's just do that in the function itself. Note that the inp's local address is not a parameter to the inp hash functions. No functional change intended. Reviewed by: glebius MFC after: 2 weeks Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38362	2023-02-03 11:48:25 -05:00
Mark Johnston	b0ccf53f24	inpcb: Assert against wildcard addrs in in_pcblookup_hash_locked() No functional change intended. Reviewed by: glebius MFC after: 1 week Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38361	2023-02-03 11:48:25 -05:00
Mark Johnston	675e2618ae	inpcb: Deduplicate some assertions It makes more sense to check lookupflags in the function which actually uses SMR. No functional change intended. Reviewed by: glebius MFC after: 1 week Sponsored by: Klara, Inc. Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D38359	2023-02-03 11:48:25 -05:00
Gleb Smirnoff	e68b379244	tcp: embed inpcb into tcpcb For the TCP protocol inpcb storage specify allocation size that would provide space to most of the data a TCP connection needs, embedding into struct tcpcb several structures, that previously were allocated separately. The most import one is the inpcb itself. With embedding we can provide strong guarantee that with a valid TCP inpcb the tcpcb is always valid and vice versa. Also we reduce number of allocs/frees per connection. The embedded inpcb is placed in the beginning of the struct tcpcb, since in_pcballoc() requires that. However, later we may want to move it around for cache line efficiency, and this can be done with a little effort. The new intotcpcb() macro is ready for such move. The congestion algorithm data, the TCP timers and osd(9) data are also embedded into tcpcb, and temprorary struct tcpcb_mem goes away. There was no extra allocation here, but we went through extra pointer every time we accessed this data. One interesting side effect is that now TCP data is allocated from SMR-protected zone. Potentially this allows the TCP stacks or other TCP related modules to utilize that for their own synchronization. Large part of the change was done with sed script: s/tp->ccv->/tp->t_ccv./g s/tp->ccv/\&tp->t_ccv/g s/tp->cc_algo/tp->t_cc/g s/tp->t_timers->tt_/tp->tt_/g s/CCV\(ccv, osd\)/\&CCV(ccv, t_osd)/g Dependency side effect is that code that needs to know struct tcpcb should also know struct inpcb, that added several <netinet/in_pcb.h>. Differential revision: https://reviews.freebsd.org/D37127	2022-12-07 09:00:48 -08:00
Mark Johnston	d93ec8cb13	inpcb: Allow SO_REUSEPORT_LB to be used in jails Currently SO_REUSEPORT_LB silently does nothing when set by a jailed process. It is trivial to support this option in VNET jails, but it's also useful in traditional jails. This patch enables LB groups in jails with the following semantics: - all PCBs in a group must belong to the same jail, - PCB lookup prefers jailed groups to non-jailed groups This is a straightforward extension of the semantics used for individual listening sockets. One pre-existing quirk of the lbgroup implementation is that non-jailed lbgroups are searched before jailed listening sockets; that is preserved with this change. Discussed with: glebius MFC after: 1 month Sponsored by: Modirum MDPay Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D37029	2022-11-02 13:46:24 -04:00
Mark Johnston	ac1750dd14	inpcb: Remove NULL checks of credential references Some auditing of the code shows that "cred" is never non-NULL in these functions, either because all callers pass a non-NULL reference or because they unconditionally dereference "cred". So, let's simplify the code a bit and remove NULL checks. No functional change intended. Reviewed by: glebius MFC after: 1 week Sponsored by: Modirum MDPay Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D37025	2022-11-02 13:46:24 -04:00
Gleb Smirnoff	53af690381	tcp: remove INP_TIMEWAIT flag Mechanically cleanup INP_TIMEWAIT from the kernel sources. After `0d7445193a`, this commit shall not cause any functional changes. Note: this flag was very often checked together with INP_DROPPED. If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we will be able to remove most of this checks and turn them to assertions. Some of them can be turned into assertions right now, but that should be carefully done on a case by case basis. Differential revision: https://reviews.freebsd.org/D36400	2022-10-06 19:24:37 -07:00
Gleb Smirnoff	0d7445193a	tcp: remove tcptw, the compressed timewait state structure The memory savings the tcptw brought back in 2003 (see `340c35de6a`) no longer justify the complexity required to maintain it. For longer explanation please check out the email [1]. Surpisingly through almost 20 years the TCP stack functionality of handling the TIME_WAIT state with a normal tcpcb did not bitrot. The existing tcp_input() properly handles a tcpcb in TCPS_TIME_WAIT state, which is confirmed by the packetdrill tcp-testsuite [2]. This change just removes tcptw and leaves INP_TIMEWAIT. The flag will be removed in a separate commit. This makes it easier to review and possibly debug the changes. [1] https://lists.freebsd.org/archives/freebsd-net/2022-January/001206.html [2] https://github.com/freebsd-net/tcp-testsuite Differential revision: https://reviews.freebsd.org/D36398	2022-10-06 19:22:23 -07:00
Gleb Smirnoff	fcb3f813f3	netinet: remove PRC_ constants and streamline ICMP processing In the original design of the network stack from the protocol control input method pr_ctlinput was used notify the protocols about two very different kinds of events: internal system events and receival of an ICMP messages from outside. These events were coded with PRC_ codes. Today these methods are removed from the protosw(9) and are isolated to IPv4 and IPv6 stacks and are called only from icmp_input(). The PRC_ codes now just create a shim layer between ICMP codes and errors or actions taken by protocols. - Change ipproto_ctlinput_t to pass just pointer to ICMP header. This allows protocols to not deduct it from the internal IP header. - Change ip6proto_ctlinput_t to pass just struct ip6ctlparam pointer. It has all the information needed to the protocols. In the structure, change ip6c_finaldst fields to sockaddr_in6. The reason is that icmp6_input() already has this address wrapped in sockaddr, and the protocols want this address as sockaddr. - For UDP tunneling control input, as well as for IPSEC control input, change the prototypes to accept a transparent union of either ICMP header pointer or struct ip6ctlparam pointer. - In icmp_input() and icmp6_input() do only validation of ICMP header and count bad packets. The translation of ICMP codes to errors/actions is done by protocols. - Provide icmp_errmap() and icmp6_errmap() as substitute to inetctlerrmap, inet6ctlerrmap arrays. - In protocol ctlinput methods either trust what icmp_errmap() recommend, or do our own logic based on the ICMP header. Differential revision: https://reviews.freebsd.org/D36731	2022-10-03 20:53:04 -07:00
Gleb Smirnoff	43d39ca7e5	netinet*: de-void control input IP protocol methods After decoupling of protosw(9) and IP wire protocols in `78b1fc05b2` for IPv4 we got vector ip_ctlprotox[] that is executed only and only from icmp_input() and respectively for IPv6 we got ip6_ctlprotox[] executed only and only from icmp6_input(). This allows to use protocol specific argument types in these methods instead of struct sockaddr and void. Reviewed by: melifaro Differential revision: https://reviews.freebsd.org/D36727	2022-10-03 20:53:04 -07:00
Gleb Smirnoff	a057769205	in_pcb: use jenkins hash over the entire IPv6 (or IPv4) address The intent is to provide more entropy than can be provided by just the 32-bits of the IPv6 address which overlaps with 6to4 tunnels. This is needed to mitigate potential algorithmic complexity attacks from attackers who can control large numbers of IPv6 addresses. Together with: gallatin Reviewed by: dwmalone, rscheff Differential revision: https://reviews.freebsd.org/D33254	2021-12-26 10:47:28 -08:00
Gleb Smirnoff	185e659c40	inpcb: use locked variant of prison_check_ip*() The pcb lookup always happens in the network epoch and in SMR section. We can't block on a mutex due to the latter. Right now this patch opens up a race. But soon that will be addressed by D33339. Reviewed by: markj, jamie Differential revision: https://reviews.freebsd.org/D33340 Fixes: `de2d47842e`	2021-12-14 09:38:52 -08:00
Cy Schubert	db0ac6ded6	Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816" This reverts commit `266f97b5e9`, reversing changes made to `a10253cffe`. A mismerge of a merge to catch up to main resulted in files being committed which should not have been.	2021-12-02 14:45:04 -08:00
Cy Schubert	266f97b5e9	wpa: Import wpa_supplicant/hostapd commit 14ab4a816 This is the November update to vendor/wpa committed upstream 2021-11-26. MFC after: 1 month	2021-12-02 13:35:14 -08:00
Gleb Smirnoff	de2d47842e	SMR protection for inpcbs With introduction of epoch(9) synchronization to network stack the inpcb database became protected by the network epoch together with static network data (interfaces, addresses, etc). However, inpcb aren't static in nature, they are created and destroyed all the time, which creates some traffic on the epoch(9) garbage collector. Fairly new feature of uma(9) - Safe Memory Reclamation allows to safely free memory in page-sized batches, with virtually zero overhead compared to uma_zfree(). However, unlike epoch(9), it puts stricter requirement on the access to the protected memory, needing the critical(9) section to access it. Details: - The database is already build on CK lists, thanks to epoch(9). - For write access nothing is changed. - For a lookup in the database SMR section is now required. Once the desired inpcb is found we need to transition from SMR section to r/w lock on the inpcb itself, with a check that inpcb isn't yet freed. This requires some compexity, since SMR section itself is a critical(9) section. The complexity is hidden from KPI users in inp_smr_lock(). - For a inpcb list traversal (a pcblist sysctl, or broadcast notification) also a new KPI is provided, that hides internals of the database - inp_next(struct inp_iterator *). Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33022	2021-12-02 10:48:48 -08:00
Gleb Smirnoff	565655f4e3	inpcb: reduce some aliased functions after removal of PCBGROUP. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33021	2021-12-02 10:48:48 -08:00
Gleb Smirnoff	93c67567e0	Remove "options PCBGROUP" With upcoming changes to the inpcb synchronisation it is going to be broken. Even its current status after the move of PCB synchronization to the network epoch is very questionable. This experimental feature was sponsored by Juniper but ended never to be used in Juniper and doesn't exist in their source tree [sjg@, stevek@, jtl@]. In the past (AFAIK, pre-epoch times) it was tried out at Netflix [gallatin@, rrs@] with no positive result and at Yandex [ae@, melifaro@]. I'm up to resurrecting it back if there is any interest from anybody. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D33020	2021-12-02 10:48:48 -08:00
Roy Marples	5c5340108e	net: Allow binding of unspecified address without address existance Previously in_pcbbind_setup returned EADDRNOTAVAIL for empty V_in_ifaddrhead (i.e., no IPv4 addresses configured) and in6_pcbbind did the same for empty V_in6_ifaddrhead (no IPv6 addresses). An equivalent test has existed since 4.4-Lite. It was presumably done to avoid extra work (assuming the address isn't going to be found later). In normal system operation *_ifaddrhead will not be empty: they will at least have the loopback address(es). In practice no work will be avoided. Further, this case caused net/dhcpd to fail when run early in boot before assignment of any addresses. It should be possible to bind the unspecified address even if no addresses have been configured yet, so just remove the tests. The now-removed "XXX broken" comments were added in `59562606b9`, which converted the ifaddr lists to TAILQs. As far as I (emaste) can tell the brokenness is the issue described above, not some aspect of the TAILQ conversion. PR: 253166 Reviewed by: ae, bz, donner, emaste, glebius MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D32563	2021-10-20 19:25:51 -04:00
Gleb Smirnoff	0f617ae48a	Add in_pcb_var.h for KPIs that are private to in_pcb.c and in6_pcb.c.	2021-10-18 10:19:57 -07:00
Gleb Smirnoff	147f018a72	Move in6_pcbsetport() to in6_pcb.c This function was originally carved out of in6_pcbbind(), which is in in6_pcb.c. This function also uses KPI private to the PCB database - in_pcb_lport().	2021-10-18 10:19:03 -07:00
Gordon Bergling	04389c855e	Fix some common typos in comments - s/configuraiton/configuration/ - s/specifed/specified/ - s/compatiblity/compatibility/ MFC after: 5 days	2021-08-08 10:16:06 +02:00
Mark Johnston	f161d294b9	Add missing sockaddr length and family validation to various protocols Several protocol methods take a sockaddr as input. In some cases the sockaddr lengths were not being validated, or were validated after some out-of-bounds accesses could occur. Add requisite checking to various protocol entry points, and convert some existing checks to assertions where appropriate. Reported by: syzkaller+KASAN Reviewed by: tuexen, melifaro MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29519	2021-05-03 13:35:19 -04:00
Gleb Smirnoff	1db08fbe3f	tcp_input: always request read-locking of PCB for any pure SYN segment. This is further rework of `08d9c92027`. Now we carry the knowledge of lock type all the way through tcp_input() and also into tcp_twcheck(). Ideally the rlocking for pure SYNs should propagate all the way into the alternative TCP stacks, but not yet today. This should close a race when socket is bind(2)-ed but not yet listen(2)-ed and a SYN-packet arrives racing with listen(2), discovered recently by pho@.	2021-04-20 10:02:20 -07:00
Gleb Smirnoff	08d9c92027	tcp_input/syncache: acquire only read lock on PCB for SYN,!ACK packets When packet is a SYN packet, we don't need to modify any existing PCB. Normally SYN arrives on a listening socket, we either create a syncache entry or generate syncookie, but we don't modify anything with the listening socket or associated PCB. Thus create a new PCB lookup mode - rlock if listening. This removes the primary contention point under SYN flood - the listening socket PCB. Sidenote: when SYN arrives on a synchronized connection, we still don't need write access to PCB to send a challenge ACK or just to drop. There is only one exclusion - tcptw recycling. However, existing entanglement of tcp_input + stacks doesn't allow to make this change small. Consider this patch as first approach to the problem. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D29576	2021-04-12 08:25:31 -07:00
Alexander V. Chernikov	605284b894	Enforce net epoch in in6_selectsrc(). in6_selectsrc() may call fib6_lookup() in some cases, which requires epoch. Wrap in6_selectsrc* calls into epoch inside its users. Mark it as requiring epoch by adding NET_EPOCH_ASSERT(). MFC after: 1 weeek Differential Revision: https://reviews.freebsd.org/D28647	2021-02-15 22:33:12 +00:00
Andrew Gallatin	a034518ac8	Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636	2020-12-19 22:04:46 +00:00
Jonathan T. Looney	440598dd9e	Fix implicit automatic local port selection for IPv6 during connect calls. When a user creates a TCP socket and tries to connect to the socket without explicitly binding the socket to a local address, the connect call implicitly chooses an appropriate local port. When evaluating candidate local ports, the algorithm checks for conflicts with existing ports by doing a lookup in the connection hash table. In this circumstance, both the IPv4 and IPv6 code look for exact matches in the hash table. However, the IPv4 code goes a step further and checks whether the proposed 4-tuple will match wildcard (e.g. TCP "listen") entries. The IPv6 code has no such check. The missing wildcard check can cause problems when connecting to a local server. It is possible that the algorithm will choose the same value for the local port as the foreign port uses. This results in a connection with identical source and destination addresses and ports. Changing the IPv6 code to align with the IPv4 code's behavior fixes this problem. Reviewed by: tuexen Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27164	2020-11-14 14:50:34 +00:00
Alexander V. Chernikov	0c325f53f1	Implement flowid calculation for outbound connections to balance connections over multiple paths. Multipath routing relies on mbuf flowid data for both transit and outbound traffic. Current code fills mbuf flowid from inp_flowid for connection-oriented sockets. However, inp_flowid is currently not calculated for outbound connections. This change creates simple hashing functions and starts calculating hashes for TCP,UDP/UDP-Lite and raw IP if multipath routes are present in the system. Reviewed by: glebius (previous version),ae Differential Revision: https://reviews.freebsd.org/D26523	2020-10-18 17:15:47 +00:00
Mike Karels	2510235150	Allow TCP to reuse local port with different destinations Previously, tcp_connect() would bind a local port before connecting, forcing the local port to be unique across all outgoing TCP connections for the address family. Instead, choose a local port after selecting the destination and the local address, requiring only that the tuple is unique and does not match a wildcard binding. Reviewed by: tuexen (rscheff, rrs previous version) MFC after: 1 month Sponsored by: Forcepoint LLC Differential Revision: https://reviews.freebsd.org/D24781	2020-05-18 22:53:12 +00:00
Alexander V. Chernikov	983066f05b	Convert route caching to nexthop caching. This change is build on top of nexthop objects introduced in r359823. Nexthops are separate datastructures, containing all necessary information to perform packet forwarding such as gateway interface and mtu. Nexthops are shared among the routes, providing more pre-computed cache-efficient data while requiring less memory. Splitting the LPM code and the attached data solves multiple long-standing problems in the routing layer, drastically reduces the coupling with outher parts of the stack and allows to transparently introduce faster lookup algorithms. Route caching was (re)introduced to minimise (slow) routing lookups, allowing for notably better performance for large TCP senders. Caching works by acquiring rtentry reference, which is protected by per-rtentry mutex. If the routing table is changed (checked by comparing the rtable generation id) or link goes down, cache record gets withdrawn. Nexthops have the same reference counting interface, backed by refcount(9). This change merely replaces rtentry with the actual forwarding nextop as a cached object, which is mostly mechanical. Other moving parts like cache cleanup on rtable change remains the same. Differential Revision: https://reviews.freebsd.org/D24340	2020-04-25 09:06:11 +00:00
Michael Tuexen	fe1274ee39	Fix race when accepting TCP connections. When expanding a SYN-cache entry to a socket/inp a two step approach was taken: 1) The local address was filled in, then the inp was added to the hash table. 2) The remote address was filled in and the inp was relocated in the hash table. Before the epoch changes, a write lock was held when this happens and the code looking up entries was holding a corresponding read lock. Since the read lock is gone away after the introduction of the epochs, the half populated inp was found during lookup. This resulted in processing TCP segments in the context of the wrong TCP connection. This patch changes the above procedure in a way that the inp is fully populated before inserted into the hash table. Thanks to Paul <devgs@ukr.net> for reporting the issue on the net@ mailing list and for testing the patch! Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D22971	2020-01-12 17:52:32 +00:00
Gleb Smirnoff	c17cd08f53	It is unclear why in6_pcblookup_local() would require write access to the PCB hash. The function doesn't modify the hash. It always asserted write lock historically, but with epoch conversion this fails in some special cases. Reviewed by: rwatson, bz Reported-by: syzbot+0b0488ca537e20cb2429@syzkaller.appspotmail.com	2019-11-11 06:28:25 +00:00
Gleb Smirnoff	d797164a86	Since r353292 on input path we are always in network epoch, when we lookup PCBs. Thus, do not enter epoch recursively in in_pcblookup_hash() and in6_pcblookup_hash(). Same applies to tcp_ctlinput() and tcp6_ctlinput(). This leaves several sysctl(9) handlers that return PCB credentials unprotected. Add epoch enter/exit to all of them. Differential Revision: https://reviews.freebsd.org/D22197	2019-11-07 20:49:56 +00:00
Bjoern A. Zeeb	0ecd976e80	IPv6 cleanup: kernel Finish what was started a few years ago and harmonize IPv6 and IPv4 kernel names. We are down to very few places now that it is feasible to do the change for everything remaining with causing too much disturbance. Remove "aliases" for IPv6 names which confusingly could indicate that we are talking about a different data structure or field or have two fields, one for each address family. Try to follow common conventions used in FreeBSD. * Rename sin6p to sin6 as that is how it is spelt in most places. * Remove "aliases" (#defines) for: - in6pcb which really is an inpcb and nothing separate - sotoin6pcb which is sotoinpcb (as per above) - in6p_sp which is inp_sp - in6p_flowinfo which is inp_flow * Try to use ia6 for in6_addr rather than in6p. * With all these gone also rename the in6p variables to inp as that is what we call it in most of the network stack including parts of netinet6. The reasons behind this cleanup are that we try to further unify netinet and netinet6 code where possible and that people will less ignore one or the other protocol family when doing code changes as they may not have spotted places due to different names for the same thing. No functional changes. Discussed with: tuexen (SCTP changes) MFC after: 3 months Sponsored by: Netflix	2019-08-02 07:41:36 +00:00
Hans Petter Selasky	59854ecf55	Convert all IPv4 and IPv6 multicast memberships into using a STAILQ instead of a linear array. The multicast memberships for the inpcb structure are protected by a non-sleepable lock, INP_WLOCK(), which needs to be dropped when calling the underlying possibly sleeping if_ioctl() method. When using a linear array to keep track of multicast memberships, the computed memory location of the multicast filter may suddenly change, due to concurrent insertion or removal of elements in the linear array. This in turn leads to various invalid memory access issues and kernel panics. To avoid this problem, put all multicast memberships on a STAILQ based list. Then the memory location of the IPv4 and IPv6 multicast filters become fixed during their lifetime and use after free and memory leak issues are easier to track, for example by: vmstat -m \| grep multi All list manipulation has been factored into inline functions including some macros, to easily allow for a future hash-list implementation, if needed. This patch has been tested by pho@ . Differential Revision: https://reviews.freebsd.org/D20080 Reviewed by: markj @ MFC after: 1 week Sponsored by: Mellanox Technologies	2019-06-25 11:54:41 +00:00
Gleb Smirnoff	a68cc38879	Mechanical cleanup of epoch(9) usage in network stack. - Remove macros that covertly create epoch_tracker on thread stack. Such macros a quite unsafe, e.g. will produce a buggy code if same macro is used in embedded scopes. Explicitly declare epoch_tracker always. - Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read locking macros to what they actually are - the net_epoch. Keeping them as is is very misleading. They all are named FOO_RLOCK(), while they no longer have lock semantics. Now they allow recursion and what's more important they now no longer guarantee protection against their companion WLOCK macros. Note: INP_HASH_RLOCK() has same problems, but not touched by this commit. This is non functional mechanical change. The only functionally changed functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter epoch recursively. Discussed with: jtl, gallatin	2019-01-09 01:11:19 +00:00
Mateusz Guzik	cc426dd319	Remove unused argument to priv_check_cred. Patch mostly generated with cocinnelle: @@ expression E1,E2; @@ - priv_check_cred(E1,E2,0) + priv_check_cred(E1,E2) Sponsored by: The FreeBSD Foundation	2018-12-11 19:32:16 +00:00
Mark Johnston	9d2877fc3d	Clamp the INPCB port hash tables to IPPORT_MAX + 1 chains. Memory beyond that limit was previously unused, wasting roughly 1MB per 8GB of RAM. Also retire INP_PCBLBGROUP_PORTHASH, which was identical to INP_PCBPORTHASH. Reviewed by: glebius MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D17803	2018-12-05 17:06:00 +00:00
Mark Johnston	d9ff5789be	Remove redundant checks for a NULL lbgroup table. No functional change intended. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17108	2018-11-01 15:52:49 +00:00

1 2 3 4 5

222 Commits