RFC 4620 is an experimental RFC that can be used to request information
about a host, including:
- the fully-qualified or single-component name
- some set of the Responder's IPv6 unicast addresses
- some set of the Responder's IPv4 unicast addresses
This is not something that should be made available by default.
PR: 257709
Submitted by: ruben@verweg.com
Reviewed by: melifaro
Relnotes: Yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39778
The SMR-protected inpcb lookup algorithm currently has to check whether
a matching inpcb belongs to a jail, in order to prioritize jailed
bound sockets. To do this it has to maintain a ucred reference, and for
this to be safe, the reference can't be released until the UMA
destructor is called, and this will not happen within any bounded time
period.
Changing SMR to periodically recycle garbage is not trivial. Instead,
let's implement SMR-synchronized lookup without needing to dereference
inp_cred. This will allow the inpcb code to free the inp_cred reference
immediately when a PCB is freed, ensuring that ucred (and thus jail)
references are released promptly.
Commit 220d892129 ("inpcb: immediately return matching pcb on lookup")
gets us part of the way there. This patch goes further to handle
lookups of unconnected sockets. Here, the strategy is to maintain a
well-defined order of items within a hash chain so that a wild lookup
can simply return the first match and preserve existing semantics. This
makes insertion of listening sockets more complicated in order to make
lookup simpler, which seems like the right tradeoff anyway given that
bind() is already a fairly expensive operation and lookups are more
common.
In particular, when inserting an unconnected socket, in_pcbinhash() now
keeps the following ordering:
- jailed sockets before non-jailed sockets,
- specified local addresses before unspecified local addresses.
Most of the change adds a separate SMR-based lookup path for inpcb hash
lookups. When a match is found, we try to lock the inpcb and
re-validate its connection info. In the common case, this works well
and we can simply return the inpcb. If this fails, typically because
something is concurrently modifying the inpcb, we go to the slow path,
which performs a serialized lookup.
Note, I did not touch lbgroup lookup, since there the credential
reference is formally synchronized by net_epoch, not SMR. In
particular, lbgroups are rarely allocated or freed.
I think it is possible to simplify in_pcblookup_hash_wild_locked() now,
but I didn't do it in this patch.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38572
These functions will get some additional callers in future revisions.
No functional change intended.
Discussed with: glebius
Tested by: glebius
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38571
Currently we use a single hash table per PCB database for connected and
bound PCBs. Since we started using net_epoch to synchronize hash table
lookups, there's been a bug, noted in a comment above in_pcbrehash():
connecting a socket can cause an inpcb to move between hash chains, and
this can cause a concurrent lookup to follow the wrong linkage pointers.
I believe this could cause rare, spurious ECONNREFUSED errors in the
worse case.
Address the problem by introducing a second hash table and adding more
linkage pointers to struct inpcb. Now the database has one table each
for connected and unconnected sockets.
When inserting an inpcb into the hash table, in_pcbinhash() now looks at
the foreign address of the inpcb to figure out which table to use. This
ensures that queue linkage pointers are stable until the socket is
disconnected, so the problem described above goes away. There is also a
small benefit in that in_pcblookup_*() can now search just one of the
two possible hash buckets.
I also made the "rehash" parameter of in(6)_pcbconnect() unused. This
parameter seems confusing and it is simpler to let the inpcb code figure
out what to do using the existing INP_INHASHLIST flag.
UDP sockets pose a special problem since they can be connected and
disconnected multiple times during their lifecycle. To handle this, the
patch plugs a hole in the inpcb structure and uses it to store an SMR
sequence number. When an inpcb is disconnected - an operation which
requires the global PCB database hash lock - the write sequence number
is advanced, and in order to reconnect, the connecting thread must wait
for readers to drain before reusing the inpcb's hash chain linkage
pointers.
raw_ip (ab)uses the hash table without using the corresponding
accessors. Since there are now two hash tables, it arbitrarily uses the
"connected" table for all of its PCBs. This will be addressed in some
way in the future.
inp interators which specify a hash bucket will only visit connected
PCBs. This is not really correct, but nothing in the tree uses that
functionality except raw_ip, which as mentioned above places all of its
PCBs in the "connected" table and so is unaffected.
Discussed with: glebius
Tested by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38569
This is a total hack/bare minimum which follows inet4.
Otherwise 2 threads removing the same address can easily crash.
Reviewed by: kp
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D39317
ip6_input() and ip6_destroy() both directly reference ifnet members.
This file was missed in 3d0d5b21
Fixes: 3d0d5b21 ("IfAPI: Explicitly include <net/if_private.h>...")
Sponsored by: Juniper Networks, Inc.
Re-introduce PFIL_FWD, because pf's pf_refragment6() needs to know if
we're ip6_forward()-ing or ip6_output()-ing.
ip6_forward() relies on m->m_pkthdr.rcvif, at least for link-local
traffic (for in6_get_unicast_scopeid()). rcvif is not set for locally
generated traffic (e.g. from icmp6_reflect()), so we need to call the
correct output function.
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revisi: https://reviews.freebsd.org/D39061
these functions exclusively return (0) and (1), so convert them to bool
We also convert some networking related jail functions from int to bool
some of which were returning an error that was never used.
Differential Revision: https://reviews.freebsd.org/D29659
Reviewed by: imp, jamie (earlier version)
Pull Request: https://github.com/freebsd/freebsd-src/pull/663
RFC 4443 specifies cases where certain packets, like those originating from
local-scope addresses destined outside of the scope shouldn't be forwarded.
The current practice is to drop them, send ICMPv6 message where appropriate,
and log the message:
cannot forward src fe80:10::426:82ff:fe36:1d8, dst 2001:db8:db8::10, nxt
58, rcvif vlan5, outif vlan2
At times the volume of such messages cat get very high. Let's allow local
admins to disable such messages on per vnet basis, keeping the current
default (log).
Reported by: zarychtam@plan-b.pwste.edu.pl
Reviewed by: zlei (previous version), pauamma (docs)
Differential Revision: https://reviews.freebsd.org/D38644
The assertions added in commit b0ccf53f24 ("inpcb: Assert against
wildcard addrs in in_pcblookup_hash_locked()") revealed that protocol
layers may pass the unspecified address to in_pcblookup().
Add some checks to filter out such packets before we attempt an inpcb
lookup:
- Disallow the use of an unspecified source address in in_pcbladdr() and
in6_pcbladdr().
- Disallow IP packets with an unspecified destination address.
- Disallow TCP packets with an unspecified source address, and add an
assertion to verify the comment claiming that the case of an
unspecified destination address is handled by the IP layer.
Reported by: syzbot+9ca890fb84e984e82df2@syzkaller.appspotmail.com
Reported by: syzbot+ae873c71d3c71d5f41cb@syzkaller.appspotmail.com
Reported by: syzbot+e3e689aba1d442905067@syzkaller.appspotmail.com
Reviewed by: glebius, melifaro
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38570
It has no effect, and an exp-run revealed that it is not in use.
PR: 261398 (exp-run)
Reviewed by: mjg, glebius
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38822
This option was added in commit 0a100a6f1e but was never completed.
In particular, there is no logic to map flowids to different listening
sockets, so it accomplishes basically the same thing as SO_REUSEPORT.
Meanwhile, we've since added SO_REUSEPORT_LB, which at least tries to
balance among listening sockets using a hash of the 4-tuple and some
optional NUMA policy.
The option was never documented or completed, and an exp-run revealed
nothing using it in the ports tree. Moreover, it complicates the
already very complicated in_pcbbind_setup(), and the checking in
in_pcbbind_check_bindmulti() is insufficient. So, let's remove it.
PR: 261398 (exp-run)
Reviewed by: glebius
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D38574
Do the cast from sockaddr to either IPv4 or IPv6 sockaddr in the
protocol's pr_bind method and from there on go down the call
stack with family specific argument.
Reviewed by: zlei, melifaro, markj
Differential Revision: https://reviews.freebsd.org/D38601
Split the in_pcblookup_hash_locked() function into several independent
subroutine calls, each of which does some kind of hash table lookup.
This refactoring makes it easier to introduce variants of the lookup
algorithm that behave differently depending on whether they are
synchronized by SMR or the PCB database hash lock.
While here, do some related cleanup:
- Remove an unused ifnet parameter from internal functions. Keep it in
external functions so that it can be used in the future to derive a v6
scopeid.
- Reorder the parameters to in_pcblookup_lbgroup() to be consistent with
the other lookup functions.
- Remove an always-true check from in_pcblookup_lbgroup(): we can assume
that we're performing a wildcard match.
No functional change intended.
Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D38364
This saves a lot of CPU cycles if you got large connection table.
The code removed originates from 413628a7e3, a very large changeset.
Discussed that with Bjoern, Jamie we can't recover why would we ever
have identical 4-tuples in the hash, even in the presence of jails.
Bjoern did a test that confirms that it is impossible to allocate an
identical connection from a jail to a host. Code review also confirms
that system shouldn't allow for such connections to exist.
With a lack of proper test suite we decided to take a risk and go
forward with removing that code.
Reviewed by: gallatin, bz, markj
Differential Revision: https://reviews.freebsd.org/D38015
Do the cast from sockaddr to either IPv4 or IPv6 sockaddr in the
protocol's pr_connect method and from there on go down the call
stack with family specific argument.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D38356
This removes recursive epoch entry in the syncache case. Fixes
unprotected access to V_in6_ifaddrhead in in6_pcbladdr(), as
well as access to prison IP address lists. It also matches what
IPv4 in_pcbconnect() does.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D38355
All callers of in_pcbdisconnect() clear the local address, so let's just
do that in the function itself.
Note that the inp's local address is not a parameter to the inp hash
functions. No functional change intended.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38362
It makes more sense to check lookupflags in the function which actually
uses SMR. No functional change intended.
Reviewed by: glebius
MFC after: 1 week
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38359
Summary:
In preparation of making if_t completely opaque outside of the netstack,
explicitly include the header. <net/if_var.h> will stop including the
header in the future.
Sponsored by: Juniper Networks, Inc.
Reviewed by: glebius, melifaro
Differential Revision: https://reviews.freebsd.org/D38200
Summary:
in6m_lookup_locked() iterates over the ifnet's multiaddrs list. Keep
this implementation detail private, by moving the implementation to the
netstack source from the header.
Sponsored by: Juniper Networks, Inc.
Reviewed by: glebius, melifaro
Differential Revision: https://reviews.freebsd.org/D38201
Currently, under the conditions specified below, IPv6 ingress packet
processing can ignore blackhole/reject flag on the prefix. The packet
will instead be looped locally till TTL expiration and a single ICMPv6
unreachable message will be send to the source even in case of
RTF_BLACKHOLE.
The following conditions needs hold to make the scenario happen:
* IPv6 forwarding is enabled
* Packet is not fast-forwarded
* Destination prefix has either RTF_BLACKHOLE or RTF_REJECT flag
Fix this behavior by checking for the blackhole/reject flags in
ip6_forward().
Reported by: Dmitriy Smirnov <fox@sage.su>
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D38164
MFC after: 3 days
nd6_resolve_slow() can be called without mbuf. If the LLE entry
is not reachable, nd6_resolve_slow() will add this NULL mbuf to
the holdchain via lltable_append_entry_queue, which will "append"
NULL to the end of the queue (effectively no-op) and bump la_numhold
value. When this entry gets freed, the kernel will panic due to the
inconsistency between the amount of mbufs in the queue and the value
of la_numhold.
Fix the panic by checking of mbuf is not NULL prior to inserting it
into the holdchain.
Reported by: kib
MFC after: 3 days
mli_delete_locked() is the only function that takes a const ifnet.
Since it's a static function there's no advantage to keeping it const.
Since `if_t` is not a const struct (currently) the compiler throws an
error passing the ifp around to ifnet functions.
Fixes: eb1da3e525
Sponsored by: Juniper Networks, Inc.
They are shared between UDP over IPv4 and over IPv6. To prevent all
possible kernel build failures wrap them in #ifdef _SYS_PROTOSW_H_.
Prompted by feedback from jhb@ and jrtc27@ on c93db4abf4.
For the TCP protocol inpcb storage specify allocation size that would
provide space to most of the data a TCP connection needs, embedding
into struct tcpcb several structures, that previously were allocated
separately.
The most import one is the inpcb itself. With embedding we can provide
strong guarantee that with a valid TCP inpcb the tcpcb is always valid
and vice versa. Also we reduce number of allocs/frees per connection.
The embedded inpcb is placed in the beginning of the struct tcpcb,
since in_pcballoc() requires that. However, later we may want to move
it around for cache line efficiency, and this can be done with a little
effort. The new intotcpcb() macro is ready for such move.
The congestion algorithm data, the TCP timers and osd(9) data are
also embedded into tcpcb, and temprorary struct tcpcb_mem goes away.
There was no extra allocation here, but we went through extra pointer
every time we accessed this data.
One interesting side effect is that now TCP data is allocated from
SMR-protected zone. Potentially this allows the TCP stacks or other
TCP related modules to utilize that for their own synchronization.
Large part of the change was done with sed script:
s/tp->ccv->/tp->t_ccv./g
s/tp->ccv/\&tp->t_ccv/g
s/tp->cc_algo/tp->t_cc/g
s/tp->t_timers->tt_/tp->tt_/g
s/CCV\(ccv, osd\)/\&CCV(ccv, t_osd)/g
Dependency side effect is that code that needs to know struct tcpcb
should also know struct inpcb, that added several <netinet/in_pcb.h>.
Differential revision: https://reviews.freebsd.org/D37127
If udp[6]_append() returns non-zero, it is because the inp has gone
away (inpcbrele_rlocked returned 1 after running the tunnel function).
Reviewed by: ae
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D37511
Currently SO_REUSEPORT_LB silently does nothing when set by a jailed
process. It is trivial to support this option in VNET jails, but it's
also useful in traditional jails.
This patch enables LB groups in jails with the following semantics:
- all PCBs in a group must belong to the same jail,
- PCB lookup prefers jailed groups to non-jailed groups
This is a straightforward extension of the semantics used for individual
listening sockets. One pre-existing quirk of the lbgroup implementation
is that non-jailed lbgroups are searched before jailed listening
sockets; that is preserved with this change.
Discussed with: glebius
MFC after: 1 month
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37029
It is ok to use memcmp() in the kernel. No functional change intended.
Reviewed by: glebius, melifaro
MFC after: 1 week
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37028
Some auditing of the code shows that "cred" is never non-NULL in these
functions, either because all callers pass a non-NULL reference or
because they unconditionally dereference "cred". So, let's simplify the
code a bit and remove NULL checks. No functional change intended.
Reviewed by: glebius
MFC after: 1 week
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37025
This commit brings back the driver from FreeBSD commit
f187d6dfbf plus subsequent fixes from
upstream.
Relative to upstream this commit includes a few other small fixes such
as additional INET and INET6 #ifdef's, #include cleanups, and updates
for recent API changes in main.
Reviewed by: pauamma, gbe, kevans, emaste
Obtained from: git@git.zx2c4.com:wireguard-freebsd @ 3cc22b2
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D36909
Mechanically cleanup INP_TIMEWAIT from the kernel sources. After
0d7445193a, this commit shall not cause any functional changes.
Note: this flag was very often checked together with INP_DROPPED.
If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we
will be able to remove most of this checks and turn them to
assertions. Some of them can be turned into assertions right now,
but that should be carefully done on a case by case basis.
Differential revision: https://reviews.freebsd.org/D36400
The memory savings the tcptw brought back in 2003 (see 340c35de6a) no
longer justify the complexity required to maintain it. For longer
explanation please check out the email [1].
Surpisingly through almost 20 years the TCP stack functionality of
handling the TIME_WAIT state with a normal tcpcb did not bitrot. The
existing tcp_input() properly handles a tcpcb in TCPS_TIME_WAIT state,
which is confirmed by the packetdrill tcp-testsuite [2].
This change just removes tcptw and leaves INP_TIMEWAIT. The flag will
be removed in a separate commit. This makes it easier to review and
possibly debug the changes.
[1] https://lists.freebsd.org/archives/freebsd-net/2022-January/001206.html
[2] https://github.com/freebsd-net/tcp-testsuite
Differential revision: https://reviews.freebsd.org/D36398