Currently, under the conditions specified below, IPv6 ingress packet
processing can ignore blackhole/reject flag on the prefix. The packet
will instead be looped locally till TTL expiration and a single ICMPv6
unreachable message will be send to the source even in case of
RTF_BLACKHOLE.
The following conditions needs hold to make the scenario happen:
* IPv6 forwarding is enabled
* Packet is not fast-forwarded
* Destination prefix has either RTF_BLACKHOLE or RTF_REJECT flag
Fix this behavior by checking for the blackhole/reject flags in
ip6_forward().
Reported by: Dmitriy Smirnov <fox@sage.su>
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D38164
MFC after: 3 days
nd6_resolve_slow() can be called without mbuf. If the LLE entry
is not reachable, nd6_resolve_slow() will add this NULL mbuf to
the holdchain via lltable_append_entry_queue, which will "append"
NULL to the end of the queue (effectively no-op) and bump la_numhold
value. When this entry gets freed, the kernel will panic due to the
inconsistency between the amount of mbufs in the queue and the value
of la_numhold.
Fix the panic by checking of mbuf is not NULL prior to inserting it
into the holdchain.
Reported by: kib
MFC after: 3 days
mli_delete_locked() is the only function that takes a const ifnet.
Since it's a static function there's no advantage to keeping it const.
Since `if_t` is not a const struct (currently) the compiler throws an
error passing the ifp around to ifnet functions.
Fixes: eb1da3e525
Sponsored by: Juniper Networks, Inc.
They are shared between UDP over IPv4 and over IPv6. To prevent all
possible kernel build failures wrap them in #ifdef _SYS_PROTOSW_H_.
Prompted by feedback from jhb@ and jrtc27@ on c93db4abf4.
For the TCP protocol inpcb storage specify allocation size that would
provide space to most of the data a TCP connection needs, embedding
into struct tcpcb several structures, that previously were allocated
separately.
The most import one is the inpcb itself. With embedding we can provide
strong guarantee that with a valid TCP inpcb the tcpcb is always valid
and vice versa. Also we reduce number of allocs/frees per connection.
The embedded inpcb is placed in the beginning of the struct tcpcb,
since in_pcballoc() requires that. However, later we may want to move
it around for cache line efficiency, and this can be done with a little
effort. The new intotcpcb() macro is ready for such move.
The congestion algorithm data, the TCP timers and osd(9) data are
also embedded into tcpcb, and temprorary struct tcpcb_mem goes away.
There was no extra allocation here, but we went through extra pointer
every time we accessed this data.
One interesting side effect is that now TCP data is allocated from
SMR-protected zone. Potentially this allows the TCP stacks or other
TCP related modules to utilize that for their own synchronization.
Large part of the change was done with sed script:
s/tp->ccv->/tp->t_ccv./g
s/tp->ccv/\&tp->t_ccv/g
s/tp->cc_algo/tp->t_cc/g
s/tp->t_timers->tt_/tp->tt_/g
s/CCV\(ccv, osd\)/\&CCV(ccv, t_osd)/g
Dependency side effect is that code that needs to know struct tcpcb
should also know struct inpcb, that added several <netinet/in_pcb.h>.
Differential revision: https://reviews.freebsd.org/D37127
If udp[6]_append() returns non-zero, it is because the inp has gone
away (inpcbrele_rlocked returned 1 after running the tunnel function).
Reviewed by: ae
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D37511
Currently SO_REUSEPORT_LB silently does nothing when set by a jailed
process. It is trivial to support this option in VNET jails, but it's
also useful in traditional jails.
This patch enables LB groups in jails with the following semantics:
- all PCBs in a group must belong to the same jail,
- PCB lookup prefers jailed groups to non-jailed groups
This is a straightforward extension of the semantics used for individual
listening sockets. One pre-existing quirk of the lbgroup implementation
is that non-jailed lbgroups are searched before jailed listening
sockets; that is preserved with this change.
Discussed with: glebius
MFC after: 1 month
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37029
It is ok to use memcmp() in the kernel. No functional change intended.
Reviewed by: glebius, melifaro
MFC after: 1 week
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37028
Some auditing of the code shows that "cred" is never non-NULL in these
functions, either because all callers pass a non-NULL reference or
because they unconditionally dereference "cred". So, let's simplify the
code a bit and remove NULL checks. No functional change intended.
Reviewed by: glebius
MFC after: 1 week
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37025
This commit brings back the driver from FreeBSD commit
f187d6dfbf plus subsequent fixes from
upstream.
Relative to upstream this commit includes a few other small fixes such
as additional INET and INET6 #ifdef's, #include cleanups, and updates
for recent API changes in main.
Reviewed by: pauamma, gbe, kevans, emaste
Obtained from: git@git.zx2c4.com:wireguard-freebsd @ 3cc22b2
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D36909
Mechanically cleanup INP_TIMEWAIT from the kernel sources. After
0d7445193a, this commit shall not cause any functional changes.
Note: this flag was very often checked together with INP_DROPPED.
If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we
will be able to remove most of this checks and turn them to
assertions. Some of them can be turned into assertions right now,
but that should be carefully done on a case by case basis.
Differential revision: https://reviews.freebsd.org/D36400
The memory savings the tcptw brought back in 2003 (see 340c35de6a) no
longer justify the complexity required to maintain it. For longer
explanation please check out the email [1].
Surpisingly through almost 20 years the TCP stack functionality of
handling the TIME_WAIT state with a normal tcpcb did not bitrot. The
existing tcp_input() properly handles a tcpcb in TCPS_TIME_WAIT state,
which is confirmed by the packetdrill tcp-testsuite [2].
This change just removes tcptw and leaves INP_TIMEWAIT. The flag will
be removed in a separate commit. This makes it easier to review and
possibly debug the changes.
[1] https://lists.freebsd.org/archives/freebsd-net/2022-January/001206.html
[2] https://github.com/freebsd-net/tcp-testsuite
Differential revision: https://reviews.freebsd.org/D36398
When system acts as CARP BACKUP ignore received IPv6 Neighbor Advertisements
to ensure that neighbor cache will not be changed.
Also do not send IPv6 Neighbor Solicitation from CARP BACKUP source address.
Such packets can confuse network switch and it detects MAC addresses
flapping.
Obtained from: Yandex LLC
MFC after: 2 weeks
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D36649
In the original design of the network stack from the protocol control
input method pr_ctlinput was used notify the protocols about two very
different kinds of events: internal system events and receival of an
ICMP messages from outside. These events were coded with PRC_ codes.
Today these methods are removed from the protosw(9) and are isolated
to IPv4 and IPv6 stacks and are called only from icmp*_input(). The
PRC_ codes now just create a shim layer between ICMP codes and errors
or actions taken by protocols.
- Change ipproto_ctlinput_t to pass just pointer to ICMP header. This
allows protocols to not deduct it from the internal IP header.
- Change ip6proto_ctlinput_t to pass just struct ip6ctlparam pointer.
It has all the information needed to the protocols. In the structure,
change ip6c_finaldst fields to sockaddr_in6. The reason is that
icmp6_input() already has this address wrapped in sockaddr, and the
protocols want this address as sockaddr.
- For UDP tunneling control input, as well as for IPSEC control input,
change the prototypes to accept a transparent union of either ICMP
header pointer or struct ip6ctlparam pointer.
- In icmp_input() and icmp6_input() do only validation of ICMP header and
count bad packets. The translation of ICMP codes to errors/actions is
done by protocols.
- Provide icmp_errmap() and icmp6_errmap() as substitute to inetctlerrmap,
inet6ctlerrmap arrays.
- In protocol ctlinput methods either trust what icmp_errmap() recommend,
or do our own logic based on the ICMP header.
Differential revision: https://reviews.freebsd.org/D36731
Now these functions are called only from icmp*_input(). The pointer
to the ICMP data is never NULL and cmd has a limited set of values.
In the past the functions were demultiplexing control messages from
ICMP layer, as well as internally generated events. In the latter
case the the pointer to IP would be NULL.
Differential revision: https://reviews.freebsd.org/D36729
and mark those PRC_* codes, that are used. The rest are dead code.
This is not a functional change, but illustrative to make easier
review of following changes.
After decoupling of protosw(9) and IP wire protocols in 78b1fc05b2 for
IPv4 we got vector ip_ctlprotox[] that is executed only and only from
icmp_input() and respectively for IPv6 we got ip6_ctlprotox[] executed
only and only from icmp6_input(). This allows to use protocol specific
argument types in these methods instead of struct sockaddr and void.
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36727
The netinet/ipprotosw.h and netinet6/ip6protosw.h were KAME relics, with
the former removed in f0ffb944d2 in 2001 and the latter survived until
today. It has been reduced down to only one useful declaration that
moves to ip6_var.h
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36726
Summary:
Move SIOCAIFADDR_IN6 (current "primary" ioctl to add an IPv6
interface address) handling code to the dedicated in6_addifaddr()
function and make it a part of KPI. This allows in-kernel users to
add/delete interfaces addresses without relying on ioctl interface.
Subscribers: imp, ae, glebius
Differential Revision: https://reviews.freebsd.org/D36713
It is now unused and not having it allows further clean ups.
Reviewed by: cy, glebius, kp
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D36452
Fix nd6_na_input() eventhandler call: run eventhandler after lle
state transition.
Old behaviour (as seen by event handler):
* fe80::5054:ff:fe8c:63e9 dev vtnet0 lladdr 52:54:00:8c:63:e9 INCOMPLETE
New behaviour:
* fe80::5054:ff:fe8c:63e9 dev vtnet0 lladdr 52:54:00:8c:63:e9 REACHABLE
MFC after: 2 weeks
With 61f7427f02 raw sockets protosw has wildcard pr_protocol. Protocol
of a specific pcb is stored in inp_ip_p.
Reviewed by: karels
Reported by: karels
Differential revision: https://reviews.freebsd.org/D36429
Fixes: 61f7427f02
Since 4.4BSD the protosw was used to implement socket types created
by socket(2) syscall and at the same to demultiplex incoming IPv4
datagrams (later copied to IPv6). This story ended with 78b1fc05b2.
These entries (e.g. IPPROTO_ICMP) in inetsw that were added to catch
packets in ip_input(), they would also be returned by pffindproto()
if user says socket(AF_INET, SOCK_RAW, IPPROTO_ICMP). Thus, for raw
sockets to work correctly, all the entries were pointing at raw_usrreq
differentiating only in the value of pr_protocol.
With 78b1fc05b2 all these entries are no longer needed, as ip_protox
is independent of protosw. Any socket syscall requesting SOCK_RAW type
would end up with rip_protosw. And this protosw has its pr_protocol
set to 0, allowing to mark socket with any protocol.
For IPv6 raw socket the change required two small fixes:
o Validate user provided protocol value
o Always use protocol number stored in inp in rip6_attach, instead
of protosw value, which is now always 0.
Differential revision: https://reviews.freebsd.org/D36380
The divert(4) is not a protocol of IPv4. It is a socket to
intercept packets from ipfw(4) to userland and re-inject them
back. It can divert and re-inject IPv4 and IPv6 packets today,
but potentially it is not limited to these two protocols. The
IPPROTO_DIVERT does not belong to known IP protocols, it
doesn't even fit into u_char. I guess, the implementation of
divert(4) was done the way it is done basically because it was
easier to do it this way, back when protocols for sockets were
intertwined with IP protocols and domains were statically
compiled in.
Moving divert(4) out of inetsw accomplished two important things:
1) IPDIVERT is getting much closer to be not dependent on INET.
This will be finalized in following changes.
2) Now divert socket no longer aliases with raw IPv4 socket.
Domain/proto selection code won't need a hack for SOCK_RAW and
multiple entries in inetsw implementing different flavors of
raw socket can merge into one without requirement of raw IPv4
being the last member of dom_protosw.
Differential revision: https://reviews.freebsd.org/D36379
Construct the desired hexthops directly instead of using the
"translation" layer in form of filling rt_addrinfo data.
Simplify V_rt_add_addr_allfibs handling by using recently-added
rib_copy_route() to propagate the routes to the non-primary address
fibs.
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D36166
Summary:
Currently netinet6/ code allocates IPv6 prefixes (nd_prefix) for
both manually-assigned addresses and advertised prefixes. As a result,
prefixes from manually-assigned prefixes can be seen in `ndp -p` list
and be cleared via `ndp -P`. The latter relies on the SIOCSPFXFLUSH_IN6
ioctl to clear to prefix list.
The original intent of the SIOCSPFXFLUSH_IN6 was to clear prefixes
originated from the advertising routers:
```
1998-09-02 JINMEI, Tatuya <jinmei@isl.rdc.toshiba.co.jp>
* nd6.c (nd6_ioctl): added 2 new ioctls; SIOCSRTRFLUSH_IN6 and
SIOCSPFXFLUSH_IN6. The former is to flush all default routers
in the default router list, and the latter is to flush all the
prefixes and the addresses derived from them in the prefix list.
```
Restore the intent by marking prefixes derived from the RA messages
with newly-added ndpr_flags.ra_derived flag and skip prefixes not marked
with such flag during deletion and listing.
Differential Revision: https://reviews.freebsd.org/D36312
MFC after: 2 weeks
o Assert that every protosw has pr_attach. Now this structure is
only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw. This was suggested
in 1996 by wollman@ (see 7b187005d1), and later reiterated
in 2006 by rwatson@ (see 6fbb9cf860).
o Make struct domain hold a variable sized array of protosw pointers.
For most protocols these pointers are initialized statically.
Those domains that may have loadable protocols have spacers. IPv4
and IPv6 have 8 spacers each (andre@ dff3237ee5).
o For inetsw and inet6sw leave a comment noting that many protosw
entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36232
The method was called for two different conditions: 1) the VM layer is
low on pages or 2) one of UMA zones of mbuf allocator exhausted.
This change 2) into a new event handler, but all affected network
subsystems modified to subscribe to both, so this change shall not
bring functional changes under different low memory situations.
There were three subsystems still using pr_drain: TCP, SCTP and frag6.
The latter had its protosw entry for the only reason to register its
pr_drain method.
Reviewed by: tuexen, melifaro
Differential revision: https://reviews.freebsd.org/D36164