Currently we use pre-calculated headers inside LLE entries as prepend data
for `if_output` functions. Using these headers allows saving some
CPU cycles/memory accesses on the fast path.
However, this approach makes adding L2 header for IPv4 traffic with IPv6
nexthops more complex, as it is not possible to store multiple
pre-calculated headers inside lle. Additionally, the solution space is
limited by the fact that PCB caching saves LLEs in addition to the nexthop.
Thus, add support for creating special "child" LLEs for the purpose of holding
custom family encaps and store mbufs pending resolution. To simplify handling
of those LLEs, store them in a linked-list inside a "parent" (e.g. normal) LLE.
Such LLEs are not visible when iterating LLE table. Their lifecycle is bound
to the "parent" LLE - it is not possible to delete "child" when parent is alive.
Furthermore, "child" LLEs are static (RTF_STATIC), avoding complex state
machine used by the standard LLEs.
nd6_lookup() and nd6_resolve() now accepts an additional argument, family,
allowing to return such child LLEs. This change uses `LLE_SF()` macro which
packs family and flags in a single int field. This is done to simplify merging
back to stable/. Once this code lands, most of the cases will be converted to
use a dedicated `family` parameter.
Differential Revision: https://reviews.freebsd.org/D31379
MFC after: 2 weeks
The keyword adds nothing as all operations on the var are performed
through atomic_*
Reviewed by: kp
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31528
The use of Giant here is vestigal and does not provide any useful
synchronization. Furthermore, non-MPSAFE callouts can cause the
softclock threads to block waiting for long-running newbus operations to
complete.
Reported by: mav
Reviewed by: bz
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31470
We need to do so to safely traverse the ifnet list.
Reviewed by: bz
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31470
Fix getsockopt() for the IPPROTO_IPV6 level socket options with the
following names: IPV6_HOPOPTS, IPV6_RTHDR, IPV6_RTHDRDSTOPTS,
IPV6_DSTOPTS, and IPV6_NEXTHOP.
Reviewed by: markj
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D31458
Factor out lltable locking logic from lltable_try_set_entry_addr()
into a separate lltable_acquire_wlock(), so the latter can be used
in other parts of the code w/o duplication.
Create nd6_try_set_entry_addr() to avoid code duplication in nd6.c
and nd6_nbr.c.
Move lle creation logic from nd6_resolve_slow() into a separate
nd6_get_llentry() to simplify the former.
These changes serve as a pre-requisite for implementing
RFC8950 (IPv4 prefixes with IPv6 nexthops).
Differential Revision: https://reviews.freebsd.org/D31432
MFC after: 2 weeks
Add check that ifp supports IPv6 multicasts in in6_getmulti.
This fixes panic when user application tries to join into multicast
group on an interface that doesn't support IPv6 multicasts, like
IFT_PFLOG interfaces.
PR: 257302
Reviewed by: melifaro
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D31420
Use newly-create llentry_request_feedback(),
llentry_mark_used() and llentry_get_hittime() to
request datapatch usage check and fetch the results
in the same fashion both in IPv4 and IPv6.
While here, simplify llentry_provide_feedback() wrapper
by eliminating 1 condition check.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31390
SO_RERROR indicates that receive buffer overflows should be handled as
errors. Historically receive buffer overflows have been ignored and
programs could not tell if they missed messages or messages had been
truncated because of overflows. Since programs historically do not
expect to get receive overflow errors, this behavior is not the
default.
This is really really important for programs that use route(4) to keep
in sync with the system. If we loose a message then we need to reload
the full system state, otherwise the behaviour from that point is
undefined and can lead to chasing bogus bug reports.
Reviewed by: philip (network), kbowling (transport), gbe (manpages)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26652
ip6_setpktopts() can look up ifnets via ifnet_by_index(), which
is only safe in the net epoch. Ensure that callers are in the net
epoch before calling this function.
Sponsored by: Dell EMC Isilon
MFC after: 4 weeks
Reviewed by: donner, kp
Differential Revision: https://reviews.freebsd.org/D30630
The various protocol implementations are not very consistent about
freeing mbufs in error paths. In general, all protocols must free both
"m" and "control" upon an error, except if PRUS_NOTREADY is specified
(this is only implemented by TCP and unix(4) and requires further work
not handled in this diff), in which case "control" still must be freed.
This diff plugs various leaks in the pru_send implementations.
Reviewed by: tuexen
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30151
Commit 81728a538 ("Split rtinit() into multiple functions.") removed
the initialization of sa6, but not one of its uses. This meant that we
were passing an uninitialized sockaddr as the address to
lltable_prefix_free(). Remove the variable outright to fix the problem.
The caller is expected to hold a reference on pr.
Fixes: 81728a538 ("Split rtinit() into multiple functions.")
Reported by: KMSAN
Reviewed by: donner
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30166
Distinguish between truly invalid requests and those that fail because
we've already joined the group. Both cases fail, but differentiating
them allows userspace to make more informed decisions about what the
error means.
For example. radvd tries to join the all-routers group on every SIGHUP.
This fails, because it's already joined it, but this failure should be
ignored (rather than treated as a sign that the interface's multicast is
broken).
This puts us in line with OpenBSD, NetBSD and Linux.
Reviewed by: donner
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30111
Several protocol methods take a sockaddr as input. In some cases the
sockaddr lengths were not being validated, or were validated after some
out-of-bounds accesses could occur. Add requisite checking to various
protocol entry points, and convert some existing checks to assertions
where appropriate.
Reported by: syzkaller+KASAN
Reviewed by: tuexen, melifaro
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29519
This reverts a portion of 274579831b ("capsicum: Limit socket
operations in capability mode") as at least rtsol and dhcpcd rely on
being able to configure network interfaces while in capability mode.
Reported by: bapt, Greg V
Sponsored by: The FreeBSD Foundation
This is further rework of 08d9c92027. Now we carry the knowledge of
lock type all the way through tcp_input() and also into tcp_twcheck().
Ideally the rlocking for pure SYNs should propagate all the way into
the alternative TCP stacks, but not yet today.
This should close a race when socket is bind(2)-ed but not yet
listen(2)-ed and a SYN-packet arrives racing with listen(2), discovered
recently by pho@.
Adding support for TCP over UDP allows communication with
TCP stacks which can be implemented in userspace without
requiring special priviledges or specific support by the OS.
This is joint work with rrs.
Reviewed by: rrs
Sponsored by: Netflix, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29469
When packet is a SYN packet, we don't need to modify any existing PCB.
Normally SYN arrives on a listening socket, we either create a syncache
entry or generate syncookie, but we don't modify anything with the
listening socket or associated PCB. Thus create a new PCB lookup
mode - rlock if listening. This removes the primary contention point
under SYN flood - the listening socket PCB.
Sidenote: when SYN arrives on a synchronized connection, we still
don't need write access to PCB to send a challenge ACK or just to
drop. There is only one exclusion - tcptw recycling. However,
existing entanglement of tcp_input + stacks doesn't allow to make
this change small. Consider this patch as first approach to the problem.
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D29576
Capsicum did not prevent certain privileged networking operations,
specifically creation of raw sockets and network configuration ioctls.
However, these facilities can be used to circumvent some of the
restrictions that capability mode is supposed to enforce.
Add capability mode checks to disallow network configuration ioctls and
creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET
internet sockets.
Reviewed by: oshogbo
Discussed with: emaste
Reported by: manu
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D29423
After length decisions, we've decided that the if_wg(4) driver and
related work is not yet ready to live in the tree. This driver has
larger security implications than many, and thus will be held to
more scrutiny than other drivers.
Please also see the related message sent to the freebsd-hackers@
and freebsd-arch@ lists by Kyle Evans <kevans@FreeBSD.org> on
2021/03/16, with the subject line "Removing WireGuard Support From Base"
for additional context.
This is the culmination of about a week of work from three developers to
fix a number of functional and security issues. This patch consists of
work done by the following folks:
- Jason A. Donenfeld <Jason@zx2c4.com>
- Matt Dunwoodie <ncon@noconroy.net>
- Kyle Evans <kevans@FreeBSD.org>
Notable changes include:
- Packets are now correctly staged for processing once the handshake has
completed, resulting in less packet loss in the interim.
- Various race conditions have been resolved, particularly w.r.t. socket
and packet lifetime (panics)
- Various tests have been added to assure correct functionality and
tooling conformance
- Many security issues have been addressed
- if_wg now maintains jail-friendly semantics: sockets are created in
the interface's home vnet so that it can act as the sole network
connection for a jail
- if_wg no longer fails to remove peer allowed-ips of 0.0.0.0/0
- if_wg now exports via ioctl a format that is future proof and
complete. It is additionally supported by the upstream
wireguard-tools (which we plan to merge in to base soon)
- if_wg now conforms to the WireGuard protocol and is more closely
aligned with security auditing guidelines
Note that the driver has been rebased away from using iflib. iflib
poses a number of challenges for a cloned device trying to operate in a
vnet that are non-trivial to solve and adds complexity to the
implementation for little gain.
The crypto implementation that was previously added to the tree was a
super complex integration of what previously appeared in an old out of
tree Linux module, which has been reduced to crypto.c containing simple
boring reference implementations. This is part of a near-to-mid term
goal to work with FreeBSD kernel crypto folks and take advantage of or
improve accelerated crypto already offered elsewhere.
There's additional test suite effort underway out-of-tree taking
advantage of the aforementioned jail-friendly semantics to test a number
of real-world topologies, based on netns.sh.
Also note that this is still a work in progress; work going further will
be much smaller in nature.
MFC after: 1 month (maybe)
Summary:
This fixes rtentry leak for the cloned interfaces created inside the
VNET.
PR: 253998
Reported by: rashey at superbox.pl
MFC after: 3 days
Loopback teardown order is `SI_SUB_INIT_IF`, which happens after `SI_SUB_PROTO_DOMAIN` (route table teardown).
Thus, any route table operations are too late to schedule.
As the intent of the vnet teardown procedures to minimise the amount of effort by doing global cleanups instead of per-interface ones, address this by adding a relatively light-weight routing table cleanup function, `rib_flush_routes()`.
It removes all remaining routes from the routing table and schedules the deletion, which will happen later, when `rtables_destroy()` waits for the current epoch to finish.
Test Plan:
```
set_skip:set_skip_group_lo -> passed [0.053s]
tail -n 200 /var/log/messages | grep rtentry
```
Reviewers: #network, kp, bz
Reviewed By: kp
Subscribers: imp, ae
Differential Revision: https://reviews.freebsd.org/D29116
P2P ifa may require 2 routes: one is the loopback route, another is
the "prefix" route towards its destination.
Current code marks loopback routes existence with IFA_RTSELF and
"prefix" p2p routes with IFA_ROUTE.
For historic reasons, we fill in ifa_dstaddr for loopback interfaces.
To avoid installing the same route twice, we preemptively set
IFA_RTSELF when adding "prefix" route for loopback.
However, the teardown part doesn't have this hack, so we try to
remove the same route twice.
Fix this by checking if ifa_dstaddr is different from the ifa_addr
and moving this logic into a separate function.
Reviewed By: kp
Differential Revision: https://reviews.freebsd.org/D29121
MFC after: 3 days
The current preference number were copied from IPv4 code,
assuming 500k routes to be the full-view. Adjust with the current
reality (100k full-view).
Reported by: Marek Zarychta <zarychtam at plan-b.pwste.edu.pl>
MFC after: 3 days
Introduce convenience macros to retrieve the DSCP, ECN or traffic class
bits from an IPv6 header.
Use them where appropriate.
Reviewed by: ae (previous version), rscheff, tuexen, rgrimes
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D29056
When tearing down vnet jails we can move an if_bridge out (as
part of the normal vnet_if_return()). This can, when it's clearing out
its list of member interfaces, change its link layer address.
That sends an iflladdr_event, but at that point we've already freed the
AF_INET/AF_INET6 if_afdata pointers.
In other words: when the iflladdr_event callbacks fire we can't assume
that ifp->if_afdata[AF_INET] will be set.
Reviewed by: donner@, melifaro@
MFC after: 1 week
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D28860
Currently ip6_input() calls in6ifa_ifwithaddr() for
every local packet, in order to check if the target ip
belongs to the local ifa in proper state and increase
its counters.
in6ifa_ifwithaddr() references found ifa.
With epoch changes, both `ip6_input()` and all other current callers
of `in6ifa_ifwithaddr()` do not need this reference
anymore, as epoch provides stability guarantee.
Given that, update `in6ifa_ifwithaddr()` to allow
it to return ifa without referencing it, while preserving
option for getting referenced ifa if so desired.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28648
in6_selectsrc() may call fib6_lookup() in some cases, which requires
epoch. Wrap in6_selectsrc* calls into epoch inside its users.
Mark it as requiring epoch by adding NET_EPOCH_ASSERT().
MFC after: 1 weeek
Differential Revision: https://reviews.freebsd.org/D28647
The only place where in6_ifawithifp() is used is ip6_output(),
which uses the returned ifa to bump traffic counters.
Given ifa stability guarantees is provided by epoch, do not refcount ifa.
This eliminates 2 atomic ops from IPv6 fast path.
Reviewed By: rstone
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28649
The lookup for a IPv6 multicast addresses corresponding to
the destination address in the datagram is protected by the
NET_EPOCH section. Access to each PCB is protected by INP_RLOCK
during comparing. But access to socket's so_options field is
not protected. And in some cases it is possible, that PCB
pointer is still valid, but inp_socket is not. The patch wides
lock holding to protect access to inp_socket. It copies locking
strategy from IPv4 UDP handling.
PR: 232192
Obtained from: Yandex LLC
MFC after: 3 days
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D28232
Historically receive buffer overflows have been ignored and programs
could not tell if they missed messages or messages had been truncated
because of overflows. Since programs historically do not expect to get
receive overflow errors, this behavior is not the default.
This is really really important for programs that use route(4) to keep in sync
with the system. If we loose a message then we need to reload the full system
state, otherwise the behaviour from that point is undefined and can lead
to chasing bogus bug reports.
Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address. Then, this was extended to array of
such pointers. Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change. The new name should be generic enough to avoid further
renaming.
we need to make sure that the m_nextpkt field is NULL
else the lower layers may do unwanted things.
Reviewed By: gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D28377
it gets unused.
Currently if_purgeifaddrs() uses in6_purgeaddr() to remove IPv6
ifaddrs. in6_purgeaddr() does not trrigger prefix removal if
number of linked ifas goes to 0, as this is a low-level function.
As a result, if_purgeifaddrs() purges all IPv4/IPv6 addresses but
keeps corresponding IPv6 prefixes.
Fix this by creating higher-level wrapper which handles unused
prefix usecase and use it in if_purgeifaddrs().
Differential revision: https://reviews.freebsd.org/D28128
rtinit[1]() is a function used to add or remove interface address prefix routes,
similar to ifa_maintain_loopback_route().
It was intended to be family-agnostic. There is a problem with this approach
in reality.
1) IPv6 code does not use it for the ifa routes. There is a separate layer,
nd6_prelist_(), providing interface for maintaining interface routes. Its part,
responsible for the actual route table interaction, mimics rtenty() code.
2) rtinit tries to combine multiple actions in the same function: constructing
proper route attributes and handling iterations over multiple fibs, for the
non-zero net.add_addr_allfibs use case. It notably increases the code complexity.
3) dstaddr handling. flags parameter re-uses RTF_ flags. As there is no special flag
for p2p connections, host routes and p2p routes are handled in the same way.
Additionally, mapping IFA flags to RTF flags makes the interface pretty messy.
It make rtinit() to clash with ifa_mainain_loopback_route() for IPV4 interface
aliases.
4) rtinit() is the last customer passing non-masked prefixes to rib_action(),
complicating rib_action() implementation.
5) rtinit() coupled ifa announce/withdrawal notifications, producing "false positive"
ifa messages in certain corner cases.
To address all these points, the following has been done:
* rtinit() has been split into multiple functions:
- Route attribute construction were moved to the per-address-family functions,
dealing with (2), (3) and (4).
- funnction providing net.add_addr_allfibs handling and route rtsock notificaions
is the new routing table inteface.
- rtsock ifa notificaion has been moved out as well. resulting set of funcion are only
responsible for the actual route notifications.
Side effects:
* /32 alias does not result in interface routes (/32 route and "host" route)
* RTF_PINNED is now set for IPv6 prefixes corresponding to the interface addresses
Differential revision: https://reviews.freebsd.org/D28186
Currently we create link-local route by creating an always-on IPv6 prefix
in the prefix list. This prefix is not tied to the link-local ifa.
This leads to the following problems:
First, when flushing interface addresses we skip on-link route, leaving
fe80::/64 prefix on the interface without any IPv6 addresses.
Second, when creating and removing link-local alias we lose fe80::/64 prefix
from the routing table.
Fix this by attaching link-local prefix to the ifa at the initial creation.
Reviewed by: hrs
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D28129
Relevant inet/inet6 code has the control over deciding what
the RIB lookup function currently is. With that in mind,
explicitly set it to the current value (rn_match) in the
datapath lookups. This avoids cost on indirect call.
Differential Revision: https://reviews.freebsd.org/D28066
Currently default behaviour is to keep only 1 packet per unresolved entry.
Ability to queue more than one packet was added 10 years ago, in r215207,
though the default value was kep intact.
Things have changed since that time. Systems tend to initiate multiple
connections at once for a variety of reasons.
For example, recent kern/252278 bug report describe happy-eyeball DNS
behaviour sending multiple requests to the DNS server.
The primary driver for upper value for the queue length determination is
memory consumption. Remote actors should not be able to easily exhaust
local memory by sending packets to unresolved arp/ND entries.
For now, bump value to 16 packets, to match Darwin implementation.
The proper approach would be to switch the limit to calculate memory
consumption instead of packet count and limit based on memory.
We should MFC this with a variation of D22447.
Reviewers: #manpages, #network, bz, emaste
Reviewed By: emaste, gbe(doc), jilles(doc)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D28068