The only place to execute this method was raw_usend(). Only those
protocols that used raw socket were able to actually enter that method.
All pr_output assignments being deleted by this commit were a dead code
for many years.
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36126
destinations.
The current assumption is that kernel-handled rtadv prefixes along with
the interface address prefixes are the only prefixes considered in
the ND neighbor eligibility code.
Change this by allowing any non-gatewaye routes to be eligible. This
will allow DHCPv6-controlled routes to be correctly handled by
the ND code.
Refactor nd6_is_new_addr_neighbor() to enable more deterministic
performance in "found" case and remove non-needed
V_rt_add_addr_allfibs handling logic.
Reviewed By: kbowling
Differential Revision: https://reviews.freebsd.org/D23695
MFC after: 1 month
* Make nhgrp_get_nhops() return const struct weightened_nhop to
indicate that the list is immutable
* Make nhgrp_get_group() return the actual group, instead of
group+weight.
MFC after: 2 weeks
Currently we accept any pmtu between IPV6_MMTU(1280B) and the link mtu.
In some network topologies could allow a bad actor to perform a DOS attack.
Contrary to IPv4 in IPv6 oversized packets are dropped, and a ICMP
PACKET_TOO_BIG message is sent back to the sender.
After receiving an ICMPv6 packet with pmtu bigger than the
current one the victim will start sending frames that will be dropped
a router with reduced MTU.
Although it will eventually receive another message with correct pmtu,
an attacker can still just inject their spoofed packets frequently
enough to overwrite the correct value.
This issue is described in detail in RFC8201, section 6.
Fix this by checking the current pmtu, and accepting the new one only
if it's smaller.
Approved by: mw(mentor)
Reviewed by: tuexen
MFC after: 1 week
Sponsored by: Stormshield
Obtained from: Semihalf
Differential Revision: https://reviews.freebsd.org/D35871
With clang 15, the following -Werror warning is produced:
sys/netinet6/nd6.c:247:12: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
nd6_destroy()
^
void
This is nd6_destroy() is declared with a (void) argument list, but
defined with an empty argument list. Make the definition match the
declaration.
MFC after: 3 days
Currently, processing of IPv6 local traffic is partially broken:
link-local connection fails and global unicast connect() takes
3 seconds to complete.
This happens due to the combination of multiple factors.
IPv6 code passes original interface "origifp" when passing
traffic via loopack to retain the scope that is mandatory for the
correct hadling of link-local traffic. First problem is that the logic
of passing source interface is not working correcly for TCP connections,
resulting in passing "origifp" on the first 2 connection attempts and
lo0 on the subsequent ones. Second problem is that source address
validation logic skips its checks iff the source interface is loopback,
which doesn't cover "origifp" case.
More detailed description is available at https://reviews.freebsd.org/D35732
Fix the first problem by untangling&simplifying ifp/origifp logic.
Fix the second problem by switching source address validation check to
using M_LOOP mbuf flag instead of interface type.
PR: 265089
Reviewed by: ae, bz(previous version)
Differential Revision: https://reviews.freebsd.org/D35732
MFC after: 2 weeks
Effectively selectroute() addresses two different cases:
providing interface info for multicast destinations and providing
nexthop data for unicast ones. Current implementation intertwines
handling of both cases, especially in the error handling part.
Factor out all route lookup logic in a separate function,
lookup_route() to simplify the code.
Ensure consistent KPI: no error means *retifp is set and otherwise.
Differential Revision: https://reviews.freebsd.org/D35711
MFC after: 2 weeks
Currently selectroute() contains two nearly-identical versions of
the route lookup logic - one for original destination and another
for the case when IPV6_NEXTHOP option was set on the socket.
Factor out handling these route lookups in a separation function to
improve readability.
This change also fixes handling of link-local IPV6_NEXTHOPs.
Differential Revision: https://reviews.freebsd.org/D35710
MFC after: 2 weeks
Currently, some per-mbuf multicast statistics is stored in
the per-interface ip6stat.ip6s_m2m[] array of size 32 (IP6S_M2MMAX).
Check that loopback ifindex falls within 0.. IP6S_M2MMAX-1 range to
avoid silent data corruption. The latter cat happen with large
number of VNETs.
Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D35715
MFC after: 2 weeks
Commit d6cd20cc5 ("netinet6: fix ndp proxying") caused us to panic when
unloading pfsync:
Fatal trap 12: page fault while in kernel mode
cpuid = 19; apic id = 38
fault virtual address = 0x20
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80dfe7f4
stack pointer = 0x28:0xfffffe015d4f8ac0
frame pointer = 0x28:0xfffffe015d4f8ae0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 5477 (kldunload)
trap number = 12
panic: page fault
cpuid = 19
time = 1654023100
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe015d4f8880
vpanic() at vpanic+0x17f/frame 0xfffffe015d4f88d0
panic() at panic+0x43/frame 0xfffffe015d4f8930
trap_fatal() at trap_fatal+0x387/frame 0xfffffe015d4f8990
trap_pfault() at trap_pfault+0xab/frame 0xfffffe015d4f89f0
calltrap() at calltrap+0x8/frame 0xfffffe015d4f89f0
--- trap 0xc, rip = 0xffffffff80dfe7f4, rsp = 0xfffffe015d4f8ac0, rbp = 0xfffffe015d4f8ae0 ---
in6_purge_proxy_ndp() at in6_purge_proxy_ndp+0x14/frame 0xfffffe015d4f8ae0
if_purgeaddrs() at if_purgeaddrs+0x24/frame 0xfffffe015d4f8b90
if_detach_internal() at if_detach_internal+0x1c2/frame 0xfffffe015d4f8bf0
if_detach() at if_detach+0x71/frame 0xfffffe015d4f8c20
pfsync_clone_destroy() at pfsync_clone_destroy+0x1dd/frame 0xfffffe015d4f8c70
if_clone_destroyif() at if_clone_destroyif+0x239/frame 0xfffffe015d4f8cc0
if_clone_detach() at if_clone_detach+0xc8/frame 0xfffffe015d4f8cf0
vnet_pfsync_uninit() at vnet_pfsync_uninit+0xda/frame 0xfffffe015d4f8d10
vnet_deregister_sysuninit() at vnet_deregister_sysuninit+0x85/frame 0xfffffe015d4f8d40
linker_file_sysuninit() at linker_file_sysuninit+0x147/frame 0xfffffe015d4f8d70
linker_file_unload() at linker_file_unload+0x269/frame 0xfffffe015d4f8db0
kern_kldunload() at kern_kldunload+0x18d/frame 0xfffffe015d4f8e00
amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe015d4f8f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe015d4f8f30
--- syscall (444, FreeBSD ELF64, sys_kldunloadf), rip = 0x1601eab28cba, rsp = 0x1601e9c363f8, rbp = 0x1601e9c36c50 ---
This happens because ifp->if_afdata[AF_INET6] is NULL. Check for this,
just as we already do in a few other places.
See also c139b3c19b ("arp/nd: Cope with late calls to
iflladdr_event").
Reviewed by: melifaro
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D35374
Mbufs leak when manually removing incomplete NDP records with pending packet via ndp -d.
It happens because lltable_drop_entry_queue() rely on `la_numheld`
counter when dropping NDP entries (lles). It turned out NDP code never
increased `la_numheld`, so the actual free never happened.
Fix the issue by introducing unified lltable_append_entry_queue(),
common for both ARP and NDP code, properly addressing packet queue
maintenance.
Reviewed By: melifaro
Differential Revision: https://reviews.freebsd.org/D35365
MFC after: 2 weeks
We could insert proxy NDP entries by the ndp command, but the host
with proxy ndp entries had not responded to Neighbor Solicitations.
Change the following points for proxy NDP to work as expected:
* join solicited-node multicast addresses for proxy NDP entries
in order to receive Neighbor Solicitations.
* look up proxy NDP entries not on the routing table but on the
link-level address table when receiving Neighbor Solicitations.
Reviewed By: melifaro
Differential Revision: https://reviews.freebsd.org/D35307
MFC after: 2 weeks
In order to decrease ifdef INET/INET6s in the lltable implementation,
introduce the llt_post_resolved callback and implement protocol-dependent
code in the protocol-dependent part.
Reviewed By: melifaro
Differential Revision: https://reviews.freebsd.org/D35322
MFC after: 2 weeks
Implement the same filter feature we implemented for UDP over IPv6 in
742e7210d. This was missed in that commit.
Pointed out by: markj
Sponsored by: Rubicon Communications, LLC ("Netgate")
mld_domifattach() does a memory allocation under the global MLD mutex
and so can fail, but no error handling prevents a null pointer
dereference in this case. The mutex is only needed when updating the
global softc list; the allocation and static initialization of the softc
does not require this mutex. So, reduce the scope of the mutex and use
M_WAITOK for the allocation.
PR: 261457
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34943
Allow udp tunnel functions to indicate they have not taken ownership of
the packet, and that normal UDP processing should continue.
This is especially useful for scenarios where the kernel has taken
ownership of a socket that was originally created by userspace. It
allows the tunnel function to pass through certain packets for userspace
processing.
The primary user of this is if_ovpn, when it receives messages from
unknown peers (which might be a new client).
Reviewed by: tuexen
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D34883
For IPv4 use dst pointer as destination address in fib4_lookup().
It keeps destination address from IPv4 header and can be changed
when PACKET_TAG_IPFORWARD tag was set by packet filter.
For IPv6 override destination address with address from dst_sa.sin6_addr,
that was set from PACKET_TAG_IPFORWARD tag.
Reviewed by: eugen
MFC after: 1 week
PR: 256828, 261697, 255705
Differential Revision: https://reviews.freebsd.org/D34732
Also convert raw epoch_call() calls to lltable_free_entry() calls, no
functional change intended. There's no need to asynchronously free the
LLEs in that case to begin with, but we might as well use the lltable
interfaces consistently.
Noticed by code inspection; I believe lltable_calc_llheader() failures
do not generally happen in practice.
Reviewed by: bz
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34832
Historically, lltable_try_set_entry_addr() would release the LLE lock
upon failure. After some refactoring, it no longer does so, but
consumers were not adjusted accordingly.
Also fix a leak that can occur if lltable_calc_llheader() fails in the
ARP code, but I suspect that such a failure can only occur due to a code
bug.
Reviewed by: bz, melifaro
Reported by: pho
Fixes: 0b79b007eb ("[lltable] Restructure nd6 code.")
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34831
For nallow and nblock, move the variables under #ifdef KTR.
For return values from functions logged in KTR traces, mark the
variables as __unused rather than having to #ifdef the assignment of
the function return value.
When ip_output_send() returns EAGAIN due to issues with send tags (route
change, lagg failover, etc), it must free the mbuf. This is because
ip_output_send() was written as a wrapper/replacement for a direct
call to if_output(), and the contract with if_output() has
historically been that it owns the mbufs once called. When
ip_output_send() failed to free mbufs, it violated this assumption
and lead to leaked mbufs.
This was noticed when using NIC TLS in combination with hardware
rate-limited connections. When seeing lots of NIC output drops
triggered ratelimit send tag changes, we noticed we were leaking
ktls_sessions, send tags and mbufs. This was due ip_output_send()
leaking mbufs which held references to ktls_sessions, which in
turn held references to send tags.
Many thanks to jbh, rrs, hselasky and markj for their help in
debugging this.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34054
Reviewed by: hselasky, jhb, rrs
MFC after: 2 weeks
When sending an NS, check if we are using a IPv6 CARP address
and if we do, then put proper CARP link level address into
ND_OPT_SOURCE_LINKADDR option and also put PACKET_TAG_CARP tag
on the packet. The latter will enforce CARP link level address
at the data link layer too, which might be necessary for broken
implementations.
The code really follows what NA sending code has been doing since
introduction of carp(4). While here, bring to style(9) the whole
block of code.
PR: 193280
Differential revision: https://reviews.freebsd.org/D33858
Now that each module handles its global and VNET initialization
itself, there is no VNET related stuff left to do in domain_init().
Differential revision: https://reviews.freebsd.org/D33541
The historical BSD network stack loop that rolls over domains and
over protocols has no advantages over more modern SYSINIT(9).
While doing the sweep, split global and per-VNET initializers.
Getting rid of pr_init allows to achieve several things:
o Get rid of ifdef's that protect against double foo_init() when
both INET and INET6 are compiled in.
o Isolate initializers statically to the module they init.
o Makes code easier to understand and maintain.
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D33537
Allow the resending of DATA chunks to be controlled by the caller,
which allows retiring sctp_mtu_size_reset() in a separate commit.
Also improve the computaion of the overhead and use 32-bit integers
consistently.
Thanks to Timo Voelker for pointing me to the code.
MFC after: 3 days
Introduce a new function, lltable_get(), to retrieve lltable pointer
for the specified interface and family.
Use it to avoid all-iftable list traversal when adding or deleting
ARP/ND records.
Differential Revision: https://reviews.freebsd.org/D33660
MFC after: 2 weeks
The intent is to provide more entropy than can be provided
by just the 32-bits of the IPv6 address which overlaps with
6to4 tunnels. This is needed to mitigate potential algorithmic
complexity attacks from attackers who can control large
numbers of IPv6 addresses.
Together with: gallatin
Reviewed by: dwmalone, rscheff
Differential revision: https://reviews.freebsd.org/D33254
Now struct prison has two pointers (IPv4 and IPv6) of struct
prison_ip type. Each points into epoch context, address count
and variable size array of addresses. These structures are
freed with network epoch deferred free and are not edited in
place, instead a new structure is allocated and set.
While here, the change also generalizes a lot (but not enough)
of IPv4 and IPv6 processing. E.g. address family agnostic helpers
for kern_jail_set() are provided, that reduce v4-v6 copy-paste.
The fast-path prison_check_ip[46]_locked() is also generalized
into prison_ip_check() that can be executed with network epoch
protection only.
Reviewed by: jamie
Differential revision: https://reviews.freebsd.org/D33339
Running sys/netpfil/pf/fragmentation v6 results in:
lock order reversal:
1st 0xfffffe00050429a8 rip (rip, sleep mutex) @ /usr/src/sys/netinet6/raw_ip6.c:803
2nd 0xfffff8009491e1d0 rawinp (rawinp, rw) @ /usr/src/sys/netinet6/raw_ip6.c:804
lock order rawinp -> rip established at:
0xffffffff8068e26a at witness_lock_order_add+0x28a
0xffffffff8068d087 at witness_checkorder+0x627
0xffffffff805a9f05 at __mtx_lock_flags+0x205
0xffffffff808102e4 at in_pcballoc+0x204
0xffffffff808d53c6 at rip6_attach+0x116
0xffffffff806dc4e8 at socreate+0x368
0xffffffff806eaedc at kern_socket+0xfc
0xffffffff806eadcd at sys_socket+0x2d
0xffffffff80abc774 at syscallenter+0x5c4
0xffffffff80abbeeb at amd64_syscall+0x1b
0xffffffff80a8044b at fast_syscall_common+0xf8
lock order rip -> rawinp attempted at:
0xffffffff8068dc2a at witness_checkorder+0x11ca
0xffffffff805d1b7f at _rw_wlock_cookie+0x18f
0xffffffff808d596c at rip6_connect+0x19c
0xffffffff806e0842 at soconnectat+0x142
0xffffffff806ebe36 at kern_connectat+0x136
0xffffffff806ebcdf at sys_connect+0x4f
0xffffffff80abc774 at syscallenter+0x5c4
0xffffffff80abbeeb at amd64_syscall+0x1b
0xffffffff80a8044b at fast_syscall_common+0xf8
Reviewed by: glebius
Fixes: de2d47842e ("SMR protection for inpcbs")
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D33508
ip6_setpktopt() can call ifnet_byindex() which requires epoch. Mark the
function as requiring NET_EPOCH, and ensure we enter it priot to calling
it.
Reported-by: syzbot+92526116441688fea8a3@syzkaller.appspotmail.com
Reviewed by: glebius
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D33462
The pcb lookup always happens in the network epoch and in SMR section.
We can't block on a mutex due to the latter. Right now this patch opens
up a race. But soon that will be addressed by D33339.
Reviewed by: markj, jamie
Differential revision: https://reviews.freebsd.org/D33340
Fixes: de2d47842e
Sweep over potentially unsafe calls to ifnet_byindex() and wrap them
in epoch. Most of the code touched remains unsafe, as the returned
pointer is being used after epoch exit. Mark that with a comment.
Validate the index argument inside the function, reducing argument
validation requirement from the callers and making V_if_index
private to if.c.
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D33263
This reverts commit 266f97b5e9, reversing
changes made to a10253cffe.
A mismerge of a merge to catch up to main resulted in files being
committed which should not have been.
With introduction of epoch(9) synchronization to network stack the
inpcb database became protected by the network epoch together with
static network data (interfaces, addresses, etc). However, inpcb
aren't static in nature, they are created and destroyed all the
time, which creates some traffic on the epoch(9) garbage collector.
Fairly new feature of uma(9) - Safe Memory Reclamation allows to
safely free memory in page-sized batches, with virtually zero
overhead compared to uma_zfree(). However, unlike epoch(9), it
puts stricter requirement on the access to the protected memory,
needing the critical(9) section to access it. Details:
- The database is already build on CK lists, thanks to epoch(9).
- For write access nothing is changed.
- For a lookup in the database SMR section is now required.
Once the desired inpcb is found we need to transition from SMR
section to r/w lock on the inpcb itself, with a check that inpcb
isn't yet freed. This requires some compexity, since SMR section
itself is a critical(9) section. The complexity is hidden from
KPI users in inp_smr_lock().
- For a inpcb list traversal (a pcblist sysctl, or broadcast
notification) also a new KPI is provided, that hides internals of
the database - inp_next(struct inp_iterator *).
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33022
With upcoming changes to the inpcb synchronisation it is going to be
broken. Even its current status after the move of PCB synchronization
to the network epoch is very questionable.
This experimental feature was sponsored by Juniper but ended never to
be used in Juniper and doesn't exist in their source tree [sjg@, stevek@,
jtl@]. In the past (AFAIK, pre-epoch times) it was tried out at Netflix
[gallatin@, rrs@] with no positive result and at Yandex [ae@, melifaro@].
I'm up to resurrecting it back if there is any interest from anybody.
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33020
Drop packets arriving from the network that have our source IPv6
address. If maliciously crafted they can create evil effects
like an RST exchange between two of our listening TCP ports.
Such packets just can't be legitimate. Enable the tunable
by default. Long time due for a modern Internet host.
Reviewed by: melifaro, donner, kp
Differential revision: https://reviews.freebsd.org/D32915
In most cases blackholing for locally originated packets is undesired,
leads to different kind of lags and delays. Provide sysctls to enforce
it, e.g. for debugging purposes.
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D32718
Previously in_pcbbind_setup returned EADDRNOTAVAIL for empty
V_in_ifaddrhead (i.e., no IPv4 addresses configured) and in6_pcbbind
did the same for empty V_in6_ifaddrhead (no IPv6 addresses).
An equivalent test has existed since 4.4-Lite. It was presumably done
to avoid extra work (assuming the address isn't going to be found
later).
In normal system operation *_ifaddrhead will not be empty: they will
at least have the loopback address(es). In practice no work will be
avoided.
Further, this case caused net/dhcpd to fail when run early in boot
before assignment of any addresses. It should be possible to bind the
unspecified address even if no addresses have been configured yet, so
just remove the tests.
The now-removed "XXX broken" comments were added in 59562606b9,
which converted the ifaddr lists to TAILQs. As far as I (emaste) can
tell the brokenness is the issue described above, not some aspect of
the TAILQ conversion.
PR: 253166
Reviewed by: ae, bz, donner, emaste, glebius
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D32563
This function was originally carved out of in6_pcbbind(), which
is in in6_pcb.c. This function also uses KPI private to the PCB
database - in_pcb_lport().
All callers of sctp_aloc_assoc() mark the PCB as connected after a
successful call (for one-to-one-style sockets). In all cases this is
done without the PCB lock, so the PCB's flags can be corrupted. We also
do not atomically check whether a one-to-one-style socket is a listening
socket, which violates various assumptions in solisten_proto().
We need to hold the PCB lock across all of sctp_aloc_assoc() to fix
this. In order to do that without introducing lock order reversals, we
have to hold the global info lock as well.
So:
- Convert sctp_aloc_assoc() so that the inp and info locks are
consistently held. It returns with the association lock held, as
before.
- Fix an apparent bug where we failed to remove an association from a
global hash if sctp_add_remote_addr() fails.
- sctp_select_a_tag() is called when initializing an association, and it
acquires the global info lock. To avoid lock recursion, push locking
into its callers.
- Introduce sctp_aloc_assoc_connected(), which atomically checks for a
listening socket and sets SCTP_PCB_FLAGS_CONNECTED.
There is still one edge case in sctp_process_cookie_new() where we do
not update PCB/socket state correctly.
Reviewed by: tuexen
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31908
- Protect the `expire_upcalls` callout with the MFC6 mutex. The callout
handler needs this mutex anyway.
- Convert the MROUTER6 mutex to a sleepable sx lock. It is only used
when configuring the global v6 multicast routing socket, so is only
used in system call paths where sleeping is safe. This lets us drain
the callout without having to drop the lock.
- For all locking macros in the file, convert to using a _LOCKPTR macro.
Reported by: mav
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31836
Interface addresses with pending duplicate address detection (DAD) live
in a global queue. In this case, a callout is associated with each
entry. The callout transmits neighbour solicitations until the system
decides the address is no longer tentative, or until a duplicate address
is discovered. At this point the entry is dequeued and freed. DAD may
be manually stopped as well.
The callout currently runs (and potentially transmits packets) with
Giant held. Reorganize DAD queue locking to interlock properly with the
callout:
- Configure the callout to acquire the DAD queue lock before running.
The lock is dropped before transmitting any packets. Stop protecting
the callout with Giant.
- When looking up DAD queue entries for an incoming NS or NA, don't
bother fiddling with the DAD queue entry reference count.
- Split nd6_dad_starttimer() so that the caller is responsible to
transmitting a NS if it so desires.
- Remove the DAD entry from the queue before stopping the timer. Use a
temporary reference to make sure that the entry doesn't get freed by
the callout while we're draining.
Reported by: mav
Reviewed by: bz, hrs
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31826
Currently we use pre-calculated headers inside LLE entries as prepend data
for `if_output` functions. Using these headers allows saving some
CPU cycles/memory accesses on the fast path.
However, this approach makes adding L2 header for IPv4 traffic with IPv6
nexthops more complex, as it is not possible to store multiple
pre-calculated headers inside lle. Additionally, the solution space is
limited by the fact that PCB caching saves LLEs in addition to the nexthop.
Thus, add support for creating special "child" LLEs for the purpose of holding
custom family encaps and store mbufs pending resolution. To simplify handling
of those LLEs, store them in a linked-list inside a "parent" (e.g. normal) LLE.
Such LLEs are not visible when iterating LLE table. Their lifecycle is bound
to the "parent" LLE - it is not possible to delete "child" when parent is alive.
Furthermore, "child" LLEs are static (RTF_STATIC), avoding complex state
machine used by the standard LLEs.
nd6_lookup() and nd6_resolve() now accepts an additional argument, family,
allowing to return such child LLEs. This change uses `LLE_SF()` macro which
packs family and flags in a single int field. This is done to simplify merging
back to stable/. Once this code lands, most of the cases will be converted to
use a dedicated `family` parameter.
Differential Revision: https://reviews.freebsd.org/D31379
MFC after: 2 weeks
The keyword adds nothing as all operations on the var are performed
through atomic_*
Reviewed by: kp
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31528
The use of Giant here is vestigal and does not provide any useful
synchronization. Furthermore, non-MPSAFE callouts can cause the
softclock threads to block waiting for long-running newbus operations to
complete.
Reported by: mav
Reviewed by: bz
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31470
We need to do so to safely traverse the ifnet list.
Reviewed by: bz
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31470
Fix getsockopt() for the IPPROTO_IPV6 level socket options with the
following names: IPV6_HOPOPTS, IPV6_RTHDR, IPV6_RTHDRDSTOPTS,
IPV6_DSTOPTS, and IPV6_NEXTHOP.
Reviewed by: markj
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D31458
Factor out lltable locking logic from lltable_try_set_entry_addr()
into a separate lltable_acquire_wlock(), so the latter can be used
in other parts of the code w/o duplication.
Create nd6_try_set_entry_addr() to avoid code duplication in nd6.c
and nd6_nbr.c.
Move lle creation logic from nd6_resolve_slow() into a separate
nd6_get_llentry() to simplify the former.
These changes serve as a pre-requisite for implementing
RFC8950 (IPv4 prefixes with IPv6 nexthops).
Differential Revision: https://reviews.freebsd.org/D31432
MFC after: 2 weeks
Add check that ifp supports IPv6 multicasts in in6_getmulti.
This fixes panic when user application tries to join into multicast
group on an interface that doesn't support IPv6 multicasts, like
IFT_PFLOG interfaces.
PR: 257302
Reviewed by: melifaro
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D31420
Use newly-create llentry_request_feedback(),
llentry_mark_used() and llentry_get_hittime() to
request datapatch usage check and fetch the results
in the same fashion both in IPv4 and IPv6.
While here, simplify llentry_provide_feedback() wrapper
by eliminating 1 condition check.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31390
SO_RERROR indicates that receive buffer overflows should be handled as
errors. Historically receive buffer overflows have been ignored and
programs could not tell if they missed messages or messages had been
truncated because of overflows. Since programs historically do not
expect to get receive overflow errors, this behavior is not the
default.
This is really really important for programs that use route(4) to keep
in sync with the system. If we loose a message then we need to reload
the full system state, otherwise the behaviour from that point is
undefined and can lead to chasing bogus bug reports.
Reviewed by: philip (network), kbowling (transport), gbe (manpages)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26652
ip6_setpktopts() can look up ifnets via ifnet_by_index(), which
is only safe in the net epoch. Ensure that callers are in the net
epoch before calling this function.
Sponsored by: Dell EMC Isilon
MFC after: 4 weeks
Reviewed by: donner, kp
Differential Revision: https://reviews.freebsd.org/D30630
The various protocol implementations are not very consistent about
freeing mbufs in error paths. In general, all protocols must free both
"m" and "control" upon an error, except if PRUS_NOTREADY is specified
(this is only implemented by TCP and unix(4) and requires further work
not handled in this diff), in which case "control" still must be freed.
This diff plugs various leaks in the pru_send implementations.
Reviewed by: tuexen
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30151
Commit 81728a538 ("Split rtinit() into multiple functions.") removed
the initialization of sa6, but not one of its uses. This meant that we
were passing an uninitialized sockaddr as the address to
lltable_prefix_free(). Remove the variable outright to fix the problem.
The caller is expected to hold a reference on pr.
Fixes: 81728a538 ("Split rtinit() into multiple functions.")
Reported by: KMSAN
Reviewed by: donner
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30166
Distinguish between truly invalid requests and those that fail because
we've already joined the group. Both cases fail, but differentiating
them allows userspace to make more informed decisions about what the
error means.
For example. radvd tries to join the all-routers group on every SIGHUP.
This fails, because it's already joined it, but this failure should be
ignored (rather than treated as a sign that the interface's multicast is
broken).
This puts us in line with OpenBSD, NetBSD and Linux.
Reviewed by: donner
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30111
Several protocol methods take a sockaddr as input. In some cases the
sockaddr lengths were not being validated, or were validated after some
out-of-bounds accesses could occur. Add requisite checking to various
protocol entry points, and convert some existing checks to assertions
where appropriate.
Reported by: syzkaller+KASAN
Reviewed by: tuexen, melifaro
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29519
This reverts a portion of 274579831b ("capsicum: Limit socket
operations in capability mode") as at least rtsol and dhcpcd rely on
being able to configure network interfaces while in capability mode.
Reported by: bapt, Greg V
Sponsored by: The FreeBSD Foundation
This is further rework of 08d9c92027. Now we carry the knowledge of
lock type all the way through tcp_input() and also into tcp_twcheck().
Ideally the rlocking for pure SYNs should propagate all the way into
the alternative TCP stacks, but not yet today.
This should close a race when socket is bind(2)-ed but not yet
listen(2)-ed and a SYN-packet arrives racing with listen(2), discovered
recently by pho@.
Adding support for TCP over UDP allows communication with
TCP stacks which can be implemented in userspace without
requiring special priviledges or specific support by the OS.
This is joint work with rrs.
Reviewed by: rrs
Sponsored by: Netflix, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29469
When packet is a SYN packet, we don't need to modify any existing PCB.
Normally SYN arrives on a listening socket, we either create a syncache
entry or generate syncookie, but we don't modify anything with the
listening socket or associated PCB. Thus create a new PCB lookup
mode - rlock if listening. This removes the primary contention point
under SYN flood - the listening socket PCB.
Sidenote: when SYN arrives on a synchronized connection, we still
don't need write access to PCB to send a challenge ACK or just to
drop. There is only one exclusion - tcptw recycling. However,
existing entanglement of tcp_input + stacks doesn't allow to make
this change small. Consider this patch as first approach to the problem.
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D29576
Capsicum did not prevent certain privileged networking operations,
specifically creation of raw sockets and network configuration ioctls.
However, these facilities can be used to circumvent some of the
restrictions that capability mode is supposed to enforce.
Add capability mode checks to disallow network configuration ioctls and
creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET
internet sockets.
Reviewed by: oshogbo
Discussed with: emaste
Reported by: manu
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D29423
After length decisions, we've decided that the if_wg(4) driver and
related work is not yet ready to live in the tree. This driver has
larger security implications than many, and thus will be held to
more scrutiny than other drivers.
Please also see the related message sent to the freebsd-hackers@
and freebsd-arch@ lists by Kyle Evans <kevans@FreeBSD.org> on
2021/03/16, with the subject line "Removing WireGuard Support From Base"
for additional context.
This is the culmination of about a week of work from three developers to
fix a number of functional and security issues. This patch consists of
work done by the following folks:
- Jason A. Donenfeld <Jason@zx2c4.com>
- Matt Dunwoodie <ncon@noconroy.net>
- Kyle Evans <kevans@FreeBSD.org>
Notable changes include:
- Packets are now correctly staged for processing once the handshake has
completed, resulting in less packet loss in the interim.
- Various race conditions have been resolved, particularly w.r.t. socket
and packet lifetime (panics)
- Various tests have been added to assure correct functionality and
tooling conformance
- Many security issues have been addressed
- if_wg now maintains jail-friendly semantics: sockets are created in
the interface's home vnet so that it can act as the sole network
connection for a jail
- if_wg no longer fails to remove peer allowed-ips of 0.0.0.0/0
- if_wg now exports via ioctl a format that is future proof and
complete. It is additionally supported by the upstream
wireguard-tools (which we plan to merge in to base soon)
- if_wg now conforms to the WireGuard protocol and is more closely
aligned with security auditing guidelines
Note that the driver has been rebased away from using iflib. iflib
poses a number of challenges for a cloned device trying to operate in a
vnet that are non-trivial to solve and adds complexity to the
implementation for little gain.
The crypto implementation that was previously added to the tree was a
super complex integration of what previously appeared in an old out of
tree Linux module, which has been reduced to crypto.c containing simple
boring reference implementations. This is part of a near-to-mid term
goal to work with FreeBSD kernel crypto folks and take advantage of or
improve accelerated crypto already offered elsewhere.
There's additional test suite effort underway out-of-tree taking
advantage of the aforementioned jail-friendly semantics to test a number
of real-world topologies, based on netns.sh.
Also note that this is still a work in progress; work going further will
be much smaller in nature.
MFC after: 1 month (maybe)
Summary:
This fixes rtentry leak for the cloned interfaces created inside the
VNET.
PR: 253998
Reported by: rashey at superbox.pl
MFC after: 3 days
Loopback teardown order is `SI_SUB_INIT_IF`, which happens after `SI_SUB_PROTO_DOMAIN` (route table teardown).
Thus, any route table operations are too late to schedule.
As the intent of the vnet teardown procedures to minimise the amount of effort by doing global cleanups instead of per-interface ones, address this by adding a relatively light-weight routing table cleanup function, `rib_flush_routes()`.
It removes all remaining routes from the routing table and schedules the deletion, which will happen later, when `rtables_destroy()` waits for the current epoch to finish.
Test Plan:
```
set_skip:set_skip_group_lo -> passed [0.053s]
tail -n 200 /var/log/messages | grep rtentry
```
Reviewers: #network, kp, bz
Reviewed By: kp
Subscribers: imp, ae
Differential Revision: https://reviews.freebsd.org/D29116
P2P ifa may require 2 routes: one is the loopback route, another is
the "prefix" route towards its destination.
Current code marks loopback routes existence with IFA_RTSELF and
"prefix" p2p routes with IFA_ROUTE.
For historic reasons, we fill in ifa_dstaddr for loopback interfaces.
To avoid installing the same route twice, we preemptively set
IFA_RTSELF when adding "prefix" route for loopback.
However, the teardown part doesn't have this hack, so we try to
remove the same route twice.
Fix this by checking if ifa_dstaddr is different from the ifa_addr
and moving this logic into a separate function.
Reviewed By: kp
Differential Revision: https://reviews.freebsd.org/D29121
MFC after: 3 days
The current preference number were copied from IPv4 code,
assuming 500k routes to be the full-view. Adjust with the current
reality (100k full-view).
Reported by: Marek Zarychta <zarychtam at plan-b.pwste.edu.pl>
MFC after: 3 days