When using lagg failover mode neither Gratuitous ARP (IPv4) or Unsolicited
Neighbour Advertisements (IPv6) are sent to notify other nodes that the
address may have moved.
This results is slow failover, dropped packets and network outages for the
lagg interface when the primary link goes down.
We now use the new if_link_state_change_cond with the force param set to
allow lagg to force through link state changes and hence fire a
ifnet_link_event which are now monitored by rip and nd6.
Upon receiving these events each protocol trigger the relevant
notifications:
* inet4 => Gratuitous ARP
* inet6 => Unsolicited Neighbour Announce
This also fixes the carp IPv6 NA's that stopped working after r251584 which
added the ipv6_route__llma route.
The new behavour can be controlled using the sysctls:
* net.link.ether.inet.arp_on_link
* net.inet6.icmp6.nd6_on_link
Also removed unused param from lagg_port_state and added descriptions for the
sysctls while here.
PR: 156226
MFC after: 1 month
Sponsored by: Multiplay
Differential Revision: https://reviews.freebsd.org/D4111
LLE structure is mostly unchanged during its lifecycle: there are only 2
things relevant for fast path lookup code:
1) link-level address change. Since r286722, these updates are performed
under AFDATA WLOCK.
2) Some sort of feedback indicating that this particular entry is used so
we send NS to perform reachability verification instead of expiring entry.
The only signal that is needed from fast path is something like binary
yes/no.
The latter is solved by the following changes:
Special r_skip_req (introduced in D3688) value is used for fast path feedback.
It is read lockless by fast path, but updated under req_mutex mutex. If this
field is non-zero, then fast path will acquire lock and set it back to 0.
After transitioning to STALE state, callout timer is armed to run each
V_nd6_delay seconds to make sure that if packet was transmitted at the start
of given interval, we would be able to switch to PROBE state in V_nd6_delay
seconds as user expects.
(in STALE state) timer is rescheduled until original V_nd6_gctimer expires
keeping lle in STALE state (remaining timer value stored in lle_remtime).
(in STALE state) timer is rescheduled if packet was transmitted less that
V_nd6_delay seconds ago to make sure we transition to PROBE state exactly
after V_n6_delay seconds.
As a result, all packets towards lle in REACHABLE/STALE/PROBE states are handled
by fast path without acquiring lle read lock.
Differential Revision: https://reviews.freebsd.org/D3780
catches cases that DAD probes cannot be sent because of
IFF_UP && !IFF_DRV_RUNNING.
- nd6_dad_starttimer() now calls nd6_dad_ns_output(), instead of
calling it before nd6_dad_starttimer().
- Do not release an entry in dadq when a duplicate entry is being
added.
Initially function was introduced in r53541 (KAME initial commit) to
"provide hints from upper layer protocols that indicate a connection
is making "forward progress"" (quote from RFC 2461 7.3.1 Reachability
Confirmation).
However, it was converted to do nothing (e.g. just return) in r122922
(tcp_hostcache implementation) back in 2003. Some defines were moved
to tcp_var.h in r169541. Then, it was broken (for non-corner cases)
by r186119 (L2<>L3 split) in 2008 (NULL ifp in nd6_lookup). So,
right now this code is broken and has no "real" base users.
Differential Revision: https://reviews.freebsd.org/D3699
function. The change is mostly mechanical with the following exception:
Last piece of nd6_resolve_slow() was refactored: ND6_LLINFO_PERMANENT
condition was removed as always-true, explicit ND6_LLINFO_NOSTATE ->
ND6_LLINFO_INCOMPLETE state transition was removed as duplicate.
Reviewed by: ae
Sponsored by: Yandex LLC
interface but in6if_do_dad() already had a check for IFF_LOOPBACK.
- Remove in6if_do_dad() check in in6_broadcast_ifa(). An address
which needs DAD always has IN6_IFF_TENTATIVE there.
- in6if_do_dad() now returns EAGAIN when the interface is not ready
since DAD callout handler ignores such an interface.
- In DAD callout handler, mark an address as IN6_IFF_TENTATIVE
when the interface has ND6_IFF_IFDISABLED. And Do IFF_UP and
IFF_DRV_RUNNING check consistently when DAD is required.
- draft-ietf-6man-enhanced-dad is now published as RFC 7527.
- Fix some typos.
This permits us having all (not fully true yet) all the info
needed in lookup process in first 64 bytes of 'struct llentry'.
struct llentry layout:
BEFORE:
[rwlock .. state .. state .. MAC ] (lle+1) [sockaddr_in[6]]
AFTER
[ in[6]_addr MAC .. state .. rwlock ]
Currently, address part of struct llentry has only 16 bytes for the key.
However, lltable does not restrict any custom lltable consumers with long
keys use the previous approach (store key at (lle+1)).
Sponsored by: Yandex LLC
because a link where looped back NS messages are permanently observed
does not work with either NDP or ARP for IPv4.
- draft-ietf-6man-enhanced-dad is now RFC 7527.
Discussed with: hiren
MFC after: 3 days
It is acceptable that the size can be equal to MCLBYTES. In the later
KAME's code this check has been moved under DIAGNOSTIC ifdef, because
the size of NA and NS is much smaller than MCLBYTES. So, it is safe to
replace the check with KASSERT.
PR: 199304
Discussed with: glebius
MFC after: 1 week
- Add no_dad and ignoreloop per-IF knob. no_dad disables DAD completely,
and ignoreloop is to prevent infinite loop in loopback probing state when
loopback is permanently expected.
to initialize mbuf's fibnum. Uninitialized fibnum value can lead to
panic in the routing code. Currently we use only RT_DEFAULT_FIB value
for initialization.
Differential Revision: https://reviews.freebsd.org/D1998
Reviewed by: hrs (previous version)
Sponsored by: Yandex LLC
draft-ietf-6man-enhanced-dad-13.
This basically adds a random nonce option (RFC 3971) to NS messages
for DAD probe to detect a looped back packet. This looped back packet
prevented DAD on some pseudo-interfaces which aggregates multiple L2 links
such as lagg(4).
The length of the nonce is set to 6 bytes. This algorithm can be disabled by
setting net.inet6.ip6.dad_enhanced sysctl to 0 in a per-vnet basis.
Reported by: hiren
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D1835
hold mbuf chain instead of calling full-blown nd6_output_lle()
for each packet. This simplifies both callers and nd6_output_lle()
implementation.
* Make nd6_output_lle() static and remove now-unused lle and chain
arguments.
* Rename nd6_output_flush() -> nd6_flush_holdchain() to be consistent.
* Move all pre-send transmit hooks to newly-created nd6_output_ifp().
Now nd6_output(), nd6_output_lle() and nd6_flush_holdchain() are using
it to send mbufs to if_output.
* Remove SeND hook from nd6_na_input() because it was implemented
incorrectly since the beginning (r211501):
- it tagged initial input mbuf (m) instead of m_hold
- tagging _all_ mbufs in holdchain seems to be wrong anyway.
the knowledge of mbuf layout, and in particular constants such as M_EXT,
MLEN, MHLEN, and so on, in mbuf consumers by unifying various alignment
utility functions (M_ALIGN(), MH_ALIGN(), MEXT_ALIGN() in a single
M_ALIGN() macro, implemented by a now-inlined m_align() function:
- Move m_align() from uipc_mbuf.c to mbuf.h; mark as __inline.
- Reimplement M_ALIGN(), MH_ALIGN(), and MEXT_ALIGN() using m_align().
- Update consumers around the tree to simply use M_ALIGN().
This change eliminates a number of cases where mbuf consumers must be aware
of whether or not mbufs returned by the allocator use external storage, but
also assumptions about the size of the returned mbuf. This will make it
easier to introduce changes in how we use external storage, as well as
features such as variable-size mbufs.
Differential Revision: https://reviews.freebsd.org/D1436
Reviewed by: glebius, trasz, gnn, bz
Sponsored by: EMC / Isilon Storage Division
number of races which could cause double frees or use-after-frees when
performing DAD on an address. In particular, an IPv6 address can now only be
marked as a duplicate from the DAD callout.
Differential Revision: https://reviews.freebsd.org/D1258
Reviewed by: ae, hrs
Reported by: rstone
MFC after: 1 month
possible and do not clear IN6_IFF_TENTATIVE. If IFDISABLED was accidentally
set after a DAD started, TENTATIVE could be cleared because no NA was
received due to IFDISABLED, and as a result it could prevent DAD when
manually clearing IFDISABLED after that.
Thus we can use IF_AFDATA_RLOCK() instead of IF_AFDATA_LOCK() when doing
lla_lookup() without LLE_CREATE flag.
Reviewed by: glebius, adrian
MFC after: 1 week
Sponsored by: Yandex LLC
unlocked route. Use in6_rtalloc() instead of in6_rtalloc1. This helps
simplify the code and remove several now unused variables.
PR: 156283
MFC after: 2 weeks
- Do not calculate constant length values at run time,
CTASSERT() their sanity.
- Remove superfluous cleaning of mbuf fields after allocation.
- Replace compat macros with function calls.
Sponsored by: Nginx, Inc.
it skips FLOWTABLE lookup. However, the non-NULL ro has dual meaning
here: it may be supplied to provide route, and it may be supplied to
store and return to caller the route that ip_output()/ip6_output()
finds. In the latter case skipping FLOWTABLE lookup is pessimisation.
The difference between struct route filled by FLOWTABLE and filled
by rtalloc() family is that the former doesn't hold a reference on
its rtentry. Reference is hold by flow entry, and it is about to
be released in future. Thus, route filled by FLOWTABLE shouldn't
be passed to RTFREE() macro.
- Introduce new flag for struct route/route_in6, that marks route
not holding a reference on rtentry.
- Introduce new macro RO_RTFREE() that cleans up a struct route
depending on its kind.
- All callers to ip_output()/ip6_output() that do supply non-NULL
but empty route should use RO_RTFREE() to free results of
lookup.
- ip_output()/ip6_output() now do FLOWTABLE lookup always when
ro->ro_rt == NULL.
Tested by: tuexen (SCTP part)
Extend the so far IPv4-only support for multiple routing tables (FIBs)
introduced in r178888 to IPv6 providing feature parity.
This includes an extended rtalloc(9) KPI for IPv6, the necessary
adjustments to the network stack, and user land support as in netstat.
Sponsored by: Cisco Systems, Inc.
Reviewed by: melifaro (basically)
MFC after: 10 days
from scratch, copying needed functionality from the old implemenation
on demand, with a thorough review of all code. The main change is that
interface layer has been removed from the CARP. Now redundant addresses
are configured exactly on the interfaces, they run on.
The CARP configuration itself is, as before, configured and read via
SIOCSVH/SIOCGVH ioctls. A new prefix created with SIOCAIFADDR or
SIOCAIFADDR_IN6 may now be configured to a particular virtual host id,
which makes the prefix redundant.
ifconfig(8) semantics has been changed too: now one doesn't need
to clone carpXX interface, he/she should directly configure a vhid
on a Ethernet interface.
To supply vhid data from the kernel to an application the getifaddrs(8)
function had been changed to pass ifam_data with each address. [1]
The new implementation definitely closes all PRs related to carp(4)
being an interface, and may close several others. It also allows
to run a single redundant IP per interface.
Big thanks to Bjoern Zeeb for his help with inet6 part of patch, for
idea on using ifam_data and for several rounds of reviewing!
PR: kern/117000, kern/126945, kern/126714, kern/120130, kern/117448
Reviewed by: bz
Submitted by: bz [1]
- A new per-interface knob IFF_ND6_NO_RADR and sysctl IPV6CTL_NO_RADR.
This controls if accepting a route in an RA message as the default route.
The default value for each interface can be set by net.inet6.ip6.no_radr.
The system wide default value is 0.
- A new sysctl: net.inet6.ip6.norbit_raif. This controls if setting R-bit in
NA on RA accepting interfaces. The default is 0 (R-bit is set based on
net.inet6.ip6.forwarding).
Background:
IPv6 host/router model suggests a router sends an RA and a host accepts it for
router discovery. Because of that, KAME implementation does not allow
accepting RAs when net.inet6.ip6.forwarding=1. Accepting RAs on a router can
make the routing table confused since it can change the default router
unintentionally.
However, in practice there are cases where we cannot distinguish a host from
a router clearly. For example, a customer edge router often works as a host
against the ISP, and as a router against the LAN at the same time. Another
example is a complex network configurations like an L2TP tunnel for IPv6
connection to Internet over an Ethernet link with another native IPv6 subnet.
In this case, the physical interface for the native IPv6 subnet works as a
host, and the pseudo-interface for L2TP works as the default IP forwarding
route.
Problem:
Disabling processing RA messages when net.inet6.ip6.forwarding=1 and
accepting them when net.inet6.ip6.forward=0 cause the following practical
issues:
- A router cannot perform SLAAC. It becomes a problem if a box has
multiple interfaces and you want to use SLAAC on some of them, for
example. A customer edge router for IPv6 Internet access service
using an IPv6-over-IPv6 tunnel sometimes needs SLAAC on the
physical interface for administration purpose; updating firmware
and so on (link-local addresses can be used there, but GUAs by
SLAAC are often used for scalability).
- When a host has multiple IPv6 interfaces and it receives multiple RAs on
them, controlling the default route is difficult. Router preferences
defined in RFC 4191 works only when the routers on the links are
under your control.
Details of Implementation Changes:
Router Advertisement messages will be accepted even when
net.inet6.ip6.forwarding=1. More precisely, the conditions are as
follow:
(ACCEPT_RTADV && !NO_RADR && !ip6.forwarding)
=> Normal RA processing on that interface. (as IPv6 host)
(ACCEPT_RTADV && (NO_RADR || ip6.forwarding))
=> Accept RA but add the router to the defroute list with
rtlifetime=0 unconditionally. This effectively prevents
from setting the received router address as the box's
default route.
(!ACCEPT_RTADV)
=> No RA processing on that interface.
ACCEPT_RTADV and NO_RADR are per-interface knob. In short, all interface
are classified as "RA-accepting" or not. An RA-accepting interface always
processes RA messages regardless of ip6.forwarding. The difference caused by
NO_RADR or ip6.forwarding is whether the RA source address is considered as
the default router or not.
R-bit in NA on the RA accepting interfaces is set based on
net.inet6.ip6.forwarding. While RFC 6204 W-1 rule (for CPE case) suggests
a router should disable the R-bit completely even when the box has
net.inet6.ip6.forwarding=1, I believe there is no technical reason with
doing so. This behavior can be set by a new sysctl net.inet6.ip6.norbit_raif
(the default is 0).
Usage:
# ifconfig fxp0 inet6 accept_rtadv
=> accept RA on fxp0
# ifconfig fxp0 inet6 accept_rtadv no_radr
=> accept RA on fxp0 but ignore default route information in it.
# sysctl net.inet6.ip6.norbit_no_radr=1
=> R-bit in NAs on RA accepting interfaces will always be set to 0.
passing the cached proxydl reference (sockaddr_dl initialized or not) to
nd6_na_output(). nd6_na_output() will thus assume a proxy NA. Revert to
conditionally passing either &proxydl or NULL if no proxy case desired.
Tested by: ipv6gw and ref9-i386
Reported by: Pete French (petefrench ingresso.co.uk on stable)
Reported by: bz, simon on Y! cluster
Reported by: kib
PR: kern/151908
MFC after: 3 days
even after dropping the reference and unlocking. Previously we
have dereferenced a NULL pointer (after r121765).
Simply unlocking after the block does not work either because of
lock ordering (see r121765) and in addition we would still hold
a pointer to something that might be gone by the time we access it.
Thus take a copy of the value rather than just caching the pointer.
PR: kern/151908
Submitted by: chenyl (netstar2008 126.com) (initial version)
MFC after: 2 weeks
Call the handler function with the lock held, return unlocked as we
might free the entry. Rework functions later in the call graph to be
either called with the lock held or, only if needed, unlocked.
Place asserts to document and tighten assumptions on various lle locking,
which were not always true before.
We call nd6_ns_output() unlocked and the assignment of ip6->ip6_src was
decentralized to minimize possible complexity introduced with the formerly
missing locking there. This also resulted in a push down of local
variable scopes into smaller blocks.
Reported by: many
PR: kern/148857
Submitted by: Dmitrij Tejblum (tejblum yandex-team.ru) (original version)
MFC After: 4 days
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files. A better long-term solution is
still being considered. This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.
Changes reverted:
------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines
Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.
------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines
Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.
------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines
Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
Add kernel side support for Secure Neighbor Discovery (SeND), RFC 3971.
The implementation consists of a kernel module that gets packets from
the nd6 code, sends them to user space on a dedicated socket and reinjects
them back for further processing.
Hooks are used from nd6 code paths to divert relevant packets to the
send implementation for processing in user space. The hooks are only
triggered if the send module is loaded. In case no user space
application is connected to the send socket, processing continues
normaly as if the module would not be loaded. Unloading the module
is not possible at this time due to missing nd6 locking.
The native SeND socket is similar to a raw IPv6 socket but with its own,
internal pseudo-protocol.
Approved by: bz (mentor)
bridge(4), lagg(4) etc. and make use of function pointers and
pf_proto_register() to hook carp into the network stack.
Currently, because of the uncertainty about whether the unload path is free
of race condition panics, unloads are disallowed by default. Compiling with
CARPMOD_CAN_UNLOAD in CFLAGS removes this anti foot shooting measure.
This commit requires IP6PROTOSPACER, introduced in r211115.
Reviewed by: bz, simon
Approved by: ken (mentor)
MFC after: 2 weeks