option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].
[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].
Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip
unlocked route. Use in6_rtalloc() instead of in6_rtalloc1. This helps
simplify the code and remove several now unused variables.
PR: 156283
MFC after: 2 weeks
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
four counters to struct ifaddr. This kills '+=' on a variables shared
between processors for every packet.
- Nuke struct if_data from struct ifaddr.
- In ip_input() do not put a reference on ifaddr, instead update statistics
right now in place and do IN_IFADDR_RUNLOCK(). These removes atomic(9)
for every packet. [1]
- To properly support NET_RT_IFLISTL sysctl used by getifaddrs(3), in
rtsock.c fill if_data fields using counter_u64_fetch().
- Accidentially fix bug in COMPAT_32 version of NET_RT_IFLISTL, which
took if_data not from the ifaddr, but from ifaddr's ifnet. [2]
Submitted by: melifaro [1], pluknet[2]
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
transmission which could be tricked into rounding up to the nearest
page size, leaking up to a page of kernel memory. [13:11]
In IPv6 and NetATM, stop SIOCSIFADDR, SIOCSIFBRDADDR, SIOCSIFDSTADDR
and SIOCSIFNETMASK at the socket layer rather than pass them on to the
link layer without validation or credential checks. [SA-13:12]
Prevent cross-mount hardlinks between different nullfs mounts of the
same underlying filesystem. [SA-13:13]
Security: CVE-2013-5666
Security: FreeBSD-SA-13:11.sendfile
Security: CVE-2013-5691
Security: FreeBSD-SA-13:12.ifioctl
Security: CVE-2013-5710
Security: FreeBSD-SA-13:13.nullfs
Approved by: re
matches the types used when computing hash indices and the type of the
maximum size of mfchashtbl[].
PR: kern/181821
Submitted by: Sven-Thorsten Dietrich <sven@vyatta.com> (IPv4)
MFC after: 1 week
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.
Tested by: gnn, hiren
MFC after: 1 month
features. The changes in particular are:
o Remove rarely used "header" pointer and replace it with a 64bit protocol/
layer specific union PH_loc for local use. Protocols can flexibly overlay
their own 8 to 64 bit fields to store information while the packet is
worked on.
o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc
instead of pkthdr.header.
o Extend csum_flags to 64bits to allow for additional future offload
information to be carried (e.g. iSCSI, IPsec offload, and others).
o Move the RSS hash type enumerator from abusing m_flags to its own 8bit
rsstype field. Adjust accessor macros.
o Add cosqos field to store Class of Service / Quality of Service information
with the packet. It is not yet supported in any drivers but allows us to
get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with
a modernized ALTQ.
o Add four 8 bit fields l[2-5]hlen to store the relative header offsets
from the start of the packet. This is important for various offload
capabilities and to relieve the drivers from having to parse the packet
and protocol headers to find out location of checksums and other
information. Header parsing in drivers is a lot of copy-paste and
unhandled corner cases which we want to avoid.
o Add another flexible 64bit union to map various additional persistent
packet information, like ether_vtag, tso_segsz and csum fields.
Depending on the csum_flags settings some fields may have different usage
making it very flexible and adaptable to future capabilities.
o Restructure the CSUM flags to better signify their outbound (down the
stack) and inbound (up the stack) use. The CSUM flags used to be a bit
chaotic and rather poorly documented leading to incorrect use in many
places. Bring clarity into their use through better naming.
Compatibility mappings are provided to preserve the API. The drivers
can be corrected one by one and MFC'd without issue.
o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures).
Sponsored by: The FreeBSD Foundation
flag instead. The flag is only used within the IP and IPv6 layer 3
protocols.
Because some firewall packages treat IPv4 and IPv6 packets the same the
flag should have the same value for both.
Discussed with: trociny, glebius
PF_INET6 in kernel. This fixes various malfunction when the wall time
clock is changed. Bump __FreeBSD_version to 1000041.
- Use clock_gettime(CLOCK_MONOTONIC_FAST) in userland utilities.
MFC after: 1 month
structure is used, but they already have equal fields in the struct
newipsecstat, that was introduced with FAST_IPSEC and then was merged
together with old ipsecstat structure.
This fixes kernel stack overflow on some architectures after migration
ipsecstat to PCPU counters.
Reported by: Taku YAMAMOTO, Maciej Milewski
duplicated sockets a multicast address is bound and either
SO_REUSEPORT or SO_REUSEADDR is set.
But actually it works for the following combinations:
* SO_REUSEPORT is set for the fist socket and SO_REUSEPORT for the new;
* SO_REUSEADDR is set for the fist socket and SO_REUSEADDR for the new;
* SO_REUSEPORT is set for the fist socket and SO_REUSEADDR for the new;
and fails for this:
* SO_REUSEADDR is set for the fist socket and SO_REUSEPORT for the new.
Fix the last case.
PR: 179901
MFC after: 1 month
dereferencing, when checking for SO_REUSEPORT option (and SO_REUSEADDR
for multicast), INP_REUSEPORT flag was introduced to cache the socket
option. It was decided then that one flag would be enough to cache
both SO_REUSEPORT and SO_REUSEADDR: when processing SO_REUSEADDR
setsockopt(2), it was checked if it was called for a multicast address
and INP_REUSEPORT was set accordingly.
Unfortunately that approach does not work when setsockopt(2) is called
before binding to a multicast address: the multicast check fails and
INP_REUSEPORT is not set.
Fix this by adding INP_REUSEADDR flag to unconditionally cache
SO_REUSEADDR.
PR: 179901
Submitted by: Michael Gmelin freebsd grem.de (initial version)
Reviewed by: rwatson
MFC after: 1 week
is initialized with !ND6_IFF_AUTO_LINKLOCAL && !ND6_IFF_ACCEPT_RTADV
regardless of net.inet6.ip6.accept_rtadv and net.inet6.ip6.auto_linklocal.
To configure an autoconfigured link-local address (RFC 4862), the
following rc.conf(5) configuration can be used:
ifconfig_bridge0_ipv6="inet6 auto_linklocal"
- if_bridge(4) now removes IPv6 addresses on a member interface to be
added when the parent interface or one of the existing member
interfaces has an IPv6 address. if_bridge(4) merges each link-local
scope zone which the member interfaces form respectively, so it causes
address scope violation. Removal of the IPv6 addresses prevents it.
- if_lagg(4) now removes IPv6 addresses on a member interfaces
unconditionally.
- Set reasonable flags to non-IPv6-capable interfaces. [*]
Submitted by: rpaulo [*]
MFC after: 1 week
Address. Although KAME implementation used FF02:0:0:0:0:2::/96 based on
older versions of draft-ietf-ipngwg-icmp-name-lookup, it has been changed
in RFC 4620.
The kernel always joins the /104-prefixed address, and additionally does
/96-prefixed one only when net.inet6.icmp6.nodeinfo_oldmcprefix=1.
The default value of the sysctl is 1.
ping6(8) -N flag now uses /104-prefixed one. When this flag is specified
twice, it uses /96-prefixed one instead.
Reviewed by: ume
Based on work by: Thomas Scheffler
PR: conf/174957
MFC after: 2 weeks
for the set of IPv6 addresses. Now each attempt goes into IPv6 statistics,
even if given rule did not won. Change this and take into account only
those rules, that won. Also add accounting for cases, when algorithm
fails to select an address.
- Do not calculate constant length values at run time,
CTASSERT() their sanity.
- Remove superfluous cleaning of mbuf fields after allocation.
- Replace compat macros with function calls.
Sponsored by: Nginx, Inc.
Currently we use interface indeces as zone IDs for link-local and
interface-local scopes, and since we don't have any tool to configure
zone IDs, there is no need to acquire the afdata lock several times per
packet only to read if_index value.
So, now in6_setscope reads zone IDs for interface-local, link-local and
global scopes without a lock.
Sponsored by: Yandex LLC
MFC after: 2 weeks
It stops treating the address on the interface as special by source
address selection rule even when the interface is outgoing interface.
This is desired in some situation.
Requested by: hrs
Reviewed by: IHANet folks including hrs
MFC after: 1 week
and embeds it into address. Inside the kernel we keep addresses with
embedded zone id only for two scopes: link-local and interface-local.
For other scopes this function is nop in most cases. To reduce an
overhead of locking, first check that address is capable for embedding.
Also, handle the loopback address before acquire the lock.
Sponsored by: Yandex LLC
MFC after: 1 week
all interested parties in case if interface flag IFF_UP has changed.
However, not only SIOCSIFFLAGS can raise the flag, but SIOCAIFADDR
and SIOCAIFADDR_IN6 can, too. The actual |= is done not in the protocol
code, but in code of interface drivers. To fix this historical layering
violation, we will check whether ifp->if_ioctl(SIOCSIFADDR) raised the
IFF_UP flag, and if it did, run the if_up() handler.
This fixes configuring an address under CARP control on an interface
that was initially !IFF_UP.
P.S. I intentionally omitted handling the IFF_SMART flag. This flag was
never ever used in any driver since it was introduced, and since it
means another layering violation, it should be garbage collected instead
of pretended to be supported.
For now use 256 buckets and fnv_hash function. Use xor'ed 32-bit
s6_addr32 parts of in6_addr structure as a hash key. Update
in6_localip and in6_is_addr_deprecated to use hash table for fastest
lookup.
Sponsored by: Yandex LLC
Discussed with: dwmalone, glebius, bz
entering llentry_free(), and in case if we lose the race, we should simply
perform LLE_FREE_LOCKED(). Otherwise, if the race is lost by the thread
performing arptimer(), it will remove two references from the lle instead
of one.
Reported by: Ian FREISLICH <ianf clue.co.za>
- Deembed scope id in L3 address in in6_lltable_dump().
- Simplify scope id recovery in rtsock routines.
- Remove embedded scope id handling in ndp(8) and route(8) completely.
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.
Suggested by: andre
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.
Sponsored by: Yandex LLC
Discussed with: net@
MFC after: 2 weeks
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.
After this change a packet processed by the stack isn't
modified at all[2] except for TTL.
After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.
[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.
[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.
Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>
are using (different) ND-based approach described in RFC 4861. This change
is similar to r241406 which conditionally skips the same check in IPv4.
This change is part of bigger patch eliminating rte locking.
Sponsored by: Yandex LLC.
OK'd by: hrs
MFC after: 2 weeks
host byte order, was sometimes called with net byte order. Since we are
moving towards net byte order throughout the stack, the function was
converted to expect net byte order, and its consumers fixed appropriately:
- ip_output(), ipfilter(4) not changed, since already call
in_delayed_cksum() with header in net byte order.
- divert(4), ng_nat(4), ipfw_nat(4) now don't need to swap byte order
there and back.
- mrouting code and IPv6 ipsec now need to switch byte order there and
back, but I hope, this is temporary solution.
- In ipsec(4) shifted switch to net byte order prior to in_delayed_cksum().
- pf_route() catches up on r241245 changes to ip_output().
llentry_free() and arptimer():
o Use callout_init_rw() for lle timeout, this allows us safely
disestablish them.
- This allows us to simplify the arptimer() and make it
race safe.
o Consistently use ifp->if_afdata_lock to lock access to
linked lists in the lle hashes.
o Introduce new lle flag LLE_LINKED, which marks an entry that
is attached to the hash.
- Use LLE_LINKED to avoid double unlinking via consequent
calls to llentry_free().
- Mark lle with LLE_DELETED via |= operation istead of =,
so that other flags won't be lost.
o Make LLE_ADDREF(), LLE_REMREF() and LLE_FREE_LOCKED() more
consistent and provide more informative KASSERTs.
The patch is a collaborative work of all submitters and myself.
PR: kern/165863
Submitted by: Andrey Zonov <andrey zonov.org>
Submitted by: Ryan Stone <rysto32 gmail.com>
Submitted by: Eric van Gyzen <eric_van_gyzen dell.com>
adding any extension header, or rather before calling into IPsec
processing as we may send the packet and not return to IPv6 output
processing here.
PR: kern/170116
MFC After: 3 days
implementation of RFC 3484 for this purpose for a long time and "prefer_source"
was never implemented actually. ND6_IFF_PREFER_SOURCE macro is left intact.
it skips FLOWTABLE lookup. However, the non-NULL ro has dual meaning
here: it may be supplied to provide route, and it may be supplied to
store and return to caller the route that ip_output()/ip6_output()
finds. In the latter case skipping FLOWTABLE lookup is pessimisation.
The difference between struct route filled by FLOWTABLE and filled
by rtalloc() family is that the former doesn't hold a reference on
its rtentry. Reference is hold by flow entry, and it is about to
be released in future. Thus, route filled by FLOWTABLE shouldn't
be passed to RTFREE() macro.
- Introduce new flag for struct route/route_in6, that marks route
not holding a reference on rtentry.
- Introduce new macro RO_RTFREE() that cleans up a struct route
depending on its kind.
- All callers to ip_output()/ip6_output() that do supply non-NULL
but empty route should use RO_RTFREE() to free results of
lookup.
- ip_output()/ip6_output() now do FLOWTABLE lookup always when
ro->ro_rt == NULL.
Tested by: tuexen (SCTP part)
Possibly do some entra work in case we would not get into the
ifa0 != NULL paths later as we already do for the mltaddr before.
XXX We should possibly error in case in6_setscope fails.
Reference: http://lists.freebsd.org/pipermail/freebsd-net/2011-September/029829.html
Submitted by: bz
MFC after: 1 week
Interface routes are refcounted as packets move through the stack,
and there's garbage collection tied to it so that route changes can
safely propagate while traffic is flowing. In our setup, we weren't
changing or deleting any routes, but the refcounting logic in
ip6_input() was wrong and caused a reference leak on every inbound
V6 packet. This eventually caused a 32bit overflow, and the resulting
0 value caused the garbage collection to run on the active route.
That then snowballed into the panic.
Reviewed by: scottl
MFC after: 3 days
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.
To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.
Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.
This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.
Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.
Individual driver updates will have to follow, as will SCTP.
Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958
already plan to support >64k payload here, the IPv6 header payload
length obviously is only 16 bit and the calculations need to be right.
Reported by: dim
Tested by: dim
MFC after: 1 day
X-MFC: with r235958
Use M_ZERO with malloc rather than calling bzero() ourselves.
Change if () panic() checks to KASSERT()s as they are only
catching invariants in code flow but not dependent on network
input/output.
Move initial assigments indirecting pointers after the lock
has been aquired.
Passing layer boundries, reset M_PROTOFLAGS.
Remove a NULL assignment before free.
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
Reviewed by: gnn (as part of the whole)
MFC After: 3 days
Factor out Hop-By-Hop option processing. It's still not heavily used,
it reduces the footprint of ip6_input() and makes ip6_input() more
readable.
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
Reviewed by: gnn (as part of the whole)
MFC After: 3 days
Defer checksum calulations on UDP6 output and respect the mbuf
flags set by NICs having done checksum validation for us already,
thus saving the computing time in the input path as well.
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
Reviewed by: gnn (as part of the whole)
MFC After: 3 days
Add support for delayed checksum calculations in the IPv6
output path. We currently cannot offload to the card if we
add extension headers (which incl. fragmentation).
Fix two SCTP offload support copy&paste bugs: calculate
checksums if fragmenting and no need to flag IPv4 header
checksums in the IPv6 forwarding path.
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
Reviewed by: gnn (as part of the whole)
MFC After: 3 days
Hide the ip6aux functions. The only one referenced outside ip6_input.c
is not compiled in yet (__notyet__) in route6.c (r235954). We do have
accessor functions that should be used.
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
Reviewed by: gnn (as part of the whole)
MFC After: 3 days
X-MFC: KPI?
Simplify the code removing a return from an earlier else case,
not differing from the default function return called now.
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
Reviewed by: gnn (as part of the whole)
MFC After: 3 days
We currently nowhere set IP6A_SWAP making the entire check useless
with the current code. Keep around but do not compile in.
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
Reviewed by: gnn (as part of the whole)
MFC After: 3 days
No need to hold the (expensive) rt lock over (expensive) logging.
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
Reviewed by: gnn (as part of the whole)
MFC After: 3 days
Introduce a (for now copied stripped down) in6_cksum_pseudo()
function. We should be able to use this from in6_cksum() but
we should also ponder possible MD specific improvements.
It takes an extra csum argument to allow for easy checks as
will be done by the upper layer protocol input paths.
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
Reviewed by: gnn (as part of the whole)
MFC After: 3 days
Optimize in6_cksum(), re-ordering work and limiting variable
initialization, removing a bzero() for mostly re-initialized
struct values, making use of the newly introduced in6_getscope(),
as well as converting an if/panic to a KASSERT().
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
Reviewed by: gnn (as part of the whole)
MFC After: 3 days
Introduce in6_getscope() to allow more effective checksum
computations without the need to copy the address to clear the
scope.
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
Reviewed by: gnn (as part of the whole)
MFC After: 3 days
casted to structs by getting rid of these buffers entirely. In r169832, it
was tried to paper over this issue by 32-bit aligning the buffers. Depending
on compiler optimizations that still was insufficient for 64-bit architectures
with strong alignment requirements though.
While at it, add comments regarding the total lack of locking in this area.
Tested by: bz
Reviewed by: bz (slightly earlier version), yongari (earlier version)
MFC after: 1 week
inp_socket. To avoid panic, do not dereference inp_socket,
but obtain reuse port option from inp_flags2, like this
is done after next call to in_pcblookup_local() a few lines
down below.
Submitted by: rwatson
call in an #if 0 section.
In in6_selecthlim() optimize a case where in6p cannot be NULL due to an
earlier check.
More consistently use u_int instead of int for fibnum function arguments.
Sponsored by: Cisco Systems, Inc.
MFC after: 3 days
at which the lle_tbl pointer points to freed memory and the llt_free pointer is no longer
valid.
Move the free pointer in to the llentry itself and update the initalization sites.
MFC after: 2 weeks
return ifnet double pointer.
Pass that hint down to in6_selectif() to be used when i) the default FIB
is queried and ii) route lookup fails because the network is not present
(i.e. someone deleted the connected subnet).
This hint should not be generally used from anywhere outside the neighbor
discovery code. We just make use of it from nd6_ns_output().
Extend the nd6_na_output() interface by a nd6_na_output_fib() version
and pass the FIB number from the NS mbuf on to NA to allow the new mbuf
to inherit the FIB tag and a later lookup from ip6_output() to succeed
in the aformentioned example case.
Provide a wrapper function for the old public interface also used from
CARP but mark it with BURN_BRIDGES to cleanup in HEAD after MFC.
Sponsored by: Cisco Systems, Inc.
the original IPv4 implementation from r178888:
- Use RT_DEFAULT_FIB in the IPv4 implementation where noticed.
- Use rt*fib() KPI with explicit RT_DEFAULT_FIB where applicable in
the NFS code.
- Use the new in6_rt* KPI in TCP, gif(4), and the IPv6 network stack
where applicable.
- Split in6_rtqtimo() and in6_mtutimo() as done in IPv4 and equally
prevent multiple initializations of callouts in in6_inithead().
- Use wrapper functions where needed to preserve the current KPI to
ease MFCs. Use BURN_BRIDGES to indicate expected future cleanup.
- Fix (related) comments (both technical or style).
- Convert to rtinit() where applicable and only use custom loops where
currently not possible otherwise.
- Multicast group, most neighbor discovery address actions and faith(4)
are locked to the default FIB. Individual IPv6 addresses will only
appear in the default FIB, however redirect information and prefixes
of connected subnets are automatically propagated to all FIBs by
default (mimicking IPv4 behavior as closely as possible).
Sponsored by: Cisco Systems, Inc.
The actual ia6->ia6_lifetime access is hidden in
IFA6_IS_INVALID/IFA6_IS_DEPRECATED macros since a long time ago
(see netinet6/nd6.c, r1.104 of KAME for the reference).
MFC after: 3 days
comments to longer, also refining strange ones.
Properly use #ifdef rather than #if defined() where possible. Four
#if defined(PCBGROUP) occurances (netinet and netinet6) were ignored to
avoid conflicts with eventually upcoming changes for RSS.
Reported by: bde (most)
Reviewed by: bde
MFC after: 3 days
If set to 1, no ABORT is sent back in response to an incoming
INIT. If set to 2, no ABORT is sent back in response to
an out of the blue packet. If set to 0 (the default), ABORTs
are sent.
Discussed with rrs@.
MFC after: 1 month.
in6m_release_locked() to defer calls to mld_v1_transmit_report() until
after the IF_ADDR_LOCK is dropped. This removes a race where the lock
is dropped and reacquired while attempting to walk an interface's
address list.
Reviewed by: bz
MFC after: 1 week
reference on a group in the leaving state while iterating over the loop.
Instead, use the same approach used in igmp_ifdetach() and mld_ifdetach()
of placing the groups to free on pending release list and then releasing
the references after dropping the IF_ADDR_LOCK. This closes an ugly race
where the code was dropping the lock in the middle of iterating over the
list. It also fixes some additional potential use-after-free bugs since
the cancellation routine also applied other changes to the group after
dropping the reference. Now those changes are performed before the
reference is dropped and the group is potentially freed.
Prodded to fix by: glebius
Reviewed by: bz
MFC after: 1 week
of the SIOC[DG]LIFADDR icotls before dropping the IF_ADDR_LOCK() and
release the reference after using it. This prevents the address from
being potentially freed out from under the ioctl handler.
Reviewed by: bz
MFC after: 1 week
from scratch, copying needed functionality from the old implemenation
on demand, with a thorough review of all code. The main change is that
interface layer has been removed from the CARP. Now redundant addresses
are configured exactly on the interfaces, they run on.
The CARP configuration itself is, as before, configured and read via
SIOCSVH/SIOCGVH ioctls. A new prefix created with SIOCAIFADDR or
SIOCAIFADDR_IN6 may now be configured to a particular virtual host id,
which makes the prefix redundant.
ifconfig(8) semantics has been changed too: now one doesn't need
to clone carpXX interface, he/she should directly configure a vhid
on a Ethernet interface.
To supply vhid data from the kernel to an application the getifaddrs(8)
function had been changed to pass ifam_data with each address. [1]
The new implementation definitely closes all PRs related to carp(4)
being an interface, and may close several others. It also allows
to run a single redundant IP per interface.
Big thanks to Bjoern Zeeb for his help with inet6 part of patch, for
idea on using ifam_data and for several rounds of reviewing!
PR: kern/117000, kern/126945, kern/126714, kern/120130, kern/117448
Reviewed by: bz
Submitted by: bz [1]
after its installation. This removal may be accidental and can
prevent the default route from being installed in the future if
the associated default router has the best preference. The cause
is the lack of status update in the default router on the state
of its route installation in the kernel FIB. This patch fixes
the described problem.
Reviewed by: hrs, discussed with hrs
MFC after: 5 days
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
inp_socket->so_options dereference when we may not acquire the lock on
the inpcb.
This fixes the crash due to NULL pointer dereference in
in_pcbbind_setup() when inp_socket->so_options in a pcb returned by
in_pcblookup_local() was checked.
Reported by: dave jones <s.dave.jones@gmail.com>, Arnaud Lacombe <lacombar@gmail.com>
Suggested by: rwatson
Glanced by: rwatson
Tested by: dave jones <s.dave.jones@gmail.com>
separation rewrite changes. r196865 was committed to fix a scope
violation problem in the following test scenario:
box-1# ifconfig em0 inet6 2001:db8:1:: prefixlen 64 anycast
box-1# ifconfig em1 inet6 2001:db8:2::1 prefixlen 64
box-2# ifconfig re0 inet6 2001:db8:1::6 prefixlen 64
em0 and re0 are on the same link.
box-2# ping6 2001:db8:1::
PING6(56=40+8+8 bytes) 2001:db8:1::6 --> 2001:db8:1::
the ICMPv6 response should have a source address of em1, which
is 2001:db8:2::1, not the link-local address of em0.
That code is no longer necessary and breaks the IPv6-Ready logo
testing, so revert it now.
Reviewed by: hrs
MFC after: 3 days
inlined by Qing Li in his big new-ARP commit. I am going to utilize
them in my newcarp work, and also these functions left declared
in in6_var.h for all the time they were absent.
Reviewed by: bz
determine if a loopback route should be installed for an interface
IPv6 address. Another condition is the address must not belong to a
looopback interface.
Reviewed by: hrs
MFC after: 3 days
appending the new mbuf to the chain reference but possibly causing an mbuf
nextpkt loop leading to a memory used after handoff (or having been freed)
and leaking an mbuf here.
Reviewed by: rwatson, brooks
MFC after: 3 days
(r225485). When setting an interface name to it, the following
configurations will be enabled:
1. "no_radr" is set to all IPv6 interfaces automatically.
2. "-no_radr accept_rtadv" will be set only for $ipv6_cpe_wanif. This is
done just before evaluating $ifconfig_IF_ipv6 in the rc.d scripts (this
means you can manually supersede this configuration if necessary).
3. The node will add RA-sending routers to the default router list
even if net.inet6.ip6.forwarding=1.
This mode is added to conform to RFC 6204 (a router which connects
the end-user network to a service provider network). To enable
packet forwarding, you still need to set ipv6_gateway_enable=YES.
Note that accepting router entries into the default router list when
packet forwarding capability and a routing daemon are enabled can
result in messing up the routing table. To minimize such unexpected
behaviors, "no_radr" is set on all interfaces but $ipv6_cpe_wanif.
Approved by: re (bz)
mld_set_version() is called only from mld_v1_input_query() and
mld_v2_input_query() both holding the if_addr_mtx lock, and then calling
into mld_v2_cancel_link_timers() acquires it the second time, which results
in mtx recursion. To avoid that, delay if_addr_mtx acquisition until after
mld_set_version() is called; while here, further reduce locking scope
to protect only the needed pieces: if_multiaddrs, in6m_lookup_locked().
PR: kern/158426
Reported by: Thomas <tps vr-web.de>,
Tom Vijlbrief <tom.vijlbrief xs4all.nl>
Tested by: Tom Vijlbrief
Reviewed by: bz
Approved by: re (kib)
Distinguish IPv4 and IPv6 addresses and optional port numbers in
user space to set the option for the correct protocol family.
Add support in the kernel for carrying the new IPv6 destination
address and port.
Add support to TCP and UDP for IPv6 and fix UDP IPv4 to not change
the address in the IP header.
Add support for IPv6 forwarding to a non-local destination.
Add a regession test uitilizing VIMAGE to check all 20 possible
combinations I could think of.
Obtained from: David Dolson at Sandvine Incorporated
(original version for ipfw fwd IPv6 support)
Sponsored by: Sandvine Incorporated
PR: bin/117214
MFC after: 4 weeks
Approved by: re (kib)
people think: returning true for an address in any connected subnet, not
necessarily on the local machine.
Sponsored by: Sandvine Incorporated
MFC after: 2 weeks
Approved by: re (kib)
* Decouple the path supervision using a separate HB timer per path.
* Add support for potentially failed state.
* Bring back RTO.min to 1 second.
* Accept packets on IP-addresses already announced via an ASCONF
* While there: do some cleanups.
Approved by: re@
MFC after: 2 months.
same as the host address. This already works fine for INET6 and ND6.
While here, remove two function pointers from struct lltable which are
only initialized but never used.
MFC after: 3 days
whether decapsulated IPsec packets will be passed to pfil again depending
on the setting of the net.ip6.ipsec6.filtertunnel sysctl.
PR: kern/157670
Submitted by: Manuel Kasper (mk neon1.net)
MFC after: 2 weeks
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
an explicit action for INET6 configuration happens. The changes are:
1. When an ND6 flag is changed via SIOCSIFINFO_FLAGS ioctl,
setting ND6_IFF_ACCEPT_RTADV and/or ND6_IFF_AUTO_LINKLOCAL now triggers
an attempt to clear the ND6_IFF_IFDISABLED flag.
2. When an AF_INET6 address is added successfully to an interface and
it is marked as ND6_IFF_IFDISABLED, an attempt to clear the
ND6_IFF_IFDISABLED happens.
This simplifies ND6_IFF_IFDISABLED flag manipulation by users via ifconfig(8);
in most cases manual configuration is no longer needed.
- When ND6_IFF_AUTO_LINKLOCAL is set and no link-local address is assigned to
an interface, SIOCSIFINFO_FLAGS ioctl now calls in6_ifattach() to configure
a link-local address.
This change ensures link-local address configuration when "ifconfig IF inet6"
command is invoked. For example, "ifconfig IF inet6 auto_linklocal" now
always try to configure an LL addr even if ND6_IFF_AUTO_LINKLOCAL is already
set to 1 (i.e. down/up cycle is no longer needed).
Reviewed by: bz
- A new per-interface knob IFF_ND6_NO_RADR and sysctl IPV6CTL_NO_RADR.
This controls if accepting a route in an RA message as the default route.
The default value for each interface can be set by net.inet6.ip6.no_radr.
The system wide default value is 0.
- A new sysctl: net.inet6.ip6.norbit_raif. This controls if setting R-bit in
NA on RA accepting interfaces. The default is 0 (R-bit is set based on
net.inet6.ip6.forwarding).
Background:
IPv6 host/router model suggests a router sends an RA and a host accepts it for
router discovery. Because of that, KAME implementation does not allow
accepting RAs when net.inet6.ip6.forwarding=1. Accepting RAs on a router can
make the routing table confused since it can change the default router
unintentionally.
However, in practice there are cases where we cannot distinguish a host from
a router clearly. For example, a customer edge router often works as a host
against the ISP, and as a router against the LAN at the same time. Another
example is a complex network configurations like an L2TP tunnel for IPv6
connection to Internet over an Ethernet link with another native IPv6 subnet.
In this case, the physical interface for the native IPv6 subnet works as a
host, and the pseudo-interface for L2TP works as the default IP forwarding
route.
Problem:
Disabling processing RA messages when net.inet6.ip6.forwarding=1 and
accepting them when net.inet6.ip6.forward=0 cause the following practical
issues:
- A router cannot perform SLAAC. It becomes a problem if a box has
multiple interfaces and you want to use SLAAC on some of them, for
example. A customer edge router for IPv6 Internet access service
using an IPv6-over-IPv6 tunnel sometimes needs SLAAC on the
physical interface for administration purpose; updating firmware
and so on (link-local addresses can be used there, but GUAs by
SLAAC are often used for scalability).
- When a host has multiple IPv6 interfaces and it receives multiple RAs on
them, controlling the default route is difficult. Router preferences
defined in RFC 4191 works only when the routers on the links are
under your control.
Details of Implementation Changes:
Router Advertisement messages will be accepted even when
net.inet6.ip6.forwarding=1. More precisely, the conditions are as
follow:
(ACCEPT_RTADV && !NO_RADR && !ip6.forwarding)
=> Normal RA processing on that interface. (as IPv6 host)
(ACCEPT_RTADV && (NO_RADR || ip6.forwarding))
=> Accept RA but add the router to the defroute list with
rtlifetime=0 unconditionally. This effectively prevents
from setting the received router address as the box's
default route.
(!ACCEPT_RTADV)
=> No RA processing on that interface.
ACCEPT_RTADV and NO_RADR are per-interface knob. In short, all interface
are classified as "RA-accepting" or not. An RA-accepting interface always
processes RA messages regardless of ip6.forwarding. The difference caused by
NO_RADR or ip6.forwarding is whether the RA source address is considered as
the default router or not.
R-bit in NA on the RA accepting interfaces is set based on
net.inet6.ip6.forwarding. While RFC 6204 W-1 rule (for CPE case) suggests
a router should disable the R-bit completely even when the box has
net.inet6.ip6.forwarding=1, I believe there is no technical reason with
doing so. This behavior can be set by a new sysctl net.inet6.ip6.norbit_raif
(the default is 0).
Usage:
# ifconfig fxp0 inet6 accept_rtadv
=> accept RA on fxp0
# ifconfig fxp0 inet6 accept_rtadv no_radr
=> accept RA on fxp0 but ignore default route information in it.
# sysctl net.inet6.ip6.norbit_no_radr=1
=> R-bit in NAs on RA accepting interfaces will always be set to 0.
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
feature_present(3) to dynamically decide whether to use one or the
other family.
Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 10 days
in_pcb_lport(), in_pcblookup_local(), and in_pcblookup_hash(), and similarly
for IPv6 functions. In the future, we would like to support other flags
relating to locking strategy.
This change doesn't appear to modify the KBI in practice, as callers already
passed in INPLOOKUP_WILDCARD rather than a simple boolean.
MFC after: 3 weeks
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
interface is brought down, even though the interface address is still
valid. This patch maintains the permanent ARP entries as long as the
interface address (having the same prefix as that of the ARP entries)
is valid.
Reviewed by: delphij
MFC after: 5 days
Some bugs where fixed while doing this:
* ASCONF-ACK messages might use wrong port number when using
IPv6.
* Checking for additional addresses takes the correct address
into account and also does not do more comparisons than
necessary.
This patch is based on one received from bz@ who was
sponsored by The FreeBSD Foundation and iXsystems.
MFC after: 1 week
as well compiling out most functions adding or extending #ifdef INET
coverage.
Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days
Unfold the IPSEC_COMMON_INPUT_CB() macro in xform_{ah,esp,ipcomp}.c
to not need three different versions depending on INET, INET6 or both.
Mark two places preparing for not yet supported functionality with IPv6.
Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days
Not compiling in and not initializing from inetsw from in_proto.c for
IPv6 only, we need to initialize upper layer protocols from inet6sw.
Make sure to not initialize them twice in a Dual-Stack
environment but only conditionally on no INET as we have done for
TCP for a long time. Otherwise we would leak resources.
Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 3 days
passing the cached proxydl reference (sockaddr_dl initialized or not) to
nd6_na_output(). nd6_na_output() will thus assume a proxy NA. Revert to
conditionally passing either &proxydl or NULL if no proxy case desired.
Tested by: ipv6gw and ref9-i386
Reported by: Pete French (petefrench ingresso.co.uk on stable)
Reported by: bz, simon on Y! cluster
Reported by: kib
PR: kern/151908
MFC after: 3 days
working. We store v4 and v6 addresses as a union but for v4-mapped
addresses only store the 32bits w/o the ::ffff: word. That failed the
check as for example 127.0.0.1 would be ::7f00:1 rather than ::ffff:7f00:1
and the IN6_IS_ADDR_V4MAPPED() never worked here. Given we can hardly get
here with an unbound local address or invalid inp_vflags remove the check.
Reported by: tuexen
Reviewed by: tuexen
MFC after: 3 days
some more. Similar to what we do for TCP check for v4-mapped
addresses and then handle them or the normal v6 address case.
For either set inp_vflags before calling into the pcb connect
function so that we have an unambiguous view in case we need to
set the local address or port.
Looked at: tuexen (as part of more)
MFC after: 3 days
callers. This also fixes a problem when the prison call could set
the inp->in6p_laddr (laddr) and a following priv_check_cred() call
would return an error and will allow us to merge the IPv4 and IPv6
implementation.
MFC after: 2 weeks
even after dropping the reference and unlocking. Previously we
have dereferenced a NULL pointer (after r121765).
Simply unlocking after the block does not work either because of
lock ordering (see r121765) and in addition we would still hold
a pointer to something that might be gone by the time we access it.
Thus take a copy of the value rather than just caching the pointer.
PR: kern/151908
Submitted by: chenyl (netstar2008 126.com) (initial version)
MFC after: 2 weeks
* Store the flowid when receiving an SCTP/IPv6 packet.
* Store the flowid when receiving an SCTP packet with wrong CRC.
* Initilize flowid correctly.
* Put test code under INVARIANTS.
MFC after: 3 months.
Call the handler function with the lock held, return unlocked as we
might free the entry. Rework functions later in the call graph to be
either called with the lock held or, only if needed, unlocked.
Place asserts to document and tighten assumptions on various lle locking,
which were not always true before.
We call nd6_ns_output() unlocked and the assignment of ip6->ip6_src was
decentralized to minimize possible complexity introduced with the formerly
missing locking there. This also resulted in a push down of local
variable scopes into smaller blocks.
Reported by: many
PR: kern/148857
Submitted by: Dmitrij Tejblum (tejblum yandex-team.ru) (original version)
MFC After: 4 days
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files. A better long-term solution is
still being considered. This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.
Changes reverted:
------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines
Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.
------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines
Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.
------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines
Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
Consistently use the LLE_ prefix for lla_lookup() and the ND6_ prefix
for nd6_lookup() even though both are defined the same. Use the right
flag variable when checking each.
No real functional change.
MFC after: 4 days
legacy and IPv6 route destination address.
Previously in case of IPv6, there was a memory overwrite due to not enough
space for the IPv6 address.
PR: kern/122565
MFC After: 2 weeks
un-expiring.
The previous version of code have no locking when testing rt_refcnt.
The result of the lack of locking may result in a condition where
a routing entry have a reference count but at the same time have
RTPRF_OURS bit set and an expiration timer. These would eventually
lead to a panic:
panic: rtqkill route really not free
When the system have ICMP redirects accepted from local gateway
in a moderate frequency, for instance.
Commit this workaround for now until we have some better solution.
PR: kern/149804
Reviewed by: bz
Tested by: Zhao Xin, Pete French
MFC after: 2 weeks
In protosw we define pr_protocol as short, while on the wire
it is an uint8_t. That way we can have "internal" protocols
like DIVERT, SEND or gaps for modules (PROTO_SPACER).
Switch ipproto_{un,}register to accept a short protocol number(*)
and do an upfront check for valid boundries. With this we
also consistently report EPROTONOSUPPORT for out of bounds
protocols, as we did for proto == 0. This allows a caller
to not error for this case, which is especially important
if we want to automatically call these from domain handling.
(*) the functions have been without any in-tree consumer
since the initial introducation, so this is considered save.
Implement ip6proto_{un,}register() similarly to their legacy IP
counter parts to allow modules to hook up dynamically.
Reviewed by: philip, will
MFC after: 1 week
Fix the switching on/off of PF and NR-SACKs using sysctl.
Add minor improvement in handling malloc failures.
Improve the address checks when sending.
MFC after: 4 weeks
Add kernel side support for Secure Neighbor Discovery (SeND), RFC 3971.
The implementation consists of a kernel module that gets packets from
the nd6 code, sends them to user space on a dedicated socket and reinjects
them back for further processing.
Hooks are used from nd6 code paths to divert relevant packets to the
send implementation for processing in user space. The hooks are only
triggered if the send module is loaded. In case no user space
application is connected to the send socket, processing continues
normaly as if the module would not be loaded. Unloading the module
is not possible at this time due to missing nd6 locking.
The native SeND socket is similar to a raw IPv6 socket but with its own,
internal pseudo-protocol.
Approved by: bz (mentor)
and go to the next iteration early if multicast filtering would decide that
this socket shall not receive the data.
Unlock the pcb in that case or we leak the read lock and next time trying
to get a write lock, would hang forever.
PR: kern/149608
Submitted by: Chris Luke (chrisy flirble.org)
MFC after: 3 days
bridge(4), lagg(4) etc. and make use of function pointers and
pf_proto_register() to hook carp into the network stack.
Currently, because of the uncertainty about whether the unload path is free
of race condition panics, unloads are disallowed by default. Compiling with
CARPMOD_CAN_UNLOAD in CFLAGS removes this anti foot shooting measure.
This commit requires IP6PROTOSPACER, introduced in r211115.
Reviewed by: bz, simon
Approved by: ken (mentor)
MFC after: 2 weeks
Add proto spacers to inet6sw like we have for legacy IP. This allows us
to dynamically pf_proto_register() for INET6 from modules, needed by
upcoming CARP changes and SeND.
MC and SCTP could make use of it as well in theory in the future after
upcoming VIMAGE vnet teardown work.
Discussed with: will, anchie
MFC after: 10 days
nd6_llinfo_timer() functions with a KASSERT().
Note: there is no need to return after panic.
In the legacy IP case, only assign the arg after the check,
in the IPv6 case, remove the extra checks for the table and
interface as they have to be there unless we freed and forgot
to cancel the timer. It doesn't matter anyway as we would
panic on the NULL pointer deref immediately and the bug is
elsewhere.
This unifies the code of both address families to some extend.
Reviewed by: rwatson
MFC after: 6 days
working anymore. In addition more checks and operations were missing.
In case lla_lookup results in a match, get the ifaddr to update the
statistics counters, and check that the address is neither tentative,
duplicate or otherwise invalid before accepting the packet. If ok,
record the address information in the mbuf. [ as is done in case
lla_lookup does not return a result and we go through the FIB ].
Reported by: remko
Tested by: remko
MFC after: 2 weeks
We do not respect rules 3 and 4 in the required list:
1. omit leading zeros
2. "::" used to their maximum extent whenever possible
3. "::" used where shortens address the most
4. "::" used in the former part in case of a tie breaker
5. do not shorten one 16 bit 0 field
6. use lower case
http://tools.ietf.org/html/draft-ietf-6man-text-addr-representation-04.html
Submitted by: Kalluru Abhiram @ Juniper Networks
Obtained from: Juniper Networks
Reviewed by: hrs, dougb
"Whitspace" churn after the VIMAGE/VNET whirls.
Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.
Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.
This also removes some header file pollution for putatively
static global variables.
Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.
Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
MFC after: 6 days
that we allow all possible jail IPs as source address rather than
forcing the "primary". While IPv6 naturally has source address
selection, for legacy IP we do not go through the pain in case
IP_HDRINCL was not set. People should bind(2) for that.
This will, for example, allow ping(|6) -S to work correctly for
non-primary addresses.
Reported by: (ten 211.ru)
Tested by: (ten 211.ru)
MFC after: 4 days
addresses while walking the IPv6 address list if in the jail case
something is connecting to ::1.
Reported by: Pieter de Boer (pieter thedarkside.nl)
Tested by: Pieter de Boer (pieter thedarkside.nl)
MFC after: 4 days
prevented the link-layer entry from being freed.
In both in.c and in6.c (though that code path seems to be basically dead)
plug a reference leak in case of a pending callout being drained.
In if_ether.c consistently add a reference before resetting the callout
and in case we canceled a pending one remove the reference for that.
In the final case in arptimer, before freeing the expired entry, remove
the reference again and explicitly call callout_stop() to clear the active
flag.
In nd6.c:nd6_free() we are only ever called from the callout function and
thus need to remove the reference there as well before calling into
llentry_free().
In if_llatbl.c when freeing entire tables make sure that in case we cancel
a pending callout to remove the reference as well.
Reviewed by: qingli (earlier version)
MFC after: 10 days
Problem observed, patch tested by: simon on ipv6gw.f.o,
Christian Kratzer (ck cksoft.de),
Evgenii Davidov (dado korolev-net.ru)
PR: kern/144564
Configurations still affected: with options FLOWTABLE
being embedded is in fact link-local, before attempting to embed it.
Note that this operation is a side-effect of trying to avoid recursion on
the IN6 scope lock.
PR: 144560
Submitted by: Petr Lampa
MFC after: 3 days
* Fix handling of mapping arrays when draining mbufs or processing
FORWARD-TSN chunks.
* Cleanup code (no duplicate code anymore for SACKs and NR-SACKs).
Part of this code was developed together with rrs.
MFC after: 2 weeks.
have the delayed function take an argument as to the offset
to the SCTP header. This allows it to work for V4 and V6.
This of course means changing all callers of the function
to either pass the header len, if they have it, or create
it (ip_hl << 2 or sizeof(ip6_hdr)).
PR: 144529
MFC after: 2 weeks
no delayed checksum was added to the ip6 output code. This
causes cards that do not support SCTP checksum offload to
have SCTP packets that are IPv6 NOT have the sctp checksum
performed. Thus you could not communicate with a peer. This
adds the missing bits to make the checksum happen for these cards.
PR: 144529
MFC after: 2 weeks
whether to use source address selection (default) or the primary
jail address for unbound outgoing connections.
This is intended to be used by people upgrading from single-IP
jails to multi-IP jails but not having to change firewall rules,
application ACLs, ... but to force their connections (unless
otherwise changed) to the primry jail IP they had been used for
years, as well as for people prefering to implement similar policies.
Note that for IPv6, if configured incorrectly, this might lead to
scope violations, which single-IPv6 jails could as well, as by the
design of jails. [1]
Reviewed by: jamie, hrs (ipv6 part)
Pointed out by: hrs [1]
MFC After: 2 weeks
Asked for by: Jase Thew (bazerka beardz.net)
for the interface address. This marker is necessary to properly support
PPP types of links where multiple links can have the same local end
IP address. The IFA_RTSELF flag bit maps to the RTF_HOST value, which
was combined into the route flag bits during prefix installation in
IPv6. This inclusion causing the prefix route to be unusable. This
patch fixes this bug by excluding the IFA_RTSELF flag during route
installation.
MFC after: 5 days
same interface. The first address will install the prefix route into
the kernel routing table and that prefix will be marked as on-link.
Without RADIX_MPATH enabled, the other address aliases of the same
prefix will update the prefix reference count but no other routes
will be installed. Consequently the prefixes associated with these
addresses would not be marked as on-link. As such, incoming packets
destined to these address aliases will fail the ND6 on-link check
on input. This patch fixes the above problem by searching the kernel
routing table and try to find an on-link prefix on the given interface.
MFC after: 5 days
IFF_POINTOPOINT link types. The reason was due to the routing
entry returned from the kernel covering the remote end is of an
interface type that does not support ARP. This patch fixes this
problem by providing a hint to the kernel routing code, which
indicates the prefix route instead of the PPP host route should
be returned to the caller. Since a host route to the local end
point is also added into the routing table, and there could be
multiple such instantiations due to multiple PPP links can be
created with the same local end IP address, this patch also fixes
the loopback route installation failure problem observed prior to
this patch. The reference count of loopback route to local end would
be either incremented or decremented. The first instantiation would
create the entry and the last removal would delete the route entry.
MFC after: 5 days
with SSM MLDv2 by default.
This is current practice and complies with RFC 4604, as well as being
required by production IPv6 networks in Japan.
The behaviour may be disabled by setting the net.inet6.mld.use_allow
sysctl/tunable to 0.
Requested by: Hideki Yamamoto
MFC after: 1 week
if (jailed(cred))
left. If you are running with a vnet (virtual network stack) those will
return true and defer you to classic IP-jails handling and thus things
will be "denied" or returned with an error.
Work around this problem by introducing another "jailed()" function,
jailed_without_vnet(), that also takes vnets into account, and permits
the calls, should the jail from the given cred have its own virtual
network stack.
We cannot change the classic jailed() call to do that, as it is used
outside the network stack as well.
Discussed with: julian, zec, jamie, rwatson (back in Sept)
MFC after: 5 days
Don't allow joins w/o source on an existing group.
This is almost always pilot error.
We don't need to check for group filter UNDEFINED state at t1,
because we only ever allocate filters with their groups, so we
unconditionally reject such calls with EINVAL.
Trying to change the active filter mode w/o going through IPV6_MSFILTER
is also disallowed.
MFC after: 1 day
Tighten input checking in in6p_join_group():
* Don't try to use the source address, when its family is unspecified.
* If we get a join without a source, on an existing inclusive
mode group, this is an error, as it would change the filter mode.
Fix a problem with the handling of in6_mfilter for new memberships:
* Do not rely on im6f being NULL; it is explicitly initialized to a
non-NULL pointer when constructing a membership.
* Explicitly initialize *im6f to EX mode when the source address
is unspecified.
This fixes a problem with in_mfilter slot recycling in the join path.
MFC after: 1 day
Fix an obvious logic error in the IPv4 multicast leave processing,
where the filter mode vector was not updated correctly after the leave.
MFC after: 1 day
we did not add. Call LLE_REMREF() only when callout_stop()
actually canceled a pending callout.
- callout_reset() may cancel a pending callout. When
callout_reset() canceled a pending callout, call LLE_REMREF()
to drop a reference for the canceled callout.
MFC after: 1 week
Adding a tentative address is useless.
- Comment out a confused warning message when
in6_ifattach_linklocal() fails. This can occur when the
interface does not support ioctl(SIOCAIFADDR) (interfaces
associated with 802.11 wireless network device drivers, for
example).
packet filters. ALso allows ipfw to be enabled on on ejail and disabled
on another. In 8.0 it's a global setting.
Sitting aroung in tree waiting to commit for: 2 months
MFC after: 2 months
Note that when the interface has ND6_IFF_IFDISABLED, a newly-added
address is always marked as IN6_IFF_TENTATIVE so that the interface
can perform DAD after the ND6_IFF_IFDISABLED is cleared.
and address aliases. After an interface is brought down and brought
back up again, those self pointing routes disappeared. This patch
ensures after an interface is brought back up, the loopback routes
are reinstalled properly.
Reviewed by: bz
MFC after: immediately
automatic link-local address configuration:
- Convert a sysctl net.inet6.ip6.accept_rtadv to one for the
default value of a per-IF flag ND6_IFF_ACCEPT_RTADV, not a
global knob. The default value of the sysctl is 0.
- Add a new per-IF flag ND6_IFF_AUTO_LINKLOCAL and convert a
sysctl net.inet6.ip6.auto_linklocal to one for its default
value. The default value of the sysctl is 1.
- Make ND6_IFF_IFDISABLED more robust. It can be used to disable
IPv6 functionality of an interface now.
- Receiving RA is allowed if ip6_forwarding==0 *and*
ND6_IFF_ACCEPT_RTADV is set on that interface. The former
condition will be revisited later to support a "host + router" box
like IPv6 CPE router. The current behavior is compatible with
the older releases of FreeBSD.
- The ifconfig(8) now supports these ND6 flags as well as "nud",
"prefer_source", and "disabled" in ndp(8). The ndp(8) now
supports "auto_linklocal".
Discussed with: bz and jinmei
Reviewed by: bz
MFC after: 3 days
scenario where an anycast address is assigned on one interface,
and a global address with the same scope is assigned on another
interface. In other words, the interface owns the anycast
address has only the link-local address as one other address.
Without this patch, "ping6" the anycast address from another
station will observe the source address of the returned ICMP6
echo reply has the link-local address, not the global address
that exists on the other interface in the same node.
Reviewed by: bz
MFC after: immediately
- Interface link-local address is not reachable within the
node that owns the interface, this is due to the mismatch
in address scope as the result of the installed interface
address loopback route. Therefore for each interface
address loopback route, the rt_gateway field (of AF_LINK
type) will be used to track which interface a given
address belongs to. This will aid the address source to
use the proper interface for address scope/zone validation.
- The loopback address is not reachable. The root cause is
the same as the above.
- Empty nd6 entries are created for the IPv6 loopback addresses
only for validation reason. Doing so will eliminate as much
of the special case (loopback addresses) handling code
as possible, however, these empty nd6 entries should not
be returned to the userland applications such as the
"ndp" command.
Since both of the above issues contain common files, these
files are committed together.
Reviewed by: bz
MFC after: immediately
configured prefixes. Since these statically configured prefixes
do not have any associated advertising routers, these prefixes
are treated as unreachable and those prefix routes are deleted
from the routing table. Therefore bypass prefixes that are not
learned from router advertisements during prefix on-link check.
Reviewed by: hrs
an IPv6 address assigned to it, and if an incoming packet received on
one interface has a packet destination address that belongs to another
interface, the routing table is consulted to determine how to reach this
packet destination. Since the packet destination is an interface address,
the route table will return a host route with the loopback interface as
rt_ifp. The input code must recognize this fact, instead of using the
loopback interface, the input code performs a search to find the right
interface that owns the given IPv6 address.
Reviewed by: bz, gnn, kmacy
MFC after: immediately
list/index locks, to protect link layer address tables. This avoids
lock order issues during interface teardown, but maintains the bug that
sysctl copy routines may be called while a non-sleepable lock is held.
Reviewed by: bz, kmacy
MFC after: 3 days
several critical bugs, including race conditions and lock order issues:
Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an
sxlock. Either can be held to stablize the lists and indexes, but both
are required to write. This allows the list to be held stable in both
network interrupt contexts and sleepable user threads across sleeping
memory allocations or device driver interactions. As before, writes to
the interface list must occur from sleepable contexts.
Reviewed by: bz, julian
MFC after: 3 days
address is configured with a /128 prefix. This is no longer necessary due
to r192011. In fact that code conflicts with r192011. This patch removes
the host route installation when detecting the /128 prefix, and instead
let the code added by r192011 to install the loopback route for that IPv6
interface address.
Reviewed by: bz
Approved by: re
all pertinent statatistics for the subsystem. These structures are
sometimes "borrowed" by kernel modules that require a place to store
statistics for similar events.
Add KPI accessor functions for statistics structures referenced by kernel
modules so that they no longer encode certain specifics of how the data
structures are named and stored. This change is intended to make it
easier to move to per-CPU network stats following 8.0-RELEASE.
The following modules are affected by this change:
if_bridge
if_cxgb
if_gif
ip_mroute
ipdivert
pf
In practice, most of these statistics consumers should, in fact, maintain
their own statistics data structures rather than borrowing structures
from the base network stack. However, that change is too agressive for
this point in the release cycle.
Reviewed by: bz
Approved by: re (kib)
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.
Reviewed by: bz
Approved by: re (vimage blanket)
- Allow loopback route to be installed for address assigned to
interface of IFF_POINTOPOINT type.
- Install loopback route for an IPv4 interface addreess when the
"useloopback" sysctl variable is enabled. Similarly, install
loopback route for an IPv6 interface address when the sysctl variable
"nd6_useloopback" is enabled. Deleting loopback routes for interface
addresses is unconditional in case these sysctl variables were
disabled after an interface address has been assigned.
Reviewed by: bz
Approved by: re
network stacks, VNET_SYSINIT:
- Add VNET_SYSINIT and VNET_SYSUNINIT macros to declare events that will
occur each time a network stack is instantiated and destroyed. In the
!VIMAGE case, these are simply mapped into regular SYSINIT/SYSUNINIT.
For the VIMAGE case, we instead use SYSINIT's to track their order and
properties on registration, using them for each vnet when created/
destroyed, or immediately on module load for already-started vnets.
- Remove vnet_modinfo mechanism that existed to serve this purpose
previously, as well as its dependency scheme: we now just use the
SYSINIT ordering scheme.
- Implement VNET_DOMAIN_SET() to allow protocol domains to declare that
they want init functions to be called for each virtual network stack
rather than just once at boot, compiling down to DOMAIN_SET() in the
non-VIMAGE case.
- Walk all virtualized kernel subsystems and make use of these instead
of modinfo or DOMAIN_SET() for init/uninit events. In some cases,
convert modular components from using modevent to using sysinit (where
appropriate). In some cases, do minor rejuggling of SYSINIT ordering
to make room for or better manage events.
Portions submitted by: jhb (VNET_SYSINIT), bz (cleanup)
Discussed with: jhb, bz, julian, zec
Reviewed by: bz
Approved by: re (VIMAGE blanket)
non-vrtiualized sysctls so we cannot used one common function.
Add a macro to convert the arg1 in the virtualized case to
vnet.h to not expose the maths to all over the code.
Add a wrapper for the single virtualized call, properly handling
arg1 and call the default implementation from there.
Convert the two over places to use the new macro.
Reviewed by: rwatson
Approved by: re (kib)
nor destructors, as there's no actual work to do.
In most cases, the constructors weren't needed because of the existing
protocol initialization functions run by net_init_domain() as part of
VNET_MOD_NET, or they were eliminated when support for static
initialization of virtualized globals was added.
Garbage collect dependency references to modules without constructors or
destructors, notably VNET_MOD_INET and VNET_MOD_INET6.
Reviewed by: bz
Approved by: re (vimage blanket)
unused custom mutex/condvar-based sleep locks with two locks: an
rwlock (for non-sleeping use) and sxlock (for sleeping use). Either
acquired for read is sufficient to stabilize the vnet list, but both
must be acquired for write to modify the list.
Replace previous no-op read locking macros, used in various places
in the stack, with actual locking to prevent race conditions. Callers
must declare when they may perform unbounded sleeps or not when
selecting how to lock.
Refactor vnet sysinits so that the vnet list and locks are initialized
before kernel modules are linked, as the kernel linker will use them
for modules loaded by the boot loader.
Update various consumers of these KPIs based on whether they may sleep
or not.
Reviewed by: bz
Approved by: re (kib)
a valid zone ID or interface identifier in a v6 multicast leave, would
trigger a fairly paranoid KASSERT().
Observed with Boost++ regression tests on ref8.freebsd.org.
Approved by: re (kib)
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
to a non loopback/ppp link type) through the loopback interface. Prior
to the new L2/L3 rewrite, this host route was explicitly created when
processing the IPv6 address assignment. This loopback host route is
deleted when that IPv6 address is removed from the interface.
Reviewed by: bz, gnn
Approved by: re
in one additional case, avoiding an ifaddr reference leak.
Defer releasing the in6_ifaddr's in6_ifaddrhead reference until the
end of in6_unlink_ifa(), as callers are inconsistent regarding whether
or not they hold a reference across the call. This avoids using the
ifaddr after it may have been freed.
Reported by: tegge
Reviewed by: tegge
Approved by: re (blanket)
MFC after: 6 weeks