have chosen different (and more traditional) stateless/statuful
NAT64 as translation mechanism. Last non-trivial commits to both
faith(4) and faithd(8) happened more than 12 years ago, so I assume
it is time to drop RFC3142 in FreeBSD.
No objections from: net@
It isn't safe to keep unreferenced ifaddrs. Use in6ifa_ifwithaddr() to
determine ifaddr corresponding to destination address. Since currently
we keep addresses with embedded scope zone, in6ifa_ifwithaddr is called
with zero zoneid and marked with XXX.
Also remove route and lle lookups from ip6_input. Use in6ifa_ifwithaddr()
instead.
Sponsored by: Yandex LLC
Split it into two modules: if_gre(4) for GRE encapsulation and
if_me(4) for minimal encapsulation within IP.
gre(4) changes:
* convert to if_transmit;
* rework locking: protect access to softc with rmlock,
protect from concurrent ioctls with sx lock;
* correct interface accounting for outgoing datagramms (count only payload size);
* implement generic support for using IPv6 as delivery header;
* make implementation conform to the RFC 2784 and partially to RFC 2890;
* add support for GRE checksums - calculate for outgoing datagramms and check
for inconming datagramms;
* add support for sending sequence number in GRE header;
* remove support of cached routes. This fixes problem, when gre(4) doesn't
work at system startup. But this also removes support for having tunnels with
the same addresses for inner and outer header.
* deprecate support for various GREXXX ioctls, that doesn't used in FreeBSD.
Use our standard ioctls for tunnels.
me(4):
* implementation conform to RFC 2004;
* use if_transmit;
* use the same locking model as gre(4);
PR: 164475
Differential Revision: D1023
No objections from: net@
Relnotes: yes
Sponsored by: Yandex LLC
Some virtual if drivers has (ab)used ifa ifa_rtrequest hook to enforce
route MTU to be not bigger that interface MTU. While ifa_rtrequest hooking
might be an option in some situation, it is not feasible to do MTU checks
there: generic (or per-domain) routing code is perfectly capable of doing
this.
We currrently have 3 places where MTU is altered:
1) route addition.
In this case domain overrides radix _addroute callback (in[6]_addroute)
and all necessary checks/fixes are/can be done there.
2) route change (especially, GW change).
In this case, there are no explicit per-domain calls, but one can
override rte by setting ifa_rtrequest hook to domain handler
(inet6 does this).
3) ifconfig ifaceX mtu YYYY
In this case, we have no callbacks, but ip[6]_output performes runtime
checks and decreases rt_mtu if necessary.
Generally, the goals are to be able to handle all MTU changes in
control plane, not in runtime part, and properly deal with increased
interface MTU.
This commit changes the following:
* removes hooks setting MTU from drivers side
* adds proper per-doman MTU checks for case 1)
* adds generic MTU check for case 2)
* The latter is done by using new dom_ifmtu callback since
if_mtu denotes L3 interface MTU, e.g. maximum trasmitted _packet_ size.
However, IPv6 mtu might be different from if_mtu one (e.g. default 1280)
for some cases, so we need an abstract way to know maximum MTU size
for given interface and domain.
* moves rt_setmetrics() before MTU/ifa_rtrequest hooks since it copies
user-supplied data which must be checked.
* removes RT_LOCK_ASSERT() from other ifa_rtrequest hooks to be able to
use this functions on new non-inserted rte.
More changes will follow soon.
MFC after: 1 month
Sponsored by: Yandex LLC
According to IANA RPC uaddr registry, there are no AFs
except IPv4 and IPv6, so it's not worth being too abstract here.
Remove ne_rtable[AF_MAX+1] and use explicit per-AF radix tries.
Use own initialization without relying on domattach code.
While I admit that this was one of the rare places in kernel
networking code which really was capable of doing multi-AF
without any AF-depended code, it is not possible anymore to
rely on dom* code.
While here, change terrifying "Invalid radix node head, rn:" message,
to different non-understandable "netcred already exists for given addr/mask",
but less terrifying. Since we know that rn_addaddr() returns NULL if
the same record already exists, we should provide more friendly error.
MFC after: 1 month
When multicast capable interface goes away, it leaves multicast groups,
this leads to generate MLD reports, but MLD code does deffered send and
MLD reports are queued in the in6_multi's in6m_scq ifq. The problem is
that in6_multi structures are freed when interface leaves multicast groups
and thread that does deffered send will not take these queued packets.
PR: 194577
MFC after: 1 week
Sponsored by: Yandex LLC
o convert to if_transmit;
o use rmlock to protect access to gif_softc;
o use sx lock to protect from concurrent ioctls;
o remove a lot of unneeded and duplicated code;
o remove cached route support (it won't work with concurrent io);
o style fixes.
Reviewed by: melifaro
Obtained from: Yandex LLC
MFC after: 1 month
Sponsored by: Yandex LLC
data in an mbuf, use M_WRITABLE() instead of a direct test of M_EXT;
the latter both unnecessarily exposes mbuf-allocator internals in the
protocol stack and is also insufficient to catch all cases of
non-writability.
(NB: m_pullup() does not actually guarantee that a writable mbuf is
returned, so further refinement of all of these code paths continues to
be required.)
Reviewed by: bz
MFC after: 3 days
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D900
These are needed for the forthcoming vxlan implementation. The context
pointer means we do not have to use a spare pointer field in the inpcb,
and the source address is required to populate vxlan's forwarding table.
While I highly doubt there is an out of tree consumer of the UDP
tunneling callback, this change may be a difficult to eventually MFC.
Phabricator: https://reviews.freebsd.org/D383
Reviewed by: gnn
usage of a function computing the checksum only over a part of the function.
Therefore introduce in6_cksum_partial() and implement in6_cksum() based
on that.
While there, ensure that the UDPLite packet contains at least enough bytes
to contain the header.
Reviewed by: kevlo
MFC after: 3 days
that this means full checksum coverage for received packets.
If an application is willing to accept packets with partial
coverage, it is expected to use the socekt option and provice
the minimum coverage it accepts.
Reviewed by: kevlo
MFC after: 3 days
ifa_ifwithdstaddr. For the sake of backwards compatibility, the new
arguments were added to new functions named ifa_ifwithnet_fib and
ifa_ifwithdstaddr_fib, while the old functions became wrappers around the
new ones that passed RT_ALL_FIBS for the fib argument. However, the
backwards compatibility is not desired for FreeBSD 11, because there are
numerous other incompatible changes to the ifnet(9) API. We therefore
decided to remove it from head but leave it in place for stable/9 and
stable/10. In addition, this commit adds the fib argument to
ifa_ifwithbroadaddr for consistency's sake.
sys/sys/param.h
Increment __FreeBSD_version
sys/net/if.c
sys/net/if_var.h
sys/net/route.c
Add fibnum argument to ifa_ifwithbroadaddr, and remove the _fib
versions of ifa_ifwithdstaddr, ifa_ifwithnet, and ifa_ifwithroute.
sys/net/route.c
sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_options.c
sys/netinet/ip_output.c
sys/netinet6/nd6.c
Fixup calls of modified functions.
share/man/man9/ifnet.9
Document changed API.
CR: https://reviews.freebsd.org/D458
MFC after: Never
Sponsored by: Spectra Logic
Add in6ifa_ifwithaddr() function. It is similar to ifa_ifwithaddr,
but does fast lookup in the hash of inet6 addresses.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
* new macro to remove magic number - IPV6_ADDR_SCOPES_COUNT;
* sa6_checkzone() - this function checks sockaddr_in6 structure
for correctness of sin6_scope_id. It also can fill correct
value sometimes.
* in6_getscopezone() - this function returns scope zone id for
specified interface and scope.
* in6_getlinkifnet() - this function returns struct ifnet for
corresponding zone id of link-local scope.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
* use IN6_IS_ADDR_XXX() macro instead of hardcoded values;
* for multicast addresses just return scope value, the only exception
is addresses with 0x0F scope value (RFC 4291 p2.7.0);
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
* Return ENETDOWN when interface specified by ipi6_ifindex is not
enabled for IPv6 use.
* Return EADDRNOTAVAIL when ipi6_ifindex specifies an interface, but the
address ipi6_addr is not available for use on that interface.
* Return EINVAL when ipi6_addr is multicast address.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
It affects the IPv6 source address selection algorithm (RFC 6724)
and allows override the last rule ("longest matching prefix") for
choosing among equivalent addresses. The address with `prefer_source'
will be preferred source address.
Obtained from: Yandex LLC
MFC after: 1 month
Sponsored by: Yandex LLC
This doesn't include the same kind of userland overriding that the IPv4
path has; nor does it yet know about 2-tuple versus 4-tuple hashing.
That'll come later.
Differential Revision: https://reviews.freebsd.org/D527
Reviewed by: grehan
eliminiates some warnings when building in userland.
Thanks to Patrick Laimbock for reporting this issue.
Remove also some unnecessary casts.
There should be no functional change.
MFC after: 1 week
handling ioctls. While here, remove duplicated checks for a NULL ifp in
in6_control(): this check is already done near the beginning of the
function.
PR: 189117
Reviewed by: hrs
MFC after: 2 weeks
PF_LINK, and multicast/broadcast flag should always be dropped because
the outer protocol uses unicast even when the inner address is not for
unicast. It had been broken since r236951 when gif_output() started to
use IFQ_HANDOFF().
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
ifa_ifwithnet() and ifa_ifwithdstaddr() The legacy functions will call the
_fib() versions with RT_ALL_FIBS, preserving legacy behavior.
sys/net/if_var.h
sys/net/if.c
Add legacy-compatible functions as described above. Ensure legacy
behavior when RT_ALL_FIBS is passed as fibnum.
sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/net/route.c
sys/net/rtsock.c
sys/netinet6/nd6.c
Call with _fib() functions if we must use a specific fib, or the
legacy functions otherwise.
tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
Improve the udp_dontroute test. The bug that this test exercises is
that ifa_ifwithnet() will return the wrong address, if multiple
interfaces have addresses on the same subnet but with different
fibs. The previous version of the test only considered one possible
failure mode: that ifa_ifwithnet_fib() might fail to find any
suitable address at all. The new version also checks whether
ifa_ifwithnet_fib() finds the correct address by checking where the
ARP request goes.
Reported by: bz, hrs
Reviewed by: hrs
MFC after: 1 week
X-MFC-with: 264905
Sponsored by: Spectra Logic
For IPv6-in-IPv4, you may need to do the following command
on the tunnel interface if it is configured as IPv4 only:
ifconfig <interface> inet6 -ifdisabled
Code logic inspired from NetBSD.
PR: kern/169438
Submitted by: emeric.poupon@netasq.com
Reviewed by: fabient, ae
Obtained from: NETASQ
possible and do not clear IN6_IFF_TENTATIVE. If IFDISABLED was accidentally
set after a DAD started, TENTATIVE could be cleared because no NA was
received due to IFDISABLED, and as a result it could prevent DAD when
manually clearing IFDISABLED after that.
These two bugs are closely related. The root cause is that ifa_ifwithnet
does not consider FIBs when searching for an interface address.
sys/net/if_var.h
sys/net/if.c
Add a fib argument to ifa_ifwithnet and ifa_ifwithdstadddr. Those
functions will only return an address whose interface fib equals the
argument.
sys/net/route.c
Update calls to ifa_ifwithnet and ifa_ifwithdstaddr with fib
arguments.
sys/netinet/in.c
Update in_addprefix to consider the interface fib when adding
prefixes. This will prevent it from not adding a subnet route when
one already exists on a different fib.
sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/netinet6/nd6.c
Add RT_DEFAULT_FIB arguments to ifa_ifwithdstaddr and ifa_ifwithnet.
In some cases it there wasn't a clear specific fib number to use.
In others, I was unable to test those functions so I chose
RT_DEFAULT_FIB to minimize divergence from current behavior. I will
fix some of the latter changes along with PR kern/187553.
tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
tests/sys/netinet/Makefile
Revert r263738. The udp_dontroute test was right all along.
However, bugs kern/187550 and kern/187553 cancelled each other out
when it came to this test. Because of kern/187553, ifa_ifwithnet
searched the default fib instead of the requested one, but because
of kern/187550, there was an applicable subnet route on the default
fib. The new test added in r263738 doesn't work right, however. I
can verify with dtrace that ifa_ifwithnet returned the wrong address
before I applied this commit, but route(8) miraculously found the
correct interface to use anyway. I don't know how.
Clear expected failure messages for kern/187550 and kern/187552.
PR: kern/187550
PR: kern/187552
Reviewed by: melifaro
MFC after: 3 weeks
Sponsored by: Spectra Logic
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
the protocol specific mbuf flags are shared between them.
- Move all M_FOO definitions into a single place: netinet/in6.h, to
avoid future clashes.
- Resolve clash between M_DECRYPTED and M_SKIP_FIREWALL which resulted
in a failure of operation of IPSEC and packet filters.
Thanks to Nicolas and Georgios for all the hard work on bisecting,
testing and finally finding the root of the problem.
PR: kern/186755
PR: kern/185876
In collaboration with: Georgios Amanakis <gamanakis gmail.com>
In collaboration with: Nicolas DEFFAYET <nicolas-ml deffayet.com>
Sponsored by: Nginx, Inc.
- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This
removes another cache trashing ++ from packet forwarding path.
- Create zini/fini methods for the rtentry UMA zone. Via initialize
mutex and counter in them.
- Fix reporting of rmx_pksent to routing socket.
- Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.
The change is mostly targeted for stable/10 merge. For head,
rt_pksent is expected to just disappear.
Discussed with: melifaro
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
were primarily used to size the sysctl name list macros that were removed
in r254295. A few other constants either did not have an associated
sysctl node, or the associated node used OID_AUTO instead.
PR: ports/184525 (exp-run)
tested and is unfinished. However, I've tested my version,
it works okay. As before it is unfinished: timeout aren't
driven by TCP session state. To enable the HASH_ALL mode,
one needs in kernel config:
options FLOWTABLE_HASH_ALL
o Reduce the alignment on flentry to 64 bytes. Without
the FLOWTABLE_HASH_ALL option, twice less memory would
be consumed by flows.
o API to ip_output()/ip6_output() got even more thin: 1 liner.
o Remove unused unions. Simply use fle->f_key[].
o Merge all IPv4 code into flowtable_lookup_ipv4(), and do same
flowtable_lookup_ipv6(). Stop copying data to on stack
sockaddr structures, simply use key[] on stack.
o Move code from flowtable_lookup_common() that actually works
on insertion into flowtable_insert().
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
Currently we have 3 usage patterns:
1) nd6_output (most traffic flow, no lle supplied, lle RLOCK sufficient)
2) corner cases for output (no lle, STALE lle, so on). lle WLOCK needed.
3) nd* iunternal machinery (WLOCK'ed lle provided, perform packet queing).
We separate case 1 and implement it inside its only customer - nd6_output.
This leads to some code duplication (especialy SEND stuff, which should be
hooked to output in a different way), but simplifies locking and control
flow logic fir nd6_output_lle.
Reviewed by: ae
MFC after: 3 weeks
Sponsored by: Yandex LLC
* Check ND6_IFF_IFDISABLED before acquiring any locks
* Assume m is always non-NULL
* remove 'bad' case not used anymore
* Simply if_output conditional
MFC after: 2 weeks
Sponsored by: Yandex LLC
- ip_output() and ip_output6() simply call flowtable_lookup(),
passing mbuf and address family. That's the only code under
#ifdef FLOWTABLE in the protocols code now.
o Revamp statistics gathering and export.
- Remove hand made pcpu stats, and utilize counter(9).
- Snapshot of statistics is available via 'netstat -rs'.
- All sysctls are moved into net.flowtable namespace, since
spreading them over net.inet isn't correct.
o Properly separate at compile time INET and INET6 parts.
o General cleanup.
- Remove chain of multiple flowtables. We simply have one for
IPv4 and one for IPv6.
- Flowtables are allocated in flowtable.c, symbols are static.
- With proper argument to SYSINIT() we no longer need flowtable_ready.
- Hash salt doesn't need to be per-VNET.
- Removed rudimentary debugging, which use quite useless in dtrace era.
The runtime behavior of flowtable shouldn't be changed by this commit.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
* Set ia address/mask values BEFORE attaching to address lists.
Inet6 address assignment is not atomic, so the simplest way to
do this atomically is to fill in ia before attach.
* Validate irfa->ia_addr field before use (we permit ANY sockaddr in old code).
* Do some renamings:
in6_ifinit -> in6_notify_ifa (interaction with other subsystems is here)
in6_setup_ifa -> in6_broadcast_ifa (LLE/Multicast/DaD code)
in6_ifaddloop -> nd6_add_ifa_lle
in6_ifremloop -> nd6_rem_ifa_lle
* Split working with LLE and route announce code for last two.
Add temporary in6_newaddrmsg() function to mimic current rtsock behaviour.
* Call device SIOCSIFADDR handler IFF we're adding first address.
In IPv4 we have to call it on every address change since ARP record
is installed by arp_ifinit() which is called by given handler.
IPv6 stack, on the opposite is responsible to call nd6_add_ifa_lle() so
there is no reason to call SIOCSIFADDR often.
included.. netstat uses -DKERNEL=1 to get these parts and breaks the
build w/o it...
melifaro@ says that ae@ is probably asleep, and the PR doesn't have
this part of the patch... Probably a local change got in by accident..
PR: 185148
Pointy hat to: ae@