Commit Graph

1799 Commits

Author SHA1 Message Date
ae
7cdbaef028 Fix the NULL pointer dereference for unresolved link layer entries in
the netinet6 code. Copy link layer address only when corresponding entry
has LLE_VALID flag.

PR:		210379
Approved by:	re (kib)
2016-06-22 11:29:21 +00:00
bz
7a1c0b1ad1 Get closer to a VIMAGE network stack teardown from top to bottom rather
than removing the network interfaces first. This change is rather larger
and convoluted as the ordering requirements cannot be separated.

Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and
related modules to their own SI_SUB_PROTO_FIREWALL.
Move initialization of "physical" interfaces to SI_SUB_DRIVERS,
move virtual (cloned) interfaces to SI_SUB_PSEUDO.
Move Multicast to SI_SUB_PROTO_MC.

Re-work parts of multicast initialisation and teardown, not taking the
huge amount of memory into account if used as a module yet.

For interface teardown we try to do as many of them as we can on
SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling
over a higher layer protocol such as IP. In that case the interface
has to go along (or before) the higher layer protocol is shutdown.

Kernel hhooks need to go last on teardown as they may be used at various
higher layers and we cannot remove them before we cleaned up the higher
layers.

For interface teardown there are multiple paths:
(a) a cloned interface is destroyed (inside a VIMAGE or in the base system),
(b) any interface is moved from a virtual network stack to a different
network stack ("vmove"), or (c) a virtual network stack is being shut down.
All code paths go through if_detach_internal() where we, depending on the
vmove flag or the vnet state, make a decision on how much to shut down;
in case we are destroying a VNET the individual protocol layers will
cleanup their own parts thus we cannot do so again for each interface as
we end up with, e.g., double-frees, destroying locks twice or acquiring
already destroyed locks.
When calling into protocol cleanups we equally have to tell them
whether they need to detach upper layer protocols ("ulp") or not
(e.g., in6_ifdetach()).

Provide or enahnce helper functions to do proper cleanup at a protocol
rather than at an interface level.

Approved by:		re (hrs)
Obtained from:		projects/vnet
Reviewed by:		gnn, jhb
Sponsored by:		The FreeBSD Foundation
MFC after:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D6747
2016-06-21 13:48:49 +00:00
pfg
0f12e1993e Remove the SIOCSIFALIFETIME_IN6 ioctl.
The SIOCSIFALIFETIME_IN6 provided by the kame project is unused,
it can't really be used safely and has been completely removed from
NetBSD and OpenBSD.

Obtained from:	NetBSD (kern/35897)
PR:		210148 (exp-run)
Reviewed by:	ae, hrs
Relnotes:	yes
Approved by:	re (glebius)
Differential Revision:	https://reviews.freebsd.org/D5491
2016-06-13 22:31:16 +00:00
ae
48b268cd67 Cleanup unneded include "opt_ipfw.h".
It was used for conditional build IPFIREWALL_FORWARD support.
But IPFIREWALL_FORWARD option was removed a long time ago.
2016-06-09 05:48:34 +00:00
bz
5baf25edd0 Make KASSERT message more useful by printing the variables on which
we assert.

Obtained from:	projects/vnet
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2016-06-06 22:34:12 +00:00
bz
aaac6c5a13 Move the callout_reset() to the end of the work not having it stick
before we do anything.

Obtained from:	projects/vnet
MFC after:	2 week
Sponsored by:	The FreeBSD Foundation
2016-06-06 14:01:09 +00:00
bz
69cdb2137c Introduce a per-VNET flag to enable/disable netisr prcessing on that VNET.
Add accessor functions to toggle the state per VNET.
The base system (vnet0) will always enable itself with the normal
registration. We will share the registered protocol handlers in all
VNETs minimising duplication and management.
Upon disabling netisr processing for a VNET drain the netisr queue from
packets for that VNET.

Update netisr consumers to (de)register on a per-VNET start/teardown using
VNET_SYS(UN)INIT functionality.

The change should be transparent for non-VIMAGE kernels.

Reviewed by:	gnn (, hiren)
Obtained from:	projects/vnet
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6691
2016-06-03 13:57:10 +00:00
gnn
d75e0c471e This change re-adds L2 caching for TCP and UDP, as originally added in D4306
but removed due to other changes in the system. Restore the llentry pointer
to the "struct route", and use it to cache the L2 lookup (ARP or ND6) as
appropriate.

Submitted by:	Mike Karels
Differential Revision:	https://reviews.freebsd.org/D6262
2016-06-02 17:51:29 +00:00
markj
8b17712ca6 Exploit r301213 to fix in6 ifaddr locking in pfxlist_onlink_check().
Reviewed by:	ae, hrs
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D6639
2016-06-02 17:21:57 +00:00
markj
71ff51c027 Always start IPv6 DAD asynchronously.
Otherwise we transmit the first neighbour solicitation in the context of the
caller of nd6_dad_start(), which can easily result in lock recursion. When
DAD is to be started after some delay, we send the first NS from the DAD
callout handler, so just change the implementation to do this in the
non-delayed case as well.

Reviewed by:	ae, hrs
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D6639
2016-06-02 17:17:15 +00:00
bz
fac944a70a The pr_destroy field does not allow us to run the teardown code in a
specific order.  VNET_SYSUNINITs however are doing exactly that.
Thus remove the VIMAGE conditional field from the domain(9) protosw
structure and replace it with VNET_SYSUNINITs.
This also allows us to change some order and to make the teardown functions
file local static.
Also convert divert(4) as it uses the same mechanism ip(4) and ip6(4) use
internally.

Slightly reshuffle the SI_SUB_* fields in kernel.h and add a new ones, e.g.,
for pfil consumers (firewalls), partially for this commit and for others
to come.

Reviewed by:		gnn, tuexen (sctp), jhb (kernel.h)
Obtained from:		projects/vnet
MFC after:		2 weeks
X-MFC:			do not remove pr_destroy
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6652
2016-06-01 10:14:04 +00:00
tuexen
525f16040f Add PR_CONNREQUIRED for SOCK_STREAM sockets using SCTP.
This is required to signal connetion setup on non-blocking sockets
via becoming writable. This still allows for implicit connection
setup.

MFC after:	1 week
2016-05-30 18:24:23 +00:00
glebius
bbfa6d2853 Plug route reference underleak that happens with FLOWTABLE after r297225.
Submitted by:	Mike Karels <mike karels.net>
2016-05-27 17:31:02 +00:00
markj
9cb221bdb0 Mark the prefix and default router list sysctl handlers MPSAFE.
MFC after:	2 weeks
2016-05-23 20:18:11 +00:00
markj
86c15ee95b Acquire the nd6 lock in the prefix list sysctl handler.
The nd6 lock will be used to synchronize access to the NDP prefix list.

MFC after:	2 weeks
Tested by:	Jason Wolfe (as part of a larger change)
2016-05-23 20:15:08 +00:00
ae
78161c3462 Remove ip6 adjusting from the place where pointer couldn't be changed.
And add comment after calling PFIL hooks, where it could be changed.
2016-05-20 12:17:40 +00:00
ae
eef5384953 Remove ip6 pointer initialization and strange check from the beginning
of ip6_output(). It isn't used until the first time adjusted.
Remove the comment about adjusting where it is actually initialized.
2016-05-20 12:09:10 +00:00
markj
a09fd6097b Move IPv6 malloc tag definitions into the IPv6 code. 2016-05-20 04:45:08 +00:00
ae
0412106b46 Since PFIL can change destination address, use its always actual value
from mbuf when calculating path mtu. Remove now unused finaldst variable.
Also constify dst argument in ip6_getpmtu() and ip6_getpmtu_ctl().

Reviewed by:	melifaro
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-05-19 12:45:20 +00:00
ae
d1f53cbfea Call RO_RTFREE() when we have detected the change of destination
address, otherwise the old route will be used with new destination.

MFC after:	1 week
2016-05-17 14:06:55 +00:00
markj
a0640b8262 Use Node Information flag names instead of hard-coding their values.
MFC after:	1 week
2016-05-15 03:22:13 +00:00
markj
43beb421e5 Add sysctl descriptions for net.inet6.ip6 and net.inet6.icmp6.
icmp6.redirtimeout, icmp6.nd6_maxnudhint and ip6.rr_prune are left
undocumented as they appear to have no effect. Some existing sysctl
descriptions were modified for consistency and style, and the
ip6.tempvltime and ip6.temppltime handlers were rewritten to be a bit
simpler and to avoid setting the sysctl value before validating it.

MFC after:	3 weeks
2016-05-15 03:18:03 +00:00
markj
c5c6630f07 Remove an always-false error check in the AIFADDR_IN6 handler.
CID:		1250792
MFC after:	1 week
2016-05-15 03:01:40 +00:00
markj
4b9a93dfb3 Remove obsolescent comments from nd6_purge().
MFC after:	1 week
2016-05-09 23:43:12 +00:00
markj
94a1c25725 Clean up callers of nd6_prelist_add().
nd6_prelist_add() sets *newp if and only if it is successful, so there's no
need for code that handles the case where the return value is 0 and
*newp == NULL. Fix some style bugs in nd6_prelist_add() while here.

MFC after:	1 week
2016-05-07 03:41:29 +00:00
markj
557551b31f Remove two useless local variables from prelist_update().
MFC after:	1 week
2016-05-07 03:32:29 +00:00
pfg
d9c9113377 sys/net*: minor spelling fixes.
No functional change.
2016-05-03 18:05:43 +00:00
tuexen
a750782f5b When a client uses UDP encapsulation and lists IP addresses in the INIT
chunk, enable UDP encapsulation for all those addresses.
This helps clients using a userland stack to support multihoming if
they are not behind a NAT.

MFC after: 1 week
2016-05-01 21:48:55 +00:00
tuexen
3e7292aa0b Use correct order of source and destination address and port. 2016-04-29 20:13:35 +00:00
rrs
64e463c093 Complete the UDP tunneling of ICMP msgs to those protocols
interested in having tunneled UDP and finding out about the
ICMP (tested by Michael Tuexen with SCTP.. soon to be using
this feature).

Differential Revision:	http://reviews.freebsd.org/D5875
2016-04-28 15:53:10 +00:00
cem
23a478288f in_lltable_alloc and in6 copy: Don't leak LLE in error path
Fix a memory leak in error conditions introduced in r292978.

Reported by:	Coverity
CIDs:		1347009, 1347010
Sponsored by:	EMC / Isilon Storage Division
2016-04-26 23:13:48 +00:00
loos
cfc8d71705 Fixes the comment to reflect the code.
Sponsored by:	Rubicon Communications (Netgate)
2016-04-25 23:12:39 +00:00
pfg
32dcf3933a Indentation issues.
Contract some lines leftover from r298310.

Mea culpa.
2016-04-20 16:19:44 +00:00
pfg
a7d40a88c9 kernel: use our nitems() macro when it is available through param.h.
No functional change, only trivial cases are done in this sweep,

Discussed in:	freebsd-current
2016-04-19 23:48:27 +00:00
tuexen
f78898772a Address issues found by the XCode code analyzer. 2016-04-18 20:16:41 +00:00
tuexen
42159e8af3 Fix the ICMP6 handling for SCTP.
Keep the IPv4 code in sync.

MFC after:	1 week
2016-04-16 21:34:49 +00:00
pfg
12232f8463 sys/net* : for pointers replace 0 with NULL.
Mostly cosmetical, no functional change.

Found with devel/coccinelle.
2016-04-15 17:30:33 +00:00
ae
3f81fe2ce0 Fix regression introduced in r296986.
Currently we don't keep zoneid in in6_ifaddr structure, because there
is still some code, that doesn't properly initialize sin6_scope_id,
but some functions use sa_equal() for addresses comparison. sa_equal()
compares full sockaddr_in6 structures and such comparison will fail.
For now use zero zoneid in in6ifa_ifwithaddr(). It is safe, because
used address is in embedded form. In future we will use zoneid, so mark it
with XXX comment.

Reported by:	kp
Tested by:	kp
2016-04-08 11:13:24 +00:00
gnn
43026a8c5f Unbreak the RSS/PCBGROUp build. 2016-03-31 00:53:23 +00:00
markj
4081472216 Fix the lladdr copy in in6_lltable_dump_entry() after r292978.
This bug caused "ndp -a" to show the wrong link layer address for neighbour
cache entries.

PR:	208067
2016-03-30 00:03:59 +00:00
markj
db6b45eb6c Modify nd6_llinfo_timer() to acquire the nd6 lock before the LLE lock.
When expiring a neighbour cache entry we may need to look up the associated
default router, which requires the nd6 read lock. To avoid an LOR, the nd6
lock should be acquired first.

X-MFC-With:	r296063
Tested by:	Larry Rosenman <ler@lerctr.org> (previous revision)
2016-03-29 19:23:00 +00:00
gnn
c3d5404bbe FreeBSD previously provided route caching for TCP (and UDP). Re-add
route caching for TCP, with some improvements. In particular, invalidate
the route cache if a new route is added, which might be a better match.
The cache is automatically invalidated if the old route is deleted.

Submitted by:	Mike Karels
Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D4306
2016-03-24 07:54:56 +00:00
bz
d4cf530887 Mfp4 @180378:
Factor out nd6 and in6_attach initialization to their own files.
  Also move destruction into those files though still called from
  the central initialization.

  Sponsored by:	CK Software GmbH
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D5033
2016-03-22 15:43:47 +00:00
markj
8c873a2b1b Modify defrouter_remove() to perform the router lookup before removal.
This allows some simplification of its callers. No functional change
intended.

Tested by:	Larry Rosenman (as part of a larger change)
MFC after:	1 month
2016-03-17 19:01:44 +00:00
ae
2d399eaa6e Reduce the number of local variables. Remove redundant check that inp
pointer isn't NULL, it is safe, because we are handling IPV6_PKTINFO
socket option in this block of code. Also, use in6ifa_withaddr() instead
of ifa_withaddr().
2016-03-17 11:10:44 +00:00
ae
a3a4fe0039 Change in6_selectsrc() to allow usage of non-local IPv6 addresses in
IPV6_PKTINFO ancillary data when IPV6_BINDANY socket option is set.

Submitted by:	n_hibma
MFC after:	2 weeks
2016-03-17 10:59:30 +00:00
glebius
163857deb4 New way to manage reference counting of mbuf external storage.
The m_ext.ext_cnt pointer becomes a union. It can now hold the refcount
value itself. To tell that m_ext.ext_flags flag EXT_FLAG_EMBREF is used.
The first mbuf to attach a cluster stores the refcount. The further mbufs
to reference the cluster point at refcount in the first mbuf. The first
mbuf is freed only when the last reference is freed.

The benefit over refcounts stored in separate slabs is that now refcounts
of different, unrelated mbufs do not share a cache line.

For EXT_EXTREF mbufs the zone_ext_refcnt is no longer needed, and m_extadd()
becomes void, making widely used M_EXTADD macro safe.

For EXT_SFBUF mbufs the sf_ext_ref() is removed, which was an optimization
exactly against the cache aliasing problem with regular refcounting.

Discussed with:		rrs, rwatson, gnn, hiren, sbruno, np
Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D5396
Sponsored by:		Netflix
2016-03-01 00:17:14 +00:00
markj
9524b1b571 Lock the NDP default router list and count defrouter references.
This addresses a number of race conditions that can cause crashes as a
result of unsynchronized access to the list.

PR:		206904
Tested by:	Larry Rosenman <ler@lerctr.org>,
		Kevin Bowling <kevin.bowling@kev009.com>
MFC after:	2 months
Differential Revision: https://reviews.freebsd.org/D5315
2016-02-25 20:12:05 +00:00
tuexen
63d9199ac6 Don't leak an address in an error path.
CID:		1351729
MFC after:	3 days
2016-02-23 18:50:34 +00:00
tuexen
3c0af23540 Fix reporting of mapped addressed in getpeername() and getsockname() for
IPv6 SCTP sockets.
This bugs were found because of an issue reported by PVS / D5245.
2016-02-18 21:05:04 +00:00
markj
2bf18fe55d Release the ref acquired in nd6_dad_find() if DAD is already in progress.
MFC after:	1 week
2016-02-18 00:00:51 +00:00
markj
bbc632b86e Use pfxrtr_del() instead of freeing advertising routers directly.
MFC after:	1 week
2016-02-17 23:55:24 +00:00
markj
7f716b8a59 Remove a prototype for the non-existent prelist_del().
MFC after:	1 week
2016-02-17 23:53:24 +00:00
glebius
c905b9e75c Ternary operator has lower priority than OR.
Found by:	PVS-Studio
2016-02-17 21:17:14 +00:00
markj
67a2e23340 Add a missing newline to a log message.
MFC after:	1 week
2016-02-12 21:17:00 +00:00
markj
9ec0a4438f Rename the flags field of struct nd_defrouter to "raflags".
This field contains the flags inherited from the corresponding router
advertisement message and is not for storing private state.

MFC after:	1 week
2016-02-12 21:15:57 +00:00
markj
e659a689f9 Simplify defrtrlist_update() slightly in preparation for future changes.
No functional change intended.

MFC after:	1 week
2016-02-12 21:06:48 +00:00
markj
151140ce44 Remove a bogus comment from nd6_na_input().
The splnet() call that it refers to has been removed, and a lock for the
default router list is in fact needed.

MFC after:	1 week
2016-02-12 21:01:53 +00:00
markj
88e4f89c6d Remove superfluous return statements from the neighbour discovery code.
MFC after:	1 week
2016-02-12 20:55:22 +00:00
markj
0309ab8781 Fix style around allocations from M_IP6NDP.
- Don't cast the return value of malloc(9).
- Use M_ZERO instead of explicitly calling bzero(9).

MFC after:	1 week
2016-02-12 20:52:53 +00:00
markj
7326876626 Remove some unreferenced NDP debug variable definitions.
MFC after:	1 week
2016-02-12 20:46:53 +00:00
dteske
51b30e8967 Merge SVN r295220 (bz) from projects/vnet/
Fix a panic that occurs when a vnet interface is unavailable at the time the
vnet jail referencing said interface is stopped.

Sponsored by:	FIS Global, Inc.
2016-02-11 17:07:19 +00:00
glebius
306a6faf84 These files were getting sys/malloc.h and vm/uma.h with header pollution
via sys/mbuf.h
2016-02-01 17:41:21 +00:00
melifaro
23582454c7 MFP r287070,r287073: split radix implementation and route table structure.
There are number of radix consumers in kernel land (pf,ipfw,nfs,route)
  with different requirements. In fact, first 3 don't have _any_ requirements
  and first 2 does not use radix locking. On the other hand, routing
  structure do have these requirements (rnh_gen, multipath, custom
  to-be-added control plane functions, different locking).
Additionally, radix should not known anything about its consumers internals.

So, radix code now uses tiny 'struct radix_head' structure along with
  internal 'struct radix_mask_head' instead of 'struct radix_node_head'.
  Existing consumers still uses the same 'struct radix_node_head' with
  slight modifications: they need to pass pointer to (embedded)
  'struct radix_head' to all radix callbacks.

Routing code now uses new 'struct rib_head' with different locking macro:
  RADIX_NODE_HEAD prefix was renamed to RIB_ (which stands for routing
  information base).

New net/route_var.h header was added to hold routing subsystem internal
  data. 'struct rib_head' was placed there. 'struct rtentry' will also
  be moved there soon.
2016-01-25 06:33:15 +00:00
melifaro
962b1b7134 Fix rte refcount leak in ip6_forward().
Reviewed by:	ae
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2016-01-20 11:25:30 +00:00
glebius
51f55053b6 Verify the packet length in sctp6_input().
The sctp6_ctlinput() function does not properly check the length of the packet
it receives from the ICMP6 input routine. This means that an attacker can craft
a packet that will cause a kernel panic.

When the kernel receives an ICMP6 error message with one of the types/codes
it handles, it calls icmp6_notify_error() to deliver it to the upper-level
protocol. icmp6_notify_error() cycles through the extension headers (if any)
to find the protocol number of the first non-extension header. It does NOT
verify the length of the non-extension header.

It passes information about the packet (including the actual packet) to the
upper-level protocol's pr_ctlinput function. In the case of SCTP for IPv6,
icmp6_notify_error() calls sctp6_ctlinput().

sctp6_ctlinput() assumes that the incoming packet contains a sufficiently-long
SCTP header and calls m_copydata() to extract a copy of that header. In turn,
m_copydata() assumes that the caller has already verified that the offset and
length parameters are correct. If they are incorrect, it will dereference a
NULL pointer and cause a kernel panic.

In short, no one is sufficiently verifying the input, and the result is a
kernel panic.

Submitted by:	jtl
Security:	SA-16:01.sctp
2016-01-14 10:11:10 +00:00
melifaro
7cc47d54cd Bring RADIX_MPATH support to new routing KPI to ease migration.
Move actual rte selection process from rtalloc_mpath_fib()
  to the rt_path_selectrte() function. Add public
  rt_mpath_select() to use in fibX_lookup_ functions.
2016-01-11 08:45:28 +00:00
melifaro
21632a9bd9 Split in6_selectsrc() into in6_selectsrc_addr() and in6_selectsrc_socket().
in6_selectsrc() has 2 class of users: socket-based one (raw/udp/pcb/etc) and
  socket-less (ND code). The main reason for that change is inability to
  specify non-default FIB for callers w/o socket since (internally) inpcb
  is used to determine fib.

As as result, add 2 wrappers for in6_selectsrc() (making in6_selectsrc()
  static):
1) in6_selectsrc_socket() for the former class. Embed scope_ambiguous check
  along with returning hop limit when needed.
2) in6_selectsrc_addr() for the latter case. Add 'fibnum' argument and
  pass IPv6 address  w/ explicitly specified scope as separate argument.

Reviewed by:	ae (previous version)
2016-01-10 13:40:29 +00:00
melifaro
62e557108e Do not hold ifaddr reference for the whole icmp6_reflect() exec time.
Copy source address, calculate hlim and release refcount instead.
2016-01-10 11:59:55 +00:00
melifaro
fda8b5679f Remove prefix check from in6_addroute().
This check was added in initial? netinet6/ import
  back in 1999 (r53541).
It effectively became unnecessary after 'address/prefix clean-ups'
  KAME commit 90ff8792e676132096a440dd787f99a5a5860ee8 (github) in 2001
  (merged to FreeBSD in r78064) where prefix check was added to
  nd6_prefix_onlink(). Similar IPv4 check (in_addroute() was added
  in r137628).
Additionally, the right plance for this (or similar) check is the prefix
  addition code (nd6_prefix_onlink(), nd6_prefix_onlink_rtrequest(),
  in_addprefix() or rtinit()), but not the generic radix insert routine.
2016-01-09 11:41:37 +00:00
melifaro
14cf7637d1 Remove sys/eventhandler.h from net/route.h
Reviewed by:	ae
2016-01-09 09:34:39 +00:00
melifaro
4fe868c921 Finish r293098: make ip6_getpmtu() and ip6_getpmtu_ctl() use new routing API 2016-01-04 18:32:24 +00:00
melifaro
31d78f6810 Add rib_lookup_info() to provide API for retrieving individual route
entries data in unified format.

There are control plane functions that require information other than
  just next-hop data (e.g. individual rtentry fields like flags or
  prefix/mask). Given that the goal is to avoid rte reference/refcounting,
  re-use rt_addrinfo structure to store most rte fields. If caller wants
  to retrieve key/mask or gateway (which are sockaddrs and are allocated
  separately), it needs to provide sufficient-sized sockaddrs structures
  w/ ther pointers saved in passed rt_addrinfo.

Convert:
  * lltable new records checks (in_lltable_rtcheck(),
    nd6_is_new_addr_neighbor().
  * rtsock pre-add/change route check.
  * IPv6 NS ND-proxy check (RADIX_MPATH code was eliminated because
     1) we don't support RTF_ANNOUNCE ND-proxy for networks and there should
       not be multiple host routes for such hosts 2) if we have multiple
       routes we should inspect them (which is not done). 3) the entire idea
       of abusing KRT as storage for ND proxy seems odd. Userland programs
       should be used for that purpose).
2016-01-04 15:03:20 +00:00
melifaro
113d546f8e Remove 'struct route_int6' argument from in6_selectsrc() and
in6_selectif().

The main task of in6_selectsrc() is to return IPv6 SAS (along with
  output interface used for scope checks). No data-path code uses
  route argument for caching. The only users are icmp6 (reflect code),
  ND6 ns/na generation code. All this fucntions are control-plane, so
  there is no reason to try to 'optimize' something by passing cached
  route into to ip6_output(). Given that, simplify code by eliminating
  in6_selectsrc() 'struct route_in6' argument. Since in6_selectif() is
  used only by in6_selectsrc(), eliminate its 'struct route_in6' argument,
  too. While here, reshape rte-related code inside in6_selectif() to
  free lookup result immediately after saving all the needed fields.
2016-01-03 10:43:23 +00:00
melifaro
c0fd3127f0 Handle IPV6_PATHMTU option by spliting ip6_getpmtu_ctl() from ip6_getpmtu().
Add ro_mtu field to 'struct route' to be able to pass lookup MTU back to
  the caller.

Currently, ip6_getpmtu() has 2 totally different use cases:
1) control plane (IPV6_PATHMTU req), where we just need to calculate MTU
  and return it, w/o any reusability.
2) Actual ip6_output() data path where we (nearly) always use the provided
  route lookup data. If this data is not 'valid' we need to perform another
  lookup and save the result (which cannot be re-used by ip6_output()).

Given that, handle 1) by calling separate function doing rte lookup itself.
  Resulting MTU is calculated by (newly-added) ip6_calcmtu() used by both
  ip6_getpmtu_ctl() and ip6_getpmtu().
For 2) instead of storing ref'ed rte, store mtu (the only needed data
  from the lookup result) inside newly-added ro_mtu field.
  'struct route' was shrinked by 8(or 4 bytes) in r292978. Grow it again
  by 4 bytes. New ro_mtu field will be used in other places like
  ip/tcp_output (EMSGSIZE handling from output routines).

Reviewed by:	ae
2016-01-03 09:54:03 +00:00
melifaro
eceeaeddc0 Use lltable_get_ifp() instead of direct access to lltable fields. 2016-01-01 12:35:33 +00:00
melifaro
93152c67c9 Implement interface link header precomputation API.
Add if_requestencap() interface method which is capable of calculating
  various link headers for given interface. Right now there is support
  for INET/INET6/ARP llheader calculation (IFENCAP_LL type request).
  Other types are planned to support more complex calculation
  (L2 multipath lagg nexthops, tunnel encap nexthops, etc..).

Reshape 'struct route' to be able to pass additional data (with is length)
  to prepend to mbuf.

These two changes permits routing code to pass pre-calculated nexthop data
  (like L2 header for route w/gateway) down to the stack eliminating the
  need for other lookups. It also brings us closer to more complex scenarios
  like transparently handling MPLS nexthops and tunnel interfaces.
  Last, but not least, it removes layering violation introduced by flowtable
  code (ro_lle) and simplifies handling of existing if_output consumers.

ARP/ND changes:
Make arp/ndp stack pre-calculate link header upon installing/updating lle
  record. Interface link address change are handled by re-calculating
  headers for all lles based on if_lladdr event. After these changes,
  arpresolve()/nd6_resolve() returns full pre-calculated header for
  supported interfaces thus simplifying if_output().
Move these lookups to separate ether_resolve_addr() function which ether
  returs error or fully-prepared link header. Add <arp|nd6_>resolve_addr()
  compat versions to return link addresses instead of pre-calculated data.

BPF changes:
Raw bpf writes occupied _two_ cases: AF_UNSPEC and pseudo_AF_HDRCMPLT.
Despite the naming, both of there have ther header "complete". The only
  difference is that interface source mac has to be filled by OS for
  AF_UNSPEC (controlled via BIOCGHDRCMPLT). This logic has to stay inside
  BPF and not pollute if_output() routines. Convert BPF to pass prepend data
  via new 'struct route' mechanism. Note that it does not change
  non-optimized if_output(): ro_prepend handling is purely optional.
Side note: hackish pseudo_AF_HDRCMPLT is supported for ethernet and FDDI.
  It is not needed for ethernet anymore. The only remaining FDDI user is
  dev/pdq mostly untouched since 2007. FDDI support was eliminated from
  OpenBSD in 2013 (sys/net/if_fddisubr.c rev 1.65).

Flowtable changes:
  Flowtable violates layering by saving (and not correctly managing)
  rtes/lles. Instead of passing lle pointer, pass pointer to pre-calculated
  header data from that lle.

Differential Revision:	https://reviews.freebsd.org/D4102
2015-12-31 05:03:27 +00:00
jtl
3504bdc019 Add the appropriate case statement for IPV6_BINDMULTI so the option can be
retrieved with getsockopt().

CID:	1229928
Differential Revision:	https://reviews.freebsd.org/D4737
Reviewed by:	adrian
Sponsored by:	Juniper Networks
2015-12-30 18:08:05 +00:00
bz
be33cc799d This code is not in modules that need KPI stability so no need to use
the wrapper functions as used in r252511.  We can directly use the
locking macros.

Reviewed by:		jtl, rwatson
MFC after:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D4731
2015-12-30 17:10:03 +00:00
wollman
fc57e5b82d in6_if2idlen: treat bridge(4) interfaces like other Ethernet interfaces
bridge(4) interfaces have an if_type of IFT_BRIDGE, rather than
IFT_ETHER, even though they only support Ethernet-style links.  This
caused in6_if2idlen to emit an "unknown link type (209)" warning to
the console every time it was called.  Add IFT_BRIDGE to the case
statement in the appropriate place, indicating that it uses the same
IPv6 address format as other Ethernet-like interfaces.

MFC after:	1 week
2015-12-28 18:29:47 +00:00
bz
bd04375ea4 Remove superfluous return (1) missed in r292601.
Reported by:	Matthew D. Fuller (fullermd over-yonder.net),
		Kevin Bowling (kevin.bowling kev009.com)
MFC after:	13 days
X-MFC with:	r292601
Sponsored by:	The FreeBSD Foundation
2015-12-23 10:23:47 +00:00
bz
982684552b Since r256624 we've been leaking routing table allocations
on vnet enabled jail shutdown. Call the provided cleanup
routines for IP versions 4 and 6 to plug these leaks.

Sponsored by:		The FreeBSD Foundation
MFC atfer:		2 weeks
Reviewed by:		gnn
Differential Revision:	https://reviews.freebsd.org/D4530
2015-12-22 14:53:19 +00:00
smh
45d5617154 Revert r292275 & r292379
glebius has concerns about these changes so reverting those can be discussed
and addressed.

Sponsored by:	Multiplay
2015-12-17 14:41:30 +00:00
smh
0813e34dbc Fix issues introduced by r292275
* Fix panic for etherswitches which don't have a LLADDR.
* Disabled DELAY in unsolicited NDA, which needs further work.
* Fixed missing DELAY in carp_send_na.
* style(9) fix.

Reported by:	kp & melifaro
X-MFC-With:	r292275
MFC after:	1 month
Sponsored by:	Multiplay
2015-12-16 22:26:28 +00:00
melifaro
fb12a509fe Provide additional lle data in IPv6 lltable dump used by ndp(8).
Before the change, things like lle state were queried via
  SIOCGNBRINFO_IN6 by ndp(8) for _each_ lle entry in dump.
This ioctl was added in 1999, probably to avoid touching rtsock code.

This change maps SIOCGNBRINFO_IN6 data to standard rtsock dump the
 following way:
  expire (already) maps to rtm_rmx.rmx_expire
  isrouter -> rtm_flags & RTF_GATEWAY
  asked -> rtm_rmx.rmx_pksent
  state -> rtm_rmx.rmx_state (maps to rmx_weight via define)

Reviewed by:	ae
2015-12-16 10:14:16 +00:00
smh
864cf18128 Fix lagg failover due to missing notifications
When using lagg failover mode neither Gratuitous ARP (IPv4) or Unsolicited
Neighbour Advertisements (IPv6) are sent to notify other nodes that the
address may have moved.

This results is slow failover, dropped packets and network outages for the
lagg interface when the primary link goes down.

We now use the new if_link_state_change_cond with the force param set to
allow lagg to force through link state changes and hence fire a
ifnet_link_event which are now monitored by rip and nd6.

Upon receiving these events each protocol trigger the relevant
notifications:
* inet4 => Gratuitous ARP
* inet6 => Unsolicited Neighbour Announce

This also fixes the carp IPv6 NA's that stopped working after r251584 which
added the ipv6_route__llma route.

The new behavour can be controlled using the sysctls:
* net.link.ether.inet.arp_on_link
* net.inet6.icmp6.nd6_on_link

Also removed unused param from lagg_port_state and added descriptions for the
sysctls while here.

PR:		156226
MFC after:	1 month
Sponsored by:	Multiplay
Differential Revision:	https://reviews.freebsd.org/D4111
2015-12-15 16:02:11 +00:00
kp
30ed5ce1ff inet6: Do not assume every interface has ip6 enabled.
Certain interfaces (e.g. pfsync0) do not have ip6 addresses (in other words,
ifp->if_afdata[AF_INET6] is NULL). Ensure we don't panic when the MTU is
updated.

pfsync interfaces will never have ip6 support, because it's explicitly disabled
in in6_domifattach().

PR:		205194
Reviewed by:	melifaro, hrs
Differential Revision:	https://reviews.freebsd.org/D4522
2015-12-14 19:44:49 +00:00
melifaro
5acc305ae0 Remove LLE read lock from IPv6 fast path.
LLE structure is mostly unchanged during its lifecycle: there are only 2
things relevant for fast path lookup code:
1) link-level address change. Since r286722, these updates are performed
  under AFDATA WLOCK.
2) Some sort of feedback indicating that this particular entry is used so
  we send NS to perform reachability verification instead of expiring entry.
  The only signal that is needed from fast path is something like binary
  yes/no.
The latter is solved by the following changes:

Special r_skip_req (introduced in D3688) value is used for fast path feedback.
  It is read lockless by fast path, but updated under req_mutex mutex. If this
  field is non-zero, then fast path will acquire lock and set it back to 0.

After transitioning to STALE state, callout timer is armed to run each
  V_nd6_delay seconds to make sure that if packet was transmitted at the start
  of given interval, we would be able to switch to PROBE state in V_nd6_delay
  seconds as user expects.
(in STALE state) timer is rescheduled until original V_nd6_gctimer expires
  keeping lle in STALE state (remaining timer value stored in lle_remtime).
(in STALE state) timer is rescheduled if packet was transmitted less that
  V_nd6_delay seconds ago to make sure we transition to PROBE state exactly
  after V_n6_delay seconds.

As a result, all packets towards lle in REACHABLE/STALE/PROBE states are handled
  by fast path without acquiring lle read lock.

Differential Revision:		https://reviews.freebsd.org/D3780
2015-12-13 07:39:49 +00:00
melifaro
802824b70a Use correct lookup key for gif route lookups.
This fixes r291993 change.
2015-12-09 22:09:33 +00:00
melifaro
2bb0e924cc Make in_arpinput(), inp_lookup_mcast_ifp(), icmp_reflect(),
ip_dooptions(), icmp6_redirect_input(), in6_lltable_rtcheck(),
  in6p_lookup_mcast_ifp() and in6_selecthlim() use new routing api.

Eliminate now-unused ip_rtaddr().
Fix lookup key fib6_lookup_nh_basic() which was lost diring merge.
Make fib6_lookup_nh_basic() and fib6_lookup_nh_extended() always
  return IPv6 destination address with embedded scope. Currently
  rw_gateway has it scope embedded, do the same for non-gatewayed
  destinations.

Sponsored by:	Yandex LLC
2015-12-09 11:14:27 +00:00
melifaro
ca13483a3c Merge helper fib* functions used for basic lookups.
Vast majority of rtalloc(9) users require only basic info from
route table (e.g. "does the rtentry interface match with the interface
  I have?". "what is the MTU?", "Give me the IPv4 source address to use",
  etc..).
Instead of hand-rolling lookups, checking if rtentry is up, valid,
  dealing with IPv6 mtu, finding "address" ifp (almost never done right),
  provide easy-to-use API hiding all the complexity and returning the
  needed info into small on-stack structure.

This change also helps hiding route subsystem internals (locking, direct
  rtentry accesses).
Additionaly, using this API improves lookup performance since rtentry is not
  locked.
(This is safe, since all the rtentry changes happens under both radix WLOCK
  and rtentry WLOCK).

Sponsored by:	Yandex LLC
2015-12-08 10:50:03 +00:00
tuexen
23770ab942 Fix the allocation of outgoing streams:
* When processing a cookie, use the number of
  streams announced in the INIT-ACK.
* When sending an INIT-ACK for an existing
  association, use the value from the association,
  not from the end-point.

MFC after:	1 week
2015-12-06 16:17:57 +00:00
ae
e9c29fe907 mld_v2_dispatch_general_query() is used by mld_fasttimo_vnet() to send
a reply to the MLDv2 General Query. In case when router has a lot of
multicast groups, the reply can take several packets due to MTU limitation.
Also we have a limit MLD_MAX_RESPONSE_BURST == 4, that limits the number
of packets we send in one shot. Then we recalculate the timer value and
schedule the remaining packets for sending.
The problem is that when we call mld_v2_dispatch_general_query() to send
remaining packets, we queue new reply in the same mbuf queue. And when
number of packets is bigger than MLD_MAX_RESPONSE_BURST, we get endless
reply of MLDv2 reports.
To fix this, add the check for remaining packets in the queue.

PR:		204831
MFC after:	1 week
Sponsored by:	Yandex LLC
2015-12-01 11:17:41 +00:00
melifaro
e198456483 Add new rt_foreach_fib_walk_del() function for deleting route entries
by filter function instead of picking into routing table details in
  each consumer.
Remove now-unused rt_expunge() (eliminating last external RTF_RNH_LOCKED
 user).
This simplifies future nexthops/mulitipath changes and rtrequest1_fib()
  locking refactoring.

Actual changes:
Add "rt_chain" field to permit rte grouping while doing batched delete
  from routing table (thus growing rte 200->208 on amd64).
Add "rti_filter" /  "rti_filterdata" / "rti_spare" fields to rt_addrinfo
  to pass filter function to various routing subsystems in standard way.
Convert all rt_expunge() customers to new rt_addinfo-based api and eliminate
  rt_expunge().
2015-11-30 05:51:14 +00:00
ae
d81208c948 Overhaul if_enc(4) and make it loadable in run-time.
Use hhook(9) framework to achieve ability of loading and unloading
if_enc(4) kernel module. INET and INET6 code on initialization registers
two helper hooks points in the kernel. if_enc(4) module uses these helper
hook points and registers its hooks. IPSEC code uses these hhook points
to call helper hooks implemented in if_enc(4).
2015-11-25 07:31:59 +00:00
cem
8922a29059 in6_mc_get: Fix recursion on if_addr_lock on malloc failure
Analogously to r291040, in6_mc_get recurses on if_addr_lock if the
M_NOWAIT allocation fails.  The fix is the same.

Suggested by:	Andrey V. Elsukov
Reviewed by:	jhb (ip4 version)
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4138 (ip4 version)
2015-11-19 00:27:26 +00:00
melifaro
2bf2184989 Bring back the ability of passing cached route via nd6_output_ifp(). 2015-11-15 16:02:22 +00:00
rrs
dc494194a2 This fixes several places where callout_stops return is examined. The
new return codes of -1 were mistakenly being considered "true". Callout_stop
now returns -1 to indicate the callout had either already completed or
was not running and 0 to indicate it could not be stopped.  Also update
the manual page to make it more consistent no non-zero in the callout_stop
or callout_reset descriptions.

MFC after:	1 Month with associated callout change.
2015-11-13 22:51:35 +00:00
melifaro
595bcb4ce1 Unify setting lladdr for AF_INET[6]. 2015-11-07 11:12:00 +00:00
adrian
7ba24ae636 [netinet6]: Create a new IPv6 netisr which expects the frames to have been verified.
This is required for fragments and encapsulated data (eg tunneling) to be redistributed
to the RSS bucket based on the eventual IPv6 header and protocol (TCP, UDP, etc) header.

* Add an mbuf tag with the state of IPv6 options parsing before the frame is queued
  into the direct dispatch handler;
* Continue processing and complete the frame reception in the correct RSS bucket /
  netisr context.

Testing results are in the phabricator review.

Differential Revision:	https://reviews.freebsd.org/D3563
Submitted by:	Tiwei Bie <btw@mail.ustc.edu.cn>
2015-11-06 23:07:43 +00:00
melifaro
4ea35c2935 Use m_cat() to reassembly IPv6 packets.
Submitted by:	jonloony_gmail.com
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D3863
2015-10-27 22:11:09 +00:00
melifaro
e2681931e4 Invoke lle_event for new entry iff it has lladdr set. 2015-10-04 19:10:27 +00:00
melifaro
7867205355 Simplify if (lladdr) condition in nd6_cache_lladdr():
For case (7) (new entry) nothing has to be done except lle_event.
  Invoke this event directly from "create new lle" code block.
  For case (4) (existing entry, same mac) useless mac update was performed,
  along with LLENTRY_RESOLVED lle_event. There was no sense in doing that,
  since nothing really had changed. Simply avoid this condition instead.
  Given that, condition was simplified to (3),(5) states which can be merged
  with previous block.
2015-10-04 12:42:07 +00:00
melifaro
0dee60f8c8 Eliminate nd6_llinfo_settimer(). All consumers were converted to
use nd6_llinfo_settimer_locked() in r216022.
Make nd6_llinfo_settimer_locked() static: last external consumer was
converted in r288124.
2015-10-04 08:33:16 +00:00
melifaro
02d9938404 Add __noinline attribute to several functions to ease dtrace instrumentation 2015-10-04 08:21:15 +00:00
melifaro
15bc65f144 Fix condition for nd6_llinfo_getholdsrc() introduced in r287484.
Effectively it always returned NULL so SAS was always performed and
  sometimes the result might have been different.

Fix state machine change accidentally introduced in r287985:
  state (4) inside nd6_cache_lladdr() (existing entry got nd message
  with the same lladdress) started to cause lle state transition to STALE
  instead of no-action.
2015-10-04 07:02:17 +00:00
hrs
defd45464b - Schedule DAD for IN6_IFF_TENTATIVE addresses in nd6_timer(). This
catches cases that DAD probes cannot be sent because of
  IFF_UP && !IFF_DRV_RUNNING.

- nd6_dad_starttimer() now calls nd6_dad_ns_output(), instead of
  calling it before nd6_dad_starttimer().

- Do not release an entry in dadq when a duplicate entry is being
  added.
2015-10-03 12:09:12 +00:00
ae
c3f8d46dc4 Take extra reference to security policy before calling crypto_dispatch().
Currently we perform crypto requests for IPSEC synchronous for most of
crypto providers (software, aesni) and only VIA padlock calls crypto
callback asynchronous. In synchronous mode it is possible, that security
policy will be removed during the processing crypto request. And crypto
callback will release the last reference to SP. Then upon return into
ipsec[46]_process_packet() IPSECREQUEST_UNLOCK() will be called to already
freed request. To prevent this we will take extra reference to SP.

PR:		201876
Sponsored by:	Yandex LLC
2015-09-30 08:16:33 +00:00
melifaro
91b3356875 Eliminate nd6_nud_hint() and its TCP bindings.
Initially function was introduced in r53541 (KAME initial commit) to
  "provide hints from upper layer protocols that indicate a connection
  is making "forward progress"" (quote from RFC 2461 7.3.1 Reachability
  Confirmation).
However, it was converted to do nothing (e.g. just return) in r122922
  (tcp_hostcache implementation) back in 2003. Some defines were moved
  to tcp_var.h in r169541. Then, it was broken (for non-corner cases)
  by r186119 (L2<>L3 split) in 2008 (NULL ifp in nd6_lookup). So,
  right now this code is broken and has no "real" base users.

Differential Revision:	https://reviews.freebsd.org/D3699
2015-09-27 05:29:34 +00:00
melifaro
4fed811000 rtsock requests for deleting interface address lles started to return EPERM
instead of old "ignore-and-return 0" in r287789. This broke arp -da /
  ndp -cn behavior (they exit on rtsock command failure). Fix this by
  translating LLE_IFADDR to RTM_PINNED flag, passing it to userland and
  making arp/ndp ignore these entries in batched delete.

MFC after:	2 weeks
2015-09-27 04:54:29 +00:00
melifaro
88b54de46b Use standard lle LLE_EXCLUSIVE request flags instead of
its redefined version.
2015-09-22 20:45:04 +00:00
bz
34885c7609 Compare mbuf pointer to NULL rather than to 0.
No functional change.

MFC after:	2 weeks
2015-09-21 12:53:26 +00:00
bz
3e1ba7cafd In the UDP over IPv6 implementation several cases are using the wrong protocol,
e.g., based on wrong "next header" assumptions (which does not have to point to
the upper layer protocol), or using hard-coded UDP instead of UDP or UDP-Lite
possibly switching protocols.  Fix those cases for UDP-Lite to work correctly.

PR:			202788
Submitted by:		Tiwei Bie (btw mail.ustc.edu.cn) [parts]
Reviewed by:		gnn, Tiwei Bie (btw mail.ustc.edu.cn),
			kevlo (earlier version)
MFC after:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D3686
2015-09-21 12:32:36 +00:00
melifaro
536c20752f Unify nd6 state switching by using newly-created nd6_llinfo_setstate()
function. The change is mostly mechanical with the following exception:
Last piece of nd6_resolve_slow() was refactored: ND6_LLINFO_PERMANENT
  condition was removed as always-true, explicit ND6_LLINFO_NOSTATE ->
  ND6_LLINFO_INCOMPLETE state transition was removed as duplicate.

Reviewed by:	ae
Sponsored by:	Yandex LLC
2015-09-21 11:19:53 +00:00
melifaro
009169304f Add "stale" timer back to nd6_cache_lladdr().
Setting timer was accidentally removed in r276844 due to misleading
  comment on its meaningless. Add it back to restore proper behaviour.
2015-09-21 10:24:34 +00:00
melifaro
d6bd5ed2ba Cleanup nd6_cache_lladdr(). No functional changes.
* Since new extries are now allocated explicitly, fill in
  all the necessary fields for lle _before_ attaching it to the table.
* Remove ND6_LLINFO_INCOMPLETE check which was unused even in
  first KAME merge (r53541).
* After that, the only new state that function can set, was
  ND6_LLINFO_STALE. Given everything above, simplify logic besides
  do_update and is_newentry.
* Fix nd_resolve() comment.
2015-09-19 11:50:02 +00:00
melifaro
8da907f42b * Simplify logic besides llchange variable.
* Refresh nd6_is_router() comment.
2015-09-18 07:18:10 +00:00
melifaro
493325342d Simplify the way of attaching IPv6 link-layer header.
Problem description:
How do we currently perform layer 2 resolution and header imposition:

For IPv4 we have the following chain:
  ip_output() -> (ether|atm|whatever)_output() -> arpresolve()

Lookup is done in proper place (link-layer output routine) and it is possible
  to provide cached lle data.

For IPv6 situation is more complex:
  ip6_output() -> nd6_output() -> nd6_output_ifp() -> (whatever)_output() ->
    nd6_storelladdr()

We have ip6_ouput() which calls nd6_output() instead of link output routine.
nd6_output() does the following:
  * checks if lle exists, creates it if needed (similar to arpresolve())
  * performes lle state transitions (similar to arpresolve())
  * calls nd6_output_ifp() which pushes packets to link output routine along
    with running SeND/MAC hooks regardless of lle state
    (e.g. works as run-hooks placeholder).

After that, iface output routine like ether_output() calls nd6_storelladdr()
  which performs lle lookup once again.

As a result, we perform lookup twice for each outgoing packet for most types
  of interfaces. We also need to maintain runtime-checked table of 'nd6-free'
  interfaces (see nd6_need_cache()).

Fix this behavior by eliminating first ND lookup. To be more specific:
  * make all nd6_output() consumers use nd6_output_ifp() instead
  * rename nd6_output[_slow]() to nd6_resolve_[slow]()
  * convert nd6_resolve() and nd6_resolve_slow() to arpresolve() semantics,
    e.g. copy L2 address to buffer instead of pushing packet towards lower
    layers
  * Make all nd6_storelladdr() users use nd6_resolve()
  * eliminate nd6_storelladdr()

The resulting callchain is the following:
  ip6_output() -> nd6_output_ifp() -> (whatever)_output() -> nd6_resolve()

Error handling:
Currently sending packet to non-existing la results in ip6_<output|forward>
  -> nd6_output() -> nd6_output _lle() which returns 0.
In new scenario packet is propagated to <ether|whatever>_output() ->
  nd6_resolve() which will return EWOULDBLOCK, and that result
  will be converted to 0.

(And EWOULDBLOCK is actually used by IB/TOE code).

Sponsored by:		Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D1469
2015-09-16 14:26:28 +00:00
melifaro
daa0d6c584 Constantify lookup key in several nd6_* functions. 2015-09-16 11:06:07 +00:00
melifaro
f6e576f283 Simplify nd6_cache_lladdr:
* Move isRouter calculation code to separate nd6_is_router() function.
* Make nd6_cache_lladdr() return void: its return value hasn't been used
  since r53541 KAME import in 1999.

Sponsored by:	Yandex LLC
2015-09-15 17:16:31 +00:00
melifaro
b42c7af3b5 * Require explicitl lle unlink prior to calling llentry_delete().
This one slightly decreases time of holding afdata wlock.
* While here, make nd6_free() return void. No one has used its return value
  since r186119.
2015-09-15 06:48:19 +00:00
vangyzen
75d72d4482 Fix the handling of IPv6 On-Link Redirects.
On receipt of a redirect message, install an interface route for the
redirected destination.  On removal of the corresponding Neighbor Cache
entry, remove the interface route.

This requires changes in rtredirect_fib() to cope with an AF_LINK
address for the gateway and with the absence of RTF_GATEWAY.

This fixes the "Redirected On-Link" test cases in the Tahi IPv6 Ready Logo
Phase 2 test suite.

Unrelated to the above, fix a recursion on the radix node head lock
triggered by the Tahi Redirected to Alternate Router test cases.

When I first wrote this patch in October 2012, all Section 2
(Neighbor Discovery) test cases passed on 10-CURRENT, 9-STABLE,
and 8-STABLE.  cem@ recently rebased the 10.x patch onto head and reported
that it passes Tahi.  (Thanks!)

These other test cases also passed in 2012:

* the RTF_MODIFIED case, with IPv4 and IPv6 (using a
  RTF_HOST|RTF_GATEWAY route for the destination)

* the redirected-to-self case, with IPv4 and IPv6

* a valid IPv4 redirect

All testing in 2012 was done with WITNESS and INVARIANTS.

Tested by:    EMC / Isilon Storage Division via Conrad Meyer (cem) in 2015,
              Mark Kelley <mark_kelley@dell.com> in 2012,
              TC Telkamp <terence_telkamp@dell.com> in 2012
PR:           152791
Reviewed by:  melifaro (current rev), bz (earlier rev)
Approved by:  kib (mentor)
MFC after:    1 month
Relnotes:     yes
Sponsored by: Dell Inc.
Differential Revision: https://reviews.freebsd.org/D3602
2015-09-14 19:17:25 +00:00
melifaro
5ad1f2444d * Do more fine-grained locking: call eventhandlers/free_entry
without holding afdata wlock
* convert per-af delete_address callback to global lltable_delete_entry() and
  more low-level "delete this lle" per-af callback
* fix some bugs/inconsistencies in IPv4/IPv6 ifscrub procedures

Sponsored by:		Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D3573
2015-09-14 16:48:19 +00:00
hrs
0529f0c80f Remove SIOCGDRLST_IN6 and SIOCGPRLST_IN6 forgotten in the previous commit.
MFC after:	3 days
2015-09-10 08:37:03 +00:00
hrs
e16dfdb9ef - Remove SIOCGDRLST_IN6 and SIOCGPRLST_IN6. These are quite old APIs and
there is no consumer now.

MFC after:	3 days
2015-09-10 06:31:24 +00:00
hrs
e5a6c91e16 - Remove SIOCGDRLST_IN6 and SIOCGPRLST_IN6. These are quite old APIs and
there is no consumer now.

- Simplify first and duplicate LLA check.

MFC after:	3 days
2015-09-10 06:29:18 +00:00
hrs
e161347f7d Do not add IN6_IFF_TENTATIVE when ND6_IFF_NO_DAD.
MFC after:	3 days
2015-09-10 06:10:30 +00:00
hrs
e5cef93a39 Remove IN6_IFF_NOPFX. This flag was no longer used.
MFC after:	3 days
2015-09-10 06:08:42 +00:00
adrian
43407a0ac4 Add support for receiving flowtype, flowid and RSS bucket information as part of recvmsg().
Submitted by:	Tiwei Bie <btw@mail.ustc.edu.cn>
Differential Revision:	https://reviews.freebsd.org/D3562
2015-09-06 20:57:57 +00:00
melifaro
e31cc5ffc6 Do not pass lle to nd6_ns_output(). Use newly-added
nd6_llinfo_get_holdsrc() to extract desired IPv6 source
  from holdchain and pass it to the nd6_ns_output().
2015-09-05 14:14:03 +00:00
melifaro
d013cff635 Do not skip entries without LLE_VALID flag.
This one fixes showing incomplete entries in ndp -an.

MFC after:	2 weeks
2015-09-05 06:24:00 +00:00
melifaro
3e1524c83e Make in6ifa_ifpwithaddr() take const param.
Remove unneded DECONST from in6_lltable_rtcheck().
2015-09-05 05:54:09 +00:00
melifaro
3f9699bcfe Simplify lla_rt_output()/nd6_add_ifa_lle() by setting lle state in
alloc handler, based on flags.
2015-08-31 05:03:36 +00:00
adrian
a3c3341951 Implement RSS hashing/re-hashing for IPv6 ingress packets.
This mirrors the basic IPv4 implementation - IPv6 packets under RSS
now are checked for a correct RSS hash and if one isn't provided,
it's done in software.

This only handles the initial receive - it doesn't yet handle
reinjecting / rehashing packets after being decapsulated from
various tunneling setups.  That'll come in some follow-up work.

For non-RSS users, this is almost a giant no-op.

It does change a couple of ipv6 methods to use const mbuf * instead of
mbuf * but it doesn't have any functional changes.

So, the following now occurs:

* If the NIC doesn't do any RSS hashing, it's all done in software.
  Single-queue, non-RSS NICs will now have the RX path distributed
  into multiple receive netisr queues.

* If the NIC provides the wrong hash (eg only IPv6 hash when we needed
  an IPv6 TCP hash, or IPv6 UDP hash when we expected IPv6 hash)
  then the hash is recalculated.

* .. if the hash is recalculated, it'll end up being injected into
  the correct netisr queue for v6 processing.

Submitted by:	Tiwei Bie <btw@mail.ustc.edu.cn>
Differential Revision:	https://reviews.freebsd.org/D3504
2015-08-29 07:14:29 +00:00
bz
7bff80af22 remove a left-over after r220463 empty #ifdef INET check.
MFC after:	1 week
2015-08-28 09:38:18 +00:00
adrian
2d6b12b499 Replace the printf()s with optional rate limited debugging for RSS.
Submitted by:	Tiwei Bie <btw@mail.ustc.edu.cn>
Differential Revision:	https://reviews.freebsd.org/D3471
2015-08-28 05:58:16 +00:00
bz
6d4420afb7 get_inpcbinfo() and get_pcblist() are UDP local functions and
do not do what one would expect by name. Prefix them with "udp_"
to at least obviously limit the scope.

This is a non-functional change.

Reviewed by:		gnn, rwatson
MFC after:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D3505
2015-08-27 15:27:41 +00:00
adrian
abcfdaaaf0 Call the new RSS hash calculation function to correctly calculate a hash
based on the configured requirements for the protocol.

Tested:

* UDP IPv6 TX/RX testing, w/ RSS enabled, 82599 ixgbe(4) hardware
2015-08-25 06:12:59 +00:00
adrian
830cda9ae5 Implement the IPv6 RSS software hash function.
This isn't yet linked into the receive/transmit paths anywhere just yet.

This is part of a GSoC 2015 project.

Submitted by:	Tiwei Bie <btw@mail.ustc.edu.cn>
Reviewed by:	hiren, gnn
Differential Revision:	https://reviews.freebsd.org/D3423
2015-08-24 05:36:08 +00:00
hrs
d27954934d - Deprecate IN6_IFF_NODAD. It was used to prevent DAD on a loopback
interface but in6if_do_dad() already had a check for IFF_LOOPBACK.

- Remove in6if_do_dad() check in in6_broadcast_ifa().  An address
  which needs DAD always has IN6_IFF_TENTATIVE there.

- in6if_do_dad() now returns EAGAIN when the interface is not ready
  since DAD callout handler ignores such an interface.

- In DAD callout handler, mark an address as IN6_IFF_TENTATIVE
  when the interface has ND6_IFF_IFDISABLED.  And Do IFF_UP and
  IFF_DRV_RUNNING check consistently when DAD is required.

- draft-ietf-6man-enhanced-dad is now published as RFC 7527.

- Fix some typos.
2015-08-24 05:21:49 +00:00
melifaro
54b3b78856 * Split allocation and table linking for lle's.
Before that, the logic besides lle_create() was the following:
  return existing if found, create if not. This behaviour was error-prone
  since we had to deal with 'sudden' static<>dynamic lle changes.
  This commit fixes bunch of different issues like:
  - refcount leak when lle is converted to static.
    Simple check case:
    console 1:
    while true;
      do for i in `arp -an|awk '$4~/incomp/{print$2}'|tr -d '()'`;
        do arp -s $i 00:22:44:66:88:00 ; arp -d $i;
      done;
    done
   console 2:
    ping -f any-dead-host-in-L2
   console 3:
    # watch for memory consumption:
    vmstat -m | awk '$1~/lltable/{print$2}'
  - possible problems in arptimer() / nd6_timer() when dropping/reacquiring
   lock.
  New logic explicitly handles use-or-create cases in every lla_create
  user. Basically, most of the changes are purely mechanical. However,
  we explicitly avoid using existing lle's for interface/static LLE records.
* While here, call lle_event handlers on all real table lle change.
* Create lltable_free_entry() calling existing per-lltable
  lle_free_t callback for entry deletion
2015-08-20 12:05:17 +00:00
melifaro
0c24547a66 Use single 'lle_timer' callout in lltable instead of
two different names of the same timer.
2015-08-11 12:38:54 +00:00
melifaro
d8f92ce2cf Store addresses instead of sockaddrs inside llentry.
This permits us having all (not fully true yet) all the info
needed in lookup process in first 64 bytes of 'struct llentry'.

struct llentry layout:
BEFORE:
[rwlock .. state .. state .. MAC ] (lle+1) [sockaddr_in[6]]
AFTER
[ in[6]_addr MAC .. state .. rwlock ]

Currently, address part of struct llentry has only 16 bytes for the key.
However, lltable does not restrict any custom lltable consumers with long
keys use the previous approach (store key at (lle+1)).

Sponsored by:	Yandex LLC
2015-08-11 09:26:11 +00:00
melifaro
8e6b3a8d59 MFP r276712.
* Split lltable_init() into lltable_allocate_htbl() (alloc
  hash table with default callbacks) and lltable_link() (
  links any lltable to the list).
* Switch from LLTBL_HASHTBL_SIZE to per-lltable hash size field.
* Move lltable setup to separate functions in in[6]_domifattach.
2015-08-11 05:51:00 +00:00
melifaro
ba06112c24 Rename rt_foreach_fib() to rt_foreach_fib_walk().
Suggested by:	julian
2015-08-10 20:50:31 +00:00
melifaro
4f240a9c31 Partially merge r274887,r275334,r275577,r275578,r275586 to minimize
differences between projects/routing and HEAD.

This commit tries to keep code logic the same while changing underlying
code to use unified callbacks.

* Add llt_foreach_entry method to traverse all entries in given llt
* Add llt_dump_entry method to export particular lle entry in sysctl/rtsock
  format (code is not indented properly to minimize diff). Will be fixed
  in the next commits.
* Add llt_link_entry/llt_unlink_entry methods to link/unlink particular lle.
* Add llt_fill_sa_entry method to export address in the lle to sockaddr
  format.
* Add llt_hash method to use in generic hash table support code.
* Add llt_free_entry method which is used in llt_prefix_free code.

* Prepare for fine-grained locking by separating lle unlink and deletion in
  lltable_free() and lltable_prefix_free().

* Provide lltable_get<ifp|af>() functions to reduce direct 'struct lltable'
 access by external callers.

* Remove @llt agrument from lle_free() lle callback since it was unused.
* Temporarily add L3_CADDR() macro for 'const' sockaddr typecasting.
* Switch to per-af hashing code.
* Rename LLE_FREE_LOCKED() callback from in[6]_lltable_free() to
  in_[6]lltable_destroy() to avoid clashing with llt_free_entry() method.
  Update description from these functions.
* Use unified lltable_free_entry() function instead of per-af one.

Reviewed by:	ae
2015-08-10 12:03:59 +00:00
marius
fbf037b786 Fix compilation after r286457 w/o INVARIANTS or INVARIANT_SUPPORT. 2015-08-08 21:41:59 +00:00
melifaro
33c52eed18 MFP r274295:
* Move interface route cleanup to route.c:rt_flushifroutes()
* Convert most of "for (fibnum = 0; fibnum < rt_numfibs; fibnum++)" users
  to use new rt_foreach_fib() instead of hand-rolling cycles.
2015-08-08 18:14:59 +00:00
melifaro
20bb5966e2 MFP r274553:
* Move lle creation/deletion from lla_lookup to separate functions:
  lla_lookup(LLE_CREATE) -> lla_create
  lla_lookup(LLE_DELETE) -> lla_delete
lla_create now returns with LLE_EXCLUSIVE lock for lle.
* Provide typedefs for new/existing lltable callbacks.

Reviewed by:	ae
2015-08-08 17:48:54 +00:00
melifaro
a915efe931 Simplify ip[6] simploop:
Do not pass 'dst' sockaddr to ip[6]_mloopback:
  - We have explicit check for AF_INET in ip_output()
  - We assume ip header inside passed mbuf in ip_mloopback
  - We assume ip6 header inside passed mbuf in ip6_mloopback
2015-08-08 15:58:35 +00:00
jch
67927a7a7c Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability:
- The existing TCP INP_INFO lock continues to protect the global inpcb list
  stability during full list traversal (e.g. tcp_pcblist()).

- A new INP_LIST lock protects inpcb list actual modifications (inp allocation
  and free) and inpcb global counters.

It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input())
and INP_INFO_WLOCK only in occasional operations that walk all connections.

PR:			183659
Differential Revision:	https://reviews.freebsd.org/D2599
Reviewed by:		jhb, adrian
Tested by:		adrian, nitroboost-gmail.com
Sponsored by:		Verisign, Inc.
2015-08-03 12:13:54 +00:00
ae
55f7ded2fa Properly handle IPV6_NEXTHOP socket option in selectroute().
o remove disabled code;
 o if nexthop address is link-local, use embedded scope zone id to
   determine outgoing interface;
 o properly fill ro_dst before doing route lookup;
 o remove LLE lookup, instead check rt_flags for RTF_GATEWAY bit.

Sponsored by:	Yandex LLC
2015-08-02 12:40:56 +00:00
ae
99ebe8411a Remove redundant check. 2015-08-02 11:58:24 +00:00
ae
271b2043d8 Eliminate the use of m_copydata() in gif_encapcheck().
ip_encap already has inspected mbuf's data, at least an IP header.
And it is safe to use mtod() and do direct access to needed fields.
Add M_ASSERTPKTHDR() to gif_encapcheck(), since the code expects that
mbuf has a packet header.
Move the code from gif_validate[46] into in[6]_gif_encapcheck(), also
remove "martian filters" checks. According to RFC 4213 it is enough to
verify that the source address is the address of the encapsulator, as
configured on the decapsulator.

Reviewed by:	melifaro
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2015-07-29 14:07:43 +00:00
ae
75425458ac Convert in_ifaddr_lock and in6_ifaddr_lock to rmlock.
Both are used to protect access to IP addresses lists and they can be
acquired for reading several times per packet. To reduce lock contention
it is better to use rmlock here.

Reviewed by:	gnn (previous version)
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D3149
2015-07-29 08:12:05 +00:00
tuexen
c0e1a0d3a9 Move including netinet/icmp6.h around to avoid a problem when including
netinet/icmp6.h and net/netmap.h. Both use ni_flags...
This allows to build multistack with SCTP support.

MFC after: 1 week
2015-07-25 18:26:09 +00:00
rrs
e8d0638773 Fix inverted logic bug that David Wolfskill found (thanks David!)
MFC after:	3 Weeks
2015-07-22 09:29:50 +00:00
rrs
e4870a15db When a tunneling protocol is being used with UDP we must release the
lock on the INP before calling the tunnel protocol, else a LOR
may occur (it does with SCTP for sure). Instead we must acquire a
ref count and release the lock, taking care to allow for the case
where the UDP socket has gone away and *not* unlocking since the
refcnt decrement on the inp will do the unlock in that case.

Reviewed by:	tuexen
MFC after:	3 weeks
2015-07-21 09:54:31 +00:00
ae
8ada02a437 Add LLE event handler to report ND6 events to userland via rtsock.
Obtained from:	Yandex LLC
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2015-07-20 06:58:32 +00:00
ae
291abe56fe Invoke LLE event handler when entry is deleted.
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2015-07-20 06:54:50 +00:00
ae
cce3941676 Keep IPv6 address specified by IPV6_PKTINFO socket option in kernel
internal form to be able handle link-local IPv6 addresses.

Reported by:	kp
Tested by:	kp
2015-07-03 19:01:38 +00:00
bz
4c7c9799ed Move comment to the right position.
PR:		152791
Submitted by:	vangyzen (as part of the functional change)
MFC after:	3 days
2015-07-03 09:53:56 +00:00
tuexen
2af840e2ac Add FIB support for SCTP.
This fixes https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200379

MFC after: 3 days
2015-06-17 15:20:14 +00:00
ae
a1e13e9e59 Move RTM announces into generic code to be independent from Layer2 code.
This fixes bug introduced in 274988, when announces about new addresses
don't sent for tunneling interfaces.

Reported by:	tuexen@
MFC after:	1 week
2015-05-29 10:24:16 +00:00
tuexen
a82f33e60c Fix and cleanup the debug information. This has no user-visible changes.
Thanks to Irene Ruengeler for proving a patch.

MFC after: 3 days
2015-05-28 16:00:23 +00:00
jkim
318c4f97e6 CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten
years for head.  However, it is continuously misused as the mpsafe argument
for callout_init(9).  Deprecate the flag and clean up callout_init() calls
to make them more consistent.

Differential Revision:	https://reviews.freebsd.org/D2613
Reviewed by:	jhb
MFC after:	2 weeks
2015-05-22 17:05:21 +00:00
ae
cbc4e577f0 Add an ability accept encapsulated packets from different sources by one
gif(4) interface. Add new option "ignore_source" for gif(4) interface.
When it is enabled, gif's encapcheck function requires match only for
packet's destination address.

Differential Revision:	https://reviews.freebsd.org/D2004
Obtained from:	Yandex LLC
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2015-05-15 12:19:45 +00:00
hrs
64bd31eb61 - Remove ND6_IFF_IGNORELOOP. This functionality was useless in practice
because a link where looped back NS messages are permanently observed
  does not work with either NDP or ARP for IPv4.

- draft-ietf-6man-enhanced-dad is now RFC 7527.

Discussed with:	hiren
MFC after:	3 days
2015-05-12 03:31:57 +00:00
ae
85230adb3c Mark data checksum as valid for multicast packets, that we send back
to myself via simloop.
Also remove duplicate check under #ifdef DIAGNOSTIC.

PR:		180065
MFC after:	1 week
2015-05-07 14:17:43 +00:00
ae
efaf13e547 Remove unneded #ifdef INET6 and IPSEC. This file compiled only when
both options are defined.
Include opt_sctp.h and sctp_crc32.h to enable #ifdef SCTP code block
and delayed checksum calculation for SCTP.
2015-05-07 12:15:45 +00:00
glebius
1ca5d173ab Remove #ifdef IFT_FOO.
Submitted by:	Guy Yur <guyyur gmail.com>
2015-05-02 20:31:27 +00:00
ae
b38774ffc9 Remove now unneded KEY_FREESP() for case when ipsec[46]_process_packet()
returns EJUSTRETURN.

Sponsored by:	Yandex LLC
2015-04-27 01:11:09 +00:00
ae
5a6412a276 Fix possible use after free due to security policy deletion.
When we are passing mbuf to IPSec processing via ipsec[46]_process_packet(),
we hold one reference to security policy and release it just after return
from this function. But IPSec processing can be deffered and when we release
reference to security policy after ipsec[46]_process_packet(), user can
delete this security policy from SPDB. And when IPSec processing will be
done, xform's callback function will do access to already freed memory.

To fix this move KEY_FREESP() into callback function. Now IPSec code will
release reference to SP after processing will be finished.

Differential Revision:	https://reviews.freebsd.org/D2324
No objections from:	#network
Sponsored by:	Yandex LLC
2015-04-27 00:55:56 +00:00
glebius
37c56fd2f8 Fix r281649: don't call in6_clearscope() twice.
Submitted by:	ae
2015-04-17 15:26:08 +00:00
glebius
14b7122d6d Provide functions to determine presence of a given address
configured on a given interface.

Discussed with:	np
Sponsored by:	Nginx, Inc.
2015-04-17 11:57:06 +00:00
markj
47f557e75d Fix a possible refcount leak in regen_tmpaddr().
public_ifa6 may be set to NULL after taking a reference to a previous
address list element. Instead, only take the reference after leaving the
loop but before releasing the address list lock.

Differential Revision:	https://reviews.freebsd.org/D2253
Reviewed by:		ae
MFC after:		2 weeks
2015-04-13 01:55:42 +00:00
ae
77d9811f99 Fix the IPV6_MULTICAST_IF sockopt handling. RFC 3493 says when the
interface index is specified as zero, the system should select the
interface to use for outgoing multicast packets. Even the comment
for the in6p_set_multicast_if() function says about index of zero.
But in fact for zero index the function just returns EADDRNOTAVAIL.

I.e. if you first set some interface and then will try reset it
with zero ifindex, you will get EADDRNOTAVAIL.

Reset im6o_multicast_ifp to NULL when interface index specified as
zero. Also return EINVAL in case when ifnet_byindex() returns NULL.
This will be the same behaviour as when ifindex is bigger than
V_if_index. And return EADDRNOTAVAIL only when interface is not
multicast capable.

Reported by:	Olivier Cochard-Labbé
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2015-04-10 19:09:51 +00:00
ae
1af0fcc87b Fix the check for maximum mbuf's size needed to send ND6 NA and NS.
It is acceptable that the size can be equal to MCLBYTES. In the later
KAME's code this check has been moved under DIAGNOSTIC ifdef, because
the size of NA and NS is much smaller than MCLBYTES. So, it is safe to
replace the check with KASSERT.

PR:		199304
Discussed with:	glebius
MFC after:	1 week
2015-04-09 12:57:58 +00:00
kp
86419259e3 Evaluate packet size after the firewall had its chance
Defer the packet size check until after the firewall has had a look at it. This
means that the firewall now has the opportunity to (re-)fragment an oversized
packet.

Differential Revision:	https://reviews.freebsd.org/D1815
Reviewed by:	ae
Approved by:	gnn (mentor)
2015-04-07 20:29:03 +00:00
delphij
250c8d6f6d Mitigate Local Denial of Service with IPv6 Router Advertisements
and log attack attempts.

Submitted by:	hrs
Security:	FreeBSD-SA-15:09.nd6
Security:	CVE-2015-2923
2015-04-07 20:20:09 +00:00
glebius
8ea08edd44 o Make net.inet6.ip6.mif6table return special API structure, that doesn't
contain kernel pointers, and instead has interface index.
  Bump __FreeBSD_version for that change.
o Now, netstat/mroute6.c no longer needs to kvm_read(3) struct ifnet, and
  no longer needs to include if_var.h

Note that this change is far from being a complete move of IPv6 multicast
routing to a proper API. Other structures are still dumped into their
sysctls as is, requiring userland application to #define _KERNEL when
including ip6_mroute.h and then call kvm_read(3) to gather all bits and
pieces. But fixing this is out of scope of the opaque ifnet project.

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-04-06 22:12:18 +00:00
kp
b57f3509a8 Remove duplicate code
We'll just fall into the same local delivery block under the
'if (m->m_flags & M_FASTFWD_OURS)'.

Suggested by:	ae
Differential Revision:	https://reviews.freebsd.org/D2225
Approved by:	gnn (mentor)
2015-04-06 19:08:44 +00:00
kp
86dedea3cb Preserve IPv6 fragment IDs accross reassembly and refragmentation
When forwarding fragmented IPv6 packets and filtering with PF we
reassemble and refragment. That means we generate new fragment headers
and a new fragment ID.

We already save the fragment IDs so we can do the reassembly so it's
straightforward to apply the incoming fragment ID on the refragmented
packets.

Differential Revision:	https://reviews.freebsd.org/D2188
Approved by:		gnn (mentor)
2015-04-01 12:15:01 +00:00
glebius
09c0e42529 Move ip6_sprintf() declaration from in6_var.h to in6.h. This is a simple
function that works with in6_addr and it is not related to the INET6
stack implementation.

Sponsored by:	Nginx, Inc.
2015-03-24 16:45:50 +00:00
ae
a41131da37 To avoid a possible race, release the reference to ifa after return
from nd6_dad_na_input().

Submitted by:	Alexandre Martins
MFC after:	1 week
2015-03-19 00:04:25 +00:00
ae
3169d96832 tcp6_ctlinput() doesn't pass MTU value to in6_pcbnotify().
Check cmdarg isn't NULL before dereference, this check was in the
ip6_notify_pmtu() before r279588.

Reported by:	Florian Smeets
MFC after:	1 week
2015-03-06 05:50:39 +00:00
hrs
218de7f1d6 - Implement loopback probing state in enhanced DAD algorithm.
- Add no_dad and ignoreloop per-IF knob.  no_dad disables DAD completely,
  and ignoreloop is to prevent infinite loop in loopback probing state when
  loopback is permanently expected.
2015-03-05 21:27:49 +00:00
ae
a312c1bedf Fix deadlock in IPv6 PCB code.
When several threads are trying to send datagram to the same destination,
but fragmentation is disabled and datagram size exceeds link MTU,
ip6_output() calls pfctlinput2(PRC_MSGSIZE). It does notify all
sockets wanted to know MTU to this destination. And since all threads
hold PCB lock while sending, taking the lock for each PCB in the
in6_pcbnotify() leads to deadlock.

RFC 3542 p.11.3 suggests notify all application wanted to receive
IPV6_PATHMTU ancillary data for each ICMPv6 packet too big message.
But it doesn't require this, when we don't receive ICMPv6 message.

Change ip6_notify_pmtu() function to be able use it directly from
ip6_output() to notify only one socket, and to notify all sockets
when ICMPv6 packet too big message received.

PR:		197059
Differential Revision:	https://reviews.freebsd.org/D1949
Reviewed by:	no objection from #network
Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2015-03-04 11:20:01 +00:00
ae
aadc4e1de2 Create nd6_ns_output_fib() function with extra argument fibnum. Use it
to initialize mbuf's fibnum. Uninitialized fibnum value can lead to
panic in the routing code. Currently we use only RT_DEFAULT_FIB value
for initialization.

Differential Revision:	https://reviews.freebsd.org/D1998
Reviewed by:	hrs (previous version)
Sponsored by:	Yandex LLC
2015-03-03 10:50:03 +00:00
hrs
09859de9fa Nonce has to be non-NULL for DAD even if net.inet6.ip6.dad_enhanced=0. 2015-03-03 04:28:19 +00:00
hrs
9d87f6a3c3 Implement Enhanced DAD algorithm for IPv6 described in
draft-ietf-6man-enhanced-dad-13.

This basically adds a random nonce option (RFC 3971) to NS messages
for DAD probe to detect a looped back packet.  This looped back packet
prevented DAD on some pseudo-interfaces which aggregates multiple L2 links
such as lagg(4).

The length of the nonce is set to 6 bytes.  This algorithm can be disabled by
setting net.inet6.ip6.dad_enhanced sysctl to 0 in a per-vnet basis.

Reported by:		hiren
Reviewed by:		ae
Differential Revision:	https://reviews.freebsd.org/D1835
2015-03-02 17:30:26 +00:00
glebius
90eb9ef3d2 Now that all users of _WANT_IFADDR are fixed, remove this crutch and
hide ifaddr, in_ifaddr and in6_ifaddr under _KERNEL.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-02-19 23:16:10 +00:00
glebius
4ec4d85cb2 - Rename 'struct mld_ifinfo' into 'struct mld_ifsoftc', since it really
represents a context.
- Preserve name 'struct mld_ifinfo' for a new structure, that will be stable
  API between userland and kernel.
- Make sysctl_mld_ifinfo() return the new 'struct mld_ifinfo', instead of
  old one, which had a bunch of internal kernel structures in it.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-02-19 22:37:01 +00:00
glebius
d505f11c5d Widen _KERNEL ifdef to hide more kernel network stack structures from userland. 2015-02-19 06:24:27 +00:00
glebius
4a529a64c2 Use new struct mbufq instead of struct ifqueue to manage packet queues in
IPv6 multicast code.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-02-19 01:21:23 +00:00
glebius
1b68ebd476 Factor out ip6_fragment() function, to be used in IPv6 stack and pf(4).
Submitted by:		Kristof Provost
Differential Revision:	D1766
2015-02-16 06:30:27 +00:00
glebius
f3e67710e4 Move ip6_deletefraghdr() to frag6.c.
Suggested by:	bz
2015-02-16 05:58:32 +00:00
glebius
35ef97e1c7 Factor out ip6_deletefraghdr() function, to be shared between IPv6
stack and pf(4).

Submitted by:	Kristof Provost
Reviewed by:	ae
Differential Revision:	D1764
2015-02-16 01:12:20 +00:00
rrs
e83a007798 This fixes a bug in the way that the LLE timers for nd6
and arp were being used. They basically would pass in the
mutex to the callout_init. Because they used this method
to the callout system, it was possible to "stop" the callout.
When flushing the table and you stopped the running callout, the
callout_stop code would return 1 indicating that it was going
to stop the callout (that was about to run on the callout_wheel blocked
by the function calling the stop). Now when 1 was returned, it would
lower the reference count one extra time for the stopped timer, then
a few lines later delete the memory. Of course the callout_wheel was
stuck in the lock code and would then crash since it was accessing
freed memory. By using callout_init(c, 1) we always get a 0 back
and the reference counting bug does not rear its head. We do have
to make a few adjustments to the callouts themselves though to make
sure it does the proper thing if rescheduled as well as gets the lock.

Commented upon by hiren and sbruno
See Phabricator D1777 for more details.

Commented upon by hiren and sbruno
Reviewed by:	adrian, jhb and bz
Sponsored by:	Netflix Inc.
2015-02-09 19:28:11 +00:00
ae
e079051f88 Print IPv6 address in log message instead of address of pointer.
MFC after:	1 week
2015-02-05 16:29:26 +00:00
adrian
f1c9d3f332 Refactor / restructure the RSS code into generic, IPv4 and IPv6 specific
bits.

The motivation here is to eventually teach netisr and potentially
other networking subsystems a bit more about how RSS work queues / buckets
are configured so things have a hope of auto-configuring in the future.

* net/rss_config.[ch] takes care of the generic bits for doing
  configuration, hash function selection, etc;
* topelitz.[ch] is now in net/ rather than netinet/;
* (and would be in libkern if it didn't directly include RSS_KEYSIZE;
  that's a later thing to fix up.)
* netinet/in_rss.[ch] now just contains the IPv4 specific methods;
* and netinet/in6_rss.[ch] now just contains the IPv6 specific methods.

This should have no functional impact on anyone currently using
the RSS support.

Differential Revision:	D1383
Reviewed by:	gnn, jfv (intel driver bits)
2015-01-18 18:06:40 +00:00
glebius
2cf1c6ce95 Do not go one layer down to check ifqueue length. First, not all drivers
use ifqueue at all. Second, there is no point in this lockless check.
Either positive or negative result of the check could be incorrect after
a tick.

Sponsored by:	Nginx, Inc.
2015-01-12 14:52:43 +00:00
tuexen
c118a100d9 Minimize the usage of SCTP_BUF_IS_EXTENDED.
This should help Robert...
2015-01-10 20:49:57 +00:00
melifaro
900b596ba0 * Deal with ARCNET L2 multicast mapping for IPv6 the same way as in IPv4:
handle it in arc_output() instead of nd6_storelladdr().
* Remove IFT_ARCNET check from arpresolve() since arc_output() does not
  use arpresolve() to handle broadcast/multicast. This check was there
  since r84931. It looks like it was not used since r89099 (initial
  import of Arcnet support where multicast is handled separately).
* Remove IFT_IEEE1394 case from nd6_storelladdr() since firewire_output()
  calles nd6_storelladdr() for unicast addresses only.
* Remove IFT_ARCNET case from nd6_storelladdr() since arc_output() now
  handles multicast by itself.

As a result, we have the following pattern: all non-ethernet-style
media have their own multicast map handling inside their appropriate
routines. On the other hand, arpresolve() (and nd6_storelladdr()) which
meant to be 'generic' ones de-facto handles ethernet-only multicast maps.

MFC after:	3 weeks
2015-01-09 12:56:51 +00:00
melifaro
4b0aa49e7d Add forgotten definition for nd6_output_ifp(). 2015-01-08 18:29:54 +00:00
melifaro
abec420f04 * Use newly-created nd6_grab_holdchain() function to retrieve lle
hold mbuf chain instead of calling full-blown nd6_output_lle()
  for each packet. This simplifies both callers and nd6_output_lle()
  implementation.
* Make nd6_output_lle() static and remove now-unused lle and chain
  arguments.
* Rename nd6_output_flush() -> nd6_flush_holdchain() to be consistent.
* Move all pre-send transmit hooks to newly-created nd6_output_ifp().
  Now nd6_output(), nd6_output_lle() and nd6_flush_holdchain() are using
  it to send mbufs to if_output.
* Remove SeND hook from nd6_na_input() because it was implemented
  incorrectly since the beginning (r211501):
  - it tagged initial input mbuf (m) instead of m_hold
  - tagging _all_ mbufs in holdchain seems to be wrong anyway.
2015-01-08 18:02:05 +00:00
rwatson
1c44e71143 To ease changes to underlying mbuf structure and the mbuf allocator, reduce
the knowledge of mbuf layout, and in particular constants such as M_EXT,
MLEN, MHLEN, and so on, in mbuf consumers by unifying various alignment
utility functions (M_ALIGN(), MH_ALIGN(), MEXT_ALIGN() in a single
M_ALIGN() macro, implemented by a now-inlined m_align() function:

- Move m_align() from uipc_mbuf.c to mbuf.h; mark as __inline.
- Reimplement M_ALIGN(), MH_ALIGN(), and MEXT_ALIGN() using m_align().
- Update consumers around the tree to simply use M_ALIGN().

This change eliminates a number of cases where mbuf consumers must be aware
of whether or not mbufs returned by the allocator use external storage, but
also assumptions about the size of the returned mbuf. This will make it
easier to introduce changes in how we use external storage, as well as
features such as variable-size mbufs.

Differential Revision:	https://reviews.freebsd.org/D1436
Reviewed by:	glebius, trasz, gnn, bz
Sponsored by:	EMC / Isilon Storage Division
2015-01-05 09:58:32 +00:00
adrian
06c1511222 Migrate the RSS IPv6 hash code to use pointers to the v6 addresses
rather than passing them in by value.

The eventual aim is to do incremental hash construction rather than
all of the memcpy()'ing into a contiguous buffer for the hash
function, which does show up as taking quite a bit of CPU during
profiling.

Tested:

* a variety of laptops/desktop setups I have, with v6 connectivity

Differential Revision:	D1404
Reviewed by:	bz, rpaulo
2014-12-31 22:52:43 +00:00
ae
a5140616af Extern declarations in C files loses compile-time checking that
the functions' calls match their definitions. Move them to header files.

Reviewed by:	jilles (previous version)
2014-12-25 21:32:37 +00:00
ae
aec9c75c1b Remove in_gif.h and in6_gif.h files. They only contain function
declarations used by gif(4). Instead declare these functions in C files.
Also make some variables static.
2014-12-23 16:17:37 +00:00
tuexen
2d11eaedd1 Plug a memory leak in an error code path.
Reported by:	Coverity
CID:		1018936
MFC after: 	3 days
2014-12-17 20:19:57 +00:00
ae
7c61e1dea8 Do not count security policy violation twice.
ipsec*_in_reject() do this by their own.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 19:20:13 +00:00
ae
c022ef3630 Use ipsec6_in_reject() to simplify ip6_ipsec_fwd() and ip6_ipsec_input().
ipsec6_in_reject() does the same things, also it counts policy violation
errors.

Do IPSEC check in the ip6_forward() after addresses checks.
Also use ip6_ipsec_fwd() to make code similar to IPv4 implementation.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 19:09:57 +00:00
ae
3665df88dc Remove flag/flags argument from the following functions:
ipsec_getpolicybyaddr()
 ipsec4_checkpolicy()
 ip_ipsec_output()
 ip6_ipsec_output()

The only flag used here was IP_FORWARDING.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 18:35:34 +00:00
ae
4b1c09e909 Move ip_ipsec_fwd() from ip_input() into ip_forward().
Remove check for presence PACKET_TAG_IPSEC_IN_DONE mbuf tag from
ip_ipsec_fwd(). PACKET_TAG_IPSEC_IN_DONE tag means that packet is
already handled by IPSEC code. This means that before IPSEC processing
it was destined to our address and security policy was checked in
the ip_ipsec_input(). After IPSEC processing packet has new IP
addresses and destination address isn't our own. So, anyway we can't
check security policy from the mbuf tag, because it corresponds
to different addresses.

We should check security policy that corresponds to packet
attributes in both cases - when it has a mbuf tag and when it has not.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 16:53:29 +00:00
ae
8e6349d4bc Remove PACKET_TAG_IPSEC_IN_DONE mbuf tag lookup and usage of its
security policy. The changed block of code in ip*_ipsec_input() is
called when packet has ESP/AH header. Presence of
PACKET_TAG_IPSEC_IN_DONE mbuf tag in the same time means that
packet was already handled by IPSEC and reinjected in the netisr,
and it has another ESP/AH headers (encrypted twice?).
Since it was already processed by IPSEC code, the AH/ESP headers
was already stripped (and probably outer IP header was stripped too)
and security policy from the tdb_ident was applied to those headers.
It is incorrect to apply this security policy to current headers.

Also make ip_ipsec_input() prototype similar to ip6_ipsec_input().

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 14:58:55 +00:00
ae
1cca983d1b Remove check for presence of PACKET_TAG_IPSEC_PENDING_TDB and
PACKET_TAG_IPSEC_OUT_CRYPTO_NEEDED mbuf tags. They aren't used in FreeBSD.

Instead check presence of PACKET_TAG_IPSEC_OUT_DONE mbuf tag. If it
is found, bypass security policy lookup as described in the comment.

PACKET_TAG_IPSEC_OUT_DONE tag added to mbuf when IPSEC code finishes
ESP/AH processing. Since it was already finished, this means the security
policy placed in the tdb_ident was already checked. And there is no reason
to check it again here.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 14:43:44 +00:00
markj
b1aa319778 Revert r275695: nd6_dad_find() was already correct.
Reported by:	ae, kib
Pointy hat to:	markj
2014-12-11 09:16:45 +00:00
markj
6eef5a7d4d Fix a bug in r266857: nd6_dad_find() must return NULL if it doesn't find
a matching element in the DAD queue.

Reported by:	Holger Hans Peter Freyther <holger@freyther.de>
MFC after:	3 days
2014-12-11 00:41:54 +00:00
markj
f0265d5d83 Add refcounting to IPv6 DAD objects and simplify the DAD code to fix a
number of races which could cause double frees or use-after-frees when
performing DAD on an address. In particular, an IPv6 address can now only be
marked as a duplicate from the DAD callout.

Differential Revision:	https://reviews.freebsd.org/D1258
Reviewed by:		ae, hrs
Reported by:		rstone
MFC after:		1 month
2014-12-08 04:44:40 +00:00
tuexen
48d05792ee This is the SCTP specific companion of
https://svnweb.freebsd.org/changeset/base/275358
which was provided by Hans Petter Selasky.
2014-12-04 21:17:50 +00:00
ae
3e1d0ef3ba Remove unneded check. No need to do m_pullup to the size that we prepended.
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-12-02 05:41:03 +00:00
ae
b82eb2f5d9 Remove route chaching support from ipsec code. It isn't used for some time.
* remove sa_route_union declaration and route_cache member from struct secashead;
* remove key_sa_routechange() call from ICMP and ICMPv6 code;
* simplify ip_ipsec_mtu();
* remove #include <net/route.h>;

Sponsored by:	Yandex LLC
2014-12-02 04:20:50 +00:00
hselasky
12fec3618b Start process of removing the use of the deprecated "M_FLOWID" flag
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.

This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.

"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.

Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.

MFC after:	1 month
Sponsored by:	Mellanox Technologies
2014-12-01 11:45:24 +00:00
melifaro
95c680b9a3 Do not return unlocked/unreferenced lle in arpresolve/nd6_storelladdr -
return lle flags IFF needed.
Do not pass rte to arpresolve - pass is_gateway flag instead.
2014-11-27 23:06:25 +00:00
ae
0d350694ed Skip L2 addresses lookups for p2p interfaces.
Discussed with:	melifaro
Sponsored by:	Yandex LLC
2014-11-24 21:51:43 +00:00
melifaro
f8d64c469a Finish r274175: do control plane MTU tracking.
Update route MTU in case of ifnet MTU change.
Add new RTF_FIXEDMTU to track explicitly specified MTU.

Old behavior:
ifconfig em0 mtu 1500->9000 -> all routes traversing em0 do not change MTU.
User has to manually update all routes.
ifconfig em0 mtu 9000->1500 -> all routes traversing em0 do not change MTU.
However, if ip[6]_output finds route with rt_mtu > interface mtu, rt_mtu
gets updated.

New behavior:
ifconfig em0 mtu 1500->9000 -> all interface routes in all fibs gets updated
with new MTU unless RTF_FIXEDMTU flag set on them.
ifconfig em0 mtu 9000->1500 -> all routes in all fibs gets updated with new
MTU unless RTF_FIXEDMTU flag set on them AND rt_mtu is less than ifp mtu.

route add ... -mtu XXX automatically sets RTF_FIXEDMTU flag.
route change .. -mtu 0 automatically removes RTF_FIXEDMTU flag.

PR:		194238
MFC after:	1 month
CR:		D1125
2014-11-17 01:05:29 +00:00
ae
2bb499d06b We don't return sp pointer, thus NULL assignment isn't needed.
And reference to sp will be freed at the end.

MFC after:	1 week
Sponsored by:	Yandex LLC
2014-11-12 22:58:52 +00:00
melifaro
12580bcaa8 Kill custom in_matroute() radix mathing function removing one rte mutex lock.
Initially in_matrote() in_clsroute() in their current state was introduced by
r4105 20 years ago. Instead of deleting inactive routes immediately, we kept them
in route table, setting RTPRF_OURS flag and some expire time. After that, either
GC came or RTPRF_OURS got removed on first-packet. It was a good solution
in that days (and probably another decade after that) to keep TCP metrics.
However, after moving metrics to TCP hostcache in r122922, most of in_rmx
functionality became unused. It might had been used for flushing icmp-originated
routes before rte mutexes/refcounting, but I'm not sure about that.

So it looks like this is nearly impossible to make GC do its work nowadays:

in_rtkill() ignores non-RTPRF_OURS routes.
route can only become RTPRF_OURS after dropping last reference via rtfree()
which calls in_clsroute(), which, it turn, ignores UP and non-RTF_DYNAMIC routes.

Dynamic routes can still be installed via received redirect, but they
have default lifetime (no specific rt_expire) and no one has another trie walker
to call RTFREE() on them.

So, the changelist:
* remove custom rnh_match / rnh_close matching function.
* remove all GC functions
* partially revert r256695 (proto3 is no more used inside kernel,
  it is not possible to use rt_expire from user point of view, proto3 support
  is not complete)
* Finish r241884 (similar to this commit) and remove remaining IPv6 parts

MFC after:	1 month
2014-11-11 02:52:40 +00:00
ae
253f06cd2a Add sa6_checkzone_ifp() function. It checks correctness of struct
sockaddr_in6, usually obtained from the user level through ioctl.
It initializes sin6_scope_id using given interface.

Sponsored by:	Yandex LLC
2014-11-10 16:12:51 +00:00
melifaro
89eb9b8cc2 * Make nd6_dad_duplicated() constant.
* Simplify refcounting by using nd6_dad_add() / nd6_dad_del().

Reviewed by:	ae
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2014-11-10 16:01:39 +00:00
ae
e9254e774f Remove link-local multicast routes remnants from in6_purgeaddr.
Also merge in6_purgeaddr_mc with in6_purgeaddr.

Sponsored by:	Yandex LLC
2014-11-10 16:01:31 +00:00
glebius
f28e253225 Consistently use if_link.
Reviewed by:	ae, melifaro
2014-11-10 15:56:30 +00:00
ae
4a7f34f96c For now handle only multicast addresses, we still use routes to
LLA unicasts yet.

Sponsored by:	Yandex LLC
2014-11-10 10:59:08 +00:00
ae
57dd09b78b Use embedded scope zone id to determine outgoing interface for link-local
and node-local addresses.
2014-11-09 22:54:40 +00:00
melifaro
b5d711d3a6 Renove faith(4) and faithd(8) from base. It looks like industry
have chosen different (and more traditional) stateless/statuful
NAT64 as translation mechanism. Last non-trivial commits to both
faith(4) and faithd(8) happened more than 12 years ago, so I assume
it is time to drop RFC3142 in FreeBSD.

No objections from:	net@
2014-11-09 21:33:01 +00:00
melifaro
94efd07208 Remove unused 'struct route *' argument from nd6_output_flush(). 2014-11-09 16:20:27 +00:00
ae
61303c568a Remove ip6_getdstifaddr() and all functions to work with auxiliary data.
It isn't safe to keep unreferenced ifaddrs. Use in6ifa_ifwithaddr() to
determine ifaddr corresponding to destination address. Since currently
we keep addresses with embedded scope zone, in6ifa_ifwithaddr is called
with zero zoneid and marked with XXX.

Also remove route and lle lookups from ip6_input. Use in6ifa_ifwithaddr()
instead.

Sponsored by:	Yandex LLC
2014-11-08 19:38:34 +00:00
ae
7144dc8bc2 Overhaul if_gre(4).
Split it into two modules: if_gre(4) for GRE encapsulation and
if_me(4) for minimal encapsulation within IP.

gre(4) changes:
* convert to if_transmit;
* rework locking: protect access to softc with rmlock,
  protect from concurrent ioctls with sx lock;
* correct interface accounting for outgoing datagramms (count only payload size);
* implement generic support for using IPv6 as delivery header;
* make implementation conform to the RFC 2784 and partially to RFC 2890;
* add support for GRE checksums - calculate for outgoing datagramms and check
  for inconming datagramms;
* add support for sending sequence number in GRE header;
* remove support of cached routes. This fixes problem, when gre(4) doesn't
  work at system startup. But this also removes support for having tunnels with
  the same addresses for inner and outer header.
* deprecate support for various GREXXX ioctls, that doesn't used in FreeBSD.
  Use our standard ioctls for tunnels.

me(4):
* implementation conform to RFC 2004;
* use if_transmit;
* use the same locking model as gre(4);

PR:		164475
Differential Revision:	D1023
No objections from:	net@
Relnotes:	yes
Sponsored by:	Yandex LLC
2014-11-07 19:13:19 +00:00
glebius
99f4ec50e8 Remove SYSCTL_VNET_* macros, and simply put CTLFLAG_VNET where needed.
Sponsored by:	Nginx, Inc.
2014-11-07 09:39:05 +00:00
glebius
6088b7572e Remove VNET_SYSCTL_ARG(). The generic sysctl(9) code handles that.
Reviewed by:	ae
Sponsored by:	Nginx, Inc.
2014-11-07 08:58:05 +00:00
melifaro
5c54c0c246 Finish r274118: remove useless fields from struct domain.
Sponsored by:	Yandex LLC
2014-11-06 14:39:04 +00:00
melifaro
11af63037f Make checks for rt_mtu generic:
Some virtual if drivers has (ab)used ifa ifa_rtrequest hook to enforce
route MTU to be not bigger that interface MTU. While ifa_rtrequest hooking
might be an option in some situation, it is not feasible to do MTU checks
there: generic (or per-domain) routing code is perfectly capable of doing
this.

We currrently have 3 places where MTU is altered:

1) route addition.
 In this case domain overrides radix _addroute callback (in[6]_addroute)
 and all necessary checks/fixes are/can be done there.

2) route change (especially, GW change).
 In this case, there are no explicit per-domain calls, but one can
 override rte by setting ifa_rtrequest hook to domain handler
 (inet6 does this).

3) ifconfig ifaceX mtu YYYY
 In this case, we have no callbacks, but ip[6]_output performes runtime
 checks and decreases rt_mtu if necessary.

Generally, the goals are to be able to handle all MTU changes in
 control plane, not in runtime part, and properly deal with increased
 interface MTU.

This commit changes the following:
* removes hooks setting MTU from drivers side
* adds proper per-doman MTU checks for case 1)
* adds generic MTU check for case 2)

* The latter is done by using new dom_ifmtu callback since
 if_mtu denotes L3 interface MTU, e.g. maximum trasmitted _packet_ size.
 However, IPv6 mtu might be different from if_mtu one (e.g. default 1280)
 for some cases, so we need an abstract way to know maximum MTU size
 for given interface and domain.
* moves rt_setmetrics() before MTU/ifa_rtrequest hooks since it copies
  user-supplied data which must be checked.
* removes RT_LOCK_ASSERT() from other ifa_rtrequest hooks to be able to
  use this functions on new non-inserted rte.

More changes will follow soon.

MFC after:	1 month
Sponsored by:	Yandex LLC
2014-11-06 13:13:09 +00:00
melifaro
c2069a39a4 Remove old hack abusing domattach from NFS code.
According to IANA RPC uaddr registry, there are no AFs
except IPv4 and IPv6, so it's not worth being too abstract here.

Remove ne_rtable[AF_MAX+1] and use explicit per-AF radix tries.
Use own initialization without relying on domattach code.

While I admit that this was one of the rare places in kernel
networking code which really was capable of doing multi-AF
without any AF-depended code, it is not possible anymore to
rely on dom* code.

While here, change terrifying "Invalid radix node head, rn:" message,
to different non-understandable "netcred already exists for given addr/mask",
but less terrifying. Since we know that rn_addaddr() returns NULL if
the same record already exists, we should provide more friendly error.

MFC after:	1 month
2014-11-05 00:58:01 +00:00
hrs
948508e096 Fix a bug which prevented ND6_IFF_IFDISABLED flag from clearing when
the newly-added IPv6 address was /128.

PR:	188032
2014-11-02 21:58:31 +00:00
ae
e74f24ecf2 Remove redundant code.
if_detach already did these steps. Also, now we didn't keep routes to link-local
addresses.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-10-30 12:44:46 +00:00
ae
3820ebf4a8 Move ifq drain into in6m_purge().
Suggested by:	bms
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-10-30 11:34:07 +00:00
ae
41aec014af Fix mbuf leak in IPv6 multicast code.
When multicast capable interface goes away, it leaves multicast groups,
this leads to generate MLD reports, but MLD code does deffered send and
MLD reports are queued in the in6_multi's in6m_scq ifq. The problem is
that in6_multi structures are freed when interface leaves multicast groups
and thread that does deffered send will not take these queued packets.

PR:		194577
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-10-30 10:59:57 +00:00
ae
b27ddf1ff3 Do not automatically install routes to link-local and interface-local multicast
addresses.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-10-27 16:15:15 +00:00
ae
8d3b28e72b Remove unused function.
Sponsored by:	Yandex LLC
2014-10-27 10:34:09 +00:00