Commit Graph

1632 Commits

Author SHA1 Message Date
Andrey V. Elsukov
4e870f943f Move RTM announces into generic code to be independent from Layer2 code.
This fixes bug introduced in 274988, when announces about new addresses
don't sent for tunneling interfaces.

Reported by:	tuexen@
MFC after:	1 week
2015-05-29 10:24:16 +00:00
Michael Tuexen
b7d130befc Fix and cleanup the debug information. This has no user-visible changes.
Thanks to Irene Ruengeler for proving a patch.

MFC after: 3 days
2015-05-28 16:00:23 +00:00
Jung-uk Kim
fd90e2ed54 CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten
years for head.  However, it is continuously misused as the mpsafe argument
for callout_init(9).  Deprecate the flag and clean up callout_init() calls
to make them more consistent.

Differential Revision:	https://reviews.freebsd.org/D2613
Reviewed by:	jhb
MFC after:	2 weeks
2015-05-22 17:05:21 +00:00
Andrey V. Elsukov
c1b4f79dfa Add an ability accept encapsulated packets from different sources by one
gif(4) interface. Add new option "ignore_source" for gif(4) interface.
When it is enabled, gif's encapcheck function requires match only for
packet's destination address.

Differential Revision:	https://reviews.freebsd.org/D2004
Obtained from:	Yandex LLC
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2015-05-15 12:19:45 +00:00
Hiroki Sato
59333867ff - Remove ND6_IFF_IGNORELOOP. This functionality was useless in practice
because a link where looped back NS messages are permanently observed
  does not work with either NDP or ARP for IPv4.

- draft-ietf-6man-enhanced-dad is now RFC 7527.

Discussed with:	hiren
MFC after:	3 days
2015-05-12 03:31:57 +00:00
Andrey V. Elsukov
654bdb5abb Mark data checksum as valid for multicast packets, that we send back
to myself via simloop.
Also remove duplicate check under #ifdef DIAGNOSTIC.

PR:		180065
MFC after:	1 week
2015-05-07 14:17:43 +00:00
Andrey V. Elsukov
db037aa4ed Remove unneded #ifdef INET6 and IPSEC. This file compiled only when
both options are defined.
Include opt_sctp.h and sctp_crc32.h to enable #ifdef SCTP code block
and delayed checksum calculation for SCTP.
2015-05-07 12:15:45 +00:00
Gleb Smirnoff
0fa5aacd8b Remove #ifdef IFT_FOO.
Submitted by:	Guy Yur <guyyur gmail.com>
2015-05-02 20:31:27 +00:00
Andrey V. Elsukov
3e92c37f32 Remove now unneded KEY_FREESP() for case when ipsec[46]_process_packet()
returns EJUSTRETURN.

Sponsored by:	Yandex LLC
2015-04-27 01:11:09 +00:00
Andrey V. Elsukov
3d80e82d60 Fix possible use after free due to security policy deletion.
When we are passing mbuf to IPSec processing via ipsec[46]_process_packet(),
we hold one reference to security policy and release it just after return
from this function. But IPSec processing can be deffered and when we release
reference to security policy after ipsec[46]_process_packet(), user can
delete this security policy from SPDB. And when IPSec processing will be
done, xform's callback function will do access to already freed memory.

To fix this move KEY_FREESP() into callback function. Now IPSec code will
release reference to SP after processing will be finished.

Differential Revision:	https://reviews.freebsd.org/D2324
No objections from:	#network
Sponsored by:	Yandex LLC
2015-04-27 00:55:56 +00:00
Gleb Smirnoff
210b5c73e7 Fix r281649: don't call in6_clearscope() twice.
Submitted by:	ae
2015-04-17 15:26:08 +00:00
Gleb Smirnoff
28ebe80cab Provide functions to determine presence of a given address
configured on a given interface.

Discussed with:	np
Sponsored by:	Nginx, Inc.
2015-04-17 11:57:06 +00:00
Mark Johnston
dff78447a4 Fix a possible refcount leak in regen_tmpaddr().
public_ifa6 may be set to NULL after taking a reference to a previous
address list element. Instead, only take the reference after leaving the
loop but before releasing the address list lock.

Differential Revision:	https://reviews.freebsd.org/D2253
Reviewed by:		ae
MFC after:		2 weeks
2015-04-13 01:55:42 +00:00
Andrey V. Elsukov
e2956804dd Fix the IPV6_MULTICAST_IF sockopt handling. RFC 3493 says when the
interface index is specified as zero, the system should select the
interface to use for outgoing multicast packets. Even the comment
for the in6p_set_multicast_if() function says about index of zero.
But in fact for zero index the function just returns EADDRNOTAVAIL.

I.e. if you first set some interface and then will try reset it
with zero ifindex, you will get EADDRNOTAVAIL.

Reset im6o_multicast_ifp to NULL when interface index specified as
zero. Also return EINVAL in case when ifnet_byindex() returns NULL.
This will be the same behaviour as when ifindex is bigger than
V_if_index. And return EADDRNOTAVAIL only when interface is not
multicast capable.

Reported by:	Olivier Cochard-Labbé
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2015-04-10 19:09:51 +00:00
Andrey V. Elsukov
efb19cf6db Fix the check for maximum mbuf's size needed to send ND6 NA and NS.
It is acceptable that the size can be equal to MCLBYTES. In the later
KAME's code this check has been moved under DIAGNOSTIC ifdef, because
the size of NA and NS is much smaller than MCLBYTES. So, it is safe to
replace the check with KASSERT.

PR:		199304
Discussed with:	glebius
MFC after:	1 week
2015-04-09 12:57:58 +00:00
Kristof Provost
53deb05c36 Evaluate packet size after the firewall had its chance
Defer the packet size check until after the firewall has had a look at it. This
means that the firewall now has the opportunity to (re-)fragment an oversized
packet.

Differential Revision:	https://reviews.freebsd.org/D1815
Reviewed by:	ae
Approved by:	gnn (mentor)
2015-04-07 20:29:03 +00:00
Xin LI
dd3856601d Mitigate Local Denial of Service with IPv6 Router Advertisements
and log attack attempts.

Submitted by:	hrs
Security:	FreeBSD-SA-15:09.nd6
Security:	CVE-2015-2923
2015-04-07 20:20:09 +00:00
Gleb Smirnoff
c151f24d08 o Make net.inet6.ip6.mif6table return special API structure, that doesn't
contain kernel pointers, and instead has interface index.
  Bump __FreeBSD_version for that change.
o Now, netstat/mroute6.c no longer needs to kvm_read(3) struct ifnet, and
  no longer needs to include if_var.h

Note that this change is far from being a complete move of IPv6 multicast
routing to a proper API. Other structures are still dumped into their
sysctls as is, requiring userland application to #define _KERNEL when
including ip6_mroute.h and then call kvm_read(3) to gather all bits and
pieces. But fixing this is out of scope of the opaque ifnet project.

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-04-06 22:12:18 +00:00
Kristof Provost
31e2e88c27 Remove duplicate code
We'll just fall into the same local delivery block under the
'if (m->m_flags & M_FASTFWD_OURS)'.

Suggested by:	ae
Differential Revision:	https://reviews.freebsd.org/D2225
Approved by:	gnn (mentor)
2015-04-06 19:08:44 +00:00
Kristof Provost
798318490e Preserve IPv6 fragment IDs accross reassembly and refragmentation
When forwarding fragmented IPv6 packets and filtering with PF we
reassemble and refragment. That means we generate new fragment headers
and a new fragment ID.

We already save the fragment IDs so we can do the reassembly so it's
straightforward to apply the incoming fragment ID on the refragmented
packets.

Differential Revision:	https://reviews.freebsd.org/D2188
Approved by:		gnn (mentor)
2015-04-01 12:15:01 +00:00
Gleb Smirnoff
20778ab5b4 Move ip6_sprintf() declaration from in6_var.h to in6.h. This is a simple
function that works with in6_addr and it is not related to the INET6
stack implementation.

Sponsored by:	Nginx, Inc.
2015-03-24 16:45:50 +00:00
Andrey V. Elsukov
ff9f2a36de To avoid a possible race, release the reference to ifa after return
from nd6_dad_na_input().

Submitted by:	Alexandre Martins
MFC after:	1 week
2015-03-19 00:04:25 +00:00
Andrey V. Elsukov
fd8dd3a6d7 tcp6_ctlinput() doesn't pass MTU value to in6_pcbnotify().
Check cmdarg isn't NULL before dereference, this check was in the
ip6_notify_pmtu() before r279588.

Reported by:	Florian Smeets
MFC after:	1 week
2015-03-06 05:50:39 +00:00
Hiroki Sato
23e9ffb0e1 - Implement loopback probing state in enhanced DAD algorithm.
- Add no_dad and ignoreloop per-IF knob.  no_dad disables DAD completely,
  and ignoreloop is to prevent infinite loop in loopback probing state when
  loopback is permanently expected.
2015-03-05 21:27:49 +00:00
Andrey V. Elsukov
8f1beb889e Fix deadlock in IPv6 PCB code.
When several threads are trying to send datagram to the same destination,
but fragmentation is disabled and datagram size exceeds link MTU,
ip6_output() calls pfctlinput2(PRC_MSGSIZE). It does notify all
sockets wanted to know MTU to this destination. And since all threads
hold PCB lock while sending, taking the lock for each PCB in the
in6_pcbnotify() leads to deadlock.

RFC 3542 p.11.3 suggests notify all application wanted to receive
IPV6_PATHMTU ancillary data for each ICMPv6 packet too big message.
But it doesn't require this, when we don't receive ICMPv6 message.

Change ip6_notify_pmtu() function to be able use it directly from
ip6_output() to notify only one socket, and to notify all sockets
when ICMPv6 packet too big message received.

PR:		197059
Differential Revision:	https://reviews.freebsd.org/D1949
Reviewed by:	no objection from #network
Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2015-03-04 11:20:01 +00:00
Andrey V. Elsukov
1eef8a6c08 Create nd6_ns_output_fib() function with extra argument fibnum. Use it
to initialize mbuf's fibnum. Uninitialized fibnum value can lead to
panic in the routing code. Currently we use only RT_DEFAULT_FIB value
for initialization.

Differential Revision:	https://reviews.freebsd.org/D1998
Reviewed by:	hrs (previous version)
Sponsored by:	Yandex LLC
2015-03-03 10:50:03 +00:00
Hiroki Sato
8d56075939 Nonce has to be non-NULL for DAD even if net.inet6.ip6.dad_enhanced=0. 2015-03-03 04:28:19 +00:00
Hiroki Sato
11d8451df3 Implement Enhanced DAD algorithm for IPv6 described in
draft-ietf-6man-enhanced-dad-13.

This basically adds a random nonce option (RFC 3971) to NS messages
for DAD probe to detect a looped back packet.  This looped back packet
prevented DAD on some pseudo-interfaces which aggregates multiple L2 links
such as lagg(4).

The length of the nonce is set to 6 bytes.  This algorithm can be disabled by
setting net.inet6.ip6.dad_enhanced sysctl to 0 in a per-vnet basis.

Reported by:		hiren
Reviewed by:		ae
Differential Revision:	https://reviews.freebsd.org/D1835
2015-03-02 17:30:26 +00:00
Gleb Smirnoff
e072c794ad Now that all users of _WANT_IFADDR are fixed, remove this crutch and
hide ifaddr, in_ifaddr and in6_ifaddr under _KERNEL.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-02-19 23:16:10 +00:00
Gleb Smirnoff
9e62a5a379 - Rename 'struct mld_ifinfo' into 'struct mld_ifsoftc', since it really
represents a context.
- Preserve name 'struct mld_ifinfo' for a new structure, that will be stable
  API between userland and kernel.
- Make sysctl_mld_ifinfo() return the new 'struct mld_ifinfo', instead of
  old one, which had a bunch of internal kernel structures in it.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-02-19 22:37:01 +00:00
Gleb Smirnoff
fd1b2a7c57 Widen _KERNEL ifdef to hide more kernel network stack structures from userland. 2015-02-19 06:24:27 +00:00
Gleb Smirnoff
a99c84d4e6 Use new struct mbufq instead of struct ifqueue to manage packet queues in
IPv6 multicast code.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-02-19 01:21:23 +00:00
Gleb Smirnoff
6c269f6912 Factor out ip6_fragment() function, to be used in IPv6 stack and pf(4).
Submitted by:		Kristof Provost
Differential Revision:	D1766
2015-02-16 06:30:27 +00:00
Gleb Smirnoff
e5ee706031 Move ip6_deletefraghdr() to frag6.c.
Suggested by:	bz
2015-02-16 05:58:32 +00:00
Gleb Smirnoff
0b438b0fb8 Factor out ip6_deletefraghdr() function, to be shared between IPv6
stack and pf(4).

Submitted by:	Kristof Provost
Reviewed by:	ae
Differential Revision:	D1764
2015-02-16 01:12:20 +00:00
Randall Stewart
2575fbb827 This fixes a bug in the way that the LLE timers for nd6
and arp were being used. They basically would pass in the
mutex to the callout_init. Because they used this method
to the callout system, it was possible to "stop" the callout.
When flushing the table and you stopped the running callout, the
callout_stop code would return 1 indicating that it was going
to stop the callout (that was about to run on the callout_wheel blocked
by the function calling the stop). Now when 1 was returned, it would
lower the reference count one extra time for the stopped timer, then
a few lines later delete the memory. Of course the callout_wheel was
stuck in the lock code and would then crash since it was accessing
freed memory. By using callout_init(c, 1) we always get a 0 back
and the reference counting bug does not rear its head. We do have
to make a few adjustments to the callouts themselves though to make
sure it does the proper thing if rescheduled as well as gets the lock.

Commented upon by hiren and sbruno
See Phabricator D1777 for more details.

Commented upon by hiren and sbruno
Reviewed by:	adrian, jhb and bz
Sponsored by:	Netflix Inc.
2015-02-09 19:28:11 +00:00
Andrey V. Elsukov
46386183da Print IPv6 address in log message instead of address of pointer.
MFC after:	1 week
2015-02-05 16:29:26 +00:00
Adrian Chadd
b2bdc62a95 Refactor / restructure the RSS code into generic, IPv4 and IPv6 specific
bits.

The motivation here is to eventually teach netisr and potentially
other networking subsystems a bit more about how RSS work queues / buckets
are configured so things have a hope of auto-configuring in the future.

* net/rss_config.[ch] takes care of the generic bits for doing
  configuration, hash function selection, etc;
* topelitz.[ch] is now in net/ rather than netinet/;
* (and would be in libkern if it didn't directly include RSS_KEYSIZE;
  that's a later thing to fix up.)
* netinet/in_rss.[ch] now just contains the IPv4 specific methods;
* and netinet/in6_rss.[ch] now just contains the IPv6 specific methods.

This should have no functional impact on anyone currently using
the RSS support.

Differential Revision:	D1383
Reviewed by:	gnn, jfv (intel driver bits)
2015-01-18 18:06:40 +00:00
Gleb Smirnoff
ffec6ee527 Do not go one layer down to check ifqueue length. First, not all drivers
use ifqueue at all. Second, there is no point in this lockless check.
Either positive or negative result of the check could be incorrect after
a tick.

Sponsored by:	Nginx, Inc.
2015-01-12 14:52:43 +00:00
Michael Tuexen
4be807c4d6 Minimize the usage of SCTP_BUF_IS_EXTENDED.
This should help Robert...
2015-01-10 20:49:57 +00:00
Alexander V. Chernikov
d63e657c04 * Deal with ARCNET L2 multicast mapping for IPv6 the same way as in IPv4:
handle it in arc_output() instead of nd6_storelladdr().
* Remove IFT_ARCNET check from arpresolve() since arc_output() does not
  use arpresolve() to handle broadcast/multicast. This check was there
  since r84931. It looks like it was not used since r89099 (initial
  import of Arcnet support where multicast is handled separately).
* Remove IFT_IEEE1394 case from nd6_storelladdr() since firewire_output()
  calles nd6_storelladdr() for unicast addresses only.
* Remove IFT_ARCNET case from nd6_storelladdr() since arc_output() now
  handles multicast by itself.

As a result, we have the following pattern: all non-ethernet-style
media have their own multicast map handling inside their appropriate
routines. On the other hand, arpresolve() (and nd6_storelladdr()) which
meant to be 'generic' ones de-facto handles ethernet-only multicast maps.

MFC after:	3 weeks
2015-01-09 12:56:51 +00:00
Alexander V. Chernikov
abc1be9062 Add forgotten definition for nd6_output_ifp(). 2015-01-08 18:29:54 +00:00
Alexander V. Chernikov
d7968c29ec * Use newly-created nd6_grab_holdchain() function to retrieve lle
hold mbuf chain instead of calling full-blown nd6_output_lle()
  for each packet. This simplifies both callers and nd6_output_lle()
  implementation.
* Make nd6_output_lle() static and remove now-unused lle and chain
  arguments.
* Rename nd6_output_flush() -> nd6_flush_holdchain() to be consistent.
* Move all pre-send transmit hooks to newly-created nd6_output_ifp().
  Now nd6_output(), nd6_output_lle() and nd6_flush_holdchain() are using
  it to send mbufs to if_output.
* Remove SeND hook from nd6_na_input() because it was implemented
  incorrectly since the beginning (r211501):
  - it tagged initial input mbuf (m) instead of m_hold
  - tagging _all_ mbufs in holdchain seems to be wrong anyway.
2015-01-08 18:02:05 +00:00
Alexander V. Chernikov
3a7498636a * Allocate hash tables separately
* Make llt_hash() callback more flexible
* Default hash size and hashing method is now per-af
* Move lltable allocation to separate function
2015-01-05 17:23:02 +00:00
Alexander V. Chernikov
df485dbe3e Do not call LLE_WUNLOCK() for deleted lle. 2015-01-05 16:10:54 +00:00
Robert Watson
ed6a66ca6c To ease changes to underlying mbuf structure and the mbuf allocator, reduce
the knowledge of mbuf layout, and in particular constants such as M_EXT,
MLEN, MHLEN, and so on, in mbuf consumers by unifying various alignment
utility functions (M_ALIGN(), MH_ALIGN(), MEXT_ALIGN() in a single
M_ALIGN() macro, implemented by a now-inlined m_align() function:

- Move m_align() from uipc_mbuf.c to mbuf.h; mark as __inline.
- Reimplement M_ALIGN(), MH_ALIGN(), and MEXT_ALIGN() using m_align().
- Update consumers around the tree to simply use M_ALIGN().

This change eliminates a number of cases where mbuf consumers must be aware
of whether or not mbufs returned by the allocator use external storage, but
also assumptions about the size of the returned mbuf. This will make it
easier to introduce changes in how we use external storage, as well as
features such as variable-size mbufs.

Differential Revision:	https://reviews.freebsd.org/D1436
Reviewed by:	glebius, trasz, gnn, bz
Sponsored by:	EMC / Isilon Storage Division
2015-01-05 09:58:32 +00:00
Alexander V. Chernikov
b44a7d5d87 * Use unified code for deleting entry by sockaddr instead of per-af one.
* Remove now unused llt_delete_addr callback.
2015-01-03 19:09:06 +00:00
Alexander V. Chernikov
20dd899505 * Hide lltable implementation details in if_llatbl_var.h
* Make most of lltable_* methods 'normal' functions instead of inline
* Add lltable_get_<af|ifp>() functions to access given lltable fields
* Temporarily resurrect nd6_lookup() function
2015-01-03 16:04:28 +00:00
Alexander V. Chernikov
787cea14a5 Since @ln is the result of LLTABLE6(ifp) lookup its originating interface
must always be @ifp. So change ln->lle_tbl->llt_ifp to ifp.
2015-01-03 14:18:48 +00:00
Alexander V. Chernikov
d2e0f37c22 Finish r275628 #2: remove remaining 'base' references. 2015-01-03 14:09:35 +00:00
Adrian Chadd
492ccbe14d Migrate the RSS IPv6 hash code to use pointers to the v6 addresses
rather than passing them in by value.

The eventual aim is to do incremental hash construction rather than
all of the memcpy()'ing into a contiguous buffer for the hash
function, which does show up as taking quite a bit of CPU during
profiling.

Tested:

* a variety of laptops/desktop setups I have, with v6 connectivity

Differential Revision:	D1404
Reviewed by:	bz, rpaulo
2014-12-31 22:52:43 +00:00
Andrey V. Elsukov
f188f14d43 Extern declarations in C files loses compile-time checking that
the functions' calls match their definitions. Move them to header files.

Reviewed by:	jilles (previous version)
2014-12-25 21:32:37 +00:00
Andrey V. Elsukov
132c449079 Remove in_gif.h and in6_gif.h files. They only contain function
declarations used by gif(4). Instead declare these functions in C files.
Also make some variables static.
2014-12-23 16:17:37 +00:00
Michael Tuexen
caeae63f97 Plug a memory leak in an error code path.
Reported by:	Coverity
CID:		1018936
MFC after: 	3 days
2014-12-17 20:19:57 +00:00
Andrey V. Elsukov
44eb8bbe7b Do not count security policy violation twice.
ipsec*_in_reject() do this by their own.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 19:20:13 +00:00
Andrey V. Elsukov
49ada98eac Use ipsec6_in_reject() to simplify ip6_ipsec_fwd() and ip6_ipsec_input().
ipsec6_in_reject() does the same things, also it counts policy violation
errors.

Do IPSEC check in the ip6_forward() after addresses checks.
Also use ip6_ipsec_fwd() to make code similar to IPv4 implementation.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 19:09:57 +00:00
Andrey V. Elsukov
0275b2e369 Remove flag/flags argument from the following functions:
ipsec_getpolicybyaddr()
 ipsec4_checkpolicy()
 ip_ipsec_output()
 ip6_ipsec_output()

The only flag used here was IP_FORWARDING.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 18:35:34 +00:00
Andrey V. Elsukov
8922ddbe40 Move ip_ipsec_fwd() from ip_input() into ip_forward().
Remove check for presence PACKET_TAG_IPSEC_IN_DONE mbuf tag from
ip_ipsec_fwd(). PACKET_TAG_IPSEC_IN_DONE tag means that packet is
already handled by IPSEC code. This means that before IPSEC processing
it was destined to our address and security policy was checked in
the ip_ipsec_input(). After IPSEC processing packet has new IP
addresses and destination address isn't our own. So, anyway we can't
check security policy from the mbuf tag, because it corresponds
to different addresses.

We should check security policy that corresponds to packet
attributes in both cases - when it has a mbuf tag and when it has not.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 16:53:29 +00:00
Andrey V. Elsukov
e58320f127 Remove PACKET_TAG_IPSEC_IN_DONE mbuf tag lookup and usage of its
security policy. The changed block of code in ip*_ipsec_input() is
called when packet has ESP/AH header. Presence of
PACKET_TAG_IPSEC_IN_DONE mbuf tag in the same time means that
packet was already handled by IPSEC and reinjected in the netisr,
and it has another ESP/AH headers (encrypted twice?).
Since it was already processed by IPSEC code, the AH/ESP headers
was already stripped (and probably outer IP header was stripped too)
and security policy from the tdb_ident was applied to those headers.
It is incorrect to apply this security policy to current headers.

Also make ip_ipsec_input() prototype similar to ip6_ipsec_input().

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 14:58:55 +00:00
Andrey V. Elsukov
dd9cd45b44 Remove check for presence of PACKET_TAG_IPSEC_PENDING_TDB and
PACKET_TAG_IPSEC_OUT_CRYPTO_NEEDED mbuf tags. They aren't used in FreeBSD.

Instead check presence of PACKET_TAG_IPSEC_OUT_DONE mbuf tag. If it
is found, bypass security policy lookup as described in the comment.

PACKET_TAG_IPSEC_OUT_DONE tag added to mbuf when IPSEC code finishes
ESP/AH processing. Since it was already finished, this means the security
policy placed in the tdb_ident was already checked. And there is no reason
to check it again here.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 14:43:44 +00:00
Mark Johnston
a37271c3b8 Revert r275695: nd6_dad_find() was already correct.
Reported by:	ae, kib
Pointy hat to:	markj
2014-12-11 09:16:45 +00:00
Mark Johnston
97712e3efc Fix a bug in r266857: nd6_dad_find() must return NULL if it doesn't find
a matching element in the DAD queue.

Reported by:	Holger Hans Peter Freyther <holger@freyther.de>
MFC after:	3 days
2014-12-11 00:41:54 +00:00
Alexander V. Chernikov
ee7e9a4e17 * Do not assume lle has sockaddr key after struct lle:
use llt_fill_sa_entry() llt method to store lle address in sa.
* Eliminate L3_ADDR macro and either reference IPv4/IPv6 address
   directly from lle or use newly-created llt_fill_sa_entry().
* Do not store sockaddr inside arp/ndp lle anymore.
2014-12-09 00:48:08 +00:00
Alexander V. Chernikov
d82ed5051c Simplify lle lookup/create api by using addresses instead of sockaddrs. 2014-12-08 23:23:53 +00:00
Mark Johnston
d6ad6a865a Add refcounting to IPv6 DAD objects and simplify the DAD code to fix a
number of races which could cause double frees or use-after-frees when
performing DAD on an address. In particular, an IPv6 address can now only be
marked as a duplicate from the DAD callout.

Differential Revision:	https://reviews.freebsd.org/D1258
Reviewed by:		ae, hrs
Reported by:		rstone
MFC after:		1 month
2014-12-08 04:44:40 +00:00
Alexander V. Chernikov
73b52ad896 Use llt_prepare_static_entry method to prepare valid per-af static entry. 2014-12-07 23:59:44 +00:00
Alexander V. Chernikov
0368226e65 * Retire abstract llentry_free() in favor of lltable_drop_entry_queue()
and explicit calls to RTENTRY_FREE_LOCKED()
* Use lltable_prefix_free() in arp_ifscrub to be consistent with nd6.
* Rename <lltable_|llt>_delete function to _delete_addr() to note that
   this function is used to external callers. Make this function maintain
   its own locking.
* Use lookup/unlink/clear call chain from internal callers instead of
    delete_addr.
* Fix LLE_DELETED flag handling
2014-12-07 23:08:07 +00:00
Alexander V. Chernikov
721cd2e032 Do not enforce particular lle storage scheme:
* move lltable allocation to per-domain callbacks.
* make llentry_link/unlink functions overridable llt methods.
* make hash table traversal another overridable llt method.
2014-12-07 17:32:06 +00:00
Alexander V. Chernikov
a743ccd468 * Add llt_clear_entry() callback which is able to do all lle
cleanup including unlinking/freeing
* Relax locking in lltable_prefix_free_af/lltable_free
* Do not pass @llt to lle free callback: it is always NULL now.
* Unify arptimer/nd6_llinfo_timer: explicitly unlock lle avoiding
   unlock/lock sequinces
* Do not pass unlocked lle to nd6_ns_output(): add nd6_llinfo_get_holdsrc()
   to retrieve preferred source address from lle hold queue and pass it
   instead of lle.
* Finally, make nd6_create() create and return unlocked lle
* Separate defrtr handling code from nd6_free():
   use nd6_check_del_defrtr() to check if we need to keep entry instead of
    performing GC,
   use nd6_check_recalc_defrtr() to perform actual recalc on lle removal.
* Move isRouter handling from nd6_cache_lladdr() to separate
   nd6_check_router()
* Add initial code to maintain lle runtime flags in sync.
2014-12-07 15:42:46 +00:00
Michael Tuexen
457b4b8836 This is the SCTP specific companion of
https://svnweb.freebsd.org/changeset/base/275358
which was provided by Hans Petter Selasky.
2014-12-04 21:17:50 +00:00
Andrey V. Elsukov
2dfcd0ae9d Remove unneded check. No need to do m_pullup to the size that we prepended.
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-12-02 05:41:03 +00:00
Andrey V. Elsukov
2d957916ef Remove route chaching support from ipsec code. It isn't used for some time.
* remove sa_route_union declaration and route_cache member from struct secashead;
* remove key_sa_routechange() call from ICMP and ICMPv6 code;
* simplify ip_ipsec_mtu();
* remove #include <net/route.h>;

Sponsored by:	Yandex LLC
2014-12-02 04:20:50 +00:00
Alexander V. Chernikov
9b65db85e2 Do more fine-grained locking in lltable code: lltable_create_lle()
does actual new lle creation without extensive locking and existing
  lle search.
Move lle updating code from gigantic in_arpinput() to arp_update_llle()
  and some other functions.

IPv6 changes to follow.
2014-12-01 21:43:48 +00:00
Hans Petter Selasky
c25290420e Start process of removing the use of the deprecated "M_FLOWID" flag
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.

This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.

"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.

Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.

MFC after:	1 month
Sponsored by:	Mellanox Technologies
2014-12-01 11:45:24 +00:00
Alexander V. Chernikov
ce313fdd71 * Unify lle table dump/prefix removal code.
* Rename lla_XXX -> lltable_XXX_lle to reduce number of name prefixes
  used by lltable code.
2014-11-30 14:35:01 +00:00
Alexander V. Chernikov
5d14e4cd76 Provide rte_<get|set> methods to access rtentry for external consumers. 2014-11-29 19:27:43 +00:00
Alexander V. Chernikov
74860d4f7c Do not return unlocked/unreferenced lle in arpresolve/nd6_storelladdr -
return lle flags IFF needed.
Do not pass rte to arpresolve - pass is_gateway flag instead.
2014-11-27 23:06:25 +00:00
Andrey V. Elsukov
af6209a133 Skip L2 addresses lookups for p2p interfaces.
Discussed with:	melifaro
Sponsored by:	Yandex LLC
2014-11-24 21:51:43 +00:00
Alexander V. Chernikov
73d770287d Do more fine-grained lltable locking: use table runtime lock as rare
as we can.
2014-11-23 15:38:06 +00:00
Alexander V. Chernikov
9479029b1f * Add lltable llt_hash callback
* Move lltable items insertions/deletions to generic llt code.
2014-11-23 12:15:28 +00:00
Alexander V. Chernikov
7c066c18db Use less-invasive approach for IF_AFDATA lock: convert into 2 locks:
use rwlock accessible via external functions
    (IF_AFDATA_CFG_* -> if_afdata_cfg_*()) for all control plane tasks
  use rmlock (IF_AFDATA_RUN_*) for fast-path lookups.
2014-11-22 19:53:36 +00:00
Alexander V. Chernikov
27688dfe1d Temporarily revert r274774. 2014-11-22 17:57:54 +00:00
Alexander V. Chernikov
4194b42144 Another r274774 fix. 2014-11-21 23:37:14 +00:00
Alexander V. Chernikov
86b94cffe4 Finish r274774: add more headers/fix build for non-debug case. 2014-11-21 23:36:21 +00:00
Alexander V. Chernikov
9883e41b4b Switch IF_AFDATA lock to rmlock 2014-11-21 02:28:56 +00:00
Alexander V. Chernikov
4d56c133fb Sync to HEAD@r274766 2014-11-21 01:22:33 +00:00
Alexander V. Chernikov
f9723c7705 Simplify API: use new NHOP_LOOKUP_AIFP flag to select what ifp
we need to return.
Rename fib[64]_lookup_nh_basic to fib[64]_lookup_nh, add flags
fields for all relevant functions.
2014-11-20 22:41:59 +00:00
Alexander V. Chernikov
7f948f12f6 Finish r274175: do control plane MTU tracking.
Update route MTU in case of ifnet MTU change.
Add new RTF_FIXEDMTU to track explicitly specified MTU.

Old behavior:
ifconfig em0 mtu 1500->9000 -> all routes traversing em0 do not change MTU.
User has to manually update all routes.
ifconfig em0 mtu 9000->1500 -> all routes traversing em0 do not change MTU.
However, if ip[6]_output finds route with rt_mtu > interface mtu, rt_mtu
gets updated.

New behavior:
ifconfig em0 mtu 1500->9000 -> all interface routes in all fibs gets updated
with new MTU unless RTF_FIXEDMTU flag set on them.
ifconfig em0 mtu 9000->1500 -> all routes in all fibs gets updated with new
MTU unless RTF_FIXEDMTU flag set on them AND rt_mtu is less than ifp mtu.

route add ... -mtu XXX automatically sets RTF_FIXEDMTU flag.
route change .. -mtu 0 automatically removes RTF_FIXEDMTU flag.

PR:		194238
MFC after:	1 month
CR:		D1125
2014-11-17 01:05:29 +00:00
Alexander V. Chernikov
df629abf3e Rework LLE code locking:
* struct llentry is now basically split into 2 pieces:
  all fields within 64 bytes (amd64) are now protected by both
  ifdata lock AND lle lock, e.g. you require both locks to be held
  exclusively for modification. All data necessary for fast path
  operations is kept here. Some fields were added:
  - r_l3addr - makes lookup key liev within first 64 bytes.
  - r_flags - flags, containing pre-compiled decision whether given
    lle contains usable data or not. Current the only flag is RLLE_VALID.
  - r_len - prepend data len, currently unused
  - r_kick - used to provide feedback to control plane (see below).
  All other fields are protected by lle lock.
* Add simple state machine for ARP to handle "about to expire" case:
  Current model (for the fast path) is the following:
  - rlock afdata
  - find / rlock rte
  - runlock afdata
  - see if "expire time" is approaching
    (time_uptime + la->la_preempt > la->la_expire)
  - if true, call arprequest() and decrease la_preempt
  - store MAC and runlock rte
  New model (data plane):
  - rlock afdata
  - find rte
  - check if it can be used using r_* fields only
  - if true, store MAC
  - if r_kick field != 0 set it to 0.
  - runlock afdata
  New mode (control plane):
  - schedule arptimer to be called in (V_arpt_keep - V_arp_maxtries)
    seconds instead of V_arpt_keep.
  - on first timer invocation change state from ARP_LLINFO_REACHABLE
    to ARP_LLINFO_VERIFY, sets r_kick to 1 and shedules next call in
    V_arpt_rexmit (default to 1 sec).
  - on subsequent timer invocations in ARP_LLINFO_VERIFY state, checks
    for r_kick value: reschedule if not changed, and send arprequest()
    if set to zero (e.g. entry was used).
* Convert IPv4 path to use new single-lock approach. IPv6 bits to follow.
* Slow down in_arpinput(): now valid reply will (in most cases) require
  acquiring afdata WLOCK twice. This is requirement for storing changed
  lle data. This change will be slightly optimized in future.
* Provide explicit hash link/unlink functions for both ipv4/ipv6 code.
  This will probably be moved to generic lle code once we have per-AF
  hashing callback inside lltable.
* Perform lle unlink on deletion immediately instead of delaying it to
  the timer routine.
* Make r244183 more explicit: use new LLE_CALLOUTREF flag to indicate the
  presence of lle reference used for safe callout calls.
2014-11-16 20:12:49 +00:00
Alexander V. Chernikov
b4b1367ae4 * Move lle creation/deletion from lla_lookup to separate functions:
lla_lookup(LLE_CREATE) -> lla_create
  lla_lookup(LLE_DELETE) -> lla_delete
  Assume lla_create to return LLE_EXCLUSIVE lock for lle.
* Rework lla_rt_output to perform all lle changes under afdata WLOCK.
* change arp_ifscrub() ackquire afdata WLOCK, the same as arp_ifinit().
2014-11-15 18:54:07 +00:00
Andrey V. Elsukov
794a349c6f We don't return sp pointer, thus NULL assignment isn't needed.
And reference to sp will be freed at the end.

MFC after:	1 week
Sponsored by:	Yandex LLC
2014-11-12 22:58:52 +00:00
Alexander V. Chernikov
670e8b3b8c Kill custom in_matroute() radix mathing function removing one rte mutex lock.
Initially in_matrote() in_clsroute() in their current state was introduced by
r4105 20 years ago. Instead of deleting inactive routes immediately, we kept them
in route table, setting RTPRF_OURS flag and some expire time. After that, either
GC came or RTPRF_OURS got removed on first-packet. It was a good solution
in that days (and probably another decade after that) to keep TCP metrics.
However, after moving metrics to TCP hostcache in r122922, most of in_rmx
functionality became unused. It might had been used for flushing icmp-originated
routes before rte mutexes/refcounting, but I'm not sure about that.

So it looks like this is nearly impossible to make GC do its work nowadays:

in_rtkill() ignores non-RTPRF_OURS routes.
route can only become RTPRF_OURS after dropping last reference via rtfree()
which calls in_clsroute(), which, it turn, ignores UP and non-RTF_DYNAMIC routes.

Dynamic routes can still be installed via received redirect, but they
have default lifetime (no specific rt_expire) and no one has another trie walker
to call RTFREE() on them.

So, the changelist:
* remove custom rnh_match / rnh_close matching function.
* remove all GC functions
* partially revert r256695 (proto3 is no more used inside kernel,
  it is not possible to use rt_expire from user point of view, proto3 support
  is not complete)
* Finish r241884 (similar to this commit) and remove remaining IPv6 parts

MFC after:	1 month
2014-11-11 02:52:40 +00:00
Andrey V. Elsukov
002c24396d Add sa6_checkzone_ifp() function. It checks correctness of struct
sockaddr_in6, usually obtained from the user level through ioctl.
It initializes sin6_scope_id using given interface.

Sponsored by:	Yandex LLC
2014-11-10 16:12:51 +00:00
Alexander V. Chernikov
e0c0711e01 * Make nd6_dad_duplicated() constant.
* Simplify refcounting by using nd6_dad_add() / nd6_dad_del().

Reviewed by:	ae
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2014-11-10 16:01:39 +00:00
Andrey V. Elsukov
06fec20791 Remove link-local multicast routes remnants from in6_purgeaddr.
Also merge in6_purgeaddr_mc with in6_purgeaddr.

Sponsored by:	Yandex LLC
2014-11-10 16:01:31 +00:00
Gleb Smirnoff
e6abaf91f4 Consistently use if_link.
Reviewed by:	ae, melifaro
2014-11-10 15:56:30 +00:00
Andrey V. Elsukov
45d1880a36 For now handle only multicast addresses, we still use routes to
LLA unicasts yet.

Sponsored by:	Yandex LLC
2014-11-10 10:59:08 +00:00
Alexander V. Chernikov
f7bab8d0dd Switch route radix to dual-lock model:
use rmlock for data patch access, and config rwlock
for conrol plane processing. Route table changes require
bock locks held.
2014-11-10 00:07:06 +00:00
Andrey V. Elsukov
ea455de91d Use embedded scope zone id to determine outgoing interface for link-local
and node-local addresses.
2014-11-09 22:54:40 +00:00
Alexander V. Chernikov
36f34ac70b Fix nd6_output_flush() prototype.
Remove 'net/route_internal.h' header from stf.
2014-11-09 22:16:50 +00:00
Alexander V. Chernikov
603eaf792b Renove faith(4) and faithd(8) from base. It looks like industry
have chosen different (and more traditional) stateless/statuful
NAT64 as translation mechanism. Last non-trivial commits to both
faith(4) and faithd(8) happened more than 12 years ago, so I assume
it is time to drop RFC3142 in FreeBSD.

No objections from:	net@
2014-11-09 21:33:01 +00:00
Alexander V. Chernikov
d0f9fca40d Remove forgotten arguments. 2014-11-09 16:57:31 +00:00
Alexander V. Chernikov
033074c440 Replace 'struct route *' if_output() argument with 'struct nhop_info *'.
Leave 'struct route' as is for legacy routing api users.
Remove most of rtalloc_ign*-derived functions.
2014-11-09 16:33:04 +00:00
Alexander V. Chernikov
9c9bde01d1 Remove unused 'struct route *' argument from nd6_output_flush(). 2014-11-09 16:20:27 +00:00
Alexander V. Chernikov
55e5eda676 Separate radix and routing: use different structures for route and
for other customers.

Introduce new 'struct rib_head' for routing purposes and make
all routing api use it.
2014-11-09 00:36:39 +00:00
Andrey V. Elsukov
3e88eb903b Remove ip6_getdstifaddr() and all functions to work with auxiliary data.
It isn't safe to keep unreferenced ifaddrs. Use in6ifa_ifwithaddr() to
determine ifaddr corresponding to destination address. Since currently
we keep addresses with embedded scope zone, in6ifa_ifwithaddr is called
with zero zoneid and marked with XXX.

Also remove route and lle lookups from ip6_input. Use in6ifa_ifwithaddr()
instead.

Sponsored by:	Yandex LLC
2014-11-08 19:38:34 +00:00
Alexander V. Chernikov
a9413f6ca0 Sync to HEAD@r274297. 2014-11-08 18:13:35 +00:00
Alexander V. Chernikov
1398ffe5bc Convert most of "for (fibnum = 0; fibnum < rt_numfibs; fibnum++)" users
to use new rt_foreach_fib() instead of hand-rolling cycles.
2014-11-08 16:38:15 +00:00
Alexander V. Chernikov
3939f50c88 Finish r274290#2: remove unused IPv6 code. 2014-11-08 16:31:11 +00:00
Alexander V. Chernikov
22b08fd8b7 Split radix implementation and system route table structure:
use new 'struct radix_head' for radix.
2014-11-07 22:52:02 +00:00
Andrey V. Elsukov
f325335caf Overhaul if_gre(4).
Split it into two modules: if_gre(4) for GRE encapsulation and
if_me(4) for minimal encapsulation within IP.

gre(4) changes:
* convert to if_transmit;
* rework locking: protect access to softc with rmlock,
  protect from concurrent ioctls with sx lock;
* correct interface accounting for outgoing datagramms (count only payload size);
* implement generic support for using IPv6 as delivery header;
* make implementation conform to the RFC 2784 and partially to RFC 2890;
* add support for GRE checksums - calculate for outgoing datagramms and check
  for inconming datagramms;
* add support for sending sequence number in GRE header;
* remove support of cached routes. This fixes problem, when gre(4) doesn't
  work at system startup. But this also removes support for having tunnels with
  the same addresses for inner and outer header.
* deprecate support for various GREXXX ioctls, that doesn't used in FreeBSD.
  Use our standard ioctls for tunnels.

me(4):
* implementation conform to RFC 2004;
* use if_transmit;
* use the same locking model as gre(4);

PR:		164475
Differential Revision:	D1023
No objections from:	net@
Relnotes:	yes
Sponsored by:	Yandex LLC
2014-11-07 19:13:19 +00:00
Gleb Smirnoff
6df8a71067 Remove SYSCTL_VNET_* macros, and simply put CTLFLAG_VNET where needed.
Sponsored by:	Nginx, Inc.
2014-11-07 09:39:05 +00:00
Gleb Smirnoff
428cf06b31 Remove VNET_SYSCTL_ARG(). The generic sysctl(9) code handles that.
Reviewed by:	ae
Sponsored by:	Nginx, Inc.
2014-11-07 08:58:05 +00:00
Alexander V. Chernikov
064b1bdb2d Convert lle rtchecks to use new routing API.
For inet/ case, this involves reverting r225947
which seem to be pretty strange commit and should
be reverted in HEAD ad well.
2014-11-06 23:35:22 +00:00
Alexander V. Chernikov
146a181f28 Finish r274118: remove useless fields from struct domain.
Sponsored by:	Yandex LLC
2014-11-06 14:39:04 +00:00
Alexander V. Chernikov
1a75e3b20f Make checks for rt_mtu generic:
Some virtual if drivers has (ab)used ifa ifa_rtrequest hook to enforce
route MTU to be not bigger that interface MTU. While ifa_rtrequest hooking
might be an option in some situation, it is not feasible to do MTU checks
there: generic (or per-domain) routing code is perfectly capable of doing
this.

We currrently have 3 places where MTU is altered:

1) route addition.
 In this case domain overrides radix _addroute callback (in[6]_addroute)
 and all necessary checks/fixes are/can be done there.

2) route change (especially, GW change).
 In this case, there are no explicit per-domain calls, but one can
 override rte by setting ifa_rtrequest hook to domain handler
 (inet6 does this).

3) ifconfig ifaceX mtu YYYY
 In this case, we have no callbacks, but ip[6]_output performes runtime
 checks and decreases rt_mtu if necessary.

Generally, the goals are to be able to handle all MTU changes in
 control plane, not in runtime part, and properly deal with increased
 interface MTU.

This commit changes the following:
* removes hooks setting MTU from drivers side
* adds proper per-doman MTU checks for case 1)
* adds generic MTU check for case 2)

* The latter is done by using new dom_ifmtu callback since
 if_mtu denotes L3 interface MTU, e.g. maximum trasmitted _packet_ size.
 However, IPv6 mtu might be different from if_mtu one (e.g. default 1280)
 for some cases, so we need an abstract way to know maximum MTU size
 for given interface and domain.
* moves rt_setmetrics() before MTU/ifa_rtrequest hooks since it copies
  user-supplied data which must be checked.
* removes RT_LOCK_ASSERT() from other ifa_rtrequest hooks to be able to
  use this functions on new non-inserted rte.

More changes will follow soon.

MFC after:	1 month
Sponsored by:	Yandex LLC
2014-11-06 13:13:09 +00:00
Alexander V. Chernikov
9f25cbe45e Remove old hack abusing domattach from NFS code.
According to IANA RPC uaddr registry, there are no AFs
except IPv4 and IPv6, so it's not worth being too abstract here.

Remove ne_rtable[AF_MAX+1] and use explicit per-AF radix tries.
Use own initialization without relying on domattach code.

While I admit that this was one of the rare places in kernel
networking code which really was capable of doing multi-AF
without any AF-depended code, it is not possible anymore to
rely on dom* code.

While here, change terrifying "Invalid radix node head, rn:" message,
to different non-understandable "netcred already exists for given addr/mask",
but less terrifying. Since we know that rn_addaddr() returns NULL if
the same record already exists, we should provide more friendly error.

MFC after:	1 month
2014-11-05 00:58:01 +00:00
Alexander V. Chernikov
69b74805d5 Convert gif and stf to use new routing api. 2014-11-04 18:48:13 +00:00
Alexander V. Chernikov
5c9ef37854 Sync to HEAD@r274095. 2014-11-04 18:22:33 +00:00
Alexander V. Chernikov
8c3cfe0be0 Hide 'struct rtentry' and all its macro inside new header:
net/route_internal.h
The goal is to make its opaque for all code except route/rtsock and
proto domain _rmx.
2014-11-04 17:28:13 +00:00
Alexander V. Chernikov
a9ac00b76b Convert in6p_lookup_mcast_ifp() to use new routing api.
* Add special fib6_lookup_nh_ifp() to return rt_ifp
  instead of rt_ifa->ifa_ifp for that.
2014-11-04 17:05:24 +00:00
Alexander V. Chernikov
257480b8ab Convert netinet6/ to use new routing API.
* Remove &ifpp from ip6_output() in favor of ri->ri_nh_info
* Provide different wrappers to in6_selectsrc:
  Currently it is used by 2 differenct type of customers:
  - socket-based one, which all are unsure about provided
   address scope and
  - in-kernel ones (ND code mostly), which don't have
    any sockets, options, crededentials, etc.
  So, we provide two different wrappers to in6_selectsrc()
  returning select source.
* Make different versions of selectroute():
  Currenly selectroute() is used in two scenarios:
  - SAS, via in6_selecsrc() -> in6_selectif() -> selectroute()
  - output, via in6_output -> wrapper -> selectroute()
  Provide different versions for each customer:
  - fib6_lookup_nh_basic()-based in6_selectif() which is
    capable of returning interface only, without MTU/NHOP/L2
    calculations
  - full-blown fib6_selectroute() with cached route/multipath/
    MTU/L2
* Stop using routing table for link-local address lookups
* Add in6_ifawithifp_lla() to make for-us check faster for link-local
* Add in6_splitscope / in6_setllascope for faster embed/deembed scopes
2014-11-04 15:39:56 +00:00
Hiroki Sato
da1304cb42 Fix a bug which prevented ND6_IFF_IFDISABLED flag from clearing when
the newly-added IPv6 address was /128.

PR:	188032
2014-11-02 21:58:31 +00:00
Andrey V. Elsukov
94a43496c2 Remove redundant code.
if_detach already did these steps. Also, now we didn't keep routes to link-local
addresses.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-10-30 12:44:46 +00:00
Andrey V. Elsukov
3c268b3afc Move ifq drain into in6m_purge().
Suggested by:	bms
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-10-30 11:34:07 +00:00
Andrey V. Elsukov
8ff1eae10d Fix mbuf leak in IPv6 multicast code.
When multicast capable interface goes away, it leaves multicast groups,
this leads to generate MLD reports, but MLD code does deffered send and
MLD reports are queued in the in6_multi's in6m_scq ifq. The problem is
that in6_multi structures are freed when interface leaves multicast groups
and thread that does deffered send will not take these queued packets.

PR:		194577
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-10-30 10:59:57 +00:00
Andrey V. Elsukov
c56173a626 Do not automatically install routes to link-local and interface-local multicast
addresses.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-10-27 16:15:15 +00:00
Andrey V. Elsukov
8e4bdfa2db Remove unused function.
Sponsored by:	Yandex LLC
2014-10-27 10:34:09 +00:00
Alexander V. Chernikov
30514718e7 Convert several places inside netinet6/ to new api. 2014-10-25 22:53:08 +00:00
Andrey V. Elsukov
a663aa4ce8 Remove redundant check and m_pullup() call. 2014-10-24 13:34:22 +00:00
Andrey V. Elsukov
0b9f5f8a5f Overhaul if_gif(4):
o convert to if_transmit;
 o use rmlock to protect access to gif_softc;
 o use sx lock to protect from concurrent ioctls;
 o remove a lot of unneeded and duplicated code;
 o remove cached route support (it won't work with concurrent io);
 o style fixes.

Reviewed by:	melifaro
Obtained from:	Yandex LLC
MFC after:	1 month
Sponsored by:	Yandex LLC
2014-10-14 13:31:47 +00:00
Robert Watson
f0cace5d94 When deciding whether to call m_pullup() even though there is adequate
data in an mbuf, use M_WRITABLE() instead of a direct test of M_EXT;
the latter both unnecessarily exposes mbuf-allocator internals in the
protocol stack and is also insufficient to catch all cases of
non-writability.

(NB: m_pullup() does not actually guarantee that a writable mbuf is
returned, so further refinement of all of these code paths continues to
be required.)

Reviewed by:	bz
MFC after:	3 days
Sponsored by:	EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D900
2014-10-12 15:49:52 +00:00
Bryan Venteicher
81d3ec1763 Add context pointer and source address to the UDP tunnel callback
These are needed for the forthcoming vxlan implementation. The context
pointer means we do not have to use a spare pointer field in the inpcb,
and the source address is required to populate vxlan's forwarding table.

While I highly doubt there is an out of tree consumer of the UDP
tunneling callback, this change may be a difficult to eventually MFC.

Phabricator:	https://reviews.freebsd.org/D383
Reviewed by:	gnn
2014-10-10 06:08:59 +00:00
Bryan Venteicher
a0a9e1b57c Add missing UDP multicast receive dtrace probes
Phabricator:	https://reviews.freebsd.org/D924
Reviewed by:	rpaulo markj
MFC after:	1 month
2014-10-09 22:36:21 +00:00
Bryan Venteicher
514929b193 Move the calls to u_tun_func() into udp6_append()
A similar cleanup for UDPv4 was performed in r220620.

Phabricator:	https://reviews.freebsd.org/D383
Reviewed by:	gnn
MFC after:	1 month
2014-10-09 05:42:07 +00:00
Michael Tuexen
5558cc334d Fix a bug introduced in
https://svnweb.freebsd.org/base?view=revision&revision=272347

MFC after: 3 days
2014-10-07 16:01:17 +00:00
Michael Tuexen
4e1730b532 UPD and UDPLite require a checksum. So check for it.
MFC after: 3 days
2014-10-03 08:46:49 +00:00
Michael Tuexen
5055cfcb4d Check for UDP/IPv6 packets that the length in the UDP header is at least
the minimum. Make the check similar to the one for UDPLite/IPv6.

MFC after: 3 days
2014-10-02 10:49:01 +00:00
Michael Tuexen
76b96fbc9e Fix the checksum computation for UDPLite/IPv6. This requires the
usage of a function computing the checksum only over a part of the function.
Therefore introduce in6_cksum_partial() and implement in6_cksum() based
on that.
While there, ensure that the UDPLite packet contains at least enough bytes
to contain the header.

Reviewed by: kevlo
MFC after: 3 days
2014-10-02 10:32:24 +00:00
Hiroki Sato
9c57a5b630 Add an additional routing table lookup when m->m_pkthdr.fibnum is changed
at a PFIL hook in ip{,6}_output().  IPFW setfib rule did not perform
a routing table lookup when the destination address was not changed.

CR:	D805
2014-10-02 00:25:57 +00:00
Alexander V. Chernikov
31f0d081d8 Remove lock init from radix.c.
Radix has never managed its locking itself.
The only consumer using radix with embeded rwlock
is system routing table. Move per-AF lock inits there.
2014-10-01 14:39:06 +00:00
Michael Tuexen
83e95fb30b The default for UDPLITE_RECV_CSCOV is zero. RFC 3828 recommend
that this means full checksum coverage for received packets.
If an application is willing to accept packets with partial
coverage, it is expected to use the socekt option and provice
the minimum coverage it accepts.

Reviewed by: kevlo
MFC after: 3 days
2014-10-01 05:43:29 +00:00
Michael Tuexen
0f4a03663b If the checksum coverage field in the UDPLITE header is the length
of the complete UDPLITE packet, the packet has full checksum coverage.
SO fix the condition.

Reviewed by: kevlo
MFC after: 3 days
2014-09-30 18:17:28 +00:00
Andrey V. Elsukov
d1729484d4 Remove redundant call to ipsec_getpolicybyaddr().
ipsec_hdrsiz() will call it internally.

Sponsored by:	Yandex LLC
2014-09-30 13:15:19 +00:00
Kevin Lo
0bc40ebf00 When plen != ulen, it should only be checked when this is UDP.
Spotted by:	bryanv
2014-09-30 07:28:31 +00:00
Alan Somers
4f8585e021 Revisions 264905 and 266860 added a "int fib" argument to ifa_ifwithnet and
ifa_ifwithdstaddr. For the sake of backwards compatibility, the new
arguments were added to new functions named ifa_ifwithnet_fib and
ifa_ifwithdstaddr_fib, while the old functions became wrappers around the
new ones that passed RT_ALL_FIBS for the fib argument. However, the
backwards compatibility is not desired for FreeBSD 11, because there are
numerous other incompatible changes to the ifnet(9) API. We therefore
decided to remove it from head but leave it in place for stable/9 and
stable/10. In addition, this commit adds the fib argument to
ifa_ifwithbroadaddr for consistency's sake.

sys/sys/param.h
	Increment __FreeBSD_version

sys/net/if.c
sys/net/if_var.h
sys/net/route.c
	Add fibnum argument to ifa_ifwithbroadaddr, and remove the _fib
	versions of ifa_ifwithdstaddr, ifa_ifwithnet, and ifa_ifwithroute.

sys/net/route.c
sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_options.c
sys/netinet/ip_output.c
sys/netinet6/nd6.c
	Fixup calls of modified functions.

share/man/man9/ifnet.9
	Document changed API.

CR:		https://reviews.freebsd.org/D458
MFC after:	Never
Sponsored by:	Spectra Logic
2014-09-11 20:21:03 +00:00
Andrey V. Elsukov
343e440f63 Add const qualifier to in6_addrhash() function.
Add in6ifa_ifwithaddr() function. It is similar to ifa_ifwithaddr,
but does fast lookup in the hash of inet6 addresses.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-09-11 13:18:41 +00:00
Andrey V. Elsukov
80803aa289 * use M_ZERO flag with malloc instead of explicit zeroing.
* remove MULTI_SCOPE ifdef.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-09-11 12:54:17 +00:00
Andrey V. Elsukov
41874e85d6 Introduce new scope related functions.
* new macro to remove magic number - IPV6_ADDR_SCOPES_COUNT;
* sa6_checkzone() - this function checks sockaddr_in6 structure
  for correctness of sin6_scope_id. It also can fill correct
  value sometimes.
* in6_getscopezone() - this function returns scope zone id for
  specified interface and scope.
* in6_getlinkifnet() - this function returns struct ifnet for
  corresponding zone id of link-local scope.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-09-11 12:33:37 +00:00
Andrey V. Elsukov
573791d01c * constify argument of in6_addrscope();
* use IN6_IS_ADDR_XXX() macro instead of hardcoded values;
* for multicast addresses just return scope value, the only exception
  is addresses with 0x0F scope value (RFC 4291 p2.7.0);

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-09-11 10:27:59 +00:00
Andrey V. Elsukov
9196891fc9 Add additional checks for IPV6_PKTINFO handling (RFC 3542):
* Return ENETDOWN when interface specified by ipi6_ifindex is not
  enabled for IPv6 use.
* Return EADDRNOTAVAIL when ipi6_ifindex specifies an interface, but the
  address ipi6_addr is not available for use on that interface.
* Return EINVAL when ipi6_addr is multicast address.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-09-10 14:32:07 +00:00
Andrey V. Elsukov
a7e201bbac Make in6_pcblookup_hash_locked and in6_pcbladdr static.
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-09-10 13:17:35 +00:00
Andrey V. Elsukov
1b44e5ffe3 Introduce INP6_PCBHASHKEY macro. Replace usage of hardcoded part of
IPv6 address as hash key in all places.

Obtained from:	Yandex LLC
2014-09-10 12:35:42 +00:00
Andrey V. Elsukov
5dbfa43f65 Add the ability to set `prefer_source' flag to an IPv6 address.
It affects the IPv6 source address selection algorithm (RFC 6724)
and allows override the last rule ("longest matching prefix") for
choosing among equivalent addresses. The address with `prefer_source'
will be preferred source address.

Obtained from:	Yandex LLC
MFC after:	1 month
Sponsored by:	Yandex LLC
2014-09-09 10:52:50 +00:00
Adrian Chadd
a4d98bf442 Add basic RSS awareness for the UDPv6 send path.
This doesn't include the same kind of userland overriding that the IPv4
path has; nor does it yet know about 2-tuple versus 4-tuple hashing.
That'll come later.

Differential Revision:	https://reviews.freebsd.org/D527
Reviewed by:	grehan
2014-09-09 04:20:53 +00:00
Adrian Chadd
b174de323a Add IP_NODEFAULTFLOWID awareness to ip6_output().
Differential Revision:	https://reviews.freebsd.org/D527
2014-09-09 00:21:21 +00:00
Michael Tuexen
24aaac8d59 Use union sctp_sockstore instead of struct sockaddr_storage. This
eliminiates some warnings when building in userland.
Thanks to Patrick Laimbock for reporting this issue.
Remove also some unnecessary casts.
There should be no functional change.

MFC after: 1 week
2014-09-07 09:06:26 +00:00
Andrey V. Elsukov
ccc53de916 Add the reverse part to rule #9. Also change its description in the
netstat(8) output.

MFC after:	1 week
2014-09-01 09:30:34 +00:00
Mark Johnston
5fc2632281 Add some missing checks for unsupported interfaces (e.g. pflog(4)) when
handling ioctls. While here, remove duplicated checks for a NULL ifp in
in6_control(): this check is already done near the beginning of the
function.

PR:		189117
Reviewed by:	hrs
MFC after:	2 weeks
2014-08-22 19:21:08 +00:00
Kevin Lo
73d76e77b6 Change pr_output's prototype to avoid the need for explicit casts.
This is a follow up to r269699.

Phabric:	D564
Reviewed by:	jhb
2014-08-15 02:43:02 +00:00
Kevin Lo
8f5a8818f5 Merge 'struct ip6protosw' and 'struct protosw' into one. Now we have
only one protocol switch structure that is shared between ipv4 and ipv6.

Phabric:	D476
Reviewed by:	jhb
2014-08-08 01:57:15 +00:00
Andrey V. Elsukov
d6e6b9943b Add new rule to source address selection algorithm. It prefers address
with better virtual status. Use ifa_preferred() to choose better address.

PR:		187341
Tested by:	des
MFC after:	1 week
2014-07-30 15:08:12 +00:00
Gleb Smirnoff
9753faf553 Garbage collect couple of unused fields from struct ifaddr:
- ifa_claim_addr() unused since removal of NetAtalk
- ifa_metric seems to be never utilized, always a copy of if_metric
2014-07-29 15:01:29 +00:00
Hiroki Sato
9be09a6e43 Fix EtherIP. TOS field must be initialized when the inner protocol is
PF_LINK, and multicast/broadcast flag should always be dropped because
the outer protocol uses unicast even when the inner address is not for
unicast.  It had been broken since r236951 when gif_output() started to
use IFQ_HANDOFF().
2014-07-24 10:42:47 +00:00
Adrian Chadd
0ae3f42231 When it's time to do 4-tuple UDP IPv6 hashing, make sure this is a known
type.
2014-07-20 07:39:54 +00:00
Adrian Chadd
c7c0d94874 Add IPv6 flowid, bindmulti and RSS awareness. 2014-07-12 05:46:33 +00:00
Adrian Chadd
a8a2d8003a Add INP_RSS_BUCKET_SET awareness for IPv6 pcbgroup entries.
This ensures that a listen socket with INP_RSS_BUCKET_SET set will use
the pre-determined PCBGROUP rather than what the hashing path chooses.
2014-07-12 05:45:53 +00:00
Adrian Chadd
6e4405cee1 Add the IPv6 versions of the multi-bind, hash/hash type and RSS options. 2014-07-12 05:44:16 +00:00
Andrey V. Elsukov
ff899182ec Fix condition.
Sponsored by:	Yandex LLC
2014-07-11 06:34:15 +00:00
Bryan Venteicher
6700a7d44b Use the appropriate IPv6 hashtype defines when looking up the PCBGROUP
Reviewed by:	adrian@
2014-07-07 00:02:49 +00:00
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
Hajimu UMEMOTO
f4839cbc0a Make nd6_gctimer tunable.
MFC after:	1 week
2014-06-23 16:27:29 +00:00
Kevin Lo
ea93c6a613 Catch up with r186809, correct comments. 2014-06-23 05:17:39 +00:00
Andrey V. Elsukov
45b4fb0449 Remove unused variable.
Sponsored by:	Yandex LLC
2014-06-08 09:08:51 +00:00
Alan Somers
2f308a343f Fix unintended KBI change from r264905. Add _fib versions of
ifa_ifwithnet() and ifa_ifwithdstaddr()  The legacy functions will call the
_fib() versions with RT_ALL_FIBS, preserving legacy behavior.

sys/net/if_var.h
sys/net/if.c
	Add legacy-compatible functions as described above.  Ensure legacy
	behavior when RT_ALL_FIBS is passed as fibnum.

sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/net/route.c
sys/net/rtsock.c
sys/netinet6/nd6.c
	Call with _fib() functions if we must use a specific fib, or the
	legacy functions otherwise.

tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
	Improve the udp_dontroute test.  The bug that this test exercises is
	that ifa_ifwithnet() will return the wrong address, if multiple
	interfaces have addresses on the same subnet but with different
	fibs.  The previous version of the test only considered one possible
	failure mode: that ifa_ifwithnet_fib() might fail to find any
	suitable address at all.  The new version also checks whether
	ifa_ifwithnet_fib() finds the correct address by checking where the
	ARP request goes.

Reported by:	bz, hrs
Reviewed by:	hrs
MFC after:	1 week
X-MFC-with:	264905
Sponsored by:	Spectra Logic
2014-05-29 21:03:49 +00:00
Hiroki Sato
82a9fa4a1d Add rwlock to struct dadq. A panic could occur when a large number of
addresses performed DAD at the same time.
2014-05-29 20:53:53 +00:00
VANHULLEBUS Yvan
aaf2cfc0d6 Fixed IPv4-in-IPv6 and IPv6-in-IPv4 IPsec tunnels.
For IPv6-in-IPv4, you may need to do the following command
on the tunnel interface if it is configured as IPv4 only:
ifconfig <interface> inet6 -ifdisabled

Code logic inspired from NetBSD.

PR: kern/169438
Submitted by: emeric.poupon@netasq.com
Reviewed by: fabient, ae
Obtained from: NETASQ
2014-05-28 12:45:27 +00:00
Hiroki Sato
705bef548a Cancel DAD for an ifa when the ifp has ND6_IFF_IFDISABLED as early as
possible and do not clear IN6_IFF_TENTATIVE.  If IFDISABLED was accidentally
set after a DAD started, TENTATIVE could be cleared because no NA was
received due to IFDISABLED, and as a result it could prevent DAD when
manually clearing IFDISABLED after that.
2014-05-16 15:53:31 +00:00
Alexander V. Chernikov
b980262e63 Pass radix head ptr along with rte to rtexpunge().
Rename rtexpunge to rt_expunge().
2014-05-03 16:28:54 +00:00
Alexander V. Chernikov
cf58751a44 Use "hash" value in rtalloc_mpath_fib() instead of RTF_ANNOUNCE flag.
Hashing method is the same as in in6_src.c. (Probably we need better one).

MFC after:	2 weeks
2014-04-26 16:46:33 +00:00
Alexander V. Chernikov
36d55f0f9d Unify sa_equal() macro usage.
MFC after:	2 weeks
2014-04-26 14:52:03 +00:00
Alan Somers
0cfee0c223 Fix subnet and default routes on different FIBs on the same subnet.
These two bugs are closely related.  The root cause is that ifa_ifwithnet
does not consider FIBs when searching for an interface address.

sys/net/if_var.h
sys/net/if.c
	Add a fib argument to ifa_ifwithnet and ifa_ifwithdstadddr.  Those
	functions will only return an address whose interface fib equals the
	argument.

sys/net/route.c
	Update calls to ifa_ifwithnet and ifa_ifwithdstaddr with fib
	arguments.

sys/netinet/in.c
	Update in_addprefix to consider the interface fib when adding
	prefixes.  This will prevent it from not adding a subnet route when
	one already exists on a different fib.

sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/netinet6/nd6.c
	Add RT_DEFAULT_FIB arguments to ifa_ifwithdstaddr and ifa_ifwithnet.
	In some cases it there wasn't a clear specific fib number to use.
	In others, I was unable to test those functions so I chose
	RT_DEFAULT_FIB to minimize divergence from current behavior.  I will
	fix some of the latter changes along with PR kern/187553.

tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
tests/sys/netinet/Makefile
	Revert r263738.  The udp_dontroute test was right all along.
	However, bugs kern/187550 and kern/187553 cancelled each other out
	when it came to this test.  Because of kern/187553, ifa_ifwithnet
	searched the default fib instead of the requested one, but because
	of kern/187550, there was an applicable subnet route on the default
	fib.  The new test added in r263738 doesn't work right, however.  I
	can verify with dtrace that ifa_ifwithnet returned the wrong address
	before I applied this commit, but route(8) miraculously found the
	correct interface to use anyway.  I don't know how.

	Clear expected failure messages for kern/187550 and kern/187552.

PR:		kern/187550
PR:		kern/187552
Reviewed by:	melifaro
MFC after:	3 weeks
Sponsored by:	Spectra Logic
2014-04-24 23:56:56 +00:00
Andrey V. Elsukov
52c57247d3 Remove unused variable.
PR:		173521
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-04-17 06:40:11 +00:00
Andrey V. Elsukov
4fd913364f Properly release the in6_multi lock.
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-04-12 02:05:31 +00:00
Kevin Lo
d1b18731d9 Minor style cleanups. 2014-04-07 01:55:53 +00:00
Kevin Lo
e06e816f67 Add support for UDP-Lite protocol (RFC 3828) to IPv4 and IPv6 stacks.
Tested with vlc and a test suite [1].

[1] http://www.erg.abdn.ac.uk/~gerrit/udp-lite/files/udplite_linux.tar.gz

Reviewed by:	jhb, glebius, adrian
2014-04-07 01:53:03 +00:00
Andrey V. Elsukov
cd71804c84 Remove unused label.
MFC after:	1 week
2014-03-31 14:40:35 +00:00
Andrey V. Elsukov
27aa751c90 Don't generate an ICMPv6 error message if packet was consumed by filter.
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-03-31 14:27:22 +00:00
Robert Watson
7527624efa Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation.  This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.

(1) Merge a software implementation of the Toeplitz hash specified in
    RSS implemented by David Malone.  This is used to allow suitable
    pcbgroup placement of connections before the first packet is
    received from the NIC.  Software hashing is generally avoided,
    however, due to high cost of the hash on general-purpose CPUs.

(2) In in_rss.c, maintain authoritative versions of RSS state intended
    to be pushed to each NIC, including keying material, hash
    algorithm/ configuration, and buckets.  Provide software-facing
    interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
    the RSS standardised Toeplitz and a 'naive' variation with a hash
    efficient in software but with poor distribution properties.
    Implement rss_m2cpuid()to be used by netisr and other load
    balancing code to look up the CPU on which an mbuf should be
    processed.

(3) In the Ethernet link layer, allow netisr distribution using RSS as
    a source of policy as an alternative to source ordering; continue
    to default to direct dispatch (i.e., don't try and requeue packets
    for processing on the 'right' CPU if they arrive in a directly
    dispatchable context).

(4) Allow RSS to control tuning of connection groups in order to align
    groups with RSS buckets.  If a packet arrives on a protocol using
    connection groups, and contains a suitable hardware-generated
    hash, use that hash value to select the connection group for pcb
    lookup for both IPv4 and IPv6.  If no hardware-generated Toeplitz
    hash is available, we fall back on regular PCB lookup risking
    contention rather than pay the cost of Toeplitz in software --
    this is a less scalable but, at my last measurement, faster
    approach.  As core counts go up, we may want to revise this
    strategy despite CPU overhead.

Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP.  This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers.  Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing.  This will hopefully prove
a useful starting point for refinement.

No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.

Sponsored by:   Juniper Networks (original work)
Sponsored by:   EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
Gleb Smirnoff
aa69c61235 Since both netinet/ and netinet6/ call into netipsec/ and netpfil/,
the protocol specific mbuf flags are shared between them.

- Move all M_FOO definitions into a single place: netinet/in6.h, to
  avoid future  clashes.
- Resolve clash between M_DECRYPTED and M_SKIP_FIREWALL which resulted
  in a failure of operation of IPSEC and packet filters.

Thanks to Nicolas and Georgios for all the hard work on bisecting,
testing and finally finding the root of the problem.

PR:			kern/186755
PR:			kern/185876
In collaboration with:	Georgios Amanakis <gamanakis gmail.com>
In collaboration with:	Nicolas DEFFAYET <nicolas-ml deffayet.com>
Sponsored by:		Nginx, Inc.
2014-03-12 14:29:08 +00:00
Gleb Smirnoff
e3a7aa6f56 - Remove rt_metrics_lite and simply put its members into rtentry.
- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This
  removes another cache trashing ++ from packet forwarding path.
- Create zini/fini methods for the rtentry UMA zone. Via initialize
  mutex and counter in them.
- Fix reporting of rmx_pksent to routing socket.
- Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.

The change is mostly targeted for stable/10 merge. For head,
rt_pksent is expected to just disappear.

Discussed with:		melifaro
Sponsored by:		Netflix
Sponsored by:		Nginx, Inc.
2014-03-05 01:17:47 +00:00
John Baldwin
5b26ea5df3 Remove more constants related to static sysctl nodes. The MAXID constants
were primarily used to size the sysctl name list macros that were removed
in r254295.  A few other constants either did not have an associated
sysctl node, or the associated node used OID_AUTO instead.

PR:		ports/184525 (exp-run)
2014-02-25 18:44:33 +00:00
Craig Rodrigues
47a79fadc6 Remove KASSERT from in6p_lookup_mcast_ifp().
When the devel/jenkins port, version 1.551 was started,
the kernel would panic if INVARIANTS was enabled in the kernel config.

Suggested by: bms
2014-02-23 01:27:22 +00:00
Gleb Smirnoff
0ff96b4f55 o Remove at compile time the HASH_ALL code, that was never
tested and is unfinished. However, I've tested my version,
  it works okay. As before it is unfinished: timeout aren't
  driven by TCP session state. To enable the HASH_ALL mode,
  one needs in kernel config:

	options FLOWTABLE_HASH_ALL

o Reduce the alignment on flentry to 64 bytes. Without
  the FLOWTABLE_HASH_ALL option, twice less memory would
  be consumed by flows.
o API to ip_output()/ip6_output() got even more thin: 1 liner.
o Remove unused unions. Simply use fle->f_key[].
o Merge all IPv4 code into flowtable_lookup_ipv4(), and do same
  flowtable_lookup_ipv6(). Stop copying data to on stack
  sockaddr structures, simply use key[] on stack.
o Move code from flowtable_lookup_common() that actually works
  on insertion into flowtable_insert().

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-02-17 11:50:56 +00:00
Alexander V. Chernikov
f6990c4e3e Further simplify nd6_output_lle.
Currently we have 3 usage patterns:
1) nd6_output (most traffic flow, no lle supplied, lle RLOCK sufficient)
2) corner cases for output (no lle, STALE lle, so on). lle WLOCK needed.
3) nd* iunternal machinery (WLOCK'ed lle provided, perform packet queing).

We separate case 1 and implement it inside its only customer - nd6_output.
This leads to some code duplication (especialy SEND stuff, which should be
hooked to output in a different way), but simplifies locking and control
flow logic fir nd6_output_lle.

Reviewed by:	ae
MFC after:	3 weeks
Sponsored by:	Yandex LLC
2014-02-13 19:09:04 +00:00
Andrey V. Elsukov
e4c77ca0c0 Drop packets to multicast address whose scop field contains the
reserved value 0.

MFC after:	1 week
Sponsored by:	Yandex LLC
2014-02-13 14:10:44 +00:00
Christian Brueffer
d37872314f Only count table lookups when we're actually processing packets.
PR:		183462
Submitted by:	Sven-Thorsten Dietrich <thebigcorporation at gmail.com>
Reviewed by:	bms
MFC after:	1 month
2014-02-10 14:47:51 +00:00
Christian Brueffer
1b55364ed9 For IPv6, return the same error code as IPv4 when mrouter is not initialized.
PR:		178472
Submitted by:	Sven-Thorsten Dietrich <sven at vyatta.com>
Reviewed by:	bms
2014-02-10 14:36:51 +00:00