Commit Graph

5261 Commits

Author SHA1 Message Date
Will Andrews
bb269f3ae4 Log hardware interface up/down as "hardware" rather than just "hw".
Suggested by:	glebius
MFC after:	1 week
MFC with:	277530
2015-01-23 14:30:24 +00:00
Will Andrews
369a670857 When a CARP state change is caused by an ifconfig request, log it accordingly.
Suggested by:	glebius
MFC after:	1 week
MFC with:	277530
2015-01-23 14:28:12 +00:00
Will Andrews
d01641e2c1 Improve CARP logging so that all state transitions are logged.
sys/netinet/ip_carp.c:
	Add a "reason" string parameter to carp_set_state() and
	carp_master_down_locked() allowing more specific logging
	information to be passed into these apis.

	Refactor existing state transition logging into a single
	log call in carp_set_state().

	Update all calls to carp_set_state() and
	carp_master_down_locked() to pass an appropriate reason
	string.  For state transitions that were previously logged,
	the output should be unchanged.

Submitted by:	gibbs (original), asomers (updated)
MFC after:	1 week
Sponsored by:	Spectra Logic
MFSpectraBSD:	1039697 on 2014/02/11 (original)
		1049992 on 2014/03/21 (updated)
2015-01-22 17:09:54 +00:00
Michael Tuexen
bcbf8c2105 Remove comparisons which are not necessary.
Reported by:	Coverity
CID:		1237826, 1237844, 1237847
MFC after:	1 week
2015-01-20 19:08:55 +00:00
Michael Tuexen
2010054d91 Code cleanup.
Reported by:	Coverity
CID:		749578
MFC after:	1 week
2015-01-19 11:52:08 +00:00
Michael Tuexen
e1600e5058 Fix a bug which only shows up when an mbuf allocation failed.
Therefore chances are low that we hit this.

Reported by:	Coverity
CID:		1018886
MFC after:	1 week
2015-01-18 22:00:39 +00:00
Michael Tuexen
d6165c1fca Remove an unnecessary check.
Reported by:	Coverity
CID:		749576
MFC after:	1 week
2015-01-18 21:16:22 +00:00
Michael Tuexen
3ff78fbbd9 Add protection code to free memory in case of processing an address which
is neither IPv4 or IPv6.

Reported by:	Coverity
CID:		749311
MFC after:	1 week
2015-01-18 20:53:20 +00:00
Michael Tuexen
61330de4b0 Remove an unused variable.
Reported by:	Coverity
CID:		750999
MFC after:	1 week
2015-01-18 20:20:27 +00:00
Adrian Chadd
b2bdc62a95 Refactor / restructure the RSS code into generic, IPv4 and IPv6 specific
bits.

The motivation here is to eventually teach netisr and potentially
other networking subsystems a bit more about how RSS work queues / buckets
are configured so things have a hope of auto-configuring in the future.

* net/rss_config.[ch] takes care of the generic bits for doing
  configuration, hash function selection, etc;
* topelitz.[ch] is now in net/ rather than netinet/;
* (and would be in libkern if it didn't directly include RSS_KEYSIZE;
  that's a later thing to fix up.)
* netinet/in_rss.[ch] now just contains the IPv4 specific methods;
* and netinet/in6_rss.[ch] now just contains the IPv6 specific methods.

This should have no functional impact on anyone currently using
the RSS support.

Differential Revision:	D1383
Reviewed by:	gnn, jfv (intel driver bits)
2015-01-18 18:06:40 +00:00
Gleb Smirnoff
fc2517100b Do not go one layer down to check ifqueue length. First, not all drivers
use ifqueue at all. Second, there is no point in this lockless check.
Either positive or negative result of the check could be incorrect after
a tick.

Reviewed by:	tuexen
Sponsored by:	Nginx, Inc.
2015-01-12 18:06:22 +00:00
Gleb Smirnoff
fc7ea8b690 Remove incorrect layering violating code that:
a) assumed that ifqueue length is measured in bytes, instead of packets
b) assumed that any interface has working ifqueue
c) incremented global counter instead of ifi_oqdrops

Sponsored by:	Nginx, Inc.
2015-01-12 09:41:12 +00:00
Hiren Panchasara
64807b300f DCTCP (Data Center TCP) implementation.
DCTCP congestion control algorithm aims to maximise throughput and minimise
latency in data center networks by utilising the proportion of Explicit
Congestion Notification (ECN) marked packets received from capable hardware as a
congestion signal.

Highlights:
Implemented as a mod_cc(4) module.
ECN (Explicit congestion notification) processing is done differently from
RFC3168.
Takes one-sided DCTCP into consideration where only one of the sides is using
DCTCP and other is using standard ECN.

IETF draft: http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Thesis report by Midori Kato: https://eggert.org/students/kato-thesis.pdf

Submitted by:	Midori Kato <katoon@sfc.wide.ad.jp> and
		Lars Eggert <lars@netapp.com>
		with help and modifications from
		hiren
Differential Revision:	https://reviews.freebsd.org/D604
Reviewed by:	gnn
2015-01-12 08:33:04 +00:00
Michael Tuexen
d89abe19b0 Remove dead code.
Reported by:	Coverity
CID:		748664
MFC after:	1 week
2015-01-12 07:55:16 +00:00
Michael Tuexen
df26ea6839 Remove dead code.
Reported by:	Coverity
CID:		1018052
MFC after:	1 week
2015-01-12 07:39:52 +00:00
Michael Tuexen
f104b614a0 Remove dead code.
Reported by:	Coverity
CID:		1018053
MFC after:	1 week
2015-01-12 07:29:35 +00:00
Michael Tuexen
f0dc2113ca Remove dead code.
Reported by:	Coverity
CID:		748663
MFC after:	1 week
2015-01-11 22:49:20 +00:00
Michael Tuexen
448e859674 Remove dead code.
Reported by:	Coverity
CID:		748660, 748661
MFC after:	1 week
2015-01-11 22:23:39 +00:00
Michael Tuexen
e88f89a393 Remove dead code.
Reported by:	Coverity
CID:		748665
MFC after:	1 week
2015-01-11 21:55:30 +00:00
Michael Tuexen
d3cfd43074 Remove dead code.
Reported by:	Coverity
CID:		748666
MFC after:	1 week
2015-01-11 21:44:56 +00:00
Michael Tuexen
4be807c4d6 Minimize the usage of SCTP_BUF_IS_EXTENDED.
This should help Robert...
2015-01-10 20:49:57 +00:00
Michael Tuexen
296d0b9495 Retire SCTP_BUF_EXTEND_SIZE. This patch was suggested by
Robert Watson.
2015-01-10 13:56:26 +00:00
Alexander V. Chernikov
d63e657c04 * Deal with ARCNET L2 multicast mapping for IPv6 the same way as in IPv4:
handle it in arc_output() instead of nd6_storelladdr().
* Remove IFT_ARCNET check from arpresolve() since arc_output() does not
  use arpresolve() to handle broadcast/multicast. This check was there
  since r84931. It looks like it was not used since r89099 (initial
  import of Arcnet support where multicast is handled separately).
* Remove IFT_IEEE1394 case from nd6_storelladdr() since firewire_output()
  calles nd6_storelladdr() for unicast addresses only.
* Remove IFT_ARCNET case from nd6_storelladdr() since arc_output() now
  handles multicast by itself.

As a result, we have the following pattern: all non-ethernet-style
media have their own multicast map handling inside their appropriate
routines. On the other hand, arpresolve() (and nd6_storelladdr()) which
meant to be 'generic' ones de-facto handles ethernet-only multicast maps.

MFC after:	3 weeks
2015-01-09 12:56:51 +00:00
Robert Watson
e1165035a6 Use M_WRITABLE() and M_LEADINGSPACE() rather than checking M_EXT and
doing hand-crafted length calculations in the IP options code.

Reviewed by:	bz
Sponsored by:	EMC / Isilon Storage Division
2015-01-06 14:32:28 +00:00
Luiz Otavio O Souza
57c5139c46 Remove the check that prevent carp(4) advskew to be set to '0'.
CARP devices are created with advskew set to '0' and once you set it to
any other value in the valid range (0..254) you can't set it back to zero.

The code in question is also used to prevent that zeroed values overwrite
the CARP defaults when a new CARP device is created.  Since advskew already
defaults to '0' for newly created devices and the new value is guaranteed
to be within the valid range, it is safe to overwrite it here.

PR:		194672
Reported by:	cmb@pfsense.org
In collaboration with:	garga
Tested by:	garga
MFC after:	2 weeks
2015-01-06 13:07:13 +00:00
Alexander V. Chernikov
3a7498636a * Allocate hash tables separately
* Make llt_hash() callback more flexible
* Default hash size and hashing method is now per-af
* Move lltable allocation to separate function
2015-01-05 17:23:02 +00:00
Robert Watson
ed6a66ca6c To ease changes to underlying mbuf structure and the mbuf allocator, reduce
the knowledge of mbuf layout, and in particular constants such as M_EXT,
MLEN, MHLEN, and so on, in mbuf consumers by unifying various alignment
utility functions (M_ALIGN(), MH_ALIGN(), MEXT_ALIGN() in a single
M_ALIGN() macro, implemented by a now-inlined m_align() function:

- Move m_align() from uipc_mbuf.c to mbuf.h; mark as __inline.
- Reimplement M_ALIGN(), MH_ALIGN(), and MEXT_ALIGN() using m_align().
- Update consumers around the tree to simply use M_ALIGN().

This change eliminates a number of cases where mbuf consumers must be aware
of whether or not mbufs returned by the allocator use external storage, but
also assumptions about the size of the returned mbuf. This will make it
easier to introduce changes in how we use external storage, as well as
features such as variable-size mbufs.

Differential Revision:	https://reviews.freebsd.org/D1436
Reviewed by:	glebius, trasz, gnn, bz
Sponsored by:	EMC / Isilon Storage Division
2015-01-05 09:58:32 +00:00
Alexander V. Chernikov
b44a7d5d87 * Use unified code for deleting entry by sockaddr instead of per-af one.
* Remove now unused llt_delete_addr callback.
2015-01-03 19:09:06 +00:00
Alexander V. Chernikov
20dd899505 * Hide lltable implementation details in if_llatbl_var.h
* Make most of lltable_* methods 'normal' functions instead of inline
* Add lltable_get_<af|ifp>() functions to access given lltable fields
* Temporarily resurrect nd6_lookup() function
2015-01-03 16:04:28 +00:00
Alexander V. Chernikov
73db4e007a Finish r275628: remove remaining 'base' references. 2015-01-03 13:41:53 +00:00
Adrian Chadd
492ccbe14d Migrate the RSS IPv6 hash code to use pointers to the v6 addresses
rather than passing them in by value.

The eventual aim is to do incremental hash construction rather than
all of the memcpy()'ing into a contiguous buffer for the hash
function, which does show up as taking quite a bit of CPU during
profiling.

Tested:

* a variety of laptops/desktop setups I have, with v6 connectivity

Differential Revision:	D1404
Reviewed by:	bz, rpaulo
2014-12-31 22:52:43 +00:00
Andrey V. Elsukov
f188f14d43 Extern declarations in C files loses compile-time checking that
the functions' calls match their definitions. Move them to header files.

Reviewed by:	jilles (previous version)
2014-12-25 21:32:37 +00:00
Andrey V. Elsukov
132c449079 Remove in_gif.h and in6_gif.h files. They only contain function
declarations used by gif(4). Instead declare these functions in C files.
Also make some variables static.
2014-12-23 16:17:37 +00:00
Michael Tuexen
f3ba71bee4 Don't check twice that inp is not NULL.
Reported by:	Coverity
CID:		748671
MFC after:	3 days
2014-12-21 13:58:53 +00:00
Warner Losh
61f26cae7d Where appropriate, use the modern terms for the one true time base
(UTC) rather than the archaic (GMT) in comments. Except where the
comments are making fun of people doing this (and pedants who insist
on the new terms).
2014-12-21 05:07:11 +00:00
Michael Tuexen
b03b5d729a Fix and harmonize the validation of PR-SCTP policies.
Reported by:	Coverity
CID:		1232044
MFC after:	3 days
2014-12-20 21:17:28 +00:00
Michael Tuexen
ca10a8d944 Cleanup the code.
Reported by:	Coverity
CID:		1232003
2014-12-20 13:47:38 +00:00
Michael Tuexen
142a4d9e86 Add a missing break.
Reported by:	Coverity
CID:		1232014
MFC after: 	3 days
2014-12-17 20:34:38 +00:00
Andrey V. Elsukov
44eb8bbe7b Do not count security policy violation twice.
ipsec*_in_reject() do this by their own.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 19:20:13 +00:00
Andrey V. Elsukov
0332a55f0f Use ipsec4_in_reject() to simplify ip_ipsec_fwd() and ip_ipsec_input().
ipsec4_in_reject() does the same things, also it counts policy violation
errors.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 18:55:54 +00:00
Andrey V. Elsukov
0275b2e369 Remove flag/flags argument from the following functions:
ipsec_getpolicybyaddr()
 ipsec4_checkpolicy()
 ip_ipsec_output()
 ip6_ipsec_output()

The only flag used here was IP_FORWARDING.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 18:35:34 +00:00
Andrey V. Elsukov
619764beab Remove flags and tunalready arguments from ipsec4_process_packet()
and make its prototype similar to ipsec6_process_packet.
The flags argument isn't used here, tunalready is always zero.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 17:34:49 +00:00
Andrey V. Elsukov
8922ddbe40 Move ip_ipsec_fwd() from ip_input() into ip_forward().
Remove check for presence PACKET_TAG_IPSEC_IN_DONE mbuf tag from
ip_ipsec_fwd(). PACKET_TAG_IPSEC_IN_DONE tag means that packet is
already handled by IPSEC code. This means that before IPSEC processing
it was destined to our address and security policy was checked in
the ip_ipsec_input(). After IPSEC processing packet has new IP
addresses and destination address isn't our own. So, anyway we can't
check security policy from the mbuf tag, because it corresponds
to different addresses.

We should check security policy that corresponds to packet
attributes in both cases - when it has a mbuf tag and when it has not.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 16:53:29 +00:00
Andrey V. Elsukov
e58320f127 Remove PACKET_TAG_IPSEC_IN_DONE mbuf tag lookup and usage of its
security policy. The changed block of code in ip*_ipsec_input() is
called when packet has ESP/AH header. Presence of
PACKET_TAG_IPSEC_IN_DONE mbuf tag in the same time means that
packet was already handled by IPSEC and reinjected in the netisr,
and it has another ESP/AH headers (encrypted twice?).
Since it was already processed by IPSEC code, the AH/ESP headers
was already stripped (and probably outer IP header was stripped too)
and security policy from the tdb_ident was applied to those headers.
It is incorrect to apply this security policy to current headers.

Also make ip_ipsec_input() prototype similar to ip6_ipsec_input().

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 14:58:55 +00:00
Andrey V. Elsukov
dd9cd45b44 Remove check for presence of PACKET_TAG_IPSEC_PENDING_TDB and
PACKET_TAG_IPSEC_OUT_CRYPTO_NEEDED mbuf tags. They aren't used in FreeBSD.

Instead check presence of PACKET_TAG_IPSEC_OUT_DONE mbuf tag. If it
is found, bypass security policy lookup as described in the comment.

PACKET_TAG_IPSEC_OUT_DONE tag added to mbuf when IPSEC code finishes
ESP/AH processing. Since it was already finished, this means the security
policy placed in the tdb_ident was already checked. And there is no reason
to check it again here.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 14:43:44 +00:00
Alexander V. Chernikov
ee7e9a4e17 * Do not assume lle has sockaddr key after struct lle:
use llt_fill_sa_entry() llt method to store lle address in sa.
* Eliminate L3_ADDR macro and either reference IPv4/IPv6 address
   directly from lle or use newly-created llt_fill_sa_entry().
* Do not store sockaddr inside arp/ndp lle anymore.
2014-12-09 00:48:08 +00:00
Alexander V. Chernikov
d82ed5051c Simplify lle lookup/create api by using addresses instead of sockaddrs. 2014-12-08 23:23:53 +00:00
Alexander V. Chernikov
73b52ad896 Use llt_prepare_static_entry method to prepare valid per-af static entry. 2014-12-07 23:59:44 +00:00
Alexander V. Chernikov
0368226e65 * Retire abstract llentry_free() in favor of lltable_drop_entry_queue()
and explicit calls to RTENTRY_FREE_LOCKED()
* Use lltable_prefix_free() in arp_ifscrub to be consistent with nd6.
* Rename <lltable_|llt>_delete function to _delete_addr() to note that
   this function is used to external callers. Make this function maintain
   its own locking.
* Use lookup/unlink/clear call chain from internal callers instead of
    delete_addr.
* Fix LLE_DELETED flag handling
2014-12-07 23:08:07 +00:00
Alexander V. Chernikov
721cd2e032 Do not enforce particular lle storage scheme:
* move lltable allocation to per-domain callbacks.
* make llentry_link/unlink functions overridable llt methods.
* make hash table traversal another overridable llt method.
2014-12-07 17:32:06 +00:00
Alexander V. Chernikov
a743ccd468 * Add llt_clear_entry() callback which is able to do all lle
cleanup including unlinking/freeing
* Relax locking in lltable_prefix_free_af/lltable_free
* Do not pass @llt to lle free callback: it is always NULL now.
* Unify arptimer/nd6_llinfo_timer: explicitly unlock lle avoiding
   unlock/lock sequinces
* Do not pass unlocked lle to nd6_ns_output(): add nd6_llinfo_get_holdsrc()
   to retrieve preferred source address from lle hold queue and pass it
   instead of lle.
* Finally, make nd6_create() create and return unlocked lle
* Separate defrtr handling code from nd6_free():
   use nd6_check_del_defrtr() to check if we need to keep entry instead of
    performing GC,
   use nd6_check_recalc_defrtr() to perform actual recalc on lle removal.
* Move isRouter handling from nd6_cache_lladdr() to separate
   nd6_check_router()
* Add initial code to maintain lle runtime flags in sync.
2014-12-07 15:42:46 +00:00
Michael Tuexen
39cbb549cc Include the received chunk padding when reporting an unknown chunk.
MFC after: 1 week
2014-12-06 22:57:19 +00:00
Michael Tuexen
d59107f700 Fix the support of mapped IPv4 addresses.
Thanks to Mark Bonnekessel and Markus Boese for making me aware of the
problems.
MFC after: 1 week
2014-12-06 20:00:08 +00:00
Craig Rodrigues
a8da5dd658 MFp4: @181627
Allow UMA allocated memory to be freed when VNET jails are torn down.

Differential Revision: D1201
Submitted by: bz
Reviewed by: rwatson, gnn
2014-12-06 02:59:59 +00:00
Michael Tuexen
457b4b8836 This is the SCTP specific companion of
https://svnweb.freebsd.org/changeset/base/275358
which was provided by Hans Petter Selasky.
2014-12-04 21:17:50 +00:00
Michael Tuexen
4e88d37a2a Do the renaming of sb_cc to sb_ccc in a way with less code changes by
using a macro.
This is an alternate approach to
https://svnweb.freebsd.org/changeset/base/275326
which is easier to handle upstream.

Discussed with: rrs, glebius
2014-12-02 20:29:29 +00:00
Andrey V. Elsukov
2d957916ef Remove route chaching support from ipsec code. It isn't used for some time.
* remove sa_route_union declaration and route_cache member from struct secashead;
* remove key_sa_routechange() call from ICMP and ICMPv6 code;
* simplify ip_ipsec_mtu();
* remove #include <net/route.h>;

Sponsored by:	Yandex LLC
2014-12-02 04:20:50 +00:00
Alexander V. Chernikov
a6e934e359 Put bandaid on arptimer-derived crashed for 'arp -da' deleted records.
The proper fix will arrive after convering lltable 'delete' method.
2014-12-01 22:37:36 +00:00
Alexander V. Chernikov
9b65db85e2 Do more fine-grained locking in lltable code: lltable_create_lle()
does actual new lle creation without extensive locking and existing
  lle search.
Move lle updating code from gigantic in_arpinput() to arp_update_llle()
  and some other functions.

IPv6 changes to follow.
2014-12-01 21:43:48 +00:00
Hans Petter Selasky
c25290420e Start process of removing the use of the deprecated "M_FLOWID" flag
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.

This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.

"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.

Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.

MFC after:	1 month
Sponsored by:	Mellanox Technologies
2014-12-01 11:45:24 +00:00
Alexander V. Chernikov
ce313fdd71 * Unify lle table dump/prefix removal code.
* Rename lla_XXX -> lltable_XXX_lle to reduce number of name prefixes
  used by lltable code.
2014-11-30 14:35:01 +00:00
Gleb Smirnoff
2cbcd3c198 Merge from projects/sendfile:
- Provide pru_ready function for TCP.
- Don't call tcp_output() from tcp_usr_send() if no ready data was put
  into the socket buffer.
- In case of dropped connection don't try to m_freem() not ready data.

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2014-11-30 13:43:52 +00:00
Gleb Smirnoff
651e4e6a30 Merge from projects/sendfile: extend protocols API to support
sending not ready data:
o Add new flag to pru_send() flags - PRUS_NOTREADY.
o Add new protocol method pru_ready().

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2014-11-30 13:24:21 +00:00
Gleb Smirnoff
0f9d0a73a4 Merge from projects/sendfile:
o Introduce a notion of "not ready" mbufs in socket buffers.  These
mbufs are now being populated by some I/O in background and are
referenced outside.  This forces following implications:
- An mbuf which is "not ready" can't be taken out of the buffer.
- An mbuf that is behind a "not ready" in the queue neither.
- If sockbet buffer is flushed, then "not ready" mbufs shouln't be
  freed.

o In struct sockbuf the sb_cc field is split into sb_ccc and sb_acc.
  The sb_ccc stands for ""claimed character count", or "committed
  character count".  And the sb_acc is "available character count".
  Consumers of socket buffer API shouldn't already access them directly,
  but use sbused() and sbavail() respectively.
o Not ready mbufs are marked with M_NOTREADY, and ready but blocked ones
  with M_BLOCKED.
o New field sb_fnrdy points to the first not ready mbuf, to avoid linear
  search.
o New function sbready() is provided to activate certain amount of mbufs
  in a socket buffer.

A special note on SCTP:
  SCTP has its own sockbufs.  Unfortunately, FreeBSD stack doesn't yet
allow protocol specific sockbufs.  Thus, SCTP does some hacks to make
itself compatible with FreeBSD: it manages sockbufs on its own, but keeps
sb_cc updated to inform the stack of amount of data in them.  The new
notion of "not ready" data isn't supported by SCTP.  Instead, only a
mechanical substitute is done: s/sb_cc/sb_ccc/.
  A proper solution would be to take away struct sockbuf from struct
socket and allow protocols to implement their own socket buffers, like
SCTP already does.  This was discussed with rrs@.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-30 12:52:33 +00:00
Gleb Smirnoff
300fa232ee Missed in r274421: use sbavail() instead of bare access to sb_cc. 2014-11-30 12:11:01 +00:00
Alexander V. Chernikov
5d14e4cd76 Provide rte_<get|set> methods to access rtentry for external consumers. 2014-11-29 19:27:43 +00:00
Alexander V. Chernikov
74860d4f7c Do not return unlocked/unreferenced lle in arpresolve/nd6_storelladdr -
return lle flags IFF needed.
Do not pass rte to arpresolve - pass is_gateway flag instead.
2014-11-27 23:06:25 +00:00
Alexander V. Chernikov
4c9df1c5b8 Fix r274855: use proper unlock method. 2014-11-23 16:40:33 +00:00
Alexander V. Chernikov
73d770287d Do more fine-grained lltable locking: use table runtime lock as rare
as we can.
2014-11-23 15:38:06 +00:00
Alexander V. Chernikov
9479029b1f * Add lltable llt_hash callback
* Move lltable items insertions/deletions to generic llt code.
2014-11-23 12:15:28 +00:00
Alexander V. Chernikov
7c066c18db Use less-invasive approach for IF_AFDATA lock: convert into 2 locks:
use rwlock accessible via external functions
    (IF_AFDATA_CFG_* -> if_afdata_cfg_*()) for all control plane tasks
  use rmlock (IF_AFDATA_RUN_*) for fast-path lookups.
2014-11-22 19:53:36 +00:00
Alexander V. Chernikov
27688dfe1d Temporarily revert r274774. 2014-11-22 17:57:54 +00:00
Alexander V. Chernikov
d7ebebfba1 Fix non-debug build after r274855. 2014-11-22 17:30:37 +00:00
Alexander V. Chernikov
8f465f6690 Convert &in_ifaddr_lock to dual-locking model:
use rwlock accessible via external functions
   (IN_IFADDR_CFG_* -> in_ifaddr_cfg_*()) for all control plane tasks
  use rmlock (IN_IFADDR_RUN_*) for fast-path lookups.
2014-11-22 16:27:51 +00:00
Alexander V. Chernikov
2e47d2f953 Mark ifaddr/rtsock static entries RLLE_VALID. 2014-11-21 23:37:59 +00:00
Alexander V. Chernikov
86b94cffe4 Finish r274774: add more headers/fix build for non-debug case. 2014-11-21 23:36:21 +00:00
Alexander V. Chernikov
9883e41b4b Switch IF_AFDATA lock to rmlock 2014-11-21 02:28:56 +00:00
Alexander V. Chernikov
4d56c133fb Sync to HEAD@r274766 2014-11-21 01:22:33 +00:00
Alexander V. Chernikov
f9723c7705 Simplify API: use new NHOP_LOOKUP_AIFP flag to select what ifp
we need to return.
Rename fib[64]_lookup_nh_basic to fib[64]_lookup_nh, add flags
fields for all relevant functions.
2014-11-20 22:41:59 +00:00
Julien Charbon
71da715374 Re-introduce padding fields removed with r264321 to keep
struct tcptw ABI unchanged.

Suggested by:	jhb
Approved by:	jhb (mentor)
MFC after:	1 day
X-MFC-With:	r264321
2014-11-17 14:56:02 +00:00
Alexander V. Chernikov
7f948f12f6 Finish r274175: do control plane MTU tracking.
Update route MTU in case of ifnet MTU change.
Add new RTF_FIXEDMTU to track explicitly specified MTU.

Old behavior:
ifconfig em0 mtu 1500->9000 -> all routes traversing em0 do not change MTU.
User has to manually update all routes.
ifconfig em0 mtu 9000->1500 -> all routes traversing em0 do not change MTU.
However, if ip[6]_output finds route with rt_mtu > interface mtu, rt_mtu
gets updated.

New behavior:
ifconfig em0 mtu 1500->9000 -> all interface routes in all fibs gets updated
with new MTU unless RTF_FIXEDMTU flag set on them.
ifconfig em0 mtu 9000->1500 -> all routes in all fibs gets updated with new
MTU unless RTF_FIXEDMTU flag set on them AND rt_mtu is less than ifp mtu.

route add ... -mtu XXX automatically sets RTF_FIXEDMTU flag.
route change .. -mtu 0 automatically removes RTF_FIXEDMTU flag.

PR:		194238
MFC after:	1 month
CR:		D1125
2014-11-17 01:05:29 +00:00
Alexander V. Chernikov
df629abf3e Rework LLE code locking:
* struct llentry is now basically split into 2 pieces:
  all fields within 64 bytes (amd64) are now protected by both
  ifdata lock AND lle lock, e.g. you require both locks to be held
  exclusively for modification. All data necessary for fast path
  operations is kept here. Some fields were added:
  - r_l3addr - makes lookup key liev within first 64 bytes.
  - r_flags - flags, containing pre-compiled decision whether given
    lle contains usable data or not. Current the only flag is RLLE_VALID.
  - r_len - prepend data len, currently unused
  - r_kick - used to provide feedback to control plane (see below).
  All other fields are protected by lle lock.
* Add simple state machine for ARP to handle "about to expire" case:
  Current model (for the fast path) is the following:
  - rlock afdata
  - find / rlock rte
  - runlock afdata
  - see if "expire time" is approaching
    (time_uptime + la->la_preempt > la->la_expire)
  - if true, call arprequest() and decrease la_preempt
  - store MAC and runlock rte
  New model (data plane):
  - rlock afdata
  - find rte
  - check if it can be used using r_* fields only
  - if true, store MAC
  - if r_kick field != 0 set it to 0.
  - runlock afdata
  New mode (control plane):
  - schedule arptimer to be called in (V_arpt_keep - V_arp_maxtries)
    seconds instead of V_arpt_keep.
  - on first timer invocation change state from ARP_LLINFO_REACHABLE
    to ARP_LLINFO_VERIFY, sets r_kick to 1 and shedules next call in
    V_arpt_rexmit (default to 1 sec).
  - on subsequent timer invocations in ARP_LLINFO_VERIFY state, checks
    for r_kick value: reschedule if not changed, and send arprequest()
    if set to zero (e.g. entry was used).
* Convert IPv4 path to use new single-lock approach. IPv6 bits to follow.
* Slow down in_arpinput(): now valid reply will (in most cases) require
  acquiring afdata WLOCK twice. This is requirement for storing changed
  lle data. This change will be slightly optimized in future.
* Provide explicit hash link/unlink functions for both ipv4/ipv6 code.
  This will probably be moved to generic lle code once we have per-AF
  hashing callback inside lltable.
* Perform lle unlink on deletion immediately instead of delaying it to
  the timer routine.
* Make r244183 more explicit: use new LLE_CALLOUTREF flag to indicate the
  presence of lle reference used for safe callout calls.
2014-11-16 20:12:49 +00:00
Alexander V. Chernikov
b9497d2c91 Move "slow path" handling from arpresolve() to newly-created
arpresolve_slow().
2014-11-15 20:02:22 +00:00
Alexander V. Chernikov
b4b1367ae4 * Move lle creation/deletion from lla_lookup to separate functions:
lla_lookup(LLE_CREATE) -> lla_create
  lla_lookup(LLE_DELETE) -> lla_delete
  Assume lla_create to return LLE_EXCLUSIVE lock for lle.
* Rework lla_rt_output to perform all lle changes under afdata WLOCK.
* change arp_ifscrub() ackquire afdata WLOCK, the same as arp_ifinit().
2014-11-15 18:54:07 +00:00
Gleb Smirnoff
cfa6009e36 In preparation of merging projects/sendfile, transform bare access to
sb_cc member of struct sockbuf to a couple of inline functions:

sbavail() and sbused()

Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-12 09:57:15 +00:00
Hans Petter Selasky
3c7c188c16 Fix some minor TSO issues:
- Improve description of TSO limits.
- Remove a not needed KASSERT()
- Remove some not needed variable casts.

Sponsored by:	Mellanox Technologies
Discussed with:	lstewart @
MFC after:	1 week
2014-11-11 12:05:59 +00:00
Alexander V. Chernikov
670e8b3b8c Kill custom in_matroute() radix mathing function removing one rte mutex lock.
Initially in_matrote() in_clsroute() in their current state was introduced by
r4105 20 years ago. Instead of deleting inactive routes immediately, we kept them
in route table, setting RTPRF_OURS flag and some expire time. After that, either
GC came or RTPRF_OURS got removed on first-packet. It was a good solution
in that days (and probably another decade after that) to keep TCP metrics.
However, after moving metrics to TCP hostcache in r122922, most of in_rmx
functionality became unused. It might had been used for flushing icmp-originated
routes before rte mutexes/refcounting, but I'm not sure about that.

So it looks like this is nearly impossible to make GC do its work nowadays:

in_rtkill() ignores non-RTPRF_OURS routes.
route can only become RTPRF_OURS after dropping last reference via rtfree()
which calls in_clsroute(), which, it turn, ignores UP and non-RTF_DYNAMIC routes.

Dynamic routes can still be installed via received redirect, but they
have default lifetime (no specific rt_expire) and no one has another trie walker
to call RTFREE() on them.

So, the changelist:
* remove custom rnh_match / rnh_close matching function.
* remove all GC functions
* partially revert r256695 (proto3 is no more used inside kernel,
  it is not possible to use rt_expire from user point of view, proto3 support
  is not complete)
* Finish r241884 (similar to this commit) and remove remaining IPv6 parts

MFC after:	1 month
2014-11-11 02:52:40 +00:00
Alexander V. Chernikov
d1f79a3bfc Remove kernel handling of ICMP_SOURCEQUENCH.
It hasn't been used for a very long time.
Additionally, it was deprecated by RFC 6633.
2014-11-10 23:10:01 +00:00
Alexander V. Chernikov
f7bab8d0dd Switch route radix to dual-lock model:
use rmlock for data patch access, and config rwlock
for conrol plane processing. Route table changes require
bock locks held.
2014-11-10 00:07:06 +00:00
Alexander V. Chernikov
603eaf792b Renove faith(4) and faithd(8) from base. It looks like industry
have chosen different (and more traditional) stateless/statuful
NAT64 as translation mechanism. Last non-trivial commits to both
faith(4) and faithd(8) happened more than 12 years ago, so I assume
it is time to drop RFC3142 in FreeBSD.

No objections from:	net@
2014-11-09 21:33:01 +00:00
Alexander V. Chernikov
033074c440 Replace 'struct route *' if_output() argument with 'struct nhop_info *'.
Leave 'struct route' as is for legacy routing api users.
Remove most of rtalloc_ign*-derived functions.
2014-11-09 16:33:04 +00:00
Alexander V. Chernikov
55e5eda676 Separate radix and routing: use different structures for route and
for other customers.

Introduce new 'struct rib_head' for routing purposes and make
all routing api use it.
2014-11-09 00:36:39 +00:00
Andrey V. Elsukov
3e88eb903b Remove ip6_getdstifaddr() and all functions to work with auxiliary data.
It isn't safe to keep unreferenced ifaddrs. Use in6ifa_ifwithaddr() to
determine ifaddr corresponding to destination address. Since currently
we keep addresses with embedded scope zone, in6ifa_ifwithaddr is called
with zero zoneid and marked with XXX.

Also remove route and lle lookups from ip6_input. Use in6ifa_ifwithaddr()
instead.

Sponsored by:	Yandex LLC
2014-11-08 19:38:34 +00:00
Alexander V. Chernikov
a9413f6ca0 Sync to HEAD@r274297. 2014-11-08 18:13:35 +00:00
Alexander V. Chernikov
1398ffe5bc Convert most of "for (fibnum = 0; fibnum < rt_numfibs; fibnum++)" users
to use new rt_foreach_fib() instead of hand-rolling cycles.
2014-11-08 16:38:15 +00:00
Alexander V. Chernikov
2c498ee6be Finish r274290: arg.nextstop / arg.updating are not used anymore. 2014-11-08 15:58:17 +00:00
Alexander V. Chernikov
9c83df7d48 Retire rtq_timeout altering code. This code was initially introduced
by r6399 to enhance expiring large number of host cache routes.
Since we don't have route cloning anymore and no one altered
V_rtq_toomany default (which is 128) in nearly 20 years, I assume
this can be safely deleted.
2014-11-08 15:39:32 +00:00
Alexander V. Chernikov
22b08fd8b7 Split radix implementation and system route table structure:
use new 'struct radix_head' for radix.
2014-11-07 22:52:02 +00:00
Andrey V. Elsukov
f325335caf Overhaul if_gre(4).
Split it into two modules: if_gre(4) for GRE encapsulation and
if_me(4) for minimal encapsulation within IP.

gre(4) changes:
* convert to if_transmit;
* rework locking: protect access to softc with rmlock,
  protect from concurrent ioctls with sx lock;
* correct interface accounting for outgoing datagramms (count only payload size);
* implement generic support for using IPv6 as delivery header;
* make implementation conform to the RFC 2784 and partially to RFC 2890;
* add support for GRE checksums - calculate for outgoing datagramms and check
  for inconming datagramms;
* add support for sending sequence number in GRE header;
* remove support of cached routes. This fixes problem, when gre(4) doesn't
  work at system startup. But this also removes support for having tunnels with
  the same addresses for inner and outer header.
* deprecate support for various GREXXX ioctls, that doesn't used in FreeBSD.
  Use our standard ioctls for tunnels.

me(4):
* implementation conform to RFC 2004;
* use if_transmit;
* use the same locking model as gre(4);

PR:		164475
Differential Revision:	D1023
No objections from:	net@
Relnotes:	yes
Sponsored by:	Yandex LLC
2014-11-07 19:13:19 +00:00
Gleb Smirnoff
6df8a71067 Remove SYSCTL_VNET_* macros, and simply put CTLFLAG_VNET where needed.
Sponsored by:	Nginx, Inc.
2014-11-07 09:39:05 +00:00
Alexander V. Chernikov
930d2a42c9 Remove legacy inet lookup functions. 2014-11-07 00:29:00 +00:00
Alexander V. Chernikov
064b1bdb2d Convert lle rtchecks to use new routing API.
For inet/ case, this involves reverting r225947
which seem to be pretty strange commit and should
be reverted in HEAD ad well.
2014-11-06 23:35:22 +00:00
Alexander V. Chernikov
146a181f28 Finish r274118: remove useless fields from struct domain.
Sponsored by:	Yandex LLC
2014-11-06 14:39:04 +00:00
Alexander V. Chernikov
1a75e3b20f Make checks for rt_mtu generic:
Some virtual if drivers has (ab)used ifa ifa_rtrequest hook to enforce
route MTU to be not bigger that interface MTU. While ifa_rtrequest hooking
might be an option in some situation, it is not feasible to do MTU checks
there: generic (or per-domain) routing code is perfectly capable of doing
this.

We currrently have 3 places where MTU is altered:

1) route addition.
 In this case domain overrides radix _addroute callback (in[6]_addroute)
 and all necessary checks/fixes are/can be done there.

2) route change (especially, GW change).
 In this case, there are no explicit per-domain calls, but one can
 override rte by setting ifa_rtrequest hook to domain handler
 (inet6 does this).

3) ifconfig ifaceX mtu YYYY
 In this case, we have no callbacks, but ip[6]_output performes runtime
 checks and decreases rt_mtu if necessary.

Generally, the goals are to be able to handle all MTU changes in
 control plane, not in runtime part, and properly deal with increased
 interface MTU.

This commit changes the following:
* removes hooks setting MTU from drivers side
* adds proper per-doman MTU checks for case 1)
* adds generic MTU check for case 2)

* The latter is done by using new dom_ifmtu callback since
 if_mtu denotes L3 interface MTU, e.g. maximum trasmitted _packet_ size.
 However, IPv6 mtu might be different from if_mtu one (e.g. default 1280)
 for some cases, so we need an abstract way to know maximum MTU size
 for given interface and domain.
* moves rt_setmetrics() before MTU/ifa_rtrequest hooks since it copies
  user-supplied data which must be checked.
* removes RT_LOCK_ASSERT() from other ifa_rtrequest hooks to be able to
  use this functions on new non-inserted rte.

More changes will follow soon.

MFC after:	1 month
Sponsored by:	Yandex LLC
2014-11-06 13:13:09 +00:00
Alexander V. Chernikov
9f25cbe45e Remove old hack abusing domattach from NFS code.
According to IANA RPC uaddr registry, there are no AFs
except IPv4 and IPv6, so it's not worth being too abstract here.

Remove ne_rtable[AF_MAX+1] and use explicit per-AF radix tries.
Use own initialization without relying on domattach code.

While I admit that this was one of the rare places in kernel
networking code which really was capable of doing multi-AF
without any AF-depended code, it is not possible anymore to
rely on dom* code.

While here, change terrifying "Invalid radix node head, rn:" message,
to different non-understandable "netcred already exists for given addr/mask",
but less terrifying. Since we know that rn_addaddr() returns NULL if
the same record already exists, we should provide more friendly error.

MFC after:	1 month
2014-11-05 00:58:01 +00:00
Alexander V. Chernikov
69b74805d5 Convert gif and stf to use new routing api. 2014-11-04 18:48:13 +00:00
Alexander V. Chernikov
5c9ef37854 Sync to HEAD@r274095. 2014-11-04 18:22:33 +00:00
Alexander V. Chernikov
8c3cfe0be0 Hide 'struct rtentry' and all its macro inside new header:
net/route_internal.h
The goal is to make its opaque for all code except route/rtsock and
proto domain _rmx.
2014-11-04 17:28:13 +00:00
Alexander V. Chernikov
4003643db1 Convert tcp_maxmtu6() to use new routing api. 2014-11-04 17:01:40 +00:00
Alexander V. Chernikov
257480b8ab Convert netinet6/ to use new routing API.
* Remove &ifpp from ip6_output() in favor of ri->ri_nh_info
* Provide different wrappers to in6_selectsrc:
  Currently it is used by 2 differenct type of customers:
  - socket-based one, which all are unsure about provided
   address scope and
  - in-kernel ones (ND code mostly), which don't have
    any sockets, options, crededentials, etc.
  So, we provide two different wrappers to in6_selectsrc()
  returning select source.
* Make different versions of selectroute():
  Currenly selectroute() is used in two scenarios:
  - SAS, via in6_selecsrc() -> in6_selectif() -> selectroute()
  - output, via in6_output -> wrapper -> selectroute()
  Provide different versions for each customer:
  - fib6_lookup_nh_basic()-based in6_selectif() which is
    capable of returning interface only, without MTU/NHOP/L2
    calculations
  - full-blown fib6_selectroute() with cached route/multipath/
    MTU/L2
* Stop using routing table for link-local address lookups
* Add in6_ifawithifp_lla() to make for-us check faster for link-local
* Add in6_splitscope / in6_setllascope for faster embed/deembed scopes
2014-11-04 15:39:56 +00:00
Hans Petter Selasky
4952ad427a Restore spares used in "struct tcpcb" and bump "__FreeBSD_version" to
indicate need for kernel module re-compilation.

Sponsored by:	Mellanox Technologies
2014-11-03 13:01:58 +00:00
Michael Tuexen
f885296d70 Don't zero the stats before they are read out.
MFC after: 3 days
2014-11-01 10:35:45 +00:00
Andrey V. Elsukov
1d904a55c8 Remove the check for packets with broadcast source from if_gif's encapcheck.
The check was recommened in the draft-ietf-ngtrans-mech-05.txt. But it isn't
clear, should it compare the source with all direct broadcast addresses in the
system or not.
RFC 4213 says it is enough to verify that the source address is the address
of the encapsulator, as configured on the decapsulator. And this verification
can be extended by administrator with any other forms of IPv4 ingress filtering.

Discussed with:	glebius, melifaro
Sponsored by:	Yandex LLC
2014-10-31 15:23:24 +00:00
Andrey V. Elsukov
7e4217558c Fix typo. 2014-10-31 11:40:49 +00:00
Julien Charbon
cea40c4888 Fix a race condition in TCP timewait between tcp_tw_2msl_reuse() and
tcp_tw_2msl_scan().  This race condition drives unplanned timewait
timeout cancellation.  Also simplify implementation by holding inpcb
reference and removing tcptw reference counting.

Differential Revision:	https://reviews.freebsd.org/D826
Submitted by:		Marc De la Gueronniere <mdelagueronniere@verisign.com>
Submitted by:		jch
Reviewed By:		jhb (mentor), adrian, rwatson
Sponsored by:		Verisign, Inc.
MFC after:		2 weeks
X-MFC-With:		r264321
2014-10-30 08:53:56 +00:00
Hans Petter Selasky
0e1152fcc2 The SYSCTL data pointers can come from userspace and must not be
directly accessed. Although this will work on some platforms, it can
throw an exception if the pointer is invalid and then panic the kernel.

Add a missing SYSCTL_IN() of "SCTP_BASE_STATS" structure.

MFC after:	3 days
Sponsored by:	Mellanox Technologies
2014-10-28 12:00:39 +00:00
Hans Petter Selasky
614b50ae8b Preserve limitation of "TCP_CA_NAME_MAX" when matching the algorithm
name.

MFC after:	3 days
Suggested by:	gnn @
2014-10-27 16:08:41 +00:00
Hans Petter Selasky
60a945f95d Make assignments to "net.inet.tcp.cc.algorithm" work by fixing a bad
string comparison.

MFC after:	3 days
Reported by:	Jukka Ukkonen <jau789@gmail.com>
Sponsored by:	Mellanox Technologies
2014-10-27 11:21:47 +00:00
Mateusz Guzik
e015b1ab0a Avoid dynamic syscall overhead for statically compiled modules.
The kernel tracks syscall users so that modules can safely unregister them.

But if the module is not unloadable or was compiled into the kernel, there is
no need to do this.

Achieve this by adding SY_THR_STATIC_KLD macro which expands to SY_THR_STATIC
during kernel build and 0 otherwise.

Reviewed by:	kib (previous version)
MFC after:	2 weeks
2014-10-26 19:42:44 +00:00
Alexander V. Chernikov
7b42f6fae2 * Convert TOE framework to use new routing api.
* Add fib6_lookup_nh_ext().
* Rename union structures:
  nhop64_basic -> nhopu_basic,
  nhop64_extended -> nhopu_extended
2014-10-25 18:25:00 +00:00
Alexander V. Chernikov
9f65116cc1 * Increase nh_flags to be u16 thus reducing nhop payload to be 48 bytes
* Use NHF_ namespace for all nhop flags
* Rename nhop_data -> nhop_prepend
* Rename fib4_lookup_nh_extended -> fib4_lookup_nh_ext
* Add "flags" argument to fib4_lookup_nh_ext() to specify whether we want
  returned nh_ext structure to be refcounted or not.
2014-10-25 15:32:56 +00:00
Michael Tuexen
b3817112b4 Fix a use of an uninitialized variable by makeing sure
that sctp_med_chunk_output() always initialized the reason_code
instead of relying on the caller.
The variable is only used for debugging purpose.
This issue was reported by Peter Bostroem from Google.

MFC after: 3 days
2014-10-25 09:25:29 +00:00
Alexander V. Chernikov
7de36f627e Convert ip_fastfwd() to use new routing api. 2014-10-24 23:08:44 +00:00
Alexander V. Chernikov
b863adaaf3 Convert last piece of ip_forward to use new rouing api. 2014-10-24 22:00:25 +00:00
Alexander V. Chernikov
0652a307ca Convert arpinput() to use new routing api. 2014-10-24 19:38:05 +00:00
Alexander V. Chernikov
5bc9d581fd Convert all ip_rtaddr() users to fib4_lookup_nh_extended().
Remove ip_rtaddr().
2014-10-24 17:40:32 +00:00
Andrey V. Elsukov
a663aa4ce8 Remove redundant check and m_pullup() call. 2014-10-24 13:34:22 +00:00
Alexander V. Chernikov
f50706648c Add new fib4_lookup_nh_extended() which fills in nhop4_extended
structure without doinf L2 resolve. It also requires freeing
 references by calling fib4_free_nh_ext().

Convert in_pcbladdr() to use it.
Convert tcp_maxmtu() to use it.
2014-10-23 23:11:04 +00:00
Alexander V. Chernikov
18d8739c6b Convert inp_lookup_mcast_ifp() to new routing api. 2014-10-23 21:38:54 +00:00
Alexander V. Chernikov
2bb83c79f6 Rename ip_sendmbuf to fib4_sendmbuf() and move it to rt_nhops api.
Convert IPv4 SAS to use new routing api.
2014-10-23 21:09:14 +00:00
Hans Petter Selasky
f0188618f2 Fix multiple incorrect SYSCTL arguments in the kernel:
- Wrong integer type was specified.

- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.

- Logical OR where binary OR was expected.

- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.

- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.

- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.

- Updated "EXAMPLES" section in SYSCTL manual page.

MFC after:	3 days
Sponsored by:	Mellanox Technologies
2014-10-21 07:31:21 +00:00
Alexander V. Chernikov
b4e8f808bf Switch IPv4 output path to use new routing api.
The goals of the new API is to provide consumers with minimal
  needed information, but as fast as possible. So we provide
  full nexthop info copied into alighed on-cache structure
  instead of rte/ia pointers, their refcounts and locks.
  This does not provide solution for protecting from egress
  ifp destruction, but does not make it any worse.

Current changes:

nhops:
Add fib4_lookup_prepend() function which stores either full
L2+L3 prepend info (e.g. MAC header in case of plain IPv4) or
L3 info with NH_FLAGS_L2_INCOMPLETE flag indicating that no valid L2
info exists and we have to take "slow" path.

ip_output:
Currently ip[ 46]_output consumers use 'struct route' for
the following purposes:
  1) double lookup avoidance(route caching)
  2) plain route caching
  3) get path MTU to be able to notify source.
The former pattern is mostly used by various tunnels
 (gif, gre, stf). (Actually, gre is the only remaining,
 others were already converted. Their locking model did
 not scale good enogh to benefit from such caching, so
 we have (temporarily) removed it without any performance
 loss).
Plain route caching used by SCTP is simply wrong and should be removed.
  Temporary break it for now just to be able to compile.
Optimize path mtu reporting by providing it in new 'route_info' stucture.

Minimize games with @ia locking/refcounting for route lookup:
  add special nhop[46]_extended structure to store more route attributes.
  Pointer to given structure can be passed to fib4_lookup_prepend() to indicate
  we want this info (we actually needs it for UDP and raw IP).

ether_output:
Provide light-weight ether_output2() call to deal with
transmitting L2 frame (e.g. properly handle broadcast/simloop/bridge/
  other L2 hooks before actually transmitting frame by if_transmit()).
Add a hack based on new RT_NHOP ro_flag to distinguish which version should
  we call. Better way is probably to add a new "if_output_frame" driver
  callbacks.

 Next steps:
* Convert ip_fastfwd part
* Implement auto-growing array for per-radix nexthops
* Implement LLE tracking for nexthop calculations to be able to
  immediately provide all necessary info in single route lookup
  for gateway routes
* Switch radix locking scheme to runtime/cfg lock
* Implement multipath support for rtsock
* Implement "tracked nexthops" for tunnels (e.g. _proper_
  nexthop caching)
* Add IPv6 support for remaining parts (postponed not to
   interfere with user/ae/inet6 branch)
* Consider adding "if_output_frame" driver call to
  ease logical frame pushing.
2014-10-19 21:07:35 +00:00
Michael Tuexen
84f3b49ac9 Fix the reported streams in a SCTP_STREAM_RESET_EVENT, if a
sent incoming stream reset request was responded with failed
or denied.
Thanks to Peter Bostroem from Google for reporting the issue.

MFC after: 3 days
2014-10-16 15:36:04 +00:00
Andrey V. Elsukov
0b9f5f8a5f Overhaul if_gif(4):
o convert to if_transmit;
 o use rmlock to protect access to gif_softc;
 o use sx lock to protect from concurrent ioctls;
 o remove a lot of unneeded and duplicated code;
 o remove cached route support (it won't work with concurrent io);
 o style fixes.

Reviewed by:	melifaro
Obtained from:	Yandex LLC
MFC after:	1 month
Sponsored by:	Yandex LLC
2014-10-14 13:31:47 +00:00
Sean Bruno
882ac53ed7 Handle small file case with regards to plpmtud blackhole detection.
Submitted by:	Mikhail <mp@lenta.ru>
MFC after:	2 weeks
Relnotes:	yes
2014-10-13 21:06:21 +00:00
Sean Bruno
0f3e3bc526 Catch ipv6 case when attempting to do PLPMTUD blackhole detection.
Submitted by:	Mikhail <mp@lenta.ru>
MFC after:	2 weeks
Relnotes:	yes
2014-10-13 21:05:29 +00:00
Alexander V. Chernikov
2930362fb1 Fix matching default rule on clear/show commands.
Found by:	Oleg Ginzburg
2014-10-13 13:49:28 +00:00
Julien Charbon
489dcc9262 A connection in TIME_WAIT state before calling close() actually did not
received any RST packet.  Do not set error to ECONNRESET in this case.

Differential Revision:	https://reviews.freebsd.org/D879
Reviewed by:		rpaulo, adrian
Approved by:		jhb (mentor)
Sponsored by:		Verisign, Inc.
2014-10-12 23:01:25 +00:00
Robert Watson
f0cace5d94 When deciding whether to call m_pullup() even though there is adequate
data in an mbuf, use M_WRITABLE() instead of a direct test of M_EXT;
the latter both unnecessarily exposes mbuf-allocator internals in the
protocol stack and is also insufficient to catch all cases of
non-writability.

(NB: m_pullup() does not actually guarantee that a writable mbuf is
returned, so further refinement of all of these code paths continues to
be required.)

Reviewed by:	bz
MFC after:	3 days
Sponsored by:	EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D900
2014-10-12 15:49:52 +00:00
John Baldwin
585a4290ab Update ip_divert.ko to depend on version 3 of ipfw. 2014-10-11 16:08:54 +00:00
Bryan Venteicher
81d3ec1763 Add context pointer and source address to the UDP tunnel callback
These are needed for the forthcoming vxlan implementation. The context
pointer means we do not have to use a spare pointer field in the inpcb,
and the source address is required to populate vxlan's forwarding table.

While I highly doubt there is an out of tree consumer of the UDP
tunneling callback, this change may be a difficult to eventually MFC.

Phabricator:	https://reviews.freebsd.org/D383
Reviewed by:	gnn
2014-10-10 06:08:59 +00:00
Bryan Venteicher
a0a9e1b57c Add missing UDP multicast receive dtrace probes
Phabricator:	https://reviews.freebsd.org/D924
Reviewed by:	rpaulo markj
MFC after:	1 month
2014-10-09 22:36:21 +00:00
Michael Tuexen
e03159ea69 Ensure that the flags field of sctp_tmit_chunks is initialized.
Thanks to Peter Bostroem from Google for reporting the issue.

MFC after: 3 days
2014-10-09 20:08:12 +00:00
Alexander V. Chernikov
779b53d008 Sync to HEAD@r272825. 2014-10-09 15:35:28 +00:00
Marcel Moolenaar
80b47aefa1 Move the SCTP syscalls to netinet with the rest of the SCTP code. The
syscalls themselves are tightly coupled with the network stack and
therefore should not be in the generic socket code.

The following four syscalls have been marked as NOSTD so they can be
dynamically registered in sctp_syscalls_init() function:
  sys_sctp_peeloff
  sys_sctp_generic_sendmsg
  sys_sctp_generic_sendmsg_iov
  sys_sctp_generic_recvmsg

The syscalls are also set up to be dynamically registered when COMPAT32
option is configured.

As a side effect of moving the SCTP syscalls, getsock_cap needs to be
made available outside of the uipc_syscalls.c source file.  A proper
prototype has been added to the sys/socketvar.h header file.

API tests from the SCTP reference implementation have been run to ensure
compatibility. (http://code.google.com/p/sctp-refimpl/source/checkout)

Submitted by:	Steve Kiernan <stevek@juniper.net>
Reviewed by:	tuexen, rrs
Obtained from:	Juniper Networks, Inc.
2014-10-09 15:16:52 +00:00
Bryan Venteicher
c19f98eb74 Check for mbuf copy failure when there are multiple multicast sockets
This partitular case is the only path where the mbuf could be NULL.
udp_append() checked for a NULL mbuf only after invoking the tunneling
callback. Our only in tree tunneling callback - SCTP - assumed a non
NULL mbuf, and it is a bit odd to make the callbacks responsible for
checking this condition.

This also reduces the differences between the IPv4 and IPv6 code.

MFC after:	1 month
2014-10-09 05:17:47 +00:00
Andrey V. Elsukov
5b7a43f546 When tunneling interface is going to insert mbuf into netisr queue after stripping
outer header, consider it as new packet and clear the protocols flags.

This fixes problems when IPSEC traffic goes through various tunnels and router
doesn't send ICMP/ICMPv6 errors.

PR:		174602
Obtained from:	Yandex LLC
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2014-10-08 21:23:34 +00:00
Michael Tuexen
9ba6106020 Ensure that the list of streams sent in a stream reset parameter fits
in an mbuf-cluster.
Thanks to Peter Bostroem for drawing my attention to this part of the code.
2014-10-08 15:30:59 +00:00
Michael Tuexen
e29127de2e Ensure that the number of stream reported in srs_number_streams is
consistent with the amount of data provided in the SCTP_RESET_STREAMS
socket option.
Thanks to Peter Bostroem from Google for drawing my attention to
this part of the code.
2014-10-08 15:29:49 +00:00
Alexander V. Chernikov
be8bc45790 Add IP_FW_DUMP_SOPTCODES sopt to be able to determine
which opcodes are currently available in kernel.
2014-10-08 11:12:14 +00:00
Sean Bruno
f6f6703f27 Implement PLPMTUD blackhole detection (RFC 4821), inspired by code
from xnu sources.  If we encounter a network where ICMP is blocked
the Needs Frag indicator may not propagate back to us.  Attempt to
downshift the mss once to a preconfigured value.

Default this feature to off for now while we do not have a full PLPMTUD
implementation in our stack.

Adds the following new sysctl's for control:
net.inet.tcp.pmtud_blackhole_detection -- turns on/off this feature
net.inet.tcp.pmtud_blackhole_mss       -- mss to try for ipv4
net.inet.tcp.v6pmtud_blackhole_mss     -- mss to try for ipv6

Adds the following new sysctl's for monitoring:
-- Number of times the code was activated to attempt a mss downshift
net.inet.tcp.pmtud_blackhole_activated
-- Number of times the blackhole mss was used in an attempt to downshift
net.inet.tcp.pmtud_blackhole_min_activated
-- Number of times that we failed to connect after we downshifted the mss
net.inet.tcp.pmtud_blackhole_failed

Phabricator:	https://reviews.freebsd.org/D506
Reviewed by:	rpaulo bz
MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	Limelight Networks
2014-10-07 21:50:28 +00:00
Alexander V. Chernikov
a5fedf11fc Sync to HEAD@r272609. 2014-10-06 11:29:50 +00:00
Hans Petter Selasky
b228e6bf57 Minor code styling.
Suggested by:	glebius @
2014-10-06 06:19:54 +00:00
Michael Tuexen
041353aba4 Remove unused MC_ALIGN macro as suggested by Robert.
MFC after: 1 week
2014-10-05 20:30:49 +00:00
Robert Watson
6c572040c6 Eliminate use of M_EXT in IP6_EXTHDR_CHECK() by trimming a redundant
'if'/'else' case: it matches the simple 'else' case that follows.

This reduces awareness of external-storage mechanics outside of the
mbuf allocator.

Reviewed by:	bz
MFC after:	3 days
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D900
2014-10-05 06:28:53 +00:00
Alexander V. Chernikov
1ce4b35740 Sync to HEAD@r272516. 2014-10-04 12:42:37 +00:00
Hiroki Sato
9c57a5b630 Add an additional routing table lookup when m->m_pkthdr.fibnum is changed
at a PFIL hook in ip{,6}_output().  IPFW setfib rule did not perform
a routing table lookup when the destination address was not changed.

CR:	D805
2014-10-02 00:25:57 +00:00
Mark Johnston
00cb6bef99 Add a sysctl, net.inet.icmp.tstamprepl, which can be used to disable replies
to ICMP Timestamp packets.

PR:		193689
Submitted by:	Anthony Cornehl <accornehl@gmail.com>
MFC after:	3 weeks
Sponsored by:	EMC / Isilon Storage Division
2014-10-01 18:07:34 +00:00
Alexander V. Chernikov
31f0d081d8 Remove lock init from radix.c.
Radix has never managed its locking itself.
The only consumer using radix with embeded rwlock
is system routing table. Move per-AF lock inits there.
2014-10-01 14:39:06 +00:00
Michael Tuexen
83e95fb30b The default for UDPLITE_RECV_CSCOV is zero. RFC 3828 recommend
that this means full checksum coverage for received packets.
If an application is willing to accept packets with partial
coverage, it is expected to use the socekt option and provice
the minimum coverage it accepts.

Reviewed by: kevlo
MFC after: 3 days
2014-10-01 05:43:29 +00:00
Michael Tuexen
c6d81a3445 UDPLite requires a checksum. Therefore, discard a received packet if
the checksum is 0.

MFC after: 3 days
2014-09-30 20:29:58 +00:00
Michael Tuexen
0f4a03663b If the checksum coverage field in the UDPLITE header is the length
of the complete UDPLITE packet, the packet has full checksum coverage.
SO fix the condition.

Reviewed by: kevlo
MFC after: 3 days
2014-09-30 18:17:28 +00:00
John Baldwin
a9456c081a Only define the full inm_print() if KTR_IGMPV3 is enabled at compile time. 2014-09-30 17:26:34 +00:00
Michael Tuexen
03f90784bf Checksum coverage values larger than 65535 for UDPLite are invalid.
Check for this when the user calls setsockopt using UDPLITE_{SEND,RECV}CSCOV.

Reviewed by: kevlo
MFC after: 3 days
2014-09-28 17:22:45 +00:00
Alexander V. Chernikov
29c47f18da * Split tcp_signature_compute() into 2 pieces:
- tcp_get_sav() - SADB key lookup
 - tcp_signature_do_compute() - actual computation
* Fix TCP signature case for listening socket:
  do not assume EVERY connection coming to socket
  with TCP_SIGNATURE set to be md5 signed regardless
  of SADB key existance for particular address. This
  fixes the case for routing software having _some_
  BGP sessions secured by md5.
* Simplify TCP_SIGNATURE handling in tcp_input()

MFC after:	2 weeks
2014-09-27 07:04:12 +00:00
Adrian Chadd
3aac064c2f Remove an un-needed bit of pre-processor work - it all lives inside
#ifdef RSS.
2014-09-27 05:14:02 +00:00
John-Mark Gurney
469c4e0465 drop unnecessary ifdef IPSEC's. This file is only compiled when IPSEC
is defined...

Differential Revision:	D839
Reviewed by:	bz, glebius, gnn
Sponsered by:	EuroBSDCon DevSummit
2014-09-26 12:48:54 +00:00
Navdeep Parhar
5acf7269da Catch up with r271119. 2014-09-24 20:12:40 +00:00
Hans Petter Selasky
9fd573c39d Improve transmit sending offload, TSO, algorithm in general.
The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.

Reviewed by:	adrian, rmacklem
Sponsored by:	Mellanox Technologies
MFC after:	1 week
2014-09-22 08:27:27 +00:00
Hiroki Sato
cc45ae406d Add a change missing in r271916. 2014-09-21 04:38:50 +00:00
Hiroki Sato
89c58b73e0 - Virtualize interface cloner for gre(4). This fixes a panic when destroying
a vnet jail which has a gre(4) interface.

- Make net.link.gre.max_nesting vnet-local.
2014-09-21 03:56:06 +00:00
Gleb Smirnoff
32c7c51c2a Mechanically convert to if_inc_counter(). 2014-09-19 10:19:51 +00:00
Gleb Smirnoff
22bfa4f5b1 Remove disabled code, that is very unlikely to be ever enabled again,
as well as the comment that explains why is it disabled.
2014-09-19 05:23:47 +00:00
Alan Somers
58a39d8c5b Fix source address selection on unbound sockets in the presence of multiple
fibs. Use the mbuf's or the socket's fib instead of RT_ALL_FIBS. Fixes PR
187553. Also fixes netperf's UDP_STREAM test on a nondefault fib.

sys/netinet/ip_output.c
	In ip_output, lookup the source address using the mbuf's fib instead
	of RT_ALL_FIBS.

sys/netinet/in_pcb.c
	in in_pcbladdr, lookup the source address using the socket's fib,
	because we don't seem to have the mbuf fib. They should be the same,
	though.

tests/sys/net/fibs_test.sh
	Clear the expected failure on udp_dontroute.

PR:		187553
CR:		https://reviews.freebsd.org/D772
MFC after:	3 weeks
Sponsored by:	Spectra Logic
2014-09-16 15:28:19 +00:00
Michael Tuexen
b60b0fe6fd Add a explict cast to silence a warning when building
the userland stack on Windows.
This issue was reported by Peter Kasting from Google.

MFC after: 3 days
2014-09-16 14:39:24 +00:00
Michael Tuexen
47b80412cd Use a consistent type for the number of HMAC algorithms.
This fixes a bug which resulted in a warning on the userland
stack, when compiled on Windows.
Thanks to Peter Kasting from Google for reporting the issue and
provinding a potential fix.

MFC after: 3 days
2014-09-16 14:20:33 +00:00
Michael Tuexen
667eb48763 Small cleanup which addresses a warning regaring the truncation
of a 64-bit entity to a 32-bit entity. This issue was reported by
Peter Kasting from Google.

MFC after: 3 days
2014-09-16 13:48:46 +00:00
Gleb Smirnoff
3220a2121c FreeBSD-SA-14:19.tcp raised attention to the state of our stack
towards blind SYN/RST spoofed attack.

Originally our stack used in-window checks for incoming SYN/RST
as proposed by RFC793. Later, circa 2003 the RST attack was
mitigated using the technique described in P. Watson
"Slipping in the window" paper [1].

After that, the checks were only relaxed for the sake of
compatibility with some buggy TCP stacks. First, r192912
introduced the vulnerability, just fixed by aforementioned SA.
Second, r167310 had slightly relaxed the default RST checks,
instead of utilizing net.inet.tcp.insecure_rst sysctl.

In 2010 a new technique for mitigation of these attacks was
proposed in RFC5961 [2]. The idea is to send a "challenge ACK"
packet to the peer, to verify that packet arrived isn't spoofed.
If peer receives challenge ACK it should regenerate its RST or
SYN with correct sequence number. This should not only protect
against attacks, but also improve communication with broken
stacks, so authors of reverted r167310 and r192912 won't be
disappointed.

[1] http://bandwidthco.com/whitepapers/netforensics/tcpip/TCP Reset Attacks.pdf
[2] http://www.rfc-editor.org/rfc/rfc5961.txt

Changes made:

o Revert r167310.
o Implement "challenge ACK" protection as specificed in RFC5961
  against RST attack. On by default.
  - Carefully preserve r138098, which handles empty window edge
    case, not described by the RFC.
  - Update net.inet.tcp.insecure_rst description.
o Implement "challenge ACK" protection as specificed in RFC5961
  against SYN attack. On by default.
  - Provide net.inet.tcp.insecure_syn sysctl, to turn off
    RFC5961 protection.

The changes were tested at Netflix. The tested box didn't show
any anomalies compared to control box, except slightly increased
number of TCP connection in LAST_ACK state.

Reviewed by:	rrs
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-09-16 11:07:25 +00:00
Michael Tuexen
8a0834ec28 Make a type conversion explicit. When compiling this code on
Windows as part of the SCTP userland stack, this fixes a
warning reported by Peter Kasting from Google.

MFC after: 3 days
2014-09-16 10:57:55 +00:00
Xin LI
831ad37ef2 Fix Denial of Service in TCP packet processing.
Submitted by:	glebius
Security:	FreeBSD-SA-14:19.tcp
2014-09-16 09:48:24 +00:00
Michael Tuexen
43f9f175c5 The MTU is handled as a 32-bit entity within the SCTP stack.
This was reported by Peter Kasting from Google.

MFC after: 3 days
2014-09-16 09:22:43 +00:00
Adrian Chadd
f4659f4c27 Ensure the correct software IPv4 hash is done based on the configured
RSS parameters, rather than assuming we're hashing IPv4+UDP and IPv4+TCP.
2014-09-16 03:26:42 +00:00
Michael Tuexen
aa7e5af86f Chunk IDs are 8 bit entities, not 16 bit.
Thanks to Peter Kasting from Google for drawing
my attention to it.

MFC after: 3 days
2014-09-15 19:38:34 +00:00
Hiroki Sato
9bc11d7bd7 Use generic SYSCTL_* macro instead of deprecated SYSCTL_VNET_*.
Suggested by:	glebius
2014-09-15 14:43:58 +00:00
Hiroki Sato
348aae2398 Make net.inet.ip.sourceroute, net.inet.ip.accept_sourceroute, and
net.inet.ip.process_options vnet-aware.  Revert changes in r271545.

Suggested by:	bz
2014-09-15 07:20:40 +00:00
Hans Petter Selasky
72f3100047 Revert r271504. A new patch to solve this issue will be made.
Suggested by:	adrian @
2014-09-13 20:52:01 +00:00
Hans Petter Selasky
eb93b77ae4 Improve transmit sending offload, TSO, algorithm in general.
The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2014-09-13 08:26:09 +00:00
Alan Somers
4f8585e021 Revisions 264905 and 266860 added a "int fib" argument to ifa_ifwithnet and
ifa_ifwithdstaddr. For the sake of backwards compatibility, the new
arguments were added to new functions named ifa_ifwithnet_fib and
ifa_ifwithdstaddr_fib, while the old functions became wrappers around the
new ones that passed RT_ALL_FIBS for the fib argument. However, the
backwards compatibility is not desired for FreeBSD 11, because there are
numerous other incompatible changes to the ifnet(9) API. We therefore
decided to remove it from head but leave it in place for stable/9 and
stable/10. In addition, this commit adds the fib argument to
ifa_ifwithbroadaddr for consistency's sake.

sys/sys/param.h
	Increment __FreeBSD_version

sys/net/if.c
sys/net/if_var.h
sys/net/route.c
	Add fibnum argument to ifa_ifwithbroadaddr, and remove the _fib
	versions of ifa_ifwithdstaddr, ifa_ifwithnet, and ifa_ifwithroute.

sys/net/route.c
sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_options.c
sys/netinet/ip_output.c
sys/netinet6/nd6.c
	Fixup calls of modified functions.

share/man/man9/ifnet.9
	Document changed API.

CR:		https://reviews.freebsd.org/D458
MFC after:	Never
Sponsored by:	Spectra Logic
2014-09-11 20:21:03 +00:00
Andrey V. Elsukov
028bdf289d Add scope zone id to the in_endpoints and hc_metrics structures.
A non-global IPv6 address can be used in more than one zone of the same
scope. This zone index is used to identify to which zone a non-global
address belongs.

Also we can have many foreign hosts with equal non-global addresses,
but from different zones. So, they can have different metrics in the
host cache.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-09-10 16:26:18 +00:00
Andrey V. Elsukov
a7e201bbac Make in6_pcblookup_hash_locked and in6_pcbladdr static.
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-09-10 13:17:35 +00:00
Andrey V. Elsukov
1b44e5ffe3 Introduce INP6_PCBHASHKEY macro. Replace usage of hardcoded part of
IPv6 address as hash key in all places.

Obtained from:	Yandex LLC
2014-09-10 12:35:42 +00:00
Adrian Chadd
8ad1a83b48 Calculate the RSS hash for outbound UDPv4 frames.
Differential Revision:	https://reviews.freebsd.org/D527
Reviewed by:	grehan
2014-09-09 04:19:36 +00:00
Adrian Chadd
b8bc95cd49 Update the IPv4 input path to handle reassembled frames and incoming frames
with no RSS hash.

When doing RSS:

* Create a new IPv4 netisr which expects the frames to have been verified;
  it just directly dispatches to the IPv4 input path.
* Once IPv4 reassembly is done, re-calculate the RSS hash with the new
  IP and L3 header; then reinject it as appropriate.
* Update the IPv4 netisr to be a CPU affinity netisr with the RSS hash
  function (rss_soft_m2cpuid) - this will do a software hash if the
  hardware doesn't provide one.

NICs that don't implement hardware RSS hashing will now benefit from RSS
distribution - it'll inject into the correct destination netisr.

Note: the netisr distribution doesn't work out of the box - netisr doesn't
query RSS for how many CPUs and the affinity setup.  Yes, netisr likely
shouldn't really be doing CPU stuff anymore and should be "some kind of
'thing' that is a workqueue that may or may not have any CPU affinity";
that's for a later commit.

Differential Revision:	https://reviews.freebsd.org/D527
Reviewed by:	grehan
2014-09-09 04:18:20 +00:00
Adrian Chadd
72d33245f5 Implement IPv4 RSS software hash functions to use during packet ingress
and egress.

* rss_mbuf_software_hash_v4 - look at the IPv4 mbuf to fetch the IPv4 details
  + direction to calculate a hash.
* rss_proto_software_hash_v4 - hash the given source/destination IPv4 address,
  port and direction.
* rss_soft_m2cpuid - map the given mbuf to an RSS CPU ("bucket" for now)

These functions are intended to be used by the stack to support
the following:

* Not all NICs do RSS hashing, so we should support some way of doing
  a hash in software;
* The NIC / driver may not hash frames the way we want (eg UDP 4-tuple
  hashing when the stack is only doing 2-tuple hashing for UDP); so we
  may need to re-hash frames;
* .. same with IPv4 fragments - they will need to be re-hashed after
  reassembly;
* .. and same with things like IP tunneling and such;
* The transmit path for things like UDP, RAW and ICMP don't currently
  have any RSS information attached to them - so they'll need an
  RSS calculation performed before transmit.

TODO:

* Counters! Everywhere!
* Add a debug mode that software hashes received frames and compares them
  to the hardware hash provided by the hardware to ensure they match.

The IPv6 part of this is missing - I'm going to do some re-juggling of
where various parts of the RSS framework live before I add the IPv6
code (read: the IPv6 code is going to go into netinet6/in6_rss.[ch],
rather than living here.)

Note: This API is still fluid.  Please keep that in mind.

Differential Revision:	https://reviews.freebsd.org/D527
Reviewed by:	grehan
2014-09-09 03:10:21 +00:00
Adrian Chadd
9d3ddf4384 Add support for receiving and setting flowtype, flowid and RSS bucket
information as part of recvmsg().

This is primarily used for debugging/verification of the various
processing paths in the IP, PCB and driver layers.

Unfortunately the current implementation of the control message path
results in a ~10% or so drop in UDP frame throughput when it's used.

Differential Revision:	https://reviews.freebsd.org/D527
Reviewed by:	grehan
2014-09-09 01:45:39 +00:00
Adrian Chadd
061a4b4c36 Add a flag to ip_output() - IP_NODEFAULTFLOWID - which prevents it from
overriding an existing flowid/flowtype field in the outbound mbuf with
the inp_flowid/inp_flowtype details.

The upcoming RSS UDP support calculates a valid RSS value for outbound
mbufs and since it may change per send, it doesn't cache it in the inpcb.
So overriding it here would be wrong.

Differential Revision:	https://reviews.freebsd.org/D527
Reviewed by:	grehan
2014-09-09 00:19:02 +00:00
Alexander V. Chernikov
d6164b77f8 Make ipfw_nat module use IP_FW3 codes.
Kernel changes:
* Split kernel/userland nat structures eliminating IPFW_INTERNAL hack.
* Add IP_FW_NAT44_* codes resemblin old ones.
* Assume that instances can be named (no kernel support currently).
* Use both UH+WLOCK locks for all configuration changes.
* Provide full ABI support for old sockopts.

Userland changes:
* Use IP_FW_NAT44_* codes for nat operations.
* Remove undocumented ability to show ranges of nat "log" entries.
2014-09-07 18:30:29 +00:00
Michael Tuexen
ad234e3c3d Address warnings generated by the clang analyzer.
MFC after: 1 week
2014-09-07 18:05:37 +00:00
Michael Tuexen
23602b60fb Address another warnings reported by Patrick Laimbock when compiling
in userspace. While there, improve consistency.

MFC after: 1 week
2014-09-07 17:07:19 +00:00
Michael Tuexen
24aaac8d59 Use union sctp_sockstore instead of struct sockaddr_storage. This
eliminiates some warnings when building in userland.
Thanks to Patrick Laimbock for reporting this issue.
Remove also some unnecessary casts.
There should be no functional change.

MFC after: 1 week
2014-09-07 09:06:26 +00:00