Commit Graph

70 Commits

Author SHA1 Message Date
Zhenlei Huang
5a8abd0a29 lacp: Use C99 bool for boolean return value
This improves readability.

No functional change intended.

MFC after:	1 week
2023-04-01 01:48:36 +08:00
Justin Hibbits
2c2b37ad25 ifnet/API: Move struct ifnet definition to a <net/if_private.h>
Hide the ifnet structure definition, no user serviceable parts inside,
it's a netstack implementation detail.  Include it temporarily in
<net/if_var.h> until all drivers are updated to use the accessors
exclusively.

Reviewed by:	glebius
Sponsored by:	Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D38046
2023-01-24 14:36:30 -05:00
Andrew Gallatin
43c72c45a1 lacp: Remove racy kassert
In lacp_select_tx_port_by_hash(), we assert that the selected port is
DISTRIBUTING. However, the port state is protected by the LACP_LOCK(),
which is not held around lacp_select_tx_port_by_hash().  So this
assertion is racy, and can result in a spurious panic when links
are flapping.

It is certainly possible to fix it by acquiring LACP_LOCK(),
but this seems like an early development assert, and it seems best
to just remove it, rather than add complexity inside an ifdef
INVARIANTS.

Sponsored by: Netflix
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D35396
2022-06-13 11:32:10 -04:00
Andrey V. Elsukov
f2ab916084 [vlan + lagg] add IFNET_EVENT_UPDATE_BAUDRATE event
use it to update if_baudrate for vlan interfaces created on the LACP lagg.

Differential revision:	https://reviews.freebsd.org/D33405
2022-05-20 06:38:43 +02:00
Greg Foster
00a80538b4 lacp: short timeout erroneously declares link-flapping
Panasas was seeing a higher-than-expected number of link-flap events.
After joint debugging with the switch vendor, we determined there were
problems on both sides; either of which might cause the occasional
event, but together caused lots of them.

On the switch side, an internal queuing issue was causing LACP PDUs --
which should be sent every second, in short-timeout mode -- to sometimes
be sent slightly later than they should have been. In some cases, two
successive PDUs were late, but we never saw three late PDUs in a row.

On the FreeBSD side, we saw a link-flap event every time there were two
late PDUs, while the spec says that it takes *three* seconds of downtime
to trigger that event. It turns out that if a PDU was received shortly
before the timer code was run, it would decrement less than a full
second after the PDU arrived. Then two delayed PDUs would cause two
additional decrements, causing it to reach zero less than three seconds
after the most-recent on-time PDU.

The solution is to note the time a PDU arrives, and only decrement if at
least a full second has elapsed since then.

Reported by:	Greg Foster <gfoster@panasas.com>
Reviewed by:	gallatin
Tested by:	Greg Foster <gfoster@panasas.com>
MFC after:	3 days
Sponsored by:	Panasas
Differential Revision:	https://reviews.freebsd.org/D35070
2022-04-27 12:41:30 -07:00
Arnaud Ysmal
0b92a7fe47 LACP: Do not wait response for marker messages not sent
The error returned when a marker message can not be emitted on a port is not handled.

This cause the lacp to block all emissions until the timeout of 3 seconds is reached.

To fix this issue, I just clear the LACP_PORT_MARK flag when the packet could not be emitted.

Differential revision:	https://reviews.freebsd.org/D30467
Obtained from:		Stormshield
2021-09-23 10:57:11 +02:00
Andrew Gallatin
8732245d29 LACP: When suppressing distributing, return ENOBUFS
When links come and go, lacp goes into a "suppress distributing" mode
where it drops traffic for 3 seconds. When in this mode, lagg/lacp
historiclally drops traffic with ENETDOWN. That return value causes TCP
to close any connection where it gets that value back from the lower
parts of the stack.  This means that any TCP connection with active
traffic during a 3-second windown when an LACP link comes or goes
would get closed.

TCP treats return values of ENOBUFS as transient errors, and re-schedules
transmission later. So rather than returning ENETDOWN, lets
return ENOBUFS instead.  This allows TCP connections to be preserved.

I've tested this by repeatedly bouncing links on a Netlfix CDN server
under a moderate (20Gb/s) load and overved ENOBUFS reported back to
the TCP stack (as reported by a RACK TCP sysctl).

Reviewed by:	jhb, jtl, rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27188
2020-11-18 14:55:49 +00:00
Hans Petter Selasky
a92c4bb62a Add support for IP over infiniband, IPoIB, to lagg(4). Currently only
the failover protocol is supported due to limitations in the IPoIB
architecture. Refer to the lagg(4) manual page for how to configure
and use this new feature. A new network interface type,
IFT_INFINIBANDLAG, has been added, similar to the existing
IFT_IEEE8023ADLAG .

ifconfig(8) has been updated to accept a new laggtype argument when
creating lagg(4) network interfaces. This new argument is used to
distinguish between ethernet and infiniband type of lagg(4) network
interface. The laggtype argument is optional and defaults to
ethernet. The lagg(4) command line syntax is backwards compatible.

Differential Revision:	https://reviews.freebsd.org/D26254
Reviewed by:		melifaro@
MFC after:		1 week
Sponsored by:		Mellanox Technologies // NVIDIA Networking
2020-10-22 09:47:12 +00:00
Mitchell Horne
ceff9b9d25 if_media: definitions for 40GE LM4 ethernet media type
Reviewed by:	erj
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26276
2020-09-16 14:45:16 +00:00
Mateusz Guzik
662c13053f net: clean up empty lines in .c and .h files 2020-09-01 21:19:14 +00:00
Andrew Gallatin
98085bae8c make lacp's use_numa hashing aware of send tags
When I did the use_numa support, I missed the fact that there is
a separate hash function for send tag nic selection. So when
use_numa is enabled, ktls offload does not work properly, as it
does not reliably allocate a send tag on the proper egress nic
since different egress nics are selected for send-tag allocation
and packet transmit. To fix this, this change:

- refectors lacp_select_tx_port_by_hash() and
     lacp_select_tx_port() to make lacp_select_tx_port_by_hash()
     always called by lacp_select_tx_port()

-   pre-shifts flowids to convert them to hashes when calling lacp_select_tx_port_by_hash()

-   adds a numa_domain field to if_snd_tag_alloc_params

-   plumbs the numa domain into places where we allocate send tags

In testing with NIC TLS setup on a NUMA machine, I see thousands
of output errors before the change when enabling
kern.ipc.tls.ifnet.permitted=1. After the change, I see no
errors, and I see the NIC sysctl counters showing active TLS
offload sessions.

Reviewed by:	rrs, hselasky, jhb
Sponsored by:	Netflix
2020-03-09 13:44:51 +00:00
Pawel Biernacki
7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Konstantin Belousov
5d1277ca9a if_media.h: Add 50G KR4 ethernet media type.
Submitted by:	Adam Peace <adam.e.peace@gmail.com>
Reviewed by:	hselasky
Differential revision:	https://reviews.freebsd.org/D23620
2020-02-11 18:03:45 +00:00
John Baldwin
b2e60773c6 Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets.  KTLS only supports
offload of TLS for transmitted data.  Key negotation must still be
performed in userland.  Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option.  All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type.  Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer.  Each TLS frame is described by a single
ext_pgs mbuf.  The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame().  ktls_enqueue() is then
called to schedule TLS frames for encryption.  In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed.  For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue().  Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends.  Internally,
Netflix uses proprietary pure-software backends.  This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames.  As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready().  At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation.  In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session.  TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted.  The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface.  If so, the packet is tagged
with the TLS send tag and sent to the interface.  The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation.  If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped.  In addition, a task is scheduled to refresh the TLS send
tag for the TLS session.  If a new TLS send tag cannot be allocated,
the connection is dropped.  If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag.  (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another.  As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8).  ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option.  They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax.  However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node.  The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default).  The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by:	gallatin, hselasky, rrs
Obtained from:	Netflix
Sponsored by:	Netflix, Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
Andrew Gallatin
35961dce98 Select lacp egress ports based on NUMA domain
This change creates an array of port maps indexed by numa domain
for lacp port selection. If we have lacp interfaces in more than
one domain, then we select the egress port by indexing into the
numa port maps and picking a port on the appropriate numa domain.

This is behavior is controlled by the new ifconfig use_numa flag
and net.link.lagg.use_numa sysctl/tunable (both modeled after the
existing use_flowid), which default to enabled.

Reviewed by:	bz, hselasky, markj (and scottl, earlier version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20060
2019-05-03 14:43:21 +00:00
Eric Joyner
fe2bf351fe if_media: Add new 2.5G/5G/25G/40G/50G/100G/200G/400G media types
Upcoming Ethernet hardware will support new media types that aren't in the kernel
yet, so they are added here. These mostly include new 25G/50G/100G media types;
and this commit introduces new 200G/400G speeds and media.

Reviewed by:	hselasky@, jhb@
MFC after:	1 week
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D16731
2018-08-22 18:19:56 +00:00
Andrew Gallatin
5ccac9f972 lagg: allow lacp to manage the link state
Lacp needs to manage the link state itself. Unlike other
lagg protocols, the ability of lacp to pass traffic
depends not only on the lagg members having link, but also
on the lacp protocol converging to a distributing state with the
link partner.

If we prematurely mark the link as up, then we will send a
gratuitous arp (via arp_handle_ifllchange()) before the lacp
interface is capable of passing traffic. When this happens,
the gratuitous arp is lost, and our link partner may cache
a stale mac address (eg, when the base mac address for the
lagg bundle changes, due to a BIOS change re-ordering NIC
unit numbers)

Reviewed by: jtl, hselasky
Sponsored by: Netflix
2018-08-13 14:13:25 +00:00
Andrew Turner
5f901c92a8 Use the new VNET_DEFINE_STATIC macro when we are defining static VNET
variables.

Reviewed by:	bz
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D16147
2018-07-24 16:35:52 +00:00
Steven Hartland
6fb1399a4c Added missing CTLFLAG_VNET to lacp default_strict_mode
Added CTLFLAG_VNET to net.link.lagg.lacp.default_strict_mode which was missed
in r290450.

Reported by:	julian@
MFC after:	1 week
Sponsored by:	Multiplay
2018-01-24 10:13:14 +00:00
Pedro F. Giffuni
fe267a5590 sys: general adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

No functional change intended.
2017-11-27 15:23:17 +00:00
Navdeep Parhar
aa0186bc36 Make LACP based lagg work with interfaces (like 100Gbps and 25Gbps) that
report extended media types.

lacp_aggregator_bandwidth() uses the media to determine the speed of the
interface and returns 0 for IFM_OTHER without the bits in the extended
range.

Reported by:	kbowling@
Reviewed by:	eugen_grosbein.net, mjoras@
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D12188
2017-09-06 14:36:35 +00:00
Eric Joyner
6e105d4e35 Add several new media types to if_media.h
These include several 25G types (for active direct attach cables and LR modules),
and a missing type for 10G active direct attach.

Differential Revision:	https://reviews.freebsd.org/D10425
Reviewed by:	smh, imp
MFC after:	3 days
Sponsored by:	Intel Corporation
2017-05-10 18:33:40 +00:00
Jonathan T. Looney
e80039007a Do some minimal work to better conform to the 802.3ad (LACP) standard.
In particular, don't set the synchronized bit for the peer unless it truly
appears to be synchronized to us. Also, don't set our own synchronized bit
unless we have actually seen a remote system.

Prior to this change, we were seeing some strange behavior, such as:

1. We send an advertisement with the Activity, Aggregation, and Default
flags, followed by an advertisement with the Activity, Aggregation,
Synchronization, and Default flags. However, we hadn't seen an
advertisement from another peer and were still advertising the default
(NULL) peer. A closer examination of the in-kernel data structures (using
kgdb) showed that the system had added the default (NULL) peer as a valid
aggregator for the segment.
2. We were receiving an advertisement from a peer that included the
default (NULL) peer instead of including our system information. However,
we responded with an advertisement that included the Synchronization flag
for both our system and the peer. (Since the peer's advertisement did not
include our system information, we shouldn't add the synchronization bit
for the peer.)

This patch corrects those two items.

Reviewed by:	smh
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D9485
2017-02-26 00:19:02 +00:00
Ravi Pokala
d592868ebf Eliminate misleading comments and dead code in lacp_port_create()
Variables "fast" and "active" are both constant in lacp_port_create(), but
comments mispleadingly suggest that "fast" can be changed via ioctl. The
constant values control the value of "lp->lp_state", so it too is constant,
and the code for assigning different value to it is essentially dead.

Remove both "fast" and "active", and set "lp->lp_state" unconditionally;
that gets rid of the dead code and misleading comments.

CID: 1305692
CID: 1305734

Reported by:	asomers
Reviewed by:	asomers
MFC after:	1 week
Sponsored by:	Panasas
Differential Revision:	https://reviews.freebsd.org/D9302
2017-01-24 01:39:40 +00:00
Hans Petter Selasky
f3e7afe2d7 Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.

- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.

- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().

- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.

- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.

- How rate limiting works:

1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.

2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.

3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.

4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.

Reviewed by:		wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision:	https://reviews.freebsd.org/D3687
Sponsored by:		Mellanox Technologies
MFC after:		3 months
2017-01-18 13:31:17 +00:00
Steven Hartland
c1be893c44 Add sysctl to control LACP strict compliance default
Add net.link.lagg.lacp.default_strict_mode which defines
the default value for LACP strict compliance for created
lagg devices.

Also:
* Add lacp_strict option to ifconfig(8).
* Fix lagg(4) creation examples.
* Minor style(9) fix.

MFC after:	1 week
2015-11-06 15:33:27 +00:00
Hiren Panchasara
0e02b43a07 Make LAG LACP fast timeout tunable through IOCTL.
Differential Revision:	D3300
Submitted by:		LN Sundararajan <lakshmi.n at msystechnologies>
Reviewed by:		wblock, smh, gnn, hiren, rpokala at panasas
MFC after:		2 weeks
Sponsored by:		Panasas
2015-08-12 20:21:04 +00:00
Eric Joyner
eb7e25b22f ifmedia changes:
- Extend the number of available subtypes for Ethernet media by using some
of the ifmedia word's option bits to help denote subtypes. As a result, the
number of possible Ethernet subtype values increases from 31 to 511.

- Use some of those new values to define new media types.

- lacp_compose_key() recgonizes the new Ethernet media types added.
  (Change made as required by a comment in if_media.h)

- New ioctl, SIOGIFXMEDIA, to handle getting the new extended media types.
  SIOCGIFMEDIA is retained for backwards compatibility.

- Changes to ifconfig to allow it to handle the new extended media types.

Submitted by:	mike@karels.net (original), hselasky
Reviewed by:	jfvogel, gnn, hselasky
Approved by:	jfvogel (mentor), gnn (mentor)
Differential Revision: http://reviews.freebsd.org/D1965
2015-04-07 21:31:17 +00:00
Hans Petter Selasky
b7ba031ff7 Factor out mbuf hashing code from LAGG driver so that other network
drivers can use it. This avoids some code duplication. Add missing
default case to all switch statements while at it. Also move the
hashing of the IPv6 flow field to layer 4 because the IPv6 flow field
is constant on a per L4 connection basis and not on a per L3 network.

Differential Revision:	https://reviews.freebsd.org/D1987
Sponsored by:		Mellanox Technologies
MFC after:		1 month
2015-03-11 16:02:24 +00:00
Will Andrews
0e5f55bb95 Improve the distribution of LAGG port traffic.
I edited the original change to retain the use of arc4random() as a seed for
the hashing as a very basic defense against intentional lagg port selection.

The author's original commit message (edited slightly):

sys/net/ieee8023ad_lacp.c
sys/net/if_lagg.c
	In lagg_hashmbuf, use the FNV hash instead of the old
	hash32_buf.  The hash32 family of functions operate one octet
	at a time, and when run on a string s of length n, their output
	is equivalent to :

		   ----- i=n-1
		   \
	       n    \           (n-i-1)              32
	( seed^  +  /        33^        * s[i] ) % 2^
		   /
		   ----- i=0

	The problem is that the last five bytes of input don't get
	multiplied by sufficiently many powers of 33 to rollover 2^32.
	That means that changing the last few bytes (but obviously not
	the very last) of input will always change the value of the
	hash by a multiple of 33.  In the case of lagg_hashmbuf() with
	ipv4 input, the last four bytes are the TCP or UDP port
	numbers.  Since the output of lagg_hashmbuf is always taken
	modulo the port count, and 3 is a common port count for a lagg,
	that's bad.  It means that the UDP or TCP source port will
	never affect which lagg member is selected on a 3-port lagg.

	At 10Gbps, I was not able to measure any difference in CPU
	consumption between the old and new hash.

Submitted by:	asomers (original commit)
Reviewed by:	emaste, glebius
MFC after:	1 week
Sponsored by:	Spectra Logic
MFSpectraBSD:	1001723 on 2013/08/28 (original)
		1114258 on 2015/01/22 (edit)
2015-01-23 00:06:35 +00:00
Hans Petter Selasky
c25290420e Start process of removing the use of the deprecated "M_FLOWID" flag
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.

This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.

"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.

Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.

MFC after:	1 month
Sponsored by:	Mellanox Technologies
2014-12-01 11:45:24 +00:00
Hiroki Sato
6d47816791 - Move L2 addr configuration for the primary port to a taskqueue. This fixes
LOR of softc rmlock in iflladdr_event handlers.

- Call if_delmulti_ifma() after LACP_UNLOCK().  This fixes another LOR.

- Fix a panic in lacp_transit_expire().

- Fix a panic in lagg_input() upon shutting down a port.
2014-10-05 02:34:21 +00:00
Hiroki Sato
939a050ad9 Virtualize lagg(4) cloner. This change fixes a panic when tearing down
if_lagg(4) interfaces which were cloned in a vnet jail.

Sysctl nodes which are dynamically generated for each cloned interface
(net.link.lagg.N.*) have been removed, and use_flowid and flowid_shift
ifconfig(8) parameters have been added instead.  Flags and per-interface
statistics counters are displayed in "ifconfig -v".

CR:	D842
2014-10-01 21:37:32 +00:00
Gleb Smirnoff
eade13f9d2 Remove macros that hide access to struct ifnet fields. 2014-09-26 13:02:29 +00:00
Gleb Smirnoff
6900d0d328 - Whitespace.
- Remove caddr_t.
2014-09-26 12:35:58 +00:00
Gleb Smirnoff
09c7577ef3 - When reconfiguring protocol on a lagg, first set it to LAGG_PROTO_NONE,
then drop lock, run the attach routines, and then set it to specific
  proto. This removes tons of WITNESS warnings.
- Make lagg protocol attach handlers not failing and allocate memory
  with M_WAITOK.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-09-26 08:42:32 +00:00
Gleb Smirnoff
b1bbc5b3d1 Make lagg protocols detach methods returning void.
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-09-26 07:12:40 +00:00
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
Alan Somers
f544a74870 Fix a panic caused by doing "ifconfig -am" while a lagg is being destroyed.
The thread that is destroying the lagg has already set sc->sc_psc=NULL when
the "ifconfig -am" thread gets to lacp_req().  It tries to dereference
sc->sc_psc and panics.  The solution is for lacp_req() to check the value of
sc->sc_psc.  If NULL, harmlessly return an lacp_opreq structure full of
zeros.  Full details in GNATS.

PR:		kern/189003
Reviewed by:	timeout on freebsd-net@
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corporation
2014-05-02 16:24:09 +00:00
Alexander V. Chernikov
95fbe4d0cc Simplify filling sockaddr_dl structure for if_resolvemulti()
callback providers. link_init_sdl() function can be used to
fill most of the parameters. Use caller stack instead of
allocation / freing memory for each request. Do not drop support
for extra-long (probably non-existing) link-layer protocols by
introducing link_alloc_sdl() (used by if_resolvemulti() callback)
and link_free_sdl() (used by caller).
Since this change breaks KBI, MFC requires slightly different approach
(link_init_sdl() auto-allocating buffer if necessary to handle cases
 with unmodified if_resolvemulti() callers).

MFC after:	2 weeks
2014-01-18 23:24:51 +00:00
Scott Long
1a8959dac6 Multi-queue NIC drivers and multi-port lagg tend to use the same lower
bits of the flowid as each other, resulting in a poor distribution of
packets among queues in certain cases.  Work around this by adding a
set of sysctls for controlling a bit-shift on the flowid when doing
multi-port aggrigation in lagg and lacp.  By default, lagg/lacp will
now use bits 16 and higher instead of 0 and higher.

Reviewed by:	max
Obtained from:	Netflix
MFC after:	3 days
2013-12-30 01:32:17 +00:00
Gleb Smirnoff
c3322cb91c Include necessary headers that now are available due to pollution
via if_var.h.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-28 07:29:16 +00:00
Gleb Smirnoff
76039bc84f The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 17:58:36 +00:00
Andrey V. Elsukov
d5773da82a Use the same actor key for media types of the same speed.
PR:		176097
MFC after:	2 weeks
2013-10-17 15:14:58 +00:00
Adrian Chadd
49de4f2214 Break out the static, global LACP debug options into a per-lagg unit
sysctl tree.

* Create a net.link.lagg.X.lacp node
* Add a debug node under that for tx_test and rx_test
* Add lacp_strict_mode, defaulting to 1

tx_test and rx_test are still a bitmap of unit numbers for now.
At some point it would be nice to create child nodes of the lagg bundle
for each sub-interface, and then populate those with various knobs
and statistics.

Sponsored by:	Netflix
2013-07-26 19:41:13 +00:00
Adrian Chadd
387e754ae5 Fix typo.
Sponsored by:	Netflix
2013-07-25 19:10:23 +00:00
Adrian Chadd
31402c27b8 Bring over some link aggregation / LACP protocol improvements and debugging
additions.

* Add some new tracing events to aid in debugging.
* Add in a debugging mode to drop transmit and received frames, specifically
  to test whether seeing or hearing heartbeats correctly cause LACP to
  drop the port.
* Add in (and make default) a strict LACP mode, which requires the
  heartbeat on a port to be heard before it's used.  Sometimes vendor ports
  will hang but the link layer stays up, resulting in hung traffic.
* Add logging the number of link status flaps, again to aid in debugging
  badly behaving switch ports.
* Calculate the lagg interface port speed as the multiple of the
  configured ports, rather than the largest.

Obtained from:	Netflix
MFC after:	2 weeks
2013-07-13 04:25:03 +00:00
Gleb Smirnoff
eb1b1807af Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually
2012-12-05 08:04:20 +00:00