Commit Graph

165 Commits

Author SHA1 Message Date
Ravi Pokala
2a73c8f5e1 Decode the "LACP Fast Timeout" LAGG option flag
r286700 added the "lacp_fast_timeout" option to `ifconfig', but we forgot to
include the new option in the string used to decode the option bits. Add
"LACP_FAST_TIMO" to LAGG_OPT_BITS.

Also, s/LAGG_OPT_LACP_TIMEOUT/LAGG_OPT_LACP_FAST_TIMO/g , to be clearer that
the flag indicates "Fast Timeout" mode.

Reported by:	Greg Foster <gfoster at panasas dot com>
Reviewed by:	jpaetzel
MFC after:	1 week
Sponsored by:	Panasas
Differential Revision:	https://reviews.freebsd.org/D25239
2020-06-11 22:46:08 +00:00
Andrew Gallatin
bd673b9942 lagg: stop double-counting output errors and counting drops as errors
Before this change, lagg double-counted errors from lagg members, and counted
every drop by a lagg member as an error.  Eg, if lagg sent a packet, and the
underlying hardware driver dropped it, a counter would be incremented by both
lagg and the underlying driver.

This change attempts to fix that by incrementing lagg's counters only for
errors that do not come from underlying drivers.

Reviewed by:	hselasky, jhb
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D24331
2020-04-13 23:06:56 +00:00
Andrew Gallatin
98085bae8c make lacp's use_numa hashing aware of send tags
When I did the use_numa support, I missed the fact that there is
a separate hash function for send tag nic selection. So when
use_numa is enabled, ktls offload does not work properly, as it
does not reliably allocate a send tag on the proper egress nic
since different egress nics are selected for send-tag allocation
and packet transmit. To fix this, this change:

- refectors lacp_select_tx_port_by_hash() and
     lacp_select_tx_port() to make lacp_select_tx_port_by_hash()
     always called by lacp_select_tx_port()

-   pre-shifts flowids to convert them to hashes when calling lacp_select_tx_port_by_hash()

-   adds a numa_domain field to if_snd_tag_alloc_params

-   plumbs the numa domain into places where we allocate send tags

In testing with NIC TLS setup on a NUMA machine, I see thousands
of output errors before the change when enabling
kern.ipc.tls.ifnet.permitted=1. After the change, I see no
errors, and I see the NIC sysctl counters showing active TLS
offload sessions.

Reviewed by:	rrs, hselasky, jhb
Sponsored by:	Netflix
2020-03-09 13:44:51 +00:00
Pawel Biernacki
7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Alexander Motin
84becee1ac Update route MTUs for bridge, lagg and vlan interfaces.
Those interfaces may implicitly change their MTU on addition of parent
interface in addition to normal SIOCSIFMTU ioctl path, where the route
MTUs are updated normally.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-01-22 20:36:45 +00:00
Gleb Smirnoff
2a4bd982d0 Introduce NET_EPOCH_CALL() macro and use it everywhere where we free
data based on the network epoch.   The macro reverses the argument
order of epoch_call(9) - first function, then its argument. NFC
2020-01-15 06:05:20 +00:00
Gleb Smirnoff
97168be809 Mechanically substitute assertion of in_epoch(net_epoch_preempt) to
NET_EPOCH_ASSERT(). NFC
2020-01-15 05:45:27 +00:00
Mark Johnston
c23df8eafa lagg: Further cleanup of the rr_limit option.
Add an option flag so that arbitrary updates to a lagg's configuration
do not clear sc_stride.  Preseve compatibility for old ifconfig
binaries.  Update ifconfig to use the new flag and improve the casting
used when parsing the option parameter.

Modify the RR transmit function to avoid locklessly reading sc_stride
twice.  Ensure that sc_stride is always 1 or greater.

Reviewed by:	hselasky
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23092
2020-01-09 14:58:41 +00:00
Mark Johnston
c104c2990d lagg: Clean up handling of the rr_limit option.
- Don't allow an unprivileged user to set the stride. [1]
- Only set the stride under the softc lock.
- Rename the internal fields to accurately reflect their use.  Keep
  ro_bkt to avoid changing the user API.
- Simplify the implementation.  The port index is just sc_seq / stride.
- Document rr_limit in ifconfig.8.

Reported by:	Ilja Van Sprundel <ivansprundel@ioactive.com> [1]
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22857
2019-12-22 21:56:47 +00:00
John Baldwin
b2e60773c6 Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets.  KTLS only supports
offload of TLS for transmitted data.  Key negotation must still be
performed in userland.  Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option.  All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type.  Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer.  Each TLS frame is described by a single
ext_pgs mbuf.  The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame().  ktls_enqueue() is then
called to schedule TLS frames for encryption.  In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed.  For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue().  Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends.  Internally,
Netflix uses proprietary pure-software backends.  This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames.  As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready().  At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation.  In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session.  TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted.  The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface.  If so, the packet is tagged
with the TLS send tag and sent to the interface.  The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation.  If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped.  In addition, a task is scheduled to refresh the TLS send
tag for the TLS session.  If a new TLS send tag cannot be allocated,
the connection is dropped.  If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag.  (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another.  As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8).  ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option.  They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax.  However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node.  The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default).  The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by:	gallatin, hselasky, rrs
Obtained from:	Netflix
Sponsored by:	Netflix, Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
Randall Stewart
20abea6663 This adds the third step in getting BBR into the tree. BBR and
an updated rack depend on having access to the new
ratelimit api in this commit.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D20953
2019-08-01 14:17:31 +00:00
Mark Johnston
ce7fb386d8 Restore the comment removed in r348745.
LAGG_RLOCK() enters an epoch section, so the comment wasn't stale.

Reported by:	jhb
MFC with:	r348745
2019-06-06 17:20:35 +00:00
Mark Johnston
9995dfd364 Conditionalize an in_epoch() call on INVARIANTS.
Its result is only used to determine whether to perform further
INVARIANTS-only checks.  Remove a stale comment while here.

Submitted by:	Sebastian Huber <sebastian.huber@embedded-brains.de>
MFC after:	1 week
2019-06-06 16:22:29 +00:00
John Baldwin
fb3bc59600 Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
  for a different ifp than the one the packet is being output on), in
  ip_output() and ip6_output().  This avoids sending packets with send
  tags to ifnet drivers that don't support send tags.

  Since we are now checking for ifp mismatches before invoking
  if_output, we can now try to allocate a new tag before invoking
  if_output sending the original packet on the new tag if allocation
  succeeds.

  To avoid code duplication for the fragment and unfragmented cases,
  add ip_output_send() and ip6_output_send() as wrappers around
  if_output and nd6_output_ifp, respectively.  All of the logic for
  setting send tags and dealing with send tag-related errors is done
  in these wrapper functions.

  For pseudo interfaces that wrap other network interfaces (vlan and
  lagg), wrapper send tags are now allocated so that ip*_output see
  the wrapper ifp as the ifp in the send tag.  The if_transmit
  routines rewrite the send tags after performing an ifp mismatch
  check.  If an ifp mismatch is detected, the transmit routines fail
  with EAGAIN.

- To provide clearer life cycle management of send tags, especially
  in the presence of vlan and lagg wrapper tags, add a reference count
  to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
  Provide a helper function (m_snd_tag_init()) for use by drivers
  supporting send tags.  m_snd_tag_init() takes care of the if_ref
  on the ifp meaning that code alloating send tags via if_snd_tag_alloc
  no longer has to manage that manually.  Similarly, m_snd_tag_rele
  drops the refcount on the ifp after invoking if_snd_tag_free when
  the last reference to a send tag is dropped.

  This also closes use after free races if there are pending packets in
  driver tx rings after the socket is closed (e.g. from tcpdrop).

  In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
  csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
  Drivers now also check this flag instead of checking snd_tag against
  NULL.  This avoids false positive matches when a forwarded packet
  has a non-NULL rcvif that was treated as a send tag.

- cxgbe was relying on snd_tag_free being called when the inp was
  detached so that it could kick the firmware to flush any pending
  work on the flow.  This is because the driver doesn't require ACK
  messages from the firmware for every request, but instead does a
  kind of manual interrupt coalescing by only setting a flag to
  request a completion on a subset of requests.  If all of the
  in-flight requests don't have the flag when the tag is detached from
  the inp, the flow might never return the credits.  The current
  snd_tag_free command issues a flush command to force the credits to
  return.  However, the credit return is what also frees the mbufs,
  and since those mbufs now hold references on the tag, this meant
  that snd_tag_free would never be called.

  To fix, explicitly drop the mbuf's reference on the snd tag when the
  mbuf is queued in the firmware work queue.  This means that once the
  inp's reference on the tag goes away and all in-flight mbufs have
  been queued to the firmware, tag's refcount will drop to zero and
  snd_tag_free will kick in and send the flush request.  Note that we
  need to avoid doing this in the middle of ethofld_tx(), so the
  driver grabs a temporary reference on the tag around that loop to
  defer the free to the end of the function in case it sends the last
  mbuf to the queue after the inp has dropped its reference on the
  tag.

- mlx5 preallocates send tags and was using the ifp pointer even when
  the send tag wasn't in use.  Explicitly use the ifp from other data
  structures instead.

- Sprinkle some assertions in various places to assert that received
  packets don't have a send tag, and that other places that overwrite
  rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.

Reviewed by:	gallatin, hselasky, rgrimes, ae
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
Andrew Gallatin
35961dce98 Select lacp egress ports based on NUMA domain
This change creates an array of port maps indexed by numa domain
for lacp port selection. If we have lacp interfaces in more than
one domain, then we select the egress port by indexing into the
numa port maps and picking a port on the appropriate numa domain.

This is behavior is controlled by the new ifconfig use_numa flag
and net.link.lagg.use_numa sysctl/tunable (both modeled after the
existing use_flowid), which default to enabled.

Reviewed by:	bz, hselasky, markj (and scottl, earlier version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20060
2019-05-03 14:43:21 +00:00
John Baldwin
841613dcdc Use a dedicated malloc type for lagg(4)'s structures.
Reviewed by:	gallatin
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19719
2019-03-28 21:00:54 +00:00
John Baldwin
2f59b04af1 Remove nested epochs from lagg(4).
lagg_bcast_start appeared to have a bug in that was using the last
lagg port structure after exiting the epoch that was keeping that
structure alive.  However, upon further inspection, the epoch was
already entered by the caller (lagg_transmit), so the epoch enter/exit
in lagg_bcast_start was actually unnecessary.

This commit generally removes uses of the net epoch via LAGG_RLOCK to
protect the list of ports when the list of ports was already protected
by an existing LAGG_RLOCK in a caller, or the LAGG_XLOCK.

It also adds a missing epoch enter/exit in lagg_snd_tag_alloc while
accessing the lagg port structures.  An ifp is still accessed via an
unsafe reference after the epoch is exited, but that is true in the
current code and will be fixed in a future change.

Reviewed by:	gallatin
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19718
2019-03-28 20:25:36 +00:00
Randall Stewart
fa91f84502 This commit adds the missing release mechanism for the
ratelimiting code. The two modules (lagg and vlan) did have
allocation routines, and even though they are indirect (and
vector down to the underlying interfaces) they both need to
have a free routine (that also vectors down to the actual interface).

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D19032
2019-02-13 14:57:59 +00:00
John Baldwin
829c56fc08 Don't set IFCAP_TXRTLMT during lagg_clone_create().
lagg_capabilities() will set the capability once interfaces supporting
the feature are added to the lagg.  Setting it on a lagg without any
interfaces is pointless as the if_snd_tag_alloc call will always fail
in that case.

Reviewed by:	hselasky, gallatin
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19040
2019-01-31 21:35:37 +00:00
Marcelo Araujo
fbd8c33022 Allow changing lagg(4) MTU.
Previously, changing the MTU would require destroying the lagg and
creating a new one. Now it is allowed to change the MTU of
the lagg interface and the MTU of the ports will be set to match.

If any port cannot set the new MTU, all ports are reverted to the original
MTU of the lagg. Additionally, when adding ports, the MTU of a port will be
automatically set to the MTU of the lagg. As always, the MTU of the lagg is
initially determined by the MTU of the first port added. If adding an
interface as a port for some reason fails, that interface is reverted to its
original MTU.

Submitted by:	Ryan Moeller <ryan@freqlabs.com>
Reviewed by:	mav
Relnotes:	Yes
Sponsored by:	iXsystems Inc.
Differential Revision:	https://reviews.freebsd.org/D17576
2018-10-30 09:53:57 +00:00
Jonathan T. Looney
13c6ba6d94 There are three places where we return from a function which entered an
epoch section without exiting that epoch section. This is bad for two
reasons: the epoch section won't exit, and we will leave the epoch tracker
from the stack on the epoch list.

Fix the epoch leak by making sure we exit epoch sections before returning.

Reviewed by:	ae, gallatin, mmacy
Approved by:	re (gjb, kib)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D17450
2018-10-09 13:26:06 +00:00
Andrew Gallatin
5ccac9f972 lagg: allow lacp to manage the link state
Lacp needs to manage the link state itself. Unlike other
lagg protocols, the ability of lacp to pass traffic
depends not only on the lagg members having link, but also
on the lacp protocol converging to a distributing state with the
link partner.

If we prematurely mark the link as up, then we will send a
gratuitous arp (via arp_handle_ifllchange()) before the lacp
interface is capable of passing traffic. When this happens,
the gratuitous arp is lost, and our link partner may cache
a stale mac address (eg, when the base mac address for the
lagg bundle changes, due to a BIOS change re-ordering NIC
unit numbers)

Reviewed by: jtl, hselasky
Sponsored by: Netflix
2018-08-13 14:13:25 +00:00
Andrew Turner
5f901c92a8 Use the new VNET_DEFINE_STATIC macro when we are defining static VNET
variables.

Reviewed by:	bz
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D16147
2018-07-24 16:35:52 +00:00
Matt Macy
6573d7580b epoch(9): allow preemptible epochs to compose
- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
  there's no longer any benefit to dropping the pcbinfo lock
  and trying to do so just adds an error prone branchfest to
  these functions
- Remove cases of same function recursion on the epoch as
  recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
  thread as the tracker field is now stack or heap allocated
  as appropriate.

Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066
2018-07-04 02:47:16 +00:00
Matt Macy
0f8d79d977 CK: update consumers to use CK macros across the board
r334189 changed the fields to have names distinct from those in queue.h
in order to expose the oversights as compile time errors
2018-05-24 23:21:23 +00:00
Mark Johnston
db5a36bddf Simplify lagg_input().
No functional change intended.

MFC after:	2 weeks
2018-05-22 15:35:38 +00:00
Matt Macy
46d0f824be net: fix set but not used 2018-05-19 05:27:49 +00:00
Matt Macy
d7c5a620e2 ifnet: Replace if_addr_lock rwlock with epoch + mutex
Run on LLNW canaries and tested by pho@

gallatin:
Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5
based ConnectX 4-LX NIC, I see an almost 12% improvement in received
packet rate, and a larger improvement in bytes delivered all the way
to userspace.

When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1,
I see, using nstat -I mce0 1 before the patch:

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
4.98   0.00   4.42   0.00 4235592     33   83.80 4720653 2149771   1235 247.32
4.73   0.00   4.20   0.00 4025260     33   82.99 4724900 2139833   1204 247.32
4.72   0.00   4.20   0.00 4035252     33   82.14 4719162 2132023   1264 247.32
4.71   0.00   4.21   0.00 4073206     33   83.68 4744973 2123317   1347 247.32
4.72   0.00   4.21   0.00 4061118     33   80.82 4713615 2188091   1490 247.32
4.72   0.00   4.21   0.00 4051675     33   85.29 4727399 2109011   1205 247.32
4.73   0.00   4.21   0.00 4039056     33   84.65 4724735 2102603   1053 247.32

After the patch

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
5.43   0.00   4.20   0.00 3313143     33   84.96 5434214 1900162   2656 245.51
5.43   0.00   4.20   0.00 3308527     33   85.24 5439695 1809382   2521 245.51
5.42   0.00   4.19   0.00 3316778     33   87.54 5416028 1805835   2256 245.51
5.42   0.00   4.19   0.00 3317673     33   90.44 5426044 1763056   2332 245.51
5.42   0.00   4.19   0.00 3314839     33   88.11 5435732 1792218   2499 245.52
5.44   0.00   4.19   0.00 3293228     33   91.84 5426301 1668597   2121 245.52

Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15366
2018-05-18 20:13:34 +00:00
Matt Macy
70398c2f86 epoch(9): Make epochs non-preemptible by default
There are risks associated with waiting on a preemptible epoch section.
Change the name to make them not be the default and document the issue
under CAVEATS.

Reported by:	markj
2018-05-18 17:29:43 +00:00
Stephen Hurd
99031b8f7d Replace rmlock with epoch in lagg
Use the new epoch based reclamation API. Now the hot paths will not
block at all, and the sx lock is used for the softc data.  This fixes LORs
reported where the rwlock was obtained when the sxlock was held.

Submitted by:	mmacy
Reported by:	Harry Schmalzbauer <freebsd@omnilan.de>
Reviewed by:	sbruno
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15355
2018-05-14 20:06:49 +00:00
Steven Hartland
fd3bb7aa46 Disabled the use of flowid for lagg by default
Disabled the use of RSS hash from the network card aka flowid for
lagg(4) interfaces by default as it's currently incompatible with
the lacp and loadbalance protocols.

The incompatibility is due to the fact that the flowid isn't know
for the first packet of a new outbound stream which can result in
the hash calculation method changing and hence a stream being
incorrectly split across multiple interfaces during normal
operation.

This can be re-enabled by setting the following in loader.conf:
net.link.lagg.default_use_flowid="1"

Discussed with: kmacy
Sponsored by:	Multiplay
2018-01-04 20:05:47 +00:00
Sean Bruno
51352d9d81 Don't hold the RM lock during lagg_proto_addport() to avoid an LOR.
Submitted by:	Kevin Bowling <kevin.bowling@kev009.com>
Reviewed by:	mav
MFC after:	1 week
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D11711
2017-07-25 14:41:50 +00:00
Alexander Motin
41cf0d54a2 Call VLAN_CAPABILITIES() when LAGG capabilities change.
This makes VLAN on top of LAGG to expose proper capabilities if they are
changed after creation.

MFC after:	1 week
2017-05-26 22:22:48 +00:00
Alexander Motin
8403ab7919 Improve applying unified capabilities to the lagg ports.
Some NICs have some capabilities dependent, so that disabling one require
disabling some other (TXCSUM/RXCSUM on em).  This code tries to reach the
consensus more insistently.

PR:		219453
MFC after:	1 week
2017-05-26 20:15:33 +00:00
Alexander Motin
e3d90506c4 Remove some code, dead from the day one. 2017-05-25 23:19:09 +00:00
Alexander Motin
bbfc32a6b5 Relax r317696 locking to not drain taskqueue under the lock.
MFC after:	11 days
2017-05-05 16:51:53 +00:00
Alexander Motin
e83177fba8 Fix r317696 build without debug.
MFC after:	2 weeks
2017-05-03 02:54:11 +00:00
Alexander Motin
2f86d4b001 Introduce sleepable locks into if_lagg.
Before this change if_lagg was using nonsleepable rmlocks to protect its
internal state.  This patch introduces another sx lock to protect code
paths that require sleeping, while still uses old rmlock to protect hot
nonsleepable data paths.

This change allows to remove taskqueue decoupling used before to change
interface addresses without holding the lock.  Instead it uses sx lock to
protect direct if_ioctl() calls.

As another bonus, the new code synchronizes enabled capabilities of member
interfaces, and allows to control them with ifconfig laggX, that was
impossible before.  This part should fix interoperation with if_bridge,
that may need to disable some capabilities, such as TXCSUM or LRO, to allow
bridging with noncapable interfaces.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D10514
2017-05-02 19:09:11 +00:00
Alexander Motin
1e04441a9d Remove unneeded conditions.
MFC after:	2 weeks
2017-04-22 08:38:49 +00:00
Alexander Motin
b98b5ae8ec Add interface reference counting to if_lagg.
Using plain ifunit() looks like request for troubles.

MFC after:	2 weeks
2017-04-21 13:45:01 +00:00
Luiz Otavio O Souza
13157b2baf Do not update the lagg link layer address when destroying a lagg clone.
This would enqueue an event to send the gratuitous arp on a dying lagg
interface without any physical ports attached to it.

Apart from that, the taskqueue_drain() on lagg_clone_destroy() runs too
late, when the ifp data structure is already freed.  Fix that too.

Obtained from:	pfSense
MFC after:	2 weeks
Sponsored by:	Rubicon Communications, LLC (Netgate)
2017-01-30 03:04:33 +00:00
Hans Petter Selasky
f3e7afe2d7 Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.

- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.

- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().

- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.

- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.

- How rate limiting works:

1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.

2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.

3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.

4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.

Reviewed by:		wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision:	https://reviews.freebsd.org/D3687
Sponsored by:		Mellanox Technologies
MFC after:		3 months
2017-01-18 13:31:17 +00:00
Alan Somers
8a73c85db3 Remove stray debugging code from r310180
Reported by:	rstone
Pointy hat to:	asomers
MFC after:	3 weeks
X-MFC-with:	310180
Sponsored by:	Spectra Logic Corp
2016-12-20 15:45:53 +00:00
Alan Somers
d9fa2d67eb Fix panic during lagg destruction with simultaneous status check
If you run "ifconfig lagg0 destroy" and "ifconfig lagg0" at the same time a
page fault may result. The first process will destroy ifp->if_lagg in
lagg_clone_destroy (called by if_clone_destroy). Then the second process
will observe that ifp->if_lagg is NULL at the top of lagg_port_ioctl and
goto fallback: where it will promptly dereference ifp->if_lagg anyway.

The solution is to repeat the NULL check for ifp->if_lagg

MFC after:	4 weeks
Sponsored by:	Spectra Logic Corp
Differential Revision:	https://reviews.freebsd.org/D8512
2016-12-16 22:39:30 +00:00
Bjoern A. Zeeb
89856f7e2d Get closer to a VIMAGE network stack teardown from top to bottom rather
than removing the network interfaces first. This change is rather larger
and convoluted as the ordering requirements cannot be separated.

Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and
related modules to their own SI_SUB_PROTO_FIREWALL.
Move initialization of "physical" interfaces to SI_SUB_DRIVERS,
move virtual (cloned) interfaces to SI_SUB_PSEUDO.
Move Multicast to SI_SUB_PROTO_MC.

Re-work parts of multicast initialisation and teardown, not taking the
huge amount of memory into account if used as a module yet.

For interface teardown we try to do as many of them as we can on
SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling
over a higher layer protocol such as IP. In that case the interface
has to go along (or before) the higher layer protocol is shutdown.

Kernel hhooks need to go last on teardown as they may be used at various
higher layers and we cannot remove them before we cleaned up the higher
layers.

For interface teardown there are multiple paths:
(a) a cloned interface is destroyed (inside a VIMAGE or in the base system),
(b) any interface is moved from a virtual network stack to a different
network stack ("vmove"), or (c) a virtual network stack is being shut down.
All code paths go through if_detach_internal() where we, depending on the
vmove flag or the vnet state, make a decision on how much to shut down;
in case we are destroying a VNET the individual protocol layers will
cleanup their own parts thus we cannot do so again for each interface as
we end up with, e.g., double-frees, destroying locks twice or acquiring
already destroyed locks.
When calling into protocol cleanups we equally have to tell them
whether they need to detach upper layer protocols ("ulp") or not
(e.g., in6_ifdetach()).

Provide or enahnce helper functions to do proper cleanup at a protocol
rather than at an interface level.

Approved by:		re (hrs)
Obtained from:		projects/vnet
Reviewed by:		gnn, jhb
Sponsored by:		The FreeBSD Foundation
MFC after:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D6747
2016-06-21 13:48:49 +00:00
Pedro F. Giffuni
a4641f4eaa sys/net*: minor spelling fixes.
No functional change.
2016-05-03 18:05:43 +00:00
Ravi Pokala
729a4cff7e Revert accidental submit of WIP as part of r297609
Pointyhat to:	rpokala
2016-04-06 04:58:20 +00:00
Ravi Pokala
06152bf0e1 Storage Controller Interface driver - typo in unimplemented macro in
scic_sds_controller_registers.h

s/contoller/controller/

PR:		207336
Submitted by:	Tony Narlock <tony @ git-pull.com>
2016-04-06 04:50:28 +00:00
Marcelo Araujo
d931334bd4 Fix regression introduced on 272446r.
lagg(4) supports the protocol none, where it disables any traffic without
disabling the lagg(4) interface itself.

PR:		206921
Submitted by:	Pushkar Kothavade <pushkarbk@gmail.com>
Reviewed by:	rpokala
Approved by:	bapt (mentor)
MFC after:	3 weeks
Sponsored by:	gandi.net
Differential Revision:	https://reviews.freebsd.org/D5076
2016-02-19 06:35:53 +00:00
Marcelo Araujo
d62edc5eb5 Add an IOCTL rr_limit to let users fine tuning the number of packets to be
sent using roundrobin protocol and set a better granularity and distribution
among the interfaces. Tuning the number of packages sent by interface can
increase throughput and reduce unordered packets as well as reduce SACK.

Example of usage:
# ifconfig bge0 up
# ifconfig bge1 up
# ifconfig lagg0 create
# ifconfig lagg0 laggproto roundrobin laggport bge0 laggport bge1 \
	192.168.1.1 netmask 255.255.255.0
# ifconfig lagg0 rr_limit 500

Reviewed by:	thompsa, glebius, adrian (old patch)
Approved by:	bapt (mentor)
Relnotes:	Yes
Differential Revision:	https://reviews.freebsd.org/D540
2016-01-23 04:18:44 +00:00