Commit Graph

219 Commits

Author SHA1 Message Date
Adrian Chadd
b2bdc62a95 Refactor / restructure the RSS code into generic, IPv4 and IPv6 specific
bits.

The motivation here is to eventually teach netisr and potentially
other networking subsystems a bit more about how RSS work queues / buckets
are configured so things have a hope of auto-configuring in the future.

* net/rss_config.[ch] takes care of the generic bits for doing
  configuration, hash function selection, etc;
* topelitz.[ch] is now in net/ rather than netinet/;
* (and would be in libkern if it didn't directly include RSS_KEYSIZE;
  that's a later thing to fix up.)
* netinet/in_rss.[ch] now just contains the IPv4 specific methods;
* and netinet/in6_rss.[ch] now just contains the IPv6 specific methods.

This should have no functional impact on anyone currently using
the RSS support.

Differential Revision:	D1383
Reviewed by:	gnn, jfv (intel driver bits)
2015-01-18 18:06:40 +00:00
Jack F Vogel
4da1bbcda5 Revert r275136, it was not approved, it was sloppy, if a feature
like this is needed please resubmit for Intel's approval.
2014-12-02 23:02:57 +00:00
Hans Petter Selasky
c25290420e Start process of removing the use of the deprecated "M_FLOWID" flag
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.

This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.

"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.

Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.

MFC after:	1 month
Sponsored by:	Mellanox Technologies
2014-12-01 11:45:24 +00:00
Alfred Perlstein
56c14bca7e Make igb and ixgbe check tunables at probe time.
This allows one to make a kernel module to tune the
number of queues before the driver loads.

This is needed so that a module at SI_SUB_CPU can set
tunables for these drivers to take.  Otherwise getenv
is called too early by the TUNABLE macros.

Reviewed by: smh
Phabric: https://reviews.freebsd.org/D1149
2014-11-26 20:19:36 +00:00
Alexander V. Chernikov
86e10a0c6a Fix r273112: do not turn DROP_EN by default.
Due to adapter->hw.fc.requested_mode is filled with default value
after ixgbe_initialize_receive_units(), this leads to enabling
DROP_EN in most cases.

Tested by:	ae
MFC after:	1 week
2014-11-16 18:08:00 +00:00
Hans Petter Selasky
f0188618f2 Fix multiple incorrect SYSCTL arguments in the kernel:
- Wrong integer type was specified.

- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.

- Logical OR where binary OR was expected.

- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.

- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.

- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.

- Updated "EXAMPLES" section in SYSCTL manual page.

MFC after:	3 days
Sponsored by:	Mellanox Technologies
2014-10-21 07:31:21 +00:00
Adrian Chadd
bba61d1a89 Set the DROP_EN bit before the RX queue is brought up and active.
He noticed issues setting this bit in SRRCTL after the queue was up,
so doing it from the sysctl handler isn't enough and may not actually
work correctly.

This commit doesn't remove the sysctl path or try to change its
behaviour.  I'll talk with others about how to finish fixing that
before I tackle that.

PR:		kern/194311
Submitted by:	luigi
MFC after:	3 days
Sponsored by:	Norse Corp, Inc
2014-10-15 01:22:56 +00:00
Gleb Smirnoff
058a38660e Convert to if_get_counter(). 2014-09-28 07:29:45 +00:00
Gleb Smirnoff
5fe12776aa Mechanically switch ixv(4) to if_inc_counter(). 2014-09-28 07:19:32 +00:00
Adrian Chadd
e45d876dd7 The error bits are not valid with EOP=0; so intermediary fragments should
not be discarded.

Submitted by:	Marc De La Gueronniere <mdelagueronniere@verisign.com>
MFC after:	1 week
Sponsored by:	Verisign, Inc.
2014-09-15 20:54:12 +00:00
Adrian Chadd
5894690d0d Fix a double-free of mbufs in rx_ixgbe_discard().
fmp->buf at the free point is already part of the chain being freed,
so double-freeing is counter-productive.

Submitted by:	Marc De La Gueronniere <mdelagueronniere@verisign.com>
MFC after:	1 week
Sponsored by:	Verisign, Inc.
2014-09-15 20:50:26 +00:00
Christian Brueffer
380064c8dc Use the right constants in comparisons. This is currently a nop, as
MIN_RXD == MIN_TXD and MAX_RXD == MAX_TXD.

Reviewed by:	Eric Joyner @ Intel
MFC after:	1 week
2014-09-08 19:24:25 +00:00
Gleb Smirnoff
1bffa9511f Use define from if_var.h to access a field inside struct if_data,
that resides in struct ifnet.

Sponsored by:	Nginx, Inc.
2014-08-30 19:55:54 +00:00
Alexander V. Chernikov
ea463f2dc0 * Add SIOCGI2C driver ioctl used to retrieve i2c info.
* Convert ixgbe to use this ioctl
* Convert ifconfig to use generic i2c handler for  "ix" interfaces.

Approved by:	Eric Joyner (ixgbe part)
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2014-08-29 18:02:58 +00:00
Luigi Rizzo
4bf50f18eb Update to the current version of netmap.
Mostly bugfixes or features developed in the past 6 months,
so this is a 10.1 candidate.

Basically no user API changes (some bugfixes in sys/net/netmap_user.h).

In detail:

1. netmap support for virtio-net, including in netmap mode.
  Under bhyve and with a netmap backend [2] we reach over 1Mpps
  with standard APIs (e.g. libpcap), and 5-8 Mpps in netmap mode.

2. (kernel) add support for multiple memory allocators, so we can
  better partition physical and virtual interfaces giving access
  to separate users. The most visible effect is one additional
  argument to the various kernel functions to compute buffer
  addresses. All netmap-supported drivers are affected, but changes
  are mechanical and trivial

3. (kernel) simplify the prototype for *txsync() and *rxsync()
  driver methods. All netmap drivers affected, changes mostly mechanical.

4. add support for netmap-monitor ports. Think of it as a mirroring
  port on a physical switch: a netmap monitor port replicates traffic
  present on the main port. Restrictions apply. Drive carefully.

5. if_lem.c: support for various paravirtualization features,
  experimental and disabled by default.
  Most of these are described in our ANCS'13 paper [1].
  Paravirtualized support in netmap mode is new, and beats the
  numbers in the paper by a large factor (under qemu-kvm,
  we measured gues-host throughput up to 10-12 Mpps).

A lot of refactoring and additional documentation in the files
in sys/dev/netmap, but apart from #2 and #3 above, almost nothing
of this stuff is visible to other kernel parts.

Example programs in tools/tools/netmap have been updated with bugfixes
and to support more of the existing features.

This is meant to go into 10.1 so we plan an MFC before the Aug.22 deadline.

A lot of this code has been contributed by my colleagues at UNIPI,
including Giuseppe Lettieri, Vincenzo Maffione, Stefano Garzarella.

MFC after:	3 days.
2014-08-16 15:00:01 +00:00
Steven Hartland
d7b5bae512 Renamed hw.ixgbe.unsupported_sfp -> hw.ix.unsupported_sfp
This now matches all other ixgbe sysctl / tunables.

Sponsored by:	Multiplay
2014-08-14 13:25:05 +00:00
Adrian Chadd
ce44e5f776 Add the UDP hash -> RSS mbuf hash type for the ixgbe(4) driver. 2014-07-20 08:43:53 +00:00
Adrian Chadd
e3af537e75 Teach ixgbe(4) about rss_gethashconfig().
If RSS is enabled, ixgbe(4) will query the RSS API for the types of hashes
which should be used.  It'll then only enable hashes that are exposed via
the RSS layer.

This way it won't try to do things like enable UDP hashing if RSS explicitly
states that it isn't supported in lookups.

Tested:

* 82599EB ixgbe(4) NIC
2014-07-20 07:45:48 +00:00
Adrian Chadd
e965b0dcd1 Disable the ixgbe(4) UDP 4-tuple hashing for the time being.
A mix of fragmented and non-fragmented UDP in a single stream will end up
being hashed differently, resulting in out-of-order behaviour in the receive
path.

This was done in the linux e1000 driver in 2011.

Discussed with:	jfv
2014-07-20 07:43:41 +00:00
Adrian Chadd
c64a6bc62d Correctly program the RSS redirection table entries.
Without this, the RSS bucket assignments aren't correct - they're
DCBA instead of ABCD in each DWORD.

Tested:	82599EB ixgbe(4), TCP and UDP RSS
2014-07-20 04:11:18 +00:00
Hiren Panchasara
2814ea66e3 Fix a typo.
PR:		191898
Submitted by:	vsjcfm@gmail.com
2014-07-17 06:21:58 +00:00
Adrian Chadd
8c0d2adf3f Initialise these variables so gcc doesn't complain.
Submitted by:	luigi
2014-06-30 23:34:36 +00:00
Adrian Chadd
7063e348ab Add initial RSS awareness to the ixgbe(4) driver.
The ixgbe(4) hardware is capable of RSS hashing RX packets and doing RSS
queue selection for up to 8 queues.

However, even if multi-queue is enabled for ixgbe(4), the RX path doesn't use
the RSS flowid from the received descriptor.  It just uses the MSIX queue id.

This patch does a handful of things if RSS is enabled:

* Instead of using a random key at boot, fetch the RSS key from the RSS code
  and program that in to the RSS redirection table.

  That whole chunk of code should be double checked for endian correctness.

* Use the RSS queue mapping to CPU ID to figure out where to thread pin
  the RX swi thread and the taskqueue threads for each queue.

* The software queue is now really an "RSS bucket".

* When programming the RSS indirection table, use the RSS code to
  figure out which RSS bucket each slot in the indirection table maps
  to.

* When transmitting, use the flowid RSS mapping if the mbuf has
  an RSS aware hash.  The existing method wasn't guaranteed to align
  correctly with the destination RSS bucket (and thus CPU ID.)

This code warns if the number of RSS buckets isn't the same as the
automatically configured number of hardware queues.  The administrator
will have to tweak one of them for better performance.

There's currently no way to re-balance the RSS indirection table after
startup.  I'll worry about that later.

Additionally, it may be worthwhile to always use the full 32 bit flowid if
multi-queue is enabled.  It'll make things like lagg(4) behave better with
respect to traffic distribution.
2014-06-30 04:38:29 +00:00
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
John Baldwin
46e89834dc - Don't compare bus_dma map pointers for static DMA allocations against
NULL to determine if bus_dmamap_unload() or bus_dmamem_free() should be
  called.  Instead, check the associated bus and virtual addresses.
- Don't clear static DMA maps to NULL.

Reviewed by:	jfv
2014-06-12 11:15:19 +00:00
Luigi Rizzo
c7156fe92f make sure if_transmit returns 0 if the mbuf is enqueued.
ixgbe/ixv.c still needs a similar fix but it takes a little
more restructuring of the code.

MFC after:	3 days
2014-06-06 20:49:56 +00:00
Gleb Smirnoff
b245f96c44 Since 32-bit if_baudrate isn't enough to describe a baud rate of a 10 Gbit
interface, in the r241616 a crutch was provided. It didn't work well, and
finally we decided that it is time to break ABI and simply make if_baudrate
a 64-bit value. Meanwhile, the entire struct if_data was reviewed.

o Remove the if_baudrate_pf crutch.

o Make all fields of struct if_data fixed machine independent size. The
  notion of data (packet counters, etc) are by no means MD. And it is a
  bug that on amd64 we've got a 64-bit counters, while on i386 32-bit,
  which at modern speeds overflow within a second.

  This also removes quite a lot of COMPAT_FREEBSD32 code.

o Give 16 bit for the ifi_datalen field. This field was provided to
  make future changes to if_data less ABI breaking. Unfortunately the
  8 bit size of it had effectively limited sizeof if_data to 256 bytes.

o Give 32 bits to ifi_mtu and ifi_metric.
o Give 64 bits to the rest of fields, since they are counters.

__FreeBSD_version bumped.

Discussed with:	emax
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-03-13 03:42:24 +00:00
Luigi Rizzo
17885a7bfd It is 2014 and we have a new version of netmap.
Most relevant features:

- netmap emulation on any NIC, even those without native netmap support.

  On the ixgbe we have measured about 4Mpps/core/queue in this mode,
  which is still a lot more than with sockets/bpf.

- seamless interconnection of VALE switch, NICs and host stack.

  If you disable accelerations on your NIC (say em0)

        ifconfig em0 -txcsum -txcsum

  you can use the VALE switch to connect the NIC and the host stack:

        vale-ctl -h valeXX:em0

  allowing sharing the NIC with other netmap clients.

- THE USER API HAS SLIGHTLY CHANGED (head/cur/tail pointers
  instead of pointers/count as before). This was unavoidable to support,
  in the future, multiple threads operating on the same rings.
  Netmap clients require very small source code changes to compile again.
      On the plus side, the new API should be easier to understand
  and the internals are a lot simpler.

The manual page has been updated extensively to reflect the current
features and give some examples.

This is the result of work of several people including Giuseppe Lettieri,
Vincenzo Maffione, Michio Honda and myself, and has been financially
supported by EU projects CHANGE and OPENLAB, from NetApp University
Research Fund, NEC, and of course the Universita` di Pisa.
2014-01-06 12:53:15 +00:00
Gleb Smirnoff
f56831a217 Fix build broken in r259644.
Submitted by:	tuexen
Pointy hat to:	glebius
2013-12-20 13:18:50 +00:00
Gleb Smirnoff
46bf53de69 ixgbe(4) takes packet counters from hardware in ixgbe_update_stats_counters(),
so we don't need to do a per packet increment, which trashes cache line.

Submitted by:	oleg
2013-12-20 10:57:47 +00:00
Oleg Bulyzhin
ba36d317f6 - Fix link loss on vlan reconfiguration.
- Fix issues with 'vlanhwfilter'.

MFC after:	1 week
Silence from:	jfv 5 weeks
2013-11-05 09:46:01 +00:00
Luigi Rizzo
ce3ee1e7c4 update to the latest netmap snapshot.
This includes the following:
- use separate memory regions for VALE ports
- locking fixes
- some simplifications in the NIC-specific routines
- performance improvements for the VALE switch
- some new features in the pkt-gen test program
- documentation updates

There are small API changes that require programs to be recompiled
(NETMAP_API has been bumped so you will detect old binaries at runtime).

In particular:
- struct netmap_slot now is 16 bytes to support an extra pointer,
  which may save one data copy when using VALE ports or VMs;
- the struct netmap_if has two extra fields;

MFC after:	3 days
2013-11-01 21:21:14 +00:00
Gleb Smirnoff
76039bc84f The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 17:58:36 +00:00
Gleb Smirnoff
4cdc1f5421 There are some high performance NICs that count statistics in hardware,
and there are ifnets, that do that via counter(9). Provide a flag that
would skip cache line trashing '+=' operation in ether_input().

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
Reviewed by:	melifaro, adrian
Approved by:	re (marius)
2013-10-09 19:04:40 +00:00
Hiren Panchasara
5b9d734b08 Expose system level ixgbe sysctls.
Device level sysctls are already exposed as dev.ix.<device>

Fixing the case where number of queues for igb is auto-tuned and
hw.igb.num_queues does not return current/updated value.

Reviewed by:	jfv
Approved by:	re (delphij)
MFC after:	2 weeks
2013-10-05 19:17:56 +00:00
Andre Oppermann
1b4381afbb Restructure the mbuf pkthdr to make it fit for upcoming capabilities and
features.  The changes in particular are:

o Remove rarely used "header" pointer and replace it with a 64bit protocol/
  layer specific union PH_loc for local use.  Protocols can flexibly overlay
  their own 8 to 64 bit fields to store information while the packet is
  worked on.

o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc
  instead of pkthdr.header.

o Extend csum_flags to 64bits to allow for additional future offload
  information to be carried (e.g. iSCSI, IPsec offload, and others).

o Move the RSS hash type enumerator from abusing m_flags to its own 8bit
  rsstype field.  Adjust accessor macros.

o Add cosqos field to store Class of Service / Quality of Service information
  with the packet.  It is not yet supported in any drivers but allows us to
  get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with
  a modernized ALTQ.

o Add four 8 bit fields l[2-5]hlen to store the relative header offsets
  from the start of the packet.  This is important for various offload
  capabilities and to relieve the drivers from having to parse the packet
  and protocol headers to find out location of checksums and other
  information.  Header parsing in drivers is a lot of copy-paste and
  unhandled corner cases which we want to avoid.

o Add another flexible 64bit union to map various additional persistent
  packet information, like ether_vtag, tso_segsz and csum fields.
  Depending on the csum_flags settings some fields may have different usage
  making it very flexible and adaptable to future capabilities.

o Restructure the CSUM flags to better signify their outbound (down the
  stack) and inbound (up the stack) use.  The CSUM flags used to be a bit
  chaotic and rather poorly documented leading to incorrect use in many
  places.  Bring clarity into their use through better naming.
  Compatibility mappings are provided to preserve the API.  The drivers
  can be corrected one by one and MFC'd without issue.

o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures).

Sponsored by:	The FreeBSD Foundation
2013-08-24 19:51:18 +00:00
Scott Long
c68534f1d5 Update PCI drivers to no longer look at the MEMIO-enabled bit in the PCI
command register.  The lazy BAR allocation code in FreeBSD sometimes
disables this bit when it detects a range conflict, and will re-enable
it on demand when a driver allocates the BAR.  Thus, the bit is no longer
a reliable indication of capability, and should not be checked.  This
results in the elimination of a lot of code from drivers, and also gives
the opportunity to simplify a lot of drivers to use a helper API to set
the busmaster enable bit.

This changes fixes some recent reports of disk controllers and their
associated drives/enclosures disappearing during boot.

Submitted by:	jhb
Reviewed by:	jfv, marius, achadd, achim
MFC after:	1 day
2013-08-12 23:30:01 +00:00
Jack F Vogel
4dc63104ae Improve the MSIX setup code in the drivers, thanks to Marius for
the changes. Make sure that pci_alloc_msix() does give us the vectors
we need and fall back to MSI when it doesn't, also release any that
were allocated when insufficient.

MFC after: 3 days
2013-08-12 22:54:38 +00:00
Jack F Vogel
d0913b7f25 Make the various driver MSIX setup routines fallback to MSI more
gracefully. This change was suggested by Marius Strobl, thank you.

PR: kern/181016
MFC after: ASAP
2013-08-06 21:01:38 +00:00
Jack F Vogel
7301d64aba Correct a fat-finger in the last delta.
MFC after: ASAP
2013-08-05 16:16:50 +00:00
Jack F Vogel
cbe75ae8f5 A number of important fixes:
- mbuf reused after an RX_COPY optimized operation can sometimes have
    a bogus cached address, resulting in TCP hangs. Add critical save points
    to the cached address. Thanks to Michael and the team at Verisign for
    finding this problem.
  - A couple more spots where the rxbuf->flags member should be cleared just
    to be sure no incorrect RX_COPY state is left around. Thanks to Adrian
    for tracking these down.
  - Remove the rearm_queues function from the driver, this was found to be
    responsible for some out-of-order packets by Verisign, and was always a
    bandaid, with the other fixes in this delta the bandaid can finally be
    removed.
  - In the other/link interrupt handler the entire state of the EICS register
    was being writen back into EICR (which clears causes and thus re-enables
    those interrupts), this was wrong, so now mask off the queue portion of
    the register value, so we only clear the other/link interrupt we intend.
    Marc from Verisign found this.
  - Make the SFP+ unsupported option tuneable now, by customer request.
  - Finally, just a couple of minor DEBUG string fixes.

I want to call out and thank all the participants in the 10G community/Intel
calls for helping track down these problems and make the driver better for
everyone!

MFC after:	3 days, these are critical fixes for 9.2!
2013-08-01 20:10:16 +00:00
Jack F Vogel
bae4b87e8a Opps, need to change the VF code as well.
MFC after:	ASAP
2013-07-12 21:21:15 +00:00
Jack F Vogel
ee738eea01 Remove the conditional define around the option headers,
when building the driver as a module the result of the present
system results in INET and INET6 being undefined, and will cause
the panic in ixgbe_tso_setup(). The Makefile in the module directory
now renders the conditional in the source unnecessary and wrong.

MFC after: ASAP - the panic as a module must not get into 9.2
2013-07-12 21:14:42 +00:00
Jack F Vogel
3f80cc03fd Fix my last commit, flags rather than flag... duh.
MFC after: 2 days
2013-07-11 03:44:06 +00:00
Jack F Vogel
804d70535a Fix to a panic found internally, bad pointer during rxeof
processing. Thanks for John Baldwin for catching this. Not
clearing the flag member of the rxbuf could result in a NULL
mbuf pointer being used.

MFC after:	2 days (this needs to get into 9.2!)
2013-07-10 23:14:24 +00:00
Jack F Vogel
fd75b91d13 Add quad port probe support, this gives the admin proper information about the slot
(which should be a PCIE Gen 3 slot for this adapter) by looking back thru the PCI
parent devices to the slot device.

The fix above also corrects the bandwidth display to GT/s rather than the
incorrect Gb/s

Next, allow the use of ALTQ if you select the compile option IXGBE_LEGACY_TX.

Allow the use of 'unsupported' optic modules by a compile option as well.

Add a phy reset capability into the stop code, this is so a static configured
driver will still behave properly when taken down (not being able to unload it).

This revision synchronizes the shared code with Intel internal current code,
and note that it now includes DCB supporting code, this was necessitated by
some internal changes with the code, but it also will provide the opportunity
to develop this feature in the core driver down the road.

I have edited the README to get rid of some of the worse anachronisms in it
as well, its by no means as robust as I might wish at this point however.

Oh, I also have included some conditional stuff in the code so it will be
compatible in both the 9.X and 10 environments.

Performance has been a focus in recent changes and I believe this revision
driver will perform very well in most workloads.

MFC after: 2 weeks
2013-06-18 21:28:19 +00:00
Luigi Rizzo
d61ba75247 use netmap_rx_irq() / netmap_tx_irq() to handle interrupts in
netmap mode, removing the logic from individual drivers.

(note: if_lem.c not updated yet due to some other pending modifications)
2013-04-30 16:18:29 +00:00
Jack F Vogel
7bdac10465 Two small fixes:
Set promiscuous code was unconditionally turning off multicast when
  turning off promiscuous mode, this should only be done when there are
  less than MAX groups. Thanks to Mike Karels for this correction.

  Second, the overtmp interrupt setup/detection was wrong, correcting it.

MFC after:	one week
2013-03-29 18:03:00 +00:00
Jack F Vogel
facc592d88 Fix a small, but important bug, a task drain was mistakenly
being compiled only when setting LEGACY_TX, this means you would
not get the drain when needed on detach!!

Thanks to Bryan Venteicher (bryanv@freebsd.org) for catching this
little gremlin!! :)
2013-03-04 23:15:07 +00:00
Jack F Vogel
0ecc2ff0e8 First, sync to internal shared code, and then
Fixes:
	- flow control - don't override user value on re-init
	- fix to make 1G optics work correctly
	- change to interrupt enabling - some bits were incorrect
	  for certain hardware.
	- certain stats fixes, remove a duplicate increment of
	  ierror, thanks to Scott Long for pointing these out.
	- shared code link interface changed, requiring some
	  core code changes to accomodate this.
	- add an m_adj() to ETHER_ALIGN on the recieve side, this
	  was requested by Mike Karels, thanks Mike.
	- Multicast code corrections also thanks to Mike Karels.
2013-03-04 23:07:40 +00:00
Dag-Erling Smørgrav
cdc1296734 revert 247035 2013-02-20 21:16:50 +00:00
Dag-Erling Smørgrav
c9263bd288 Reduce excessive nesting. 2013-02-20 12:59:21 +00:00
Randall Stewart
ded5ea6a25 This fixes a out-of-order problem with several
of the newer drivers. The basic problem was
that the driver was pulling the mbuf off the
drbr ring and then when sending with xmit(), encounting
a full transmit ring. Thus the lower layer
xmit() function would return an error, and the
drivers would then append the data back on to the ring.
For TCP this is a horrible scenario sure to bring
on a fast-retransmit.

The fix is to use drbr_peek() to pull the data pointer
but not remove it from the ring. If it fails then
we either call the new drbr_putback or drbr_advance
method. Advance moves it forward (we do this sometimes
when the xmit() function frees the mbuf). When
we succeed we always call advance. The
putback will always copy the mbuf back to the top
of the ring. Note that the putback *cannot* be used
with a drbr_dequeue() only with drbr_peek(). We most
of the time, in putback, would not need to copy it
back since most likey the mbuf is still the same, but
sometimes xmit() functions will change the mbuf via
a pullup or other call. So the optimial case for
the single consumer is to always copy it back. If
we ever do a multiple_consumer (for lagg?) we
will  need a test and atomic in the put back possibly
a seperate putback_mc() in the ring buf.

Reviewed by:	jhb@freebsd.org, jlv@freebsd.org
2013-02-07 15:20:54 +00:00
Sofian Brabez
61bfd86762 Use DEVMETHOD_END macro defined in sys/bus.h instead of {0, 0} sentinel on device_method_t arrays
Reviewed by:	cognet
Approved by:	cognet
2013-01-30 18:01:20 +00:00
Pedro F. Giffuni
646a7fea0c Clean some 'svn:executable' properties in the tree.
Submitted by:	Christoph Mallon
MFC after:	3 days
2013-01-26 22:08:21 +00:00
Luigi Rizzo
60372f6f58 rename the 'tag' and 'map' fields used the rx ring to their
previous names, 'ptag' and 'pmap' -- p stands for packet.

This change reduces the difference between the code in stable/9
and head, and also helps using the same ixgbe_netmap.h on both branches.

Approved by:	Jack Vogel
2012-12-20 22:26:03 +00:00
Gleb Smirnoff
c6499eccad Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags in sys/dev.
2012-12-04 09:32:43 +00:00
Jack F Vogel
4153fe7216 Remove the sysctl process_limit interface, after some
thought I've decided its overkill,a simple tuneable for
each RX and TX limit, and then init sets the ring values
based on that, should be sufficient.

More importantly, fix a bug causing a panic, when changing
the define style to IXGBE_LEGACY_TX a taskqueue init was
inadvertently set #ifdef when it should be #ifndef.
2012-12-03 21:38:02 +00:00
Jack F Vogel
39aa926bb3 Patch #12 OK, I said there was only 11 patches, but unfortunately
the revamped sysctl code did not work, and needed a change. This
makes the limit get set at the time that all sysctl stats are
created and is actually more elegant imho anyway.
2012-12-01 01:24:40 +00:00
Jack F Vogel
5a5d90a268 Patch #11 - The final patch: this one greatly improves the
TX hot path by getting rid of index calculations and simply
managing pointers. Much of the creative code is due to my
coworker here at Intel, Alex Duyck, thanks Alex!

Also, this whole series of patches was given the critical
eye of Gleb Smirnoff and is all the better for it, thanks
Gleb!
2012-12-01 00:11:24 +00:00
Jack F Vogel
d777904f05 Patch #10 Performance - this changes the protocol offload
interface and code in the TX path,making it tighter and
hopefully more efficient.
2012-12-01 00:03:58 +00:00
Jack F Vogel
df51baf38f Patch #9 Performance - improve the tx dma failure
path, similar to a change done in igb long ago.
2012-11-30 23:54:57 +00:00
Jack F Vogel
47dd71a877 Patch #8 Performance changes - this one improves locality,
moving some counters and data to the ring struct from
the adapter struct, also compressing some data in the
move.
2012-11-30 23:45:55 +00:00
Jack F Vogel
27329b1a91 Patch #7 This is primarily about processing limit control.
- add a limit for both RX and TX, change the default to 256
- change the sysctl usage to be common, and now to be called
during init for each ring.
- the TX limit is not yet used, but the changes in the last
patch in this series uses the value.
- the motivation behind these changes is to improve data
locality in the final code.
- rxeof interface changes since it now gets limit from the
ring struct
2012-11-30 23:28:01 +00:00
Jack F Vogel
01816c875d Patch #6 Whitespace cleanup, and removal of some very old
defines (at Gleb's request). Also, change the defines around
the old transmit code to IXGBE_LEGACY_TX, I do this to make
it possible to define this regardless of the OS level (it is
not defined by default). There are also a couple changed
comments for clarity.
2012-11-30 23:13:56 +00:00
Jack F Vogel
0c2f38e43b Patch #5 Cleanup unused IEEE1588 code fragments, the day may
come when this feature gets implemented, but its not here yet
and I see no reason to leave this laying around.
2012-11-30 23:06:27 +00:00
Jack F Vogel
6d3e416bc4 Patch #4 - this does two things, it removes a number of statistics,
these are FCOE stats (fiber channel over ethernet), something that
FreeBSD does not yet have, they were mistaken for flow control by
the implementor I believe. Secondly, the real flow control stats
are oddly named with a 'link' tag on the front, it was requested
by my validation engineer to make these stats have the same name as
the igb driver for clarity and that seemed reasonable to me.
2012-11-30 22:54:14 +00:00
Jack F Vogel
6a59dfbb86 Patch #3 - Add a new ioctl to access SFP+ module diagnostic
data via the I2C routines in shared code.
2012-11-30 22:41:32 +00:00
Jack F Vogel
35bbbdaa3b Patch #2 - remove OACTIVE and DEPLETED notions from the
multiqueue code, this functionality has proven to be more
trouble than it was worth. Thanks to Gleb for a second
critical look over my code and help in the patches!
2012-11-30 22:33:21 +00:00
Jack F Vogel
7d1157eec8 First of a series of 11 patches leading to new ixgbe version 2.5.0
This removes the header split and supporting code from the driver.
2012-11-30 22:19:18 +00:00
Jack F Vogel
8fce93a144 A few important fixes:
- Testing TSO6 has led me to discover that HW RSC is
    a problematic feature, it is ONLY designed to work
    with IPv4 in the first place, and if IP forwarding
    is done it can't be disabled as LRO in the stack,
    also initial testing we've done at Intel shows an
    equal performance using TSO[46] on the TX and LRO
    on RX, if you ran older code on 82599 or later hardware
    you actually could have detrimental performance for
    this reason. So I am disabling the feature by default
    and all our adapters will now use LRO instead.

  - If you have flow control off and multiple queues it
    was possible when the buffer of one queue becomes
    full that all RX movement is stalled, to eliminate
    this problem a feature bit is now set that will allow
    packets to be dropped when full rather than stall.
    Note, the default is to have flow control on, and this
    keeps this from happening.

  - Because of the recent fixes in the stack, LRO is now
    auto-disabled when problematic, so I have decided to
    enable it by default in the capabilities in the driver.

  - There are some 1G modules used by some customers, a couple
    small tweaks to properly support those in the media code.

  - A note: we have now done some testing of TSO6 and using
    LRO with IPv6 and it all works great!! Seeing line rate
    in both directions in best cases. Thanks bz for your
    excellent work!!
2012-10-31 23:50:36 +00:00
Jack F Vogel
89da5b3198 Correct code that was lost somewhere in the past,
this was designed to keep duplicate null vlan tags from
being added. When doing vlans purely via the switch
this problem will occur. Reported by external customer.
2012-10-31 18:16:42 +00:00
Eitan Adler
2da1951583 Now that device disabling is generic, remove extraneous code from the
device drivers that used to provide this feature.

This is a subset of 241856 (which was reverted)

Reviewed by:	des
Approved by:	cperciva (implicit)
MFC after:	1 week
2012-10-22 22:29:48 +00:00
Eitan Adler
a8de37b024 This isn't functionally identical. In some cases a hint to disable
unit 0 would in fact disable all units.

This reverts r241856

Approved by: cperciva (implicit)
2012-10-22 13:06:09 +00:00
Eitan Adler
76b7512247 Now that device disabling is generic, remove extraneous code from the
device drivers that used to provide this feature.

Reviewed by:	des
Approved by:	cperciva
MFC after:	1 week
2012-10-22 03:41:14 +00:00
Maksim Yevmenkin
608ae712d3 provide helper if_initbaudrate() to set if_baudrate_pf and if_baudrate_pf.
again, use ixgbe(4) as an example of how to use new helper function.

Reviewed by:	jhb
MFC after:	1 week
2012-10-17 19:24:13 +00:00
Maksim Yevmenkin
0fef97fea3 introduce concept of ifi_baudrate power factor. the idea is to work
around the problem where high speed interfaces (such as ixgbe(4))
are not able to report real ifi_baudrate. bascially, take a spare
byte from struct if_data and use it to store ifi_baudrate power
factor. in other words,

real ifi_baudrate = ifi_baudrate * 10 ^ ifi_baudrate power factor

this should be backwards compatible with old binaries. use ixgbe(4)
as an example on how drivers would set ifi_baudrate power factor

Discussed with:	kib, scottl, glebius
MFC after:	1 week
2012-10-16 20:18:15 +00:00
Gleb Smirnoff
063efed28c The drbr(9) API appeared to be so unclear, that most drivers in
tree used it incorrectly, which lead to inaccurate overrated
if_obytes accounting. The drbr(9) used to update ifnet stats on
drbr_enqueue(), which is not accurate since enqueuing doesn't
imply successful processing by driver. Dequeuing neither mean
that. Most drivers also called drbr_stats_update() which did
accounting again, leading to doubled if_obytes statistics. And
in case of severe transmitting, when a packet could be several
times enqueued and dequeued it could have been accounted several
times.

o Thus, make drbr(9) API thinner. Now drbr(9) merely chooses between
  ALTQ queueing or buf_ring(9) queueing.
  - It doesn't touch the buf_ring stats any more.
  - It doesn't touch ifnet stats anymore.
  - drbr_stats_update() no longer exists.

o buf_ring(9) handles its stats itself:
  - It handles br_drops itself.
  - br_prod_bytes stats are dropped. Rationale: no one ever
    reads them but update of a common counter on every packet
    negatively affects performance due to excessive cache
    invalidation.
  - buf_ring_enqueue_bytes() reduced to buf_ring_enqueue(), since
    we no longer account bytes.

o Drivers handle their stats theirselves: if_obytes, if_omcasts.

o mlx4(4), igb(4), em(4), vxge(4), oce(4) and  ixv(4) no longer
  use drbr_stats_update(), and update ifnet stats theirselves.

o bxe(4) was the most correct driver, it didn't call
  drbr_stats_update(), thus it was the only driver accurate under
  moderate load. Now it also maintains stats itself.

o ixgbe(4) had already taken stats from hardware, so just
  - drop software stats updating.
  - take multicast packet count from hardware as well.

o mxge(4) just no longer needs NO_SLOW_STATS define.

o cxgb(4), cxgbe(4) need no change, since they obtain stats
  from hardware.

Reviewed by:	jfv, gnn
2012-09-28 18:28:27 +00:00
John Baldwin
aceb040376 Merge similar fixes from 223198 from igb to ixgbe:
- Use a dedicated task to handle deferred transmits from the if_transmit
  method instead of reusing the existing per-queue interrupt task.
  Reusing the per-queue interrupt task could result in both an interrupt
  thread and the taskqueue thread trying to handle received packets on a
  single queue resulting in out-of-order packet processing and lock
  contention.
- Don't define ixgbe_start() at all where if_transmit is used.

Tested by:	Vijay Singh
Reviewed by:	jfv
MFC after:	2 weeks
2012-09-26 18:11:43 +00:00
Eitan Adler
2bfb8a83f6 Define missing DEBUGOUT# macros. DEBUGOUT[45] are not yet used but are
being defined pre-emptively to avoid future build breakage

PR:		kern/168967
Submitted by:	fuzhli <fuzl@arraynetworks.com.cn>
Approved by:	cperciva
MFC after:	1 week
2012-09-13 14:40:24 +00:00
Scott Long
a46570c76d Remove a prefetch() directive that, after careful testing, does more harm
than good.

Submitted by:	Fabien Thomas
Reviewed by:	jfv
2012-09-11 16:59:04 +00:00
Kevin Lo
3aa5b33a42 Add missing braces
Obtained from:	DragonFly
2012-09-06 02:07:58 +00:00
Scott Long
cfc0969ad4 Heavily optimize the case of small RX packets of 160 bytes or less. For
this case, allocate a plain mbuf and copy the frame into it, then send the
copy up the stack, leaving the original mbuf+cluster in place in the
receive ring for immediate re-use.  This saves a trip through 2 of the
3 zones of the compound mbuf allocator, a trip through busdma, and a trip
through the 1 of the 3 mbuf destructors.  For our load at Netflix, this can
lower CPU consumption by as much as 20%.  The copy algorithm is based on
investigative work from Luigi Rizzo earlier in the year.

Reviewed by:	jfv
Obtained from:	Netflix
2012-08-31 10:07:38 +00:00
Jack F Vogel
a621e3c8b5 Update to the ixgbe driver:
- Add a couple of new devices
  - Flow control changes in shared and core code
  - Bug fix to Flow Director for 82598
  - Shared code sync to internal with required core change

Thanks to those helping in the testing and improvements to this driver!

MFC after:5 days
2012-07-05 20:51:44 +00:00
Maksim Yevmenkin
c5d8a885d4 Correct typo(?) and actually set PTHRESH to 32 and not 16 as per Intel
Linux driver 3.8.21.

MFC after:	1 week
2012-06-07 22:57:26 +00:00
Maksim Yevmenkin
cd1fb2e095 Before it gets lost in the noise.
Put a bandaid to prevent ixgbe(4) from completely locking up the system
under high load. Our platform has a few CPU cores and a single active
ixgbe(4) port with 4 queues. Under high enough traffic load, at about
7.5GBs and 700,000 packets/sec (outbound), the entire system would
deadlock. What we found was that each CPU was in an endless loop on a
different ix taskqueue thread. The OACTIVE flag had gotten set on each
queue, and the ixgbe_handle_queue() function was continuously rescheduling
itself via the taskqueue_enqueue. Since all CPUs were busy with their
taskqueue threads, the ixgbe_local_timer() function couldn't run to clear
the OACTIVE flag.

Submitted by:	scottl
MFC after:	1 week
2012-06-05 18:48:02 +00:00
Bjoern A. Zeeb
e2c0161e2e MFp4 bz_ipv6_fast:
Add TSO6 and LRO/IPv6 support.
  Fix the module Makefile to at least properly inlcude opt_inet6.h
  and allow builds without INET or INET6.

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-25 03:02:56 +00:00
Luigi Rizzo
f9125c3ec9 fix a typo in a comment 2012-05-17 14:36:19 +00:00
Bjoern A. Zeeb
39fc714a6f If we pass down 64k - L2 hdr size + 1 to 64K L3+ data adding an ether
header will make the data go over the 64k limits announced to busdma as
maxsize and the transaction will fail.

With TSO this can result in a TCP regression due to the lost packet.

According to the data sheets ixgbe(4) 82598 and 82599 can handle up to
256k so increase the maximum.

Reported by:	Jon Kåre Hellan, UNINETT (jon.kare.hellan uninett.no)
Tested by:	Jon Kåre Hellan, UNINETT (jon.kare.hellan uninett.no)
MFC after:	1 week
2012-04-23 22:05:09 +00:00
Luigi Rizzo
9b034c6f08 Properly disable crc stripping when operating in netmap mode.
Contrarily to what i wrote in my previous commit, the 82599
does include the CRC in the length. The operating mode is
reset in ixgbe_init_locked() and so we need to hook into
the places where the two registers (HLREG0 and RDRXCTL) are
modified.
2012-04-13 16:42:54 +00:00
Luigi Rizzo
aa15c59eb1 Enable prefetching of descriptors on the TX ring, using the same
values as in the Intel driver 3.8.21 for linux.  The fact that it
is standard in the above driver suggests that it has no bad side
effects.

But of course there must be a reason for enabling features, not
just "it does not harm", so here it is a good one:

Prefetching enables full line rate even using a single queue (14.88
Mpps, compared to ~12 Mpps without prefetch).  This in turn is
terribly useful when one wants to schedule traffic.

For obvious reasons the difference is only visible with netmap
or other high speed solutions, but presumably the advantage
should be in the order of a fraction of a microsecond when
starting transmission on an empty queue.

Discussed with Jack Vogel.

MFC after:	1 week
2012-04-11 15:02:14 +00:00
Scott Long
62ce43ccc8 More conversions of drivers to use the PCI parent DMA tag. 2012-03-12 18:15:08 +00:00
Luigi Rizzo
64ae02c365 A bunch of netmap fixes:
USERSPACE:
1. add support for devices with different number of rx and tx queues;

2. add better support for zero-copy operation, adding an extra field
   to the netmap ring to indicate how many buffers we have already processed
   but not yet released (with help from Eddie Kohler);

3. The two changes above unfortunately require an API change, so while
   at it add a version field and some spares to the ioctl() argument
   to help detect mismatches.

4. update the manual page for the two changes above;

5. update sample applications in tools/tools/netmap

KERNEL:

1. simplify the internal structures moving the global wait queues
   to the 'struct netmap_adapter';

2. simplify the functions that map kring<->nic ring indexes

3. normalize device-specific code, helps mainteinance;

4. start exploring the impact of micro-optimizations (prefetch etc.)
   in the ixgbe driver.
   Use 'legacy' descriptors on the tx ring and prefetch slots gives
   about 20% speedup at 900 MHz. Another 7-10% would come from removing
   the explict calls to bus_dmamap* in the core (they are effectively
   NOPs in this case, but it takes expensive load of the per-buffer
   dma maps to figure out that they are all NULL.

   Rx performance not investigated.

I am postponing the MFC so i can import a few more improvements
before merging.
2012-02-27 19:05:01 +00:00
Luigi Rizzo
5644ccec61 (This commit only touches code within the DEV_NETMAP blocks)
Introduce some functions to map NIC ring indexes into netmap ring
indexes and vice versa. This way we can implement the bound
checks only in one place (and hopefully in a correct way).

On passing, make the code and comments more uniform across the
various drivers.
2012-02-15 23:13:29 +00:00
Jack F Vogel
3e52ad9cc6 Wrap the bool typedef 2012-01-30 23:03:21 +00:00
Jack F Vogel
85d0a26ed4 New hardware support: Intel X540 adapter support added.
Some shared code reorganization along with the new adapter.
Sync changes to OACTIVE in igb into this driver.
Misc small fixes.
2012-01-30 16:42:02 +00:00
Luigi Rizzo
2157a17ce2 ixgbe changes:
- remove experimental code for disabling CRC
- use the correct constant for conversion between interrupt rate
  and EITR values (the previous values were off by a factor of 2)
- make dev.ix.N.queueM.interrupt_rate a RW sysctl variable.
  Changing individual values affects the queue immediately,
  and propagates to all interfaces at the next reinit.
- add dev.ix.N.queueM.irqs rdonly sysctl, to export the actual
  interrupt counts

Netmap-related changes for ixgbe:
- use the "new" format for TX descriptors in netmap mode.
- pass interrupt mitigation delays to the user process doing poll()
  on a netmap file descriptor.
  On the RX side this means we will not check the ring more than once
  per interrupt. This gives the process a chance to sleep and process
  packets in larger batches, thus reducing CPU usage.
  On the TX side we take this even further: completed transmissions are
  reclaimed every half ring even if the NIC interrupts more often.
  This saves even more CPU without any additional tx delays.

Generic Netmap-related changes:
- align the netmap_kring to cache lines so that there is no false sharing
  (possibly useful for multiqueue NICs and MSIX interrupts, which are
  handled by different cores). It's a minor improvement but it does not
  cost anything.

Reviewed by:	Jack Vogel
Approved by:	Jack Vogel
2012-01-26 09:55:16 +00:00
Luigi Rizzo
e3ca4599b0 netmap-related changes:
1. correct the initialization of RDT when there is an ixgbe_init()
   while a netmap client is active. This code was previously
   in ixgbe_initialize_receive_units() but RDT is overwritten
   shortly afterwards in ixgbe_init_locked()

2. add code (not active yet) to disable CRCSTRIP while in netmap mode.
   From all evidence i could gather, it seems that when the 82599 has to
   write a data block that is not a full cache line, it first reads
   the line (64 bytes) and then writes back the updated version.
   This hurts reception of min-sized frames, which are only 60 bytes
   if the CRC is stripped: i could never get above 11Mpps
   (received from one queue) with CRCSTRIP enabled, whyle 64+4-byte
   packets reach 14.2 Mpps (the theoretical maximum).
   Leaving the CRC in gets us 14.88Mpps for 60+4 byte frames,
   (and penalizes 64+4). The min-size case is important not just because
   it looks good in benchmarks, but also because this is the size
   of pure acks.
   Note we cannot leave CRCSTRIP on by default because it is
   incompatible with some other features (LRO etc.)
2012-01-19 09:36:19 +00:00