Commit Graph

119 Commits

Author SHA1 Message Date
jfv
879e516752 Fix a small, but important bug, a task drain was mistakenly
being compiled only when setting LEGACY_TX, this means you would
not get the drain when needed on detach!!

Thanks to Bryan Venteicher (bryanv@freebsd.org) for catching this
little gremlin!! :)
2013-03-04 23:15:07 +00:00
jfv
7b20f97709 First, sync to internal shared code, and then
Fixes:
	- flow control - don't override user value on re-init
	- fix to make 1G optics work correctly
	- change to interrupt enabling - some bits were incorrect
	  for certain hardware.
	- certain stats fixes, remove a duplicate increment of
	  ierror, thanks to Scott Long for pointing these out.
	- shared code link interface changed, requiring some
	  core code changes to accomodate this.
	- add an m_adj() to ETHER_ALIGN on the recieve side, this
	  was requested by Mike Karels, thanks Mike.
	- Multicast code corrections also thanks to Mike Karels.
2013-03-04 23:07:40 +00:00
des
93c06c0a21 revert 247035 2013-02-20 21:16:50 +00:00
des
9f6e838358 Reduce excessive nesting. 2013-02-20 12:59:21 +00:00
rrs
75ad250e97 This fixes a out-of-order problem with several
of the newer drivers. The basic problem was
that the driver was pulling the mbuf off the
drbr ring and then when sending with xmit(), encounting
a full transmit ring. Thus the lower layer
xmit() function would return an error, and the
drivers would then append the data back on to the ring.
For TCP this is a horrible scenario sure to bring
on a fast-retransmit.

The fix is to use drbr_peek() to pull the data pointer
but not remove it from the ring. If it fails then
we either call the new drbr_putback or drbr_advance
method. Advance moves it forward (we do this sometimes
when the xmit() function frees the mbuf). When
we succeed we always call advance. The
putback will always copy the mbuf back to the top
of the ring. Note that the putback *cannot* be used
with a drbr_dequeue() only with drbr_peek(). We most
of the time, in putback, would not need to copy it
back since most likey the mbuf is still the same, but
sometimes xmit() functions will change the mbuf via
a pullup or other call. So the optimial case for
the single consumer is to always copy it back. If
we ever do a multiple_consumer (for lagg?) we
will  need a test and atomic in the put back possibly
a seperate putback_mc() in the ring buf.

Reviewed by:	jhb@freebsd.org, jlv@freebsd.org
2013-02-07 15:20:54 +00:00
sbz
4d7bb3e81a Use DEVMETHOD_END macro defined in sys/bus.h instead of {0, 0} sentinel on device_method_t arrays
Reviewed by:	cognet
Approved by:	cognet
2013-01-30 18:01:20 +00:00
pfg
245e35ae97 Clean some 'svn:executable' properties in the tree.
Submitted by:	Christoph Mallon
MFC after:	3 days
2013-01-26 22:08:21 +00:00
luigi
8231b45a41 rename the 'tag' and 'map' fields used the rx ring to their
previous names, 'ptag' and 'pmap' -- p stands for packet.

This change reduces the difference between the code in stable/9
and head, and also helps using the same ixgbe_netmap.h on both branches.

Approved by:	Jack Vogel
2012-12-20 22:26:03 +00:00
glebius
a69aaa7721 Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags in sys/dev.
2012-12-04 09:32:43 +00:00
jfv
77a5b5a73b Remove the sysctl process_limit interface, after some
thought I've decided its overkill,a simple tuneable for
each RX and TX limit, and then init sets the ring values
based on that, should be sufficient.

More importantly, fix a bug causing a panic, when changing
the define style to IXGBE_LEGACY_TX a taskqueue init was
inadvertently set #ifdef when it should be #ifndef.
2012-12-03 21:38:02 +00:00
jfv
3648c2c9ff Patch #12 OK, I said there was only 11 patches, but unfortunately
the revamped sysctl code did not work, and needed a change. This
makes the limit get set at the time that all sysctl stats are
created and is actually more elegant imho anyway.
2012-12-01 01:24:40 +00:00
jfv
029de4582c Patch #11 - The final patch: this one greatly improves the
TX hot path by getting rid of index calculations and simply
managing pointers. Much of the creative code is due to my
coworker here at Intel, Alex Duyck, thanks Alex!

Also, this whole series of patches was given the critical
eye of Gleb Smirnoff and is all the better for it, thanks
Gleb!
2012-12-01 00:11:24 +00:00
jfv
a273fc1acd Patch #10 Performance - this changes the protocol offload
interface and code in the TX path,making it tighter and
hopefully more efficient.
2012-12-01 00:03:58 +00:00
jfv
3ea10f121c Patch #9 Performance - improve the tx dma failure
path, similar to a change done in igb long ago.
2012-11-30 23:54:57 +00:00
jfv
3a1ec1177a Patch #8 Performance changes - this one improves locality,
moving some counters and data to the ring struct from
the adapter struct, also compressing some data in the
move.
2012-11-30 23:45:55 +00:00
jfv
867ef5fd86 Patch #7 This is primarily about processing limit control.
- add a limit for both RX and TX, change the default to 256
- change the sysctl usage to be common, and now to be called
during init for each ring.
- the TX limit is not yet used, but the changes in the last
patch in this series uses the value.
- the motivation behind these changes is to improve data
locality in the final code.
- rxeof interface changes since it now gets limit from the
ring struct
2012-11-30 23:28:01 +00:00
jfv
07398249f3 Patch #6 Whitespace cleanup, and removal of some very old
defines (at Gleb's request). Also, change the defines around
the old transmit code to IXGBE_LEGACY_TX, I do this to make
it possible to define this regardless of the OS level (it is
not defined by default). There are also a couple changed
comments for clarity.
2012-11-30 23:13:56 +00:00
jfv
95184bc12a Patch #5 Cleanup unused IEEE1588 code fragments, the day may
come when this feature gets implemented, but its not here yet
and I see no reason to leave this laying around.
2012-11-30 23:06:27 +00:00
jfv
552da73086 Patch #4 - this does two things, it removes a number of statistics,
these are FCOE stats (fiber channel over ethernet), something that
FreeBSD does not yet have, they were mistaken for flow control by
the implementor I believe. Secondly, the real flow control stats
are oddly named with a 'link' tag on the front, it was requested
by my validation engineer to make these stats have the same name as
the igb driver for clarity and that seemed reasonable to me.
2012-11-30 22:54:14 +00:00
jfv
6b9e611792 Patch #3 - Add a new ioctl to access SFP+ module diagnostic
data via the I2C routines in shared code.
2012-11-30 22:41:32 +00:00
jfv
80b1e3c11a Patch #2 - remove OACTIVE and DEPLETED notions from the
multiqueue code, this functionality has proven to be more
trouble than it was worth. Thanks to Gleb for a second
critical look over my code and help in the patches!
2012-11-30 22:33:21 +00:00
jfv
f33a5b80e4 First of a series of 11 patches leading to new ixgbe version 2.5.0
This removes the header split and supporting code from the driver.
2012-11-30 22:19:18 +00:00
jfv
95dd9f9254 A few important fixes:
- Testing TSO6 has led me to discover that HW RSC is
    a problematic feature, it is ONLY designed to work
    with IPv4 in the first place, and if IP forwarding
    is done it can't be disabled as LRO in the stack,
    also initial testing we've done at Intel shows an
    equal performance using TSO[46] on the TX and LRO
    on RX, if you ran older code on 82599 or later hardware
    you actually could have detrimental performance for
    this reason. So I am disabling the feature by default
    and all our adapters will now use LRO instead.

  - If you have flow control off and multiple queues it
    was possible when the buffer of one queue becomes
    full that all RX movement is stalled, to eliminate
    this problem a feature bit is now set that will allow
    packets to be dropped when full rather than stall.
    Note, the default is to have flow control on, and this
    keeps this from happening.

  - Because of the recent fixes in the stack, LRO is now
    auto-disabled when problematic, so I have decided to
    enable it by default in the capabilities in the driver.

  - There are some 1G modules used by some customers, a couple
    small tweaks to properly support those in the media code.

  - A note: we have now done some testing of TSO6 and using
    LRO with IPv6 and it all works great!! Seeing line rate
    in both directions in best cases. Thanks bz for your
    excellent work!!
2012-10-31 23:50:36 +00:00
jfv
82e50468a7 Correct code that was lost somewhere in the past,
this was designed to keep duplicate null vlan tags from
being added. When doing vlans purely via the switch
this problem will occur. Reported by external customer.
2012-10-31 18:16:42 +00:00
eadler
903b8e0551 Now that device disabling is generic, remove extraneous code from the
device drivers that used to provide this feature.

This is a subset of 241856 (which was reverted)

Reviewed by:	des
Approved by:	cperciva (implicit)
MFC after:	1 week
2012-10-22 22:29:48 +00:00
eadler
92f340b6e7 This isn't functionally identical. In some cases a hint to disable
unit 0 would in fact disable all units.

This reverts r241856

Approved by: cperciva (implicit)
2012-10-22 13:06:09 +00:00
eadler
bc26a2b3b0 Now that device disabling is generic, remove extraneous code from the
device drivers that used to provide this feature.

Reviewed by:	des
Approved by:	cperciva
MFC after:	1 week
2012-10-22 03:41:14 +00:00
emax
0dfb309a1f provide helper if_initbaudrate() to set if_baudrate_pf and if_baudrate_pf.
again, use ixgbe(4) as an example of how to use new helper function.

Reviewed by:	jhb
MFC after:	1 week
2012-10-17 19:24:13 +00:00
emax
214df82afa introduce concept of ifi_baudrate power factor. the idea is to work
around the problem where high speed interfaces (such as ixgbe(4))
are not able to report real ifi_baudrate. bascially, take a spare
byte from struct if_data and use it to store ifi_baudrate power
factor. in other words,

real ifi_baudrate = ifi_baudrate * 10 ^ ifi_baudrate power factor

this should be backwards compatible with old binaries. use ixgbe(4)
as an example on how drivers would set ifi_baudrate power factor

Discussed with:	kib, scottl, glebius
MFC after:	1 week
2012-10-16 20:18:15 +00:00
glebius
4b29d585cf The drbr(9) API appeared to be so unclear, that most drivers in
tree used it incorrectly, which lead to inaccurate overrated
if_obytes accounting. The drbr(9) used to update ifnet stats on
drbr_enqueue(), which is not accurate since enqueuing doesn't
imply successful processing by driver. Dequeuing neither mean
that. Most drivers also called drbr_stats_update() which did
accounting again, leading to doubled if_obytes statistics. And
in case of severe transmitting, when a packet could be several
times enqueued and dequeued it could have been accounted several
times.

o Thus, make drbr(9) API thinner. Now drbr(9) merely chooses between
  ALTQ queueing or buf_ring(9) queueing.
  - It doesn't touch the buf_ring stats any more.
  - It doesn't touch ifnet stats anymore.
  - drbr_stats_update() no longer exists.

o buf_ring(9) handles its stats itself:
  - It handles br_drops itself.
  - br_prod_bytes stats are dropped. Rationale: no one ever
    reads them but update of a common counter on every packet
    negatively affects performance due to excessive cache
    invalidation.
  - buf_ring_enqueue_bytes() reduced to buf_ring_enqueue(), since
    we no longer account bytes.

o Drivers handle their stats theirselves: if_obytes, if_omcasts.

o mlx4(4), igb(4), em(4), vxge(4), oce(4) and  ixv(4) no longer
  use drbr_stats_update(), and update ifnet stats theirselves.

o bxe(4) was the most correct driver, it didn't call
  drbr_stats_update(), thus it was the only driver accurate under
  moderate load. Now it also maintains stats itself.

o ixgbe(4) had already taken stats from hardware, so just
  - drop software stats updating.
  - take multicast packet count from hardware as well.

o mxge(4) just no longer needs NO_SLOW_STATS define.

o cxgb(4), cxgbe(4) need no change, since they obtain stats
  from hardware.

Reviewed by:	jfv, gnn
2012-09-28 18:28:27 +00:00
jhb
085c4798a8 Merge similar fixes from 223198 from igb to ixgbe:
- Use a dedicated task to handle deferred transmits from the if_transmit
  method instead of reusing the existing per-queue interrupt task.
  Reusing the per-queue interrupt task could result in both an interrupt
  thread and the taskqueue thread trying to handle received packets on a
  single queue resulting in out-of-order packet processing and lock
  contention.
- Don't define ixgbe_start() at all where if_transmit is used.

Tested by:	Vijay Singh
Reviewed by:	jfv
MFC after:	2 weeks
2012-09-26 18:11:43 +00:00
eadler
a312b7ef4e Define missing DEBUGOUT# macros. DEBUGOUT[45] are not yet used but are
being defined pre-emptively to avoid future build breakage

PR:		kern/168967
Submitted by:	fuzhli <fuzl@arraynetworks.com.cn>
Approved by:	cperciva
MFC after:	1 week
2012-09-13 14:40:24 +00:00
scottl
076656eb8b Remove a prefetch() directive that, after careful testing, does more harm
than good.

Submitted by:	Fabien Thomas
Reviewed by:	jfv
2012-09-11 16:59:04 +00:00
kevlo
f9b0e29ecf Add missing braces
Obtained from:	DragonFly
2012-09-06 02:07:58 +00:00
scottl
693652fc56 Heavily optimize the case of small RX packets of 160 bytes or less. For
this case, allocate a plain mbuf and copy the frame into it, then send the
copy up the stack, leaving the original mbuf+cluster in place in the
receive ring for immediate re-use.  This saves a trip through 2 of the
3 zones of the compound mbuf allocator, a trip through busdma, and a trip
through the 1 of the 3 mbuf destructors.  For our load at Netflix, this can
lower CPU consumption by as much as 20%.  The copy algorithm is based on
investigative work from Luigi Rizzo earlier in the year.

Reviewed by:	jfv
Obtained from:	Netflix
2012-08-31 10:07:38 +00:00
jfv
8f97f54d9e Update to the ixgbe driver:
- Add a couple of new devices
  - Flow control changes in shared and core code
  - Bug fix to Flow Director for 82598
  - Shared code sync to internal with required core change

Thanks to those helping in the testing and improvements to this driver!

MFC after:5 days
2012-07-05 20:51:44 +00:00
emax
3479b2c8d6 Correct typo(?) and actually set PTHRESH to 32 and not 16 as per Intel
Linux driver 3.8.21.

MFC after:	1 week
2012-06-07 22:57:26 +00:00
emax
ce863bea70 Before it gets lost in the noise.
Put a bandaid to prevent ixgbe(4) from completely locking up the system
under high load. Our platform has a few CPU cores and a single active
ixgbe(4) port with 4 queues. Under high enough traffic load, at about
7.5GBs and 700,000 packets/sec (outbound), the entire system would
deadlock. What we found was that each CPU was in an endless loop on a
different ix taskqueue thread. The OACTIVE flag had gotten set on each
queue, and the ixgbe_handle_queue() function was continuously rescheduling
itself via the taskqueue_enqueue. Since all CPUs were busy with their
taskqueue threads, the ixgbe_local_timer() function couldn't run to clear
the OACTIVE flag.

Submitted by:	scottl
MFC after:	1 week
2012-06-05 18:48:02 +00:00
bz
1a89a35532 MFp4 bz_ipv6_fast:
Add TSO6 and LRO/IPv6 support.
  Fix the module Makefile to at least properly inlcude opt_inet6.h
  and allow builds without INET or INET6.

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-25 03:02:56 +00:00
luigi
36ef5534f7 fix a typo in a comment 2012-05-17 14:36:19 +00:00
bz
22fae261d4 If we pass down 64k - L2 hdr size + 1 to 64K L3+ data adding an ether
header will make the data go over the 64k limits announced to busdma as
maxsize and the transaction will fail.

With TSO this can result in a TCP regression due to the lost packet.

According to the data sheets ixgbe(4) 82598 and 82599 can handle up to
256k so increase the maximum.

Reported by:	Jon Kåre Hellan, UNINETT (jon.kare.hellan uninett.no)
Tested by:	Jon Kåre Hellan, UNINETT (jon.kare.hellan uninett.no)
MFC after:	1 week
2012-04-23 22:05:09 +00:00
luigi
3cef61f727 Properly disable crc stripping when operating in netmap mode.
Contrarily to what i wrote in my previous commit, the 82599
does include the CRC in the length. The operating mode is
reset in ixgbe_init_locked() and so we need to hook into
the places where the two registers (HLREG0 and RDRXCTL) are
modified.
2012-04-13 16:42:54 +00:00
luigi
bef4cb5b9b Enable prefetching of descriptors on the TX ring, using the same
values as in the Intel driver 3.8.21 for linux.  The fact that it
is standard in the above driver suggests that it has no bad side
effects.

But of course there must be a reason for enabling features, not
just "it does not harm", so here it is a good one:

Prefetching enables full line rate even using a single queue (14.88
Mpps, compared to ~12 Mpps without prefetch).  This in turn is
terribly useful when one wants to schedule traffic.

For obvious reasons the difference is only visible with netmap
or other high speed solutions, but presumably the advantage
should be in the order of a fraction of a microsecond when
starting transmission on an empty queue.

Discussed with Jack Vogel.

MFC after:	1 week
2012-04-11 15:02:14 +00:00
scottl
2e7ae86807 More conversions of drivers to use the PCI parent DMA tag. 2012-03-12 18:15:08 +00:00
luigi
3ac0fcfb97 A bunch of netmap fixes:
USERSPACE:
1. add support for devices with different number of rx and tx queues;

2. add better support for zero-copy operation, adding an extra field
   to the netmap ring to indicate how many buffers we have already processed
   but not yet released (with help from Eddie Kohler);

3. The two changes above unfortunately require an API change, so while
   at it add a version field and some spares to the ioctl() argument
   to help detect mismatches.

4. update the manual page for the two changes above;

5. update sample applications in tools/tools/netmap

KERNEL:

1. simplify the internal structures moving the global wait queues
   to the 'struct netmap_adapter';

2. simplify the functions that map kring<->nic ring indexes

3. normalize device-specific code, helps mainteinance;

4. start exploring the impact of micro-optimizations (prefetch etc.)
   in the ixgbe driver.
   Use 'legacy' descriptors on the tx ring and prefetch slots gives
   about 20% speedup at 900 MHz. Another 7-10% would come from removing
   the explict calls to bus_dmamap* in the core (they are effectively
   NOPs in this case, but it takes expensive load of the per-buffer
   dma maps to figure out that they are all NULL.

   Rx performance not investigated.

I am postponing the MFC so i can import a few more improvements
before merging.
2012-02-27 19:05:01 +00:00
luigi
626117dcad (This commit only touches code within the DEV_NETMAP blocks)
Introduce some functions to map NIC ring indexes into netmap ring
indexes and vice versa. This way we can implement the bound
checks only in one place (and hopefully in a correct way).

On passing, make the code and comments more uniform across the
various drivers.
2012-02-15 23:13:29 +00:00
jfv
ae08051fd8 Wrap the bool typedef 2012-01-30 23:03:21 +00:00
jfv
f19302c07e New hardware support: Intel X540 adapter support added.
Some shared code reorganization along with the new adapter.
Sync changes to OACTIVE in igb into this driver.
Misc small fixes.
2012-01-30 16:42:02 +00:00
luigi
ef0f580b11 ixgbe changes:
- remove experimental code for disabling CRC
- use the correct constant for conversion between interrupt rate
  and EITR values (the previous values were off by a factor of 2)
- make dev.ix.N.queueM.interrupt_rate a RW sysctl variable.
  Changing individual values affects the queue immediately,
  and propagates to all interfaces at the next reinit.
- add dev.ix.N.queueM.irqs rdonly sysctl, to export the actual
  interrupt counts

Netmap-related changes for ixgbe:
- use the "new" format for TX descriptors in netmap mode.
- pass interrupt mitigation delays to the user process doing poll()
  on a netmap file descriptor.
  On the RX side this means we will not check the ring more than once
  per interrupt. This gives the process a chance to sleep and process
  packets in larger batches, thus reducing CPU usage.
  On the TX side we take this even further: completed transmissions are
  reclaimed every half ring even if the NIC interrupts more often.
  This saves even more CPU without any additional tx delays.

Generic Netmap-related changes:
- align the netmap_kring to cache lines so that there is no false sharing
  (possibly useful for multiqueue NICs and MSIX interrupts, which are
  handled by different cores). It's a minor improvement but it does not
  cost anything.

Reviewed by:	Jack Vogel
Approved by:	Jack Vogel
2012-01-26 09:55:16 +00:00
luigi
c20448129b netmap-related changes:
1. correct the initialization of RDT when there is an ixgbe_init()
   while a netmap client is active. This code was previously
   in ixgbe_initialize_receive_units() but RDT is overwritten
   shortly afterwards in ixgbe_init_locked()

2. add code (not active yet) to disable CRCSTRIP while in netmap mode.
   From all evidence i could gather, it seems that when the 82599 has to
   write a data block that is not a full cache line, it first reads
   the line (64 bytes) and then writes back the updated version.
   This hurts reception of min-sized frames, which are only 60 bytes
   if the CRC is stripped: i could never get above 11Mpps
   (received from one queue) with CRCSTRIP enabled, whyle 64+4-byte
   packets reach 14.2 Mpps (the theoretical maximum).
   Leaving the CRC in gets us 14.88Mpps for 60+4 byte frames,
   (and penalizes 64+4). The min-size case is important not just because
   it looks good in benchmarks, but also because this is the size
   of pure acks.
   Note we cannot leave CRCSTRIP on by default because it is
   incompatible with some other features (LRO etc.)
2012-01-19 09:36:19 +00:00