95 Commits

Author SHA1 Message Date
Matt Macy
d7c5a620e2 ifnet: Replace if_addr_lock rwlock with epoch + mutex
Run on LLNW canaries and tested by pho@

gallatin:
Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5
based ConnectX 4-LX NIC, I see an almost 12% improvement in received
packet rate, and a larger improvement in bytes delivered all the way
to userspace.

When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1,
I see, using nstat -I mce0 1 before the patch:

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
4.98   0.00   4.42   0.00 4235592     33   83.80 4720653 2149771   1235 247.32
4.73   0.00   4.20   0.00 4025260     33   82.99 4724900 2139833   1204 247.32
4.72   0.00   4.20   0.00 4035252     33   82.14 4719162 2132023   1264 247.32
4.71   0.00   4.21   0.00 4073206     33   83.68 4744973 2123317   1347 247.32
4.72   0.00   4.21   0.00 4061118     33   80.82 4713615 2188091   1490 247.32
4.72   0.00   4.21   0.00 4051675     33   85.29 4727399 2109011   1205 247.32
4.73   0.00   4.21   0.00 4039056     33   84.65 4724735 2102603   1053 247.32

After the patch

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
5.43   0.00   4.20   0.00 3313143     33   84.96 5434214 1900162   2656 245.51
5.43   0.00   4.20   0.00 3308527     33   85.24 5439695 1809382   2521 245.51
5.42   0.00   4.19   0.00 3316778     33   87.54 5416028 1805835   2256 245.51
5.42   0.00   4.19   0.00 3317673     33   90.44 5426044 1763056   2332 245.51
5.42   0.00   4.19   0.00 3314839     33   88.11 5435732 1792218   2499 245.52
5.44   0.00   4.19   0.00 3293228     33   91.84 5426301 1668597   2121 245.52

Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15366
2018-05-18 20:13:34 +00:00
Pedro F. Giffuni
df57947f08 spdx: initial adoption of licensing ID tags.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.

Initially, only tag files that use BSD 4-Clause "Original" license.

RelNotes:	yes
Differential Revision:	https://reviews.freebsd.org/D13133
2017-11-18 14:26:50 +00:00
Gleb Smirnoff
41acb7e12c Mechanically convert to if_inc_counter(). 2014-09-18 20:30:47 +00:00
Gleb Smirnoff
1bffa9511f Use define from if_var.h to access a field inside struct if_data,
that resides in struct ifnet.

Sponsored by:	Nginx, Inc.
2014-08-30 19:55:54 +00:00
John Baldwin
068d8643ad Fix various NIC drivers to properly cleanup static DMA resources.
In particular, don't check the value of the bus_dma map against NULL
to determine if either bus_dmamem_alloc() or bus_dmamap_load() succeeded.
Instead, assume that bus_dmamap_load() succeeeded (and thus that
bus_dmamap_unload() should be called) if the bus address for a resource
is non-zero, and assume that bus_dmamem_alloc() succeeded (and thus
that bus_dmamem_free() should be called) if the virtual address for a
resource is not NULL.

In many cases these bugs could result in leaks when a driver was detached.

Reviewed by:	yongari
MFC after:	2 weeks
2014-06-11 14:53:58 +00:00
Gleb Smirnoff
76039bc84f The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 17:58:36 +00:00
Gleb Smirnoff
c6499eccad Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags in sys/dev.
2012-12-04 09:32:43 +00:00
Kevin Lo
f8d4925a39 Remove unused variable mii 2012-02-11 08:12:52 +00:00
Marius Strobl
4b7ec27007 - There's no need to overwrite the default device method with the default
one. Interestingly, these are actually the default for quite some time
  (bus_generic_driver_added(9) since r52045 and bus_generic_print_child(9)
  since r52045) but even recently added device drivers do this unnecessarily.
  Discussed with: jhb, marcel
- While at it, use DEVMETHOD_END.
  Discussed with: jhb
- Also while at it, use __FBSDID.
2011-11-22 21:28:20 +00:00
Pyun YongHyeon
17ff418d13 Announce flow control capability to underlying PHY driver.
Pause timer value is initialized to 0xFFFF. Controller allows just
4 different TX pause thresholds. The lowest possible threshold
value looks too aggressive so use next available threshold value.
2011-11-22 20:57:06 +00:00
Pyun YongHyeon
66c6108d5d Rework link establishment and link state detection logic.
- Remove MIIBUS statchg callback and program VGE_DIAGCTL before
   initiating link establishment.  Previously driver used to
   program VGE_DIAGCTL after getting a link in statchg callback.
   It seems the VGE_DIAGCTL register works like a kind of MII
   register such that it requires setting a 'to be' mode in advance
   rather than relying on resolved speed/duplex of established link.
   This means the statchg callback is not needed in driver.  In
   addition, if there was no link at the time of media change, this
   was not called at all.
 - Introduce vge_ifmedia_upd_locked() to change current media to
   configured one.  Actual media change is performed only after PHY
   reset and VGE_DIAGCTL setup.
 - In WOL configuration, make sure to clear forced mode such that
   controller can rely on auto-negotiation.
 - Unlike most other drivers that use miibus(4), vge(4) used
   controller's auto-polling feature for link state tracking via
   interrupt.  This came from controller's inefficient mechanism to
   access MII registers.  On link state change interrupt, vge(4)
   used to get current link state with series of MII register
   accesses.  Because vge(4) already enabled auto polling, read PHY
   status register to resolved speed/duplex/flow control parameters.

vge(4) still does not drive MII_TICK to reduce number of MII
register accesses which in turn means the driver does not know the
status of auto-negotiation.  This was a one of long standing
issue of vge(4).  Probably driver may be able to implement a timer
that keeps track of auto-negotiation state and restart
auto-negotiation when driver couldn't establish a link within a
specified period.  However the controller does not provide a
reliable way to detect auto-negotiation failure so I'm not sure
whether it's worth to implement it in driver.

Alternatively driver can completely disable MII auto-polling and
let miibus(4) poll link state by driving MII_TICK.  This may reduce
unnecessary overhead of stopping/restarting MII auto-polling of
controller.  Unfortunately it was known that some variants of
controller does not work correctly if MII auto-polling is disabled.
2011-11-22 20:45:09 +00:00
Pyun YongHyeon
471ad1d097 Always start MII auto polling before accessing any MII registers. 2011-11-22 18:58:39 +00:00
Pyun YongHyeon
57c81d92ae Close a race where SIOCGIFMEDIA ioctl get inconsistent link status.
Because driver is accessing a common MII structure in
mii_pollstat(), updating user supplied structure should be done
before dropping a driver lock.

Reported by:	Karim (fodillemlinkarimi <> gmail dot com)
2011-10-17 19:49:00 +00:00
Pyun YongHyeon
a3f4b4527c vge(4) hardwares poll media status and generates an interrupt
whenever the link state is changed.  Using software based polling
for media status tracking is known to cause MII access failure
under certain conditions once link is established so vge(4) used to
rely on link status change interrupt.
However DEVICE_POLLING completely disables generation of all kind
of interrupts on vge(4) such that this resulted in not detecting
link state change event.  This means vge(4) does not correctly
detect established/lost link with DEVICE_POLLING.  Losing the
interrupt made vge(4) not to send any packets to peer since vge(4)
does not try to send any packets when there is no established link.

Work around the issue by generating link state change interrupt
with DEVICE_POLLING.

PR:		kern/160442
Approved by:	re (kib)
2011-09-07 16:57:43 +00:00
Pyun YongHyeon
7ba75dc4e9 Datasheet says vge(4) controllers support DAC but it seems that's
not true on old PCI based controllers.  DAC configuration is read
from EEPROM in device reset phase and driver can override DAC
configuration.  However I guess there is an undocumented reason why
EEPROM configuration does not enable DAC so do not blindly override
DAC configuration.  Recent PCIe based controllers are supposed to
support 64bit DMA so allow 64bit DMA only on PCIe based controllers.

PR:		kern/157184
MFC after:	1 week
2011-05-20 18:27:13 +00:00
John Baldwin
3b0a4aef96 Do a sweep of the tree replacing calls to pci_find_extcap() with calls to
pci_find_cap() instead.
2011-03-23 13:10:15 +00:00
Marius Strobl
8e5d93dbb4 Convert the PHY drivers to honor the mii_flags passed down and convert
the NIC drivers as well as the PHY drivers to take advantage of the
mii_attach() introduced in r213878 to get rid of certain hacks. For
the most part these were:
- Artificially limiting miibus_{read,write}reg methods to certain PHY
  addresses; we now let mii_attach() only probe the PHY at the desired
  address(es) instead.
- PHY drivers setting MIIF_* flags based on the NIC driver they hang
  off from, partly even based on grabbing and using the softc of the
  parent; we now pass these flags down from the NIC to the PHY drivers
  via mii_attach(). This got us rid of all such hacks except those of
  brgphy() in combination with bce(4) and bge(4), which is way beyond
  what can be expressed with simple flags.

While at it, I took the opportunity to change the NIC drivers to pass
up the error returned by mii_attach() (previously by mii_phy_probe())
and unify the error message used in this case where and as appropriate
as mii_attach() actually can fail for a number of reasons, not just
because of no PHY(s) being present at the expected address(es).

Reviewed by:	jhb, yongari
2010-10-15 14:52:11 +00:00
Pyun YongHyeon
55219c9044 Remove wrong assertion. 2009-12-25 00:23:47 +00:00
Pyun YongHyeon
33a0d70b86 Disable jumbo frame support for PCIe VT6130/VT6132 controllers.
Quite contrary to VT6130 datasheet which says it supports up to 8K
jumbo frame, VT6130 does not seem to send jumbo frame that is
larger than 4K in length. Trying to send a frame that is larger
than 4K cause TX MAC hang.
Even though it's possible to allow 4K jumbo frame for VT6130, I
think it's meaningless to allow 4K jumbo frame. I'm not sure VT6132
also has the same limitation but I guess it uses the same MAC of
VT6130.
2009-12-20 19:45:46 +00:00
Pyun YongHyeon
564340e7cf VT6130 datasheet was wrong. If VT6130 receive a jumbo frame the
controller will split the jumbo frame into multiple RX buffers.
However it seems the hardware always dma the frame to 8 bytes
boundary for the split frames. Only the first part of the fragment
can have 4 byte alignment and subsequent buffers should be 8 bytes
aligned. Change RX buffer the alignment requirement to 8 bytes from
4 bytes.
2009-12-20 19:11:32 +00:00
Pyun YongHyeon
c743849615 Correct fragment bit definition in comments. 2009-12-20 18:53:34 +00:00
Pyun YongHyeon
12ea9beca9 Swap VGE_TXQTIMER and VGE_RXQTIMER register definition. Pending
timer for Tx queue is at 0x3E.
2009-12-19 20:45:23 +00:00
Pyun YongHyeon
7fc94bc4e0 Add rudimentary WOL support. While I'm here remove enabling
busmastering/memory address in resume path. Bus driver will handle
that.
2009-12-18 22:14:28 +00:00
Pyun YongHyeon
dd77a0532b Remove unused member variable of softc. 2009-12-17 19:48:54 +00:00
Pyun YongHyeon
610dfa9399 Actually clear interrupts. Writing 0 has no effect. 2009-12-17 18:03:05 +00:00
Pyun YongHyeon
3b2b8afb3c Implement interrupt moderation scheme supported by VT61xx
controllers. TX/RX interrupt mitigation is controlled by
VGE_TXSUPPTHR and VGE_RXSUPPTHR register. These registers suppress
generation of interrupts until the programmed frames counter equals
to the registers. VT61xx also supports interrupt hold off timer
register. If this interrupt hold off timer is active all interrupts
would be disabled until the timer reaches to 0. The timer value is
reloaded whenever VGE_ISR register written. The timer resolution is
about 20us.

Previously vge(4) used single shot timer to reduce Tx completion
interrupts. This required VGE_CRS1 register access in Tx
start/completion handler to rearm new timeout value and it did not
show satisfactory result(more than 50k interrupts under load). Rx
interrupts was not moderated at all such that vge(4) used to
generate too many interrupts which in turn made polling(4) better
approach under high network load.

This change activates all interrupt moderation mechanism and
initial values were tuned to generate interrupt less than 8k per
second. That number of interrupts wouldn't add additional packet
latencies compared to polling(4). These interrupt parameters could
be changed with sysctl.
dev.vge.%d.int_holdoff
dev.vge.%d.rx_coal_pkt
dev.vge.%d.tx_coal_pkt
Interface has be brought down and up again before change take
effect.

With interrupt moderation there is no more need to loop in
interrupt handler. This loop always added one more register access.
While I'm here remove dead code which tried to implement subset of
interrupt moderation.
2009-12-17 18:00:25 +00:00
Pyun YongHyeon
ed03d795ff Remove unused VGE_ETHER_ALIGN definition. 2009-12-17 17:38:06 +00:00
Pyun YongHyeon
83accfdbd5 Add "Velocity" to probe message which will make it clearer which
ethernet controller was recognized. VIA consistently calls
"Velocity" family for gigabit ethernet controllers. For fast
ethernet controllers they uses "Rhine" family(vr(4) controllers))
and vr(4) already shows "Rhine" in probe message.
2009-12-16 20:03:43 +00:00
Pyun YongHyeon
a931e5497b Add new flag VGE_FLAG_SUSPENDED to mark suspended state and
remove suspended member in softc.
2009-12-16 19:49:23 +00:00
Pyun YongHyeon
7129fb20d6 Add hardware MAC statistics support. This statistics could be
extracted from dev.vge.%d.stats sysctl node.
2009-12-16 19:41:40 +00:00
Pyun YongHyeon
5f07fd19e2 Rewrite RX filter setup and simplify code.
Now promiscuous mode and multicast handling is performed in single
function, vge_rxfilter().
2009-12-16 19:32:44 +00:00
Pyun YongHyeon
38aa43c511 All vge(4) controllers support RX/TX checksum offloading for VLAN
tagged frames so add checksum offloading capabilities. Also add
missing VLAN hardware tagging control in ioctl handler and let
upper stack know current VLAN capabilities.
2009-12-16 18:03:25 +00:00
Pyun YongHyeon
0c003e99df Tell upper layer vge(4) supports long frames. This should be done
after ether_ifattach(), as ether_ifattach() initializes it with
ETHER_HDR_LEN.
While I'm here remove setting if_mtu, it's already handled in
ether_ifattach().
2009-12-14 22:55:20 +00:00
Pyun YongHyeon
5f26dcd859 Don't report current link status if interface is not UP.
If interface is not UP, the current link status wouldn't
reflect the negotiated status.
2009-12-14 22:30:07 +00:00
Pyun YongHyeon
6f53098372 Report media change result to caller instead of returning success
without regard to the result.
2009-12-14 22:23:06 +00:00
Pyun YongHyeon
e7b2d9b80f Whenever link state change interrupt is raised, vge_tick() is
called and vge(4) used to drive auto-negotiation timer(mii_tick) in
vge_tick(). Therefore the mii_tick was not called for every hz such
that auto-negotiation complete was never handled in vge(4).
Use mii_pollstat to extract current negotiated speed/duplex instead
of mii_tick. The latter is valid only for auto-negotiation case.
While I'm here change the confusing function name vge_tick() to
vge_link_statchg().
2009-12-14 22:20:05 +00:00
Pyun YongHyeon
e4027c4968 Sort function prototyes. 2009-12-14 22:00:11 +00:00
Pyun YongHyeon
20c3cb15df We don't have to reload EEPROM in vge_reset(). Because vge_reset()
is called in vge_init_lock(), vge(4) always used to reload EEPROM.
Also add more comment why vge(4) clears VGE_CHIPCFG0_PACPI bit.
While I'm here add missing new line in vge_reset().
2009-12-14 21:16:02 +00:00
Pyun YongHyeon
623fa7184b Increase output queue size from 64 to 255. 2009-12-14 20:59:18 +00:00
Pyun YongHyeon
5957cc2a24 Add MSI support for VT613x controllers. 2009-12-14 20:49:50 +00:00
Pyun YongHyeon
643e9ee9fb Save PHY address by reading VGE_MIICFG register. For PCIe
controllers(VT613x), we assume the PHY address is 1.
Use the saved PHY address in MII register access routines and
remove accessing VGE_MIICFG register.
While I'm here save PCI express capability register which will be
used in near future.
2009-12-14 20:39:42 +00:00
Pyun YongHyeon
4d7235dd49 Introduce vge_flags member in softc. The vge_flags member will
record device specific bits. Remove vge_link and use vge_flags.
While here, move clearing link state before mii_mediachg() as
mii_mediachg() may affect link state.
2009-12-14 20:17:53 +00:00
Pyun YongHyeon
77ec83ced8 style(9). 2009-12-14 20:07:25 +00:00
Pyun YongHyeon
c3c74c6107 s/u_intXX_t/uintXX_t/g 2009-12-14 19:53:57 +00:00
Pyun YongHyeon
e44b7069a9 Remove unnecessary return statement. 2009-12-14 19:49:20 +00:00
Pyun YongHyeon
6afe22a8e6 Use ANSI function definations. 2009-12-14 19:44:54 +00:00
Pyun YongHyeon
420d0abff8 Clear VGE_TXDESC_Q bit for transmitted frames. The VGE_TXDESC_Q bit
seems to work like a tag that indicates 'not list end' of queued
frames. Without having a VGE_TXDESC_Q bit indicates 'list end'. So
the last frame of multiple queued frames has no VGE_TXDESC_Q bit.
The hardware has peculiar behavior for VGE_TXDESC_Q bit handling.
If the VGE_TXDESC_Q bit of descriptor was set the controller would
fetch next descriptor. However if next descriptor's OWN bit was
cleared but VGE_TXDESC_Q was set, it could confuse controller.
Clearing VGE_TXDESC_Q bit for transmitted frames ensure correct
behavior.
2009-12-14 19:08:11 +00:00
Pyun YongHyeon
bf93bee5c5 Fix typo in register definition. 2009-12-14 18:50:26 +00:00
Pyun YongHyeon
4baee89724 Use PCIR_BAR instead of hard-coded value. 2009-12-14 18:49:16 +00:00
Pyun YongHyeon
410f4c60ad Overhaul bus_dma(9) usage and fix various things.
o Separate TX/RX buffer DMA tag from TX/RX descriptor ring DMA tag.
 o Separate RX buffer DMA tag from common buffer DMA tag. RX DMA
   tag has different restriction compared to TX DMA tag.
 o Add 40bit DMA address support.
 o Adjust TX/RX descriptor ring alignment to 64 bytes from 256
   bytes as documented in datasheet.
 o Added check to ensure TX/RX ring reside within a 4GB boundary.
   Since TX/RX ring shares the same high address register they
   should have the same high address.
 o TX/RX side bus_dmamap_load_mbuf_sg(9) support.
 o Add lock assertion to vge_setmulti().
 o Add RX spare DMA map to recover from DMA map load failure.
 o Add optimized RX buffer handler, vge_discard_rxbuf which is
   activated when vge(4) sees bad frames.
 o Don't blindly update VGE_RXDESC_RESIDUECNT register. Datasheet
   says the register should be updated only when number of
   available RX descriptors are multiple of 4.
 o Use __NO_STRICT_ALIGNMENT instead of defining VGE_FIXUP_RX which
   is only set for i386 architecture. Previously vge(4) also
   performed expensive copy operation to align IP header on amd64.
   This change should give RX performance boost on amd64
   architecture.
 o Don't reinitialize controller if driver is already running. This
   should reduce number of link state flipping.
 o Since vge(4) drops a driver lock before passing received frame
   to upper layer, make sure vge(4) is still running after
   re-acquiring driver lock.
 o Add second argument count to vge_rxeof(). The argument will
   limit number of packets could be processed in RX handler.
 o Rearrange vge_rxeof() not to allocate RX buffer if received
   frame was bad packet.
 o Removed if_printf that prints DMA map failure. This type of
   message shouldn't be used in fast path of driver.
 o Reduce number of allowed TX buffer fragments to 6 from 7. A TX
   descriptor allows 7 fragments of a frame. However the CMZ field
   of descriptor has just 3bits and the controller wants to see
   fragment + 1 in the field. So if we have 7 fragments the field
   value would be 0 which seems to cause unexpected results under
   certain conditions. This change should fix occasional TX hang
   observed on vge(4).
 o Simplify vge_stat_locked() and add number of available TX
   descriptor check.
 o vge(4) controllers lack padding short frames. Make sure to fill
   zero for the padded bytes. This closes unintended information
   disclosure.
 o Don't set VGE_TDCTL_JUMBO flag. Datasheet is not clear whether
   this bit should be set by driver or write-back status bit after
   transmission. At least vendor's driver does not set this bit so
   remove it. Without this bit vge(4) still can send jumbo frames.
 o Don't start driver when vge(4) know there are not enough RX
   buffers.
 o Remove volatile keyword in RX descriptor structure. This should
   be handled by bus_dma(9).
 o Collapse two 16bits member of TX/RX descriptor into single 32bits
   member.
 o Reduce number of RX descriptors to 252 from 256. The
   VGE_RXDESCNUM is 16bits register but only lower 8bits are valid.
   So the maximum number of RX descriptors would be 255. However
   the number of should be multiple of 4 as controller wants to
   update 4 RX descriptors at a time. This limits the maximum
   number of RX descriptor to be 252.

Tested by:	Dewayne Geraghty (dewayne.geraghty <> heuristicsystems dot com dot au)
		Carey Jones (m.carey.jones <> gmail dot com)
		Yoshiaki Kasahara (kasahara <> nc dor kyushu-u dot ac dotjp)
2009-12-14 18:44:23 +00:00