o Increased number of Rx/Tx descriptors to 256 for 8169 GigEs
because it's hard to push the hardware to the limit with default
64 descriptors.
TSO requires large number of Tx descriptors to pass a full sized
TCP segment(65535 bytes IP packet) to hardware. Previously it
consumed 32 Tx descriptors, assuming MCLBYTES DMA segment size,
to send the TCP segment which means re(4) couldn't queue more
than two full sized IP packets.
For 8139C+ it still uses 64 Rx/Tx descriptors due to its hardware
limitations. With this changes there are (very) small waste of
memory for 8139C+ users but I don't think it would affect 8139C+
users for most cases.
o Various bus_dma(9) fixes.
- The hardware supports DAC so allow 64bit DMA operations.
- Removed BUS_DMA_ALLOC_NOW flag.
- Increased DMA segment size to 4096 from MCLBYTES because TSO
consumes too many descriptors with MCLBYTES DMA segment size.
- Tx/Rx side bus_dmamap_load_mbuf_sg(9) support. With these
changes the code is more readable than previous one and got a
(slightly) better performance as it doesn't need to pass/
decode arguments to/from callback function.
- Removed unnecessary callback function re_dmamap_desc() and
nuked rl_dmaload_arg structure which was used in the callback.
- Additional protection for DMA map load failure. In case of
failure reuse current map instead of returning a bogus DMA
map.
- Deferred DMA map unloading/sync operation for maximum
performance until we really need to load new DMA map. If we
happen to reuse current map(e.g. input error) there is no need
to sync/unload/load again.
- The number of allowable Tx DMA segments for a mbuf chains are
now 32 instead of magic nseg value. If the number of available
Tx descriptors are short enough to send highly fragmented mbuf
chains an optimized re_defrag() is called to collapse mbuf
chains which is supposed to be much faster than m_defrag(9).
re_defrag() was borrowed from ath(4).
- Separated Rx/Tx DMA tag from a common DMA tag such that Rx DMA
tag correctly uses DMA maps that were created with DMA alignment
restriction(8bytes alignments). Tx DMA tag does not have such
alignment limitation.
- Added additional sanity checks for DMA ring map load failure.
- Added additional spare Rx DMA map for graceful handling of Rx
DMA map load failure.
- Fixed misused bus_dmamap_sync(9) and added missing
bus_dmamap_sync(9) in re_encap()/re_txeof()/re_rxeof().
o Enabled TSO again as re(4) have reasonable number of Tx
descriptors.
o Don't touch DMA address of a Tx descriptor in re_txeof(). It's
not needed.
o Fix incorrect update of if_ierrors counter. For Rx buffer
shortage it should update if_qdrops as the buffer is reused.
o Added checks for unsupported H/W revisions and return ENXIO for
these hardwares. This is required to remove resource allocation
code in re_probe as other drivers do in device probe routine.
o Modified descriptor index manipulation macros as it's now possible
to have different number of descriptors for Rx/Tx.
o In re_start, to save a lock operation, use IFQ_DRV_IS_EMPTY before
trying to invoke IFQ_DRV_DEQUEUE. Also don't blindly call re_encap
since we already know the number of available Tx descriptors in
advance.
o Removed RL_TX_DESC_THLD which was used to reserve RL_TX_DESC_THLD
descriptors in Tx path. There is no such a limitation mentioned in
8139C+/8169/8110/8168/8101/8111 datasheet and it seems to work ok
without reserving RL_TX_DESC_THLD descriptors.
o Fix a comment for RL_GTXSTART. The register is 8bits register.
o Added comments for 8169/8139C+ hardware restrictions on descriptors.
o Removed forward declaration for "struct rl_softc", it's not needed.
o Added a new structure rl_txdesc for Tx descriptor managements and
a structure rl_rxdesc for Rx descriptor managements.
o Removed unused member variable rl_intlock in driver softc. There are
still several unused member variables which are supposed to be used
to access hardware statistics counters. But it seems that accessing
hardware counters were not implemented yet.
as multicast/broadcast frames. Previously re(4) ignored multicast
frames in promiscuous mode. The RTL8169 datasheet was not clear
how it handles multicast frames in promiscuous mode.
PR: kern/118572
MFC after: 3 days
Ethernet Controller. Multicast filtering wasn't tested and needs more
expore. While I'm here change complex if statements with switch
statement which would improve readability.
Reported by: Abdullah Ibn Hamad Al-Marri < wearabnet AT yahoo DOT ca >
Tested by: Abdullah Ibn Hamad Al-Marri < wearabnet AT yahoo DOT ca >
Without this the PHY wouldn't work as expected. This should fix
dual-boot Windows XP machine where RealTek Windows drivers put the
PHY in power down mode during shutdown. The magic PHY register
accesses come from RealTek driver. No datasheets mention the magic
PHY registers.
In general, the PHY wakeup code should go into PHY driver. However it
seems that it only apply to RTL8169S single chip and it would be
another hack if we have rgephy(4) check what parent driver/chip model
is attached.
Reported by: lofi, Laurens Timmermans ( laurens AT timkapel DOT nl )
Tested by: lofi
Obtained from: RealTek FreeBSD driver
Approved by: re (Ken Smith)
to clear RL_TDESC_VLANCTL_TAG). This fixes sending packets in the
native VLAN when running both tagged and an untagged VLAN over the
same trunk and descriptors are recycled.
Approved by: re (kensmith)
MFC after: 1 week
Ever since switching to adaptive polling re(4) occasionally spews
watchdog timeouts on systems with MSI capability. This change is
minimal one for supporting MSI and re(4) also needs MSIX support
for RTL8111C in future. Because softc structure of re(4) is shared
with rl(4), rl(4) was touched to use the modified softc.
Reported by: cnst
Tested by: cnst
Approved by: re (kensmith)
would be 93C46(1Kbit) or 93C56(2Kbit). One of differences between them
is number of address lines required to access the EEPROM. For example,
93C56 EEPROM needs 8 address lines to read/write data. If 93C56
recevied premature end of required number of serial clock(CLK) to set
OP code/address of EEPROM, the result would be unexpected behavior.
Previously it tried to detect 93C46, which requires 6 address lines,
and then assumed it would be 93C56 if read data was not expected
value. However, this approach didn't work in some models/situations
as 93C56 requries 8 address lines to access its data. In order to fix
it, change EEPROM probing order such that 93C56 is detected reliably.
While I'm here change hard-coded address line numbers with defined
constant to enhance readability.
PR: 112710
Approved by: re (mux)
Without bus_dma clean up and increment of number of Tx descriptors
it's hard to guarantee correct Tx operation in TSO case. The TSO
support would be enabled again when I get more feeback from re(4)
patch posted to current.
Previously whenever PROMISC mode turned on/off link renegotiation
occurs and it could resulted in network unavailability for serveral
seconds.(Depending on switch STP settings it could last several tens
seconds.)
Reported by: Prokofiev S.P. < proks AT logos DOT uptel DOT net >
Tested by: Prokofiev S.P. < proks AT logos DOT uptel DOT net >
If these drivers are setting M_VLANTAG because they are stripping the
layer 2 802.1Q headers, then they need to be re-inserting them so any
bpf(4) peers can properly decode them.
It should be noted that this is compiled tested only.
MFC after: 3 weeks
apparently be confused by short TCP segments that have been manually
padded to the minimum ethernet frame size. The driver does short frame
padding in software as a workaround for a bug in the 8169 PCI devices
that causes short IP fragments to be corrupted due to an apparent
conflict between the hardware autopadding and hardware IP checksumming.
To fix this, we avoid software padding for short TCP segments, since
the hardware seems to autopad and checksum these correctly (even the
older 8169 NICs get these right). Short UDP packets appear to be
handled correctly in all cases. This should work around the IP header
checksum bug in the 8169 while not tripping the TCP checksum bug in
the 8111B/8168B and 8101E.
addresses shall access invalid descriptor DMA addresses on PCIe
hardwares and then panicked the system.
To fix it set descriptor DMA addresses before enabling Tx and Rx
such that hardware can see valid descriptor DMA addresses. Also
set RL_EARLY_TX_THRESH before starting Tx and Rx.
Reported by: steve.tell AT crashmail DOT de
Tested by: steve.tell AT crashmail DOT de
Obtained from: NetBSD
MFC after: 1 week
operation as it ran out of free descriptors or if there are too many
segments in the first place, call bus_dmamap_unload() in order to
unload the already loaded segments.
For trying to map the defragmented mbuf (chain) in re_encap() this
introduces re_dma_map_desc() setting arg.rl_maxsegs to 0 as a new
failure mode. Previously we just ignored this case, corrupting our
view of the TX ring.
o In re_txeof():
- Don't clear IFF_DRV_OACTIVE unless there are at least 4 free TX
descriptors. Further down the road re_encap() will bail if there
aren't at least 4 free TX descriptors, causing re_start() to
abort and prepend the dequeued mbuf again so it makes no sense
to pretend we could process mbufs again when in fact we won't.
While at it replace this magic 4 with a macro RL_TX_DESC_THLD
throughout this driver.
- Don't cancel the watchdog timeout as soon as there's at least one
free TX descriptor but instead only if all descriptors have been
handled. It's perfectly normal, especially in the DEVICE_POLLING
case, that re_txeof() is called when only a part of the enqueued
TX descriptors have been handled, causing the watchdog to be
disarmed prematurely.
o In re_encap():
- If m_defrag() fails just drop the packet like other NIC drivers
do. This should only happen when there's a mbuf shortage, in which
case it was possible to end up with an IFQ full of packets which
couldn't be processed as they couldn't be defragmented as they
were taking up all the mbufs themselves. This includes adjusting
re_start() to not trying to prepend the mbuf (chain) if re_encap()
has freed it.
- Remove dupe initialization of members of struct rl_dmaload_arg to
values that didn't change since trying to process the fragmented
mbuf chain.
While at it remove an unused member from struct rl_dmaload_arg.
o In re_start() remove a abandoned, banal comment. The corresponding
code was moved to re_attach() some time ago.
With these changes re(4) now survives one day (until stopped) of
hammering out packets here.
Reviewed by: yongari
MFC after: 2 weeks
re_watchdog() in order to avoid races accessing if_timer.
- Use bus_get_dma_tag() so re(4) works on platforms requiring it.
- Remove invalid BUS_DMA_ALLOCNOW when creating the parent DMA tag
and the tags that are used for static memory allocations.
- Don't bother to set if_mtu to ETHERMTU, ether_ifattach() does that.
- Remove an unused variable in re_intr().
m_pkthdr.ether_vlan. The presence of the M_VLANTAG flag on the mbuf
signifies the presence and validity of its content.
Drivers that support hardware VLAN tag stripping fill in the received
VLAN tag (containing both vlan and priority information) into the
ether_vtag mbuf packet header field:
m->m_pkthdr.ether_vtag = vlan_id; /* ntohs()? */
m->m_flags |= M_VLANTAG;
to mark the packet m with the specified VLAN tag.
On output the driver should check the mbuf for the M_VLANTAG flag to
see if a VLAN tag is present and valid:
if (m->m_flags & M_VLANTAG) {
... = m->m_pkthdr.ether_vtag; /* htons()? */
... pass tag to hardware ...
}
VLAN tags are stored in host byte order. Byte swapping may be necessary.
(Note: This driver conversion was mechanic and did not add or remove any
byte swapping in the drivers.)
Remove zone_mtag_vlan UMA zone and MTAG_VLAN definition. No more tag
memory allocation have to be done.
Reviewed by: thompsa, yar
Sponsored by: TCP/IP Optimization Fundraise 2005
if_watchdog, etc., or in functions used only in these methods.
In all other functions in the driver use device_printf().
- Use __func__ instead of typing function name.
Submitted by: Alex Lyashkov <umka sevcity.net>
Ever since rev 1.68 re(4) checks the validity of link in re_start.
But rlphy(4) got a garbled data due to a different bit layout used on
8139C+ and it couldn't report correct link state. To fix it, ignore
BMCR_LOOP and BMCR_ISO bits which have different meanings on 8139C+.
I think this also make dhclient(8) work on 8139C+.
Reported by: Gerrit Kuehn <gerrit AT pmp DOT uni-hannover DOT de>
Tested by: Gerrit Kuehn <gerrit AT pmp DOT uni-hannover DOT de>
- Change the workaround for the autopad/checksum offload bug so that
instead of lying about the map size, we actually create a properly
padded mbuf and map it as usual. The other trick works, but is ugly.
This approach also gives us a chance to zero the pad space to avoid
possibly leaking data.
- With the PCIe devices, it looks issuing a TX command while there's
already a transmission in progress doesn't have any effect. In other
words, if you send two packets in rapid succession, the second one may
end up sitting in the TX DMA ring until another transmit command is
issued later in the future. Basically, if re_txeof() sees that there
are still descriptors outstanding, it needs to manually resume the
TX DMA channel by issuing another TX command to make sure all
transmissions are flushed out. (The PCI devices seem to keep the
TX channel moving until all descriptors have been consumed. I'm not
sure why the PCIe devices behave differently.)
(You can see this issue if you do the following test: plug an re(4)
interface into another host via crossover cable, and from the other
host do 'ping -c 2 <host with re(4) NIC>' to prime the ARP cache,
then do 'ping -c 1 -s 1473 <host with re(4) NIC>'. You're supposed
to see two packets sent in response, but you may only see one. If
you do 'ping -c 1 -s 1473 <host with re(4) NIC>' again, you'll
see two packets, but one will be the missing fragment from the last
ping, followed by one of the fragments from this ping.)
- Add the PCI ID for the US Robotics 997902 NIC, which is based on
the RTL8169S.
- Add a tsleep() of 1 second in re_detach() after the interrupt handler
is disconnected. This should allow any tasks queued up by the ISR
to drain. Now, I know you're supposed to use taskqueue_drain() for
this, but something about the way taskqueue_drain() works with
taskqueue_fast queues doesn't seem quite right, and I refuse to be
tricked into fixing it.
- Correct the PCI ID for the 8169SC/8110SC in the device list (I added
the macro for it to if_rlreg.h before, but forgot to use it.)
- Remove the extra interrupt spinlock I added previously. After giving it
some more thought, it's not really needed.
- Work around a hardware bug in some versions of the 8169. When sending
very small IP datagrams with checksum offload enabled, a conflict can
occur between the TX autopadding feature and the hardware checksumming
that can corrupt the outbound packet. This is the reason that checksum
offload sometimes breaks NFS: if you're using NFS over UDP, and you're
very unlucky, you might find yourself doing a fragmented NFS write where
the last fragment is smaller than the minimum ethernet frame size (60
bytes). (It's rare, but if you keep NFS running long enough it'll
happen.) If checksum offload is enabled, the chip will have to both
autopad the fragment and calculate its checksum header. This confuses
some revs of the 8169, causing the packet that appears on the wire
to be corrupted. (The IP addresses and the checksum field are mangled.)
This will cause the NFS write to fail. Unfortunately, when NFS retries,
it sends the same write request over and over again, and it keeps
failing, so NFS stays wedged.
(A simple way to provoke the failure is to connect the failing system
to a network with a known good machine and do "ping -s 1473 <badhost>"
from the good system. The ping will fail.)
Someone had previously worked around this using the heavy-handed
approahch of just disabling checksum offload. The correct fix is to
manually pad short frames where the TCP/IP stack has requested
checksum offloading. This allows us to have checksum offload turned
on by default but still let NFS work right.
- Not a bug, but change the ID strings for devices with hardware rev
0x30000000 and 0x38000000 to both be 8168B/8111B. According to RealTek,
they're both the same device, but 0x30000000 is an earlier silicon spin.
cards: the chips are all marked "RTL8111B", but they put stickers on the
back that say "RTL8168B/8111B". The manual says there's only one HWREV code
for both the 8111B and 8168B devices, which is 0x30000000, but the cards
they sent me actually report HWREV of 0x38000000. Deciding to trust the
hardware in front of me rather than a possibly incorrect manual (it wouldn't
be the first time the HWREVs were incorrectly documented), I changed the
8168 revision code. It turns out this was a mistake though: 0x30000000
really is a valid for the 8168.
There are two possible reasons for there to be two different HWREVs:
1) 0x30000000 is used only for the 8168B and 0x38000000 is only for
the 8111B.
2) There were 8111/8168 rev A devices which both used code 0x30000000,
and the 8111B/8168B both use 0x38000000.
The product list on the RealTek website doesn't mention the existence of
any 8168/8111 rev A chips being in production though, and I've never seen
one, so until I get clarification from RealTek, I'm going to assume that
0x30000000 is just for the 8168B and 0x38000000 is for the 8111B only.
So, the HWREV code for the 8168 has been put back to 0x30000000,
a new 8111 HWREV code has been added, and there are now separate
entries for recognizing both devices in the device list. This will
allow all devices to work, though if it turns out I'm wrong I may
need to change the ID strings
latter is a PCIe 10/100 chip.
Finally fix the EEPROM reading code so that we can access the EEPROMs on all
devices. In order to access the EEPROM, we must select 'EEPROM programming'
mode, and then set the EEPROM chip select bit. Previously, we were setting
both bits simultaneously, which doesn't work: they must be set in the
right sequence.
Always obtain the station address from the EEPROM, now that EEPROM
reading works correctly.
Make the TX interrupt moderation code based on the internal timer
optional and turned off by default.
Make the re_diag() routine conditional and off by default. When it is
on, only use it for the original 8169, which was the only device that
that really needed it.
Modify interrupt handling to use a fast interrupt handler and fast
taskqeueue.
Correct the rgephy driver so that it only applies the DSP fixup for
PHY revs 0 and 1. Later chips are fixed and don't need the fixup.
Make the rgephy driver advertise both 1000_FD and 1000_HD bits in
autoneg mode. A couple of the devices don't autoneg correctly unless
configured this way.
case if memory allocation failed.
- Remove fourth argument from VLAN_INPUT_TAG(), that was used
incorrectly in almost all drivers. Indicate failure with
mbuf value of NULL.
In collaboration with: yongari, ru, sam
rather than in ifindex_table[]; all (except one) accesses are
through ifp anyway. IF_LLADDR() works faster, and all (except
one) ifaddr_byindex() users were converted to use ifp->if_addr.
- Stop storing a (pointer to) Ethernet address in "struct arpcom",
and drop the IFP2ENADDR() macro; all users have been converted
to use IF_LLADDR() instead.
cards and teach the re(4) driver to attach to revision 3 cards.
Submitted by: Fredrik Lindberg fli+freebsd-current at shapeshifter dot se
MFC after: 2 weeks
Reviewed by: imp, mdodd
opt_device_polling.h
- Include opt_device_polling.h into appropriate files.
- Embrace with HAVE_KERNEL_OPTION_HEADERS the include in the files that
can be compiled as loadable modules.
Reviewed by: bde
o Axe poll in trap.
o Axe IFF_POLLING flag from if_flags.
o Rework revision 1.21 (Giant removal), in such a way that
poll_mtx is not dropped during call to polling handler.
This fixes problem with idle polling.
o Make registration and deregistration from polling in a
functional way, insted of next tick/interrupt.
o Obsolete kern.polling.enable. Polling is turned on/off
with ifconfig.
Detailed kern_poll.c changes:
- Remove polling handler flags, introduced in 1.21. The are not
needed now.
- Forget and do not check if_flags, if_capenable and if_drv_flags.
- Call all registered polling handlers unconditionally.
- Do not drop poll_mtx, when entering polling handlers.
- In ether_poll() NET_LOCK_GIANT prior to locking poll_mtx.
- In netisr_poll() axe the block, where polling code asks drivers
to unregister.
- In netisr_poll() and ether_poll() do polling always, if any
handlers are present.
- In ether_poll_[de]register() remove a lot of error hiding code. Assert
that arguments are correct, instead.
- In ether_poll_[de]register() use standard return values in case of
error or success.
- Introduce poll_switch() that is a sysctl handler for kern.polling.enable.
poll_switch() goes through interface list and enabled/disables polling.
A message that kern.polling.enable is deprecated is printed.
Detailed driver changes:
- On attach driver announces IFCAP_POLLING in if_capabilities, but
not in if_capenable.
- On detach driver calls ether_poll_deregister() if polling is enabled.
- In polling handler driver obtains its lock and checks IFF_DRV_RUNNING
flag. If there is no, then unlocks and returns.
- In ioctl handler driver checks for IFCAP_POLLING flag requested to
be set or cleared. Driver first calls ether_poll_[de]register(), then
obtains driver lock and [dis/en]ables interrupts.
- In interrupt handler driver checks IFCAP_POLLING flag in if_capenable.
If present, then returns.This is important to protect from spurious
interrupts.
Reviewed by: ru, sam, jhb
the softc.
- Use callout_init_mtx() and rather than timeout/untimeout in both rl(4)
and re(4).
- Fix locking for ifmedia by locking the driver in the ifmedia handlers
rather than in the miibus functions. (re(4) didn't lock the mii stuff
at all!)
- Fix some locking in re_ioctl().
Note: the two drivers share the same softc declared in if_rlreg.h, so they
had to be change simultaneously.
MFC after: 1 week
Tested by: several on rl(4), none on re(4)
could get an interrupt after we free the ifp, and the interrupt
handler depended on the ifp being still alive, this could, in theory,
cause a crash. Eliminate this possibility by moving the if_free to
after the bus_teardown_intr() call.