a pointer to an ifaddr matching the passed socket address, returns a
boolean indicating whether one was present. In the (near) future,
ifa_ifwithaddr() will return a referenced ifaddr rather than a raw
ifaddr pointer, and the new wrapper will allow callers that care only
about the boolean condition to avoid having to free that reference.
MFC after: 3 weeks
- remove mbuf iovec - useful, but adds too much complexity when isolated to
the driver
- remove driver private caching - insufficient benefit over UMA to justify
the added complexity and maintenance overhead
- remove separate logic for managing multiple transmit queues, with the
new drbr routines the control flow can be made to much more closely resemble
legacy drivers
- remove dedicated service threads, with per-cpu callouts one can get the same
benefit much more simply by registering a callout 1 tick in the future if there
are still buffered packets
- remove embedded mbuf usage - Jeffr's changes will (I hope) soon be integrated
greatly reducing the overhead of using kernel APIs for reference counting
clusters
- add hysteresis to descriptor coalescing logic
- add coalesce threshold sysctls to allow users to decide at run-time
between optimizing for forwarding / UDP or optimizing for TCP
- add once per second watchdog to effectively close the very rare races
occurring from coalescing
- incorporate Navdeep's changes to the initialization path required to
convert port and adapter locks back to ordinary mutexes (silencing BPF
LOR complaints)
- enable prefetches in get_packet and tx cleaning
Reviewed by: navdeep@
MFC after: 2 weeks
by if_free (w/o doing if_attach); move ifq_attach to if_alloc and
rename ifq_attach/detach to ifq_init/ifq_delete to better identify
their purpose
Reviewed by: jhb, kmacy
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.
Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.
- Each socket upcall is now invoked with the appropriate socket buffer
locked. It is not permissible to call soisconnected() with this lock
held; however, so socket upcalls now return an integer value. The two
possible values are SU_OK and SU_ISCONNECTED. If an upcall returns
SU_ISCONNECTED, then the soisconnected() will be invoked on the
socket after the socket buffer lock is dropped.
- A new API is provided for setting and clearing socket upcalls. The
API consists of soupcall_set() and soupcall_clear().
- To simplify locking, each socket buffer now has a separate upcall.
- When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from
the receive socket buffer automatically. Note that a SO_SND upcall
should never return SU_ISCONNECTED.
- All this means that accept filters should now return SU_ISCONNECTED
instead of calling soisconnected() directly. They also no longer need
to explicitly clear the upcall on the new socket.
- The HTTP accept filter still uses soupcall_set() to manage its internal
state machine, but other accept filters no longer have any explicit
knowlege of socket upcall internals aside from their return value.
- The various RPC client upcalls currently drop the socket buffer lock
while invoking soreceive() as a temporary band-aid. The plan for
the future is to add a new flag to allow soreceive() to be called with
the socket buffer locked.
- The AIO callback for socket I/O is now also invoked with the socket
buffer locked. Previously sowakeup() would drop the socket buffer
lock only to call aio_swake() which immediately re-acquired the socket
buffer lock for the duration of the function call.
Discussed with: rwatson, rmacklem
Calculate the exact number of vectors we'll use before calling
pci_alloc_msix. Don't grab nine all the time.
Call cxgb_setup_interrupts once per T3, not once per port. Ditto
for cxgb_teardown_interrupts.
Don't leak resources when interrupt setup fails in the middle.
Obtained from: Navdeep Parhar
MFC after: 10 days
only if prepping the adapter failed.
Slight adjustment to comments.
Fix a bug whereby downing the interface didn't preven it from
processing packets.
Submitted by: Navdeep Parhar
MFC after: 1 week
1) Add a sysctl that will say what type of PHYs exist on the card.
2) Fix a bug that occurs when an AEL 2005 PHY resets without a transciever
in the card.
3) Unify the PHY link detection code.
Obtained from: Navdeep Parhar
MFC after: 10 days
and down more cleanly. This addresses a problem where if we have the
link flap during boot the driver would lock up the system.
Reviewed by: jhb
MFC after: 1 week
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.
This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.
The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.
The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.
This change also introduces a DDB subcommand to show the list of all
vnet instances.
Approved by: julian (mentor)
TCPSTAT_INC(), rather than directly manipulating the fields across the
kernel. This will make it easier to change the implementation of
these statistics, such as using per-CPU versions of the data structures.
MFC after: 3 days
The tick routine was not being restarted in the init_locked routine
which could resulted in loss of carrier when updating the MTU.
Submitted by: Navdeep Parhar at Chelsio
MFC after: 3 weeks
Firmware upgraded to 7.1.0 (from 5.0.0).
T3C EEPROM and SRAM added; Code to update eeprom/sram fixed.
fl_empty and rx_fifo_ovfl counters can be observed via sysctl.
Two new cxgbtool commands to get uP logic analyzer info and uP IOQs
Synced up with Chelsio's "common code" (as of 03/03/09)
Submitted by: Navdeep Parhar at Chelsio
Reviewed by: gnn
MFC after: 2 weeks
net/route.h.
Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.
We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.
This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.
had been the only flag with random usage patterns.
Switch inc_flags to be used as a real bit field by using
INC_ISIPV6 with bitops to check for the 'isipv6' condition.
While here fix a place or two where in case of v4 inc_flags
were not properly initialized before.[1]
Found by: rwatson during review [1]
Discussed with: rwatson
Reviewed by: rwatson
MFC after: 4 weeks
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,
The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.
Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:
- Kip Macy revised the locking code completely, thus completing
the last piece of the puzzle, Kip has also been conducting
active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
me maintaining that branch before the svn conversion
packet loss, of between 10-30%. The fix is to put the PHY into
and take it out of local loopback mode when resetting the interface.
Obtained from: Chelsio Inc.
MFC after: 3 days
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.
For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.
Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation
when it sees only received packets. In some cases where a device only
recieves data it mistakenly thinks that its transmitting side is broken
and resets the device.
Obtained from: Chelsio Inc.
MFC after: 3 days
- break complex conditionals in to multiple lines to avoid wrapping
- remove copious unused debug statements
- be more aggressive about cleaning in the calling thread
- eliminate usage of ENOSPC
- increase number of iterations that cxgbsp can do
- eliminate "initerr" usage to simplify ENOBUFS handling
- when coalescing pass all packets to BPF
- always set overrun if hardware queue is full
- invert sense of hw.cxgb.singleq tunable to hw.cxgb.multiq
- don't wake up transmitting thread by default
- add per tx queue ifaltq to handle ALTQ
- remove several unused functions in cxgb_multiq.c
- add several sysctls: multiq_tx_enable, coalesce_tx_enable,
and wakeup_tx_thread
- this obsoletes the hw.cxgb.snd_queue_len as ifq is replaced
by a buf_ring
and ifnet functions
- add memory barriers to <machine/atomic.h>
- update drivers to only conditionally define their own
- add lockless producer / consumer ring buffer
- remove ring buffer implementation from cxgb and update its callers
- add if_transmit(struct ifnet *ifp, struct mbuf *m) to ifnet to
allow drivers to efficiently manage multiple hardware queues
(i.e. not serialize all packets through one ifq)
- expose if_qflush to allow drivers to flush any driver managed queues
This work was supported by Bitgravity Inc. and Chelsio Inc.
1) Fix a bug in dealing with the Alerus 1006 PHY which prevented the
device from ever coming back up once it had been set to down.
2) Add a kernel tunable (hw.cxgb.snd_queue_len) which makes it possible
to give the device more than IFQ_MAXLEN entries in its send queue. The
default remains 50.
3) Add code to place the card'd identification and serial number into
its description (%desc) so that users can tell which card they have
installed.
for virtualization.
Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.
Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.
Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit
Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.
Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().
Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).
All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).
(*) netipsec/keysock.c did not validate depending on compile time options.
Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
- simplify page hold logic
- allow pages for processes other than that of curthread to
have pages held
- normalize the interface to more closely resemble the functions in
sys/vm
MFC after: 1 week
but an RW mapping exists for the underlying page. This change fixes the bug by using the
page / NULL returned from pmap_extract_and_hold to determine whether or not vm_fault needs
to be called.
The bug was pointed out by alc.
MFC after: 3 days
years by the priv_check(9) interface and just very few places are left.
Note that compatibility stub with older FreeBSD version
(all above the 8 limit though) are left in order to reduce diffs against
old versions. It is responsibility of the maintainers for any module, if
they think it is the case, to axe out such cases.
This patch breaks KPI so __FreeBSD_version will be bumped into a later
commit.
This patch needs to be credited 50-50 with rwatson@ as he found time to
explain me how the priv_check() works in detail and to review patches.
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
Reviewed by: rwatson
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
- add / remove clients from cxgb_main.c now
- change ifdef TOE_ENABLED to TCP_OFFLOAD_DISABLE
- update copyrights
- fix transmit data mismatch bug caused by not setting SB_NOCOALESCE
on tx sockbuf on passive connections
- fix receive sequence mismatch bug caused by not setting SB_NOCOALESCE
on rx sockbuf on passive connections
- don't sleep without checking SBS_CANTRCVMORE first
- various ddp ordering fixes
Supported by: Chelsio Inc.
move most offload functionality from NIC to TOE
factor out all socket and inpcb direct access
factor out access to locking in incpb, pcbinfo, and sockbuf
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.
This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.
MFC after: 3 months
Tested by: kris (superset of committered patch)
- add support for T3C
- add DDP support (zero-copy receive)
- fix TOE transmit of large requests
- fix shutdown so that sockets don't remain in CLOSING state indefinitely
- register listeners when an interface is brought up after tom is loaded
- fix setting of multicast filter
- enable link at device attach
- exit tick handler if shutdown is in progress
- add helper for logging TCB
- add sysctls for dumping transmit queues
- note that TOE wxill not be MFC'd until after 7.0 has been finalized
MFC after: 3 days
free function controlable, instead of passing the KVA of the buffer
storage as the first argument.
Fix all conventional users of the API to pass the KVA of the buffer
as the first argument, to make this a no-op commit.
Likely break the only non-convetional user of the API, after informing
the relevant committer.
Update the mbuf(9) manual page, which was already out of sync on
this point.
Bump __FreeBSD_version to 800016 as there is no way to tell how
many arguments a CPP macro needs any other way.
This paves the way for giving sendfile(9) a way to wait for the
passed storage to have been accessed before returning.
This does not affect the memory layout or size of mbufs.
Parental oversight by: sam and rwatson.
No MFC is anticipated.
- Track packet zone mbufs separately from other mbufs
- free packet zone buffers via m_free rather than trying to manage the refcount
as with clusters - its refcount and management seems to be "special"
- increase asserts for mbuf accounting
- track outstanding mbufs (maps very closely to leaked)
- actually only create one thread per port if !multiq
Oddly enough this fixes the use after free
- move txq_segs to stack in t3_encap
- add checks that pidx doesn't move pass cidx
- simplify mbuf free logic in collapse mbufs routine
- move cxgb_tx_common in to cxgb_multiq.c and rename to cxgb_tx
- move cxgb_tx_common dependencies
- further simplify cxgb_dequeue_packet for the non-multiqueue case
- only launch one service thread per port in the non-multiq case
- remove dead cleaning code from cxgb_sge.c
- simplify PIO case substantially in by returning directly from mbuf collapse
and just using m_copydata
- remove gratuitous m_gethdr in the rx path
- clarify freeing of mbufs in collapse
rspq lock. Not doing so was causing us to skip re-enabling the interrupt.
- remove duplicate credits sysctl
- add support for dumping hardware context of the txq
- decrement budget_left when we break out of the process_responses loop
- return the error from cxgb_tx_common so that when an error is hit we dont
spin forever in the taskq thread
- remove unused rxsd_ref
- simplify header_offset calculation for embedded mbuf headers
- fix memory leak by making sure that mbuf header initialization took place
- disable printf's for stalled queue, don't do offload/ctrl queue restart
when tunnel queue is restarted
- add more diagnostic information about the txq state
- add facility to dump the actual contents of the hardware queue using sysctl
of two compares against 0. The negative effect of cache flushing
is probably more than the gain by not doing the two compares (the
value is almost certainly in register or at worst, cache).
Note that the uses of m_freem() are in error cases and m_freem()
handles NULL anyhow. So fast-path really isn't changed much at all.
and t3_push_frames).
- Import latest changes to cxgb_main.c and cxgb_sge.c from toestack p4 branch
- make driver local copy of tcp_subr.c and tcp_usrreq.c and override tcp_usrreqs so
TOE can also functions on versions with unmodified TCP
- add cxgb back to the build
- fix the use after free seen when sending packets small enough to fit as an immediate
and bpf peers are present
- update to firmware rev 4.7 along with various small vendor fixes
Supported by: Chelsio
Approved by: re (blanket)
MFC after: 3 days
- remove cpl->iff panic - we can't know the port number from the rspq on the 4-port
- pick the ifnet based on the interface in the CPL header
- switch to using qset 0 for egress on the 4-port for now - may change
when we start using RSS
- move ether_ifdetach to before the port lock gets deinitialized to avoid
hang in the case where there are BPF peers (cxgb_ioctl is called indirectly
when BPF peers are present)
- don't call t3_mac_reset if multiport is set, this was causing tx errors
by misconfiguring the MAC on the 4-port
- change V_TXPKT_INTF to use txpkt_intf as the interfaces are not contiguous
- free the mbuf immediately in the case where the payload is small enough to be copied
into the rspq
- only update the coalesce timer if for a queue if packets were taken off of it
- add in missed 20ms DELAY in initializaton vsc8211
- prompt MFC as this only applies to the 4-port which is currently completely
broken - OK'd by kensmith
Supported by: Chelsio
Approved by: re (blanket)
MFC after: 0 days
- reduce cpu usage by as much as 25% (40% -> 30) by doing txq reclaim more efficiently
- use mtx_trylock when trying to grab the lock to avoid spinning during long encap loop
- add per-txq reclaim task
- if mbufs were successfully re-claimed try another pass
- track txq overruns with sysctl
Approved by: re (blanket)
can be allocated atomically
- add debug macros for printing lock initialization / teardown
- add buffers to port_info and adapter to allow each lock to have a
unique name
- destroy mutexes initialized by cxgb_offload_init
- remove recursive calls to ADAPTER_LOCK
- move callout_drain calls so that they don't occur with the lock held
- ensure that only as many qsets as are needed are initialized and
destroyed
MFC after: 3 days
Sponsored by: Chelsio Inc.
- update to firmware version 4.1.0
- switch over to standard method for initializing cdevs (contributed by scottl@)
- break out timer_reclaim_task to be per-port
- move msix teardown into separate function
- fix bus_setup_intr for msi-x for the multi-port case so that msi-x resources
are not corrupted on unload
- handle 10/100/1000 base-T media and auto negotiation
- bind qset to cpu even for singleq case
- white space cleanups
- remove recursive PORT_LOCK
- move mtu setting to separate function
- stop and re-init port when changing mtu
- replace all direct references to m_data with calls to mtod
- handle attach failure better by not trying to de-initialize
taskqueues when they have not been allocated
- no longer default to jumbo frames
Sponsored by: Chelsio
MFC after: 3 days
- Double the number of descriptors that a single call to send can use
- Quadruple the number of descriptors that can be reclaimed per pass
- only run reclaim twice per second
- increase coalesce timer from 3.5us to 5us
fix printf warning on 64-bit platforms
- upgrade to reflect state of 1.0.0.86
- move from firmware rev 3.2 to 4.0.0
- import driver bits for offload functionality
- remove binary distribution clause from top level files as it
runs counter to the intent of purely supporting the hardware
MFC after: 3 days