freebsd-dev

Author	SHA1	Message	Date
Navdeep Parhar	5aaa3bc3b9	cxgbe/t4_tom: toepcb should be all-zero on allocation because the code that cleans up on failure assumes that non-NULL values indicate initialized items. Sponsored by: Chelsio Communications	2016-09-05 19:37:47 +00:00
Navdeep Parhar	a9feb2cdbb	cxgbe/t4_tom: Two new routines to allocate and write page pods for a buffer in the kernel's address space.	2016-09-01 00:51:59 +00:00
Navdeep Parhar	968267fdb8	cxgbe/t4_tom: Add general purpose routines to deal with page pod regions and allocations within them. Switch to these routines to manage the TOE DDP region. Sponsored by: Chelsio Communications	2016-08-31 23:23:46 +00:00
Navdeep Parhar	9217931fb4	cxgbe/t4_tom: The page pod arena allocates from pod address space and not index space. The minimum valid allocation out of this arena is the size of a single page pod. Sponsored by: Chelsio Communications	2016-08-04 17:29:42 +00:00
Navdeep Parhar	515b36c5b5	cxgbe/t4_tom: Read the chip's DDP page sizes and save them in a per-adapter data structure. This replaces a global array with hardcoded page sizes. Sponsored by: Chelsio Communications	2016-08-02 23:54:21 +00:00
John Baldwin	07159830be	Add support for zero-copy aio_write() on TOE sockets. AIO write requests for a TOE socket on a Chelsio T4+ adapter can now DMA directly from the user-supplied buffer. This is implemented by wiring the pages backing the user-supplied buffer and queueing special mbufs backed by raw VM pages to the socket buffer. The TOE code recognizes these special mbufs and builds a sglist from the VM page array associated with the mbuf when queueing a work request to the TOE. Because these mbufs do not have an associated virtual address, m_data is not valid. Thus, the AIO handler does not invoke sosend() directly for these mbufs but instead inlines portions of sosend_generic() and tcp_usr_send(). An aiotx_buffer structure is used to describe the user buffer (e.g. it holds the array of VM pages and a reference to the AIO job). The special mbufs reference this structure via m_ext. Note that a single job might be split across multiple mbufs (e.g. if it is larger than the socket buffer size). The 'ext_arg2' member of each mbuf gives an offset relative to the backing aiotx_buffer. The AIO job associated with an aiotx_buffer structure is completed when the last reference to the structure is released. Zero-copy aio_write()'s for connections associated with a given adapter can be enabled/disabled at runtime via the 'dev.t[45]nex.N.toe.tx_zcopy' sysctl. MFC after: 1 month Relnotes: yes Sponsored by: Chelsio Communications	2016-07-27 18:29:35 +00:00
Enji Cooper	092af585e1	Remove redundant declaration for tcp_dooptions, similar to r302576 netinet/tcp_var.h already defines this function Differential Revision: https://reviews.freebsd.org/D7189 MFC after: 1 week PR: 209920 Reported by: Mark Millard <markmi@dsl-only.net> Reviewed by: np Tested with: clang 3.8.0, gcc 4.2.1, gcc 5.3.0 Sponsored by: EMC / Isilon Storage Division	2016-07-11 17:11:18 +00:00
Navdeep Parhar	671bf2b8b2	cxgbe(4): Changes to the CPL-handler registration mechanism and code related to "shared" CPLs. a) Combine t4_set_tcb_field and t4_set_tcb_field_rpl into a single function. Allow callers to direct the response to any iq. Tidy up set_ulp_mode_iscsi while there to use names from t4_tcb.h instead of magic constants. b) Remove all CPL handler tables from struct adapter. This reduces its size by around 2KB. All handlers are now registered at MOD_LOAD instead of attach or some kind of initialization/activation. The registration functions do not need an adapter parameter any more. c) Add per-iq handlers to deal with CPLs whose destination cannot be determined solely from the opcode. There are 2 such CPLs in use right now: SET_TCB_RPL and L2T_WRITE_RPL. The base driver continues to send filter and L2T_WRITEs over the mgmtq and solicits the reply on fwq. t4_tom (including the DDP code) now uses the port's ctrlq to send L2T_WRITEs and SET_TCB_FIELDs and solicits the reply on an ofld_rxq. fwq and ofld_rxq have different handlers that know what kind of tid to expect in the reply. Update t4_write_l2e and callers to to support any wrq/iq combination. Approved by: re@ (kib@) Sponsored by: Chelsio Communications	2016-07-05 01:29:24 +00:00
Navdeep Parhar	5e03372b18	cxgbe(4): Do not bring up an interface when IFCAP_TOE is enabled on it. The interface's queues are functional after VI_INIT_DONE (which is short of interface-up) and that's all that's needed for t4_tom to communicate with the chip. Approved by: re@ (gjb@) Sponsored by: Chelsio Communications	2016-06-29 06:55:30 +00:00
John Baldwin	b1012d8036	Account for AIO socket operations in thread/process resource usage. File and disk-backed I/O requests store counts of read/written disk blocks in each AIO job so that they can be charged to the thread that completes an AIO request via aio_return() or aio_waitcomplete(). This change extends AIO jobs to store counts of received/sent messages and updates socket backends to set these counts accordingly. Note that the socket backends are careful to only charge a single messages for each AIO request even though a single request on a blocking socket might invoke sosend or soreceive multiple times. This is to mimic the resource accounting of synchronous read/write. Adjust the UNIX socketpair AIO test to verify that the message resource usage counts update accordingly for aio_read and aio_write. Approved by: re (hrs) Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D6911	2016-06-21 22:19:06 +00:00
John Baldwin	ae0b1ccbab	Use sbused() instead of sbspace() to avoid signed issues. Inserting a full mbuf with an external cluster into the socket buffer resulted in sbspace() returning -MLEN. However, since sb_hiwat is unsigned, the -MLEN value was converted to unsigned in comparisons. As a result, the socket buffer was never autosized. Note that sb_lowat is signed to permit direct comparisons with sbspace(), but sb_hiwat is unsigned. Follow suit with what tcp_output() does and compare the value of sbused() with sb_hiwat instead. Approved by: re (gjb) Sponsored by: Chelsio Communications	2016-06-15 21:08:51 +00:00
John Baldwin	fe0bdd1d2c	Move backend-specific fields of kaiocb into a union. This reduces the size of kaiocb slightly. I've also added some generic fields that other backends can use in place of the BIO-specific fields. Change the socket and Chelsio DDP backends to use 'backend3' instead of abusing _aiocb_private.status directly. This confines the use of _aiocb_private to the AIO internals in vfs_aio.c. Reviewed by: kib (earlier version) Approved by: re (gjb) Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D6547	2016-06-15 20:56:45 +00:00
Navdeep Parhar	c4765d2743	cxgbe/t4_tom: Fix inverted assertion in r300895. It is RDMA connections and not others that are allowed to fail the receive window check. Approved by: re (gjb@)	2016-06-14 21:09:00 +00:00
Navdeep Parhar	addd6a52c4	cxgbe/t4_tom: Exempt RDMA connections from a TCP sanity test for now, to avoid panicking debug kernels. t4_tom does not keep track of a connection once it switches to ULP mode iWARP. If the connection falls out of ULP mode the driver/hardware seq# etc. are out of sync. A better fix would be to figure out what the current seq# are, update the driver's state, and perform all sanity checks as usual.	2016-05-28 00:38:17 +00:00
John Baldwin	1081d2766c	Move the KTR for the update of ddp_active_id on each completion under VERBOSE_TRACES. Sponsored by: Chelsio Communications	2016-05-20 23:08:22 +00:00
John Baldwin	dc9643853d	Use DDP to implement zerocopy TCP receive with aio_read(). Chelsio's TCP offload engine supports direct DMA of received TCP payload into wired user buffers. This feature is known as Direct-Data Placement. However, to scale well the adapter needs to prepare buffers for DDP before data arrives. aio_read() is more amenable to this requirement than read() as applications often call read() only after data is available in the socket buffer. When DDP is enabled, TOE sockets use the recently added pru_aio_queue protocol hook to claim aio_read(2) requests instead of letting them use the default AIO socket logic. The DDP feature supports scheduling DMA to two buffers at a time so that the second buffer is ready for use after the first buffer is filled. The aio/DDP code optimizes the case of an application ping-ponging between two buffers (similar to the zero-copy bpf(4) code) by keeping the two most recently used AIO buffers wired. If a buffer is reused, the aio/DDP code is able to reuse the vm_page_t array as well as page pod mappings (a kind of MMU mapping the Chelsio NIC uses to describe user buffers). The generation of the vmspace of the calling process is used in conjunction with the user buffer's address and length to determine if a user buffer matches a previously used buffer. If an application queues a buffer for AIO that does not match a previously used buffer then the least recently used buffer is unwired before the new buffer is wired. This ensures that no more than two user buffers per socket are ever wired. Note that this feature is best suited to applications sending a steady stream of data vs short bursts of traffic. Discussed with: np Relnotes: yes Sponsored by: Chelsio Communications	2016-05-07 00:33:35 +00:00
John Baldwin	826c2372c5	Set the correct vnet in TOE event handlers. Differential Revision: https://reviews.freebsd.org/D6152	2016-05-06 23:49:10 +00:00
Pedro F. Giffuni	b66bb393f2	Cleanup redundant parenthesis from existing howmany()/roundup() macro uses.	2016-04-22 16:57:42 +00:00
John Baldwin	113f2316c6	Add a 'show t4 tcb <nexus> <tid>' command to dump a TCB from DDB. This allows the contents of a TCB to be extracted from a T4/T5 card in DDB after a panic.	2016-04-10 05:06:58 +00:00
Navdeep Parhar	40bf7442fa	cxgbe: catch up with the latest hardware-related definitions. Obtained from: Chelsio Communications Sponsored by: Chelsio Communications	2016-02-19 00:29:16 +00:00
Gleb Smirnoff	f353ae1c62	More fixes to the build.	2016-01-27 05:15:53 +00:00
Gleb Smirnoff	57a78e3bae	Augment struct tcpstat with tcps_states[], which is used for book-keeping the amount of TCP connections by state. Provides a cheap way to get connection count without traversing the whole pcb list. Sponsored by: Netflix	2016-01-27 00:45:46 +00:00
Alexander V. Chernikov	8a9f7532b0	Convert cxgb/cxgbe to the new routing API. Discussed with: np	2016-01-07 08:07:17 +00:00
Gleb Smirnoff	0c39d38d21	Historically we have two fields in tcpcb to describe sender MSS: t_maxopd, and t_maxseg. This dualism emerged with T/TCP, but was not properly cleaned up after T/TCP removal. After all permutations over the years the result is that t_maxopd stores a minimum of peer offered MSS and MTU reduced by minimum protocol header. And t_maxseg stores (t_maxopd - TCPOLEN_TSTAMP_APPA) if timestamps are in action, or is equal to t_maxopd otherwise. That's a very rough estimate of MSS reduced by options length. Throughout the code it was used in places, where preciseness was not important, like cwnd or ssthresh calculations. With this change: - t_maxopd goes away. - t_maxseg now stores MSS not adjusted by options. - new function tcp_maxseg() is provided, that calculates MSS reduced by options length. The functions gives a better estimate, since it takes into account SACK state as well. Reviewed by: jtl Differential Revision: https://reviews.freebsd.org/D3593	2016-01-07 00:14:42 +00:00
Alexander V. Chernikov	4fb3a8208c	Implement interface link header precomputation API. Add if_requestencap() interface method which is capable of calculating various link headers for given interface. Right now there is support for INET/INET6/ARP llheader calculation (IFENCAP_LL type request). Other types are planned to support more complex calculation (L2 multipath lagg nexthops, tunnel encap nexthops, etc..). Reshape 'struct route' to be able to pass additional data (with is length) to prepend to mbuf. These two changes permits routing code to pass pre-calculated nexthop data (like L2 header for route w/gateway) down to the stack eliminating the need for other lookups. It also brings us closer to more complex scenarios like transparently handling MPLS nexthops and tunnel interfaces. Last, but not least, it removes layering violation introduced by flowtable code (ro_lle) and simplifies handling of existing if_output consumers. ARP/ND changes: Make arp/ndp stack pre-calculate link header upon installing/updating lle record. Interface link address change are handled by re-calculating headers for all lles based on if_lladdr event. After these changes, arpresolve()/nd6_resolve() returns full pre-calculated header for supported interfaces thus simplifying if_output(). Move these lookups to separate ether_resolve_addr() function which ether returs error or fully-prepared link header. Add <arp\|nd6_>resolve_addr() compat versions to return link addresses instead of pre-calculated data. BPF changes: Raw bpf writes occupied _two_ cases: AF_UNSPEC and pseudo_AF_HDRCMPLT. Despite the naming, both of there have ther header "complete". The only difference is that interface source mac has to be filled by OS for AF_UNSPEC (controlled via BIOCGHDRCMPLT). This logic has to stay inside BPF and not pollute if_output() routines. Convert BPF to pass prepend data via new 'struct route' mechanism. Note that it does not change non-optimized if_output(): ro_prepend handling is purely optional. Side note: hackish pseudo_AF_HDRCMPLT is supported for ethernet and FDDI. It is not needed for ethernet anymore. The only remaining FDDI user is dev/pdq mostly untouched since 2007. FDDI support was eliminated from OpenBSD in 2013 (sys/net/if_fddisubr.c rev 1.65). Flowtable changes: Flowtable violates layering by saving (and not correctly managing) rtes/lles. Instead of passing lle pointer, pass pointer to pre-calculated header data from that lle. Differential Revision: https://reviews.freebsd.org/D4102	2015-12-31 05:03:27 +00:00
Navdeep Parhar	9eb533d3b4	cxgbe(4): Updates to the base NIC driver and t4_tom to support the iSCSI offload driver. These changes come from projects/cxl_iscsi.	2015-12-26 00:26:02 +00:00
John Baldwin	fe2ebb7644	Add support for configuring additional virtual interfaces (VIs) on a port. Each virtual interface has its own MAC address, queues, and statistics. The dedicated netmap interfaces (ncxgbeX / ncxlX) were already implemented as additional VIs on each port. This change allows additional non-netmap interfaces to be configured on each port. Additional virtual interfaces use the naming scheme vcxgbeX or vcxlX. Additional VIs are enabled by setting the hw.cxgbe.num_vis tunable to a value greater than 1 before loading the cxgbe(4) or cxl(4) driver. NB: The first VI on each port is the "main" interface (cxgbeX or cxlX). T4/T5 NICs provide a limited number of MAC addresses for each physical port. As a result, a maximum of six VIs can be configured on each port (including the "main" interface and the netmap interface when netmap is enabled). One user-visible result is that when netmap is enabled, packets received or transmitted via the netmap interface are no longer counted in the stats for the "main" interface, but are not accounted to the netmap interface. The netmap interfaces now also have a new-bus device and export various information sysctl nodes via dev.n(cxgbe\|cxl).X. The cxgbetool 'clearstats' command clears the stats for all VIs on the specified port along with the port's stats. There is currently no way to clear the stats of an individual VI. Reviewed by: np MFC after: 1 month Sponsored by: Chelsio	2015-12-03 00:02:01 +00:00
Navdeep Parhar	baa7d0bf9d	cxgbe/tom: decide whether to shove segments or not only if there is payload to transmit. MFC after: 1 week	2015-10-30 01:18:07 +00:00
John Baldwin	8fb15ddb00	Add a comment that to clarify how to determine the amount of received DDP data. Reviewed by: np Differential Revision: https://reviews.freebsd.org/D3619	2015-09-10 21:41:11 +00:00
Julien Charbon	ff9b006d61	Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability: - The existing TCP INP_INFO lock continues to protect the global inpcb list stability during full list traversal (e.g. tcp_pcblist()). - A new INP_LIST lock protects inpcb list actual modifications (inp allocation and free) and inpcb global counters. It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input()) and INP_INFO_WLOCK only in occasional operations that walk all connections. PR: 183659 Differential Revision: https://reviews.freebsd.org/D2599 Reviewed by: jhb, adrian Tested by: adrian, nitroboost-gmail.com Sponsored by: Verisign, Inc.	2015-08-03 12:13:54 +00:00
Andrey V. Elsukov	cc0a3c8ca4	Convert in_ifaddr_lock and in6_ifaddr_lock to rmlock. Both are used to protect access to IP addresses lists and they can be acquired for reading several times per packet. To reduce lock contention it is better to use rmlock here. Reviewed by: gnn (previous version) Obtained from: Yandex LLC Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D3149	2015-07-29 08:12:05 +00:00
Navdeep Parhar	c7dbd80213	cxgbe(4): Update T4 and T5 firmwares to 1.14.2.0. Obtained from: Chelsio Communications MFC after: 3 days	2015-07-14 08:02:05 +00:00
Gleb Smirnoff	28ebe80cab	Provide functions to determine presence of a given address configured on a given interface. Discussed with: np Sponsored by: Nginx, Inc.	2015-04-17 11:57:06 +00:00
Navdeep Parhar	7ef00d7884	cxgbe/tom: return rx credits promptly if the socket buffer's low water mark cannot be reached because the window advertised to the peer isn't wide enough. While here, tweak the normal credit return too. MFC after: 1 month	2015-03-31 01:22:20 +00:00
John Baldwin	b12c0a9eeb	Move special DDP handling for closing a connection into a new handle_ddp_close() function in t4_ddp.c as the logic is similar to handle_ddp_data(). This allows all knowledge of the special DDP mbufs to be private to t4_ddp.c as well.	2015-03-16 15:56:06 +00:00
John Baldwin	69a088631e	Resize receive socket buffers that support autosizing when receiving TCP data via direct data placement. Sponsored by: Chelsio MFC after: 1 week	2015-03-11 17:35:07 +00:00
Navdeep Parhar	b3d44a6800	cxgbe(4): tidy up some of the interaction between the Upper Layer Drivers (ULDs) and the base if_cxgbe driver. Track the per-adapter activation of ULDs in a new "active_ulds" field. This was done pretty arbitrarily before this change -- via TOM_INIT_DONE in adapter->flags for TOM, and the (1 << MAX_NPORTS) bit in adapter->offload_map for iWARP. iWARP and hw-accelerated iSCSI rely on the TOE (supported by the TOM ULD). The rules are: a) If the iWARP and/or iSCSI ULDs are available when TOE is enabled then iWARP and/or iSCSI are enabled too. b) When the iWARP and iSCSI modules are loaded they go looking for adapters with TOE enabled and enable themselves on that adapter. c) You cannot deactivate or unload the TOM module from underneath iWARP or iSCSI. Any such attempt will fail with EBUSY. MFC after: 2 weeks	2015-02-08 09:28:55 +00:00
John Baldwin	86f05ea6cf	Lock the socket buffer before jumping to the 'out' label if sblock() fails in t4_soreceive_ddp().	2015-01-26 16:32:41 +00:00
John Baldwin	de5a10ecbc	- Update a disabled KASSERT() to use sbused() instead of accessing the no-longer existant sb_cc sockbuf member. - Use sbavail() instead of sbused() in t4_soreceive_ddp() to match the usage in soreceive_stream() on which it is based. Discussed with: glebius (2)	2015-01-26 16:29:14 +00:00
Navdeep Parhar	db8bcd1b21	cxgbe/tom: allocate page pod addresses instead of ppod#. MFC after: 2 weeks	2015-01-07 06:20:33 +00:00
Navdeep Parhar	f8c479085f	cxgbe/tom: use vmem(9) as the DDP page pod allocator. MFC after: 1 month	2015-01-06 01:30:32 +00:00
Navdeep Parhar	79b93bf6a3	cxgbe/tom: do not engage the TOE's payload chopper for payload < 2 MSS or for 10Gbps ports. MFC after: 2 weeks	2015-01-03 00:09:21 +00:00
Navdeep Parhar	402873f32a	cxgbe/tom: fix the MSS calculation for IPv6 connections handled by the TOE. MFC after: 1 week	2015-01-02 21:13:24 +00:00
Navdeep Parhar	dd1be4d418	cxgbe/tom: log some more details in send_flowc_wr. MFC after: 1 week	2015-01-02 20:52:51 +00:00
John Baldwin	5ad25ceb41	Check for SS_NBIO in so->so_state instead of sb->sb_flags in soreceive_stream(). Differential Revision: https://reviews.freebsd.org/D1299 Reviewed by: bz, gnn MFC after: 1 week	2014-12-15 17:52:08 +00:00
Navdeep Parhar	a7570ee305	Move KTR_CXGBE from t4_tom.h to adapter.h so that the base if_cxgbe code can use it too. MFC after: 1 week	2014-12-12 21:54:59 +00:00
Gleb Smirnoff	651e4e6a30	Merge from projects/sendfile: extend protocols API to support sending not ready data: o Add new flag to pru_send() flags - PRUS_NOTREADY. o Add new protocol method pru_ready(). Sponsored by: Nginx, Inc. Sponsored by: Netflix	2014-11-30 13:24:21 +00:00
Gleb Smirnoff	0f9d0a73a4	Merge from projects/sendfile: o Introduce a notion of "not ready" mbufs in socket buffers. These mbufs are now being populated by some I/O in background and are referenced outside. This forces following implications: - An mbuf which is "not ready" can't be taken out of the buffer. - An mbuf that is behind a "not ready" in the queue neither. - If sockbet buffer is flushed, then "not ready" mbufs shouln't be freed. o In struct sockbuf the sb_cc field is split into sb_ccc and sb_acc. The sb_ccc stands for ""claimed character count", or "committed character count". And the sb_acc is "available character count". Consumers of socket buffer API shouldn't already access them directly, but use sbused() and sbavail() respectively. o Not ready mbufs are marked with M_NOTREADY, and ready but blocked ones with M_BLOCKED. o New field sb_fnrdy points to the first not ready mbuf, to avoid linear search. o New function sbready() is provided to activate certain amount of mbufs in a socket buffer. A special note on SCTP: SCTP has its own sockbufs. Unfortunately, FreeBSD stack doesn't yet allow protocol specific sockbufs. Thus, SCTP does some hacks to make itself compatible with FreeBSD: it manages sockbufs on its own, but keeps sb_cc updated to inform the stack of amount of data in them. The new notion of "not ready" data isn't supported by SCTP. Instead, only a mechanical substitute is done: s/sb_cc/sb_ccc/. A proper solution would be to take away struct sockbuf from struct socket and allow protocols to implement their own socket buffers, like SCTP already does. This was discussed with rrs@. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-11-30 12:52:33 +00:00
Gleb Smirnoff	cfa6009e36	In preparation of merging projects/sendfile, transform bare access to sb_cc member of struct sockbuf to a couple of inline functions: sbavail() and sbused() Right now they are equal, but once notion of "not ready socket buffer data", will be checked in, they are going to be different. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-11-12 09:57:15 +00:00
Navdeep Parhar	527e4e62ac	Always request a completion for every work request for iWARP. The initial MPA exchange must be tracked this way so that t4_tom's state for the tid is all clean at the time the tid transitions to RDMA mode. Once it does, t4_tom is out of the way and iw_cxgbe uses the qp endpoints directly. Sponsored by: Chelsio Communications	2014-10-28 18:10:57 +00:00

1 2 3

107 Commits