206 lines
5.8 KiB
C
Raw Normal View History

lib/idxd: add low level idxd library Module, etc., will follow. Notes: * IDXD is an Intel silicon feature available in future Intel CPUs. Initial development is being done on a simulator. Once HW is available and the code fully tested the experimental label will be lifted. Spec can be found here: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification * The current implementation will only work with VFIO. * DSA has a number of engines that can be grouped based on application need such as type of memory being served or QoS. Engines are processing units and are assigned to groups. Work queues are on device structures that act as front-end groups for queueing descriptors. Full details on what is configurable & how will come in later doc patches. * There is a finite number of work queue slots that are divided amongst the number of desired work queues in some fashion (ie evenly). * SW (outside of the idxd lib) is required to manage flow control, to not over-run the work queues.This is provided in the accel plug-in module. The upper layers use public API to manage this. * Work queue submissions are done with a 64 byte atomic instruction * The design here creates a set of descriptor rings per channel that match the size of the work queues. Then, an spdk_bit_array is used to make sure we don't overrun a queue. If there are not slots available, the operation is put on a linked list to be retried later from the poller. * As we need to support any number of channels (we can't limit ourselves to the number of work queues) we need to dynamically size/resize our per channel descriptor rings based on the number of current channels. This is done from upper layers via public API into the lib. * As channels are created, the total number of work queue slots is divided across the channels evenly. Same thing when they are destroyed, remaining channels with see the ring sizes increase. This is done from upper layers via public API into the lib. * The sim has 64 total work queue entries (WQE) that get dolled out to the work queues (WQ) evenly. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I899bbeda3cef3db05bea4197b8757e89dddb579d Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/1809 Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Vitaliy Mysak <vitaliy.mysak@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
2020-04-10 15:29:01 +00:00
/*-
* BSD LICENSE
*
* Copyright (c) Intel Corporation.
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* * Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* * Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in
* the documentation and/or other materials provided with the
* distribution.
* * Neither the name of Intel Corporation nor the names of its
* contributors may be used to endorse or promote products derived
* from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
* "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
* A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
* OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#ifndef __IDXD_H__
#define __IDXD_H__
#include "spdk/stdinc.h"
#include "spdk/idxd.h"
#include "spdk/queue.h"
#include "spdk/mmio.h"
#include "spdk/bit_array.h"
#include "idxd_spec.h"
#ifdef __cplusplus
extern "C" {
#endif
/* TODO: get the gcc intrinsic to work. */
#define nop() asm volatile ("nop")
static inline void movdir64b(void *dst, const void *src)
{
asm volatile(".byte 0x66, 0x0f, 0x38, 0xf8, 0x02"
: "=m"(*(char *)dst)
: "d"(src), "a"(dst));
}
#define IDXD_REGISTER_TIMEOUT_US 50
idxd: add batch capability to accel framework and IDXD back-end This patch only includes the basic framework for batching and the ability to batch one type of command, copy. Follow-on patches will add the ability to batch other commands and include an example of how to do so via the accel perf tool. SW engine support for batching will also come in a future patch. Documentation will also be coming. Batching allows the application to submit a list of independent descriptors to DSA with one single "batch" descriptor. This is beneficial when the application is in a position to have several operations ready at once; batching saves the overhead of submitting each one separately. The way batching works in SPDK is as follows: 1) The app gets a handle to a new batch with spdk_accel_batch_create() 2) The app uses that handle to prepare a command to be included in the batch. For copy the command is spdk_accel_batch_prep_copy(). The app many continue to prep commands for the batch up to the max via calling spdk_accel_batch_get_max() 3) The app then submits the batch with spdk_accel_batch_submit() 4) The callback provided for each command in the batch will be called as they complete, the callback provided to the batch submit itself will be called then the entire batch is done. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I4102e9291fe59a245cedde6888f42a923b6dbafd Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/2248 Community-CI: Mellanox Build Bot Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>
2020-05-07 14:45:15 -04:00
#define IDXD_DRAIN_TIMEOUT_US 500000
lib/idxd: add low level idxd library Module, etc., will follow. Notes: * IDXD is an Intel silicon feature available in future Intel CPUs. Initial development is being done on a simulator. Once HW is available and the code fully tested the experimental label will be lifted. Spec can be found here: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification * The current implementation will only work with VFIO. * DSA has a number of engines that can be grouped based on application need such as type of memory being served or QoS. Engines are processing units and are assigned to groups. Work queues are on device structures that act as front-end groups for queueing descriptors. Full details on what is configurable & how will come in later doc patches. * There is a finite number of work queue slots that are divided amongst the number of desired work queues in some fashion (ie evenly). * SW (outside of the idxd lib) is required to manage flow control, to not over-run the work queues.This is provided in the accel plug-in module. The upper layers use public API to manage this. * Work queue submissions are done with a 64 byte atomic instruction * The design here creates a set of descriptor rings per channel that match the size of the work queues. Then, an spdk_bit_array is used to make sure we don't overrun a queue. If there are not slots available, the operation is put on a linked list to be retried later from the poller. * As we need to support any number of channels (we can't limit ourselves to the number of work queues) we need to dynamically size/resize our per channel descriptor rings based on the number of current channels. This is done from upper layers via public API into the lib. * As channels are created, the total number of work queue slots is divided across the channels evenly. Same thing when they are destroyed, remaining channels with see the ring sizes increase. This is done from upper layers via public API into the lib. * The sim has 64 total work queue entries (WQE) that get dolled out to the work queues (WQ) evenly. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I899bbeda3cef3db05bea4197b8757e89dddb579d Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/1809 Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Vitaliy Mysak <vitaliy.mysak@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
2020-04-10 15:29:01 +00:00
#define WQ_MODE_DEDICATED 1
lib/idxd: refactor batching for increased performance And to eliminate an artificial constraint on # of user descriptors. The main idea here was to move from a single ring that covered all user descriptors to a pre-allocated ring per pre-allocated batch. In addition, the other major change here is in how we poll for completions. We used to poll the batch rings then the main ring. Now when commands are prepared their completion address is added to a per channel list and the poller simply runs through that list not caring which ring the completion address belongs too. This simplifies the completion logic considerably and will avoid polling locations that can't potentially have a completion. Some minor rework was included as well, mainly getting rid of the ring_ctrl struct as it didn't serve much of a purpose anyway and with how things are setup now its easier to read with all the elements in the channel struct. Also, a change that came in while this was WIP needed a few fixes to function correctly. Addressed those and moved them to a helper function so we have one point of control for xlations. Added support for NOP in cases where a batch is submitted with only 1 descriptor. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: Ie201b28118823100e908e0d1b08e7c10bb8fa9e7 Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/3654 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>
2020-08-03 11:54:55 -04:00
/* TODO: consider setting the max per batch limit via RPC. */
/* The following sets up a max desc count per batch of 16 */
#define LOG2_WQ_MAX_BATCH 4 /* 2^4 = 16 */
#define DESC_PER_BATCH (1 << LOG2_WQ_MAX_BATCH)
/* We decide how many batches we want to support based on what max queue
* depth makes sense resource wise. There is a small price to pay with
* larger numbers wrt polling for completions.
*/
#define NUM_BATCHES_PER_CHANNEL 0x400
#define MIN_USER_DESC_COUNT 2
lib/idxd: add low level idxd library Module, etc., will follow. Notes: * IDXD is an Intel silicon feature available in future Intel CPUs. Initial development is being done on a simulator. Once HW is available and the code fully tested the experimental label will be lifted. Spec can be found here: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification * The current implementation will only work with VFIO. * DSA has a number of engines that can be grouped based on application need such as type of memory being served or QoS. Engines are processing units and are assigned to groups. Work queues are on device structures that act as front-end groups for queueing descriptors. Full details on what is configurable & how will come in later doc patches. * There is a finite number of work queue slots that are divided amongst the number of desired work queues in some fashion (ie evenly). * SW (outside of the idxd lib) is required to manage flow control, to not over-run the work queues.This is provided in the accel plug-in module. The upper layers use public API to manage this. * Work queue submissions are done with a 64 byte atomic instruction * The design here creates a set of descriptor rings per channel that match the size of the work queues. Then, an spdk_bit_array is used to make sure we don't overrun a queue. If there are not slots available, the operation is put on a linked list to be retried later from the poller. * As we need to support any number of channels (we can't limit ourselves to the number of work queues) we need to dynamically size/resize our per channel descriptor rings based on the number of current channels. This is done from upper layers via public API into the lib. * As channels are created, the total number of work queue slots is divided across the channels evenly. Same thing when they are destroyed, remaining channels with see the ring sizes increase. This is done from upper layers via public API into the lib. * The sim has 64 total work queue entries (WQE) that get dolled out to the work queues (WQ) evenly. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I899bbeda3cef3db05bea4197b8757e89dddb579d Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/1809 Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Vitaliy Mysak <vitaliy.mysak@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
2020-04-10 15:29:01 +00:00
#define LOG2_WQ_MAX_XFER 30 /* 2^30 = 1073741824 */
#define WQCFG_NUM_DWORDS 8
#define WQ_PRIORITY_1 1
#define IDXD_MAX_QUEUES 64
lib/idxd: refactor batching for increased performance And to eliminate an artificial constraint on # of user descriptors. The main idea here was to move from a single ring that covered all user descriptors to a pre-allocated ring per pre-allocated batch. In addition, the other major change here is in how we poll for completions. We used to poll the batch rings then the main ring. Now when commands are prepared their completion address is added to a per channel list and the poller simply runs through that list not caring which ring the completion address belongs too. This simplifies the completion logic considerably and will avoid polling locations that can't potentially have a completion. Some minor rework was included as well, mainly getting rid of the ring_ctrl struct as it didn't serve much of a purpose anyway and with how things are setup now its easier to read with all the elements in the channel struct. Also, a change that came in while this was WIP needed a few fixes to function correctly. Addressed those and moved them to a helper function so we have one point of control for xlations. Added support for NOP in cases where a batch is submitted with only 1 descriptor. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: Ie201b28118823100e908e0d1b08e7c10bb8fa9e7 Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/3654 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>
2020-08-03 11:54:55 -04:00
/* Each pre-allocated batch structure goes on a per channel list and
* contains the memory for both user descriptors.
lib/idxd: refactor batching for increased performance And to eliminate an artificial constraint on # of user descriptors. The main idea here was to move from a single ring that covered all user descriptors to a pre-allocated ring per pre-allocated batch. In addition, the other major change here is in how we poll for completions. We used to poll the batch rings then the main ring. Now when commands are prepared their completion address is added to a per channel list and the poller simply runs through that list not caring which ring the completion address belongs too. This simplifies the completion logic considerably and will avoid polling locations that can't potentially have a completion. Some minor rework was included as well, mainly getting rid of the ring_ctrl struct as it didn't serve much of a purpose anyway and with how things are setup now its easier to read with all the elements in the channel struct. Also, a change that came in while this was WIP needed a few fixes to function correctly. Addressed those and moved them to a helper function so we have one point of control for xlations. Added support for NOP in cases where a batch is submitted with only 1 descriptor. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: Ie201b28118823100e908e0d1b08e7c10bb8fa9e7 Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/3654 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>
2020-08-03 11:54:55 -04:00
*/
idxd: add batch capability to accel framework and IDXD back-end This patch only includes the basic framework for batching and the ability to batch one type of command, copy. Follow-on patches will add the ability to batch other commands and include an example of how to do so via the accel perf tool. SW engine support for batching will also come in a future patch. Documentation will also be coming. Batching allows the application to submit a list of independent descriptors to DSA with one single "batch" descriptor. This is beneficial when the application is in a position to have several operations ready at once; batching saves the overhead of submitting each one separately. The way batching works in SPDK is as follows: 1) The app gets a handle to a new batch with spdk_accel_batch_create() 2) The app uses that handle to prepare a command to be included in the batch. For copy the command is spdk_accel_batch_prep_copy(). The app many continue to prep commands for the batch up to the max via calling spdk_accel_batch_get_max() 3) The app then submits the batch with spdk_accel_batch_submit() 4) The callback provided for each command in the batch will be called as they complete, the callback provided to the batch submit itself will be called then the entire batch is done. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I4102e9291fe59a245cedde6888f42a923b6dbafd Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/2248 Community-CI: Mellanox Build Bot Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>
2020-05-07 14:45:15 -04:00
struct idxd_batch {
lib/idxd: refactor batching for increased performance And to eliminate an artificial constraint on # of user descriptors. The main idea here was to move from a single ring that covered all user descriptors to a pre-allocated ring per pre-allocated batch. In addition, the other major change here is in how we poll for completions. We used to poll the batch rings then the main ring. Now when commands are prepared their completion address is added to a per channel list and the poller simply runs through that list not caring which ring the completion address belongs too. This simplifies the completion logic considerably and will avoid polling locations that can't potentially have a completion. Some minor rework was included as well, mainly getting rid of the ring_ctrl struct as it didn't serve much of a purpose anyway and with how things are setup now its easier to read with all the elements in the channel struct. Also, a change that came in while this was WIP needed a few fixes to function correctly. Addressed those and moved them to a helper function so we have one point of control for xlations. Added support for NOP in cases where a batch is submitted with only 1 descriptor. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: Ie201b28118823100e908e0d1b08e7c10bb8fa9e7 Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/3654 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>
2020-08-03 11:54:55 -04:00
struct idxd_hw_desc *user_desc;
struct idxd_comp *user_completions;
idxd: add batch capability to accel framework and IDXD back-end This patch only includes the basic framework for batching and the ability to batch one type of command, copy. Follow-on patches will add the ability to batch other commands and include an example of how to do so via the accel perf tool. SW engine support for batching will also come in a future patch. Documentation will also be coming. Batching allows the application to submit a list of independent descriptors to DSA with one single "batch" descriptor. This is beneficial when the application is in a position to have several operations ready at once; batching saves the overhead of submitting each one separately. The way batching works in SPDK is as follows: 1) The app gets a handle to a new batch with spdk_accel_batch_create() 2) The app uses that handle to prepare a command to be included in the batch. For copy the command is spdk_accel_batch_prep_copy(). The app many continue to prep commands for the batch up to the max via calling spdk_accel_batch_get_max() 3) The app then submits the batch with spdk_accel_batch_submit() 4) The callback provided for each command in the batch will be called as they complete, the callback provided to the batch submit itself will be called then the entire batch is done. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I4102e9291fe59a245cedde6888f42a923b6dbafd Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/2248 Community-CI: Mellanox Build Bot Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>
2020-05-07 14:45:15 -04:00
uint32_t remaining;
lib/idxd: refactor batching for increased performance And to eliminate an artificial constraint on # of user descriptors. The main idea here was to move from a single ring that covered all user descriptors to a pre-allocated ring per pre-allocated batch. In addition, the other major change here is in how we poll for completions. We used to poll the batch rings then the main ring. Now when commands are prepared their completion address is added to a per channel list and the poller simply runs through that list not caring which ring the completion address belongs too. This simplifies the completion logic considerably and will avoid polling locations that can't potentially have a completion. Some minor rework was included as well, mainly getting rid of the ring_ctrl struct as it didn't serve much of a purpose anyway and with how things are setup now its easier to read with all the elements in the channel struct. Also, a change that came in while this was WIP needed a few fixes to function correctly. Addressed those and moved them to a helper function so we have one point of control for xlations. Added support for NOP in cases where a batch is submitted with only 1 descriptor. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: Ie201b28118823100e908e0d1b08e7c10bb8fa9e7 Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/3654 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>
2020-08-03 11:54:55 -04:00
uint8_t index;
idxd: add batch capability to accel framework and IDXD back-end This patch only includes the basic framework for batching and the ability to batch one type of command, copy. Follow-on patches will add the ability to batch other commands and include an example of how to do so via the accel perf tool. SW engine support for batching will also come in a future patch. Documentation will also be coming. Batching allows the application to submit a list of independent descriptors to DSA with one single "batch" descriptor. This is beneficial when the application is in a position to have several operations ready at once; batching saves the overhead of submitting each one separately. The way batching works in SPDK is as follows: 1) The app gets a handle to a new batch with spdk_accel_batch_create() 2) The app uses that handle to prepare a command to be included in the batch. For copy the command is spdk_accel_batch_prep_copy(). The app many continue to prep commands for the batch up to the max via calling spdk_accel_batch_get_max() 3) The app then submits the batch with spdk_accel_batch_submit() 4) The callback provided for each command in the batch will be called as they complete, the callback provided to the batch submit itself will be called then the entire batch is done. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I4102e9291fe59a245cedde6888f42a923b6dbafd Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/2248 Community-CI: Mellanox Build Bot Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>
2020-05-07 14:45:15 -04:00
TAILQ_ENTRY(idxd_batch) link;
};
lib/idxd: add low level idxd library Module, etc., will follow. Notes: * IDXD is an Intel silicon feature available in future Intel CPUs. Initial development is being done on a simulator. Once HW is available and the code fully tested the experimental label will be lifted. Spec can be found here: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification * The current implementation will only work with VFIO. * DSA has a number of engines that can be grouped based on application need such as type of memory being served or QoS. Engines are processing units and are assigned to groups. Work queues are on device structures that act as front-end groups for queueing descriptors. Full details on what is configurable & how will come in later doc patches. * There is a finite number of work queue slots that are divided amongst the number of desired work queues in some fashion (ie evenly). * SW (outside of the idxd lib) is required to manage flow control, to not over-run the work queues.This is provided in the accel plug-in module. The upper layers use public API to manage this. * Work queue submissions are done with a 64 byte atomic instruction * The design here creates a set of descriptor rings per channel that match the size of the work queues. Then, an spdk_bit_array is used to make sure we don't overrun a queue. If there are not slots available, the operation is put on a linked list to be retried later from the poller. * As we need to support any number of channels (we can't limit ourselves to the number of work queues) we need to dynamically size/resize our per channel descriptor rings based on the number of current channels. This is done from upper layers via public API into the lib. * As channels are created, the total number of work queue slots is divided across the channels evenly. Same thing when they are destroyed, remaining channels with see the ring sizes increase. This is done from upper layers via public API into the lib. * The sim has 64 total work queue entries (WQE) that get dolled out to the work queues (WQ) evenly. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I899bbeda3cef3db05bea4197b8757e89dddb579d Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/1809 Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Vitaliy Mysak <vitaliy.mysak@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
2020-04-10 15:29:01 +00:00
struct device_config {
uint8_t config_num;
uint8_t num_groups;
uint16_t total_wqs;
uint16_t total_engines;
};
lib/idxd: refactor batching for increased performance And to eliminate an artificial constraint on # of user descriptors. The main idea here was to move from a single ring that covered all user descriptors to a pre-allocated ring per pre-allocated batch. In addition, the other major change here is in how we poll for completions. We used to poll the batch rings then the main ring. Now when commands are prepared their completion address is added to a per channel list and the poller simply runs through that list not caring which ring the completion address belongs too. This simplifies the completion logic considerably and will avoid polling locations that can't potentially have a completion. Some minor rework was included as well, mainly getting rid of the ring_ctrl struct as it didn't serve much of a purpose anyway and with how things are setup now its easier to read with all the elements in the channel struct. Also, a change that came in while this was WIP needed a few fixes to function correctly. Addressed those and moved them to a helper function so we have one point of control for xlations. Added support for NOP in cases where a batch is submitted with only 1 descriptor. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: Ie201b28118823100e908e0d1b08e7c10bb8fa9e7 Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/3654 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>
2020-08-03 11:54:55 -04:00
struct idxd_comp ;
lib/idxd: add low level idxd library Module, etc., will follow. Notes: * IDXD is an Intel silicon feature available in future Intel CPUs. Initial development is being done on a simulator. Once HW is available and the code fully tested the experimental label will be lifted. Spec can be found here: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification * The current implementation will only work with VFIO. * DSA has a number of engines that can be grouped based on application need such as type of memory being served or QoS. Engines are processing units and are assigned to groups. Work queues are on device structures that act as front-end groups for queueing descriptors. Full details on what is configurable & how will come in later doc patches. * There is a finite number of work queue slots that are divided amongst the number of desired work queues in some fashion (ie evenly). * SW (outside of the idxd lib) is required to manage flow control, to not over-run the work queues.This is provided in the accel plug-in module. The upper layers use public API to manage this. * Work queue submissions are done with a 64 byte atomic instruction * The design here creates a set of descriptor rings per channel that match the size of the work queues. Then, an spdk_bit_array is used to make sure we don't overrun a queue. If there are not slots available, the operation is put on a linked list to be retried later from the poller. * As we need to support any number of channels (we can't limit ourselves to the number of work queues) we need to dynamically size/resize our per channel descriptor rings based on the number of current channels. This is done from upper layers via public API into the lib. * As channels are created, the total number of work queue slots is divided across the channels evenly. Same thing when they are destroyed, remaining channels with see the ring sizes increase. This is done from upper layers via public API into the lib. * The sim has 64 total work queue entries (WQE) that get dolled out to the work queues (WQ) evenly. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I899bbeda3cef3db05bea4197b8757e89dddb579d Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/1809 Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Vitaliy Mysak <vitaliy.mysak@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
2020-04-10 15:29:01 +00:00
lib/idxd: refactor batching for increased performance And to eliminate an artificial constraint on # of user descriptors. The main idea here was to move from a single ring that covered all user descriptors to a pre-allocated ring per pre-allocated batch. In addition, the other major change here is in how we poll for completions. We used to poll the batch rings then the main ring. Now when commands are prepared their completion address is added to a per channel list and the poller simply runs through that list not caring which ring the completion address belongs too. This simplifies the completion logic considerably and will avoid polling locations that can't potentially have a completion. Some minor rework was included as well, mainly getting rid of the ring_ctrl struct as it didn't serve much of a purpose anyway and with how things are setup now its easier to read with all the elements in the channel struct. Also, a change that came in while this was WIP needed a few fixes to function correctly. Addressed those and moved them to a helper function so we have one point of control for xlations. Added support for NOP in cases where a batch is submitted with only 1 descriptor. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: Ie201b28118823100e908e0d1b08e7c10bb8fa9e7 Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/3654 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>
2020-08-03 11:54:55 -04:00
struct spdk_idxd_io_channel {
struct spdk_idxd_device *idxd;
/* The portal is the address that we write descriptors to for submission. */
void *portal;
lib/idxd: add low level idxd library Module, etc., will follow. Notes: * IDXD is an Intel silicon feature available in future Intel CPUs. Initial development is being done on a simulator. Once HW is available and the code fully tested the experimental label will be lifted. Spec can be found here: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification * The current implementation will only work with VFIO. * DSA has a number of engines that can be grouped based on application need such as type of memory being served or QoS. Engines are processing units and are assigned to groups. Work queues are on device structures that act as front-end groups for queueing descriptors. Full details on what is configurable & how will come in later doc patches. * There is a finite number of work queue slots that are divided amongst the number of desired work queues in some fashion (ie evenly). * SW (outside of the idxd lib) is required to manage flow control, to not over-run the work queues.This is provided in the accel plug-in module. The upper layers use public API to manage this. * Work queue submissions are done with a 64 byte atomic instruction * The design here creates a set of descriptor rings per channel that match the size of the work queues. Then, an spdk_bit_array is used to make sure we don't overrun a queue. If there are not slots available, the operation is put on a linked list to be retried later from the poller. * As we need to support any number of channels (we can't limit ourselves to the number of work queues) we need to dynamically size/resize our per channel descriptor rings based on the number of current channels. This is done from upper layers via public API into the lib. * As channels are created, the total number of work queue slots is divided across the channels evenly. Same thing when they are destroyed, remaining channels with see the ring sizes increase. This is done from upper layers via public API into the lib. * The sim has 64 total work queue entries (WQE) that get dolled out to the work queues (WQ) evenly. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I899bbeda3cef3db05bea4197b8757e89dddb579d Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/1809 Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Vitaliy Mysak <vitaliy.mysak@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
2020-04-10 15:29:01 +00:00
uint16_t ring_size;
/*
lib/idxd: refactor batching for increased performance And to eliminate an artificial constraint on # of user descriptors. The main idea here was to move from a single ring that covered all user descriptors to a pre-allocated ring per pre-allocated batch. In addition, the other major change here is in how we poll for completions. We used to poll the batch rings then the main ring. Now when commands are prepared their completion address is added to a per channel list and the poller simply runs through that list not caring which ring the completion address belongs too. This simplifies the completion logic considerably and will avoid polling locations that can't potentially have a completion. Some minor rework was included as well, mainly getting rid of the ring_ctrl struct as it didn't serve much of a purpose anyway and with how things are setup now its easier to read with all the elements in the channel struct. Also, a change that came in while this was WIP needed a few fixes to function correctly. Addressed those and moved them to a helper function so we have one point of control for xlations. Added support for NOP in cases where a batch is submitted with only 1 descriptor. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: Ie201b28118823100e908e0d1b08e7c10bb8fa9e7 Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/3654 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>
2020-08-03 11:54:55 -04:00
* Descriptors and completions share the same index. User descriptors
* (those included in a batch) are managed independently from data descriptors
* and are located in the batch structure.
lib/idxd: add low level idxd library Module, etc., will follow. Notes: * IDXD is an Intel silicon feature available in future Intel CPUs. Initial development is being done on a simulator. Once HW is available and the code fully tested the experimental label will be lifted. Spec can be found here: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification * The current implementation will only work with VFIO. * DSA has a number of engines that can be grouped based on application need such as type of memory being served or QoS. Engines are processing units and are assigned to groups. Work queues are on device structures that act as front-end groups for queueing descriptors. Full details on what is configurable & how will come in later doc patches. * There is a finite number of work queue slots that are divided amongst the number of desired work queues in some fashion (ie evenly). * SW (outside of the idxd lib) is required to manage flow control, to not over-run the work queues.This is provided in the accel plug-in module. The upper layers use public API to manage this. * Work queue submissions are done with a 64 byte atomic instruction * The design here creates a set of descriptor rings per channel that match the size of the work queues. Then, an spdk_bit_array is used to make sure we don't overrun a queue. If there are not slots available, the operation is put on a linked list to be retried later from the poller. * As we need to support any number of channels (we can't limit ourselves to the number of work queues) we need to dynamically size/resize our per channel descriptor rings based on the number of current channels. This is done from upper layers via public API into the lib. * As channels are created, the total number of work queue slots is divided across the channels evenly. Same thing when they are destroyed, remaining channels with see the ring sizes increase. This is done from upper layers via public API into the lib. * The sim has 64 total work queue entries (WQE) that get dolled out to the work queues (WQ) evenly. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I899bbeda3cef3db05bea4197b8757e89dddb579d Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/1809 Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Vitaliy Mysak <vitaliy.mysak@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
2020-04-10 15:29:01 +00:00
*/
idxd: add batch capability to accel framework and IDXD back-end This patch only includes the basic framework for batching and the ability to batch one type of command, copy. Follow-on patches will add the ability to batch other commands and include an example of how to do so via the accel perf tool. SW engine support for batching will also come in a future patch. Documentation will also be coming. Batching allows the application to submit a list of independent descriptors to DSA with one single "batch" descriptor. This is beneficial when the application is in a position to have several operations ready at once; batching saves the overhead of submitting each one separately. The way batching works in SPDK is as follows: 1) The app gets a handle to a new batch with spdk_accel_batch_create() 2) The app uses that handle to prepare a command to be included in the batch. For copy the command is spdk_accel_batch_prep_copy(). The app many continue to prep commands for the batch up to the max via calling spdk_accel_batch_get_max() 3) The app then submits the batch with spdk_accel_batch_submit() 4) The callback provided for each command in the batch will be called as they complete, the callback provided to the batch submit itself will be called then the entire batch is done. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I4102e9291fe59a245cedde6888f42a923b6dbafd Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/2248 Community-CI: Mellanox Build Bot Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>
2020-05-07 14:45:15 -04:00
struct idxd_hw_desc *desc;
lib/idxd: add low level idxd library Module, etc., will follow. Notes: * IDXD is an Intel silicon feature available in future Intel CPUs. Initial development is being done on a simulator. Once HW is available and the code fully tested the experimental label will be lifted. Spec can be found here: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification * The current implementation will only work with VFIO. * DSA has a number of engines that can be grouped based on application need such as type of memory being served or QoS. Engines are processing units and are assigned to groups. Work queues are on device structures that act as front-end groups for queueing descriptors. Full details on what is configurable & how will come in later doc patches. * There is a finite number of work queue slots that are divided amongst the number of desired work queues in some fashion (ie evenly). * SW (outside of the idxd lib) is required to manage flow control, to not over-run the work queues.This is provided in the accel plug-in module. The upper layers use public API to manage this. * Work queue submissions are done with a 64 byte atomic instruction * The design here creates a set of descriptor rings per channel that match the size of the work queues. Then, an spdk_bit_array is used to make sure we don't overrun a queue. If there are not slots available, the operation is put on a linked list to be retried later from the poller. * As we need to support any number of channels (we can't limit ourselves to the number of work queues) we need to dynamically size/resize our per channel descriptor rings based on the number of current channels. This is done from upper layers via public API into the lib. * As channels are created, the total number of work queue slots is divided across the channels evenly. Same thing when they are destroyed, remaining channels with see the ring sizes increase. This is done from upper layers via public API into the lib. * The sim has 64 total work queue entries (WQE) that get dolled out to the work queues (WQ) evenly. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I899bbeda3cef3db05bea4197b8757e89dddb579d Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/1809 Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Vitaliy Mysak <vitaliy.mysak@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
2020-04-10 15:29:01 +00:00
struct idxd_comp *completions;
lib/idxd: refactor batching for increased performance And to eliminate an artificial constraint on # of user descriptors. The main idea here was to move from a single ring that covered all user descriptors to a pre-allocated ring per pre-allocated batch. In addition, the other major change here is in how we poll for completions. We used to poll the batch rings then the main ring. Now when commands are prepared their completion address is added to a per channel list and the poller simply runs through that list not caring which ring the completion address belongs too. This simplifies the completion logic considerably and will avoid polling locations that can't potentially have a completion. Some minor rework was included as well, mainly getting rid of the ring_ctrl struct as it didn't serve much of a purpose anyway and with how things are setup now its easier to read with all the elements in the channel struct. Also, a change that came in while this was WIP needed a few fixes to function correctly. Addressed those and moved them to a helper function so we have one point of control for xlations. Added support for NOP in cases where a batch is submitted with only 1 descriptor. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: Ie201b28118823100e908e0d1b08e7c10bb8fa9e7 Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/3654 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>
2020-08-03 11:54:55 -04:00
/* Current list of oustanding completion addresses to poll. */
TAILQ_HEAD(, idxd_comp) comp_ctx_oustanding;
lib/idxd: add low level idxd library Module, etc., will follow. Notes: * IDXD is an Intel silicon feature available in future Intel CPUs. Initial development is being done on a simulator. Once HW is available and the code fully tested the experimental label will be lifted. Spec can be found here: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification * The current implementation will only work with VFIO. * DSA has a number of engines that can be grouped based on application need such as type of memory being served or QoS. Engines are processing units and are assigned to groups. Work queues are on device structures that act as front-end groups for queueing descriptors. Full details on what is configurable & how will come in later doc patches. * There is a finite number of work queue slots that are divided amongst the number of desired work queues in some fashion (ie evenly). * SW (outside of the idxd lib) is required to manage flow control, to not over-run the work queues.This is provided in the accel plug-in module. The upper layers use public API to manage this. * Work queue submissions are done with a 64 byte atomic instruction * The design here creates a set of descriptor rings per channel that match the size of the work queues. Then, an spdk_bit_array is used to make sure we don't overrun a queue. If there are not slots available, the operation is put on a linked list to be retried later from the poller. * As we need to support any number of channels (we can't limit ourselves to the number of work queues) we need to dynamically size/resize our per channel descriptor rings based on the number of current channels. This is done from upper layers via public API into the lib. * As channels are created, the total number of work queue slots is divided across the channels evenly. Same thing when they are destroyed, remaining channels with see the ring sizes increase. This is done from upper layers via public API into the lib. * The sim has 64 total work queue entries (WQE) that get dolled out to the work queues (WQ) evenly. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I899bbeda3cef3db05bea4197b8757e89dddb579d Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/1809 Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Vitaliy Mysak <vitaliy.mysak@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
2020-04-10 15:29:01 +00:00
/*
* We use one bit array to track ring slots for both
idxd: add batch capability to accel framework and IDXD back-end This patch only includes the basic framework for batching and the ability to batch one type of command, copy. Follow-on patches will add the ability to batch other commands and include an example of how to do so via the accel perf tool. SW engine support for batching will also come in a future patch. Documentation will also be coming. Batching allows the application to submit a list of independent descriptors to DSA with one single "batch" descriptor. This is beneficial when the application is in a position to have several operations ready at once; batching saves the overhead of submitting each one separately. The way batching works in SPDK is as follows: 1) The app gets a handle to a new batch with spdk_accel_batch_create() 2) The app uses that handle to prepare a command to be included in the batch. For copy the command is spdk_accel_batch_prep_copy(). The app many continue to prep commands for the batch up to the max via calling spdk_accel_batch_get_max() 3) The app then submits the batch with spdk_accel_batch_submit() 4) The callback provided for each command in the batch will be called as they complete, the callback provided to the batch submit itself will be called then the entire batch is done. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I4102e9291fe59a245cedde6888f42a923b6dbafd Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/2248 Community-CI: Mellanox Build Bot Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>
2020-05-07 14:45:15 -04:00
* desc and completions.
lib/idxd: refactor batching for increased performance And to eliminate an artificial constraint on # of user descriptors. The main idea here was to move from a single ring that covered all user descriptors to a pre-allocated ring per pre-allocated batch. In addition, the other major change here is in how we poll for completions. We used to poll the batch rings then the main ring. Now when commands are prepared their completion address is added to a per channel list and the poller simply runs through that list not caring which ring the completion address belongs too. This simplifies the completion logic considerably and will avoid polling locations that can't potentially have a completion. Some minor rework was included as well, mainly getting rid of the ring_ctrl struct as it didn't serve much of a purpose anyway and with how things are setup now its easier to read with all the elements in the channel struct. Also, a change that came in while this was WIP needed a few fixes to function correctly. Addressed those and moved them to a helper function so we have one point of control for xlations. Added support for NOP in cases where a batch is submitted with only 1 descriptor. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: Ie201b28118823100e908e0d1b08e7c10bb8fa9e7 Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/3654 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>
2020-08-03 11:54:55 -04:00
*
* TODO: We can get rid of the bit array and just use a uint
* to manage flow control as the current implementation saves
* enough info in comp_ctx that it doesn't need the index. Keeping
* the bit arrays for now as (a) they provide some extra debug benefit
* until we have silicon and (b) they may still be needed depending on
* polling implementation experiments that we need to run with real silicon.
lib/idxd: add low level idxd library Module, etc., will follow. Notes: * IDXD is an Intel silicon feature available in future Intel CPUs. Initial development is being done on a simulator. Once HW is available and the code fully tested the experimental label will be lifted. Spec can be found here: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification * The current implementation will only work with VFIO. * DSA has a number of engines that can be grouped based on application need such as type of memory being served or QoS. Engines are processing units and are assigned to groups. Work queues are on device structures that act as front-end groups for queueing descriptors. Full details on what is configurable & how will come in later doc patches. * There is a finite number of work queue slots that are divided amongst the number of desired work queues in some fashion (ie evenly). * SW (outside of the idxd lib) is required to manage flow control, to not over-run the work queues.This is provided in the accel plug-in module. The upper layers use public API to manage this. * Work queue submissions are done with a 64 byte atomic instruction * The design here creates a set of descriptor rings per channel that match the size of the work queues. Then, an spdk_bit_array is used to make sure we don't overrun a queue. If there are not slots available, the operation is put on a linked list to be retried later from the poller. * As we need to support any number of channels (we can't limit ourselves to the number of work queues) we need to dynamically size/resize our per channel descriptor rings based on the number of current channels. This is done from upper layers via public API into the lib. * As channels are created, the total number of work queue slots is divided across the channels evenly. Same thing when they are destroyed, remaining channels with see the ring sizes increase. This is done from upper layers via public API into the lib. * The sim has 64 total work queue entries (WQE) that get dolled out to the work queues (WQ) evenly. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I899bbeda3cef3db05bea4197b8757e89dddb579d Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/1809 Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Vitaliy Mysak <vitaliy.mysak@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
2020-04-10 15:29:01 +00:00
*/
struct spdk_bit_array *ring_slots;
uint32_t max_ring_slots;
idxd: add batch capability to accel framework and IDXD back-end This patch only includes the basic framework for batching and the ability to batch one type of command, copy. Follow-on patches will add the ability to batch other commands and include an example of how to do so via the accel perf tool. SW engine support for batching will also come in a future patch. Documentation will also be coming. Batching allows the application to submit a list of independent descriptors to DSA with one single "batch" descriptor. This is beneficial when the application is in a position to have several operations ready at once; batching saves the overhead of submitting each one separately. The way batching works in SPDK is as follows: 1) The app gets a handle to a new batch with spdk_accel_batch_create() 2) The app uses that handle to prepare a command to be included in the batch. For copy the command is spdk_accel_batch_prep_copy(). The app many continue to prep commands for the batch up to the max via calling spdk_accel_batch_get_max() 3) The app then submits the batch with spdk_accel_batch_submit() 4) The callback provided for each command in the batch will be called as they complete, the callback provided to the batch submit itself will be called then the entire batch is done. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I4102e9291fe59a245cedde6888f42a923b6dbafd Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/2248 Community-CI: Mellanox Build Bot Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>
2020-05-07 14:45:15 -04:00
lib/idxd: refactor batching for increased performance And to eliminate an artificial constraint on # of user descriptors. The main idea here was to move from a single ring that covered all user descriptors to a pre-allocated ring per pre-allocated batch. In addition, the other major change here is in how we poll for completions. We used to poll the batch rings then the main ring. Now when commands are prepared their completion address is added to a per channel list and the poller simply runs through that list not caring which ring the completion address belongs too. This simplifies the completion logic considerably and will avoid polling locations that can't potentially have a completion. Some minor rework was included as well, mainly getting rid of the ring_ctrl struct as it didn't serve much of a purpose anyway and with how things are setup now its easier to read with all the elements in the channel struct. Also, a change that came in while this was WIP needed a few fixes to function correctly. Addressed those and moved them to a helper function so we have one point of control for xlations. Added support for NOP in cases where a batch is submitted with only 1 descriptor. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: Ie201b28118823100e908e0d1b08e7c10bb8fa9e7 Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/3654 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>
2020-08-03 11:54:55 -04:00
/* Lists of batches, free and in use. */
TAILQ_HEAD(, idxd_batch) batch_pool;
TAILQ_HEAD(, idxd_batch) batches;
lib/idxd: add low level idxd library Module, etc., will follow. Notes: * IDXD is an Intel silicon feature available in future Intel CPUs. Initial development is being done on a simulator. Once HW is available and the code fully tested the experimental label will be lifted. Spec can be found here: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification * The current implementation will only work with VFIO. * DSA has a number of engines that can be grouped based on application need such as type of memory being served or QoS. Engines are processing units and are assigned to groups. Work queues are on device structures that act as front-end groups for queueing descriptors. Full details on what is configurable & how will come in later doc patches. * There is a finite number of work queue slots that are divided amongst the number of desired work queues in some fashion (ie evenly). * SW (outside of the idxd lib) is required to manage flow control, to not over-run the work queues.This is provided in the accel plug-in module. The upper layers use public API to manage this. * Work queue submissions are done with a 64 byte atomic instruction * The design here creates a set of descriptor rings per channel that match the size of the work queues. Then, an spdk_bit_array is used to make sure we don't overrun a queue. If there are not slots available, the operation is put on a linked list to be retried later from the poller. * As we need to support any number of channels (we can't limit ourselves to the number of work queues) we need to dynamically size/resize our per channel descriptor rings based on the number of current channels. This is done from upper layers via public API into the lib. * As channels are created, the total number of work queue slots is divided across the channels evenly. Same thing when they are destroyed, remaining channels with see the ring sizes increase. This is done from upper layers via public API into the lib. * The sim has 64 total work queue entries (WQE) that get dolled out to the work queues (WQ) evenly. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I899bbeda3cef3db05bea4197b8757e89dddb579d Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/1809 Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Vitaliy Mysak <vitaliy.mysak@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
2020-04-10 15:29:01 +00:00
lib/idxd: refactor batching for increased performance And to eliminate an artificial constraint on # of user descriptors. The main idea here was to move from a single ring that covered all user descriptors to a pre-allocated ring per pre-allocated batch. In addition, the other major change here is in how we poll for completions. We used to poll the batch rings then the main ring. Now when commands are prepared their completion address is added to a per channel list and the poller simply runs through that list not caring which ring the completion address belongs too. This simplifies the completion logic considerably and will avoid polling locations that can't potentially have a completion. Some minor rework was included as well, mainly getting rid of the ring_ctrl struct as it didn't serve much of a purpose anyway and with how things are setup now its easier to read with all the elements in the channel struct. Also, a change that came in while this was WIP needed a few fixes to function correctly. Addressed those and moved them to a helper function so we have one point of control for xlations. Added support for NOP in cases where a batch is submitted with only 1 descriptor. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: Ie201b28118823100e908e0d1b08e7c10bb8fa9e7 Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/3654 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>
2020-08-03 11:54:55 -04:00
void *batch_base;
lib/idxd: add low level idxd library Module, etc., will follow. Notes: * IDXD is an Intel silicon feature available in future Intel CPUs. Initial development is being done on a simulator. Once HW is available and the code fully tested the experimental label will be lifted. Spec can be found here: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification * The current implementation will only work with VFIO. * DSA has a number of engines that can be grouped based on application need such as type of memory being served or QoS. Engines are processing units and are assigned to groups. Work queues are on device structures that act as front-end groups for queueing descriptors. Full details on what is configurable & how will come in later doc patches. * There is a finite number of work queue slots that are divided amongst the number of desired work queues in some fashion (ie evenly). * SW (outside of the idxd lib) is required to manage flow control, to not over-run the work queues.This is provided in the accel plug-in module. The upper layers use public API to manage this. * Work queue submissions are done with a 64 byte atomic instruction * The design here creates a set of descriptor rings per channel that match the size of the work queues. Then, an spdk_bit_array is used to make sure we don't overrun a queue. If there are not slots available, the operation is put on a linked list to be retried later from the poller. * As we need to support any number of channels (we can't limit ourselves to the number of work queues) we need to dynamically size/resize our per channel descriptor rings based on the number of current channels. This is done from upper layers via public API into the lib. * As channels are created, the total number of work queue slots is divided across the channels evenly. Same thing when they are destroyed, remaining channels with see the ring sizes increase. This is done from upper layers via public API into the lib. * The sim has 64 total work queue entries (WQE) that get dolled out to the work queues (WQ) evenly. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I899bbeda3cef3db05bea4197b8757e89dddb579d Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/1809 Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Vitaliy Mysak <vitaliy.mysak@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
2020-04-10 15:29:01 +00:00
};
struct pci_dev_id {
int vendor_id;
int device_id;
};
struct idxd_group {
struct spdk_idxd_device *idxd;
struct idxd_grpcfg grpcfg;
struct pci_dev_id pcidev;
int num_engines;
int num_wqs;
int id;
uint8_t tokens_allowed;
bool use_token_limit;
uint8_t tokens_reserved;
int tc_a;
int tc_b;
};
/*
* This struct wraps the hardware completion record which is 32 bytes in
* size and must be 32 byte aligned.
*/
struct idxd_comp {
struct idxd_hw_comp_record hw;
void *cb_arg;
lib/idxd: add low level idxd library Module, etc., will follow. Notes: * IDXD is an Intel silicon feature available in future Intel CPUs. Initial development is being done on a simulator. Once HW is available and the code fully tested the experimental label will be lifted. Spec can be found here: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification * The current implementation will only work with VFIO. * DSA has a number of engines that can be grouped based on application need such as type of memory being served or QoS. Engines are processing units and are assigned to groups. Work queues are on device structures that act as front-end groups for queueing descriptors. Full details on what is configurable & how will come in later doc patches. * There is a finite number of work queue slots that are divided amongst the number of desired work queues in some fashion (ie evenly). * SW (outside of the idxd lib) is required to manage flow control, to not over-run the work queues.This is provided in the accel plug-in module. The upper layers use public API to manage this. * Work queue submissions are done with a 64 byte atomic instruction * The design here creates a set of descriptor rings per channel that match the size of the work queues. Then, an spdk_bit_array is used to make sure we don't overrun a queue. If there are not slots available, the operation is put on a linked list to be retried later from the poller. * As we need to support any number of channels (we can't limit ourselves to the number of work queues) we need to dynamically size/resize our per channel descriptor rings based on the number of current channels. This is done from upper layers via public API into the lib. * As channels are created, the total number of work queue slots is divided across the channels evenly. Same thing when they are destroyed, remaining channels with see the ring sizes increase. This is done from upper layers via public API into the lib. * The sim has 64 total work queue entries (WQE) that get dolled out to the work queues (WQ) evenly. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I899bbeda3cef3db05bea4197b8757e89dddb579d Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/1809 Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Vitaliy Mysak <vitaliy.mysak@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
2020-04-10 15:29:01 +00:00
spdk_idxd_req_cb cb_fn;
idxd: add batch capability to accel framework and IDXD back-end This patch only includes the basic framework for batching and the ability to batch one type of command, copy. Follow-on patches will add the ability to batch other commands and include an example of how to do so via the accel perf tool. SW engine support for batching will also come in a future patch. Documentation will also be coming. Batching allows the application to submit a list of independent descriptors to DSA with one single "batch" descriptor. This is beneficial when the application is in a position to have several operations ready at once; batching saves the overhead of submitting each one separately. The way batching works in SPDK is as follows: 1) The app gets a handle to a new batch with spdk_accel_batch_create() 2) The app uses that handle to prepare a command to be included in the batch. For copy the command is spdk_accel_batch_prep_copy(). The app many continue to prep commands for the batch up to the max via calling spdk_accel_batch_get_max() 3) The app then submits the batch with spdk_accel_batch_submit() 4) The callback provided for each command in the batch will be called as they complete, the callback provided to the batch submit itself will be called then the entire batch is done. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I4102e9291fe59a245cedde6888f42a923b6dbafd Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/2248 Community-CI: Mellanox Build Bot Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>
2020-05-07 14:45:15 -04:00
struct idxd_batch *batch;
lib/idxd: refactor batching for increased performance And to eliminate an artificial constraint on # of user descriptors. The main idea here was to move from a single ring that covered all user descriptors to a pre-allocated ring per pre-allocated batch. In addition, the other major change here is in how we poll for completions. We used to poll the batch rings then the main ring. Now when commands are prepared their completion address is added to a per channel list and the poller simply runs through that list not caring which ring the completion address belongs too. This simplifies the completion logic considerably and will avoid polling locations that can't potentially have a completion. Some minor rework was included as well, mainly getting rid of the ring_ctrl struct as it didn't serve much of a purpose anyway and with how things are setup now its easier to read with all the elements in the channel struct. Also, a change that came in while this was WIP needed a few fixes to function correctly. Addressed those and moved them to a helper function so we have one point of control for xlations. Added support for NOP in cases where a batch is submitted with only 1 descriptor. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: Ie201b28118823100e908e0d1b08e7c10bb8fa9e7 Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/3654 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com>
2020-08-03 11:54:55 -04:00
bool batch_op;
struct idxd_hw_desc *desc;
uint32_t index;
char pad[3];
TAILQ_ENTRY(idxd_comp) link;
};
SPDK_STATIC_ASSERT(sizeof(struct idxd_comp) == 96, "size mismatch");
lib/idxd: add low level idxd library Module, etc., will follow. Notes: * IDXD is an Intel silicon feature available in future Intel CPUs. Initial development is being done on a simulator. Once HW is available and the code fully tested the experimental label will be lifted. Spec can be found here: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification * The current implementation will only work with VFIO. * DSA has a number of engines that can be grouped based on application need such as type of memory being served or QoS. Engines are processing units and are assigned to groups. Work queues are on device structures that act as front-end groups for queueing descriptors. Full details on what is configurable & how will come in later doc patches. * There is a finite number of work queue slots that are divided amongst the number of desired work queues in some fashion (ie evenly). * SW (outside of the idxd lib) is required to manage flow control, to not over-run the work queues.This is provided in the accel plug-in module. The upper layers use public API to manage this. * Work queue submissions are done with a 64 byte atomic instruction * The design here creates a set of descriptor rings per channel that match the size of the work queues. Then, an spdk_bit_array is used to make sure we don't overrun a queue. If there are not slots available, the operation is put on a linked list to be retried later from the poller. * As we need to support any number of channels (we can't limit ourselves to the number of work queues) we need to dynamically size/resize our per channel descriptor rings based on the number of current channels. This is done from upper layers via public API into the lib. * As channels are created, the total number of work queue slots is divided across the channels evenly. Same thing when they are destroyed, remaining channels with see the ring sizes increase. This is done from upper layers via public API into the lib. * The sim has 64 total work queue entries (WQE) that get dolled out to the work queues (WQ) evenly. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I899bbeda3cef3db05bea4197b8757e89dddb579d Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/1809 Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Vitaliy Mysak <vitaliy.mysak@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
2020-04-10 15:29:01 +00:00
struct idxd_wq {
struct spdk_idxd_device *idxd;
struct idxd_group *group;
union idxd_wqcfg wqcfg;
};
struct spdk_idxd_device {
struct spdk_pci_device *device;
void *reg_base;
void *portals;
int socket_id;
int wq_id;
uint32_t num_channels;
bool needs_rebalance;
pthread_mutex_t num_channels_lock;
lib/idxd: add low level idxd library Module, etc., will follow. Notes: * IDXD is an Intel silicon feature available in future Intel CPUs. Initial development is being done on a simulator. Once HW is available and the code fully tested the experimental label will be lifted. Spec can be found here: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification * The current implementation will only work with VFIO. * DSA has a number of engines that can be grouped based on application need such as type of memory being served or QoS. Engines are processing units and are assigned to groups. Work queues are on device structures that act as front-end groups for queueing descriptors. Full details on what is configurable & how will come in later doc patches. * There is a finite number of work queue slots that are divided amongst the number of desired work queues in some fashion (ie evenly). * SW (outside of the idxd lib) is required to manage flow control, to not over-run the work queues.This is provided in the accel plug-in module. The upper layers use public API to manage this. * Work queue submissions are done with a 64 byte atomic instruction * The design here creates a set of descriptor rings per channel that match the size of the work queues. Then, an spdk_bit_array is used to make sure we don't overrun a queue. If there are not slots available, the operation is put on a linked list to be retried later from the poller. * As we need to support any number of channels (we can't limit ourselves to the number of work queues) we need to dynamically size/resize our per channel descriptor rings based on the number of current channels. This is done from upper layers via public API into the lib. * As channels are created, the total number of work queue slots is divided across the channels evenly. Same thing when they are destroyed, remaining channels with see the ring sizes increase. This is done from upper layers via public API into the lib. * The sim has 64 total work queue entries (WQE) that get dolled out to the work queues (WQ) evenly. Signed-off-by: paul luse <paul.e.luse@intel.com> Change-Id: I899bbeda3cef3db05bea4197b8757e89dddb579d Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/1809 Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Vitaliy Mysak <vitaliy.mysak@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
2020-04-10 15:29:01 +00:00
struct idxd_registers registers;
uint32_t ims_offset;
uint32_t msix_perm_offset;
uint32_t wqcfg_offset;
uint32_t grpcfg_offset;
uint32_t perfmon_offset;
struct idxd_group *groups;
struct idxd_wq *queues;
};
#ifdef __cplusplus
}
#endif
#endif /* __IDXD_H__ */