Commit Graph

4329 Commits

Author SHA1 Message Date
Andy Green
54a93341cc eal: explicit cast of builtin for bsf32
rte_common.h:416:9:
warning: conversion to 'uint32_t' {aka 'unsigned int'} from
'int' may change the sign of the result [-Wsign-conversion]
  return __builtin_ctz(v);
         ^~~~~~~~~~~~~~~~

The builtin is defined to return int, but we want to
return it as uint32_t.  Its only defined valid return
values are positive integers or zero, which is OK for
uint32_t.  So just add an explicit cast.

Fixes: 03f6bced5b ("eal: use intrinsic function")
Cc: stable@dpdk.org

Signed-off-by: Andy Green <andy@warmcat.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
2018-05-13 22:45:05 +02:00
Bruce Richardson
bd0d13a8c4 build: ensure compatibility with future meson versions
Meson 0.46 fixed a bug where "extract_all_objects" would not recursively
extract objects not compiled from source for a target. To keep backward
compatibility, a "recursive" keyword-arg was added to make this optional.
The value is "false" by default for now, but will change to "true" in
future, so we hard-code it to "false" in our code to ensure future
compatibility.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Acked-by: Luca Boccassi <bluca@debian.org>
2018-05-08 22:22:02 +02:00
Konstantin Ananyev
a93ff62a89 bpf: introduce basic Rx/Tx filters
Introduce API to install BPF based filters on ethdev RX/TX path.
Current implementation is pure SW one, based on ethdev RX/TX
callback mechanism.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Ferruh Yigit <ferruh.yigit@intel.com>
2018-05-12 00:36:34 +02:00
Konstantin Ananyev
cc752e43e0 bpf: add JIT compilation for x86_64 ISA
Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Ferruh Yigit <ferruh.yigit@intel.com>
2018-05-12 00:36:27 +02:00
Konstantin Ananyev
6e12ec4c4d bpf: add more checks
Add checks for:
 - all instructions are valid ones
   (known opcodes, correct syntax, valid reg/off/imm values, etc.)
 - no unreachable instructions
 - no loops
 - basic stack boundaries checks
 - division by zero

Still need to add checks for:
 - use/return only initialized registers and stack data.
 - memory boundaries violation

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Ferruh Yigit <ferruh.yigit@intel.com>
2018-05-12 00:35:23 +02:00
Konstantin Ananyev
5dba93ae5f bpf: add ability to load eBPF program from ELF object file
Introduce rte_bpf_elf_load() function to provide ability to
load eBPF program from ELF object file.
It also adds dependency on libelf.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Ferruh Yigit <ferruh.yigit@intel.com>
2018-05-12 00:35:20 +02:00
Konstantin Ananyev
94972f35a0 bpf: add BPF loading and execution framework
librte_bpf provides a framework to load and execute eBPF bytecode
inside user-space dpdk based applications.
It supports basic set of features from eBPF spec
(https://www.kernel.org/doc/Documentation/networking/filter.txt).

Not currently supported features:
 - JIT
 - cBPF
 - tail-pointer call
 - eBPF MAP
 - skb
 - function calls for 32-bit apps
 - mbuf pointer as input parameter for 32-bit apps

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Ferruh Yigit <ferruh.yigit@intel.com>
2018-05-12 00:35:15 +02:00
Kamil Chalupnik
58a695c6ec bbdev: split queue groups
Splitting Queue Groups into UL/DL Groups in Turbo Software
Driver. They are independent for Decode/Encode.
Release note updated accordingly.

Signed-off-by: Kamil Chalupnik <kamilx.chalupnik@intel.com>
Acked-by: Amr Mokhtar <amr.mokhtar@intel.com>
2018-05-10 17:46:20 +01:00
Kamil Chalupnik
864edd6935 bbdev: measure offload cost
New test created to measure offload cost.
Changes were introduced in API, turbo software driver
and test application

Signed-off-by: Kamil Chalupnik <kamilx.chalupnik@intel.com>
Acked-by: Amr Mokhtar <amr.mokhtar@intel.com>
2018-05-10 17:46:20 +01:00
Kamil Chalupnik
795ae2df4d baseband/turbo_sw: support optional CRC overlap
Support for optional CRC overlap in decode processing implemented
in Turbo Software driver

Signed-off-by: Kamil Chalupnik <kamilx.chalupnik@intel.com>
Acked-by: Amr Mokhtar <amr.mokhtar@intel.com>
2018-05-10 17:46:20 +01:00
Kamil Chalupnik
47d5a04969 baseband/turbo_sw: scale likelihood ratio input
Update Turbo Software driver for Wireless Baseband Device:
- function scaling input LLR values to specific range [-16, 16] added
- new test vectors to check device capabilities added
- release note updated accordingly

Signed-off-by: Kamil Chalupnik <kamilx.chalupnik@intel.com>
Acked-by: Amr Mokhtar <amr.mokhtar@intel.com>
2018-05-10 17:46:20 +01:00
Kamil Chalupnik
6a1d032e79 baseband/turbo_sw: move macros to bbdev library
Signed-off-by: Kamil Chalupnik <kamilx.chalupnik@intel.com>
Acked-by: Amr Mokhtar <amr.mokhtar@intel.com>
2018-05-10 17:46:20 +01:00
Fiona Trahe
1466bafb9a compressdev: get device id from name
Added API to retrieve the device id provided the device name.

Signed-off-by: Fiona Trahe <fiona.trahe@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Signed-off-by: Shally Verma <shally.verma@caviumnetworks.com>
Signed-off-by: Ashish Gupta <ashish.gupta@caviumnetworks.com>
2018-05-10 17:46:20 +01:00
Fiona Trahe
5d432f3640 compressdev: add device capabilities
Added structure which each PMD will fill out,
providing the capabilities of each driver
(containing mainly which compression services
it supports).

Signed-off-by: Fiona Trahe <fiona.trahe@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Signed-off-by: Shally Verma <shally.verma@caviumnetworks.com>
Signed-off-by: Ashish Gupta <ashish.gupta@caviumnetworks.com>
2018-05-10 17:46:20 +01:00
Fiona Trahe
75736aa393 compressdev: add device stats
Signed-off-by: Fiona Trahe <fiona.trahe@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Signed-off-by: Shally Verma <shally.verma@caviumnetworks.com>
Signed-off-by: Ashish Gupta <ashish.gupta@caviumnetworks.com>
2018-05-10 17:46:20 +01:00
Fiona Trahe
8f1e111539 compressdev: add compression service feature flags
Signed-off-by: Fiona Trahe <fiona.trahe@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Signed-off-by: Shally Verma <shally.verma@caviumnetworks.com>
Signed-off-by: Ashish Gupta <ashish.gupta@caviumnetworks.com>
2018-05-10 17:46:20 +01:00
Fiona Trahe
f40d300a81 compressdev: add device feature flags
Signed-off-by: Fiona Trahe <fiona.trahe@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Signed-off-by: Shally Verma <shally.verma@caviumnetworks.com>
Signed-off-by: Ashish Gupta <ashish.gupta@caviumnetworks.com>
2018-05-10 17:46:20 +01:00
Shally Verma
0d6717c437 compressdev: support hash operations
- Added hash algo enumeration and params in xform and rte_comp_op
- Updated compress/decompress xform to input hash algorithm
- Updated struct rte_comp_op to input hash buffer

User in capability query will know about support hashes via
device info comp_feature_flag. If supported, application can initialize
desired algorithm enumeration in xform structure and pass valid hash
buffer during enqueue_burst().

Signed-off-by: Shally Verma <shally.verma@caviumnetworks.com>
Signed-off-by: Sunila Sahu <sunila.sahu@caviumnetworks.com>
Signed-off-by: Ashish Gupta <ashish.gupta@caviumnetworks.com>
2018-05-10 17:46:19 +01:00
Fiona Trahe
b342c57aae compressdev: support stateful operations
Added stream data (stream) in compression operation,
which will contain the private data from each PMD
to support stateful operations.
Also, added functions to create/free this data.

Signed-off-by: Fiona Trahe <fiona.trahe@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Signed-off-by: Shally Verma <shally.verma@caviumnetworks.com>
Signed-off-by: Ashish Gupta <ashish.gupta@caviumnetworks.com>
2018-05-10 17:46:19 +01:00
Fiona Trahe
32176b0285 compressdev: support stateless operations
Added private transform data (priv_xform) in compression
operation, which will contain the private data from each
PMD to support stateless operations.
Also, added functions to create/free this data.

Signed-off-by: Fiona Trahe <fiona.trahe@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Signed-off-by: Shally Verma <shally.verma@caviumnetworks.com>
Signed-off-by: Ashish Gupta <ashish.gupta@caviumnetworks.com>
2018-05-10 17:46:19 +01:00
Fiona Trahe
96086db5a3 compressdev: add operation management
Added functions to allocate and free compression operations.

Signed-off-by: Fiona Trahe <fiona.trahe@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Signed-off-by: Shally Verma <shally.verma@caviumnetworks.com>
Signed-off-by: Ashish Gupta <ashish.gupta@caviumnetworks.com>
2018-05-10 17:46:19 +01:00
Fiona Trahe
63f4bfd532 compressdev: add enqueue/dequeue functions
Signed-off-by: Fiona Trahe <fiona.trahe@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Signed-off-by: Shally Verma <shally.verma@caviumnetworks.com>
Signed-off-by: Ashish Gupta <ashish.gupta@caviumnetworks.com>
2018-05-10 17:46:19 +01:00
Fiona Trahe
f87bdc1ddc compressdev: add compression specific data
Added structures and enums specific to compression,
including the compression operation structure and the
different supported algorithms, checksums and compression
levels.

Signed-off-by: Fiona Trahe <fiona.trahe@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Signed-off-by: Shally Verma <shally.verma@caviumnetworks.com>
Signed-off-by: Ashish Gupta <ashish.gupta@caviumnetworks.com>
2018-05-10 17:46:19 +01:00
Fiona Trahe
24a0fef851 compressdev: add queue pair management
Add functions to manage device queue pairs.

Signed-off-by: Fiona Trahe <fiona.trahe@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Signed-off-by: Shally Verma <shally.verma@caviumnetworks.com>
Signed-off-by: Ashish Gupta <ashish.gupta@caviumnetworks.com>
2018-05-10 17:46:19 +01:00
Fiona Trahe
ed7dd94f7f compressdev: add basic device management
Add basic functions to manage compress devices,
including driver and device allocation, and the basic
interface with compressdev PMDs.

Signed-off-by: Fiona Trahe <fiona.trahe@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Signed-off-by: Shally Verma <shally.verma@caviumnetworks.com>
Signed-off-by: Ashish Gupta <ashish.gupta@caviumnetworks.com>
2018-05-10 17:46:19 +01:00
Nikhil Rao
c2189c907d eventdev: make ethdev port identifiers 16-bit
Ethernet port ID data size has been extended to 16 bits size 17.11
Update the Rx event adapter interface and implementation accordingly.

This commit bumps the library version to refect the ABI change
caused by extending the ethernet port parameter in Rx adapter
functions from 8 to 16 bits.

Fixes: 9c38b704d2 ("eventdev: add eth Rx adapter implementation")
Cc: stable@dpdk.org

Signed-off-by: Nikhil Rao <nikhil.rao@intel.com>
2018-05-10 18:04:31 +02:00
Abhinandan Gujjar
7901eac340 eventdev: add crypto adapter implementation
This patch adds common code for the crypto adapter to support
SW and HW based transfer mechanisms. The adapter uses an EAL
service core function for SW based packet transfer and uses
the eventdev PMD functions to configure HW based packet
transfer between the crypto device and the event device.
This patch also adds adapter to the meson build system &
updates the necessary makefile & map file.

Signed-off-by: Abhinandan Gujjar <abhinandan.gujjar@intel.com>
Signed-off-by: Nikhil Rao <nikhil.rao@intel.com>
Signed-off-by: Gage Eads <gage.eads@intel.com>
Acked-by: Akhil Goyal <akhil.goyal@nxp.com>
2018-05-10 14:08:46 +02:00
Abhinandan Gujjar
9dc1bd7326 eventdev: add driver interface of crypto adapter
This patch defines capabilities & functions to be called
for eventdev PMDs.

Signed-off-by: Abhinandan Gujjar <abhinandan.gujjar@intel.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Akhil Goyal <akhil.goyal@nxp.com>
2018-05-10 14:07:37 +02:00
Abhinandan Gujjar
dbe869baf4 eventdev: introduce event crypto adapter
This patch introduces event crypto adapter APIs. It
also provides information on working model/adapter
modes & their usage. Application is expected to use
this interface to transfer packets between the crypto
device & the event device.

Signed-off-by: Abhinandan Gujjar <abhinandan.gujjar@intel.com>
Signed-off-by: Nikhil Rao <nikhil.rao@intel.com>
Signed-off-by: Gage Eads <gage.eads@intel.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Akhil Goyal <akhil.goyal@nxp.com>
2018-05-10 14:03:57 +02:00
Nikhil Rao
b2b8577da5 eventdev: convert eth Rx adapter files to SPDX license tag
Signed-off-by: Nikhil Rao <nikhil.rao@intel.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>
2018-05-10 14:03:20 +02:00
Jasvinder Singh
8ea4143883 table: add dedicated params struct for cuckoo hash
Add dedicated parameter structure for cuckoo hash. The cuckoo hash from
librte_hash uses slightly different prototype for the hash function (no
key_mask parameter, 32-bit seed and return value) that require either
of the following approaches:
   1/ Function pointer conversion: gcc 8.1 warning [1], misleading [2]
   2/ Union within the parameter structure: pollutes a very generic API
      parameter structure with some implementation dependent detail
      (i.e. key mask not available for one of the available
      implementations)
   3/ Using opaque pointer for hash function: same issue from 2/
   4/ Different parameter structure: avoid issue from 2/; hopefully,
      it won't be long before librte_hash implements the key mask feature,
      so the generic API structure could be used.

[1] http://www.dpdk.org/ml/archives/dev/2018-April/094950.html
[2] http://www.dpdk.org/ml/archives/dev/2018-April/096250.html

Fixes: 5a80bf0ae6 ("table: add cuckoo hash")

Signed-off-by: Jasvinder Singh <jasvinder.singh@intel.com>
Acked-by: Cristian Dumitrescu <cristian.dumitrescu@intel.com>
2018-05-08 16:19:58 +02:00
Jasvinder Singh
4726fb245e sched: add post-init pipe profile API
Add new API function to add more pipe configuration profiles
post initialization to the set of exisitng profiles specified during
the creation of scheduler port.

This API removes the current limitation that forces the user
to define the full set of pipe profiles as the part of port parameters
while port is being created.

Signed-off-by: Jasvinder Singh <jasvinder.singh@intel.com>
2018-05-04 16:25:48 +02:00
Nikhil Rao
2fcf2f104f ethdev: support WRED thresholds in bytes
WRED thresholds can be specified in bytes if the TM leaf
node supports it. Also extend WRED thresholds to 32 bits from 16.

TM capability (port/level/queue) fields cman_wred_packet_mode_supported and
cman_wred_byte_mode_supported, when non-zero, indicate support for WRED
thresholds in packets and bytes respectively.

The packet_mode member of struct rte_tm_wred_params, when non-zero,
indicates that the min and max thresholds are specified in
packets and when zero, indicates that the min and max thresholds
are specified in bytes.

Signed-off-by: Nikhil Rao <nikhil.rao@intel.com>
2018-05-04 16:23:19 +02:00
Ben Shelton
50752f06d2 ethdev: fix TM API comment
The rte_tm_node_wfq_weight_mode_update() API function operates on
non-leaf nodes, not leaf nodes.

Signed-off-by: Ben Shelton <benjamin.h.shelton@intel.com>
2018-05-01 16:50:28 +02:00
Anatoly Burakov
0256386dc4 mem: add argument to memory event callback
It may be useful to pass arbitrary data to the callback (such
as device pointers), so add this to the mem event callback API.

Suggested-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
2018-05-08 22:28:58 +02:00
Olivier Matz
5751ff40fe mempool: fix alignment of memzone length when populating
When populating a mempool with the default function, if there is not
enough virtually contiguous memory for the whole mempool, it will be
populated with several chunks. A chunk of the maximum available length
is requested with:

  mz = rte_memzone_reserve_aligned(..., len=0, ..., align=x)

If align is smaller than the page size, the address and the length of
the memzone may not be a multiple of the page size. This makes
rte_mempool_populate_virt() to fail because it requires them to be
page-aligned. This patch fixes that.

The problem can be reproduced easily by allocating more than available
memory:
  ./build/app/testpmd -l 0,1 -- --total-num-mbufs=65536
  ...
  Cause: Creation of mbuf pool for socket 0 failed: Invalid argument

After the patch, the error code is correct:
  ./build/app/testpmd -l 0,1 -- --total-num-mbufs=65536
  ...
  Cause: Creation of mbuf pool for socket 0 failed: Cannot allocate memory

Fixes: ba0009560c ("mempool: support new allocation methods")

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Andrew Rybchenko <arybchenko@solarflare.com>
2018-05-08 15:58:20 +02:00
Thomas Monjalon
7baac77594 version: 18.05-rc2
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
2018-05-02 23:12:16 +02:00
Konstantin Ananyev
51c7de38e2 eal/x86: fix atomic exchange for 32-bit
Should break out of loop when rte_atomic64_cmpset() returns non-zero.

Fixes: ff2863570f ("eal: introduce atomic exchange operation")

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
Tested-by: Ferruh Yigit <ferruh.yigit@intel.com>
2018-05-02 19:23:06 +02:00
Anatoly Burakov
0db6d2782c malloc: avoid padding elements on page deallocation
Currently, when deallocating pages, malloc will fixup other
elements' headers if there is not enough space to store a full
element in leftover space. This leads to race conditions because
there are some functions that check for pad size with an unlocked
heap, expecting pad size to be constant.

Fix it by being more conservative and only freeing pages when
there is enough space before and after the page to store a free
element.

Fixes: 1403f87d4f ("malloc: enable memory hotplug support")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-05-02 18:35:19 +02:00
Anatoly Burakov
dc14d4f026 malloc: set pad to 0 on free
The pad value is not used unless element is in pad state, but it
will show up in heap dumps and may be confusing.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-05-02 18:35:19 +02:00
Jianfeng Tan
3a0d465d4c eal: fix use-after-free on control thread creation
After below commit, we encounter some strange issue:
  1) Dead lock as described here:
     http://dpdk.org/ml/archives/dev/2018-April/099806.html
  2) SIGSEGV issue when starting a testpmd in VM.

Considering below commit changes to use dynamic memory instead of
stack for memory barrier, we doubt it's caused by use-after-free.

Fixes: 3d09a6e26d ("eal: fix threads block on barrier")

Reported-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Reported-by: Lei Yao <lei.a.yao@intel.com>
Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Suggested-by: Olivier Matz <olivier.matz@6wind.com>
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Reviewed-by: Olivier Matz <olivier.matz@6wind.com>
2018-05-02 17:23:37 +02:00
Jianfeng Tan
e87923a9be eal: fix memory leak on control thread failure
params is not freed if pthread_create() fails. The fix is
straight-forward.

Fixes: 3d09a6e26d ("eal: fix threads block on barrier")

Reported-by: Olivier Matz <olivier.matz@6wind.com>
Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Reviewed-by: Olivier Matz <olivier.matz@6wind.com>
2018-05-02 17:15:02 +02:00
Ferruh Yigit
af7551e2bf ethdev: remove error return on RSS hash check
Many sample applications fail because of
dev_info.flow_type_rss_offloads check in rte_eth_dev_configure()

The sample applications need to be fixed/updated before returning error
on rte_eth_dev_configure() and rte_eth_dev_rss_hash_update().

This patch keeps the error logs but removes returning errors.

Fixes: 8863a1fbfc ("ethdev: add supported hash function check")

Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
2018-05-01 17:55:15 +02:00
Anatoly Burakov
3eb9af3416 malloc: fix heap size not set on init
When heap initializes, we need to add already allocated segments
onto the heap. However, in doing that, we never increased total
heap size. Fix it by adding segment length to total heap length
when initializing the heap.

Fixes: 66cc45e293 ("mem: replace memseg with memseg lists")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-04-30 15:33:49 +02:00
Anatoly Burakov
eb8d29f825 mem/linux: fix hugedir write deadlock
At hugepage info initialization, EAL takes out a write lock on
hugetlbfs directories, and drops it after the memory init is
finished. However, in non-legacy mode, if "-m" or "--socket-mem"
switches are passed, this leads to a deadlock because EAL tries
to allocate pages (and thus take out a write lock on hugedir)
while still holding a separate hugedir write lock in EAL.

Fix it by checking if write lock in hugepage info is active, and
not trying to lock the directory if the hugedir fd is valid.

Fixes: 1a7dc2252f ("mem: revert to using flock and add per-segment lockfiles")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Tested-by: Shahaf Shuler <shahafs@mellanox.com>
Tested-by: Andrew Rybchenko <arybchenko@solarflare.com>
2018-04-30 15:23:17 +02:00
Thomas Monjalon
fcde84b5f8 version: 18.05-rc1
Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
2018-04-28 00:26:04 +02:00
Anatoly Burakov
1a7dc2252f mem: revert to using flock and add per-segment lockfiles
The original implementation used flock() locks, but was later
switched to using fcntl() locks for page locking, because
fcntl() locks allow locking parts of a file, which is useful
for single-file segments mode, where locking the entire file
isn't as useful because we still need to grow and shrink it.

However, according to fcntl()'s Ubuntu manpage [1], semantics of
fcntl() locks have a giant oversight:

  This interface follows the completely stupid semantics of System
  V and IEEE Std 1003.1-1988 (“POSIX.1”) that require that all
  locks associated with a file for a given process are removed
  when any file descriptor for that file is closed by that process.
  This semantic means that applications must be aware of any files
  that a subroutine library may access.

Basically, closing *any* fd with an fcntl() lock (which we do because
we don't want to leak fd's) will drop the lock completely.

So, in this commit, we will be reverting back to using flock() locks
everywhere. However, that still leaves the problem of locking parts
of a memseg list file in single file segments mode, and we will be
solving it with creating separate lock files per each page, and
tracking those with flock().

We will also be removing all of this tailq business and replacing it
with a simple array - saving a few bytes is not worth the extra
hassle of dealing with pointers and potential memory allocation
failures. Also, remove the tailq lock since it is not needed - these
fd lists are per-process, and within a given process, it is always
only one thread handling access to hugetlbfs.

So, first one to allocate a segment will create a lockfile, and put
a shared lock on it. When we're shrinking the page file, we will be
trying to take out a write lock on that lockfile, which would fail if
any other process is holding onto the lockfile as well. This way, we
can know if we can shrink the segment file. Also, if no other locks
are found in the lock list for a given memseg list, the memseg list
fd is automatically closed.

One other thing to note is, according to flock() Ubuntu manpage [2],
upgrading the lock from shared to exclusive is implemented by dropping
and reacquiring the lock, which is not atomic and thus would have
created race conditions. So, on attempting to perform operations in
hugetlbfs, we will take out a writelock on hugetlbfs directory, so
that only one process could perform hugetlbfs operations concurrently.

[1] http://manpages.ubuntu.com/manpages/artful/en/man2/fcntl.2freebsd.html
[2] http://manpages.ubuntu.com/manpages/bionic/en/man2/flock.2.html

Fixes: 66cc45e293 ("mem: replace memseg with memseg lists")
Fixes: 582bed1e1d ("mem: support mapping hugepages at runtime")
Fixes: a5ff05d60f ("mem: support unmapping pages at runtime")
Fixes: 2a04139f66 ("eal: add single file segments option")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
2018-04-27 23:52:51 +02:00
Anatoly Burakov
046aa5c447 mem: add memalloc init stage
Currently, memseg lists for secondary process are allocated on
sync (triggered by init), when they are accessed for the first
time. Move this initialization to a separate init stage for
memalloc.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
2018-04-27 23:52:51 +02:00
Anatoly Burakov
1be7644986 mem: improve autodetection of hugepage counts on 32-bit
For non-legacy mode, we are preallocating space for hugepages, so
we know in advance which pages we will be able to allocate, and
which we won't. However, the init procedure was using hugepage
counts gathered from sysfs and paid no attention to hugepage
sizes that were actually available for reservation, and failed
on attempts to reserve unavailable pages.

Fix this by limiting total page counts by number of pages
actually preallocated.

Also, VA preallocate procedure only looks at mountpoints that are
available, and expects pages to exist if a mountpoint exists. That
might not necessarily be the case, so also check if there are
hugepages available for a particular page size on a particular
NUMA node.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Jananee Parthasarathy <jananeex.m.parthasarathy@intel.com>
2018-04-27 23:52:51 +02:00
Anatoly Burakov
e82ca1a75e mem: improve preallocation on 32-bit
Previously, if we couldn't preallocate VA space on 32-bit for
one page size, we simply bailed out, even though we could've
tried allocating VA space with other page sizes.

For example, if user had both 1G and 2M pages enabled, and
has asked DPDK to allocate memory on both sockets, DPDK
would've tried to allocate VA space for 1x1G page on both
sockets, failed and never tried again, even though it
could've allocated the same 1G of VA space for 512x2M pages.

Fix this by retrying with different page sizes if VA space
reservation failed.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Jananee Parthasarathy <jananeex.m.parthasarathy@intel.com>
2018-04-27 23:52:51 +02:00