Commit Graph

69 Commits

Author SHA1 Message Date
Anatoly Burakov
6080796f65 mem: make base address hint OS specific
Not all OS's follow Linux's memory layout, which may lead to
problems following the suggested common address hint absent
of a base-virtaddr flag. Make this address hint OS-specific.

Cc: stable@dpdk.org

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2019-10-26 18:03:24 +02:00
Anatoly Burakov
028669bc9f eal: hide shared memory config
Now that everything that has ever accessed the shared memory
config is doing so through the public API's, we can make it
internal. Since we're removing quite a few headers from
rte_eal_memconfig.h, we need to add them back in places
where this header is used.

This bumps the ABI, so also change all build files and make
update documentation.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: David Marchand <david.marchand@redhat.com>
2019-07-06 10:32:34 +02:00
Anatoly Burakov
76f80881ef mem: add API to lock/unlock memory hotplug
Currently, the memory hotplug is locked automatically by all
memory-related _walk() functions, but sometimes locking the
memory subsystem outside of them is needed. There is no
public API to do that, so it creates a dependency on shared
memory config to be public. Fix this by introducing a new
API to lock/unlock the memory hotplug subsystem.

Create a new common file for all things mem config, and a
new API namespace rte_mcfg_*, and search-and-replace all
usages of the locks with the new API.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: David Marchand <david.marchand@redhat.com>
2019-07-05 22:12:40 +02:00
David Marchand
cfe3aeb170 remove experimental tags from all symbol definitions
We had some inconsistencies between functions prototypes and actual
definitions.
Let's avoid this by only adding the experimental tag to the prototypes.
Tests with gcc and clang show it is enough.

git grep -l __rte_experimental |grep \.c$ |while read file; do
	sed -i -e '/^__rte_experimental$/d' $file;
	sed -i -e 's/  *__rte_experimental//' $file;
	sed -i -e 's/__rte_experimental  *//' $file;
done

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
2019-06-29 19:04:43 +02:00
David Marchand
146e002c68 mem: remove incorrect experimental tag on static symbol
This function is not visible from outside this code unit.

Fixes: 84e7477e10 ("mem: add thread unsafe version for DMA mask check")
Cc: stable@dpdk.org

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
2019-06-29 19:04:39 +02:00
Shahaf Shuler
237060c4ad mem: limit use of address hint
The commit below added an address hint as starting address for 64-bit
systems in case an explicit base virtual address was not set by the user.

The justification for such hint was to help devices that work in VA
mode and has a address range limitation to work smoothly with the eal
memory subsystem.

While the base address value selected may work fine for the eal
initialization, it easily breaks when trying to register external memory
using rte_extmem_register API.

Trying to register anonymous memory on RH x86_64 machine took several
minutes, during them the function eal_get_virtual_area repeatedly
scanned for a good VA candidate.

The attempt to guess which VA address will be free for mapping will
always result in not portable, error prone code:
* different application may use different libraries along w/ DPDK. One
  can never guess which library was called first and how much virtual
  memory it consumed.
* external memory can be registered at any time in the application run
  time.

In order not to break the existing secondary process design, this patch
only limits the max number of tries that will be done with the
address hint.
When the number of tries exceeds the threshold the code
will use the suggested address from kernel.

Fixes: 1df2170287 ("mem: use address hint for mapping hugepages")
Cc: stable@dpdk.org

Signed-off-by: Shahaf Shuler <shahafs@mellanox.com>
Tested-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Alejandro Lucero <alejandro.lucero@netronome.com>
2019-04-03 19:10:47 +02:00
Anatoly Burakov
525670756a mem: fix segment fd API error code for external segment
Segment fd API does not support getting segment fd's from
externally allocated memory, so return proper error code
on any attempts to do so. This changes API behavior, so
document the change as well.

Fixes: 5282bb1c36 ("mem: allow memseg lists to be marked as external")
Cc: stable@dpdk.org

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Tiwei Bie <tiwei.bie@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
2018-12-20 22:51:49 +01:00
Anatoly Burakov
bed7941886 mem: allow usage of non-heap external memory in multiprocess
Add multiprocess support for externally allocated memory areas that
are not added to DPDK heap (and add relevant doc sections).

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
2018-12-20 18:14:55 +01:00
Anatoly Burakov
950e8fb4e1 mem: allow registering external memory areas
The general use-case of using external memory is well covered by
existing external memory API's. However, certain use cases require
manual management of externally allocated memory areas, so this
memory should not be added to the heap. It should, however, be
added to DPDK's internal structures, so that API's like
``rte_virt2memseg`` would work on such external memory segments.

This commit adds such an API to DPDK. The new functions will allow
to register and unregister externally allocated memory areas, as
well as documentation for them.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
2018-12-20 18:14:55 +01:00
Alejandro Lucero
ee0e074f81 mem: fix DMA mask width sanity check
Current code has different max DMA mask width values for 32 and 64
bits systems. IOMMU hardware could report a higher supported width
than current MAX_DMA_MASK_BITS when RTE_ARCH_64 is not defined. This
is actually true with a 32 bits kernel running in a 64 bits server
with IOMMU hardware. This could also be a problem with embedded systems
using an IOMMU designed for 64 bits in a 32 bits system.

This patch leaves a single max DMA mask width which will make sure the
mask width is within the range for 64 bits variables used for DMA mask.
This also will avoid wrong values because any value higher than
64 bits is likely wrong.

Fixes: 223b7f1d5e ("mem: add function for checking memseg IOVA")

Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Ferruh Yigit <ferruh.yigit@intel.com>
2018-11-07 14:42:28 +01:00
Alejandro Lucero
84e7477e10 mem: add thread unsafe version for DMA mask check
During memory initialization calling rte_mem_check_dma_mask
leads to a deadlock because memory_hotplug_lock is locked by a
writer, the current code in execution, and rte_memseg_walk
tries to lock as a reader.

This patch adds a thread_unsafe version which will call the final
function specifying the memory_hotplug_lock does not need to be
acquired. The patch also modified rte_mem_check_dma_mask as a
intermediate step which will call the final function as before,
implying memory_hotplug_lock will be acquired.

PMDs should always use the version acquiring the lock with the
thread_unsafe one being just for internal EAL memory code.

Fixes: 223b7f1d5e ("mem: add function for checking memseg IOVA")

Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
Tested-by: Ferruh Yigit <ferruh.yigit@intel.com>
2018-11-05 01:02:14 +01:00
Alejandro Lucero
9d15773606 mem: add function for setting DMA mask
This patch adds the possibility of setting a dma mask to be used
once the memory initialization is done.

This is currently needed when IOVA mode is set by PCI related
code and an x86 IOMMU hardware unit is present. Current code calls
rte_mem_check_dma_mask but it is wrong to do so at that point
because the memory has not been initialized yet.

Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
Tested-by: Ferruh Yigit <ferruh.yigit@intel.com>
2018-11-05 01:02:04 +01:00
Alejandro Lucero
0de9eb6138 mem: rename DMA mask check with proper prefix
Current name rte_eal_check_dma_mask does not follow the naming
used in the rest of the file.

Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
Tested-by: Ferruh Yigit <ferruh.yigit@intel.com>
2018-11-05 01:01:54 +01:00
Alejandro Lucero
1df2170287 mem: use address hint for mapping hugepages
Linux kernel uses a really high address as starting address for
serving mmaps calls. If there exist addressing limitations and
IOVA mode is VA, this starting address is likely too high for
those devices. However, it is possible to use a lower address in
the process virtual address space as with 64 bits there is a lot
of available space.

This patch adds an address hint as starting address for 64 bits
systems and increments the hint for next invocations. If the mmap
call does not use the hint address, repeat the mmap call using
the hint address incremented by page size.

Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-10-28 22:06:05 +01:00
Alejandro Lucero
223b7f1d5e mem: add function for checking memseg IOVA
A device can suffer addressing limitations. This function checks
memsegs have iovas within the supported range based on dma mask.

PMDs should use this function during initialization if device
suffers addressing limitations, returning an error if this function
returns memsegs out of range.

Another usage is for emulated IOMMU hardware with addressing
limitations.

It is necessary to save the most restricted dma mask for checking out
memory allocated dynamically after initialization.

Signed-off-by: Alejandro Lucero <alejandro.lucero@netronome.com>
Reviewed-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-10-28 22:04:34 +01:00
Anatoly Burakov
3717943819 mem: improve musl compatibility
When built against musl, fcntl.h doesn't silently get included.
Fix by including it explicitly.

Bugzilla ID: 31

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
2018-10-22 11:29:37 +02:00
Anatoly Burakov
5282bb1c36 mem: allow memseg lists to be marked as external
When we allocate and use DPDK memory, we need to be able to
differentiate between DPDK hugepage segments and segments that
were made part of DPDK but are externally allocated. Add such
a property to memseg lists.

This breaks the ABI, so document the change in release notes.
This also breaks a few internal assumptions about memory
contiguousness, so adjust malloc code in a few places.

All current calls for memseg walk functions were adjusted to
ignore external segments where it made sense.

Mempools is a special case, because we may be asked to allocate
a mempool on a specific socket, and we need to ignore all page
sizes on other heaps or other sockets. Previously, this
assumption of knowing all page sizes was not a problem, but it
will be now, so we have to match socket ID with page size when
calculating minimum page size for a mempool.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
2018-10-11 10:24:29 +02:00
Anatoly Burakov
4104b2a485 mem: add length to memseg list
Previously, to calculate length of memory area covered by a memseg
list, we would've needed to multiply page size by length of fbarray
backing that memseg list. This is not obvious and unnecessarily
low level, so store length in the memseg list itself.

This breaks ABI, so bump the EAL ABI version and document the
change. Also, while we're breaking ABI, pack the members a little
better.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Shreyansh Jain <shreyansh.jain@nxp.com>
2018-10-11 10:24:16 +02:00
Anatoly Burakov
3a44687139 mem: allow querying offset into segment fd
In a few cases, user may need to query offset into fd for a
particular memory segment (for example, to selectively map
pages). This commit adds a new API to do that.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
2018-09-19 15:01:58 +02:00
Anatoly Burakov
41dbdb6872 mem: add external API to retrieve page fd
Now that we can retrieve page fd's internally, we can expose it
as an external API. This will add two flavors of API - thread-safe
and non-thread-safe. Fix up internal API's to return values we need
without modifying rte_errno internally if called from within EAL.

We do not want calling code to accidentally close an internal fd, so
we make a duplicate of it before we return it to the user. Caller is
therefore responsible for closing this fd.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
2018-09-19 14:48:04 +02:00
Anatoly Burakov
1009ba1704 mem: add internal API to get and set segment fd
Enable setting and retrieving segment fd's internally.

For now, retrieving fd's will not be used anywhere until we
get an external API, but it will be useful for things like
virtio, where we wish to share segment fd's.

Setting segment fd's will not be available as a public API
at this time, but internally it is needed for legacy mode,
because we're not allocating our hugepages in memalloc in
legacy mode case, and we still need to store the fd.

Another user of get segment fd API is memseg info dump, to
show which pages use which fd's.

Not supported on FreeBSD.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
2018-09-19 14:46:34 +02:00
Anatoly Burakov
d5dd22c9f6 mem: fix alignment of requested virtual areas
The original code did not align any addresses that were requested as
page-aligned, but were different because addr_is_hint was set.

Below fix by Dariusz has introduced an issue where all unaligned addresses
were left as unaligned.

This patch is a partial revert of
commit 7fa7216ed4 ("mem: fix alignment of requested virtual areas")

and implements a proper fix for this issue, by asking for alignment in all
but the following two cases:

1) page size is equal to system page size, or
2) we got an aligned requested address, and will not accept a different one

This ensures that alignment is performed in all cases, except for those we
can guarantee that the address will not need alignment.

Fixes: b7cc54187e ("mem: move virtual area function in common directory")
Fixes: 7fa7216ed4 ("mem: fix alignment of requested virtual areas")
Cc: stable@dpdk.org

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Lei Yao <lei.a.yao@intel.com>
Acked-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
2018-07-18 23:22:33 +02:00
Anatoly Burakov
e26415428f mem: provide thread-unsafe memseg list walk variant
Sometimes, user code needs to walk memseg list while being inside
a memory-related callback. Rather than making everyone copy around
the same iteration code and depending on DPDK internals, provide an
official way to do memseg_list_walk() inside callbacks.

Also, remove existing reimplementation from memalloc code and use
the new API instead.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 11:21:25 +02:00
Anatoly Burakov
7c790af08f mem: provide thread-unsafe memseg walk variant
Sometimes, user code needs to walk memseg list while being inside
a memory-related callback. Rather than making everyone copy around
the same iteration code and depending on DPDK internals, provide an
official way to do memseg_walk() inside callbacks.

Also, remove existing reimplementation from sPAPR VFIO code and use
the new API instead.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 11:21:15 +02:00
Anatoly Burakov
b917147601 mem: provide thread-unsafe contig walk variant
Sometimes, user code needs to walk memseg list while being inside
a memory-related callback. Rather than making everyone copy around
the same iteration code and depending on DPDK internals, provide an
official way to do memseg_contig_walk() inside callbacks.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 11:20:06 +02:00
Anatoly Burakov
1d406458db mem: make segment preallocation OS-specific
In the perfect world, it wouldn't matter how much memory was
preallocated because most of it was always going to be private
anonymous zero-page mappings for the duration of the program.
However, in practice, due to peculiarities of FreeBSD, we need
to additionally limit memory allocation there. This patch moves
the segment preallocation to EAL private functions that will be
implemented by an OS-specific EAL rather than being in the common
memory-related code.

Since there is no support for growing/shrinking memory use at
runtime on FreeBSD anyway, this does not inhibit any functionality
but makes core dumps faster even on default settings.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 00:59:18 +02:00
Dariusz Stojaczyk
6c0fb7547b mem: do not use --base-virtaddr in secondary processes
Since secondary process' address space is highly dictated
by the primary process' mappings, it doesn't make much
sense to use base-virtaddr for secondary processes.

This patch is intended to fix PCI resource mapping
in secondary processes using the same base-virtaddr
as their primary processes. PCI uses the end of the hugepage
memory area to map all resources. [pci_find_max_end_va()]
It works for primary processes, but can't be mapped 1:1
by secondary ones, as the same addresses are currently always
occupied by shadow memseg lists, which were created with
eal_get_virtual_area(NULL, ...).

```
PRIMARY PROCESS
0x6e00e00000    388K rw-s- fbarray_memseg-2048k-1-3
0x6e01000000 16777216K r----   [ anon ]
0x7201000000     16K rw-s- resource0

SECONDARY PROCESS
0x6e00e00000    388K rw-s- fbarray_memseg-2048k-1-3
0x6e01000000 16777216K r----   [ anon ]
0x7201000000      4K rw-s- fbarray_memseg-1048576k-0-0_203213
```

Fixes: 524e43c2ad ("mem: prepare memseg lists for multiprocess sync")
Cc: stable@dpdk.org

Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 00:25:13 +02:00
Dariusz Stojaczyk
9dac150f98 mem: fix alignment requested with --base-virtaddr
Whenever a calculated base-virtaddr offset had to be
manually aligned to requested page_sz, we did not take
account of that alignment in incrementing the base-virtaddr
offset further. The next requested virtual area could print
a warning "hint [...] not respected!" and let the system
pick an address instead. As a result, this breaks secondary
process support on many system configurations.

Fixes: b7cc54187e ("mem: move virtual area function in common directory")
Cc: stable@dpdk.org

Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 00:25:10 +02:00
Dariusz Stojaczyk
7fa7216ed4 mem: fix alignment of requested virtual areas
Although the alignment mechanism works as intended, the
`no_align` bool flag was set incorrectly. We were aligning
buffers that didn't need extra alignment, and weren't
aligning ones that really needed it.

Fixes: b7cc54187e ("mem: move virtual area function in common directory")
Cc: stable@dpdk.org

Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 00:25:09 +02:00
Dariusz Stojaczyk
09037cf36c mem: avoid crash on memseg query with invalid address
When trying to use it with an address that's not
managed by DPDK it would segfault due to a missing
check. The doc says this function returns either
a pointer or NULL, so let it do so.

Fixes: 66cc45e293 ("mem: replace memseg with memseg lists")
Cc: stable@dpdk.org

Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Acked-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-07-13 00:25:08 +02:00
Anatoly Burakov
d2bd796d7b mem: fix potential underflow on mem size calculation
If total memory is already bigger than max memory, an underflow
will occur on subtraction. Fix it by simply stopping whenever
we already have amount of memory that is bigger than maximum.

Fixes: 66cc45e293 ("mem: replace memseg with memseg lists")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
2018-05-14 03:15:31 +02:00
Anatoly Burakov
68c3603867 mem: unmap unneeded space
When we ask to reserve virtual areas, we usually include
alignment in the mapping size, and that memory ends up
being wasted. Wasting a gigabyte of VA space while trying to
reserve one gigabyte is pretty expensive on 32-bit, so after
we're done mapping, unmap unneeded space.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
2018-05-14 01:32:38 +02:00
Anatoly Burakov
91fe57ac00 mem: check if allocation size is too big
Mapping size is a 64-bit integer, but mmap() will accept size_t for
size mappings. A user could request a mapping with an alignment, which
would have overflown size_t, so check if (size + alignment) will
overflow size_t.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
2018-05-14 01:32:21 +02:00
Anatoly Burakov
0256386dc4 mem: add argument to memory event callback
It may be useful to pass arbitrary data to the callback (such
as device pointers), so add this to the mem event callback API.

Suggested-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
2018-05-08 22:28:58 +02:00
Anatoly Burakov
046aa5c447 mem: add memalloc init stage
Currently, memseg lists for secondary process are allocated on
sync (triggered by init), when they are accessed for the first
time. Move this initialization to a separate init stage for
memalloc.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
2018-04-27 23:52:51 +02:00
Anatoly Burakov
1be7644986 mem: improve autodetection of hugepage counts on 32-bit
For non-legacy mode, we are preallocating space for hugepages, so
we know in advance which pages we will be able to allocate, and
which we won't. However, the init procedure was using hugepage
counts gathered from sysfs and paid no attention to hugepage
sizes that were actually available for reservation, and failed
on attempts to reserve unavailable pages.

Fix this by limiting total page counts by number of pages
actually preallocated.

Also, VA preallocate procedure only looks at mountpoints that are
available, and expects pages to exist if a mountpoint exists. That
might not necessarily be the case, so also check if there are
hugepages available for a particular page size on a particular
NUMA node.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Jananee Parthasarathy <jananeex.m.parthasarathy@intel.com>
2018-04-27 23:52:51 +02:00
Anatoly Burakov
e82ca1a75e mem: improve preallocation on 32-bit
Previously, if we couldn't preallocate VA space on 32-bit for
one page size, we simply bailed out, even though we could've
tried allocating VA space with other page sizes.

For example, if user had both 1G and 2M pages enabled, and
has asked DPDK to allocate memory on both sockets, DPDK
would've tried to allocate VA space for 1x1G page on both
sockets, failed and never tried again, even though it
could've allocated the same 1G of VA space for 512x2M pages.

Fix this by retrying with different page sizes if VA space
reservation failed.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Jananee Parthasarathy <jananeex.m.parthasarathy@intel.com>
2018-04-27 23:52:51 +02:00
Anatoly Burakov
a99e8df63f mem: fix 32-bit memory upper limit for non-legacy mode
32-bit mode has an upper limit on amount of VA space it can preallocate,
but the original implementation used the wrong constant, resulting in
failure to initialize due to integer overflow. Fix it by using the
correct constant.

Fixes: 66cc45e293 ("mem: replace memseg with memseg lists")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Jananee Parthasarathy <jananeex.m.parthasarathy@intel.com>
2018-04-27 23:52:51 +02:00
Anatoly Burakov
8f91f368a1 mem: log page address before unmapping
If user has specified a flag to unmap the area right after mapping it,
we were passing an already-unmapped pointer to RTE_LOG. This is not an
issue since RTE_LOG doesn't actually dereference the pointer, but fix
it anyway by moving call to RTE_LOG to before unmap.

Coverity issue: 272584
Fixes: b7cc54187e ("mem: move virtual area function in common directory")

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
2018-04-27 23:42:40 +02:00
Anatoly Burakov
2e378ff297 mem: add validator callback
This API will enable application to register for notifications
on page allocations that are about to happen, giving the application
a chance to allow or deny the allocation when total memory utilization
as a result would be above specified limit on specified socket.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 21:45:56 +02:00
Anatoly Burakov
56efb4c117 malloc: support callbacks on memory events
Each process will have its own callbacks. Callbacks will indicate
whether it's allocation and deallocation that's happened, and will
also provide start VA address and length of allocated block.

Since memory hotplug isn't supported on FreeBSD and in legacy mem
mode, it will not be possible to register them in either.

Callbacks are called whenever something happens to the memory map of
current process, therefore at those times memory hotplug subsystem
is write-locked, which leads to deadlocks on attempt to use these
functions. Document the limitation.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 21:45:55 +02:00
Anatoly Burakov
07dcbfe010 malloc: support multiprocess memory hotplug
This enables multiprocess synchronization for memory hotplug
requests at runtime (as opposed to initialization).

Basic workflow is the following. Primary process always does initial
mapping and unmapping, and secondary processes always follow primary
page map. Only one allocation request can be active at any one time.

When primary allocates memory, it ensures that all other processes
have allocated the same set of hugepages successfully, otherwise
any allocations made are being rolled back, and heap is freed back.
Heap is locked throughout the process, and there is also a global
memory hotplug lock, so no race conditions can happen.

When primary frees memory, it frees the heap, deallocates affected
pages, and notifies other processes of deallocations. Since heap is
freed from that memory chunk, the area basically becomes invisible
to other processes even if they happen to fail to unmap that
specific set of pages, so it's completely safe to ignore results of
sync requests.

When secondary allocates memory, it does not do so by itself.
Instead, it sends a request to primary process to try and allocate
pages of specified size and on specified socket, such that a
specified heap allocation request could complete. Primary process
then sends all secondaries (including the requestor) a separate
notification of allocated pages, and expects all secondary
processes to report success before considering pages as "allocated".

Only after primary process ensures that all memory has been
successfully allocated in all secondary process, it will respond
positively to the initial request, and let secondary proceed with
the allocation. Since the heap now has memory that can satisfy
allocation request, and it was locked all this time (so no other
allocations could take place), secondary process will be able to
allocate memory from the heap.

When secondary frees memory, it hides pages to be deallocated from
the heap. Then, it sends a deallocation request to primary process,
so that it deallocates pages itself, and then sends a separate sync
request to all other processes (including the requestor) to unmap
the same pages. This way, even if secondary fails to notify other
processes of this deallocation, that memory will become invisible
to other processes, and will not be allocated from again.

So, to summarize: address space will only become part of the heap
if primary process can ensure that all other processes have
allocated this memory successfully. If anything goes wrong, the
worst thing that could happen is that a page will "leak" and will
not be available to neither DPDK nor the system, as some process
will still hold onto it. It's not an actual leak, as we can account
for the page - it's just that none of the processes will be able
to use this page for anything useful, until it gets allocated from
by the primary.

Due to underlying DPDK IPC implementation being single-threaded,
some asynchronous magic had to be done, as we need to complete
several requests before we can definitively allow secondary process
to use allocated memory (namely, it has to be present in all other
secondary processes before it can be used). Additionally, only
one allocation request is allowed to be submitted at once.

Memory allocation requests are only allowed when there are no
secondary processes currently initializing. To enforce that,
a shared rwlock is used, that is set to read lock on init (so that
several secondaries could initialize concurrently), and write lock
on making allocation requests (so that either secondary init will
have to wait, or allocation request will have to wait until all
processes have initialized).

Any other function that wishes to iterate over memory or prevent
allocations should be using memory hotplug lock.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 21:45:55 +02:00
Anatoly Burakov
6167d81488 mem: add secondary process init with memory hotplug
Secondary initialization will just sync memory map with
primary process.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 21:45:55 +02:00
Anatoly Burakov
66cc45e293 mem: replace memseg with memseg lists
Before, we were aggregating multiple pages into one memseg, so the
number of memsegs was small. Now, each page gets its own memseg,
so the list of memsegs is huge. To accommodate the new memseg list
size and to keep the under-the-hood workings sane, the memseg list
is now not just a single list, but multiple lists. To be precise,
each hugepage size available on the system gets one or more memseg
lists, per socket.

In order to support dynamic memory allocation, we reserve all
memory in advance (unless we're in 32-bit legacy mode, in which
case we do not preallocate memory). As in, we do an anonymous
mmap() of the entire maximum size of memory per hugepage size, per
socket (which is limited to either RTE_MAX_MEMSEG_PER_TYPE pages or
RTE_MAX_MEM_MB_PER_TYPE megabytes worth of memory, whichever is the
smaller one), split over multiple lists (which are limited to
either RTE_MAX_MEMSEG_PER_LIST memsegs or RTE_MAX_MEM_MB_PER_LIST
megabytes per list, whichever is the smaller one). There is also
a global limit of CONFIG_RTE_MAX_MEM_MB megabytes, which is mainly
used for 32-bit targets to limit amounts of preallocated memory,
but can be used to place an upper limit on total amount of VA
memory that can be allocated by DPDK application.

So, for each hugepage size, we get (by default) up to 128G worth
of memory, per socket, split into chunks of up to 32G in size.
The address space is claimed at the start, in eal_common_memory.c.
The actual page allocation code is in eal_memalloc.c (Linux-only),
and largely consists of copied EAL memory init code.

Pages in the list are also indexed by address. That is, in order
to figure out where the page belongs, one can simply look at base
address for a memseg list. Similarly, figuring out IOVA address
of a memzone is a matter of finding the right memseg list, getting
offset and dividing by page size to get the appropriate memseg.

This commit also removes rte_eal_dump_physmem_layout() call,
according to deprecation notice [1], and removes that deprecation
notice as well.

On 32-bit targets due to limited VA space, DPDK will no longer
spread memory to different sockets like before. Instead, it will
(by default) allocate all of the memory on socket where master
lcore is. To override this behavior, --socket-mem must be used.

The rest of the changes are really ripple effects from the memseg
change - heap changes, compile fixes, and rewrites to support
fbarray-backed memseg lists. Due to earlier switch to _walk()
functions, most of the changes are simple fixes, however some
of the _walk() calls were switched to memseg list walk, where
it made sense to do so.

Additionally, we are also switching locks from flock() to fcntl().
Down the line, we will be introducing single-file segments option,
and we cannot use flock() locks to lock parts of the file. Therefore,
we will use fcntl() locks for legacy mem as well, in case someone is
unfortunate enough to accidentally start legacy mem primary process
alongside an already working non-legacy mem-based primary process.

[1] http://dpdk.org/dev/patchwork/patch/34002/

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 19:55:39 +02:00
Anatoly Burakov
f901e64d21 mem: add virt2memseg function
This can be used as a virt2iova function that only looks up
memory that is owned by DPDK (as opposed to doing pagemap walks).
Using this will result in less dependency on internals of mem API.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 19:54:44 +02:00
Anatoly Burakov
eca28edd98 mem: add iova2virt function
This is reverse lookup of PA to VA. Using this will make
other code less dependent on internals of mem API.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 19:54:00 +02:00
Anatoly Burakov
552afc420a mem: add contig walk function
This function is meant to walk over first segment of each
VA-contiguous group of memsegs.

For future users of this function, this is done so that
there is less dependency on internals of mem API and less
noise later change sets.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 19:53:38 +02:00
Anatoly Burakov
221b67bca0 eal: use memseg walk instead of iteration
Reduce dependency on internal details of EAL memory subsystem, and
simplify code.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 19:48:15 +02:00
Anatoly Burakov
2b9f98d8a5 mem: add function to walk all memsegs
For code that might need to iterate over list of allocated
segments, using this API will make it more resilient to
internal API changes and will prevent copying the same
iteration code over and over again.

Additionally, down the line there will be locking implemented,
so users of this API will not need to care about locking
either.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 19:47:25 +02:00
Anatoly Burakov
b7cc54187e mem: move virtual area function in common directory
Move get_virtual_area out of linuxapp EAL memory and make it
common to EAL, so that other code could reserve virtual areas
as well.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Tested-by: Hemant Agrawal <hemant.agrawal@nxp.com>
Tested-by: Gowrishankar Muthukrishnan <gowrishankar.m@linux.vnet.ibm.com>
2018-04-11 19:33:06 +02:00