558 lines
18 KiB
Groff
558 lines
18 KiB
Groff
.\" Copyright (c) 2011-2013 Matteo Landi, Luigi Rizzo, Universita` di Pisa
|
|
.\" All rights reserved.
|
|
.\"
|
|
.\" Redistribution and use in source and binary forms, with or without
|
|
.\" modification, are permitted provided that the following conditions
|
|
.\" are met:
|
|
.\" 1. Redistributions of source code must retain the above copyright
|
|
.\" notice, this list of conditions and the following disclaimer.
|
|
.\" 2. Redistributions in binary form must reproduce the above copyright
|
|
.\" notice, this list of conditions and the following disclaimer in the
|
|
.\" documentation and/or other materials provided with the distribution.
|
|
.\"
|
|
.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
|
|
.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
|
|
.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
.\" SUCH DAMAGE.
|
|
.\"
|
|
.\" This document is derived in part from the enet man page (enet.4)
|
|
.\" distributed with 4.3BSD Unix.
|
|
.\"
|
|
.\" $FreeBSD$
|
|
.\"
|
|
.Dd October 18, 2013
|
|
.Dt NETMAP 4
|
|
.Os
|
|
.Sh NAME
|
|
.Nm netmap
|
|
.Nd a framework for fast packet I/O
|
|
.Sh SYNOPSIS
|
|
.Cd device netmap
|
|
.Sh DESCRIPTION
|
|
.Nm
|
|
is a framework for extremely fast and efficient packet I/O
|
|
(reaching 14.88 Mpps with a single core at less than 1 GHz)
|
|
for both userspace and kernel clients.
|
|
Userspace clients can use the netmap API
|
|
to send and receive raw packets through physical interfaces
|
|
or ports of the
|
|
.Xr VALE 4
|
|
switch.
|
|
.Pp
|
|
.Nm VALE
|
|
is a very fast (reaching 20 Mpps per port)
|
|
and modular software switch,
|
|
implemented within the kernel, which can interconnect
|
|
virtual ports, physical devices, and the native host stack.
|
|
.Pp
|
|
.Nm
|
|
uses a memory mapped region to share packet buffers,
|
|
descriptors and queues with the kernel.
|
|
Simple
|
|
.Pa ioctl()s
|
|
are used to bind interfaces/ports to file descriptors and
|
|
implement non-blocking I/O, whereas blocking I/O uses
|
|
.Pa select()/poll() .
|
|
.Nm
|
|
can exploit the parallelism in multiqueue devices and
|
|
multicore systems.
|
|
.Pp
|
|
For the best performance,
|
|
.Nm
|
|
requires explicit support in device drivers;
|
|
a generic emulation layer is available to implement the
|
|
.Nm
|
|
API on top of unmodified device drivers,
|
|
at the price of reduced performance
|
|
(but still better than what can be achieved with
|
|
sockets or BPF/pcap).
|
|
.Pp
|
|
For a list of devices with native
|
|
.Nm
|
|
support, see the end of this manual page.
|
|
.Pp
|
|
.Sh OPERATION - THE NETMAP API
|
|
.Nm
|
|
clients must first
|
|
.Pa open("/dev/netmap") ,
|
|
and then issue an
|
|
.Pa ioctl(fd, NIOCREGIF, (struct nmreq *)arg)
|
|
to bind the file descriptor to a specific interface or port.
|
|
.Nm
|
|
has multiple modes of operation controlled by the
|
|
content of the
|
|
.Pa struct nmreq
|
|
passed to the
|
|
.Pa ioctl() .
|
|
In particular, the
|
|
.Em nr_name
|
|
field specifies whether the client operates on a physical network
|
|
interface or on a port of a
|
|
.Nm VALE
|
|
switch, as indicated below. Additional fields in the
|
|
.Pa struct nmreq
|
|
control the details of operation.
|
|
.Pp
|
|
.Bl -tag -width XXXX
|
|
.It Dv Interface name (e.g. 'em0', 'eth1', ... )
|
|
The data path of the interface is disconnected from the host stack.
|
|
Depending on additional arguments,
|
|
the file descriptor is bound to the NIC (one or all queues),
|
|
or to the host stack.
|
|
.It Dv valeXXX:YYY (arbitrary XXX and YYY)
|
|
The file descriptor is bound to port YYY of a VALE switch called XXX,
|
|
where XXX and YYY are arbitrary alphanumeric strings.
|
|
The string cannot exceed IFNAMSIZ characters, and YYY cannot
|
|
matching the name of any existing interface.
|
|
.Pp
|
|
The switch and the port are created if not existing.
|
|
.It Dv valeXXX:ifname (ifname is an existing interface)
|
|
Flags in the argument control whether the physical interface
|
|
(and optionally the corrisponding host stack endpoint)
|
|
are connected or disconnected from the VALE switch named XXX.
|
|
.Pp
|
|
In this case the
|
|
.Pa ioctl()
|
|
is used only for configuring the VALE switch, typically through the
|
|
.Nm vale-ctl
|
|
command.
|
|
The file descriptor cannot be used for I/O, and should be
|
|
.Pa close()d
|
|
after issuing the
|
|
.Pa ioctl().
|
|
.El
|
|
.Pp
|
|
The binding can be removed (and the interface returns to
|
|
regular operation, or the virtual port destroyed) with a
|
|
.Pa close()
|
|
on the file descriptor.
|
|
.Pp
|
|
The processes owning the file descriptor can then
|
|
.Pa mmap()
|
|
the memory region that contains pre-allocated
|
|
buffers, descriptors and queues, and use them to
|
|
read/write raw packets.
|
|
Non blocking I/O is done with special
|
|
.Pa ioctl()'s ,
|
|
whereas the file descriptor can be passed to
|
|
.Pa select()/poll()
|
|
to be notified about incoming packet or available transmit buffers.
|
|
.Ss DATA STRUCTURES
|
|
The data structures in the mmapped memory are described below
|
|
(see
|
|
.Xr sys/net/netmap.h
|
|
for reference).
|
|
All physical devices operating in
|
|
.Nm
|
|
mode use the same memory region,
|
|
shared by the kernel and all processes who own
|
|
.Pa /dev/netmap
|
|
descriptors bound to those devices
|
|
(NOTE: visibility may be restricted in future implementations).
|
|
Virtual ports instead use separate memory regions,
|
|
shared only with the kernel.
|
|
.Pp
|
|
All references between the shared data structure
|
|
are relative (offsets or indexes). Some macros help converting
|
|
them into actual pointers.
|
|
.Bl -tag -width XXX
|
|
.It Dv struct netmap_if (one per interface)
|
|
indicates the number of rings supported by an interface, their
|
|
sizes, and the offsets of the
|
|
.Pa netmap_rings
|
|
associated to the interface.
|
|
.Pp
|
|
.Pa struct netmap_if
|
|
is at offset
|
|
.Pa nr_offset
|
|
in the shared memory region is indicated by the
|
|
field in the structure returned by the
|
|
.Pa NIOCREGIF
|
|
(see below).
|
|
.Bd -literal
|
|
struct netmap_if {
|
|
char ni_name[IFNAMSIZ]; /* name of the interface. */
|
|
const u_int ni_version; /* API version */
|
|
const u_int ni_rx_rings; /* number of rx ring pairs */
|
|
const u_int ni_tx_rings; /* if 0, same as ni_rx_rings */
|
|
const ssize_t ring_ofs[]; /* offset of tx and rx rings */
|
|
};
|
|
.Ed
|
|
.It Dv struct netmap_ring (one per ring)
|
|
Contains the positions in the transmit and receive rings to
|
|
synchronize the kernel and the application,
|
|
and an array of
|
|
.Pa slots
|
|
describing the buffers.
|
|
'reserved' is used in receive rings to tell the kernel the
|
|
number of slots after 'cur' that are still in usr
|
|
indicates how many slots starting from 'cur'
|
|
the
|
|
.Pp
|
|
Each physical interface has one
|
|
.Pa netmap_ring
|
|
for each hardware transmit and receive ring,
|
|
plus one extra transmit and one receive structure
|
|
that connect to the host stack.
|
|
.Bd -literal
|
|
struct netmap_ring {
|
|
const ssize_t buf_ofs; /* see details */
|
|
const uint32_t num_slots; /* number of slots in the ring */
|
|
uint32_t avail; /* number of usable slots */
|
|
uint32_t cur; /* 'current' read/write index */
|
|
uint32_t reserved; /* not refilled before current */
|
|
|
|
const uint16_t nr_buf_size;
|
|
uint16_t flags;
|
|
#define NR_TIMESTAMP 0x0002 /* set timestamp on *sync() */
|
|
#define NR_FORWARD 0x0004 /* enable NS_FORWARD for ring */
|
|
#define NR_RX_TSTMP 0x0008 /* set rx timestamp in slots */
|
|
struct timeval ts;
|
|
struct netmap_slot slot[0]; /* array of slots */
|
|
}
|
|
.Ed
|
|
.Pp
|
|
In transmit rings, after a system call 'cur' indicates
|
|
the first slot that can be used for transmissions,
|
|
and 'avail' reports how many of them are available.
|
|
Before the next netmap-related system call on the file
|
|
descriptor, the application should fill buffers and
|
|
slots with data, and update 'cur' and 'avail'
|
|
accordingly, as shown in the figure below:
|
|
.Bd -literal
|
|
|
|
cur
|
|
|----- avail ---| (after syscall)
|
|
v
|
|
TX [*****aaaaaaaaaaaaaaaaa**]
|
|
TX [*****TTTTTaaaaaaaaaaaa**]
|
|
^
|
|
|-- avail --| (before syscall)
|
|
cur
|
|
.Ed
|
|
|
|
In receive rings, after a system call 'cur' indicates
|
|
the first slot that contains a valid packet,
|
|
and 'avail' reports how many of them are available.
|
|
Before the next netmap-related system call on the file
|
|
descriptor, the application can process buffers and
|
|
release them to the kernel updating
|
|
'cur' and 'avail' accordingly, as shown in the figure below.
|
|
Receive rings have an additional field called 'reserved'
|
|
to indicate how many buffers before 'cur' are still
|
|
under processing and cannot be released.
|
|
.Bd -literal
|
|
cur
|
|
|-res-|-- avail --| (after syscall)
|
|
v
|
|
RX [**rrrrrrRRRRRRRRRRRR******]
|
|
RX [**...........rrrrRRR******]
|
|
|res|--|<avail (before syscall)
|
|
^
|
|
cur
|
|
|
|
.Ed
|
|
.It Dv struct netmap_slot (one per packet)
|
|
contains the metadata for a packet:
|
|
.Bd -literal
|
|
struct netmap_slot {
|
|
uint32_t buf_idx; /* buffer index */
|
|
uint16_t len; /* packet length */
|
|
uint16_t flags; /* buf changed, etc. */
|
|
#define NS_BUF_CHANGED 0x0001 /* must resync, buffer changed */
|
|
#define NS_REPORT 0x0002 /* tell hw to report results
|
|
* e.g. by generating an interrupt
|
|
*/
|
|
#define NS_FORWARD 0x0004 /* pass packet to the other endpoint
|
|
* (host stack or device)
|
|
*/
|
|
#define NS_NO_LEARN 0x0008
|
|
#define NS_INDIRECT 0x0010
|
|
#define NS_MOREFRAG 0x0020
|
|
#define NS_PORT_SHIFT 8
|
|
#define NS_PORT_MASK (0xff << NS_PORT_SHIFT)
|
|
#define NS_RFRAGS(_slot) ( ((_slot)->flags >> 8) & 0xff)
|
|
uint64_t ptr; /* buffer address (indirect buffers) */
|
|
};
|
|
.Ed
|
|
The flags control how the the buffer associated to the slot
|
|
should be managed.
|
|
.It Dv packet buffers
|
|
are normally fixed size (2 Kbyte) buffers allocated by the kernel
|
|
that contain packet data. Buffers addresses are computed through
|
|
macros.
|
|
.El
|
|
.Pp
|
|
.Bl -tag -width XXX
|
|
Some macros support the access to objects in the shared memory
|
|
region. In particular,
|
|
.It NETMAP_TXRING(nifp, i)
|
|
.It NETMAP_RXRING(nifp, i)
|
|
return the address of the i-th transmit and receive ring,
|
|
respectively, whereas
|
|
.It NETMAP_BUF(ring, buf_idx)
|
|
returns the address of the buffer with index buf_idx
|
|
(which can be part of any ring for the given interface).
|
|
.El
|
|
.Pp
|
|
Normally, buffers are associated to slots when interfaces are bound,
|
|
and one packet is fully contained in a single buffer.
|
|
Clients can however modify the mapping using the
|
|
following flags:
|
|
.Ss FLAGS
|
|
.Bl -tag -width XXX
|
|
.It NS_BUF_CHANGED
|
|
indicates that the buf_idx in the slot has changed.
|
|
This can be useful if the client wants to implement
|
|
some form of zero-copy forwarding (e.g. by passing buffers
|
|
from an input interface to an output interface), or
|
|
needs to process packets out of order.
|
|
.Pp
|
|
The flag MUST be used whenever the buffer index is changed.
|
|
.It NS_REPORT
|
|
indicates that we want to be woken up when this buffer
|
|
has been transmitted. This reduces performance but insures
|
|
a prompt notification when a buffer has been sent.
|
|
Normally,
|
|
.Nm
|
|
notifies transmit completions in batches, hence signals
|
|
can be delayed indefinitely. However, we need such notifications
|
|
before closing a descriptor.
|
|
.It NS_FORWARD
|
|
When the device is open in 'transparent' mode,
|
|
the client can mark slots in receive rings with this flag.
|
|
For all marked slots, marked packets are forwarded to
|
|
the other endpoint at the next system call, thus restoring
|
|
(in a selective way) the connection between the NIC and the
|
|
host stack.
|
|
.It NS_NO_LEARN
|
|
tells the forwarding code that the SRC MAC address for this
|
|
packet should not be used in the learning bridge
|
|
.It NS_INDIRECT
|
|
indicates that the packet's payload is not in the netmap
|
|
supplied buffer, but in a user-supplied buffer whose
|
|
user virtual address is in the 'ptr' field of the slot.
|
|
The size can reach 65535 bytes.
|
|
.Em This is only supported on the transmit ring of virtual ports
|
|
.It NS_MOREFRAG
|
|
indicates that the packet continues with subsequent buffers;
|
|
the last buffer in a packet must have the flag clear.
|
|
The maximum length of a chain is 64 buffers.
|
|
.Em This is only supported on virtual ports
|
|
.It NS_RFRAGS(slot)
|
|
on receive rings, returns the number of remaining buffers
|
|
in a packet, including this one.
|
|
Slots with a value greater than 1 also have NS_MOREFRAG set.
|
|
The length refers to the individual buffer, there is no
|
|
field for the total length.
|
|
.Pp
|
|
On transmit rings, if NS_DST is set, it is passed to the lookup
|
|
function, which can use it e.g. as the index of the destination
|
|
port instead of doing an address lookup.
|
|
.El
|
|
.Sh IOCTLS
|
|
.Nm
|
|
supports some ioctl() to synchronize the state of the rings
|
|
between the kernel and the user processes, plus some
|
|
to query and configure the interface.
|
|
The former do not require any argument, whereas the latter
|
|
use a
|
|
.Pa struct nmreq
|
|
defined as follows:
|
|
.Bd -literal
|
|
struct nmreq {
|
|
char nr_name[IFNAMSIZ];
|
|
uint32_t nr_version; /* API version */
|
|
#define NETMAP_API 4 /* current version */
|
|
uint32_t nr_offset; /* nifp offset in the shared region */
|
|
uint32_t nr_memsize; /* size of the shared region */
|
|
uint32_t nr_tx_slots; /* slots in tx rings */
|
|
uint32_t nr_rx_slots; /* slots in rx rings */
|
|
uint16_t nr_tx_rings; /* number of tx rings */
|
|
uint16_t nr_rx_rings; /* number of tx rings */
|
|
uint16_t nr_ringid; /* ring(s) we care about */
|
|
#define NETMAP_HW_RING 0x4000 /* low bits indicate one hw ring */
|
|
#define NETMAP_SW_RING 0x2000 /* we process the sw ring */
|
|
#define NETMAP_NO_TX_POLL 0x1000 /* no gratuitous txsync on poll */
|
|
#define NETMAP_RING_MASK 0xfff /* the actual ring number */
|
|
uint16_t nr_cmd;
|
|
#define NETMAP_BDG_ATTACH 1 /* attach the NIC */
|
|
#define NETMAP_BDG_DETACH 2 /* detach the NIC */
|
|
#define NETMAP_BDG_LOOKUP_REG 3 /* register lookup function */
|
|
#define NETMAP_BDG_LIST 4 /* get bridge's info */
|
|
uint16_t nr_arg1;
|
|
uint16_t nr_arg2;
|
|
uint32_t spare2[3];
|
|
};
|
|
|
|
.Ed
|
|
A device descriptor obtained through
|
|
.Pa /dev/netmap
|
|
also supports the ioctl supported by network devices.
|
|
.Pp
|
|
The netmap-specific
|
|
.Xr ioctl 2
|
|
command codes below are defined in
|
|
.In net/netmap.h
|
|
and are:
|
|
.Bl -tag -width XXXX
|
|
.It Dv NIOCGINFO
|
|
returns EINVAL if the named device does not support netmap.
|
|
Otherwise, it returns 0 and (advisory) information
|
|
about the interface.
|
|
Note that all the information below can change before the
|
|
interface is actually put in netmap mode.
|
|
.Pp
|
|
.Pa nr_memsize
|
|
indicates the size of the netmap
|
|
memory region. Physical devices all share the same memory region,
|
|
whereas VALE ports may have independent regions for each port.
|
|
These sizes can be set through system-wise sysctl variables.
|
|
.Pa nr_tx_slots, nr_rx_slots
|
|
indicate the size of transmit and receive rings.
|
|
.Pa nr_tx_rings, nr_rx_rings
|
|
indicate the number of transmit
|
|
and receive rings.
|
|
Both ring number and sizes may be configured at runtime
|
|
using interface-specific functions (e.g.
|
|
.Pa sysctl
|
|
or
|
|
.Pa ethtool .
|
|
.It Dv NIOCREGIF
|
|
puts the interface named in nr_name into netmap mode, disconnecting
|
|
it from the host stack, and/or defines which rings are controlled
|
|
through this file descriptor.
|
|
On return, it gives the same info as NIOCGINFO, and nr_ringid
|
|
indicates the identity of the rings controlled through the file
|
|
descriptor.
|
|
.Pp
|
|
Possible values for nr_ringid are
|
|
.Bl -tag -width XXXXX
|
|
.It 0
|
|
default, all hardware rings
|
|
.It NETMAP_SW_RING
|
|
the ``host rings'' connecting to the host stack
|
|
.It NETMAP_HW_RING + i
|
|
the i-th hardware ring
|
|
.El
|
|
By default, a
|
|
.Nm poll
|
|
or
|
|
.Nm select
|
|
call pushes out any pending packets on the transmit ring, even if
|
|
no write events are specified.
|
|
The feature can be disabled by or-ing
|
|
.Nm NETMAP_NO_TX_SYNC
|
|
to nr_ringid.
|
|
But normally you should keep this feature unless you are using
|
|
separate file descriptors for the send and receive rings, because
|
|
otherwise packets are pushed out only if NETMAP_TXSYNC is called,
|
|
or the send queue is full.
|
|
.Pp
|
|
.Pa NIOCREGIF
|
|
can be used multiple times to change the association of a
|
|
file descriptor to a ring pair, always within the same device.
|
|
.Pp
|
|
When registering a virtual interface that is dynamically created to a
|
|
.Xr vale 4
|
|
switch, we can specify the desired number of rings (1 by default,
|
|
and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
|
|
.It Dv NIOCTXSYNC
|
|
tells the hardware of new packets to transmit, and updates the
|
|
number of slots available for transmission.
|
|
.It Dv NIOCRXSYNC
|
|
tells the hardware of consumed packets, and asks for newly available
|
|
packets.
|
|
.El
|
|
.Sh SYSTEM CALLS
|
|
.Nm
|
|
uses
|
|
.Xr select 2
|
|
and
|
|
.Xr poll 2
|
|
to wake up processes when significant events occur, and
|
|
.Xr mmap 2
|
|
to map memory.
|
|
.Pp
|
|
Applications may need to create threads and bind them to
|
|
specific cores to improve performance, using standard
|
|
OS primitives, see
|
|
.Xr pthread 3 .
|
|
In particular,
|
|
.Xr pthread_setaffinity_np 3
|
|
may be of use.
|
|
.Sh EXAMPLES
|
|
The following code implements a traffic generator
|
|
.Pp
|
|
.Bd -literal -compact
|
|
#include <net/netmap.h>
|
|
#include <net/netmap_user.h>
|
|
struct netmap_if *nifp;
|
|
struct netmap_ring *ring;
|
|
struct nmreq nmr;
|
|
|
|
fd = open("/dev/netmap", O_RDWR);
|
|
bzero(&nmr, sizeof(nmr));
|
|
strcpy(nmr.nr_name, "ix0");
|
|
nmr.nm_version = NETMAP_API;
|
|
ioctl(fd, NIOCREGIF, &nmr);
|
|
p = mmap(0, nmr.nr_memsize, fd);
|
|
nifp = NETMAP_IF(p, nmr.nr_offset);
|
|
ring = NETMAP_TXRING(nifp, 0);
|
|
fds.fd = fd;
|
|
fds.events = POLLOUT;
|
|
for (;;) {
|
|
poll(list, 1, -1);
|
|
for ( ; ring->avail > 0 ; ring->avail--) {
|
|
i = ring->cur;
|
|
buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
|
|
... prepare packet in buf ...
|
|
ring->slot[i].len = ... packet length ...
|
|
ring->cur = NETMAP_RING_NEXT(ring, i);
|
|
}
|
|
}
|
|
.Ed
|
|
.Sh SUPPORTED INTERFACES
|
|
.Nm
|
|
supports the following interfaces:
|
|
.Xr em 4 ,
|
|
.Xr igb 4 ,
|
|
.Xr ixgbe 4 ,
|
|
.Xr lem 4 ,
|
|
.Xr re 4
|
|
.Sh SEE ALSO
|
|
.Xr vale 4
|
|
.Pp
|
|
http://info.iet.unipi.it/~luigi/netmap/
|
|
.Pp
|
|
Luigi Rizzo, Revisiting network I/O APIs: the netmap framework,
|
|
Communications of the ACM, 55 (3), pp.45-51, March 2012
|
|
.Pp
|
|
Luigi Rizzo, netmap: a novel framework for fast packet I/O,
|
|
Usenix ATC'12, June 2012, Boston
|
|
.Sh AUTHORS
|
|
.An -nosplit
|
|
The
|
|
.Nm
|
|
framework has been originally designed and implemented at the
|
|
Universita` di Pisa in 2011 by
|
|
.An Luigi Rizzo ,
|
|
and further extended with help from
|
|
.An Matteo Landi ,
|
|
.An Gaetano Catalli ,
|
|
.An Giuseppe Lettieri ,
|
|
.An Vincenzo Maffione .
|
|
.Pp
|
|
.Nm
|
|
and
|
|
.Nm VALE
|
|
have been funded by the European Commission within FP7 Projects
|
|
CHANGE (257422) and OPENLAB (287581).
|