558 lines
18 KiB
Groff

.\" Copyright (c) 2011-2013 Matteo Landi, Luigi Rizzo, Universita` di Pisa
.\" All rights reserved.
.\"
.\" Redistribution and use in source and binary forms, with or without
.\" modification, are permitted provided that the following conditions
.\" are met:
.\" 1. Redistributions of source code must retain the above copyright
.\" notice, this list of conditions and the following disclaimer.
.\" 2. Redistributions in binary form must reproduce the above copyright
.\" notice, this list of conditions and the following disclaimer in the
.\" documentation and/or other materials provided with the distribution.
.\"
.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
.\" SUCH DAMAGE.
.\"
.\" This document is derived in part from the enet man page (enet.4)
.\" distributed with 4.3BSD Unix.
.\"
.\" $FreeBSD$
.\"
.Dd October 18, 2013
.Dt NETMAP 4
.Os
.Sh NAME
.Nm netmap
.Nd a framework for fast packet I/O
.Sh SYNOPSIS
.Cd device netmap
.Sh DESCRIPTION
.Nm
is a framework for extremely fast and efficient packet I/O
(reaching 14.88 Mpps with a single core at less than 1 GHz)
for both userspace and kernel clients.
Userspace clients can use the netmap API
to send and receive raw packets through physical interfaces
or ports of the
.Xr VALE 4
switch.
.Pp
.Nm VALE
is a very fast (reaching 20 Mpps per port)
and modular software switch,
implemented within the kernel, which can interconnect
virtual ports, physical devices, and the native host stack.
.Pp
.Nm
uses a memory mapped region to share packet buffers,
descriptors and queues with the kernel.
Simple
.Pa ioctl()s
are used to bind interfaces/ports to file descriptors and
implement non-blocking I/O, whereas blocking I/O uses
.Pa select()/poll() .
.Nm
can exploit the parallelism in multiqueue devices and
multicore systems.
.Pp
For the best performance,
.Nm
requires explicit support in device drivers;
a generic emulation layer is available to implement the
.Nm
API on top of unmodified device drivers,
at the price of reduced performance
(but still better than what can be achieved with
sockets or BPF/pcap).
.Pp
For a list of devices with native
.Nm
support, see the end of this manual page.
.Pp
.Sh OPERATION - THE NETMAP API
.Nm
clients must first
.Pa open("/dev/netmap") ,
and then issue an
.Pa ioctl(fd, NIOCREGIF, (struct nmreq *)arg)
to bind the file descriptor to a specific interface or port.
.Nm
has multiple modes of operation controlled by the
content of the
.Pa struct nmreq
passed to the
.Pa ioctl() .
In particular, the
.Em nr_name
field specifies whether the client operates on a physical network
interface or on a port of a
.Nm VALE
switch, as indicated below. Additional fields in the
.Pa struct nmreq
control the details of operation.
.Pp
.Bl -tag -width XXXX
.It Dv Interface name (e.g. 'em0', 'eth1', ... )
The data path of the interface is disconnected from the host stack.
Depending on additional arguments,
the file descriptor is bound to the NIC (one or all queues),
or to the host stack.
.It Dv valeXXX:YYY (arbitrary XXX and YYY)
The file descriptor is bound to port YYY of a VALE switch called XXX,
where XXX and YYY are arbitrary alphanumeric strings.
The string cannot exceed IFNAMSIZ characters, and YYY cannot
matching the name of any existing interface.
.Pp
The switch and the port are created if not existing.
.It Dv valeXXX:ifname (ifname is an existing interface)
Flags in the argument control whether the physical interface
(and optionally the corrisponding host stack endpoint)
are connected or disconnected from the VALE switch named XXX.
.Pp
In this case the
.Pa ioctl()
is used only for configuring the VALE switch, typically through the
.Nm vale-ctl
command.
The file descriptor cannot be used for I/O, and should be
.Pa close()d
after issuing the
.Pa ioctl().
.El
.Pp
The binding can be removed (and the interface returns to
regular operation, or the virtual port destroyed) with a
.Pa close()
on the file descriptor.
.Pp
The processes owning the file descriptor can then
.Pa mmap()
the memory region that contains pre-allocated
buffers, descriptors and queues, and use them to
read/write raw packets.
Non blocking I/O is done with special
.Pa ioctl()'s ,
whereas the file descriptor can be passed to
.Pa select()/poll()
to be notified about incoming packet or available transmit buffers.
.Ss DATA STRUCTURES
The data structures in the mmapped memory are described below
(see
.Xr sys/net/netmap.h
for reference).
All physical devices operating in
.Nm
mode use the same memory region,
shared by the kernel and all processes who own
.Pa /dev/netmap
descriptors bound to those devices
(NOTE: visibility may be restricted in future implementations).
Virtual ports instead use separate memory regions,
shared only with the kernel.
.Pp
All references between the shared data structure
are relative (offsets or indexes). Some macros help converting
them into actual pointers.
.Bl -tag -width XXX
.It Dv struct netmap_if (one per interface)
indicates the number of rings supported by an interface, their
sizes, and the offsets of the
.Pa netmap_rings
associated to the interface.
.Pp
.Pa struct netmap_if
is at offset
.Pa nr_offset
in the shared memory region is indicated by the
field in the structure returned by the
.Pa NIOCREGIF
(see below).
.Bd -literal
struct netmap_if {
char ni_name[IFNAMSIZ]; /* name of the interface. */
const u_int ni_version; /* API version */
const u_int ni_rx_rings; /* number of rx ring pairs */
const u_int ni_tx_rings; /* if 0, same as ni_rx_rings */
const ssize_t ring_ofs[]; /* offset of tx and rx rings */
};
.Ed
.It Dv struct netmap_ring (one per ring)
Contains the positions in the transmit and receive rings to
synchronize the kernel and the application,
and an array of
.Pa slots
describing the buffers.
'reserved' is used in receive rings to tell the kernel the
number of slots after 'cur' that are still in usr
indicates how many slots starting from 'cur'
the
.Pp
Each physical interface has one
.Pa netmap_ring
for each hardware transmit and receive ring,
plus one extra transmit and one receive structure
that connect to the host stack.
.Bd -literal
struct netmap_ring {
const ssize_t buf_ofs; /* see details */
const uint32_t num_slots; /* number of slots in the ring */
uint32_t avail; /* number of usable slots */
uint32_t cur; /* 'current' read/write index */
uint32_t reserved; /* not refilled before current */
const uint16_t nr_buf_size;
uint16_t flags;
#define NR_TIMESTAMP 0x0002 /* set timestamp on *sync() */
#define NR_FORWARD 0x0004 /* enable NS_FORWARD for ring */
#define NR_RX_TSTMP 0x0008 /* set rx timestamp in slots */
struct timeval ts;
struct netmap_slot slot[0]; /* array of slots */
}
.Ed
.Pp
In transmit rings, after a system call 'cur' indicates
the first slot that can be used for transmissions,
and 'avail' reports how many of them are available.
Before the next netmap-related system call on the file
descriptor, the application should fill buffers and
slots with data, and update 'cur' and 'avail'
accordingly, as shown in the figure below:
.Bd -literal
cur
|----- avail ---| (after syscall)
v
TX [*****aaaaaaaaaaaaaaaaa**]
TX [*****TTTTTaaaaaaaaaaaa**]
^
|-- avail --| (before syscall)
cur
.Ed
In receive rings, after a system call 'cur' indicates
the first slot that contains a valid packet,
and 'avail' reports how many of them are available.
Before the next netmap-related system call on the file
descriptor, the application can process buffers and
release them to the kernel updating
'cur' and 'avail' accordingly, as shown in the figure below.
Receive rings have an additional field called 'reserved'
to indicate how many buffers before 'cur' are still
under processing and cannot be released.
.Bd -literal
cur
|-res-|-- avail --| (after syscall)
v
RX [**rrrrrrRRRRRRRRRRRR******]
RX [**...........rrrrRRR******]
|res|--|<avail (before syscall)
^
cur
.Ed
.It Dv struct netmap_slot (one per packet)
contains the metadata for a packet:
.Bd -literal
struct netmap_slot {
uint32_t buf_idx; /* buffer index */
uint16_t len; /* packet length */
uint16_t flags; /* buf changed, etc. */
#define NS_BUF_CHANGED 0x0001 /* must resync, buffer changed */
#define NS_REPORT 0x0002 /* tell hw to report results
* e.g. by generating an interrupt
*/
#define NS_FORWARD 0x0004 /* pass packet to the other endpoint
* (host stack or device)
*/
#define NS_NO_LEARN 0x0008
#define NS_INDIRECT 0x0010
#define NS_MOREFRAG 0x0020
#define NS_PORT_SHIFT 8
#define NS_PORT_MASK (0xff << NS_PORT_SHIFT)
#define NS_RFRAGS(_slot) ( ((_slot)->flags >> 8) & 0xff)
uint64_t ptr; /* buffer address (indirect buffers) */
};
.Ed
The flags control how the the buffer associated to the slot
should be managed.
.It Dv packet buffers
are normally fixed size (2 Kbyte) buffers allocated by the kernel
that contain packet data. Buffers addresses are computed through
macros.
.El
.Pp
.Bl -tag -width XXX
Some macros support the access to objects in the shared memory
region. In particular,
.It NETMAP_TXRING(nifp, i)
.It NETMAP_RXRING(nifp, i)
return the address of the i-th transmit and receive ring,
respectively, whereas
.It NETMAP_BUF(ring, buf_idx)
returns the address of the buffer with index buf_idx
(which can be part of any ring for the given interface).
.El
.Pp
Normally, buffers are associated to slots when interfaces are bound,
and one packet is fully contained in a single buffer.
Clients can however modify the mapping using the
following flags:
.Ss FLAGS
.Bl -tag -width XXX
.It NS_BUF_CHANGED
indicates that the buf_idx in the slot has changed.
This can be useful if the client wants to implement
some form of zero-copy forwarding (e.g. by passing buffers
from an input interface to an output interface), or
needs to process packets out of order.
.Pp
The flag MUST be used whenever the buffer index is changed.
.It NS_REPORT
indicates that we want to be woken up when this buffer
has been transmitted. This reduces performance but insures
a prompt notification when a buffer has been sent.
Normally,
.Nm
notifies transmit completions in batches, hence signals
can be delayed indefinitely. However, we need such notifications
before closing a descriptor.
.It NS_FORWARD
When the device is open in 'transparent' mode,
the client can mark slots in receive rings with this flag.
For all marked slots, marked packets are forwarded to
the other endpoint at the next system call, thus restoring
(in a selective way) the connection between the NIC and the
host stack.
.It NS_NO_LEARN
tells the forwarding code that the SRC MAC address for this
packet should not be used in the learning bridge
.It NS_INDIRECT
indicates that the packet's payload is not in the netmap
supplied buffer, but in a user-supplied buffer whose
user virtual address is in the 'ptr' field of the slot.
The size can reach 65535 bytes.
.Em This is only supported on the transmit ring of virtual ports
.It NS_MOREFRAG
indicates that the packet continues with subsequent buffers;
the last buffer in a packet must have the flag clear.
The maximum length of a chain is 64 buffers.
.Em This is only supported on virtual ports
.It NS_RFRAGS(slot)
on receive rings, returns the number of remaining buffers
in a packet, including this one.
Slots with a value greater than 1 also have NS_MOREFRAG set.
The length refers to the individual buffer, there is no
field for the total length.
.Pp
On transmit rings, if NS_DST is set, it is passed to the lookup
function, which can use it e.g. as the index of the destination
port instead of doing an address lookup.
.El
.Sh IOCTLS
.Nm
supports some ioctl() to synchronize the state of the rings
between the kernel and the user processes, plus some
to query and configure the interface.
The former do not require any argument, whereas the latter
use a
.Pa struct nmreq
defined as follows:
.Bd -literal
struct nmreq {
char nr_name[IFNAMSIZ];
uint32_t nr_version; /* API version */
#define NETMAP_API 4 /* current version */
uint32_t nr_offset; /* nifp offset in the shared region */
uint32_t nr_memsize; /* size of the shared region */
uint32_t nr_tx_slots; /* slots in tx rings */
uint32_t nr_rx_slots; /* slots in rx rings */
uint16_t nr_tx_rings; /* number of tx rings */
uint16_t nr_rx_rings; /* number of tx rings */
uint16_t nr_ringid; /* ring(s) we care about */
#define NETMAP_HW_RING 0x4000 /* low bits indicate one hw ring */
#define NETMAP_SW_RING 0x2000 /* we process the sw ring */
#define NETMAP_NO_TX_POLL 0x1000 /* no gratuitous txsync on poll */
#define NETMAP_RING_MASK 0xfff /* the actual ring number */
uint16_t nr_cmd;
#define NETMAP_BDG_ATTACH 1 /* attach the NIC */
#define NETMAP_BDG_DETACH 2 /* detach the NIC */
#define NETMAP_BDG_LOOKUP_REG 3 /* register lookup function */
#define NETMAP_BDG_LIST 4 /* get bridge's info */
uint16_t nr_arg1;
uint16_t nr_arg2;
uint32_t spare2[3];
};
.Ed
A device descriptor obtained through
.Pa /dev/netmap
also supports the ioctl supported by network devices.
.Pp
The netmap-specific
.Xr ioctl 2
command codes below are defined in
.In net/netmap.h
and are:
.Bl -tag -width XXXX
.It Dv NIOCGINFO
returns EINVAL if the named device does not support netmap.
Otherwise, it returns 0 and (advisory) information
about the interface.
Note that all the information below can change before the
interface is actually put in netmap mode.
.Pp
.Pa nr_memsize
indicates the size of the netmap
memory region. Physical devices all share the same memory region,
whereas VALE ports may have independent regions for each port.
These sizes can be set through system-wise sysctl variables.
.Pa nr_tx_slots, nr_rx_slots
indicate the size of transmit and receive rings.
.Pa nr_tx_rings, nr_rx_rings
indicate the number of transmit
and receive rings.
Both ring number and sizes may be configured at runtime
using interface-specific functions (e.g.
.Pa sysctl
or
.Pa ethtool .
.It Dv NIOCREGIF
puts the interface named in nr_name into netmap mode, disconnecting
it from the host stack, and/or defines which rings are controlled
through this file descriptor.
On return, it gives the same info as NIOCGINFO, and nr_ringid
indicates the identity of the rings controlled through the file
descriptor.
.Pp
Possible values for nr_ringid are
.Bl -tag -width XXXXX
.It 0
default, all hardware rings
.It NETMAP_SW_RING
the ``host rings'' connecting to the host stack
.It NETMAP_HW_RING + i
the i-th hardware ring
.El
By default, a
.Nm poll
or
.Nm select
call pushes out any pending packets on the transmit ring, even if
no write events are specified.
The feature can be disabled by or-ing
.Nm NETMAP_NO_TX_SYNC
to nr_ringid.
But normally you should keep this feature unless you are using
separate file descriptors for the send and receive rings, because
otherwise packets are pushed out only if NETMAP_TXSYNC is called,
or the send queue is full.
.Pp
.Pa NIOCREGIF
can be used multiple times to change the association of a
file descriptor to a ring pair, always within the same device.
.Pp
When registering a virtual interface that is dynamically created to a
.Xr vale 4
switch, we can specify the desired number of rings (1 by default,
and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
.It Dv NIOCTXSYNC
tells the hardware of new packets to transmit, and updates the
number of slots available for transmission.
.It Dv NIOCRXSYNC
tells the hardware of consumed packets, and asks for newly available
packets.
.El
.Sh SYSTEM CALLS
.Nm
uses
.Xr select 2
and
.Xr poll 2
to wake up processes when significant events occur, and
.Xr mmap 2
to map memory.
.Pp
Applications may need to create threads and bind them to
specific cores to improve performance, using standard
OS primitives, see
.Xr pthread 3 .
In particular,
.Xr pthread_setaffinity_np 3
may be of use.
.Sh EXAMPLES
The following code implements a traffic generator
.Pp
.Bd -literal -compact
#include <net/netmap.h>
#include <net/netmap_user.h>
struct netmap_if *nifp;
struct netmap_ring *ring;
struct nmreq nmr;
fd = open("/dev/netmap", O_RDWR);
bzero(&nmr, sizeof(nmr));
strcpy(nmr.nr_name, "ix0");
nmr.nm_version = NETMAP_API;
ioctl(fd, NIOCREGIF, &nmr);
p = mmap(0, nmr.nr_memsize, fd);
nifp = NETMAP_IF(p, nmr.nr_offset);
ring = NETMAP_TXRING(nifp, 0);
fds.fd = fd;
fds.events = POLLOUT;
for (;;) {
poll(list, 1, -1);
for ( ; ring->avail > 0 ; ring->avail--) {
i = ring->cur;
buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
... prepare packet in buf ...
ring->slot[i].len = ... packet length ...
ring->cur = NETMAP_RING_NEXT(ring, i);
}
}
.Ed
.Sh SUPPORTED INTERFACES
.Nm
supports the following interfaces:
.Xr em 4 ,
.Xr igb 4 ,
.Xr ixgbe 4 ,
.Xr lem 4 ,
.Xr re 4
.Sh SEE ALSO
.Xr vale 4
.Pp
http://info.iet.unipi.it/~luigi/netmap/
.Pp
Luigi Rizzo, Revisiting network I/O APIs: the netmap framework,
Communications of the ACM, 55 (3), pp.45-51, March 2012
.Pp
Luigi Rizzo, netmap: a novel framework for fast packet I/O,
Usenix ATC'12, June 2012, Boston
.Sh AUTHORS
.An -nosplit
The
.Nm
framework has been originally designed and implemented at the
Universita` di Pisa in 2011 by
.An Luigi Rizzo ,
and further extended with help from
.An Matteo Landi ,
.An Gaetano Catalli ,
.An Giuseppe Lettieri ,
.An Vincenzo Maffione .
.Pp
.Nm
and
.Nm VALE
have been funded by the European Commission within FP7 Projects
CHANGE (257422) and OPENLAB (287581).