complete svn 261909 - new netmap version.

since i updated the manpage i might as well commit it.

MFC after:	3 days
This commit is contained in:
Luigi Rizzo 2014-02-15 08:23:31 +00:00
parent a2c5809961
commit fa7db06b8f

View File

@ -27,7 +27,7 @@
.\"
.\" $FreeBSD$
.\"
.Dd January 4, 2014
.Dd February 13, 2014
.Dt NETMAP 4
.Os
.Sh NAME
@ -36,6 +36,9 @@
.br
.Nm VALE
.Nd a fast VirtuAl Local Ethernet using the netmap API
.br
.Nm netmap pipes
.Nd a shared memory packet transport channel
.Sh SYNOPSIS
.Cd device netmap
.Sh DESCRIPTION
@ -45,38 +48,55 @@ for both userspace and kernel clients.
It runs on FreeBSD and Linux,
and includes
.Nm VALE ,
a very fast and modular in-kernel software switch/dataplane.
.Pp
.Nm
a very fast and modular in-kernel software switch/dataplane,
and
.Nm VALE
are one order of magnitude faster than sockets, bpf or
native switches based on
.Xr tun/tap 4 ,
reaching 14.88 Mpps with much less than one core on a 10 Gbit NIC,
and 20 Mpps per core for VALE ports.
.Nm netmap pipes ,
a shared memory packet transport channel.
All these are accessed interchangeably with the same API.
.Pp
.Nm , VALE
and
.Nm netmap pipes
are at least one order of magnitude faster than
standard OS mechanisms
(sockets, bpf, tun/tap interfaces, native switches, pipes),
reaching 14.88 million packets per second (Mpps)
with much less than one core on a 10 Gbit NIC,
about 20 Mpps per core for VALE ports,
and over 100 Mpps for netmap pipes.
.Pp
Userspace clients can dynamically switch NICs into
.Nm
mode and send and receive raw packets through
memory mapped buffers.
A selectable file descriptor supports
synchronization and blocking I/O.
.Pp
Similarly,
.Nm VALE
can dynamically create switch instances and ports,
switch instances and ports, and
.Nm netmap pipes
can be created dynamically,
providing high speed packet I/O between processes,
virtual machines, NICs and the host stack.
.Pp
.Nm
suports both non-blocking I/O through
.Xr ioctls() ,
synchronization and blocking I/O through a file descriptor
and standard OS mechanisms such as
.Xr select 2 ,
.Xr poll 2 ,
.Xr epoll 2 ,
.Xr kqueue 2 .
.Nm VALE
and
.Nm netmap pipes
are implemented by a single kernel module, which also emulates the
.Nm
API over standard drivers for devices without native
.Nm
support.
For best performance,
.Nm
requires explicit support in device drivers;
however, the
.Nm
API can be emulated on top of unmodified device drivers,
at the price of reduced performance
(but still better than sockets or BPF/pcap).
requires explicit support in device drivers.
.Pp
In the rest of this (long) manual page we document
various aspects of the
@ -114,10 +134,26 @@ mode use the same memory region,
accessible to all processes who own
.Nm /dev/netmap
file descriptors bound to NICs.
Independent
.Nm VALE
ports instead use separate memory regions.
and
.Nm netmap pipe
ports
by default use separate memory regions,
but can be independently configured to share memory.
.Pp
.Sh ENTERING AND EXITING NETMAP MODE
The following section describes the system calls to create
and control
.Nm netmap
ports (including
.Nm VALE
and
.Nm netmap pipe
ports).
Simpler, higher level functions are described in section
.Xr LIBRARIES .
.Pp
Ports and rings are created and controlled through a file descriptor,
created by opening a special device
.Dl fd = open("/dev/netmap");
@ -186,12 +222,11 @@ API. The main structures and fields are indicated below:
.Bd -literal
struct netmap_if {
...
const uint32_t ni_flags; /* properties */
const uint32_t ni_flags; /* properties */
...
const uint32_t ni_tx_rings; /* NIC tx rings */
const uint32_t ni_rx_rings; /* NIC rx rings */
const uint32_t ni_extra_tx_rings; /* extra tx rings */
const uint32_t ni_extra_rx_rings; /* extra rx rings */
const uint32_t ni_tx_rings; /* NIC tx rings */
const uint32_t ni_rx_rings; /* NIC rx rings */
uint32_t ni_bufs_head; /* head of extra bufs list */
...
};
.Ed
@ -204,11 +239,14 @@ The number of tx and rx rings
normally depends on the hardware.
NICs also have an extra tx/rx ring pair connected to the host stack.
.Em NIOCREGIF
can request additional tx/rx rings,
to be used between multiple processes/threads
accessing the same
.Nm
port.
can also request additional unbound buffers in the same memory space,
to be used as temporary storage for packets.
.Pa ni_bufs_head
contains the index of the first of these free rings,
which are connected in a list (the first uint32_t of each
buffer being the index of the next buffer in the list).
A 0 indicates the end of the list.
.Pp
.It Dv struct netmap_ring (one per ring)
.Bd -literal
struct netmap_ring {
@ -221,9 +259,9 @@ struct netmap_ring {
const uint32_t tail; /* (k) first buf owned by kernel */
...
uint32_t flags;
struct timeval ts; /* (k) time of last rxsync() */
struct timeval ts; /* (k) time of last rxsync() */
...
struct netmap_slot slot[0]; /* array of slots */
struct netmap_slot slot[0]; /* array of slots */
}
.Ed
.Pp
@ -482,14 +520,16 @@ struct nmreq {
uint32_t nr_version; /* (i) API version */
uint32_t nr_offset; /* (o) nifp offset in mmap region */
uint32_t nr_memsize; /* (o) size of the mmap region */
uint32_t nr_tx_slots; /* (o) slots in tx rings */
uint32_t nr_rx_slots; /* (o) slots in rx rings */
uint16_t nr_tx_rings; /* (o) number of tx rings */
uint16_t nr_rx_rings; /* (o) number of tx rings */
uint16_t nr_ringid; /* (i) ring(s) we care about */
uint32_t nr_tx_slots; /* (i/o) slots in tx rings */
uint32_t nr_rx_slots; /* (i/o) slots in rx rings */
uint16_t nr_tx_rings; /* (i/o) number of tx rings */
uint16_t nr_rx_rings; /* (i/o) number of tx rings */
uint16_t nr_ringid; /* (i/o) ring(s) we care about */
uint16_t nr_cmd; /* (i) special command */
uint16_t nr_arg1; /* (i) extra arguments */
uint16_t nr_arg2; /* (i) extra arguments */
uint16_t nr_arg1; /* (i/o) extra arguments */
uint16_t nr_arg2; /* (i/o) extra arguments */
uint32_t nr_arg3; /* (i/o) extra arguments */
uint32_t nr_flags /* (i/o) open mode */
...
};
.Ed
@ -537,20 +577,59 @@ it from the host stack.
Multiple file descriptors can be bound to the same port,
with proper synchronization left to the user.
.Pp
On return, it gives the same info as NIOCGINFO, and nr_ringid
indicates the identity of the rings controlled through the file
.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
.Em netmap pipe ,
consisting of two netmap ports with a crossover connection.
A netmap pipe share the same memory space of the parent port,
and is meant to enable configuration where a master process acts
as a dispatcher towards slave processes.
.Pp
To enable this function, the
.Pa nr_arg1
field of the structure can be used as a hint to the kernel to
indicate how many pipes we expect to use, and reserve extra space
in the memory region.
.Pp
On return, it gives the same info as NIOCGINFO,
with
.Pa nr_ringid
and
.Pa nr_flags
indicating the identity of the rings controlled through the file
descriptor.
.Pp
.Va nr_flags
.Va nr_ringid
selects which rings are controlled through this file descriptor.
Possible values are:
Possible values of
.Pa nr_flags
are indicated below, together with the naming schemes
that application libraries (such as the
.Nm nm_open
indicated below) can use to indicate the specific set of rings.
In the example below, "netmap:foo" is any valid netmap port name.
.Pp
.Bl -tag -width XXXXX
.It 0
(default) all hardware rings
.It NETMAP_SW_RING
.It NR_REG_ALL_NIC "netmap:foo"
(default) all hardware ring pairs
.It NR_REG_SW_NIC "netmap:foo^"
the ``host rings'', connecting to the host stack.
.It NETMAP_HW_RING | i
the i-th hardware ring .
.It NR_RING_NIC_SW "netmap:foo+
all hardware rings and the host rings
.It NR_REG_ONE_NIC "netmap:foo-i"
only the i-th hardware ring pair, where the number is in
.Pa nr_ringid ;
.It NR_REG_PIPE_MASTER "netmap:foo{i"
the master side of the netmap pipe whose identifier (i) is in
.Pa nr_ringid ;
.It NR_REG_PIPE_SLAVE "netmap:foo}i"
the slave side of the netmap pipe whose identifier (i) is in
.Pa nr_ringid .
.Pp
The identifier of a pipe must be thought as part of the pipe name,
and does not need to be sequential. On return the pipe
will only have a single ring pair with index 0,
irrespective of the value of i.
.El
.Pp
By default, a
@ -579,7 +658,7 @@ number of slots available for transmission.
tells the hardware of consumed packets, and asks for newly available
packets.
.El
.Sh SELECT AND POLL
.Sh SELECT, POLL, EPOLL, KQUEUE.
.Xr select 2
and
.Xr poll 2
@ -588,16 +667,26 @@ on a
file descriptor process rings as indicated in
.Sx TRANSMIT RINGS
and
.Sx RECEIVE RINGS
when write (POLLOUT) and read (POLLIN) events are requested.
.Sx RECEIVE RINGS ,
respectively when write (POLLOUT) and read (POLLIN) events are requested.
Both block if no slots are available in the ring
.Va ( ring->cur == ring->tail ) .
Depending on the platform,
.Xr epoll 2
and
.Xr kqueue 2
are supported too.
.Pp
Both block if no slots are available in the ring (
.Va ring->cur == ring->tail )
.Pp
Packets in transmit rings are normally pushed out even without
Packets in transmit rings are normally pushed out
(and buffers reclaimed) even without
requesting write events. Passing the NETMAP_NO_TX_SYNC flag to
.Em NIOCREGIF
disables this feature.
By default, receive rings are processed only if read
events are requested. Passing the NETMAP_DO_RX_SYNC flag to
.Em NIOCREGIF updates receive rings even without read events.
Note that on epoll and kqueue, NETMAP_NO_TX_SYNC and NETMAP_DO_RX_SYNC
only have an effect when some event is posted for the file descriptor.
.Sh LIBRARIES
The
.Nm
@ -620,7 +709,7 @@ before
.Pp
The following functions are available:
.Bl -tag -width XXXXX
.It Va struct nm_desc_t * nm_open(const char *ifname, const char *ring_name, int flags, int ring_flags)
.It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg)
similar to
.Xr pcap_open ,
binds a file descriptor to a port.
@ -629,26 +718,36 @@ binds a file descriptor to a port.
is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a
.Nm VALE
port.
.It Va req
provides the initial values for the argument to the NIOCREGIF ioctl.
The nm_flags and nm_ringid values are overwritten by parsing
ifname and flags, and other fields can be overridden through
the other two arguments.
.It Va arg
points to a struct nm_desc containing arguments (e.g. from a previously
open file descriptor) that should override the defaults.
The fields are used as described below
.It Va flags
can be set to
.Va NETMAP_SW_RING
to bind to the host ring pair,
or to NETMAP_HW_RING to bind to a specific ring.
.Va ring_name
with NETMAP_HW_RING,
is interpreted as a string or an integer indicating the ring to use.
.It Va ring_flags
is copied directly into the ring flags, to specify additional parameters
such as NR_TIMESTAMP or NR_FORWARD.
can be set to a combination of the following flags:
.Va NETMAP_NO_TX_POLL ,
.Va NETMAP_DO_RX_POLL
(copied into nr_ringid);
.Va NM_OPEN_NO_MMAP (if arg points to the same memory region,
avoids the mmap and uses the values from it);
.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg);
.Va NM_OPEN_ARG1 ,
.Va NM_OPEN_ARG2 ,
.Va NM_OPEN_ARG3 (uses the fields from arg);
.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg).
.El
.It Va int nm_close(struct nm_desc_t *d)
.It Va int nm_close(struct nm_desc *d)
closes the file descriptor, unmaps memory, frees resources.
.It Va int nm_inject(struct nm_desc_t *d, const void *buf, size_t size)
.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size)
similar to pcap_inject(), pushes a packet to a ring, returns the size
of the packet is successful, or 0 on error;
.It Va int nm_dispatch(struct nm_desc_t *d, int cnt, nm_cb_t cb, u_char *arg)
.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg)
similar to pcap_dispatch(), applies a callback to incoming packets
.It Va u_char * nm_nextpkt(struct nm_desc_t *d, struct nm_hdr_t *hdr)
.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr)
similar to pcap_next(), fetches the next packet
.Pp
.El
@ -740,9 +839,11 @@ performance.
.Sh SYSTEM CALLS
.Nm
uses
.Xr select 2
.Xr select 2 ,
.Xr poll 2 ,
.Xr epoll
and
.Xr poll 2
.Xr kqueue
to wake up processes when significant events occur, and
.Xr mmap 2
to map memory.
@ -872,10 +973,10 @@ A simple receiver can be implemented using the helper functions
...
void receiver(void)
{
struct nm_desc_t *d;
struct nm_desc *d;
struct pollfd fds;
u_char *buf;
struct nm_hdr_t h;
struct nm_pkthdr h;
...
d = nm_open("netmap:ix0", NULL, 0, 0);
fds.fd = NETMAP_FD(d);
@ -910,6 +1011,13 @@ to replenish the receive ring:
...
.Ed
.Ss ACCESSING THE HOST STACK
The host stack is for all practical purposes just a regular ring pair,
which you can access with the netmap API (e.g. with
.Dl nm_open("netmap:eth0^", ... ) ;
All packets that the host would send to an interface in
.Nm
mode end up into the RX ring, whereas all packets queued to the
TX ring are send up to the host stack.
.Ss VALE SWITCH
A simple way to test the performance of a
.Nm VALE
@ -917,6 +1025,10 @@ switch is to attach a sender and a receiver to it,
e.g. running the following in two different terminals:
.Dl pkt-gen -i vale1:a -f rx # receiver
.Dl pkt-gen -i vale1:b -f tx # sender
The same example can be used to test netmap pipes, by simply
changing port names, e.g.
.Dl pkt-gen -i vale:x{3 -f rx # receiver on the master side
.Dl pkt-gen -i vale:x}3 -f tx # sender on the slave side
.Pp
The following command attaches an interface and the host stack
to a switch:
@ -935,6 +1047,14 @@ Communications of the ACM, 55 (3), pp.45-51, March 2012
.Pp
Luigi Rizzo, netmap: a novel framework for fast packet I/O,
Usenix ATC'12, June 2012, Boston
.Pp
Luigi Rizzo, Giuseppe Lettieri,
VALE, a switched ethernet for virtual machines,
ACM CoNEXT'12, December 2012, Nice
.Pp
Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
Speeding up packet I/O in virtual machines,
ACM/IEEE ANCS'13, October 2013, San Jose
.Sh AUTHORS
.An -nosplit
The
@ -953,20 +1073,3 @@ and
.Nm VALE
have been funded by the European Commission within FP7 Projects
CHANGE (257422) and OPENLAB (287581).
.Pp
.Ss SPECIAL MODES
When the device name has the form
.Dl valeXXX:ifname (ifname is an existing interface)
the physical interface
(and optionally the corrisponding host stack endpoint)
are connected or disconnected from the
.Nm VALE
switch named XXX.
In this case the
.Pa ioctl()
is only used only for configuration, typically through the
.Xr vale-ctl
command.
The file descriptor cannot be used for I/O, and should be
closed after issuing the
.Pa ioctl() .