complete svn 261909 - new netmap version.
since i updated the manpage i might as well commit it. MFC after: 3 days
This commit is contained in:
parent
a2c5809961
commit
fa7db06b8f
@ -27,7 +27,7 @@
|
||||
.\"
|
||||
.\" $FreeBSD$
|
||||
.\"
|
||||
.Dd January 4, 2014
|
||||
.Dd February 13, 2014
|
||||
.Dt NETMAP 4
|
||||
.Os
|
||||
.Sh NAME
|
||||
@ -36,6 +36,9 @@
|
||||
.br
|
||||
.Nm VALE
|
||||
.Nd a fast VirtuAl Local Ethernet using the netmap API
|
||||
.br
|
||||
.Nm netmap pipes
|
||||
.Nd a shared memory packet transport channel
|
||||
.Sh SYNOPSIS
|
||||
.Cd device netmap
|
||||
.Sh DESCRIPTION
|
||||
@ -45,38 +48,55 @@ for both userspace and kernel clients.
|
||||
It runs on FreeBSD and Linux,
|
||||
and includes
|
||||
.Nm VALE ,
|
||||
a very fast and modular in-kernel software switch/dataplane.
|
||||
.Pp
|
||||
.Nm
|
||||
a very fast and modular in-kernel software switch/dataplane,
|
||||
and
|
||||
.Nm VALE
|
||||
are one order of magnitude faster than sockets, bpf or
|
||||
native switches based on
|
||||
.Xr tun/tap 4 ,
|
||||
reaching 14.88 Mpps with much less than one core on a 10 Gbit NIC,
|
||||
and 20 Mpps per core for VALE ports.
|
||||
.Nm netmap pipes ,
|
||||
a shared memory packet transport channel.
|
||||
All these are accessed interchangeably with the same API.
|
||||
.Pp
|
||||
.Nm , VALE
|
||||
and
|
||||
.Nm netmap pipes
|
||||
are at least one order of magnitude faster than
|
||||
standard OS mechanisms
|
||||
(sockets, bpf, tun/tap interfaces, native switches, pipes),
|
||||
reaching 14.88 million packets per second (Mpps)
|
||||
with much less than one core on a 10 Gbit NIC,
|
||||
about 20 Mpps per core for VALE ports,
|
||||
and over 100 Mpps for netmap pipes.
|
||||
.Pp
|
||||
Userspace clients can dynamically switch NICs into
|
||||
.Nm
|
||||
mode and send and receive raw packets through
|
||||
memory mapped buffers.
|
||||
A selectable file descriptor supports
|
||||
synchronization and blocking I/O.
|
||||
.Pp
|
||||
Similarly,
|
||||
.Nm VALE
|
||||
can dynamically create switch instances and ports,
|
||||
switch instances and ports, and
|
||||
.Nm netmap pipes
|
||||
can be created dynamically,
|
||||
providing high speed packet I/O between processes,
|
||||
virtual machines, NICs and the host stack.
|
||||
.Pp
|
||||
.Nm
|
||||
suports both non-blocking I/O through
|
||||
.Xr ioctls() ,
|
||||
synchronization and blocking I/O through a file descriptor
|
||||
and standard OS mechanisms such as
|
||||
.Xr select 2 ,
|
||||
.Xr poll 2 ,
|
||||
.Xr epoll 2 ,
|
||||
.Xr kqueue 2 .
|
||||
.Nm VALE
|
||||
and
|
||||
.Nm netmap pipes
|
||||
are implemented by a single kernel module, which also emulates the
|
||||
.Nm
|
||||
API over standard drivers for devices without native
|
||||
.Nm
|
||||
support.
|
||||
For best performance,
|
||||
.Nm
|
||||
requires explicit support in device drivers;
|
||||
however, the
|
||||
.Nm
|
||||
API can be emulated on top of unmodified device drivers,
|
||||
at the price of reduced performance
|
||||
(but still better than sockets or BPF/pcap).
|
||||
requires explicit support in device drivers.
|
||||
.Pp
|
||||
In the rest of this (long) manual page we document
|
||||
various aspects of the
|
||||
@ -114,10 +134,26 @@ mode use the same memory region,
|
||||
accessible to all processes who own
|
||||
.Nm /dev/netmap
|
||||
file descriptors bound to NICs.
|
||||
Independent
|
||||
.Nm VALE
|
||||
ports instead use separate memory regions.
|
||||
and
|
||||
.Nm netmap pipe
|
||||
ports
|
||||
by default use separate memory regions,
|
||||
but can be independently configured to share memory.
|
||||
.Pp
|
||||
.Sh ENTERING AND EXITING NETMAP MODE
|
||||
The following section describes the system calls to create
|
||||
and control
|
||||
.Nm netmap
|
||||
ports (including
|
||||
.Nm VALE
|
||||
and
|
||||
.Nm netmap pipe
|
||||
ports).
|
||||
Simpler, higher level functions are described in section
|
||||
.Xr LIBRARIES .
|
||||
.Pp
|
||||
Ports and rings are created and controlled through a file descriptor,
|
||||
created by opening a special device
|
||||
.Dl fd = open("/dev/netmap");
|
||||
@ -186,12 +222,11 @@ API. The main structures and fields are indicated below:
|
||||
.Bd -literal
|
||||
struct netmap_if {
|
||||
...
|
||||
const uint32_t ni_flags; /* properties */
|
||||
const uint32_t ni_flags; /* properties */
|
||||
...
|
||||
const uint32_t ni_tx_rings; /* NIC tx rings */
|
||||
const uint32_t ni_rx_rings; /* NIC rx rings */
|
||||
const uint32_t ni_extra_tx_rings; /* extra tx rings */
|
||||
const uint32_t ni_extra_rx_rings; /* extra rx rings */
|
||||
const uint32_t ni_tx_rings; /* NIC tx rings */
|
||||
const uint32_t ni_rx_rings; /* NIC rx rings */
|
||||
uint32_t ni_bufs_head; /* head of extra bufs list */
|
||||
...
|
||||
};
|
||||
.Ed
|
||||
@ -204,11 +239,14 @@ The number of tx and rx rings
|
||||
normally depends on the hardware.
|
||||
NICs also have an extra tx/rx ring pair connected to the host stack.
|
||||
.Em NIOCREGIF
|
||||
can request additional tx/rx rings,
|
||||
to be used between multiple processes/threads
|
||||
accessing the same
|
||||
.Nm
|
||||
port.
|
||||
can also request additional unbound buffers in the same memory space,
|
||||
to be used as temporary storage for packets.
|
||||
.Pa ni_bufs_head
|
||||
contains the index of the first of these free rings,
|
||||
which are connected in a list (the first uint32_t of each
|
||||
buffer being the index of the next buffer in the list).
|
||||
A 0 indicates the end of the list.
|
||||
.Pp
|
||||
.It Dv struct netmap_ring (one per ring)
|
||||
.Bd -literal
|
||||
struct netmap_ring {
|
||||
@ -221,9 +259,9 @@ struct netmap_ring {
|
||||
const uint32_t tail; /* (k) first buf owned by kernel */
|
||||
...
|
||||
uint32_t flags;
|
||||
struct timeval ts; /* (k) time of last rxsync() */
|
||||
struct timeval ts; /* (k) time of last rxsync() */
|
||||
...
|
||||
struct netmap_slot slot[0]; /* array of slots */
|
||||
struct netmap_slot slot[0]; /* array of slots */
|
||||
}
|
||||
.Ed
|
||||
.Pp
|
||||
@ -482,14 +520,16 @@ struct nmreq {
|
||||
uint32_t nr_version; /* (i) API version */
|
||||
uint32_t nr_offset; /* (o) nifp offset in mmap region */
|
||||
uint32_t nr_memsize; /* (o) size of the mmap region */
|
||||
uint32_t nr_tx_slots; /* (o) slots in tx rings */
|
||||
uint32_t nr_rx_slots; /* (o) slots in rx rings */
|
||||
uint16_t nr_tx_rings; /* (o) number of tx rings */
|
||||
uint16_t nr_rx_rings; /* (o) number of tx rings */
|
||||
uint16_t nr_ringid; /* (i) ring(s) we care about */
|
||||
uint32_t nr_tx_slots; /* (i/o) slots in tx rings */
|
||||
uint32_t nr_rx_slots; /* (i/o) slots in rx rings */
|
||||
uint16_t nr_tx_rings; /* (i/o) number of tx rings */
|
||||
uint16_t nr_rx_rings; /* (i/o) number of tx rings */
|
||||
uint16_t nr_ringid; /* (i/o) ring(s) we care about */
|
||||
uint16_t nr_cmd; /* (i) special command */
|
||||
uint16_t nr_arg1; /* (i) extra arguments */
|
||||
uint16_t nr_arg2; /* (i) extra arguments */
|
||||
uint16_t nr_arg1; /* (i/o) extra arguments */
|
||||
uint16_t nr_arg2; /* (i/o) extra arguments */
|
||||
uint32_t nr_arg3; /* (i/o) extra arguments */
|
||||
uint32_t nr_flags /* (i/o) open mode */
|
||||
...
|
||||
};
|
||||
.Ed
|
||||
@ -537,20 +577,59 @@ it from the host stack.
|
||||
Multiple file descriptors can be bound to the same port,
|
||||
with proper synchronization left to the user.
|
||||
.Pp
|
||||
On return, it gives the same info as NIOCGINFO, and nr_ringid
|
||||
indicates the identity of the rings controlled through the file
|
||||
.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
|
||||
.Em netmap pipe ,
|
||||
consisting of two netmap ports with a crossover connection.
|
||||
A netmap pipe share the same memory space of the parent port,
|
||||
and is meant to enable configuration where a master process acts
|
||||
as a dispatcher towards slave processes.
|
||||
.Pp
|
||||
To enable this function, the
|
||||
.Pa nr_arg1
|
||||
field of the structure can be used as a hint to the kernel to
|
||||
indicate how many pipes we expect to use, and reserve extra space
|
||||
in the memory region.
|
||||
.Pp
|
||||
On return, it gives the same info as NIOCGINFO,
|
||||
with
|
||||
.Pa nr_ringid
|
||||
and
|
||||
.Pa nr_flags
|
||||
indicating the identity of the rings controlled through the file
|
||||
descriptor.
|
||||
.Pp
|
||||
.Va nr_flags
|
||||
.Va nr_ringid
|
||||
selects which rings are controlled through this file descriptor.
|
||||
Possible values are:
|
||||
Possible values of
|
||||
.Pa nr_flags
|
||||
are indicated below, together with the naming schemes
|
||||
that application libraries (such as the
|
||||
.Nm nm_open
|
||||
indicated below) can use to indicate the specific set of rings.
|
||||
In the example below, "netmap:foo" is any valid netmap port name.
|
||||
.Pp
|
||||
.Bl -tag -width XXXXX
|
||||
.It 0
|
||||
(default) all hardware rings
|
||||
.It NETMAP_SW_RING
|
||||
.It NR_REG_ALL_NIC "netmap:foo"
|
||||
(default) all hardware ring pairs
|
||||
.It NR_REG_SW_NIC "netmap:foo^"
|
||||
the ``host rings'', connecting to the host stack.
|
||||
.It NETMAP_HW_RING | i
|
||||
the i-th hardware ring .
|
||||
.It NR_RING_NIC_SW "netmap:foo+
|
||||
all hardware rings and the host rings
|
||||
.It NR_REG_ONE_NIC "netmap:foo-i"
|
||||
only the i-th hardware ring pair, where the number is in
|
||||
.Pa nr_ringid ;
|
||||
.It NR_REG_PIPE_MASTER "netmap:foo{i"
|
||||
the master side of the netmap pipe whose identifier (i) is in
|
||||
.Pa nr_ringid ;
|
||||
.It NR_REG_PIPE_SLAVE "netmap:foo}i"
|
||||
the slave side of the netmap pipe whose identifier (i) is in
|
||||
.Pa nr_ringid .
|
||||
.Pp
|
||||
The identifier of a pipe must be thought as part of the pipe name,
|
||||
and does not need to be sequential. On return the pipe
|
||||
will only have a single ring pair with index 0,
|
||||
irrespective of the value of i.
|
||||
.El
|
||||
.Pp
|
||||
By default, a
|
||||
@ -579,7 +658,7 @@ number of slots available for transmission.
|
||||
tells the hardware of consumed packets, and asks for newly available
|
||||
packets.
|
||||
.El
|
||||
.Sh SELECT AND POLL
|
||||
.Sh SELECT, POLL, EPOLL, KQUEUE.
|
||||
.Xr select 2
|
||||
and
|
||||
.Xr poll 2
|
||||
@ -588,16 +667,26 @@ on a
|
||||
file descriptor process rings as indicated in
|
||||
.Sx TRANSMIT RINGS
|
||||
and
|
||||
.Sx RECEIVE RINGS
|
||||
when write (POLLOUT) and read (POLLIN) events are requested.
|
||||
.Sx RECEIVE RINGS ,
|
||||
respectively when write (POLLOUT) and read (POLLIN) events are requested.
|
||||
Both block if no slots are available in the ring
|
||||
.Va ( ring->cur == ring->tail ) .
|
||||
Depending on the platform,
|
||||
.Xr epoll 2
|
||||
and
|
||||
.Xr kqueue 2
|
||||
are supported too.
|
||||
.Pp
|
||||
Both block if no slots are available in the ring (
|
||||
.Va ring->cur == ring->tail )
|
||||
.Pp
|
||||
Packets in transmit rings are normally pushed out even without
|
||||
Packets in transmit rings are normally pushed out
|
||||
(and buffers reclaimed) even without
|
||||
requesting write events. Passing the NETMAP_NO_TX_SYNC flag to
|
||||
.Em NIOCREGIF
|
||||
disables this feature.
|
||||
By default, receive rings are processed only if read
|
||||
events are requested. Passing the NETMAP_DO_RX_SYNC flag to
|
||||
.Em NIOCREGIF updates receive rings even without read events.
|
||||
Note that on epoll and kqueue, NETMAP_NO_TX_SYNC and NETMAP_DO_RX_SYNC
|
||||
only have an effect when some event is posted for the file descriptor.
|
||||
.Sh LIBRARIES
|
||||
The
|
||||
.Nm
|
||||
@ -620,7 +709,7 @@ before
|
||||
.Pp
|
||||
The following functions are available:
|
||||
.Bl -tag -width XXXXX
|
||||
.It Va struct nm_desc_t * nm_open(const char *ifname, const char *ring_name, int flags, int ring_flags)
|
||||
.It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg)
|
||||
similar to
|
||||
.Xr pcap_open ,
|
||||
binds a file descriptor to a port.
|
||||
@ -629,26 +718,36 @@ binds a file descriptor to a port.
|
||||
is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a
|
||||
.Nm VALE
|
||||
port.
|
||||
.It Va req
|
||||
provides the initial values for the argument to the NIOCREGIF ioctl.
|
||||
The nm_flags and nm_ringid values are overwritten by parsing
|
||||
ifname and flags, and other fields can be overridden through
|
||||
the other two arguments.
|
||||
.It Va arg
|
||||
points to a struct nm_desc containing arguments (e.g. from a previously
|
||||
open file descriptor) that should override the defaults.
|
||||
The fields are used as described below
|
||||
.It Va flags
|
||||
can be set to
|
||||
.Va NETMAP_SW_RING
|
||||
to bind to the host ring pair,
|
||||
or to NETMAP_HW_RING to bind to a specific ring.
|
||||
.Va ring_name
|
||||
with NETMAP_HW_RING,
|
||||
is interpreted as a string or an integer indicating the ring to use.
|
||||
.It Va ring_flags
|
||||
is copied directly into the ring flags, to specify additional parameters
|
||||
such as NR_TIMESTAMP or NR_FORWARD.
|
||||
can be set to a combination of the following flags:
|
||||
.Va NETMAP_NO_TX_POLL ,
|
||||
.Va NETMAP_DO_RX_POLL
|
||||
(copied into nr_ringid);
|
||||
.Va NM_OPEN_NO_MMAP (if arg points to the same memory region,
|
||||
avoids the mmap and uses the values from it);
|
||||
.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg);
|
||||
.Va NM_OPEN_ARG1 ,
|
||||
.Va NM_OPEN_ARG2 ,
|
||||
.Va NM_OPEN_ARG3 (uses the fields from arg);
|
||||
.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg).
|
||||
.El
|
||||
.It Va int nm_close(struct nm_desc_t *d)
|
||||
.It Va int nm_close(struct nm_desc *d)
|
||||
closes the file descriptor, unmaps memory, frees resources.
|
||||
.It Va int nm_inject(struct nm_desc_t *d, const void *buf, size_t size)
|
||||
.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size)
|
||||
similar to pcap_inject(), pushes a packet to a ring, returns the size
|
||||
of the packet is successful, or 0 on error;
|
||||
.It Va int nm_dispatch(struct nm_desc_t *d, int cnt, nm_cb_t cb, u_char *arg)
|
||||
.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg)
|
||||
similar to pcap_dispatch(), applies a callback to incoming packets
|
||||
.It Va u_char * nm_nextpkt(struct nm_desc_t *d, struct nm_hdr_t *hdr)
|
||||
.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr)
|
||||
similar to pcap_next(), fetches the next packet
|
||||
.Pp
|
||||
.El
|
||||
@ -740,9 +839,11 @@ performance.
|
||||
.Sh SYSTEM CALLS
|
||||
.Nm
|
||||
uses
|
||||
.Xr select 2
|
||||
.Xr select 2 ,
|
||||
.Xr poll 2 ,
|
||||
.Xr epoll
|
||||
and
|
||||
.Xr poll 2
|
||||
.Xr kqueue
|
||||
to wake up processes when significant events occur, and
|
||||
.Xr mmap 2
|
||||
to map memory.
|
||||
@ -872,10 +973,10 @@ A simple receiver can be implemented using the helper functions
|
||||
...
|
||||
void receiver(void)
|
||||
{
|
||||
struct nm_desc_t *d;
|
||||
struct nm_desc *d;
|
||||
struct pollfd fds;
|
||||
u_char *buf;
|
||||
struct nm_hdr_t h;
|
||||
struct nm_pkthdr h;
|
||||
...
|
||||
d = nm_open("netmap:ix0", NULL, 0, 0);
|
||||
fds.fd = NETMAP_FD(d);
|
||||
@ -910,6 +1011,13 @@ to replenish the receive ring:
|
||||
...
|
||||
.Ed
|
||||
.Ss ACCESSING THE HOST STACK
|
||||
The host stack is for all practical purposes just a regular ring pair,
|
||||
which you can access with the netmap API (e.g. with
|
||||
.Dl nm_open("netmap:eth0^", ... ) ;
|
||||
All packets that the host would send to an interface in
|
||||
.Nm
|
||||
mode end up into the RX ring, whereas all packets queued to the
|
||||
TX ring are send up to the host stack.
|
||||
.Ss VALE SWITCH
|
||||
A simple way to test the performance of a
|
||||
.Nm VALE
|
||||
@ -917,6 +1025,10 @@ switch is to attach a sender and a receiver to it,
|
||||
e.g. running the following in two different terminals:
|
||||
.Dl pkt-gen -i vale1:a -f rx # receiver
|
||||
.Dl pkt-gen -i vale1:b -f tx # sender
|
||||
The same example can be used to test netmap pipes, by simply
|
||||
changing port names, e.g.
|
||||
.Dl pkt-gen -i vale:x{3 -f rx # receiver on the master side
|
||||
.Dl pkt-gen -i vale:x}3 -f tx # sender on the slave side
|
||||
.Pp
|
||||
The following command attaches an interface and the host stack
|
||||
to a switch:
|
||||
@ -935,6 +1047,14 @@ Communications of the ACM, 55 (3), pp.45-51, March 2012
|
||||
.Pp
|
||||
Luigi Rizzo, netmap: a novel framework for fast packet I/O,
|
||||
Usenix ATC'12, June 2012, Boston
|
||||
.Pp
|
||||
Luigi Rizzo, Giuseppe Lettieri,
|
||||
VALE, a switched ethernet for virtual machines,
|
||||
ACM CoNEXT'12, December 2012, Nice
|
||||
.Pp
|
||||
Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
|
||||
Speeding up packet I/O in virtual machines,
|
||||
ACM/IEEE ANCS'13, October 2013, San Jose
|
||||
.Sh AUTHORS
|
||||
.An -nosplit
|
||||
The
|
||||
@ -953,20 +1073,3 @@ and
|
||||
.Nm VALE
|
||||
have been funded by the European Commission within FP7 Projects
|
||||
CHANGE (257422) and OPENLAB (287581).
|
||||
.Pp
|
||||
.Ss SPECIAL MODES
|
||||
When the device name has the form
|
||||
.Dl valeXXX:ifname (ifname is an existing interface)
|
||||
the physical interface
|
||||
(and optionally the corrisponding host stack endpoint)
|
||||
are connected or disconnected from the
|
||||
.Nm VALE
|
||||
switch named XXX.
|
||||
In this case the
|
||||
.Pa ioctl()
|
||||
is only used only for configuration, typically through the
|
||||
.Xr vale-ctl
|
||||
command.
|
||||
The file descriptor cannot be used for I/O, and should be
|
||||
closed after issuing the
|
||||
.Pa ioctl() .
|
||||
|
Loading…
Reference in New Issue
Block a user