mdoc(7) police: overhaul.
Approved by: re
This commit is contained in:
parent
f93e0dc3b7
commit
5fe302ba72
@ -32,13 +32,16 @@
|
||||
.Nm zero_copy ,
|
||||
.Nm zero_copy_sockets
|
||||
.Sh SYNOPSIS
|
||||
.Cd options ZERO_COPY_SOCKETS
|
||||
.Cd "options ZERO_COPY_SOCKETS"
|
||||
.Sh DESCRIPTION
|
||||
The FreeBSD kernel includes a facility for eliminating data copies on
|
||||
The
|
||||
.Fx
|
||||
kernel includes a facility for eliminating data copies on
|
||||
socket reads and writes.
|
||||
.Pp
|
||||
This code is collectively known as the zero copy sockets code, because during
|
||||
normal network I/O, data will not be copied by the CPU at all. Rather it
|
||||
normal network I/O, data will not be copied by the CPU at all.
|
||||
Rather it
|
||||
will be DMAed from the user's buffer to the NIC (for sends), or DMAed from
|
||||
the NIC to a buffer that will then be given to the user (receives).
|
||||
.Pp
|
||||
@ -48,60 +51,79 @@ programmers should be aware of when trying to take advantage of this
|
||||
functionality.
|
||||
.Pp
|
||||
For sending data, there are no special requirements or capabilities that
|
||||
the sending NIC must have. The data written to the socket, though, must be
|
||||
the sending NIC must have.
|
||||
The data written to the socket, though, must be
|
||||
at least a page in size and page aligned in order to be mapped into the
|
||||
kernel. If it doesn't meet the page size and alignment constraints, it
|
||||
kernel.
|
||||
If it does not meet the page size and alignment constraints, it
|
||||
will be copied into the kernel, as is normally the case with socket I/O.
|
||||
.Pp
|
||||
The user should be careful not to overwrite buffers that have been written
|
||||
to the socket before the data has been freed by the kernel, and the
|
||||
copy-on-write mapping cleared. If a buffer is overwritten before it has
|
||||
copy-on-write mapping cleared.
|
||||
If a buffer is overwritten before it has
|
||||
been given up by the kernel, the data will be copied, and no savings in CPU
|
||||
utilization and memory bandwidth utilization will be realized.
|
||||
.Pp
|
||||
The
|
||||
.Xr socket 2
|
||||
API doesn't really give the user any indication of when his data has
|
||||
API does not really give the user any indication of when his data has
|
||||
actually been sent over the wire, or when the data has been freed from
|
||||
kernel buffers. For protocols like TCP, the data will be kept around in
|
||||
kernel buffers.
|
||||
For protocols like TCP, the data will be kept around in
|
||||
the kernel until it has been acknowledged by the other side; it must be
|
||||
kept until the acknowledgement is received in case retransmission is required.
|
||||
.Pp
|
||||
From an application standpoint, the best way to guarantee that the data has
|
||||
been sent out over the wire and freed by the kernel (for TCP-based sockets)
|
||||
is to set a socket buffer size (see the SO_SNDBUF socket option in the
|
||||
is to set a socket buffer size (see the
|
||||
.Dv SO_SNDBUF
|
||||
socket option in the
|
||||
.Xr setsockopt 2
|
||||
man page) appropriate for the application and network environment and then
|
||||
make sure you have sent out twice as much data as the socket buffer size
|
||||
before reusing a buffer. For TCP, the send and receive socket buffer sizes
|
||||
generally directly correspond to the TCP window size.
|
||||
before reusing a buffer.
|
||||
For TCP, the send and receive socket buffer sizes
|
||||
generally directly correspond to the TCP window size.
|
||||
.Pp
|
||||
For receiving data, in order to take advantage of the zero copy receive
|
||||
code, the user must have a NIC that is configured for an MTU greater than
|
||||
the architecture page size. (e.g., for alpha this would be 8KB, for i386,
|
||||
it would be 4KB) Additionally, in order for zero copy receive to work,
|
||||
the architecture page size.
|
||||
(E.g., for alpha this would be 8KB, for i386,
|
||||
it would be 4KB.)
|
||||
Additionally, in order for zero copy receive to work,
|
||||
packet payloads must be at least a page in size and page aligned.
|
||||
.Pp
|
||||
Achieving page aligned payloads requires a NIC that can split an incoming
|
||||
packet into multiple buffers. It also generally requires some sort of
|
||||
packet into multiple buffers.
|
||||
It also generally requires some sort of
|
||||
intelligence on the NIC to make sure that the payload starts in its own
|
||||
buffer. This is called "header splitting". Currently the only NICs with
|
||||
buffer.
|
||||
This is called
|
||||
.Dq "header splitting" .
|
||||
Currently the only NICs with
|
||||
support for header splitting are Alteon Tigon 2 based boards running
|
||||
slightly modified firmware. The FreeBSD
|
||||
slightly modified firmware.
|
||||
The
|
||||
.Fx
|
||||
.Xr ti 4
|
||||
driver includes modified firmware for Tigon 2 boards only. Header
|
||||
driver includes modified firmware for Tigon 2 boards only.
|
||||
Header
|
||||
splitting code can be written, however, for any NIC that allows putting
|
||||
received packets into multiple buffers and that has enough programability
|
||||
to determine that the header should go into one buffer and the payload into
|
||||
another.
|
||||
.Pp
|
||||
You can also do a form of header splitting that doesn't require any NIC
|
||||
You can also do a form of header splitting that does not require any NIC
|
||||
modifications if your NIC is at least capable of splitting packets into
|
||||
multiple buffers. This requires that you optimize the NIC driver for your
|
||||
most common packet header size. If that size (ethernet + IP + TCP headers)
|
||||
multiple buffers.
|
||||
This requires that you optimize the NIC driver for your
|
||||
most common packet header size.
|
||||
If that size (ethernet + IP + TCP headers)
|
||||
is generally 66 bytes, for instance, you would set the first buffer in a
|
||||
set for a particular packet to be 66 bytes long, and then subsequent
|
||||
buffers would be a page in size. For packets that have headers that are
|
||||
buffers would be a page in size.
|
||||
For packets that have headers that are
|
||||
exactly 66 bytes long, your payload will be page aligned.
|
||||
.Pp
|
||||
The other requirement for zero copy receive to work is that the buffer that
|
||||
@ -110,13 +132,15 @@ in size and page aligned.
|
||||
.Pp
|
||||
Obviously the requirements for receive side zero copy are impossible to
|
||||
meet without NIC hardware that is programmable enough to do header
|
||||
splitting of some sort. Since most NICs aren't that programmable, or their
|
||||
manufacturers won't share the source code to their firmware, this approach
|
||||
to zero copy receive isn't widely useful.
|
||||
splitting of some sort.
|
||||
Since most NICs are not that programmable, or their
|
||||
manufacturers will not share the source code to their firmware, this approach
|
||||
to zero copy receive is not widely useful.
|
||||
.Pp
|
||||
There are other approaches, such as RDMA and TCP Offload, that may
|
||||
potentially help alleviate the CPU overhead associated with copying data
|
||||
out of the kernel. Most known techniques require some sort of support at
|
||||
out of the kernel.
|
||||
Most known techniques require some sort of support at
|
||||
the NIC level to work, and describing such techniques is beyond the scope
|
||||
of this manual page.
|
||||
.Pp
|
||||
@ -128,15 +152,18 @@ and
|
||||
.Nm sysctl
|
||||
variables respectively.
|
||||
.Sh SEE ALSO
|
||||
.Xr socket 2 ,
|
||||
.Xr sendfile 2 ,
|
||||
.Xr socket 2 ,
|
||||
.Xr ti 4 ,
|
||||
.Xr jumbo 9
|
||||
.Sh HISTORY
|
||||
The zero copy sockets code first appeared in FreeBSD 5.0, although it has
|
||||
The zero copy sockets code first appeared in
|
||||
.Fx 5.0 ,
|
||||
although it has
|
||||
been in existence in patch form since at least mid-1999.
|
||||
.Sh AUTHORS
|
||||
.An -nosplit
|
||||
The zero copy sockets code was originally written by
|
||||
.An Andrew Gallatin Aq gallatin@FreeBSD.org
|
||||
and substantially modified and updated by
|
||||
and substantially modified and updated by
|
||||
.An Kenneth Merry Aq ken@FreeBSD.org .
|
||||
|
Loading…
x
Reference in New Issue
Block a user