aaa09777e1
up to now. The new sendfile is the code that Netflix uses to send their multiple tens of gigabits of data per second. The new implementation features asynchronous I/O, when I/O operations are launched, but not awaited to be complete. An explanation of why such behavior is beneficial compared to old one is going to be too long for a commit message, so we will skip it here. Additional features of new syscall are extra flags, which provide an application more control over data sent. The SF_NOCACHE flag tells kernel that data shouldn't be cached after it was sent. The SF_READAHEAD() macro allows to specify readahead size in pages. The new syscalls is a drop in replacement. No modifications are required to applications. One can take nginx binary for stable/10 and run it successfully on head. Although SF_NODISKIO lost its original sense, as now sendfile doesn't block, and now means something completely different (tm), using the new sendfile the old way is absolutely safe. Celebrates: Netflix global launch! Sponsored by: Nginx, Inc. Sponsored by: Netflix Relnotes: yes
378 lines
9.0 KiB
Groff
378 lines
9.0 KiB
Groff
.\" Copyright (c) 2003, David G. Lawrence
|
|
.\" All rights reserved.
|
|
.\"
|
|
.\" Redistribution and use in source and binary forms, with or without
|
|
.\" modification, are permitted provided that the following conditions
|
|
.\" are met:
|
|
.\" 1. Redistributions of source code must retain the above copyright
|
|
.\" notice unmodified, this list of conditions, and the following
|
|
.\" disclaimer.
|
|
.\" 2. Redistributions in binary form must reproduce the above copyright
|
|
.\" notice, this list of conditions and the following disclaimer in the
|
|
.\" documentation and/or other materials provided with the distribution.
|
|
.\"
|
|
.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
|
|
.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
|
|
.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
.\" SUCH DAMAGE.
|
|
.\"
|
|
.\" $FreeBSD$
|
|
.\"
|
|
.Dd January 7, 2016
|
|
.Dt SENDFILE 2
|
|
.Os
|
|
.Sh NAME
|
|
.Nm sendfile
|
|
.Nd send a file to a socket
|
|
.Sh LIBRARY
|
|
.Lb libc
|
|
.Sh SYNOPSIS
|
|
.In sys/types.h
|
|
.In sys/socket.h
|
|
.In sys/uio.h
|
|
.Ft int
|
|
.Fo sendfile
|
|
.Fa "int fd" "int s" "off_t offset" "size_t nbytes"
|
|
.Fa "struct sf_hdtr *hdtr" "off_t *sbytes" "int flags"
|
|
.Fc
|
|
.Sh DESCRIPTION
|
|
The
|
|
.Fn sendfile
|
|
system call
|
|
sends a regular file or shared memory object specified by descriptor
|
|
.Fa fd
|
|
out a stream socket specified by descriptor
|
|
.Fa s .
|
|
.Pp
|
|
The
|
|
.Fa offset
|
|
argument specifies where to begin in the file.
|
|
Should
|
|
.Fa offset
|
|
fall beyond the end of file, the system will return
|
|
success and report 0 bytes sent as described below.
|
|
The
|
|
.Fa nbytes
|
|
argument specifies how many bytes of the file should be sent, with 0 having the special
|
|
meaning of send until the end of file has been reached.
|
|
.Pp
|
|
An optional header and/or trailer can be sent before and after the file data by specifying
|
|
a pointer to a
|
|
.Vt "struct sf_hdtr" ,
|
|
which has the following structure:
|
|
.Pp
|
|
.Bd -literal -offset indent -compact
|
|
struct sf_hdtr {
|
|
struct iovec *headers; /* pointer to header iovecs */
|
|
int hdr_cnt; /* number of header iovecs */
|
|
struct iovec *trailers; /* pointer to trailer iovecs */
|
|
int trl_cnt; /* number of trailer iovecs */
|
|
};
|
|
.Ed
|
|
.Pp
|
|
The
|
|
.Fa headers
|
|
and
|
|
.Fa trailers
|
|
pointers, if
|
|
.Pf non- Dv NULL ,
|
|
point to arrays of
|
|
.Vt "struct iovec"
|
|
structures.
|
|
See the
|
|
.Fn writev
|
|
system call for information on the iovec structure.
|
|
The number of iovecs in these
|
|
arrays is specified by
|
|
.Fa hdr_cnt
|
|
and
|
|
.Fa trl_cnt .
|
|
.Pp
|
|
If
|
|
.Pf non- Dv NULL ,
|
|
the system will write the total number of bytes sent on the socket to the
|
|
variable pointed to by
|
|
.Fa sbytes .
|
|
.Pp
|
|
The least significant 16 bits of
|
|
.Fa flags
|
|
argument is a bitmap of these values:
|
|
.Bl -tag -offset indent
|
|
.It Dv SF_NODISKIO
|
|
This flag causes
|
|
.Nm
|
|
to return
|
|
.Er EBUSY
|
|
instead of blocking when a busy page is encountered.
|
|
This rare situation can happen if some other process is now working
|
|
with the same region of the file.
|
|
It is advised to retry the operation after a short period.
|
|
.Pp
|
|
Note that in older
|
|
.Fx
|
|
versions the
|
|
.Dv SF_NODISKIO
|
|
had slightly different notion.
|
|
The flag prevented
|
|
.Nm
|
|
to run I/O operations in case if an invalid (not cached) page is encountered,
|
|
thus avoiding blocking on I/O.
|
|
Starting with
|
|
.Fx 11
|
|
.Nm
|
|
sending files off the
|
|
.Xr ffs 7
|
|
filesystem doesn't block on I/O
|
|
(see
|
|
.Sx IMPLEMENTATION NOTES
|
|
), so the condition no longer applies.
|
|
However, it is safe if an application utilizes
|
|
.Dv SF_NODISKIO
|
|
and on
|
|
.Er EBUSY
|
|
performs the same action as it did in
|
|
older
|
|
.Fx
|
|
versions, e.g.
|
|
.Xr aio_read 2,
|
|
.Xr read 2
|
|
or
|
|
.Nm
|
|
in a different context.
|
|
.It Dv SF_NOCACHE
|
|
The data sent to socket will not be cached by the virtual memory system,
|
|
and will be freed directly to the pool of free pages.
|
|
.It Dv SF_SYNC
|
|
.Nm
|
|
sleeps until the network stack no longer references the VM pages
|
|
of the file, making subsequent modifications to it safe.
|
|
Please note that this is not a guarantee that the data has actually
|
|
been sent.
|
|
.El
|
|
.Pp
|
|
The most significant 16 bits of
|
|
.Fa flags
|
|
specify amount of pages that
|
|
.Nm
|
|
may read ahead when reading the file.
|
|
A macro
|
|
.Fn SF_FLAGS
|
|
is provided to combine readahead amount and flags.
|
|
Example shows specifing readahead of 16 pages and
|
|
.Dv SF_NOCACHE
|
|
flag:
|
|
.Pp
|
|
.Bd -literal -offset indent -compact
|
|
SF_FLAGS(16, SF_NOCACHE)
|
|
.Ed
|
|
.Pp
|
|
When using a socket marked for non-blocking I/O,
|
|
.Fn sendfile
|
|
may send fewer bytes than requested.
|
|
In this case, the number of bytes successfully
|
|
written is returned in
|
|
.Fa *sbytes
|
|
(if specified),
|
|
and the error
|
|
.Er EAGAIN
|
|
is returned.
|
|
.Sh IMPLEMENTATION NOTES
|
|
The
|
|
.Fx
|
|
implementation of
|
|
.Fn sendfile
|
|
doesn't block on disk I/O when it sends a file off the
|
|
.Xr ffs 7
|
|
filesystem.
|
|
The syscall returns success before the actual I/O completes, and data
|
|
is put into the socket later unattended.
|
|
However, the order of data in the socket is preserved, so it is safe
|
|
to do further writes to the socket.
|
|
.Pp
|
|
The
|
|
.Fx
|
|
implementation of
|
|
.Fn sendfile
|
|
is "zero-copy", meaning that it has been optimized so that copying of the file data is avoided.
|
|
.Sh TUNING
|
|
On some architectures, this system call internally uses a special
|
|
.Fn sendfile
|
|
buffer
|
|
.Pq Vt "struct sf_buf"
|
|
to handle sending file data to the client.
|
|
If the sending socket is
|
|
blocking, and there are not enough
|
|
.Fn sendfile
|
|
buffers available,
|
|
.Fn sendfile
|
|
will block and report a state of
|
|
.Dq Li sfbufa .
|
|
If the sending socket is non-blocking and there are not enough
|
|
.Fn sendfile
|
|
buffers available, the call will block and wait for the
|
|
necessary buffers to become available before finishing the call.
|
|
.Pp
|
|
The number of
|
|
.Vt sf_buf Ns 's
|
|
allocated should be proportional to the number of nmbclusters used to
|
|
send data to a client via
|
|
.Fn sendfile .
|
|
Tune accordingly to avoid blocking!
|
|
Busy installations that make extensive use of
|
|
.Fn sendfile
|
|
may want to increase these values to be inline with their
|
|
.Va kern.ipc.nmbclusters
|
|
(see
|
|
.Xr tuning 7
|
|
for details).
|
|
.Pp
|
|
The number of
|
|
.Fn sendfile
|
|
buffers available is determined at boot time by either the
|
|
.Va kern.ipc.nsfbufs
|
|
.Xr loader.conf 5
|
|
variable or the
|
|
.Dv NSFBUFS
|
|
kernel configuration tunable.
|
|
The number of
|
|
.Fn sendfile
|
|
buffers scales with
|
|
.Va kern.maxusers .
|
|
The
|
|
.Va kern.ipc.nsfbufsused
|
|
and
|
|
.Va kern.ipc.nsfbufspeak
|
|
read-only
|
|
.Xr sysctl 8
|
|
variables show current and peak
|
|
.Fn sendfile
|
|
buffers usage respectively.
|
|
These values may also be viewed through
|
|
.Nm netstat Fl m .
|
|
.Pp
|
|
If a value of zero is reported for
|
|
.Va kern.ipc.nsfbufs ,
|
|
your architecture does not need to use
|
|
.Fn sendfile
|
|
buffers because their task can be efficiently performed
|
|
by the generic virtual memory structures.
|
|
.Sh RETURN VALUES
|
|
.Rv -std sendfile
|
|
.Sh ERRORS
|
|
.Bl -tag -width Er
|
|
.It Bq Er EAGAIN
|
|
The socket is marked for non-blocking I/O and not all data was sent due to
|
|
the socket buffer being filled.
|
|
If specified, the number of bytes successfully sent will be returned in
|
|
.Fa *sbytes .
|
|
.It Bq Er EBADF
|
|
The
|
|
.Fa fd
|
|
argument
|
|
is not a valid file descriptor.
|
|
.It Bq Er EBADF
|
|
The
|
|
.Fa s
|
|
argument
|
|
is not a valid socket descriptor.
|
|
.It Bq Er EBUSY
|
|
A busy page was encountered and
|
|
.Dv SF_NODISKIO
|
|
had been specified.
|
|
Partial data may have been sent.
|
|
.It Bq Er EFAULT
|
|
An invalid address was specified for an argument.
|
|
.It Bq Er EINTR
|
|
A signal interrupted
|
|
.Fn sendfile
|
|
before it could be completed.
|
|
If specified, the number
|
|
of bytes successfully sent will be returned in
|
|
.Fa *sbytes .
|
|
.It Bq Er EINVAL
|
|
The
|
|
.Fa fd
|
|
argument
|
|
is not a regular file.
|
|
.It Bq Er EINVAL
|
|
The
|
|
.Fa s
|
|
argument
|
|
is not a SOCK_STREAM type socket.
|
|
.It Bq Er EINVAL
|
|
The
|
|
.Fa offset
|
|
argument
|
|
is negative.
|
|
.It Bq Er EIO
|
|
An error occurred while reading from
|
|
.Fa fd .
|
|
.It Bq Er ENOBUFS
|
|
The system was unable to allocate an internal buffer.
|
|
.It Bq Er ENOTCONN
|
|
The
|
|
.Fa s
|
|
argument
|
|
points to an unconnected socket.
|
|
.It Bq Er ENOTSOCK
|
|
The
|
|
.Fa s
|
|
argument
|
|
is not a socket.
|
|
.It Bq Er EOPNOTSUPP
|
|
The file system for descriptor
|
|
.Fa fd
|
|
does not support
|
|
.Fn sendfile .
|
|
.It Bq Er EPIPE
|
|
The socket peer has closed the connection.
|
|
.El
|
|
.Sh SEE ALSO
|
|
.Xr netstat 1 ,
|
|
.Xr open 2 ,
|
|
.Xr send 2 ,
|
|
.Xr socket 2 ,
|
|
.Xr writev 2 ,
|
|
.Xr tuning 7
|
|
.Rs
|
|
.%A K. Elmeleegy
|
|
.%A A. Chanda
|
|
.%A A. L. Cox
|
|
.%A W. Zwaenepoel
|
|
.%T A Portable Kernel Abstraction for Low-Overhead Ephemeral Mapping Management
|
|
.%J The Proceedings of the 2005 USENIX Annual Technical Conference
|
|
.%P pp 223-236
|
|
.%D 2005
|
|
.Re
|
|
.Sh HISTORY
|
|
The
|
|
.Fn sendfile
|
|
system call
|
|
first appeared in
|
|
.Fx 3.0 .
|
|
This manual page first appeared in
|
|
.Fx 3.1 .
|
|
In
|
|
.Fx 10
|
|
support for sending shared memory descriptors had been introduced.
|
|
In
|
|
.Fx 11
|
|
a non-blocking implementation had been introduced.
|
|
.Sh AUTHORS
|
|
The initial implementation of
|
|
.Fn sendfile
|
|
system call
|
|
and this manual page were written by
|
|
.An David G. Lawrence Aq Mt dg@dglawrence.com .
|
|
The
|
|
.Fx 11
|
|
implementation was written by
|
|
.An Gleb Smirnoff Aq Mt glebius@FreeBSD.org .
|