ignore the size of any headers that were passed with the sendfile(2)
system call. Otherwise the file sent will be truncated by the header
size if the nbytes parameter was provided. The bug doesn't show up
when either nbytes is zero, meaning send the whole file, or no header
iovec is provided.
Resolve a potential error aliasing of errors from the VM and sf_buf
parts and the protocol send parts where an error of the latter over-
writes one of the former.
Update comments.
The byte accounting bug wasn't seen in earlier because none of the popular
sendfile(2) consumers, Apache, lighttpd and our ftpd(8) use it in modes
that trigger it. The varnish HTTP proxy makes full use of it and exposed
the problem.
Bug found by: phk
Tested by: phk
on each socket buffer with the socket buffer's mutex. This sleep lock is
used to serialize I/O on sockets in order to prevent I/O interlacing.
This change replaces the custom sleep lock with an sx(9) lock, which
results in marginally better performance, better handling of contention
during simultaneous socket I/O across multiple threads, and a cleaner
separation between the different layers of locking in socket buffers.
Specifically, the socket buffer mutex is now solely responsible for
serializing simultaneous operation on the socket buffer data structure,
and not for I/O serialization.
While here, fix two historic bugs:
(1) a bug allowing I/O to be occasionally interlaced during long I/O
operations (discovere by Isilon).
(2) a bug in which failed non-blocking acquisition of the socket buffer
I/O serialization lock might be ignored (discovered by sam).
SCTP portion of this patch submitted by rrs.
When nbytes=0, sendfile(2) should use file size. Because of the bug, it
was sending half of a file. The bug is that 'off' variable can't be used
for size calculation, because it changes inside the loop, so we should
use uap->offset instead.
and flags with an sxlock. This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention. All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently. Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.
- Generally eliminate the distinction between "fast" and regular
acquisisition of the filedesc lock; the plan is that they will now all
be fast. Change all locking instances to either shared or exclusive
locks.
- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
was called without the mutex held; sx_sleep() is now always called with
the sxlock held exclusively.
- Universally hold the struct file lock over changes to struct file,
rather than the filedesc lock or no lock. Always update the f_ops
field last. A further memory barrier is required here in the future
(discussed with jhb).
- Improve locking and reference management in linux_at(), which fails to
properly acquire vnode references before using vnode pointers. Annotate
improper use of vn_fullpath(), which will be replaced at a future date.
In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).
Tested by: kris
Discussed with: jhb, kris, attilio, jeff
- Close the new file objects created during socketpair() if the copyout of
the new file descriptors fails.
- Add a test to the socketpair regression test for this edge case.
- Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde)
- Remove extra blank lines in some cases.
- Add extra blank lines in some cases.
- Remove no-op comments consisting solely of the function name, the word
"syscall", or the system call name.
- Add punctuation.
- Re-wrap some comments.
system calls now enter without Giant held, and then in some cases, acquire
Giant explicitly.
Remove a number of other MPSAFE annotations in the credential code and
tweak one or two other adjacent comments.
written to the socket). The rewrite in revision 1.240 got confused by the
FreeBSD 4.x bug compatibility code.
For some reason lighttpd, that was used for testing the new sendfile code,
was not affected by the problem but apache and others using headers/trailers
in the sendfile call received incorrect sbytes values after return from non-
blocking sockets. This then lead to restarts with wrong offsets and thus
mixed up file contents when the socket was writeable again. All programs
not using headers/trailers, like ftpd, were not affected by the bug.
Reported by: Pawel Worach <pawel.worach-at-gmail.com>
Tested by: Pawel Worach <pawel.worach-at-gmail.com>
label after the sbunlock() part.
This correctly handles calls to sendfile(2) without valid parameters
that was broken in rev. 1.240.
Coverity error: 272162
work is not just mine, but it is also the works of Peter Lei
and Michael Tuexen. They both are my two key other developers
working on the project.. and they need ata-boy's too:
****
peterlei@cisco.comtuexen@fh-muenster.de
****
I did do a make sysent which updated the
syscall's and sysproto.. I hope that is correct... without
it you don't build since we have new syscalls for SCTP :-0
So go out and look at the NOTES, add
option SCTP (make sure inet and inet6 are present too)
and play with SCTP.
I will see about comitting some test tools I have after I
figure out where I should place them. I also have a
lib (libsctp.a) that adds some of the missing socketapi
functions that I need to put into lib's.. I will talk
to George about this :-)
There may still be some 64 bit issues in here, none of
us have a 64 bit processor to test with yet.. Michael
may have a MAC but thats another beast too..
If you have a mac and want to use SCTP contact Michael
he maintains a web site with a loadable module with
this code :-)
Reviewed by: gnn
Approved by: gnn
mbuf clusters. Add a flags parameter to accept M_PKTHDR and M_EOR mbuf
chain flags. Provide compatibility macro for m_getm() calling m_getm2()
with M_PKTHDR set.
Rewrite m_uiotombuf() to use m_getm2() for mbuf allocation and do the
uiomove() in a tight loop over the mbuf chain. Add a flags parameter to
accept mbuf flags to be passed to m_getm2(). Adjust all callers for the
extra parameter.
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 month
VM pages into mbufs as it can -- up to the free send socket buffer space.
The outer loop then drops the whole mbuf chain into the send socket buffer,
calls tcp_output() on it and then waits until 50% of the socket buffer are
free again to repeat the cycle. This way tcp_output() gets the full amount
of data to work with and can issue up to 64K sends for TSO to chop up in
the network adapter without using any CPU cycles. Thus it gets very efficient
especially with the readahead the VM and I/O system do.
The previous sendfile(2) code simply looped over the file, turned each 4K
page into an mbuf and sent it off. This had the effect that TSO could only
generate 2 packets per send instead of up to 44 at its maximum of 64K.
Add experimental SF_MNOWAIT flag to sendfile(2) to return ENOMEM instead of
sleeping on mbuf allocation failures.
Benchmarking shows significant improvements (95% confidence):
45% less cpu (or 1.81 times better) with new sendfile vs. old sendfile (non-TSO)
83% less cpu (or 5.7 times better) with new sendfile vs. old sendfile (TSO)
(Sender AMD Opteron 852 (2.6GHz) with em(4) PCI-X-133 interface and receiver
DELL Poweredge SC1425 P-IV Xeon 3.2GHz with em(4) LOM connected back to back
at 1000Base-TX full duplex.)
Sponsored by: TCP/IP Optimization Fundraise 2005
MFC after: 3 month
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.
This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.
Obtained from: TrustedBSD Project
Sponsored by: SPARTA
synchronized by the lock on the object containing the page.
Transition PG_WANTED and PG_SWAPINPROG to use the new field,
eliminating the need for holding the page queues lock when setting
or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to
VPO_WANTED and VPO_SWAPINPROG, respectively.
Eliminate the assertion that the page queues lock is held in
vm_page_io_finish().
Eliminate the acquisition and release of the page queues lock
around calls to vm_page_io_finish() in kern_sendfile() and
vfs_unbusy_pages().
kern_accept() and accept1(). If another thread closed the new file
descriptor and the first thread later got an error trying to copyout the
socket address, then it would attempt to close the wrong file object. To
fix, add a struct file ** argument to kern_accept(). If it is non-NULL,
then on success kern_accept() will store a pointer to the new file object
there and not release any of the references. It is up to the calling code
to drop the references appropriately (including a call to fdclose() in case
of error to safely handle the aforementioned race). While I'm at it, go
ahead and fix the svr4 streams code to not leak the accept fd if it gets an
error trying to copyout the streams structures.
soreceive(), and sopoll(), which are wrappers for pru_sosend,
pru_soreceive, and pru_sopoll, and are now used univerally by socket
consumers rather than either directly invoking the old so*() functions
or directly invoking the protocol switch method (about an even split
prior to this commit).
This completes an architectural change that was begun in 1996 to permit
protocols to provide substitute implementations, as now used by UDP.
Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to
perform these operations on sockets -- in particular, distributed file
systems and socket system calls.
Architectural head nod: sam, gnn, wollman
use by ABI emulators.
- Alter the interface of kern_recvit() somewhat. Specifically, go ahead
and hard code UIO_USERSPACE in the uio as that's what all the callers
specify. In place, add a new uioseg to indicate what type of pointer
is in mp->msg_name. Previously it was always a userland address, but
ABI emulators may pass in kernel-side sockaddrs. Also, remove the
namelenp field and instead require the two places that used it to
explicitly copy mp->msg_namelen out to userland.
- Use the patched kern_recvit() to replace svr4_recvit() and the stock
kern_sendit() to replace svr4_sendit().
- Use kern_bind() instead of stackgap use in ti_bind().
- Use kern_getpeername() and kern_getsockname() instead of stackgap in
svr4_stream_ti_ioctl().
- Use kern_connect() instead of stackgap in svr4_do_putmsg().
- Use kern_getpeername() and kern_accept() instead of stackgap in
svr4_do_getmsg().
- Retire the stackgap from SVR4 compat as it is no longer used.
in setsockopt so that they can be compared correctly against negative
values. Passing in a negative value had a rather negative effect
on our socket code, making it impossible to open new sockets.
PR: 98858
Submitted by: James.Juran@baesystems.com
MFC after: 1 week
- Move sonewconn(), which creates new sockets for incoming connections on
listen sockets, so that all socket allocate code is together in
uipc_socket.c.
- Move 'maxsockets' and associated sysctls to uipc_socket.c with the
socket allocation code.
- Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it
to sysctl.h and remove lots of scattered implementations in various
IPC modules.
- Sort sodealloc() after soalloc() in uipc_socket.c for dependency order
reasons. Statisticize soalloc() and sodealloc() as they are now
required only in uipc_socket.c, and are internal to the socket
implementation.
After this change, socket allocation and deallocation is entirely
centralized in one file, and uipc_socket2.c consists entirely of socket
buffer manipulation and default protocol switch functions.
MFC after: 1 month
sendfile(). This causes sendfile() to use the file descriptor
reference to the socket instead of bumping the socket reference
count, which avoids an additional refcount operation, as well as a
potential expensive socket refcount drop, which can lead to
contention on the accept mutex. This change also has the side
effect of further reducing the number of cases where an in-progress
I/O operation can occur on a socket after close, as using the file
descriptor refcount prevents the socket from closing while in use.
MFC after: 3 months
file lock, in the style of fgetsock().
Modify accept1() to use getsock() instead of fgetsock(), relying on the
file descriptor reference rather than an acquired socket reference to
prevent the listen socket from being destroyed during accept(). This
avoids additional reference count operations, which should improve
performance, and also avoids accept1() operating on a socket whose file
descriptor has been torn down, which may have resulted in protocol
shutdown starting.
MFC after: 3 months
acquiring Giant in kern_sendfile().
Guard against the forced reclamation of a vnode in kern_sendfile().
Discussed with: jeff
Reviewed by: tegge
MFC after: 3 weeks
which is invoked from socket() and socketpair(), permitting MAC
policy modules to control the creation of sockets by domain, type, and
protocol.
Obtained from: TrustedBSD Project
Sponsored by: SPARTA, SPAWAR
Approved by: re (scottl)
Requested by: SCC
to the mbuf. Offset cannot exceed MHLEN bytes. This is currently used to
fix Ethernet header alignment problem on alpha and sparc64. Also change all
users of m_uiotombuf to pass proper offset.
Reviewed by: jmg, sam
Tested by: Sten Spans "sten AT blinkenlights DOT nl"
MFC after: 1 week
control socket poll() (select()), fstat(), and accept() operations,
required for some policies:
poll() mac_check_socket_poll()
fstat() mac_check_socket_stat()
accept() mac_check_socket_accept()
Update mac_stub and mac_test policies to be aware of these entry points.
While here, add missing entry point implementations for:
mac_stub.c stub_check_socket_receive()
mac_stub.c stub_check_socket_send()
mac_test.c mac_test_check_socket_send()
mac_test.c mac_test_check_socket_visible()
Obtained from: TrustedBSD Project
Sponsored by: SPAWAR, SPARTA
SIGPIPE signal for the duration of the sento-family syscalls. Use it to
replace previously added hack in Linux layer based on temporarily setting
SO_NOSIGPIPE flag.
Suggested by: alfred
soref() to also covering the update of so_state. While no other user
threads can update the socket state here as it's not yet hooked up to
the file descriptor array yet, the protocol could also frob the
socket state here, leading to a lost update to the so_state field.
No reported instances of this bug (as yet).
MFC after: 3 days
Use this in all the places where sleeping with the lock held is not
an issue.
The distinction will become significant once we finalize the exact
lock-type to use for this kind of case.
Change the spelling of the "catch" option to be consistent with the new
options. Implement the "no wait" option. An implementation of the "CPU
private" for i386 will be committed at a later date.