Commit Graph

228 Commits

Author SHA1 Message Date
Pawel Biernacki
e8cdbb4815 sysctl: hide 2.x era compat node
r23081 introduced kern.dummy oid as a semi ABI compat for kern.maxsockbuf
that was moved to a new namespace.  It never functioned as an alias of any
kind and was just returning 0 unconditionally, hence it was probably
provided to keep some 3rd party programmes happy about sysctl(3) not
reporting an error because of non-existing oid.
After nearly 23 years it seems reasonable to just hide it from sysctl(8)
list not to cause unnecessary confusion as for its purpose.

Reported by:	Antranig Vartanian <antranigv@freebsd.am>
Reviewed by:	kib (mentor)
Approved by:	kib (mentor)
Differential Revision:	https://reviews.freebsd.org/D22982
2020-01-02 01:23:43 +00:00
John Baldwin
b2e60773c6 Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets.  KTLS only supports
offload of TLS for transmitted data.  Key negotation must still be
performed in userland.  Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option.  All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type.  Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer.  Each TLS frame is described by a single
ext_pgs mbuf.  The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame().  ktls_enqueue() is then
called to schedule TLS frames for encryption.  In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed.  For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue().  Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends.  Internally,
Netflix uses proprietary pure-software backends.  This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames.  As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready().  At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation.  In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session.  TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted.  The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface.  If so, the packet is tagged
with the TLS send tag and sent to the interface.  The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation.  If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped.  In addition, a task is scheduled to refresh the TLS send
tag for the TLS session.  If a new TLS send tag cannot be allocated,
the connection is dropped.  If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag.  (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another.  As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8).  ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option.  They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax.  However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node.  The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default).  The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by:	gallatin, hselasky, rrs
Obtained from:	Netflix
Sponsored by:	Netflix, Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
John Baldwin
3807631b8e Compress pending socket buffer data once it is marked ready.
Apply similar logic from sbcompress to pending data in the socket
buffer once it is marked ready via sbready.  Normally sbcompress
merges small mbufs to reduce the length of mbuf chains in the socket
buffer.  However, sbcompress cannot do this for mbufs marked
M_NOTREADY.  sbcompress_ready is now called from sbready when mbufs
are marked ready to merge small mbuf chains once the data is available
to copy.

Submitted by:	gallatin (earlier version)
Reviewed by:	gallatin, hselasky, rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20616
2019-06-29 00:50:25 +00:00
John Baldwin
82334850ea Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages.  It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.

For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer.  This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers).  It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused.  To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.

Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.

NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability.  This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands.  For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.

If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output.  If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.

Submitted by:	gallatin (earlier version)
Reviewed by:	gallatin, hselasky, rrs
Discussed with:	ae, kp (firewalls)
Relnotes:	yes
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20616
2019-06-29 00:48:33 +00:00
Kevin Bowling
2a24f4d911 Retire sbsndptr() KPI
As of r340465 all consumers use sbsndptr_adv and sbsndptr_noadv

Reviewed by:	gallatin
Approved by:	krion (mentor)
Differential Revision:	https://reviews.freebsd.org/D17998
2018-11-19 00:54:31 +00:00
Mark Johnston
5b0480f2cc Don't check rcv sockbuf limits when sending on a unix stream socket.
sosend_generic() performs an initial comparison of the amount of data
(including control messages) to be transmitted with the send buffer
size. When transmitting on a unix socket, we then compare the amount
of data being sent with the amount of space in the receive buffer size;
if insufficient space is available, sbappendcontrol() returns an error
and the data is lost.  This is easily triggered by sending control
messages together with an amount of data roughly equal to the send
buffer size, since the control message size may change in uipc_send()
as file descriptors are internalized.

Fix the problem by removing the space check in sbappendcontrol(),
whose only consumer is the unix sockets code.  The stream sockets code
uses the SB_STOP mechanism to ensure that senders will block if the
receive buffer fills up.

PR:		181741
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16515
2018-08-04 20:26:54 +00:00
Mark Johnston
4765717321 Remove a redundant check.
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2018-07-30 17:58:41 +00:00
Randall Stewart
89e560f441 This commit brings in a new refactored TCP stack called Rack.
Rack includes the following features:
 - A different SACK processing scheme (the old sack structures are not used).
 - RACK (Recent acknowledgment) where counting dup-acks is no longer done
        instead time is used to knwo when to retransmit. (see the I-D)
 - TLP (Tail Loss Probe) where we will probe for tail-losses to attempt
        to try not to take a retransmit time-out. (see the I-D)
 - Burst mitigation using TCPHTPS
 - PRR (partial rate reduction) see the RFC.

Once built into your kernel, you can select this stack by either
socket option with the name of the stack is "rack" or by setting
the global sysctl so the default is rack.

Note that any connection that does not support SACK will be kicked
back to the "default" base  FreeBSD stack (currently known as "default").

To build this into your kernel you will need to enable in your
kernel:
   makeoptions WITH_EXTRA_TCP_STACKS=1
   options TCPHPTS

Sponsored by:	Netflix Inc.
Differential Revision:		https://reviews.freebsd.org/D15525
2018-06-07 18:18:13 +00:00
Matt Macy
b203713694 fix uninitialized variable warning 2018-05-19 03:49:36 +00:00
Pedro F. Giffuni
51369649b0 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
Gleb Smirnoff
555b3e2f2c Third take on the r319685 and r320480. Actually allow for call soisconnected()
via soisdisconnected(), and in the earlier unlock earlier to avoid lock
recursion.

This fixes a situation when a socket on accept queue is reset before being
accepted.

Reported by:	Jason Eggleston <jeggleston llnw.com>
2017-08-24 20:49:19 +00:00
Navdeep Parhar
98c9236978 Adjust sowakeup post-r319685 so that it continues to make upcalls but
still avoids calling soconnected during sodisconnected.

Discussed with:	glebius@
Sponsored by:	Chelsio Communications
2017-06-29 19:43:27 +00:00
Gleb Smirnoff
64290befc1 Provide sbsetopt() that handles socket buffer related socket options.
It distinguishes between data flow sockets and listening sockets, and
in case of the latter doesn't change resource limits, since listening
sockets don't hold any buffers, they only carry values to be inherited
by their children.
2017-06-25 01:41:07 +00:00
Gleb Smirnoff
779f106aa1 Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
  fields that belong to normal dataflow, and unionize them.  This
  shrinks the structure a bit.
  - Take out selinfo's from the socket buffers into the socket. The
    first reason is to support braindamaged scenario when a socket is
    added to kevent(2) and then listen(2) is cast on it. The second
    reason is that there is future plan to make socket buffers pluggable,
    so that for a dataflow socket a socket buffer can be changed, and
    in this case we also want to keep same selinfos through the lifetime
    of a socket.
  - Remove struct struct so_accf. Since now listening stuff no longer
    affects struct socket size, just move its fields into listening part
    of the union.
  - Provide sol_upcall field and enforce that so_upcall_set() may be called
    only on a dataflow socket, which has buffers, and for listening sockets
    provide solisten_upcall_set().

o Remove ACCEPT_LOCK() global.
  - Add a mutex to socket, to be used instead of socket buffer lock to lock
    fields of struct socket that don't belong to a socket buffer.
  - Allow to acquire two socket locks, but the first one must belong to a
    listening socket.
  - Make soref()/sorele() to use atomic(9).  This allows in some situations
    to do soref() without owning socket lock.  There is place for improvement
    here, it is possible to make sorele() also to lock optionally.
  - Most protocols aren't touched by this change, except UNIX local sockets.
    See below for more information.

o Reduce copy-and-paste in kernel modules that accept connections from
  listening sockets: provide function solisten_dequeue(), and use it in
  the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
  infiniband, rpc.

o UNIX local sockets.
  - Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
    local sockets.  Most races exist around spawning a new socket, when we
    are connecting to a local listening socket.  To cover them, we need to
    hold locks on both PCBs when spawning a third one.  This means holding
    them across sonewconn().  This creates a LOR between pcb locks and
    unp_list_lock.
  - To fix the new LOR, abandon the global unp_list_lock in favor of global
    unp_link_lock.  Indeed, separating these two locks didn't provide us any
    extra parralelism in the UNIX sockets.
  - Now call into uipc_attach() may happen with unp_link_lock hold if, we
    are accepting, or without unp_link_lock in case if we are just creating
    a socket.
  - Another problem in UNIX sockets is that uipc_close() basicly did nothing
    for a listening socket.  The vnode remained opened for connections.  This
    is fixed by removing vnode in uipc_close().  Maybe the right way would be
    to do it for all sockets (not only listening), simply move the vnode
    teardown from uipc_detach() to uipc_close()?

Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
Gleb Smirnoff
8d40bada3e Fix a degenerate case when soisdisconnected() would call soisconnected().
This happens when closing a socket with upcall, and trace is: soclose()->
... protocol ... -> soisdisconnected() -> socantrcvmore_locked() ->
sowakeup() -> soisconnected().

Right now this case is innocent for two reasons.  First, soisconnected()
doesn't clear SS_ISDISCONNECTED flag.  Second, the mutex to lock the
socket is the socket receive buffer mutex, and sodisconnected() first
disables the receive buffer. But in future code, the mutex to lock
socket is different to buffer mutex, and we would get undesired mutex
recursion.

The fix is to check SS_ISDISCONNECTED flag before calling upcall.
2017-06-08 06:16:47 +00:00
Andrey V. Elsukov
57386f5dce Fix the build.
Reported by:	lwhsu
2017-04-14 10:21:38 +00:00
Andrey V. Elsukov
c33a231337 Rework r316770 to make it protocol independent and general, like we
do for streaming sockets.

And do more cleanup in the sbappendaddr_locked_internal() to prevent
leak information from existing mbuf to the one, that will be possible
created later by netgraph.

Suggested by:	glebius
Tested by:	Irina Liakh <spell at itl ua>
MFC after:	1 week
2017-04-14 09:00:48 +00:00
Hiren Panchasara
f41b2de716 Fix the KASSERT check from r314813.
len being 0 is valid.

Submitted by:	ngie
Reported by:	ngie (via jenkins test run)
Sponsored by:	Limelight Networks
2017-03-07 06:46:38 +00:00
Hiren Panchasara
b5b023b91e We've found a recurring problem where some userland process would be
stuck spinning at 100% cpu around sbcut_internal(). Inside
sbflush_internal(), sb_ccc reached to about 4GB and before passing it
to sbcut_internal(), we type-cast it from uint to int making it -ve.

The root cause of sockbuf growing this large is unknown. Correct fix
is also not clear but based on mailing list discussions, adding
KASSERTs to panic instead of looping endlessly.

Reviewed by:		glebius
Sponsored by:		Limelight Networks
2017-03-07 00:20:01 +00:00
Ed Maste
69a2875821 Renumber license clauses in sys/kern to avoid skipping #3 2016-09-15 13:16:20 +00:00
Pedro F. Giffuni
b85f65af68 kern: for pointers replace 0 with NULL.
These are mostly cosmetical, no functional change.

Found with devel/coccinelle.
2016-04-15 16:10:11 +00:00
John Baldwin
f3215338ef Refactor the AIO subsystem to permit file-type-specific handling and
improve cancellation robustness.

Introduce a new file operation, fo_aio_queue, which is responsible for
queueing and completing an asynchronous I/O request for a given file.
The AIO subystem now exports library of routines to manipulate AIO
requests as well as the ability to run a handler function in the
"default" pool of AIO daemons to service a request.

A default implementation for file types which do not include an
fo_aio_queue method queues requests to the "default" pool invoking the
fo_read or fo_write methods as before.

The AIO subsystem permits file types to install a private "cancel"
routine when a request is queued to permit safe dequeueing and cleanup
of cancelled requests.

Sockets now use their own pool of AIO daemons and service per-socket
requests in FIFO order.  Socket requests will not block indefinitely
permitting timely cancellation of all requests.

Due to the now-tight coupling of the AIO subsystem with file types,
the AIO subsystem is now a standard part of all kernels.  The VFS_AIO
kernel option and aio.ko module are gone.

Many file types may block indefinitely in their fo_read or fo_write
callbacks resulting in a hung AIO daemon.  This can result in hung
user processes (when processes attempt to cancel all outstanding
requests during exit) or a hung system.  To protect against this, AIO
requests are only permitted for known "safe" files by default.  AIO
requests for all file types can be enabled by setting the new
vfs.aio.enable_usafe sysctl to a non-zero value.  The AIO tests have
been updated to skip operations on unsafe file types if the sysctl is
zero.

Currently, AIO requests on sockets and raw disks are considered safe
and are enabled by default.  aio_mlock() is also enabled by default.

Reviewed by:	cem, jilles
Discussed with:	kib (earlier version)
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5289
2016-03-01 18:12:14 +00:00
Gleb Smirnoff
8ec07310fa These files were getting sys/malloc.h and vm/uma.h with header pollution
via sys/mbuf.h
2016-02-01 17:41:21 +00:00
Gleb Smirnoff
829fae9063 Make it possible for sbappend() to preserve M_NOTREADY on mbufs, just like
sbappendstream() does. Although, M_NOTREADY may appear only on SOCK_STREAM
sockets, due to sendfile(2) supporting only the latter, there is a corner
case of AF_UNIX/SOCK_STREAM socket, that still uses records for the sake
of control data, albeit being stream socket.

Provide private version of m_clrprotoflags(), which understands PRUS_NOTREADY,
similar to m_demote().
2016-01-08 19:03:20 +00:00
Mateusz Guzik
f6f6d24062 Implement lockless resource limits.
Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
2015-06-10 10:48:12 +00:00
Gleb Smirnoff
53b680caa2 In sbappend*() family of functions clear M_PROTO flags of incoming
mbufs. sbappendstream() already does this in m_demote().

PR:		196174
Sponsored by:	Nginx, Inc.
2014-12-22 15:39:24 +00:00
Gleb Smirnoff
e834a84026 Revert r274494, r274712, r275955 and provide extra comments explaining
why there could appear a zero-sized mbufs in socket buffers.

A proper fix would be to divorce record socket buffers and stream
socket buffers, and divorce pru_send that accepts normal data from
pru_send that accepts control data.
2014-12-20 22:12:04 +00:00
Gleb Smirnoff
b7413232f6 Add to sbappendstream_locked() a check against NULL mbuf, like it is done
in sbappend_locked() and sbappendrecord_locked().

This is a quick fix to the panic introduced by r274712.

A proper solution should be to make sosend_generic() avoid calling
pru_send() with NULL mbuf for the protocols that do not understand
control messages. Those protocols that understand control messages,
should be able to receive NULL mbuf, if control is non-NULL.
2014-12-20 14:19:46 +00:00
Gleb Smirnoff
651e4e6a30 Merge from projects/sendfile: extend protocols API to support
sending not ready data:
o Add new flag to pru_send() flags - PRUS_NOTREADY.
o Add new protocol method pru_ready().

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2014-11-30 13:24:21 +00:00
Gleb Smirnoff
0f9d0a73a4 Merge from projects/sendfile:
o Introduce a notion of "not ready" mbufs in socket buffers.  These
mbufs are now being populated by some I/O in background and are
referenced outside.  This forces following implications:
- An mbuf which is "not ready" can't be taken out of the buffer.
- An mbuf that is behind a "not ready" in the queue neither.
- If sockbet buffer is flushed, then "not ready" mbufs shouln't be
  freed.

o In struct sockbuf the sb_cc field is split into sb_ccc and sb_acc.
  The sb_ccc stands for ""claimed character count", or "committed
  character count".  And the sb_acc is "available character count".
  Consumers of socket buffer API shouldn't already access them directly,
  but use sbused() and sbavail() respectively.
o Not ready mbufs are marked with M_NOTREADY, and ready but blocked ones
  with M_BLOCKED.
o New field sb_fnrdy points to the first not ready mbuf, to avoid linear
  search.
o New function sbready() is provided to activate certain amount of mbufs
  in a socket buffer.

A special note on SCTP:
  SCTP has its own sockbufs.  Unfortunately, FreeBSD stack doesn't yet
allow protocol specific sockbufs.  Thus, SCTP does some hacks to make
itself compatible with FreeBSD: it manages sockbufs on its own, but keeps
sb_cc updated to inform the stack of amount of data in them.  The new
notion of "not ready" data isn't supported by SCTP.  Instead, only a
mechanical substitute is done: s/sb_cc/sb_ccc/.
  A proper solution would be to take away struct sockbuf from struct
socket and allow protocols to implement their own socket buffers, like
SCTP already does.  This was discussed with rrs@.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-30 12:52:33 +00:00
Gleb Smirnoff
57f43a45a3 - Move sbcheck() declaration under SOCKBUF_DEBUG.
- Improve SOCKBUF_DEBUG macros.
- Improve sbcheck().

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-30 11:22:39 +00:00
Gleb Smirnoff
8967b220a3 Make sballoc() and sbfree() functions. Ideally, they could be marked
as static, but unfortunately Infiniband (ab)uses them.

Sponsored by:	Nginx, Inc.
2014-11-30 11:02:07 +00:00
Gleb Smirnoff
8146bcfea1 - Use NULL to compare a pointer.
- Use KASSERT() instead of panic.
- Remove useless 'continue', no need to restart cycle here.

Sponsored by:	Nginx, Inc.
2014-11-14 15:44:19 +00:00
Gleb Smirnoff
f274e25659 There should not be zero length mbufs in socket buffers. The code comes
from r1451, and thus can't be explained.  A patch with explicit panic()
here survived all tests.

Tested by:	pho
Sponsored by:	Nginx, Inc.
2014-11-14 06:02:29 +00:00
Hans Petter Selasky
9fd573c39d Improve transmit sending offload, TSO, algorithm in general.
The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.

Reviewed by:	adrian, rmacklem
Sponsored by:	Mellanox Technologies
MFC after:	1 week
2014-09-22 08:27:27 +00:00
Xin LI
2827952eb4 Don't leave the padding between the msg header and the cmsg data,
and the padding after the cmsg data un-initialized.

Submitted by:	tuexen
Security:	CVE-2014-3952
Security:	FreeBSD-SA-14:17.kmem
2014-07-08 21:54:23 +00:00
Alan Somers
8de34a88de Fix PR kern/185813 "SOCK_SEQPACKET AF_UNIX sockets with asymmetrical
buffers drop packets".  It was caused by a check for the space available
in a sockbuf, but it was checking the wrong sockbuf.

sys/sys/sockbuf.h
sys/kern/uipc_sockbuf.c
    Add sbappendaddr_nospacecheck_locked(), which is just like
    sbappendaddr_locked but doesn't validate the receiving socket's
    space.  Factor out common code into sbappendaddr_locked_internal().
    We shouldn't simply make sbappendaddr_locked check the space and
    then call sbappendaddr_nospacecheck_locked, because that would cause
    the O(n) function m_length to be called twice.

sys/kern/uipc_usrreq.c
    Use sbappendaddr_nospacecheck_locked for SOCK_SEQPACKET sockets,
    because the receiving sockbuf's size limit is irrelevant.

tests/sys/kern/unix_seqpacket_test.c
    Now that 185813 is fixed, pipe_128k_8k fails intermittently due to
    185812.  Make it fail every time by adding a usleep after starting
    the writer thread and before starting the reader thread in
    test_pipe.  That gives the writer time to fill up its send buffer.
    Also, clear the expected failure message due to 185813.  It actually
    said "185812", but that was a typo.

PR:		kern/185813
Reviewed by:	silence from freebsd-net@ and rwatson@
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corporation
2014-03-06 20:24:15 +00:00
Gleb Smirnoff
761a9a1f8d Fix comment. 2014-01-17 11:09:05 +00:00
Gleb Smirnoff
1d2df300e9 - Substitute sbdrop_internal() with sbcut_internal(). The latter doesn't free
mbufs, but return chain of free mbufs to a caller. Caller can either reuse
  them or return to allocator in a batch manner.
- Implement sbdrop()/sbdrop_locked() as a wrapper around sbcut_internal().
- Expose sbcut_locked() for outside usage.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
Approved by:	re (marius)
2013-10-09 11:57:53 +00:00
Davide Italiano
7729cbf1a6 Fix socket buffer timeouts precision using the new sbintime_t KPI instead
of relying on the tvtohz() workaround. The latter has been introduced
lately by jhb@ (r254699) in order to have a fix that can be backported
to STABLE.

Reported by:	Vitja Makarov <vitja.makarov at gmail dot com>
Reviewed by:	jhb (earlier version)
2013-09-01 23:34:53 +00:00
Lawrence Stewart
0963c8e431 When a previous call to sbsndptr() leaves sb->sb_sndptroff at the start of an
mbuf that was fully consumed by the previous call, the mbuf ptr returned by the
current call ends up being the previous mbuf in the sb chain to the one that
contains the data we want.

This does not cause any observable issues because the mbuf copy routines happily
walk the mbuf chain to get to the data at the moff offset, which in this case
means they effectively skip over the mbuf returned by sbsndptr().

We can't adjust sb->sb_sndptr during the previous call for this case because the
next mbuf in the chain may not exist yet. We therefore need to detect the
condition and make the adjustment during the current call.

Fix by detecting the special case of moff being at the start of the next mbuf in
the chain and adjust the required accounting variables accordingly.

Reviewed by:	andre
MFC after:	2 weeks
2013-06-19 03:08:01 +00:00
Gleb Smirnoff
844cacd17c Once ng_ksocket(4) is fixed, re-apply r194662. See this revision for
longer description.

Discussed with:	andre, rwatson
Sponsored by:	Nginx, Inc.
2013-03-29 14:06:04 +00:00
Gleb Smirnoff
c8b59ea750 Use m_get() and m_getcl() instead of compat macros. 2013-03-15 10:21:18 +00:00
Gleb Smirnoff
eb1b1807af Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually
2012-12-05 08:04:20 +00:00
Eitan Adler
3eb9ab5255 Document a large number of currently undocumented sysctls. While here
fix some style(9) issues and reduce redundancy.

PR:		kern/155491
PR:		kern/155490
PR:		kern/155489
Submitted by:	Galimov Albert <wtfcrap@mail.ru>
Approved by:	bde
Reviewed by:	jhb
MFC after:	1 week
2011-12-13 00:38:50 +00:00
Bjoern A. Zeeb
b233773bb9 Increase the defaults for the maximum socket buffer limit,
and the maximum TCP send and receive buffer limits from 256kB
to 2MB.

For sb_max_adj we need to add the cast as already used in the sysctl
handler to not overflow the type doing the maths.

Note that this is just the defaults.  They will allow more memory
to be consumed per socket/connection if needed but not change the
default "idle" memory consumption.   All values are still tunable
by sysctls.

Suggested by:	gnn
Discussed on:	arch (Mar and Aug 2011)
MFC after:	3 weeks
Approved by:	re (kib)
2011-08-25 09:20:13 +00:00
Gleb Smirnoff
443301e296 Revert r194662, since it breaks ng_ksocket(4) and may break
other socket consumers with alternate sb_upcall.

PR:		kern/154676
Submitted by:	Arnaud Lacombe <lacombar gmail.com>
MFC after:	7 days
2011-04-14 14:54:22 +00:00
Andre Oppermann
4ad1c464fe In sbappendstream_locked() demote all incoming packet mbufs (and
chains) to pure data mbufs using m_demote().  This removes the
packet header and all m_tag information as they are not meaningful
anymore on a stream socket where mbufs are linked through m->m_next.
Strictly speaking a packet header can be only ever valid on the first
mbuf in an m_next chain.

sbcompress() was doing this already when the mbuf chain layout lent
itself to it (e.g. header splitting or merge-append), just not
consistently.

This frees resources at socket buffer append time instead of at
sbdrop_internal() time after data has been read from the socket.

For MAC the per packet information has done its duty and during
socket buffer appending the policy of the socket itself takes over.
With the append the packet boundaries disappear naturally and with
it any context that was based on it.  None of the residual information
from mbuf headers in the socket buffer on stream sockets was looked at.
2009-06-22 21:46:40 +00:00
John Baldwin
74fb0ba732 Rework socket upcalls to close some races with setup/teardown of upcalls.
- Each socket upcall is now invoked with the appropriate socket buffer
  locked.  It is not permissible to call soisconnected() with this lock
  held; however, so socket upcalls now return an integer value.  The two
  possible values are SU_OK and SU_ISCONNECTED.  If an upcall returns
  SU_ISCONNECTED, then the soisconnected() will be invoked on the
  socket after the socket buffer lock is dropped.
- A new API is provided for setting and clearing socket upcalls.  The
  API consists of soupcall_set() and soupcall_clear().
- To simplify locking, each socket buffer now has a separate upcall.
- When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from
  the receive socket buffer automatically.  Note that a SO_SND upcall
  should never return SU_ISCONNECTED.
- All this means that accept filters should now return SU_ISCONNECTED
  instead of calling soisconnected() directly.  They also no longer need
  to explicitly clear the upcall on the new socket.
- The HTTP accept filter still uses soupcall_set() to manage its internal
  state machine, but other accept filters no longer have any explicit
  knowlege of socket upcall internals aside from their return value.
- The various RPC client upcalls currently drop the socket buffer lock
  while invoking soreceive() as a temporary band-aid.  The plan for
  the future is to add a new flag to allow soreceive() to be called with
  the socket buffer locked.
- The AIO callback for socket I/O is now also invoked with the socket
  buffer locked.  Previously sowakeup() would drop the socket buffer
  lock only to call aio_swake() which immediately re-acquired the socket
  buffer lock for the duration of the function call.

Discussed with:	rwatson, rmacklem
2009-06-01 21:17:03 +00:00
Maksim Yevmenkin
e72a94adc3 Fix sbappendrecord_locked().
The main problem is that sbappendrecord_locked() relies on sbcompress()
to set sb_mbtail. This will not happen if sbappendrecord_locked() is
called with mbuf chain made of exactly one mbuf (i.e. m0->m_next == NULL).
In this case sbcompress() will be called with m == NULL and will do
nothing. I'm not entirely sure if m == NULL is a valid argument for
sbcompress(), and, it rather pointless to call it like that, but keep
calling it so it can do SBLASTMBUFCHK().

The problem is triggered by the SOCKBUF_DEBUG kernel option that
enables SBLASTRECORDCHK() and SBLASTMBUFCHK() checks.

PR:			kern/126742
Investigated by:	pluknet < pluknet -at- gmail -dot- com >
No response from:	freebsd-current@, freebsd-bluetooth@
MFC after:		3 days
2009-04-21 19:14:13 +00:00