Commit Graph

439 Commits

Author SHA1 Message Date
Hartmut Brandt
4b481ba0ed Merge filt_soread and filt_solisten and decide what to do when checking
for EVFILT_READ at the point of the check not when the event is registers.
This fixes a problem with asio when accepting a connection.

Reviewed by:	kib@, Scott Mitchell
2017-02-01 13:12:07 +00:00
Hans Petter Selasky
f3e7afe2d7 Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.

- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.

- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().

- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.

- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.

- How rate limiting works:

1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.

2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.

3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.

4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.

Reviewed by:		wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision:	https://reviews.freebsd.org/D3687
Sponsored by:		Mellanox Technologies
MFC after:		3 months
2017-01-18 13:31:17 +00:00
Maxim Sobolev
339efd75a4 Add a new socket option SO_TS_CLOCK to pick from several different clock
sources to return timestamps when SO_TIMESTAMP is enabled. Two additional
clock sources are:

o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME);
o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC).

In addition to this, this option provides unified interface to get bintime
(equivalent of using SO_BINTIME), except it also supported with IPv6 where
SO_BINTIME has never been supported. The long term plan is to depreciate
SO_BINTIME and move everything to using SO_TS_CLOCK.

Idea for this enhancement has been briefly discussed on the Net session
during dev summit in Ottawa last June and the general input was positive.

This change is believed to benefit network benchmarks/profiling as well
as other scenarios where precise time of arrival measurement is necessary.

There are two regression test cases as part of this commit: one extends unix
domain test code (unix_cmsg) to test new SCM_XXX types and another one
implementis totally new test case which exchanges UDP packets between two
processes using both conventional methods (i.e. calling clock_gettime(2)
before recv(2) and after send(2)), as well as using setsockopt()+recv() in
receive path. The resulting delays are checked for sanity for all supported
clock types.

Reviewed by:    adrian, gnn
Differential Revision:  https://reviews.freebsd.org/D9171
2017-01-16 17:46:38 +00:00
Hiren Panchasara
7d03ff1fe9 Add kevent EVFILT_EMPTY for notification when a client has received all data
i.e. everything outstanding has been acked.

Reviewed by:	bz, gnn (previous version)
MFC after:	3 days
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D9150
2017-01-16 08:25:33 +00:00
John Baldwin
14da48cbe4 Set MORETOCOME for AIO write requests on a socket.
Add a MSG_MOREOTOCOME message flag. When this flag is set, sosend*
set PRUS_MOREOTOCOME when invoking the protocol send method. The aio
worker tasks for sending on a socket set this flag when there are
additional write jobs waiting on the socket buffer.

Reviewed by:	adrian
MFC after:	1 month
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D8955
2017-01-06 23:41:45 +00:00
Ruslan Bukin
dd7d4f199e Revert r306186 ("Adjust the sopt_val pointer on bigendian systems").
This logic doesn't work with bigger sopt_valsize (e.g. when ipfw
passing 2048 bytes rule).

Reported by:	adrian
Sponsored by:	DARPA, AFRL
2016-11-22 18:31:43 +00:00
Ruslan Bukin
30f3bfe58e Adjust the sopt_val pointer on bigendian systems (e.g. MIPS64EB).
sooptcopyin() checks if size of data provided by user is <= than we can
accept, else it strips down the size. On bigendian platforms we have to
move pointer as well so we copy the actual data.

Reviewed by:	gnn
Sponsored by:	DARPA, AFRL
Sponsored by:	HEIF5
Differential Revision:	https://reviews.freebsd.org/D7980
2016-09-22 12:41:53 +00:00
Ed Maste
69a2875821 Renumber license clauses in sys/kern to avoid skipping #3 2016-09-15 13:16:20 +00:00
Kevin Lo
c3bef61e58 Remove the 4.3BSD compatible macro m_copy(), use m_copym() instead.
Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D7878
2016-09-15 07:41:48 +00:00
Baptiste Daroussin
306e53bce9 Fix typo introduced by me (not the submitter) when fixing typos 2016-05-22 13:10:48 +00:00
Baptiste Daroussin
2fd642c899 Fix typos in the comments
Submitted by:	cipherwraith666@gmail.com (via github)
2016-05-22 13:04:45 +00:00
Pedro F. Giffuni
e3043798aa sys/kern: spelling fixes in comments.
No functional change.
2016-04-29 22:15:33 +00:00
John Baldwin
8722384b22 Introduce a new protocol hook pru_aio_queue.
This allows a protocol to claim individual AIO requests instead of using
the default socket AIO handling.

Sponsored by:	Chelsio Communications
2016-04-29 20:11:09 +00:00
Maxim Konovalov
ab77175063 o "avaliable" -> "available".
PR:		208141
Submitted by:	Tyler Littlefield
2016-03-21 08:03:50 +00:00
John Baldwin
f3215338ef Refactor the AIO subsystem to permit file-type-specific handling and
improve cancellation robustness.

Introduce a new file operation, fo_aio_queue, which is responsible for
queueing and completing an asynchronous I/O request for a given file.
The AIO subystem now exports library of routines to manipulate AIO
requests as well as the ability to run a handler function in the
"default" pool of AIO daemons to service a request.

A default implementation for file types which do not include an
fo_aio_queue method queues requests to the "default" pool invoking the
fo_read or fo_write methods as before.

The AIO subsystem permits file types to install a private "cancel"
routine when a request is queued to permit safe dequeueing and cleanup
of cancelled requests.

Sockets now use their own pool of AIO daemons and service per-socket
requests in FIFO order.  Socket requests will not block indefinitely
permitting timely cancellation of all requests.

Due to the now-tight coupling of the AIO subsystem with file types,
the AIO subsystem is now a standard part of all kernels.  The VFS_AIO
kernel option and aio.ko module are gone.

Many file types may block indefinitely in their fo_read or fo_write
callbacks resulting in a hung AIO daemon.  This can result in hung
user processes (when processes attempt to cancel all outstanding
requests during exit) or a hung system.  To protect against this, AIO
requests are only permitted for known "safe" files by default.  AIO
requests for all file types can be enabled by setting the new
vfs.aio.enable_usafe sysctl to a non-zero value.  The AIO tests have
been updated to skip operations on unsafe file types if the sysctl is
zero.

Currently, AIO requests on sockets and raw disks are considered safe
and are enabled by default.  aio_mlock() is also enabled by default.

Reviewed by:	cem, jilles
Discussed with:	kib (earlier version)
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5289
2016-03-01 18:12:14 +00:00
Alfred Perlstein
7325dfbb59 Increase max allowed backlog for listen sockets
from short to int.

PR: 203922
Submitted by: White Knight <white_knight@2ch.net>
MFC After: 4 weeks
2016-02-02 05:57:59 +00:00
Ed Schouten
b114aa7959 Make shutdown() return ENOTCONN as required by POSIX, part deux.
Summary:
Back in 2005, maxim@ attempted to fix shutdown() to return ENOTCONN in case the socket was not connected (r150152). This had to be rolled back (r150155), as it broke some of the existing programs that depend on this behavior. I reapplied this change on my system and indeed, syslogd failed to start up. I fixed this back in February (279016) and MFC'ed it to the supported stable branches. Apart from that, things seem to work out all right.

Since at least Linux and Mac OS X do the right thing, I'd like to go ahead and give this another try. To keep old copies of syslogd working, only start returning ENOTCONN for recent binaries.

I took a look at the XNU sources and they seem to test against both SS_ISCONNECTED, SS_ISCONNECTING and SS_ISDISCONNECTING, instead of just SS_ISCONNECTED. That seams reasonable, so let's do the same.

Test Plan:
This issue was uncovered while writing tests for shutdown() in CloudABI:

https://github.com/NuxiNL/cloudlibc/blob/master/src/libc/sys/socket/shutdown_test.c#L26

Reviewers: glebius, rwatson, #manpages, gnn, #network

Reviewed By: gnn, #network

Subscribers: bms, mjg, imp

Differential Revision: https://reviews.freebsd.org/D3039
2015-07-27 13:17:57 +00:00
Xin LI
1a7c14aec7 Fix a typo in comment.
Submitted by:	Yanhui Shen via twitter
MFC after:	3 days
2015-07-24 22:13:39 +00:00
Conrad Meyer
0c40f3532d Fix cleanup race between unp_dispose and unp_gc
unp_dispose and unp_gc could race to teardown the same mbuf chains, which
can lead to dereferencing freed filedesc pointers.

This patch adds an IGNORE_RIGHTS flag on unpcbs marking the unpcb's RIGHTS
as invalid/freed. The flag is protected by UNP_LIST_LOCK.

To serialize against unp_gc, unp_dispose needs the socket object. Change the
dom_dispose() KPI to take a socket object instead of an mbuf chain directly.

PR:		194264
Differential Revision:	https://reviews.freebsd.org/D3044
Reviewed by:	mjg (earlier version)
Approved by:	markj (mentor)
Obtained from:	mjg
MFC after:	1 month
Sponsored by:	EMC / Isilon Storage Division
2015-07-14 02:00:50 +00:00
Andrey V. Elsukov
e9b70483d1 soreceive_generic() still has similar KASSERT(), therefore instead of
remove KASSERT(), change it to check mbuf isn't NULL.

Suggested by:	kib
MFC after:	1 week
2015-02-23 15:24:43 +00:00
Andrey V. Elsukov
f21684bc75 In some cases soreceive_dgram() can return no data, but has control
message. This can happen when application is sending packets too big
for the path MTU and recvmsg() will return zero (indicating no data)
but there will be a cmsghdr with cmsg_type set to IPV6_PATHMTU.
Remove KASSERT() which does NULL pointer dereference in such case.
Also call m_freem() only when m isn't NULL.

PR:		197882
MFC after:	1 week
Sponsored by:	Yandex LLC
2015-02-23 13:41:35 +00:00
Davide Italiano
a76d4388e1 Don't access sockbuf fields directly, use accessor functions instead.
It is safe to move the call to socantsendmore_locked() after
sbdrop_locked() as long as we hold the sockbuf lock across the two
calls.

CR:	D1805
Reviewed by:	adrian, kmacy, julian, rwatson
2015-02-14 20:00:57 +00:00
Gleb Smirnoff
e834a84026 Revert r274494, r274712, r275955 and provide extra comments explaining
why there could appear a zero-sized mbufs in socket buffers.

A proper fix would be to divorce record socket buffers and stream
socket buffers, and divorce pru_send that accepts normal data from
pru_send that accepts control data.
2014-12-20 22:12:04 +00:00
John Baldwin
5ad25ceb41 Check for SS_NBIO in so->so_state instead of sb->sb_flags in
soreceive_stream().

Differential Revision:	https://reviews.freebsd.org/D1299
Reviewed by:	bz, gnn
MFC after:	1 week
2014-12-15 17:52:08 +00:00
Gleb Smirnoff
651e4e6a30 Merge from projects/sendfile: extend protocols API to support
sending not ready data:
o Add new flag to pru_send() flags - PRUS_NOTREADY.
o Add new protocol method pru_ready().

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2014-11-30 13:24:21 +00:00
Gleb Smirnoff
0f9d0a73a4 Merge from projects/sendfile:
o Introduce a notion of "not ready" mbufs in socket buffers.  These
mbufs are now being populated by some I/O in background and are
referenced outside.  This forces following implications:
- An mbuf which is "not ready" can't be taken out of the buffer.
- An mbuf that is behind a "not ready" in the queue neither.
- If sockbet buffer is flushed, then "not ready" mbufs shouln't be
  freed.

o In struct sockbuf the sb_cc field is split into sb_ccc and sb_acc.
  The sb_ccc stands for ""claimed character count", or "committed
  character count".  And the sb_acc is "available character count".
  Consumers of socket buffer API shouldn't already access them directly,
  but use sbused() and sbavail() respectively.
o Not ready mbufs are marked with M_NOTREADY, and ready but blocked ones
  with M_BLOCKED.
o New field sb_fnrdy points to the first not ready mbuf, to avoid linear
  search.
o New function sbready() is provided to activate certain amount of mbufs
  in a socket buffer.

A special note on SCTP:
  SCTP has its own sockbufs.  Unfortunately, FreeBSD stack doesn't yet
allow protocol specific sockbufs.  Thus, SCTP does some hacks to make
itself compatible with FreeBSD: it manages sockbufs on its own, but keeps
sb_cc updated to inform the stack of amount of data in them.  The new
notion of "not ready" data isn't supported by SCTP.  Instead, only a
mechanical substitute is done: s/sb_cc/sb_ccc/.
  A proper solution would be to take away struct sockbuf from struct
socket and allow protocols to implement their own socket buffers, like
SCTP already does.  This was discussed with rrs@.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-30 12:52:33 +00:00
Gleb Smirnoff
67af272bcf Do not allocate zero-length mbuf in sosend_generic().
Found by:	pho
Sponsored by:	Nginx, Inc.
2014-11-19 14:27:38 +00:00
Gleb Smirnoff
6bf6b25e88 Merge from projects/sendfile:
Use sbcut_locked() instead of manually editing a sockbuf.

Sponsored by:	Nginx, Inc.
2014-11-14 15:33:40 +00:00
Gleb Smirnoff
cfa6009e36 In preparation of merging projects/sendfile, transform bare access to
sb_cc member of struct sockbuf to a couple of inline functions:

sbavail() and sbused()

Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-12 09:57:15 +00:00
Hiroki Sato
714266373c - Make hhook_run_socket() vnet-aware instead of adding CURVNET_SET() around
the function calls.
- Fix a memory leak and stats in the case that hhook_run_socket() fails
  in soalloc().

PR:	193265
2014-09-08 09:04:22 +00:00
Gleb Smirnoff
9e739a5a05 Fix for r271182.
Submitted by:	mjg
Pointy hat to:	me, submitter and everyone who urged me to commit
2014-09-07 05:44:14 +00:00
Gleb Smirnoff
d9257d8b57 Set vnet context before accessing V_socket_hhh[].
Submitted by:	"Hiroo Ono (小野寛生)" <hiroo.ono+freebsd gmail.com>
2014-09-05 19:50:18 +00:00
Gleb Smirnoff
e86447ca44 - Remove socket file operations declaration from sys/file.h.
- Make them static in sys_socket.c.
- Provide generic invfo_truncate() instead of soo_truncate().

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-08-26 14:44:08 +00:00
Hiroki Sato
ed063112f4 Fix a panic which occurs in a VIMAGE-enabled kernel after r270158, and
separate socket_hhook_register() part and put it into VNET_SYS{,UN}INIT()
handler.

Discussed with:	marcel
2014-08-22 05:03:30 +00:00
Marcel Moolenaar
4ec7371233 For vendors like Juniper, extensibility for sockets is important. A
good example is socket options that aren't necessarily generic.  To
this end, OSD is added to the socket structure and hooks are defined
for key operations on sockets.  These are:
o   soalloc() and sodealloc()
o   Get and set socket options
o   Socket related kevent filters.

One aspect about hhook that appears to be not fully baked is the return
semantics (the return value from the hook is ignored in hhook_run_hooks()
at the time of commit).  To support return values, the socket_hhook_data
structure contains a 'status' field to hold return values.

Submitted by:	Anuranjan Shukla <anshukla@juniper.net>
Obtained from:	Juniper Networks, Inc.
2014-08-18 23:45:40 +00:00
Davide Italiano
4295aa9240 Fix an overflow in getsockopt(). optval isn't big enough to hold
sbintime_t.
Re-introduce r255030 behaviour capping socket timeouts to INT_32
if they're too large.

CR:	https://phabric.freebsd.org/D433
Reported by:	demon
Reviewed by:	bde [1], jhb [2]
MFC after:	2 weeks
2014-08-04 05:40:51 +00:00
Marcel Moolenaar
1e0a021e3d The accept filter code is not specific to the FreeBSD IPv4 network stack,
so it really should not be under "optional inet". The fact that uipc_accf.c
lives under kern/ lends some weight to making it a "standard" file.

Moving kern/uipc_accf.c from "optional inet" to "standard" eliminates the
need for #ifdef INET in kern/uipc_socket.c.

Also, this meant the net.inet.accf.unloadable sysctl needed to move, as
net.inet does not exist without networking compiled in (as it lives in
netinet/in_proto.c.) The new sysctl has been named net.accf.unloadable.

In order to support existing accept filter sysctls, the net.inet.accf node
has been added netinet/in_proto.c.

Submitted by:	Steve Kiernan <stevek@juniper.net>
Obtained from:	Juniper Networks, Inc.
2014-07-26 19:27:34 +00:00
Gleb Smirnoff
d978bbea8a Simplify wait/nowait code, eventually killing last remnant of
historical mbuf(9) allocator flag.

Sponsored by:	Nginx, Inc.
2014-01-16 13:45:41 +00:00
Hiren Panchasara
16ef0fa833 Fix typo in a comment. 2013-11-08 20:11:15 +00:00
Maksim Yevmenkin
f7a3a2a57c Rate limit (to once per minute) "Listen queue overflow" message in
sonewconn().

Reviewed by:	scottl, lstewart
Obtained from:	Netflix, Inc
MFC after:	2 weeks
2013-10-31 20:33:21 +00:00
Konstantin Belousov
3846a82284 Remove zero-copy sockets code. It only worked for anonymous memory,
and the equivalent functionality is now provided by sendfile(2) over
posix shared memory filedescriptor.

Remove the cow member of struct vm_page, and rearrange the remaining
members.  While there, make hold_count unsigned.

Requested and reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (delphij)
2013-09-16 06:25:54 +00:00
Davide Italiano
7729cbf1a6 Fix socket buffer timeouts precision using the new sbintime_t KPI instead
of relying on the tvtohz() workaround. The latter has been introduced
lately by jhb@ (r254699) in order to have a fix that can be backported
to STABLE.

Reported by:	Vitja Makarov <vitja.makarov at gmail dot com>
Reviewed by:	jhb (earlier version)
2013-09-01 23:34:53 +00:00
John Baldwin
e289e9f2ca Don't return an error for socket timeouts that are too large. Just
cap them to INT_MAX ticks instead.

PR:		kern/181416 (r254699 really)
Requested by:	bde
MFC after:	2 weeks
2013-08-29 15:59:05 +00:00
John Baldwin
e77c507d60 Use tvtohz() to convert a socket buffer timeout to a tick value rather
than using a home-rolled version.  The home-rolled version could result
in shorter-than-requested sleeps.

Reported by:	Vitja Makarov <vitja.makarov@gmail.com>
MFC after:	2 weeks
2013-08-23 13:47:41 +00:00
Andre Oppermann
6753da1356 When the accept queue is full print the number of already pending
new connections instead of by how many we're over the limit, which
is always 1.

Noticed by:	jmallet
MFC after:	1 week
2013-05-08 14:13:14 +00:00
Andre Oppermann
f89d4c3acf Back out r249318, r249320 and r249327 due to a heisenbug most
likely related to a race condition in the ipi_hash_lock with
the exact cause currently unknown but under investigation.
2013-05-06 16:42:18 +00:00
Jilles Tjoelker
cd31b6dd08 socket: Make shutdown() wake up a blocked accept().
A blocking accept (and some other operations) waits on &so->so_timeo. Once
it wakes up, it will detect the SBS_CANTRCVMORE bit.

The error from accept() is [ECONNABORTED] which is not the nicest one -- the
thread calling accept() needs to know out-of-band what is happening.

A spurious wakeup on so->so_timeo appears harmless (sleep retried) except
when lingering on close (SO_LINGER, and in that case there is no descriptor
to call shutdown() on) so this should be fairly safe.

A shutdown() already woke up a blocked accept() for TCP sockets, but not for
Unix domain sockets. This fix is generic for all domains.

This patch was sent to -hackers@ and -net@ on April 5.

MFC after:	2 weeks
2013-04-30 15:06:30 +00:00
Jim Harris
d58a96538f Fix the build. 2013-04-10 00:35:08 +00:00
Andre Oppermann
e8b3186b6a Change certain heavily used network related mutexes and rwlocks to
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.

NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.
2013-04-09 21:02:20 +00:00
Gleb Smirnoff
a307eb26ed When soreceive_generic() hands off an mbuf from buffer,
clear its pointer to next record, since next record
belongs to the buffer, and shouldn't be leaked.

The ng_ksocket(4) used to clear this pointer itself,
but the correct place is here.

Sponsored by:	Nginx, Inc
2013-03-29 13:57:55 +00:00