freebsd-skq

Author	SHA1	Message	Date
glebius	13e2631492	Fix locking in soisconnected(). When a newborn socket moves from incomplete queue to complete one, we need to obtain the listening socket lock after the child, which is a wrong order. The old code did that in potentially endless loop of mtx_trylock(). The new one does only one attempt of mtx_trylock(), and in case of failure references listening socket, unlocks child and locks everything in right order. In case if listening socket shuts down during that, just bail out. Reported & tested by: Jason Eggleston <jeggleston llnw.com> Reported & tested by: Jason Wolfe <jason llnw.com>	2017-09-14 18:05:54 +00:00
glebius	3966551520	Third take on the r319685 and r320480. Actually allow for call soisconnected() via soisdisconnected(), and in the earlier unlock earlier to avoid lock recursion. This fixes a situation when a socket on accept queue is reset before being accepted. Reported by: Jason Eggleston <jeggleston llnw.com>	2017-08-24 20:49:19 +00:00
tuexen	b0936e7876	Fix getsockopt() for listening sockets when using SO_SNDBUF, SO_RCVBUF, SO_SNDLOWAT, SO_RCVLOWAT. Since r31972 it only worked for non-listening sockets. Sponsored by: Netflix, Inc.	2017-07-21 07:44:43 +00:00
hselasky	2f17b2a4e9	After r319722 two fields were left uninitialized when transforming a socket structure into a listening socket. This resulted in an invalid instruction fault for all 32-bit platforms. When INVARIANTS is set the union where the two uninitialized fields reside gets properly zeroed. This patch ensures the two uninitialized fields are zeroed when INVARIANTS is undefined. For 64-bit platforms this issue was not visible because so->sol_upcall which is uninitialized overlaps with so->so_rcv.sb_state which is already zero during soalloc(); For 32-bit platforms this issue was visible and resulted in an invalid instruction fault, because so->sol_upcall overlaps with so->so_rcv.sb_sel which is always initialized to a valid data pointer during soalloc(). Verifying the offset locations mentioned above are identical is left as an exercise to the reader. PR: 220452 PR: 220358 Reviewed by: ae (network), gallatin Differential Revision: https://reviews.freebsd.org/D11475 Sponsored by: Mellanox Technologies	2017-07-04 18:23:17 +00:00
glebius	4cfe976317	Provide sbsetopt() that handles socket buffer related socket options. It distinguishes between data flow sockets and listening sockets, and in case of the latter doesn't change resource limits, since listening sockets don't hold any buffers, they only carry values to be inherited by their children.	2017-06-25 01:41:07 +00:00
glebius	29581e89e7	Plug read(2) and write(2) on listening sockets.	2017-06-15 20:11:29 +00:00
glebius	e35d543ec1	Listening sockets improvements. o Separate fields of struct socket that belong to listening from fields that belong to normal dataflow, and unionize them. This shrinks the structure a bit. - Take out selinfo's from the socket buffers into the socket. The first reason is to support braindamaged scenario when a socket is added to kevent(2) and then listen(2) is cast on it. The second reason is that there is future plan to make socket buffers pluggable, so that for a dataflow socket a socket buffer can be changed, and in this case we also want to keep same selinfos through the lifetime of a socket. - Remove struct struct so_accf. Since now listening stuff no longer affects struct socket size, just move its fields into listening part of the union. - Provide sol_upcall field and enforce that so_upcall_set() may be called only on a dataflow socket, which has buffers, and for listening sockets provide solisten_upcall_set(). o Remove ACCEPT_LOCK() global. - Add a mutex to socket, to be used instead of socket buffer lock to lock fields of struct socket that don't belong to a socket buffer. - Allow to acquire two socket locks, but the first one must belong to a listening socket. - Make soref()/sorele() to use atomic(9). This allows in some situations to do soref() without owning socket lock. There is place for improvement here, it is possible to make sorele() also to lock optionally. - Most protocols aren't touched by this change, except UNIX local sockets. See below for more information. o Reduce copy-and-paste in kernel modules that accept connections from listening sockets: provide function solisten_dequeue(), and use it in the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4), infiniband, rpc. o UNIX local sockets. - Removal of ACCEPT_LOCK() global uncovered several races in the UNIX local sockets. Most races exist around spawning a new socket, when we are connecting to a local listening socket. To cover them, we need to hold locks on both PCBs when spawning a third one. This means holding them across sonewconn(). This creates a LOR between pcb locks and unp_list_lock. - To fix the new LOR, abandon the global unp_list_lock in favor of global unp_link_lock. Indeed, separating these two locks didn't provide us any extra parralelism in the UNIX sockets. - Now call into uipc_attach() may happen with unp_link_lock hold if, we are accepting, or without unp_link_lock in case if we are just creating a socket. - Another problem in UNIX sockets is that uipc_close() basicly did nothing for a listening socket. The vnode remained opened for connections. This is fixed by removing vnode in uipc_close(). Maybe the right way would be to do it for all sockets (not only listening), simply move the vnode teardown from uipc_detach() to uipc_close()? Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D9770	2017-06-08 21:30:34 +00:00
glebius	3ef6278d8e	Provide typedef for socket upcall function. While here change so_gen_t type to modern uint64_t.	2017-06-07 01:48:11 +00:00
glebius	e35eb7da38	Remove a piece of dead code.	2017-06-07 01:21:34 +00:00
glebius	80f9254ac8	Rename accept filter getopt/setopt functions, so that they are prefixed with module name and match other functions in the module. There is no functional change.	2017-06-02 17:49:21 +00:00
pkelsey	b8b1167ed6	Remove unnecessary check for NULL mbuf in soreceive_generic(). This check has been redundant since it was introduced in r162554. Reviewed by: emaste, glebius MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D10322	2017-04-25 19:54:34 +00:00
sobomax	d7f06f39eb	Restore ability to shutdown DGRAM sockets, still forcing ENOTCONN to be returned by the shutdown(2) system call. This ability has been lost as part of the svn revision 285910. Reviewed by: ed, rwatson, glebius, hiren MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D10351	2017-04-14 17:23:28 +00:00
harti	0696dde186	Merge filt_soread and filt_solisten and decide what to do when checking for EVFILT_READ at the point of the check not when the event is registers. This fixes a problem with asio when accepting a connection. Reviewed by: kib@, Scott Mitchell	2017-02-01 13:12:07 +00:00
hselasky	efa6326974	Implement kernel support for hardware rate limited sockets. - Add RATELIMIT kernel configuration keyword which must be set to enable the new functionality. - Add support for hardware driven, Receive Side Scaling, RSS aware, rate limited sendqueues and expose the functionality through the already established SO_MAX_PACING_RATE setsockopt(). The API support rates in the range from 1 to 4Gbytes/s which are suitable for regular TCP and UDP streams. The setsockopt(2) manual page has been updated. - Add rate limit function callback API to "struct ifnet" which supports the following operations: if_snd_tag_alloc(), if_snd_tag_modify(), if_snd_tag_query() and if_snd_tag_free(). - Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT flag, which tells if a network driver supports rate limiting or not. - This patch also adds support for rate limiting through VLAN and LAGG intermediate network devices. - How rate limiting works: 1) The userspace application calls setsockopt() after accepting or making a new connection to set the rate which is then stored in the socket structure in the kernel. Later on when packets are transmitted a check is made in the transmit path for rate changes. A rate change implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the destination network interface, which then sets up a custom sendqueue with the given rate limitation parameter. A "struct m_snd_tag" pointer is returned which serves as a "snd_tag" hint in the m_pkthdr for the subsequently transmitted mbufs. 2) When the network driver sees the "m->m_pkthdr.snd_tag" different from NULL, it will move the packets into a designated rate limited sendqueue given by the snd_tag pointer. It is up to the individual drivers how the rate limited traffic will be rate limited. 3) Route changes are detected by the NIC drivers in the ifp->if_transmit() routine when the ifnet pointer in the incoming snd_tag mismatches the one of the network interface. The network adapter frees the mbuf and returns EAGAIN which causes the ip_output() to release and clear the send tag. Upon next ip_output() a new "snd_tag" will be tried allocated. 4) When the PCB is detached the custom sendqueue will be released by a non-blocking ifp->if_snd_tag_free() call to the currently bound network interface. Reviewed by: wblock (manpages), adrian, gallatin, scottl (network) Differential Revision: https://reviews.freebsd.org/D3687 Sponsored by: Mellanox Technologies MFC after: 3 months	2017-01-18 13:31:17 +00:00
sobomax	701697521c	Add a new socket option SO_TS_CLOCK to pick from several different clock sources to return timestamps when SO_TIMESTAMP is enabled. Two additional clock sources are: o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME); o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC). In addition to this, this option provides unified interface to get bintime (equivalent of using SO_BINTIME), except it also supported with IPv6 where SO_BINTIME has never been supported. The long term plan is to depreciate SO_BINTIME and move everything to using SO_TS_CLOCK. Idea for this enhancement has been briefly discussed on the Net session during dev summit in Ottawa last June and the general input was positive. This change is believed to benefit network benchmarks/profiling as well as other scenarios where precise time of arrival measurement is necessary. There are two regression test cases as part of this commit: one extends unix domain test code (unix_cmsg) to test new SCM_XXX types and another one implementis totally new test case which exchanges UDP packets between two processes using both conventional methods (i.e. calling clock_gettime(2) before recv(2) and after send(2)), as well as using setsockopt()+recv() in receive path. The resulting delays are checked for sanity for all supported clock types. Reviewed by: adrian, gnn Differential Revision: https://reviews.freebsd.org/D9171	2017-01-16 17:46:38 +00:00
hiren	275c6b6b14	Add kevent EVFILT_EMPTY for notification when a client has received all data i.e. everything outstanding has been acked. Reviewed by: bz, gnn (previous version) MFC after: 3 days Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D9150	2017-01-16 08:25:33 +00:00
jhb	cb5debd34f	Set MORETOCOME for AIO write requests on a socket. Add a MSG_MOREOTOCOME message flag. When this flag is set, sosend* set PRUS_MOREOTOCOME when invoking the protocol send method. The aio worker tasks for sending on a socket set this flag when there are additional write jobs waiting on the socket buffer. Reviewed by: adrian MFC after: 1 month Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D8955	2017-01-06 23:41:45 +00:00
br	9af3846991	Revert r306186 ("Adjust the sopt_val pointer on bigendian systems"). This logic doesn't work with bigger sopt_valsize (e.g. when ipfw passing 2048 bytes rule). Reported by: adrian Sponsored by: DARPA, AFRL	2016-11-22 18:31:43 +00:00
br	20519e7e17	Adjust the sopt_val pointer on bigendian systems (e.g. MIPS64EB). sooptcopyin() checks if size of data provided by user is <= than we can accept, else it strips down the size. On bigendian platforms we have to move pointer as well so we copy the actual data. Reviewed by: gnn Sponsored by: DARPA, AFRL Sponsored by: HEIF5 Differential Revision: https://reviews.freebsd.org/D7980	2016-09-22 12:41:53 +00:00
emaste	00b67b15b9	Renumber license clauses in sys/kern to avoid skipping #3	2016-09-15 13:16:20 +00:00
kevlo	518bc28463	Remove the 4.3BSD compatible macro m_copy(), use m_copym() instead. Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D7878	2016-09-15 07:41:48 +00:00
bapt	8bfdfde95e	Fix typo introduced by me (not the submitter) when fixing typos	2016-05-22 13:10:48 +00:00
bapt	58e5acb1f5	Fix typos in the comments Submitted by: cipherwraith666@gmail.com (via github)	2016-05-22 13:04:45 +00:00
pfg	28823d0656	sys/kern: spelling fixes in comments. No functional change.	2016-04-29 22:15:33 +00:00
jhb	ab5f9da057	Introduce a new protocol hook pru_aio_queue. This allows a protocol to claim individual AIO requests instead of using the default socket AIO handling. Sponsored by: Chelsio Communications	2016-04-29 20:11:09 +00:00
maxim	86d336784d	o "avaliable" -> "available". PR: 208141 Submitted by: Tyler Littlefield	2016-03-21 08:03:50 +00:00
jhb	be47bc68fb	Refactor the AIO subsystem to permit file-type-specific handling and improve cancellation robustness. Introduce a new file operation, fo_aio_queue, which is responsible for queueing and completing an asynchronous I/O request for a given file. The AIO subystem now exports library of routines to manipulate AIO requests as well as the ability to run a handler function in the "default" pool of AIO daemons to service a request. A default implementation for file types which do not include an fo_aio_queue method queues requests to the "default" pool invoking the fo_read or fo_write methods as before. The AIO subsystem permits file types to install a private "cancel" routine when a request is queued to permit safe dequeueing and cleanup of cancelled requests. Sockets now use their own pool of AIO daemons and service per-socket requests in FIFO order. Socket requests will not block indefinitely permitting timely cancellation of all requests. Due to the now-tight coupling of the AIO subsystem with file types, the AIO subsystem is now a standard part of all kernels. The VFS_AIO kernel option and aio.ko module are gone. Many file types may block indefinitely in their fo_read or fo_write callbacks resulting in a hung AIO daemon. This can result in hung user processes (when processes attempt to cancel all outstanding requests during exit) or a hung system. To protect against this, AIO requests are only permitted for known "safe" files by default. AIO requests for all file types can be enabled by setting the new vfs.aio.enable_usafe sysctl to a non-zero value. The AIO tests have been updated to skip operations on unsafe file types if the sysctl is zero. Currently, AIO requests on sockets and raw disks are considered safe and are enabled by default. aio_mlock() is also enabled by default. Reviewed by: cem, jilles Discussed with: kib (earlier version) Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D5289	2016-03-01 18:12:14 +00:00
alfred	b04e8a547e	Increase max allowed backlog for listen sockets from short to int. PR: 203922 Submitted by: White Knight <white_knight@2ch.net> MFC After: 4 weeks	2016-02-02 05:57:59 +00:00
ed	459f42dd55	Make shutdown() return ENOTCONN as required by POSIX, part deux. Summary: Back in 2005, maxim@ attempted to fix shutdown() to return ENOTCONN in case the socket was not connected (r150152). This had to be rolled back (r150155), as it broke some of the existing programs that depend on this behavior. I reapplied this change on my system and indeed, syslogd failed to start up. I fixed this back in February (279016) and MFC'ed it to the supported stable branches. Apart from that, things seem to work out all right. Since at least Linux and Mac OS X do the right thing, I'd like to go ahead and give this another try. To keep old copies of syslogd working, only start returning ENOTCONN for recent binaries. I took a look at the XNU sources and they seem to test against both SS_ISCONNECTED, SS_ISCONNECTING and SS_ISDISCONNECTING, instead of just SS_ISCONNECTED. That seams reasonable, so let's do the same. Test Plan: This issue was uncovered while writing tests for shutdown() in CloudABI: https://github.com/NuxiNL/cloudlibc/blob/master/src/libc/sys/socket/shutdown_test.c#L26 Reviewers: glebius, rwatson, #manpages, gnn, #network Reviewed By: gnn, #network Subscribers: bms, mjg, imp Differential Revision: https://reviews.freebsd.org/D3039	2015-07-27 13:17:57 +00:00
delphij	527ac1e9fb	Fix a typo in comment. Submitted by: Yanhui Shen via twitter MFC after: 3 days	2015-07-24 22:13:39 +00:00
cem	576619e564	Fix cleanup race between unp_dispose and unp_gc unp_dispose and unp_gc could race to teardown the same mbuf chains, which can lead to dereferencing freed filedesc pointers. This patch adds an IGNORE_RIGHTS flag on unpcbs marking the unpcb's RIGHTS as invalid/freed. The flag is protected by UNP_LIST_LOCK. To serialize against unp_gc, unp_dispose needs the socket object. Change the dom_dispose() KPI to take a socket object instead of an mbuf chain directly. PR: 194264 Differential Revision: https://reviews.freebsd.org/D3044 Reviewed by: mjg (earlier version) Approved by: markj (mentor) Obtained from: mjg MFC after: 1 month Sponsored by: EMC / Isilon Storage Division	2015-07-14 02:00:50 +00:00
ae	c6a9e35096	soreceive_generic() still has similar KASSERT(), therefore instead of remove KASSERT(), change it to check mbuf isn't NULL. Suggested by: kib MFC after: 1 week	2015-02-23 15:24:43 +00:00
ae	92ce4d2d91	In some cases soreceive_dgram() can return no data, but has control message. This can happen when application is sending packets too big for the path MTU and recvmsg() will return zero (indicating no data) but there will be a cmsghdr with cmsg_type set to IPV6_PATHMTU. Remove KASSERT() which does NULL pointer dereference in such case. Also call m_freem() only when m isn't NULL. PR: 197882 MFC after: 1 week Sponsored by: Yandex LLC	2015-02-23 13:41:35 +00:00
davide	7aab309aae	Don't access sockbuf fields directly, use accessor functions instead. It is safe to move the call to socantsendmore_locked() after sbdrop_locked() as long as we hold the sockbuf lock across the two calls. CR: D1805 Reviewed by: adrian, kmacy, julian, rwatson	2015-02-14 20:00:57 +00:00
glebius	50e87fd952	Revert r274494, r274712, r275955 and provide extra comments explaining why there could appear a zero-sized mbufs in socket buffers. A proper fix would be to divorce record socket buffers and stream socket buffers, and divorce pru_send that accepts normal data from pru_send that accepts control data.	2014-12-20 22:12:04 +00:00
jhb	93ef12f586	Check for SS_NBIO in so->so_state instead of sb->sb_flags in soreceive_stream(). Differential Revision: https://reviews.freebsd.org/D1299 Reviewed by: bz, gnn MFC after: 1 week	2014-12-15 17:52:08 +00:00
glebius	9cadf1b974	Merge from projects/sendfile: extend protocols API to support sending not ready data: o Add new flag to pru_send() flags - PRUS_NOTREADY. o Add new protocol method pru_ready(). Sponsored by: Nginx, Inc. Sponsored by: Netflix	2014-11-30 13:24:21 +00:00
glebius	25da94eb3e	Merge from projects/sendfile: o Introduce a notion of "not ready" mbufs in socket buffers. These mbufs are now being populated by some I/O in background and are referenced outside. This forces following implications: - An mbuf which is "not ready" can't be taken out of the buffer. - An mbuf that is behind a "not ready" in the queue neither. - If sockbet buffer is flushed, then "not ready" mbufs shouln't be freed. o In struct sockbuf the sb_cc field is split into sb_ccc and sb_acc. The sb_ccc stands for ""claimed character count", or "committed character count". And the sb_acc is "available character count". Consumers of socket buffer API shouldn't already access them directly, but use sbused() and sbavail() respectively. o Not ready mbufs are marked with M_NOTREADY, and ready but blocked ones with M_BLOCKED. o New field sb_fnrdy points to the first not ready mbuf, to avoid linear search. o New function sbready() is provided to activate certain amount of mbufs in a socket buffer. A special note on SCTP: SCTP has its own sockbufs. Unfortunately, FreeBSD stack doesn't yet allow protocol specific sockbufs. Thus, SCTP does some hacks to make itself compatible with FreeBSD: it manages sockbufs on its own, but keeps sb_cc updated to inform the stack of amount of data in them. The new notion of "not ready" data isn't supported by SCTP. Instead, only a mechanical substitute is done: s/sb_cc/sb_ccc/. A proper solution would be to take away struct sockbuf from struct socket and allow protocols to implement their own socket buffers, like SCTP already does. This was discussed with rrs@. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-11-30 12:52:33 +00:00
glebius	79b6f9c70a	Do not allocate zero-length mbuf in sosend_generic(). Found by: pho Sponsored by: Nginx, Inc.	2014-11-19 14:27:38 +00:00
glebius	de0e48fa9a	Merge from projects/sendfile: Use sbcut_locked() instead of manually editing a sockbuf. Sponsored by: Nginx, Inc.	2014-11-14 15:33:40 +00:00
glebius	c0b38b545a	In preparation of merging projects/sendfile, transform bare access to sb_cc member of struct sockbuf to a couple of inline functions: sbavail() and sbused() Right now they are equal, but once notion of "not ready socket buffer data", will be checked in, they are going to be different. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-11-12 09:57:15 +00:00
hrs	a8a6b19fee	- Make hhook_run_socket() vnet-aware instead of adding CURVNET_SET() around the function calls. - Fix a memory leak and stats in the case that hhook_run_socket() fails in soalloc(). PR: 193265	2014-09-08 09:04:22 +00:00
glebius	bd6b462b17	Fix for r271182. Submitted by: mjg Pointy hat to: me, submitter and everyone who urged me to commit	2014-09-07 05:44:14 +00:00
glebius	8122a06e8c	Set vnet context before accessing V_socket_hhh[]. Submitted by: "Hiroo Ono (小野寛生)" <hiroo.ono+freebsd gmail.com>	2014-09-05 19:50:18 +00:00
glebius	1ac724b05e	- Remove socket file operations declaration from sys/file.h. - Make them static in sys_socket.c. - Provide generic invfo_truncate() instead of soo_truncate(). Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-08-26 14:44:08 +00:00
hrs	afe0ab6d8e	Fix a panic which occurs in a VIMAGE-enabled kernel after r270158, and separate socket_hhook_register() part and put it into VNET_SYS{,UN}INIT() handler. Discussed with: marcel	2014-08-22 05:03:30 +00:00
marcel	8bec6893f3	For vendors like Juniper, extensibility for sockets is important. A good example is socket options that aren't necessarily generic. To this end, OSD is added to the socket structure and hooks are defined for key operations on sockets. These are: o soalloc() and sodealloc() o Get and set socket options o Socket related kevent filters. One aspect about hhook that appears to be not fully baked is the return semantics (the return value from the hook is ignored in hhook_run_hooks() at the time of commit). To support return values, the socket_hhook_data structure contains a 'status' field to hold return values. Submitted by: Anuranjan Shukla <anshukla@juniper.net> Obtained from: Juniper Networks, Inc.	2014-08-18 23:45:40 +00:00
davide	2df6a889b3	Fix an overflow in getsockopt(). optval isn't big enough to hold sbintime_t. Re-introduce r255030 behaviour capping socket timeouts to INT_32 if they're too large. CR: https://phabric.freebsd.org/D433 Reported by: demon Reviewed by: bde [1], jhb [2] MFC after: 2 weeks	2014-08-04 05:40:51 +00:00
marcel	c6a7dd8e59	The accept filter code is not specific to the FreeBSD IPv4 network stack, so it really should not be under "optional inet". The fact that uipc_accf.c lives under kern/ lends some weight to making it a "standard" file. Moving kern/uipc_accf.c from "optional inet" to "standard" eliminates the need for #ifdef INET in kern/uipc_socket.c. Also, this meant the net.inet.accf.unloadable sysctl needed to move, as net.inet does not exist without networking compiled in (as it lives in netinet/in_proto.c.) The new sysctl has been named net.accf.unloadable. In order to support existing accept filter sysctls, the net.inet.accf node has been added netinet/in_proto.c. Submitted by: Steve Kiernan <stevek@juniper.net> Obtained from: Juniper Networks, Inc.	2014-07-26 19:27:34 +00:00
glebius	63cf6debb3	Simplify wait/nowait code, eventually killing last remnant of historical mbuf(9) allocator flag. Sponsored by: Nginx, Inc.	2014-01-16 13:45:41 +00:00

1 2 3 4 5 ...

452 Commits