freebsd-dev

Author	SHA1	Message	Date
Gleb Smirnoff	d978bbea8a	Simplify wait/nowait code, eventually killing last remnant of historical mbuf(9) allocator flag. Sponsored by: Nginx, Inc.	2014-01-16 13:45:41 +00:00
Hiren Panchasara	16ef0fa833	Fix typo in a comment.	2013-11-08 20:11:15 +00:00
Maksim Yevmenkin	f7a3a2a57c	Rate limit (to once per minute) "Listen queue overflow" message in sonewconn(). Reviewed by: scottl, lstewart Obtained from: Netflix, Inc MFC after: 2 weeks	2013-10-31 20:33:21 +00:00
Konstantin Belousov	3846a82284	Remove zero-copy sockets code. It only worked for anonymous memory, and the equivalent functionality is now provided by sendfile(2) over posix shared memory filedescriptor. Remove the cow member of struct vm_page, and rearrange the remaining members. While there, make hold_count unsigned. Requested and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (delphij)	2013-09-16 06:25:54 +00:00
Davide Italiano	7729cbf1a6	Fix socket buffer timeouts precision using the new sbintime_t KPI instead of relying on the tvtohz() workaround. The latter has been introduced lately by jhb@ (r254699) in order to have a fix that can be backported to STABLE. Reported by: Vitja Makarov <vitja.makarov at gmail dot com> Reviewed by: jhb (earlier version)	2013-09-01 23:34:53 +00:00
John Baldwin	e289e9f2ca	Don't return an error for socket timeouts that are too large. Just cap them to INT_MAX ticks instead. PR: kern/181416 (r254699 really) Requested by: bde MFC after: 2 weeks	2013-08-29 15:59:05 +00:00
John Baldwin	e77c507d60	Use tvtohz() to convert a socket buffer timeout to a tick value rather than using a home-rolled version. The home-rolled version could result in shorter-than-requested sleeps. Reported by: Vitja Makarov <vitja.makarov@gmail.com> MFC after: 2 weeks	2013-08-23 13:47:41 +00:00
Andre Oppermann	6753da1356	When the accept queue is full print the number of already pending new connections instead of by how many we're over the limit, which is always 1. Noticed by: jmallet MFC after: 1 week	2013-05-08 14:13:14 +00:00
Andre Oppermann	f89d4c3acf	Back out r249318, r249320 and r249327 due to a heisenbug most likely related to a race condition in the ipi_hash_lock with the exact cause currently unknown but under investigation.	2013-05-06 16:42:18 +00:00
Jilles Tjoelker	cd31b6dd08	socket: Make shutdown() wake up a blocked accept(). A blocking accept (and some other operations) waits on &so->so_timeo. Once it wakes up, it will detect the SBS_CANTRCVMORE bit. The error from accept() is [ECONNABORTED] which is not the nicest one -- the thread calling accept() needs to know out-of-band what is happening. A spurious wakeup on so->so_timeo appears harmless (sleep retried) except when lingering on close (SO_LINGER, and in that case there is no descriptor to call shutdown() on) so this should be fairly safe. A shutdown() already woke up a blocked accept() for TCP sockets, but not for Unix domain sockets. This fix is generic for all domains. This patch was sent to -hackers@ and -net@ on April 5. MFC after: 2 weeks	2013-04-30 15:06:30 +00:00
Jim Harris	d58a96538f	Fix the build.	2013-04-10 00:35:08 +00:00
Andre Oppermann	e8b3186b6a	Change certain heavily used network related mutexes and rwlocks to reside on their own cache line to prevent false sharing with other nearby structures, especially for those in the .bss segment. NB: Those mutexes and rwlocks with variables next to them that get changed on every invocation do not benefit from their own cache line. Actually it may be net negative because two cache misses would be incurred in those cases.	2013-04-09 21:02:20 +00:00
Gleb Smirnoff	a307eb26ed	When soreceive_generic() hands off an mbuf from buffer, clear its pointer to next record, since next record belongs to the buffer, and shouldn't be leaked. The ng_ksocket(4) used to clear this pointer itself, but the correct place is here. Sponsored by: Nginx, Inc	2013-03-29 13:57:55 +00:00
Jilles Tjoelker	c2e3c52e0d	Implement SOCK_CLOEXEC, SOCK_NONBLOCK and MSG_CMSG_CLOEXEC. This change allows creating file descriptors with close-on-exec set in some situations. SOCK_CLOEXEC and SOCK_NONBLOCK can be OR'ed in socket() and socketpair()'s type parameter, and MSG_CMSG_CLOEXEC to recvmsg() makes file descriptors (SCM_RIGHTS) atomically close-on-exec. The numerical values for SOCK_CLOEXEC and SOCK_NONBLOCK are as in NetBSD. MSG_CMSG_CLOEXEC is the first free bit for MSG_. The SOCK_ flags are not passed to MAC because this may cause incorrect failures and can be done later via fcntl() anyway. On the other hand, audit is expected to cope with the new flags. For MSG_CMSG_CLOEXEC, unp_externalize() is extended to take a flags argument. Reviewed by: kib	2013-03-19 20:58:17 +00:00
Michael Tuexen	fbb3471022	Return an error if sctp_peeloff() fails because a socket can't be allocated. MFC after: 3 days	2013-03-11 17:43:55 +00:00
Pawel Jakub Dawidek	7493f24ee6	- Implement two new system calls: int bindat(int fd, int s, const struct sockaddr addr, socklen_t addrlen); int connectat(int fd, int s, const struct sockaddr name, socklen_t namelen); which allow to bind and connect respectively to a UNIX domain socket with a path relative to the directory associated with the given file descriptor 'fd'. - Add manual pages for the new syscalls. - Make the new syscalls available for processes in capability mode sandbox. - Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on the directory descriptor for the syscalls to work. - Update audit(4) to support those two new syscalls and to handle path in sockaddr_un structure relative to the given directory descriptor. - Update procstat(1) to recognize the new capability rights. - Document the new capability rights in cap_rights_limit(2). Sponsored by: The FreeBSD Foundation Discussed with: rwatson, jilles, kib, des	2013-03-02 21:11:30 +00:00
Pawel Jakub Dawidek	6e0b674628	Configure UMA warnings for the following zones: - unp_zone: kern.ipc.maxsockets limit reached - socket_zone: kern.ipc.maxsockets limit reached - zone_mbuf: kern.ipc.nmbufs limit reached - zone_clust: kern.ipc.nmbclusters limit reached - zone_jumbop: kern.ipc.nmbjumbop limit reached - zone_jumbo9: kern.ipc.nmbjumbo9 limit reached - zone_jumbo16: kern.ipc.nmbjumbo16 limit reached Note that those warnings are printed not often than every five minutes and can be globally turned off by setting sysctl/tunable vm.zone_warnings to 0. Discussed on: arch Obtained from: WHEEL Systems MFC after: 2 weeks	2012-12-07 22:30:30 +00:00
Pawel Jakub Dawidek	94b0ae5d62	- Make socket_zone static - it is used only in this file. - Update maxsockets on uma_zone_set_max(). Obtained from: WHEEL Systems	2012-12-07 22:15:51 +00:00
Pawel Jakub Dawidek	68412f4179	Style cleanups.	2012-12-07 22:13:33 +00:00
Kevin Lo	b08d12d9be	- according to POSIX, make socket(2) return EAFNOSUPPORT rather than EPROTONOSUPPORT if the address family is not supported. - introduce pffinddomain() to find a domain by family and use it as appropriate. Reviewed by: glebius	2012-12-07 02:22:48 +00:00
Gleb Smirnoff	eb1b1807af	Mechanically substitute flags from historic mbuf allocator with malloc(9) flags within sys. Exceptions: - sys/contrib not touched - sys/mbuf.h edited manually	2012-12-05 08:04:20 +00:00
Andre Oppermann	358c7f47da	Fix r243627 by testing against the head socket instead of the socket just created. MFC after: 1 week X-MFC-with: r243627	2012-11-27 22:35:48 +00:00
Andre Oppermann	ead46972a4	Base the mbuf related limits on the available physical memory or kernel memory, whichever is lower. The overall mbuf related memory limit must be set so that mbufs (and clusters of various sizes) can't exhaust physical RAM or KVM. The limit is set to half of the physical RAM or KVM (whichever is lower) as the baseline. In any normal scenario we want to leave at least half of the physmem/kvm for other kernel functions and userspace to prevent it from swapping too easily. Via a tunable kern.maxmbufmem the limit can be upped to at most 3/4 of physmem/kvm. At the same time divorce maxfiles from maxusers and set maxfiles to physpages / 8 with a floor based on maxusers. This way busy servers can make use of the significantly increased mbuf limits with a much larger number of open sockets. Tidy up ordering in init_param2() and check up on some users of those values calculated here. Out of the overall mbuf memory limit 2K clusters and 4K (page size) clusters to get 1/4 each because these are the most heavily used mbuf sizes. 2K clusters are used for MTU 1500 ethernet inbound packets. 4K clusters are used whenever possible for sends on sockets and thus outbound packets. The larger cluster sizes of 9K and 16K are limited to 1/6 of the overall mbuf memory limit. When jumbo MTU's are used these large clusters will end up only on the inbound path. They are not used on outbound, there it's still 4K. Yes, that will stay that way because otherwise we run into lots of complications in the stack. And it really isn't a problem, so don't make a scene. Normal mbufs (256B) weren't limited at all previously. This was problematic as there are certain places in the kernel that on allocation failure of clusters try to piece together their packet from smaller mbufs. The mbuf limit is the number of all other mbuf sizes together plus some more to allow for standalone mbufs (ACK for example) and to send off a copy of a cluster. Unfortunately there isn't a way to set an overall limit for all mbuf memory together as UMA doesn't support such a limiting. NB: Every cluster also has an mbuf associated with it. Two examples on the revised mbuf sizing limits: 1GB KVM: 512MB limit for mbufs 419,430 mbufs 65,536 2K mbuf clusters 32,768 4K mbuf clusters 9,709 9K mbuf clusters 5,461 16K mbuf clusters 16GB RAM: 8GB limit for mbufs 33,554,432 mbufs 1,048,576 2K mbuf clusters 524,288 4K mbuf clusters 155,344 9K mbuf clusters 87,381 16K mbuf clusters These defaults should be sufficient for even the most demanding network loads. MFC after: 1 month	2012-11-27 21:19:58 +00:00
Andre Oppermann	2c3142c82c	Fix a race on listen socket teardown where while draining the accept queues a new socket/connection may be added to the queue due to a race on the ACCEPT_LOCK. The submitted patch is slightly changed in comments, teardown and locking order and extended with KASSERT's. Submitted by: Vijay Singh <vijju.singh-at-gmail-dot-com> Found by: His team. MFC after: 1 week	2012-11-27 20:04:52 +00:00
Andre Oppermann	e8ad36aba4	In soreceive_stream() don't drop an already dequeued mbuf chain by overwriting the return mbuf pointer with newly received data after a loop. Instead append the new mbuf chain to the existing one. Fix up sb_lastrecord when dequeuing mbuf's so that sbappend_stream() doesn't get confused. For the remainder copy case in the mbuf delivery part deduct the copied length len instead of the whole mbuf length. Additionally don't depend on 'n' being being available which isn't true in the case of MSG_PEEK. Fix the MSG_WAITALL case by comparing against sb_hiwat. Before it was looping for every receive as sb_lowat normally is zero. Add comment about issue with (MSG_WAITALL \| MSG_PEEK) which isn't properly handled. Submitted by: trociny (except for the change in last paragraph)	2012-10-29 12:31:12 +00:00
Andre Oppermann	fdd1b7f52a	Add logging for socket attach failures in sonewconn() during accept(2). Include the pointer to the PCB so it can be attributed to a particular application by corresponding it to "netstat -A" output. MFC after: 2 weeks	2012-10-29 12:14:57 +00:00
Andre Oppermann	e37e60c379	Replace the ill-named ZERO_COPY_SOCKET kernel option with two more appropriate named kernel options for the very distinct send and receive path. "options SOCKET_SEND_COW" enables VM page copy-on-write based sending of data on an outbound socket. NB: The COW based send mechanism is not safe and may result in kernel crashes. "options SOCKET_RECV_PFLIP" enables VM kernel/userspace page flipping for special disposable pages attached as external storage to mbufs. Only the naming of the kernel options is changed and their corresponding #ifdef sections are adjusted. No functionality is added or removed. Discussed with: alc (mechanism and limitations of send side COW)	2012-10-23 14:19:44 +00:00
Andre Oppermann	dc00208ec4	Grammar fixes to r241781. Submitted by: alc	2012-10-20 19:38:22 +00:00
Andre Oppermann	2bdf61ca29	Hide the unfortunate named sysctl kern.ipc.somaxconn from sysctl -a output and replace it with a new visible sysctl kern.ipc.acceptqueue of the same functionality. It specifies the maximum length of the accept queue on a listen socket. The old kern.ipc.somaxconn remains available for reading and writing for compatibility reasons so that existing programs, scripts and configurations continue to work. There no plans to ever remove the orginal and now hidden kern.ipc.somaxconn.	2012-10-20 12:53:14 +00:00
Andre Oppermann	1490de00a8	Tidy up somaxconn (accept queue limit) and related functions and move it together into one place.	2012-10-20 10:51:32 +00:00
Andre Oppermann	4b62fe5b0b	Move socket UMA zone initialization functionality together into one place.	2012-10-19 12:16:29 +00:00
Andre Oppermann	cf8e6069e8	Move UMA socket zone initialization from uipc_domain.c to uipc_socket.c into one place next to its other related functions to avoid confusion.	2012-10-19 10:15:32 +00:00
Andre Oppermann	d10733a8da	Remove unnecessary includes from sosend_copyin() and fix a couple of style issues.	2012-10-18 21:04:30 +00:00
Andre Oppermann	1d147759db	Remove double-wrapping of #ifdef ZERO_COPY_SOCKETS within zero copy specialized sosend_copyin() helper function.	2012-10-18 20:22:17 +00:00
Garrett Wollman	48b5c7410f	Fix spelling of the function name in two assertion messages.	2012-10-02 18:38:05 +00:00
Mikolaj Golub	bb9f214f64	In soreceive_generic() remove the optimization for the case when MSG_WAITALL is set, and it is possible to do the entire receive operation at once if we block (resid <= hiwat). Actually it might make the recv(2) with MSG_WAITALL flag get stuck when there is enough space in the receiver buffer to satisfy the request but not enough to open the window closed previously due to the buffer being full. The issue can be reproduced using the following scenario: On the sender side do 2 send(2) requests: 1) data of size much smaller than SOBUF_SIZE (e.g. SOBUF_SIZE / 10); 2) data of size equal to SOBUF_SIZE. On the receiver side do 2 recv(2) requests with MSG_WAITALL flag set: 1) recv() data of SOBUF_SIZE / 10 size; 2) recv() data of SOBUF_SIZE size; We totally fill the receiver buffer with one SOBUF_SIZE/10 size request and partial SOBUF_SIZE request. When the first request is processed we get SOBUF_SIZE/10 free space. It is just enough to receive the rest of bytes for the second request, and soreceive_generic() blocks in the part that is a subject of this change waiting for the rest. But the window was closed when the buffer was filled and to avoid silly window syndrome it opens only when available space is larger than sb_hiwat/4 or maxseg. So it is stuck and pending data is only sent via TCP window probes. Discussed with: kib (long ago) MFC after: 2 weeks	2012-09-02 07:33:52 +00:00
Mikolaj Golub	2ad099fcb1	In soreceive_generic() when checking if the type of mbuf has changed check it for MT_CONTROL type too, otherwise the assertion "m->m_type == MT_DATA" below may be triggered by the following scenario: - the sender sends some data (MT_DATA) and then a file descriptor (MT_CONTROL); - the receiver calls recv(2) with a MSG_WAITALL asking for data larger than the receive buffer (uio_resid > hiwat). MFC after: 2 week	2012-09-02 07:29:37 +00:00
Mikolaj Golub	e71a7957bd	Fix KASSERT message. MFC after: 3 days	2012-07-03 19:08:02 +00:00
Navdeep Parhar	60a305887a	- Remove redundant call to pr_ctloutput from code that handles SO_SETFIB. - Add a check for errors during copyin while here. Reviewed by: julian, bz MFC after: 2 weeks	2012-04-03 18:38:00 +00:00
Konstantin Belousov	747d2fa178	Add SO_PROTOCOL/SO_PROTOTYPE socket SOL_SOCKET-level option to get the socket protocol number. This is useful since the socket type can be implemented by different protocols in the same protocol family, e.g. SOCK_STREAM may be provided by both TCP and SCTP. Submitted by: Jukka A. Ukkonen <jau iki fi> PR: kern/162352 Discussed with: bz Reviewed by: glebius MFC after: 2 weeks	2012-02-26 13:55:43 +00:00
Konstantin Belousov	9493639e35	Remove apparently redundand checks for socket so_proto being non-NULL from sosetopt() and sogetopt(). No exposed sockets may have so_proto invalid. Discussed with: bz, rwatson Reviewed by: glebius MFC after: 2 weeks	2012-02-26 13:51:05 +00:00
Konstantin Belousov	526d0bd547	Fix found places where uio_resid is truncated to int. Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from the usermode. Discussed with: bde, das (previous versions) MFC after: 1 month	2012-02-21 01:05:12 +00:00
Bjoern A. Zeeb	ee799639e8	Add SO_SETFIB option support on PF_INET6 sockets and allow inheriting the FIB number from the process, as set by setfib(2), on socket creation. Sponsored by: Cisco Systems, Inc.	2012-02-03 11:00:53 +00:00
Robert Millan	ea4d9a14f1	Remove a few bits of FreeBSD 2.x compatibility code. Approved by: kib (mentor)	2011-11-14 18:21:27 +00:00
Attilio Rao	6aba400a70	Fix a deficiency in the selinfo interface: If a selinfo object is recorded (via selrecord()) and then it is quickly destroyed, with the waiters missing the opportunity to awake, at the next iteration they will find the selinfo object destroyed, causing a PF#. That happens because the selinfo interface has no way to drain the waiters before to destroy the registered selinfo object. Also this race is quite rare to get in practice, because it would require a selrecord(), a poll request by another thread and a quick destruction of the selrecord()'ed selinfo object. Fix this by adding the seldrain() routine which should be called before to destroy the selinfo objects (in order to avoid such case), and fix the present cases where it might have already been called. Sometimes, the context is safe enough to prevent this type of race, like it happens in device drivers which installs selinfo objects on poll callbacks. There, the destruction of the selinfo object happens at driver detach time, when all the filedescriptors should be already closed, thus there cannot be a race. For this case, mfi(4) device driver can be set as an example, as it implements a full correct logic for preventing this from happening. Sponsored by: Sandvine Incorporated Reported by: rstone Tested by: pluknet Reviewed by: jhb, kib Approved by: re (bz) MFC after: 3 weeks	2011-08-25 15:51:54 +00:00
Andre Oppermann	695da99eba	In the experimental soreceive_stream(): o Move the non-blocking socket test below the SBS_CANTRCVMORE so that EOF is correctly returned on a remote connection close. o In the non-blocking socket test compare SS_NBIO against the so->so_state field instead of the incorrect sb->sb_state field. o Simplify the ENOTCONN test by removing cases that can't occur. Submitted by: trociny (with some further tweaks by committer) Tested by: trociny	2011-07-08 10:50:13 +00:00
Andre Oppermann	1c6e7fa7f1	Remove the TCP_SORECEIVE_STREAM compile time option. The use of soreceive_stream() for TCP still has to be enabled with the loader tuneable net.inet.tcp.soreceive_stream. Suggested by: trociny and others	2011-07-07 10:37:14 +00:00
Mikolaj Golub	3204c8e596	In soreceive_generic(), if MSG_WAITALL is set but the request is larger than the receive buffer, we have to receive in sections. When notifying the protocol that some data has been drained the lock is released for a moment. Returning we block waiting for the rest of data. There is a race, when data could arrive while the lock was released and then the connection stalls in sbwait. Fix this by checking for data before blocking and skip blocking if there are some. PR: kern/154504 Reported by: Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua> Tested by: Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua> Reviewed by: rwatson Approved by: kib (co-mentor) MFC after: 2 weeks	2011-05-29 18:00:50 +00:00
Bjoern A. Zeeb	1fb51a12f2	Mfp4 CH=177274,177280,177284-177285,177297,177324-177325 VNET socket push back: try to minimize the number of places where we have to switch vnets and narrow down the time we stay switched. Add assertions to the socket code to catch possibly unset vnets as seen in r204147. While this reduces the number of vnet recursion in some places like NFS, POSIX local sockets and some netgraph, .. recursions are impossible to fix. The current expectations are documented at the beginning of uipc_socket.c along with the other information there. Sponsored by: The FreeBSD Foundation Sponsored by: CK Software GmbH Reviewed by: jhb Tested by: zec Tested by: Mikolaj Golub (to.my.trociny gmail.com) MFC after: 2 weeks	2011-02-16 21:29:13 +00:00
Daniel Eischen	f7e6ce6d7a	Allow the SO_SETFIB socket option to select the default (0) routing table. Reviewed by: julian	2011-02-13 00:14:13 +00:00

1 2 3 4 5 ...

402 Commits