freebsd-dev

Author	SHA1	Message	Date
Hans Petter Selasky	f3e7afe2d7	Implement kernel support for hardware rate limited sockets. - Add RATELIMIT kernel configuration keyword which must be set to enable the new functionality. - Add support for hardware driven, Receive Side Scaling, RSS aware, rate limited sendqueues and expose the functionality through the already established SO_MAX_PACING_RATE setsockopt(). The API support rates in the range from 1 to 4Gbytes/s which are suitable for regular TCP and UDP streams. The setsockopt(2) manual page has been updated. - Add rate limit function callback API to "struct ifnet" which supports the following operations: if_snd_tag_alloc(), if_snd_tag_modify(), if_snd_tag_query() and if_snd_tag_free(). - Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT flag, which tells if a network driver supports rate limiting or not. - This patch also adds support for rate limiting through VLAN and LAGG intermediate network devices. - How rate limiting works: 1) The userspace application calls setsockopt() after accepting or making a new connection to set the rate which is then stored in the socket structure in the kernel. Later on when packets are transmitted a check is made in the transmit path for rate changes. A rate change implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the destination network interface, which then sets up a custom sendqueue with the given rate limitation parameter. A "struct m_snd_tag" pointer is returned which serves as a "snd_tag" hint in the m_pkthdr for the subsequently transmitted mbufs. 2) When the network driver sees the "m->m_pkthdr.snd_tag" different from NULL, it will move the packets into a designated rate limited sendqueue given by the snd_tag pointer. It is up to the individual drivers how the rate limited traffic will be rate limited. 3) Route changes are detected by the NIC drivers in the ifp->if_transmit() routine when the ifnet pointer in the incoming snd_tag mismatches the one of the network interface. The network adapter frees the mbuf and returns EAGAIN which causes the ip_output() to release and clear the send tag. Upon next ip_output() a new "snd_tag" will be tried allocated. 4) When the PCB is detached the custom sendqueue will be released by a non-blocking ifp->if_snd_tag_free() call to the currently bound network interface. Reviewed by: wblock (manpages), adrian, gallatin, scottl (network) Differential Revision: https://reviews.freebsd.org/D3687 Sponsored by: Mellanox Technologies MFC after: 3 months	2017-01-18 13:31:17 +00:00
Maxim Sobolev	339efd75a4	Add a new socket option SO_TS_CLOCK to pick from several different clock sources to return timestamps when SO_TIMESTAMP is enabled. Two additional clock sources are: o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME); o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC). In addition to this, this option provides unified interface to get bintime (equivalent of using SO_BINTIME), except it also supported with IPv6 where SO_BINTIME has never been supported. The long term plan is to depreciate SO_BINTIME and move everything to using SO_TS_CLOCK. Idea for this enhancement has been briefly discussed on the Net session during dev summit in Ottawa last June and the general input was positive. This change is believed to benefit network benchmarks/profiling as well as other scenarios where precise time of arrival measurement is necessary. There are two regression test cases as part of this commit: one extends unix domain test code (unix_cmsg) to test new SCM_XXX types and another one implementis totally new test case which exchanges UDP packets between two processes using both conventional methods (i.e. calling clock_gettime(2) before recv(2) and after send(2)), as well as using setsockopt()+recv() in receive path. The resulting delays are checked for sanity for all supported clock types. Reviewed by: adrian, gnn Differential Revision: https://reviews.freebsd.org/D9171	2017-01-16 17:46:38 +00:00
Conrad Meyer	1d64db52f3	Fix a variety of cosmetic typos and misspellings No functional change. PR: 216096, 216097, 216098, 216101, 216102, 216106, 216109, 216110 Reported by: Bulat <bltsrc at mail.ru> Sponsored by: Dell EMC Isilon	2017-01-15 18:00:45 +00:00
Gleb Smirnoff	0f7ddf91e9	Use getsock_cap() instead of deprecated fgetsock(). Reviewed by: tuexen	2017-01-13 16:54:44 +00:00
Michael Tuexen	24209f0122	Ensure that the buffer length and the length provided in the IPv4 header match when using a raw socket to send IPv4 packets and providing the header. If they don't match, let send return -1 and set errno to EINVAL. Before this patch is was only enforced that the length in the header is not larger then the buffer length. PR: 212283 Reviewed by: ae, gnn MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D9161	2017-01-13 10:55:26 +00:00
Maxim Sobolev	5e946c03c7	Fix slight type mismatch between so_options defined in sys/socketvar.h and tw_so_options defined here which is supposed to be a copy of the former (short vs u_short respectively). Switch tw_so_options to be "signed short" to match the type of the field it's inherited from.	2017-01-12 10:14:54 +00:00
Hiren Panchasara	b8a2fb91f6	sysctl net.inet.tcp.hostcache.list in a jail can see connections from other jails and the host. This commit fixes it. PR: 200361 Submitted by: bz (original version), hiren (minor corrections) Reported by: Marcus Reid <marcus at blazingdot dot com> Reviewed by: bz, gnn Tested by: Lohith Bellad <lohithbsd at gmail dot com> MFC after: 1 week Sponsored by: Limelight Networks (minor corrections)	2017-01-05 17:22:09 +00:00
George V. Neville-Neil	fad073dd44	Followup to mtod removal in main stack (r311225). Continued removal of mtod() calls from TCP_PROBE macros. MFC after: 1 week Sponsored by: Limelight Networks	2017-01-04 04:00:28 +00:00
George V. Neville-Neil	2b9c998413	Fix DTrace TCP tracepoints to not use mtod() as it is both unnecessary and dangerous. Those wanting data from an mbuf should use DTrace itself to get the data. PR: 203409 Reviewed by: hiren MFC after: 1 week Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D9035	2017-01-04 02:19:13 +00:00
Enji Cooper	cfff8d3dbd	Unbreak ip_carp with WITHOUT_INET6 enabled by conditionalizing all IPv6 structs under the INET6 #ifdef. Similarly (even though it doesn't seem to affect the build), conditionalize all IPv4 structs under the INET #ifdef This also unbreaks the LINT-NOINET6 tinderbox target on amd64; I have not verified other MACHINE/TARGET pairs (e.g. armv6/arm). MFC after: 2 weeks X-MFC with: r310847 Pointyhat to: jpaetzel Reported by: O. Hartmann <o.hartmann@walstatt.org>	2016-12-30 21:33:01 +00:00
Josh Paetzel	8151740c88	Harden CARP against network loops. If there is a loop in the network a CARP that is in MASTER state will see it's own broadcasts, which will then cause it to assume BACKUP state. When it assumes BACKUP it will stop sending advertisements. In that state it will no longer see advertisements and will assume MASTER... We can't catch all the cases where we are seeing our own CARP broadcast, but we can catch the obvious case. Submitted by: torek Obtained from: FreeNAS MFC after: 2 weeks Sponsored by: iXsystems	2016-12-30 18:46:21 +00:00
Andrey V. Elsukov	2e77d270c1	When we are sending IP fragments, update ip pointers in IP_PROBE() for each fragment. MFC after: 1 week	2016-12-29 19:57:46 +00:00
Michael Tuexen	2048d80aa3	Consistent handling of errors reported from the lower layer. MFC after: 3 days	2016-12-27 22:14:41 +00:00
Michael Tuexen	b7b84c0e02	Whitespace changes. The toolchain for processing the sources has been updated. No functional change. MFC after: 3 days	2016-12-26 11:06:41 +00:00
Michael Tuexen	d6194c562f	Remove a KASSERT which is not always true. In case of the empty queue tp->snd_holes and tcp_sackhole_insert() failing due to memory shortage, tp->snd_holes will be empty. This problem was hit when stress tests where performed by pho. PR: 215513 Reported by: pho Tested by: pho Sponsored by: Netflix, Inc.	2016-12-25 17:37:18 +00:00
Gleb Smirnoff	030b9c2f69	Remove assigned only variable.	2016-12-21 22:47:10 +00:00
Andrey V. Elsukov	ad9f4d6ab6	ip[6]_tryforward does inbound and outbound packet firewall processing. This can lead to change of mbuf pointer (packet filter could do m_pullup(), NAT, etc). Also in case of change of destination address, tryforward can decide that packet should be handled by local system. In this case modified mbuf can be returned to the ip[6]_input(). To handle this correctly, check M_FASTFWD_OURS flag after return from ip[6]_tryforward. And if it is present, update variables that depend from mbuf pointer and skip another inbound firewall processing. No objection from: #network MFC after: 3 weeks Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D8764	2016-12-19 11:02:49 +00:00
Michael Tuexen	3d6fe5d84c	Fix the handling of buffered messages in stream reset deferred handling. Thanks to Eugen-Andrei Gavriloaie for reporting the issue and providing substantial help in nailing down the issue. MFC after: 1 week	2016-12-17 22:31:30 +00:00
Hiren Panchasara	b6ff672460	We currently don't do TSO if ip options are present. In case of IPv6, we look at in6p_options to check that. That is incorrect as we carry ip options in in6p_outputopts. Also, just checking for in6p_outputopts being NULL won't suffice as we combine ip options and ip header fields both in that one field. The commit fixes this by using ip6_optlen() which correctly calculates length of only ip options for IPv6. Reviewed by: ae, bz MFC after: 3 weeks Sponsored by: Limelight Networks	2016-12-11 23:14:47 +00:00
Michael Tuexen	8b9c95f4a9	Ensure that the reported ppid and tsn are taken from the first fragment. This fixes a bug where the wrong ppid was reported, if * I-DATA was used on the first fragement was not received first * DATA was used and different ppids where used. Thanks to Julian Cordes for making me aware of the issue. MFC after: 1 week	2016-12-11 13:26:35 +00:00
Gleb Smirnoff	8c70a35334	Fix build for 32-bit machines. Submitted by: tuexen	2016-12-09 20:50:35 +00:00
Gleb Smirnoff	3cbee8caa1	Use counter_ratecheck() in the ICMP rate limiting. Together with: rrs, jtl	2016-12-09 17:59:15 +00:00
Michael Tuexen	ebecdad811	Don't bundle a SACK chunk with a SHUTDOWN chunk if it is not required. MFC after: 1 week	2016-12-09 17:58:07 +00:00
Michael Tuexen	8d0a31e19c	Don't send multiple SHUTDOWN chunks in a single packet. Thanks to Felix Weinrank for making me aware of this issue. MFC after: 1 week	2016-12-09 17:57:17 +00:00
Michael Tuexen	b594081bdf	Silence a warning produced by newer versions of gcc. MFC after: 1 week	2016-12-07 22:01:09 +00:00
Michael Tuexen	49656eefc8	Cleanup the names of SSN, SID, TSN, FSN, PPID and MID. This made a couple of bugs visible in handling SSN wrap-arounds when using DATA chunks. Now bulk transfer seems to work fine... This fixes the issue reported in https://github.com/sctplab/usrsctp/issues/111 MFC after: 1 week	2016-12-07 19:30:59 +00:00
Michael Tuexen	5b495f17a5	Whitespace changes. The tools using to generate the sources has been updated and produces different whitespaces. Commit this seperately to avoid intermixing these with real code changes. MFC after: 3 days	2016-12-06 10:21:25 +00:00
Michael Tuexen	4ddd5aadea	Fix the handling of TCP FIN-segments in the CLOSED state When a TCP segment with the FIN bit set was received in the CLOSED state, a TCP RST-ACK-segment is sent. When computing SEG.ACK for this, the FIN counts as one byte. This accounting was missing and is fixed by this patch. Reviewed by: hiren MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://svn.freebsd.org/base/head	2016-12-02 08:02:31 +00:00
Andrey V. Elsukov	dc9d21f8b0	Rework ip_tryforward() to use FIB4 KPI. Tested by: olivier Obtained from: Yandex LLC MFC after: 1 month Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D8526	2016-11-28 17:55:32 +00:00
Hiren Panchasara	2806b2933b	For RTT calculations mid-session, we explicitly ignore ACKs with tsecr of 0 as many borken middle-boxes tend to do that. But during 3whs, in syncache_expand(), we don't do that which causes us to send a RST to such a client. Relax this constraint by only using tsecr to compare against timestamp that we sent when it is not 0. As a result, we'd now accept the final ACK of 3whs with tsecr of 0. Reviewed by: jtl, gnn Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D8552	2016-11-21 20:53:11 +00:00
Michael Tuexen	35dfb8cb68	Ensure that TCP state changes to state-closing are reported via dtrace. This does not cover state changes from TIME-WAIT. Reviewed by: gnn MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8443	2016-11-19 14:45:08 +00:00
Michael Tuexen	6779a1a101	Notify the use via setting errno when a TCP RST segment is received either in the CLOSING or LAST-ACK state. Reviewed by: hiren MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8371	2016-11-17 08:15:02 +00:00
Andrey V. Elsukov	8432fa5fd9	Initialize ip6 pointer before use. PR: 214169 MFC after: 1 week	2016-11-06 02:33:04 +00:00
Hiren Panchasara	e04310d59b	Set slow start threshold more accurately on loss to be flightsize/2 instead of cwnd/2 as recommended by RFC5681. (spotted by mmacy at nextbsd dot org) Restore pre-r307901 behavior of aligning ssthresh/cwnd on mss boundary. (spotted by slawa at zxy dot spb dot ru) Tested by: dim, Slawa <slawa at zxy dot spb dot ru> MFC after: 1 month X-MFC with: r307901 Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D8349	2016-11-01 21:08:37 +00:00
Julien Charbon	f1ee30ccd6	Remove an extraneous call to soisconnected() in syncache_socket(), introduced with r261242. The useful and expected soisconnected() call is done in tcp_do_segment(). Has been found as part of unrelated PR:212920 investigation. Improve slightly (~2%) the maximum number of TCP accept per second. Tested by: kevin.bowling_kev009.com, jch Approved by: gnn, hiren MFC after: 1 week Sponsored by: Verisign, Inc Differential Revision: https://reviews.freebsd.org/D8072	2016-10-26 15:19:18 +00:00
Hiren Panchasara	4e7f755377	FreeBSD tcp stack used to inform respective congestion control module about the loss event but not use or obay the recommendations i.e. values set by it in some cases. Here is an attempt to solve that confusion by following relevant RFCs/drafts. Stack only sets congestion window/slow start threshold values when there is no CC module availalbe to take that action. All CC modules are inspected and updated when needed to take appropriate action on loss. tcp_stacks/fastpath module has been updated to adapt these changes. Note: Probably, the most significant change would be to not bring congestion window down to 1MSS on a loss signaled by 3-duplicate acks and letting respective CC decide that value. In collaboration with: Matt Macy <mmacy at nextbsd dot org> Discussed on: transport@ mailing list Reviewed by: jtl MFC after: 1 month Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D8225	2016-10-25 05:45:47 +00:00
Hiren Panchasara	dd13b7d387	Undo r307899. It needs a bit more work and proper commit log.	2016-10-25 05:07:51 +00:00
Hiren Panchasara	95d8236011	In Collaboration with: Matt Macy <mmacy at nextbsd dot com> Reviewed by: jtl Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D8225	2016-10-25 05:03:33 +00:00
Ryan Stone	6c1bd55875	Fix ip_output() on point-to-point links In r304435, ip_output() was changed to use the result of the route lookup to decide whether the outgoing packet was a broadcast or not. This introduced a regression on interfaces where IFF_BROADCAST was not set (e.g. point-to-point links), as the algorithm could incorrectly treat the destination address as a broadcast address, and ip_output() would subsequently drop the packet as broadcasting on a non-IFF_BROADCAST interface is not allowed. Differential Revision: https://reviews.freebsd.org/D8303 Reviewed by: jtl Reported by: ambrisko MFC after: 2 weeks X-MFC-With: r304435 Sponsored by: Dell EMC Isilon	2016-10-24 22:11:33 +00:00
Michael Tuexen	38d3251c3d	No functional changes, mostly getting the whitespace changes resulting from an updated formatting tool chain. MFC after: 1 month	2016-10-22 17:21:21 +00:00
Michael Tuexen	3e1465754f	Make ICMPv6 hard error handling for TCP consistent with the ICMPv4 handling. Ensure that: * Protocol unreachable errors are handled by indicating ECONNREFUSED to the TCP user for both IPv4 and IPv6. These were ignored for IPv6. * Communication prohibited errors are handled by indicating ECONNREFUSED to the TCP user for both IPv4 and IPv6. These were ignored for IPv6. * Hop Limited exceeded errors are handled by indicating EHOSTUNREACH to the TCP user for both IPv4 and IPv6. For IPv6 the TCP connected was dropped but errno wasn't set. Reviewed by: gallatin, rrs MFC after: 1 month Sponsored by: Netflix Differential Revision: 7904	2016-10-21 10:32:57 +00:00
Julien Charbon	f5cf1e5f5a	Fix a double-free when an inp transitions to INP_TIMEWAIT state after having been dropped. This fixes enforces in_pcbdrop() logic in tcp_input(): "in_pcbdrop() is used by TCP to mark an inpcb as unused and avoid future packet delivery or event notification when a socket remains open but TCP has closed." PR: 203175 Reported by: Palle Girgensohn, Slawa Olhovchenkov Tested by: Slawa Olhovchenkov Reviewed by: Slawa Olhovchenkov Approved by: gnn, Slawa Olhovchenkov Differential Revision: https://reviews.freebsd.org/D8211 MFC after: 1 week Sponsored by: Verisign, inc	2016-10-18 07:16:49 +00:00
Hiren Panchasara	784ce8fad2	Make sure tcp_mss() has the same check as tcp_mss_update() to have t_maxseg set to at least 64. This is still just a coverup to avoid kernel panic and not an actual fix. PR: 213232 Reviewed by: glebius MFC after: 1 week Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D8272	2016-10-18 02:40:25 +00:00
Patrick Kelsey	09c305eb65	Fix cases where the TFO pending counter would leak references, and eventually, memory. Also renamed some tfo labels and added/reworked comments for clarity. Based on an initial patch from jtl. PR: 213424 Reviewed by: jtl MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D8235	2016-10-15 01:41:28 +00:00
Jonathan T. Looney	82676a28eb	r307082 added the TCP_HHOOK kernel option and made some existing code only compile when that option is configured. In tcp_destroy(), the error variable is now only used in code enclosed in an '#ifdef TCP_HHOOK' block. This broke the build for VNET images. Enclose the error variable itself in an #ifdef block. Submitted by: Shawn Webb <shawn.webb at hardenedbsd.org> Reported by: Shawn Webb <shawn.webb at hardenedbsd.org> PointyHat to: jtl	2016-10-15 00:29:15 +00:00
Jonathan T. Looney	6d172f58a2	The code currently resets the keepalive timer each time a packet is received on a TCP session that has entered the ESTABLISHED state. This results in a lot of calls to reset the keepalive timer. This patch changes the behavior so we set the keepalive timer for the keepalive idle time (TP_KEEPIDLE). When the keepalive timer fires, it will first check to see if the session has been idle for TP_KEEPIDLE ticks. If not, it will reschedule the keepalive timer for the time the session will have been idle for TP_KEEPIDLE ticks. For a session with regular communication, the keepalive timer should fire approximately once every TP_KEEPIDLE ticks. For sessions with irregular communication, the keepalive timer might fire more often. But, the disruption from a periodic keepalive timer should be less than the regular cost of resetting the keepalive timer on every packet. (FWIW, this change saved approximately 1.73% of the busy CPU cycles on a particular test system with a heavy TCP output load. Of course, the actual impact is very specific to the particular hardware and workload.) Reviewed by: gallatin, rrs MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8243	2016-10-14 14:57:43 +00:00
Gleb Smirnoff	cc94f0c2d7	- Revert r300854, r303657 which tried to fix regression from r297225. - Fix the regression proper way using RO_RTFREE(). Submitted by: ae	2016-10-13 20:15:47 +00:00
Gleb Smirnoff	ec7bbf1f79	With build without TCP_HHOOK and with INVARIANTS. Before mutex.h came via sys/hhook.h -> sys/rmlock.h -> sys/mutex.h.	2016-10-13 18:02:29 +00:00
Michael Tuexen	859422cc12	Mark the socket as un-writable when it is 1-to-1 and the SCTP association is freed. MFC after: 1 month	2016-10-13 13:53:01 +00:00
Michael Tuexen	4c7fb0cf6e	Whitespace changes. MFC after: 1 month	2016-10-13 13:38:14 +00:00

1 2 3 4 5 ...

5689 Commits