freebsd-skq

Author	SHA1	Message	Date
Michael Tuexen	47384244f9	Fix an issue I introuced in r367530: tcp_twcheck() can be called with to == NULL for SYN segments. So don't assume tp != NULL. Thanks to jhb@ for reporting and suggesting a fix. PR: 250499 MFC after: 1 week XMFC-with: r367530 Sponsored by: Netflix, Inc.	2020-11-20 13:00:28 +00:00
Michael Tuexen	283c76c7c3	RFC 7323 specifies that: * TCP segments without timestamps should be dropped when support for the timestamp option has been negotiated. * TCP segments with timestamps should be processed normally if support for the timestamp option has not been negotiated. This patch enforces the above. PR: 250499 Reviewed by: gnn, rrs MFC after: 1 week Sponsored by: Netflix, Inc Differential Revision: https://reviews.freebsd.org/D27148	2020-11-09 21:49:40 +00:00
Pawel Biernacki	7029da5c36	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718	2020-02-26 14:26:36 +00:00
Gleb Smirnoff	aed553598d	Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER() in TCP timewait manipulation leaf functions.	2019-11-07 21:29:38 +00:00
Gleb Smirnoff	a81e3ecf46	Since pfslowtimo() runs in the network epoch, tcp_slowtimo() also does. This allows to simplify tcp_tw_2msl_scan() and always require the network epoch in it.	2019-11-07 21:28:46 +00:00
Michael Tuexen	e82fdca156	Fix a byte ordering issue for the advertised receiver window in ACK segments sent in TIMEWAIT state, which I introduced in r336937. MFC after: 3 days Sponsored by: Netflix, Inc.	2019-02-15 09:45:17 +00:00
Michael Tuexen	e2662978b8	Send consistent SEG.WIN when using timewait codepath for TCP. When sending TCP segments from the timewait code path, a stored value of the last sent window is used. Use the same code for computing this in the timewait code path as in the main code path used in tcp_output() to avoiv inconsistencies. Reviewed by: rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16503	2018-07-30 21:13:42 +00:00
Michael Tuexen	6138da62a9	Add missing send/recv dtrace probes for TCP. These missing probe are mostly in the syncache and timewait code. Reviewed by: markj@, rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16369	2018-07-30 20:13:38 +00:00
Andrew Turner	5f901c92a8	Use the new VNET_DEFINE_STATIC macro when we are defining static VNET variables. Reviewed by: bz Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D16147	2018-07-24 16:35:52 +00:00
Matt Macy	6573d7580b	epoch(9): allow preemptible epochs to compose - Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simplify tfb_tcp_do_segment to not take a ti_locked argument, there's no longer any benefit to dropping the pcbinfo lock and trying to do so just adds an error prone branchfest to these functions - Remove cases of same function recursion on the epoch as recursing is no longer free. - Remove the the TAILQ_ENTRY and epoch_section from struct thread as the tracker field is now stack or heap allocated as appropriate. Tested by: pho and Limelight Networks Reviewed by: kbowling at llnw dot com Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16066	2018-07-04 02:47:16 +00:00
Matt Macy	9e58ff6ff9	convert inpcbinfo hash and info rwlocks to epoch + mutex - Convert inpcbinfo info & hash locks to epoch for read and mutex for write - Garbage collect code that handled INP_INFO_TRY_RLOCK failures as INP_INFO_RLOCK which can no longer fail When running 64 netperfs sending minimal sized packets on a 2x8x2 reduces unhalted core cycles samples in rwlock rlock/runlock in udp_send from 51% to 3%. Overall packet throughput rate limited by CPU affinity and NIC driver design choices. On the receiver unhalted core cycles samples in in_pcblookup_hash went from 13% to to 1.6% Tested by LLNW and pho@ Reviewed by: jtl Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15686	2018-06-19 01:54:00 +00:00
Matt Macy	f6960e207e	netinet silence warnings	2018-05-19 05:56:21 +00:00
Gleb Smirnoff	3108a71a34	Fix LINT-NOINET build initializing local to false. This is a dead code, since for NOINET build isipv6 is always true, but this dead code makes it compilable. Reported by: rpokala	2018-03-22 05:07:57 +00:00
Gleb Smirnoff	dd388cfd9b	The net.inet.tcp.nolocaltimewait=1 optimization prevents local TCP connections from entering the TIME_WAIT state. However, it omits sending the ACK for the FIN, which results in RST. This becomes a bigger deal if the sysctl net.inet.tcp.blackhole is 2. In this case RST isn't send, so the other side of the connection (also local) keeps retransmitting FINs. To fix that in tcp_twstart() we will not call tcp_close() immediately. Instead we will allocate a tcptw on stack and proceed to the end of the function all the way to tcp_twrespond(), to generate the correct ACK, then we will drop the last PCB reference. While here, make a few tiny improvements: - use bools for boolean variable - staticize nolocaltimewait - remove pointless acquisiton of socket lock Reported by: jtl Reviewed by: jtl Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14697	2018-03-21 20:59:30 +00:00
Pedro F. Giffuni	51369649b0	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
Julien Charbon	5fcd2d9bfc	Forgotten bits in r324179: Include sys/syslog.h if INVARIANTS is not defined MFC after: 1 week X-MFC with: r324179 Pointy hat to: jch	2017-10-02 09:45:17 +00:00
Julien Charbon	dfa1f80ce9	Fix an infinite loop in tcp_tw_2msl_scan() when an INP_TIMEWAIT inp has been destroyed before its tcptw with INVARIANTS undefined. This is a symmetric change of r307551: A INP_TIMEWAIT inp should not be destroyed before its tcptw, and INVARIANTS will catch this case. If INVARIANTS is undefined it will emit a log(LOG_ERR) and avoid a hard to debug infinite loop in tcp_tw_2msl_scan(). Reported by: Ben Rubson, hselasky Submitted by: hselasky Tested by: Ben Rubson, jch MFC after: 1 week Sponsored by: Verisign, inc Differential Revision: https://reviews.freebsd.org/D12267	2017-10-01 21:20:28 +00:00
Gleb Smirnoff	779f106aa1	Listening sockets improvements. o Separate fields of struct socket that belong to listening from fields that belong to normal dataflow, and unionize them. This shrinks the structure a bit. - Take out selinfo's from the socket buffers into the socket. The first reason is to support braindamaged scenario when a socket is added to kevent(2) and then listen(2) is cast on it. The second reason is that there is future plan to make socket buffers pluggable, so that for a dataflow socket a socket buffer can be changed, and in this case we also want to keep same selinfos through the lifetime of a socket. - Remove struct struct so_accf. Since now listening stuff no longer affects struct socket size, just move its fields into listening part of the union. - Provide sol_upcall field and enforce that so_upcall_set() may be called only on a dataflow socket, which has buffers, and for listening sockets provide solisten_upcall_set(). o Remove ACCEPT_LOCK() global. - Add a mutex to socket, to be used instead of socket buffer lock to lock fields of struct socket that don't belong to a socket buffer. - Allow to acquire two socket locks, but the first one must belong to a listening socket. - Make soref()/sorele() to use atomic(9). This allows in some situations to do soref() without owning socket lock. There is place for improvement here, it is possible to make sorele() also to lock optionally. - Most protocols aren't touched by this change, except UNIX local sockets. See below for more information. o Reduce copy-and-paste in kernel modules that accept connections from listening sockets: provide function solisten_dequeue(), and use it in the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4), infiniband, rpc. o UNIX local sockets. - Removal of ACCEPT_LOCK() global uncovered several races in the UNIX local sockets. Most races exist around spawning a new socket, when we are connecting to a local listening socket. To cover them, we need to hold locks on both PCBs when spawning a third one. This means holding them across sonewconn(). This creates a LOR between pcb locks and unp_list_lock. - To fix the new LOR, abandon the global unp_list_lock in favor of global unp_link_lock. Indeed, separating these two locks didn't provide us any extra parralelism in the UNIX sockets. - Now call into uipc_attach() may happen with unp_link_lock hold if, we are accepting, or without unp_link_lock in case if we are just creating a socket. - Another problem in UNIX sockets is that uipc_close() basicly did nothing for a listening socket. The vnode remained opened for connections. This is fixed by removing vnode in uipc_close(). Maybe the right way would be to do it for all sockets (not only listening), simply move the vnode teardown from uipc_detach() to uipc_close()? Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D9770	2017-06-08 21:30:34 +00:00
Warner Losh	fbbd9655e5	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96	2017-02-28 23:42:47 +00:00
Michael Tuexen	35dfb8cb68	Ensure that TCP state changes to state-closing are reported via dtrace. This does not cover state changes from TIME-WAIT. Reviewed by: gnn MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8443	2016-11-19 14:45:08 +00:00
Julien Charbon	f5cf1e5f5a	Fix a double-free when an inp transitions to INP_TIMEWAIT state after having been dropped. This fixes enforces in_pcbdrop() logic in tcp_input(): "in_pcbdrop() is used by TCP to mark an inpcb as unused and avoid future packet delivery or event notification when a socket remains open but TCP has closed." PR: 203175 Reported by: Palle Girgensohn, Slawa Olhovchenkov Tested by: Slawa Olhovchenkov Reviewed by: Slawa Olhovchenkov Approved by: gnn, Slawa Olhovchenkov Differential Revision: https://reviews.freebsd.org/D8211 MFC after: 1 week Sponsored by: Verisign, inc	2016-10-18 07:16:49 +00:00
Bjoern A. Zeeb	252a46f324	No longer mark TCP TW zone NO_FREE. Timewait code does a proper cleanup after itself. Reviewed by: gnn Approved by: re (gjb) Obtained from: projects/vnet MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D6922	2016-06-23 00:32:58 +00:00
Gleb Smirnoff	bf840a1707	Redo r294869. The array of counters for TCP states doesn't belong to struct tcpstat, because the structure can be zeroed out by netstat(1) -z, and of course running connection counts shouldn't be touched. Place running connection counts into separate array, and provide separate read-only sysctl oid for it.	2016-03-15 00:15:10 +00:00
Gleb Smirnoff	57a78e3bae	Augment struct tcpstat with tcps_states[], which is used for book-keeping the amount of TCP connections by state. Provides a cheap way to get connection count without traversing the whole pcb list. Sponsored by: Netflix	2016-01-27 00:45:46 +00:00
Julien Charbon	ff9b006d61	Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability: - The existing TCP INP_INFO lock continues to protect the global inpcb list stability during full list traversal (e.g. tcp_pcblist()). - A new INP_LIST lock protects inpcb list actual modifications (inp allocation and free) and inpcb global counters. It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input()) and INP_INFO_WLOCK only in occasional operations that walk all connections. PR: 183659 Differential Revision: https://reviews.freebsd.org/D2599 Reviewed by: jhb, adrian Tested by: adrian, nitroboost-gmail.com Sponsored by: Verisign, Inc.	2015-08-03 12:13:54 +00:00
George V. Neville-Neil	f00543eeab	Add a state transition call to show that we have entered TIME_WAIT. Although this is not important to the rest of the TCP processing it is a conveneint way to make the DTrace state-transition probe catch this important state change. MFC after: 1 week	2015-05-01 12:49:03 +00:00
Gleb Smirnoff	6df8a71067	Remove SYSCTL_VNET_* macros, and simply put CTLFLAG_VNET where needed. Sponsored by: Nginx, Inc.	2014-11-07 09:39:05 +00:00
Julien Charbon	cea40c4888	Fix a race condition in TCP timewait between tcp_tw_2msl_reuse() and tcp_tw_2msl_scan(). This race condition drives unplanned timewait timeout cancellation. Also simplify implementation by holding inpcb reference and removing tcptw reference counting. Differential Revision: https://reviews.freebsd.org/D826 Submitted by: Marc De la Gueronniere <mdelagueronniere@verisign.com> Submitted by: jch Reviewed By: jhb (mentor), adrian, rwatson Sponsored by: Verisign, Inc. MFC after: 2 weeks X-MFC-With: r264321	2014-10-30 08:53:56 +00:00
Hiren Panchasara	76504ce978	Add a comment for easier code understanding.	2014-08-04 19:42:48 +00:00
Bjoern A. Zeeb	700515aa62	While PAWS is disabled, there are no consumers for the tcp options argument to tcp_twcheck(); thus mark it __unused. MFC after: 2 weeks	2014-05-30 22:34:06 +00:00
Bjoern A. Zeeb	4fd2b4eb53	Make tcp_twrespond() file local private; this removes it from the public KPI; it is not used anywhere else and seems it never was. MFC after: 2 weeks	2014-05-24 14:01:18 +00:00
Mike Silbersack	f1395664e5	Remove the function tcp_twrecycleable; it has been #if 0'd for eight years. The original concept was to improve the corner case where you run out of ephemeral ports, but it was causing performance problems and the mechanism of limiting the number of time_wait sockets serves the same purpose in the end. Reviewed by: bz	2014-05-16 01:38:38 +00:00
John Baldwin	b8c8c8c3c7	Some whitespace and style fixes. Submitted by: bde	2014-04-11 21:00:59 +00:00
John Baldwin	2ffb755cec	The tw_pcbrele() function does not need the global timewait lock. Submitted by: Julien Charbon Suggested by: glebius	2014-04-11 19:17:45 +00:00
John Baldwin	9941de49ad	Don't leak the TCP pcbinfo lock if a time wait connection is closed in between grabbing a reference on the connection structure and obtaining the pcbinfo lock. Reviewed by: Julien Charbon	2014-04-11 13:11:43 +00:00
John Baldwin	66eefb1eae	Currently, the TCP slow timer can starve TCP input processing while it walks the list of connections in TIME_WAIT closing expired connections due to contention on the global TCP pcbinfo lock. To remediate, introduce a new global lock to protect the list of connections in TIME_WAIT. Only acquire the TCP pcbinfo lock when closing an expired connection. This limits the window of time when TCP input processing is stopped to the amount of time needed to close a single connection. Submitted by: Julien Charbon <jcharbon@verisign.com> Reviewed by: rwatson, rrs, adrian MFC after: 2 months	2014-04-10 18:15:35 +00:00
Gleb Smirnoff	76039bc84f	The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare to this event, adding if_var.h to files that do need it. Also, include all includes that now are included due to implicit pollution via if_var.h Sponsored by: Netflix Sponsored by: Nginx, Inc.	2013-10-26 17:58:36 +00:00
Gleb Smirnoff	eb1b1807af	Mechanically substitute flags from historic mbuf allocator with malloc(9) flags within sys. Exceptions: - sys/contrib not touched - sys/mbuf.h edited manually	2012-12-05 08:04:20 +00:00
Roman Divacky	8252626fb4	Initialize hdrlen to 0 to avoid clang warning in NOINET case.	2012-11-10 10:41:00 +00:00
Gleb Smirnoff	8f134647ca	Switch the entire IPv4 stack to keep the IP packet header in network byte order. Any host byte order processing is done in local variables and host byte order values are never[1] written to a packet. After this change a packet processed by the stack isn't modified at all[2] except for TTL. After this change a network stack hacker doesn't need to scratch his head trying to figure out what is the byte order at the given place in the stack. [1] One exception still remains. The raw sockets convert host byte order before pass a packet to an application. Probably this would remain for ages for compatibility. [2] The ip_input() still subtructs header len from ip->ip_len, but this is planned to be fixed soon. Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru> Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>	2012-10-22 21:09:03 +00:00
Bjoern A. Zeeb	356ab07e2d	It turns out that too many drivers are not only parsing the L2/3/4 headers for TSO but also for generic checksum offloading. Ideally we would only have one common function shared amongst all drivers, and perhaps when updating them for IPv6 we should introduce that. Eventually we should provide the meta information along with mbufs to avoid (re-)parsing entirely. To not break IPv6 (checksums and offload) and to be able to MFC the changes without risking to hurt 3rd party drivers, duplicate the v4 framework, as other OSes have done as well. Introduce interface capability flags for TX/RX checksum offload with IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6 flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6 fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and add an alias for CSUM_DATA_VALID_IPV6. This pretty much brings IPv6 handling in line with IPv4. TSO is still handled in a different way and not via if_hwassist. Update ifconfig to allow (un)setting of the new capability flags. Update loopback to announce the new capabilities and if_hwassist flags. Individual driver updates will have to follow, as will SCTP. Reported by: gallatin, dim, .. Reviewed by: gallatin (glanced at?) MFC after: 3 days X-MFC with: r235961,235959,235958	2012-05-28 09:30:13 +00:00
Bjoern A. Zeeb	45747ba53c	MFp4 bz_ipv6_fast: Add code to handle pre-checked TCP checksums as indicated by mbuf flags to save the entire computation for validation if not needed. In the IPv6 TCP output path only compute the pseudo-header checksum, set the checksum offset in the mbuf field along the appropriate flag as done in IPv4. In tcp_respond() just initialize the IPv6 payload length to 0 as ip6_output() will properly set it. Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems Reviewed by: gnn (as part of the whole) MFC After: 3 days	2012-05-25 02:23:26 +00:00
Bjoern A. Zeeb	d8951c8a2f	Fix PAWS (Protect Against Wrapped Sequence numbers) in cases when hz >> 1000 and thus getting outside the timestamp clock frequenceny of 1ms < x < 1s per tick as mandated by RFC1323, leading to connection resets on idle connections. Always use a granularity of 1ms using getmicrouptime() making all but relevant callouts independent of hz. Use getmicrouptime(), not getmicrotime() as the latter may make a jump possibly breaking TCP nfsroot mounts having our timestamps move forward for more than 24.8 days in a second without having been idle for that long. PR: kern/61404 Reviewed by: jhb, mav, rrs Discussed with: silby, lstewart Sponsored by: Sandvine Incorporated (originally in 2011) MFC after: 6 weeks	2012-02-15 16:09:56 +00:00
John Baldwin	0ff93ca524	Tweak the last fix to match what was actually tested. Pointy hat to: jhb	2012-01-06 12:49:01 +00:00
Sergey Kandaurov	4f7fca1ca1	Fix a typo. X-MFC-with: 229665	2012-01-06 00:23:17 +00:00
John Baldwin	1e96ae8193	Remove the assertion from tcp_input() that rcv_nxt is always greater than or equal to rcv_adv and fix tcp_twstart() to handle this case by assuming the last window was zero rather than a negative value. The code in tcp_input() already safely handled this case. It can happen due to delayed ACKs along with a remote sender that sends data beyond the window we previously advertised. If we have room in our socket buffer for the extra data beyond the advertised window, we will accept it. However, if the ACK for that segment is delayed, then we will not effectively fixup rcv_adv to account for that extra data until the next segment arrives and forces out an ACK. When that next segment arrives, rcv_nxt will be beyond rcv_adv. Tested by: pjd MFC after: 1 week	2012-01-05 22:29:11 +00:00
John Baldwin	5891ebd6cd	Oops, fix order of sequence numbers in KASSERT()'s to catch negative receive windows to match the labels in the panic message. Submitted by: trociny	2011-05-14 14:41:40 +00:00
John Baldwin	f701e30d7f	Handle a rare edge case with nearly full TCP receive buffers. If a TCP buffer fills up causing the remote sender to enter into persist mode, but there is still room available in the receive buffer when a window probe arrives (either due to window scaling, or due to the local application very slowing draining data from the receive buffer), then the single byte of data in the window probe is accepted. However, this can cause rcv_nxt to be greater than rcv_adv. This condition will only last until the next ACK packet is pushed out via tcp_output(), and since the previous ACK advertised a zero window, the ACK should be pushed out while the TCP pcb is write-locked. During the window while rcv_nxt is greather than rcv_adv, a few places would compute the remaining receive window via rcv_adv - rcv_nxt. However, this value was then (uint32_t)-1. On a 64 bit machine this could expand to a positive 2^32 - 1 when cast to a long. In particular, when calculating the receive window in tcp_output(), the result would be that the receive window was computed as 2^32 - 1 resulting in advertising a far larger window to the remote peer than actually existed. Fix various places that compute the remaining receive window to either assert that it is not negative (i.e. rcv_nxt <= rcv_adv), or treat the window as full if rcv_nxt is greather than rcv_adv. Reviewed by: bz MFC after: 1 month	2011-05-02 21:05:52 +00:00
Bjoern A. Zeeb	b287c6c70c	Make the TCP code compile without INET. Sort #includes and add #ifdef INETs. Add some comments at #endifs given more nestedness. To make the compiler happy, some default initializations were added in accordance with the style on the files. Reviewed by: gnn Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems MFC after: 4 days	2011-04-30 11:21:29 +00:00
Rebecca Cran	6bccea7c2b	Fix typos - remove duplicate "the". PR: bin/154928 Submitted by: Eitan Adler <lists at eitanadler.com> MFC after: 3 days	2011-02-21 09:01:34 +00:00

1 2 3 4 5 ...

359 Commits