freebsd-skq

Author	SHA1	Message	Date
trasz	6c99c1c64d	Make use of the stats(3) framework in the TCP stack. This makes it possible to retrieve per-connection statistical information such as the receive window size, RTT, or goodput, using a newly added TCP_STATS getsockopt(3) option, and extract them using the stats_voistat_fetch(3) API. See the net/tcprtt port for an example consumer of this API. Compared to the existing TCP_INFO system, the main differences are that this mechanism is easy to extend without breaking ABI, and provides statistical information instead of raw "snapshots" of values at a given point in time. stats(3) is more generic and can be used in both userland and the kernel. Reviewed by: thj Tested by: thj Obtained from: Netflix Relnotes: yes Sponsored by: Klara Inc, Netflix Differential Revision: https://reviews.freebsd.org/D20655	2019-12-02 20:58:04 +00:00
tuexen	0bf6180fd4	Move all ECN related flags from the flags to the flags2 field. This allows adding more ECN related flags in the future. No functional change intended. Submitted by: Richard Scheffenegger Reviewed by: rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22497	2019-12-01 21:01:33 +00:00
tuexen	b176c8c5c1	Make the TF_* flags easier readable by humans by adding leading zeroes to make them aligned. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22428	2019-12-01 20:45:48 +00:00
tuexen	b352be6907	This is an initial step in implementing the new congestion window validation as specified in RFC 7661. Submitted by: Richard Scheffenegger Reviewed by: rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D21798	2019-12-01 20:35:41 +00:00
tuexen	7ceb3af1dd	Make the IPTOS value available to all substate handlers. This will allow to add support for L4S or SCE, which require processing of the IP TOS field. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22426	2019-12-01 18:47:53 +00:00
tuexen	9ee3bd429d	In order for the TCP Handshake to support ECN++, and further ECN-related improvements, the ECN bits need to be exposed to the TCP SYNcache. This change is a minimal modification to the function headers, without any functional change intended. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22436	2019-12-01 18:05:02 +00:00
tuexen	bca0a73e05	When changing the MTU of an SCTP path, not only cancel all ongoing RTT measurements, but also scheldule new ones for the future. Submitted by: Julius Flohr MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D22547	2019-12-01 17:35:36 +00:00
bz	c14bf147f4	Fix m_pullup() problem after removing PULLDOWN_TESTs and KAME EXT_*macros. r354748-354750 replaced the KAME macros with m_pulldown() calls. Contrary to the rest of the network stack m_len checks before m_pulldown() were not put in placed (see r354748). Put these m_len checks in place for now (to go along with the style of the network stack since the initial commits). These are not put in for performance but to avoid an error scenario (even though it also will help performance at the moment as it avoid allocating an extra mbuf; not because of the unconditional function call). The observed error case went like this: (1) an mbuf with M_EXT arrives and we call m_pullup() unconditionally on it. (2) m_pullup() will call m_get() unless the requested length is larger than MHLEN (in which case it'll m_freem() the perfectly fine mbuf) and migrate the requested length of data and pkthdr into the new mbuf. (3) If m_get() succeeds, a further m_pullup() call going over MHLEN will fail. This was observed with failing auto-configuration as an RA packet of 200 bytes exceeded MHLEN and the m_pullup() called from nd6_ra_input() dropped the mbuf. (Re-)adding the m_len checks before m_pullup() calls avoids this problems with mbufs using external storage for now. MFC after: 3 weeks Sponsored by: Netflix	2019-12-01 00:22:04 +00:00
tuexen	ab6fc6c3e7	Really ignore the SCTP association identifier on 1-to-1 style sockets as requiresd by the socket API specification. Thanks to Inaki Baz Castillo, who found this bug running the userland stack with valgrind and reported the issue in https://github.com/sctplab/usrsctp/issues/408 MFC after: 1 week	2019-11-28 12:50:25 +00:00
tuexen	bc0b5d2480	Plug two mbuf leaks during INIT-ACK handling. One leak happens when there is not enough memory to allocate the the resources for streams. The other leak happens if the are unknown parameters in the received INIT-ACK chunk which require reporting and the INIT-ACK requires sending an ABORT due to illegal parameter combinations. Hopefully this fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=19083 MFC after: 1 week	2019-11-27 19:32:29 +00:00
rlibby	6c18602692	in_mcast.c: need if_addr_lock around inm_release_deferred Apply a similar fix as for in6_mcast.c. Reviewed by: hselasky Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20740	2019-11-25 22:25:34 +00:00
bz	e4422fa90a	Reduce the vnet_set module size of ip_mroute to allow loading as a module. With VIMAGE kernels modules get special treatment as they need to also keep the original values and make copies for each instance. For that a few pages of vnet modspace are provided and the kernel-linker and the VNET framework know how to deal with things. When the modspace is (almost) full, other modules which would overflow the modspace cannot be loaded and kldload will fail. ip_mroute uses a lot of variable space, mostly be four big arrays: set_vnet 0000000000000510 vnet_entry_multicast_register_if set_vnet 0000000000000700 vnet_entry_viftable set_vnet 0000000000002000 vnet_entry_bw_meter_timers set_vnet 0000000000002800 vnet_entry_bw_upcalls Dynamically malloc the three big ones for each instance we need and free them again on vnet teardown (the 4th is an ifnet). That way they only need module space for a single pointer and allow a lot more modules using virtualized variables to be loaded on a VNET kernel. PR: 206583 Reviewed by: hselasky, kp MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D22443	2019-11-19 15:38:55 +00:00
tuexen	198ac30a9e	Add boundary and overflow checks to the formulas used in the TCP CUBIC congestion control module. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@ Differential Revision: https://reviews.freebsd.org/D19118	2019-11-16 12:00:22 +00:00
tuexen	a132717c05	Improve TCP CUBIC specific after idle reaction. The adjustments are inspired by the Linux stack, which has had a functionally equivalent implementation for more than a decade now. Submitted by: Richard Scheffenegger Reviewed by: Cheng Cui Differential Revision: https://reviews.freebsd.org/D18982	2019-11-16 11:57:12 +00:00
tuexen	18a1c72f35	Implement a tCP CUBIC-specific after idle reaction. This patch addresses a very common case of frequent application stalls, where TCP runs idle and looses the state of the network. Submitted by: Richard Scheffenegger Reviewed by: Cheng Cui Differential Revision: https://reviews.freebsd.org/D18954	2019-11-16 11:37:26 +00:00
tuexen	a8e5b7d5f2	Revert https://svnweb.freebsd.org/changeset/base/354708 I used the wrong Differential Revision, so back it out and do it right in a follow-up commit.	2019-11-16 11:10:09 +00:00
bz	76f9308e3e	Remove now unused IPv6 macros and update docs. After r354748-354750 all uses of the IP6_EXTHDR_CHECK() and IP6_EXTHDR_GET() macros are gone from the kernel. IP6_EXTHDR_GET0() was unused. Remove the macros and update the documentation. Sponsored by: Netflix	2019-11-15 21:55:41 +00:00
bz	1d8998fe67	IP6_EXTHDR_CHECK(): remove the last instances While r354748 removed almost all IP6_EXTHDR_CHECK() calls, these are not part of the PULLDOWN_TESTS. Equally convert these IP6_EXTHDR_CHECK()s here to m_pullup() and remove the extra check and m_pullup() in tcp_input() under isipv6 given tcp6_input() has done exactly that pullup already. MFC after: 8 weeks Sponsored by: Netflix	2019-11-15 21:51:43 +00:00
bz	1feeff48a5	netinet*: replace IP6_EXTHDR_GET() In a few places we have IP6_EXTHDR_GET() left in upper layer protocols. The IP6_EXTHDR_GET() macro might perform an m_pulldown() in case the data fragment is not contiguous. Convert these last remaining instances into m_pullup()s instead. In CARP, for example, we will a few lines later call m_pullup() anyway, the IPsec code coming from OpenBSD would otherwise have done the m_pullup() and are copying the data a bit later anyway, so pulling it in seems no better or worse. Note: this leaves very few m_pulldown() cases behind in the tree and we might want to consider removing them as well to make mbuf management easier again on a path to variable size mbufs, especially given m_pulldown() still has an issue not re-checking M_WRITEABLE(). Reviewed by: gallatin MFC after: 8 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22335	2019-11-15 21:44:17 +00:00
tuexen	8eade0cb10	For idle TCP sessions using the CUBIC congestio control, reset ssthresh to the higher of the previous ssthresh or 3/4 of the prior cwnd. Submitted by: Richard Scheffenegger Reviewed by: Cheng Cui Differential Revision: https://reviews.freebsd.org/D18982	2019-11-14 16:28:02 +00:00
bz	4f4dd10c54	netinet: update mp to pass the proper value back In ip6_[direct_]input() we are looping over the extension headers to deal with the next header. We pass a pointer to an mbuf pointer to the handling functions. In certain cases the mbuf can be updated there and we need to pass the new one back. That missing in dest6_input() and route6_input(). In tcp6_input() we should also update it before we call tcp_input(). In addition to that mark the mbuf NULL all the times when we return that we are done with handling the packet and no next header should be checked (IPPROTO_DONE). This will eventually allow us to assert proper behaviour and catch the above kind of errors more easily, expecting *mp to always be set. This change is extracted from a larger patch and not an exhaustive change across the entire stack yet. PR: 240135 Reported by: prabhakar.lakhera gmail.com MFC after: 3 weeks Sponsored by: Netflix	2019-11-12 15:46:28 +00:00
glebius	92dda61f13	Remove now unused INP_INFO_RLOCK macros.	2019-11-07 22:26:54 +00:00
glebius	b1b7d9cff8	In TCP HPTS enter the epoch in tcp_hpts_thread() and assert it in the leaf functions.	2019-11-07 21:30:27 +00:00
glebius	004daff704	Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER() in TCP timewait manipulation leaf functions.	2019-11-07 21:29:38 +00:00
glebius	5ba7a56186	Since pfslowtimo() runs in the network epoch, tcp_slowtimo() also does. This allows to simplify tcp_tw_2msl_scan() and always require the network epoch in it.	2019-11-07 21:28:46 +00:00
glebius	6d3bde7c4a	Now that there is no R/W lock on PCB list the pcblist sysctls handlers can be greatly simplified. All the previous double cycling and complex locking was added to avoid these functions holding global PCB locks for extended period of time, preventing addition of new entries.	2019-11-07 21:27:32 +00:00
glebius	cb8ae442ac	Now that all of the tcp_input() and all its branches are executed in the network epoch, we can greatly simplify synchronization. Remove all unneccesary epoch enters hidden under INP_INFO_RLOCK macro. Remove some unneccesary assertions and convert necessary ones into the NET_EPOCH_ASSERT macro.	2019-11-07 21:23:07 +00:00
glebius	05dccf5517	Remove unnecessary recursive epoch enter via INP_INFO_RLOCK macro in udp_input(). It shall always run in the network epoch.	2019-11-07 21:08:49 +00:00
glebius	e3012914f1	Remove now unused INP_HASH_RLOCK() macros.	2019-11-07 21:03:15 +00:00
glebius	f83a4563fd	Now with epoch synchronized PCB lookup tables we can greatly simplify locking in udp_output() and udp6_output(). First, we select if we need read or write lock in PCB itself, we take the lock and enter network epoch. Then, we proceed for the rest of the function. In case if we need to modify PCB hash, we would take write lock on it for a short piece of code. We could exit the epoch before allocating an mbuf, but with this patch we are keeping it all the way into ip_output()/ip6_output(). Today this creates an epoch recursion, since ip_output() enters epoch itself. However, once all protocols are reviewed, ip_output() and ip6_output() would require epoch instead of entering it. Note: I'm not 100% sure that in udp6_output() the epoch is required. We don't do PCB hash lookup for a bound socket. And all branches of in6_select_src() don't require epoch, at least they lack assertions. Today inet6 address list is protected by rmlock, although it is CKLIST. AFAIU, the future plan is to protect it by network epoch. That would require epoch in in6_select_src(). Anyway, in future ip6_output() would require epoch, udp6_output() would need to enter it.	2019-11-07 21:01:36 +00:00
glebius	deb3c19c55	Add INP_UNLOCK() which will do whatever R/W unlock is required.	2019-11-07 20:57:51 +00:00
glebius	76a6e088e6	Since r353292 on input path we are always in network epoch, when we lookup PCBs. Thus, do not enter epoch recursively in in_pcblookup_hash() and in6_pcblookup_hash(). Same applies to tcp_ctlinput() and tcp6_ctlinput(). This leaves several sysctl(9) handlers that return PCB credentials unprotected. Add epoch enter/exit to all of them. Differential Revision: https://reviews.freebsd.org/D22197	2019-11-07 20:49:56 +00:00
glebius	d7a93f3623	Remove unnecessary recursive epoch enter via INP_INFO_RLOCK macro in divert_packet(). This function is called only from pfil(9) filters, which in their place always run in the network epoch.	2019-11-07 20:44:34 +00:00
glebius	61ee0e91c4	Remove unnecessary recursive epoch enter via INP_INFO_RLOCK macro in raw input functions for IPv4 and IPv6. They shall always run in the network epoch.	2019-11-07 20:40:44 +00:00
bz	be19ea6cb3	netinet*: variable cleanup In preparation for another change factor out various variable cleanups. These mainly include: (1) do not assign values to variables during declaration: this makes the code more readable and does allow for better grouping of variable declarations, (2) do not assign values to variables before need; e.g., if a variable is only used in the 2nd half of a function and we have multiple return paths before that, then do not set it before it is needed, and (3) try to avoid assigning the same value multiple times. MFC after: 3 weeks Sponsored by: Netflix	2019-11-07 18:29:51 +00:00
glebius	a09b9c105d	TCP timers are executed in callout context, so they need to enter network epoch to look into PCB lists. Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER(). No functional change here.	2019-11-07 00:27:23 +00:00
glebius	d85e3111e4	Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER() in TCP functions that are executed in syscall context. No functional change here.	2019-11-07 00:10:14 +00:00
glebius	62dc620e39	Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER(). Remove few outdated comments and extraneous assertions. No functional change here.	2019-11-07 00:08:34 +00:00
bz	d7908b9933	Properly set VNET when nuking recvif from fragment queues. In theory the eventhandler invoke should be in the same VNET as the the current interface. We however cannot guarantee that for all cases in the future. So before checking if the fragmentation handling for this VNET is active, switch the VNET to the VNET of the interface to always get the one we want. Reviewed by: hselasky MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22153	2019-10-25 18:54:06 +00:00
tuexen	56626fc5ad	Ensure that the flags indicating IPv4/IPv6 are not changed by failing bind() calls. This would lead to inconsistent state resulting in a panic. A fix for stable/11 was committed in https://svnweb.freebsd.org/base?view=revision&revision=338986 An accelerated MFC is planned as discussed with emaste@. Reported by: syzbot+2609a378d89264ff5a42@syzkaller.appspotmail.com Obtained from: jtl@ MFC after: 1 day Sponsored by: Netflix, Inc.	2019-10-24 20:05:10 +00:00
tuexen	ddaae075ce	Store a handle for the event handler. This will be used when unloading the SCTP as a module. Obtained from: markj@	2019-10-24 09:22:23 +00:00
rrs	f651cbcfcc	Fix a small bug in bbr when running under a VM. Basically what happens is we are more delayed in the pacer calling in so we remove the stack from the pacer and recalculate how much time is left after all data has been acknowledged. However the comparision was backwards so we end up with a negative value in the last_pacing_delay time which causes us to add in a huge value to the next pacing time thus stalling the connection. Reported by: vm2.finance@gmail.com	2019-10-24 05:54:30 +00:00
tuexen	71224580cf	Fix compile issues when building a kernel without the VIMAGE option. Thanks to cem@ for discussing the issue which resulted in this patch. Reviewed by: cem@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D22089	2019-10-19 20:48:53 +00:00
cem	abc2745a10	debugnet(4): Add optional full-duplex mode It remains unattached to any client protocol. Netdump is unaffected (remaining half-duplex). The intended consumer is NetGDB. Submitted by: John Reimer <john.reimer AT emc.com> (earlier version) Discussed with: markj Differential Revision: https://reviews.freebsd.org/D21541	2019-10-17 20:25:15 +00:00
cem	f92f351606	debugnet(4): Infer non-server connection parameters Loosen requirements for connecting to debugnet-type servers. Only require a destination address; the rest can theoretically be inferred from the routing table. Relax corresponding constraints in netdump(4) and move ifp validation to debugnet connection time. Submitted by: John Reimer <john.reimer AT emc.com> (earlier version) Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D21482	2019-10-17 20:10:32 +00:00
cem	b0452a96d2	Add ddb(4) 'netdump' command to netdump a core without preconfiguration Add a 'X -s <server> -c <client> [-g <gateway>] -i <interface>' subroutine to the generic debugnet code. The imagined use is both netdump, shown here, and NetGDB (vaporware). It uses the ddb(4) lexer, with some new extensions, to parse out IPv4 addresses. 'Netdump' uses the generic debugnet routine to load a configuration and start a dump, without any netdump configuration prior to panic. Loosely derived from work by: John Reimer <john.reimer AT emc.com> Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D21460	2019-10-17 19:49:20 +00:00
glebius	a38665c7e7	Quickly fix up r353683: enter the epoch before calling into netisr_dispatch().	2019-10-17 17:02:50 +00:00
cem	f3a0ee41db	Split out a more generic debugnet(4) from netdump(4) Debugnet is a simplistic and specialized panic- or debug-time reliable datagram transport. It can drive a single connection at a time and is currently unidirectional (debug/panic machine transmit to remote server only). It is mostly a verbatim code lift from netdump(4). Netdump(4) remains the only consumer (until the rest of this patch series lands). The INET-specific logic has been extracted somewhat more thoroughly than previously in netdump(4), into debugnet_inet.c. UDP-layer logic and up, as much as possible as is protocol-independent, remains in debugnet.c. The separation is not perfect and future improvement is welcome. Supporting INET6 is a long-term goal. Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to 'debugnet_' or 'dn_' -- sorry. I thought keeping the netdump name on the generic module would be more confusing than the refactoring. The only functional change here is the mbuf allocation / tracking. Instead of initiating solely on netdump-configured interface(s) at dumpon(8) configuration time, we watch for any debugnet-enabled NIC for link activation and query it for mbuf parameters at that time. If they exceed the existing high-water mark allocation, we re-allocate and track the new high-water mark. Otherwise, we leave the pre-panic mbuf allocation alone. In a future patch in this series, this will allow initiating netdump from panic ddb(4) without pre-panic configuration. No other functional change intended. Reviewed by: markj (earlier version) Some discussion with: emaste, jhb Objection from: marius Differential Revision: https://reviews.freebsd.org/D21421	2019-10-17 16:23:03 +00:00
glebius	a6c3971a22	igmp_v1v2_queue_report() doesn't require epoch.	2019-10-17 16:02:34 +00:00
hselasky	94dc322ef6	Fix panic in network stack due to use after free when receiving partial fragmented packets before a network interface is detached. When sending IPv4 or IPv6 fragmented packets and a fragment is lost before the network device is freed, the mbuf making up the fragment will remain in the temporary hashed fragment list and cause a panic when it times out due to accessing a freed network interface structure. 1) Make sure the m_pkthdr.rcvif always points to a valid network interface. Else the rcvif field should be set to NULL. 2) Use the rcvif of the last received fragment as m_pkthdr.rcvif for the fully defragged packet, instead of the first received fragment. Panic backtrace for IPv6: panic() icmp6_reflect() # tries to access rcvif->if_afdata[AF_INET6]->xxx icmp6_error() frag6_freef() frag6_slowtimo() pfslowtimo() softclock_call_cc() softclock() ithread_loop() Reviewed by: bz Differential Revision: https://reviews.freebsd.org/D19622 MFC after: 1 week Sponsored by: Mellanox Technologies	2019-10-16 09:11:49 +00:00

1 2 3 4 5 ...

6332 Commits