freebsd-skq

Author	SHA1	Message	Date
tuexen	0d946942e2	Fix a locking issue in sctp_accept. PR: 238520 Reported by: pho@ MFC after: 1 week	2019-08-06 10:29:19 +00:00
tuexen	622372367b	Fix build issues for the userland stack on Raspbian.	2019-08-06 08:33:21 +00:00
tuexen	d76ac3bf32	Improve consistency. No functional change. MFC after: 3 days	2019-08-05 13:22:15 +00:00
delphij	67daa950c9	Fix !INET build.	2019-08-02 22:43:09 +00:00
rrs	1ef9414634	Fix one more atomic for i86 Obtained from: mtuexen@freebsd.org	2019-08-02 11:17:07 +00:00
bz	f66d5bcdd2	IPv6 cleanup: kernel Finish what was started a few years ago and harmonize IPv6 and IPv4 kernel names. We are down to very few places now that it is feasible to do the change for everything remaining with causing too much disturbance. Remove "aliases" for IPv6 names which confusingly could indicate that we are talking about a different data structure or field or have two fields, one for each address family. Try to follow common conventions used in FreeBSD. * Rename sin6p to sin6 as that is how it is spelt in most places. * Remove "aliases" (#defines) for: - in6pcb which really is an inpcb and nothing separate - sotoin6pcb which is sotoinpcb (as per above) - in6p_sp which is inp_sp - in6p_flowinfo which is inp_flow * Try to use ia6 for in6_addr rather than in6p. * With all these gone also rename the in6p variables to inp as that is what we call it in most of the network stack including parts of netinet6. The reasons behind this cleanup are that we try to further unify netinet and netinet6 code where possible and that people will less ignore one or the other protocol family when doing code changes as they may not have spotted places due to different names for the same thing. No functional changes. Discussed with: tuexen (SCTP changes) MFC after: 3 months Sponsored by: Netflix	2019-08-02 07:41:36 +00:00
rrs	5f0d15ff5a	Opps use fetchadd_u64 not long to keep old 32 bit platforms happy.	2019-08-01 20:26:27 +00:00
tuexen	922cab2076	Fix the reporting of multiple unknown parameters in an received INIT chunk. This also plugs an potential mbuf leak. Thanks to Felix Weinrank for reporting this issue found by fuzz-testing the userland stack. MFC after: 3 days	2019-08-01 19:45:34 +00:00
tuexen	bc4fcf2068	When responding with an ABORT to an INIT chunk containing a HOSTNAME parameter or a parameter with an illegal length, only include an error cause indicating why the ABORT was sent. This also fixes an mbuf leak which could occur. MFC after: 3 days	2019-08-01 17:36:15 +00:00
rrs	8a34b17735	This adds the third step in getting BBR into the tree. BBR and an updated rack depend on having access to the new ratelimit api in this commit. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D20953	2019-08-01 14:17:31 +00:00
tuexen	21b067cdcc	Small cleanup, no functional change intended. MFC after: 3 days	2019-07-31 21:39:03 +00:00
tuexen	fcc3d645b6	Consistently cleanup mbufs in case of other memory errors. MFC after: 3 days	2019-07-31 21:29:17 +00:00
tuexen	0ee779dc75	When performing after_idle() or post_recovery(), don't disable the DCTCP specific methods. Also fallthrough NewReno for non ECN capable TCP connections and improve the integer arithmetic. Obtained from: Richard Scheffenegger MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20550	2019-07-29 09:19:48 +00:00
tuexen	51e3e3ac0b	* Improve input validation of sysctl parameters for DCTPC. * Initialize the alpha parameter to a conservative value (like Linux) * Improve handling of arithmetic. * Improve man-page Obtained from: Richard Scheffenegger MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20549	2019-07-29 08:50:35 +00:00
tuexen	bd266b70fa	Add a sysctl variable ts_offset_per_conn to change the computation of the TCP TS offset from taking the IP addresses and the TCP port numbers into account to a version just taking only the IP addresses into account. This works around broken middleboxes or endpoints. The default is to keep the behaviour, which is also the behaviour recommended in RFC 7323. Reported by: devgs@ukr.net Reviewed by: rrs@ MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D20980	2019-07-23 21:28:20 +00:00
tuexen	04cac029e7	Don't hold a mutex while calling sbwait. This was found by syzkaller. Submitted by: rrs@ Reported by: markj@ MFC after: 1 week	2019-07-23 18:31:07 +00:00
tuexen	49092d73d3	Fix a LOR in SCTP which was found by running syzkaller. Submitted by: rrs@ Reported by: markj@ MFC after: 1 week	2019-07-23 18:07:36 +00:00
tuexen	79ab72bc8e	Wakeup the application when doing PD-API for unordered DATA chunks. Work done with rrs@. MFC after: 1 week	2019-07-22 18:11:35 +00:00
tuexen	3e4e5c46ac	Fix compilation on platforms using gcc. When compiling RACK on platforms using gcc, a warning that tcp_outflags is defined but not used is issued and terminates compilation on PPC64, for example. So don't indicate that tcp_outflags is used. Reviewed by: rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D20971	2019-07-16 17:54:20 +00:00
tuexen	39695d1680	Don't free read control entries, which are still on the stream queue when adding them the the read queue fails MFC after: 1 week	2019-07-15 20:45:01 +00:00
tuexen	eb56d924ff	Add support for MSG_EOR and MSG_EOF in sendmsg() for SCTP. This is an FreeBSD extension, not covered by Posix. This issue was found by running syzkaller. MFC after: 1 week	2019-07-15 14:54:04 +00:00
tuexen	e972541a65	Fix socket state handling when freeing an SCTP endpoint. This issue was found by runing syzkaller. MFC after: 1 week	2019-07-15 14:52:52 +00:00
rrs	2bc1470fec	This is the second in a number of patches needed to get BBRv1 into the tree. This fixes the DSACK bug but is also needed by BBR. We have yet to go two more one will be for the pacing code (tcp_ratelimit.c) and the second will be for the new updated LRO code that allows a transport to know the arrival times of packets and (tcp_lro.c). After that we should finally be able to get BBRv1 into head. Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D20908	2019-07-14 16:05:47 +00:00
tuexen	eabf786dc9	When calling sctp_initialize_auth_params(), the inp must have at least a read lock. To avoid more complex locking dances, just call it in sctp_aloc_assoc() when the write lock is still held. Reported by: syzbot+08a486f7e6966f1c3cfb@syzkaller.appspotmail.com MFC after: 1 week	2019-07-14 12:04:39 +00:00
rrs	cf7bf081a0	add back the comment around the pending DSACK fixes.	2019-07-12 11:45:42 +00:00
rrs	bf1f6e5c75	Update to jhb's other suggestion, use #error when we are missing HPTS.	2019-07-11 04:40:58 +00:00
rrs	358e84064e	Update copyright per JBH's suggestions.. thanks.	2019-07-11 04:38:33 +00:00
rrs	b80b5fa389	This commit updates rack to what is basically being used at NF as well as sets in some of the groundwork for committing BBR. The hpts system is updated as well as some other needed utilities for the entrance of BBR. This is actually part 1 of 3 more needed commits which will finally complete with BBRv1 being added as a new tcp stack. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D20834	2019-07-10 20:40:39 +00:00
jhb	520aafe3ec	Add an external mbuf buffer type that holds multiple unmapped pages. Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement for Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web serving workloads when used by sendfile, due to effectively compressing socket buffers by an order of magnitude, and hence reducing cache misses. For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer now points to a struct mbuf_ext_pgs structure instead of a data buffer. This structure contains an array of physical addresses (this reduces cache misses compared to an earlier version that stored an array of vm_page_t pointers). It also stores additional fields needed for in-kernel TLS such as the TLS header and trailer data that are currently unused. To more easily detect these mbufs, the M_NOMAP flag is set in m_flags in addition to M_EXT. Various functions like m_copydata() have been updated to safely access packet contents (using uiomove_fromphys()), to make things like BPF safe. NIC drivers advertise support for unmapped mbufs on transmit via a new IFCAP_NOMAP capability. This capability can be toggled via the new 'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only transmit packet contents via DMA and use bus_dma, adding the capability to if_capabilities and if_capenable should be all that is required. If a NIC does not support unmapped mbufs, they are converted to a chain of mapped mbufs (using sf_bufs to provide the mapping) in ip_output or ip6_output. If an unmapped mbuf requires software checksums, it is also converted to a chain of mapped mbufs before computing the checksum. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Discussed with: ae, kp (firewalls) Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:48:33 +00:00
jhb	eb4237a478	Reject attempts to register a TCP stack being unloaded. Reviewed by: gallatin MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20617	2019-06-27 22:34:05 +00:00
hselasky	1a5fd513af	Convert all IPv4 and IPv6 multicast memberships into using a STAILQ instead of a linear array. The multicast memberships for the inpcb structure are protected by a non-sleepable lock, INP_WLOCK(), which needs to be dropped when calling the underlying possibly sleeping if_ioctl() method. When using a linear array to keep track of multicast memberships, the computed memory location of the multicast filter may suddenly change, due to concurrent insertion or removal of elements in the linear array. This in turn leads to various invalid memory access issues and kernel panics. To avoid this problem, put all multicast memberships on a STAILQ based list. Then the memory location of the IPv4 and IPv6 multicast filters become fixed during their lifetime and use after free and memory leak issues are easier to track, for example by: vmstat -m \| grep multi All list manipulation has been factored into inline functions including some macros, to easily allow for a future hash-list implementation, if needed. This patch has been tested by pho@ . Differential Revision: https://reviews.freebsd.org/D20080 Reviewed by: markj @ MFC after: 1 week Sponsored by: Mellanox Technologies	2019-06-25 11:54:41 +00:00
ae	c6d750cdc7	Add "tcpmss" opcode to match the TCP MSS value. With this opcode it is possible to match TCP packets with specified MSS option, whose value corresponds to configured in opcode value. It is allowed to specify single value, range of values, or array of specific values or ranges. E.g. # ipfw add deny log tcp from any to any tcpmss 0-500 Reviewed by: melifaro,bcr Obtained from: Yandex LLC MFC after: 1 week Sponsored by: Yandex LLC	2019-06-21 10:54:51 +00:00
kp	5b2895d685	ip_output: pass PFIL_FWD in the slow path If we take the slow path for forwarding we should still tell our firewalls (hooked through pfil(9)) that we're forwarding. Pass the ip_output() flags to ip_output_pfil() so it can set the PFIL_FWD flag when we're forwarding. MFC after: 1 week Sponsored by: Axiado	2019-06-21 07:58:08 +00:00
jtl	a21d319b29	Add the ability to limit how much the code will fragment the RACK send map in response to SACKs. The default behavior is unchanged; however, the limit can be activated by changing the new net.inet.tcp.rack.split_limit sysctl. Submitted by: Peter Lei <peterlei@netflix.com> Reported by: jtl Reviewed by: lstewart (earlier version) Security: CVE-2019-5599	2019-06-19 13:55:00 +00:00
delphij	8581c5bfb9	Separate kernel crc32() implementation to its own header (gsb_crc32.h) and rename the source to gsb_crc32.c. This is a prerequisite of unifying kernel zlib instances. PR: 229763 Submitted by: Yoshihiro Ota <ota at j.email.ne.jp> Differential Revision: https://reviews.freebsd.org/D20193	2019-06-17 19:49:08 +00:00
jhb	6c30621191	Sort opt_foo.h #includes and add a missing blank line in ip_output().	2019-06-11 22:07:39 +00:00
bz	24f298a9c6	Fix dpcpu and vnet panics with complex types at the end of the section. Apply a linker script when linking i386 kernel modules to apply padding to a set_pcpu or set_vnet section. The padding value is kind-of random and is used to catch modules not compiled with the linker-script, so possibly still having problems leading to kernel panics. This is needed as the code generated on certain architectures for non-simple-types, e.g., an array can generate an absolute relocation on the edge (just outside) the section and thus will not be properly relocated. Adding the padding to the end of the section will ensure that even absolute relocations of complex types will be inside the section, if they are the last object in there and hence relocation will work properly and avoid panics such as observed with carp.ko or ipsec.ko. There is a rather lengthy discussion of various options to apply in the mentioned PRs and their depends/blocks, and the review. There seems no best solution working across multiple toolchains and multiple version of them, so I took the liberty of taking one, as currently our users (and our CI system) are hitting this on just i386 and we need some solution. I wish we would have a proper fix rather than another "hack". Also backout r340009 which manually, temporarily fixed CARP before 12.0-R "by chance" after a lead-up of various other link-elf.c and related fixes. PR: 230857,238012 With suggestions from: arichardson (originally last year) Tested by: lwhsu Event: Waterloo Hackathon 2019 Reported by: lwhsu, olivier MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D17512	2019-06-08 17:44:42 +00:00
tuexen	3b648a5e27	r347382 added receiver side DSACK support for the TCP base stack. The corresponding changes for the RACK stack where missed and are added by this commit. Reviewed by: Richard Scheffenegger, rrs@ MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D20372	2019-06-06 07:49:03 +00:00
bz	4dc2772cf1	After parts of the locking fixes in r346595, syzkaller found another one in udp_output(). This one is a race condition. We do check on the laddr and lport without holding a lock in order to determine whether we want a read or a write lock (this is in the "sendto/sendmsg" cases where addr (sin) is given). Instrumenting the kernel showed that after taking the lock, we had bound to a local port from a parallel thread on the same socket. If we find that case, unlock, and retry again. Taking the write lock would not be a problem in first place (apart from killing some parallelism). However the retry is needed as later on based on similar condition checks we do acquire the pcbinfo lock and if the conditions have changed, we might find ourselves with a lock inconsistency, hence at the end of the function when trying to unlock, hitting the KASSERT. Reported by: syzbot+bdf4caa36f3ceeac198f@syzkaller.appspotmail.com Reviewed by: markj MFC after: 6 weeks Event: Waterloo Hackathon 2019	2019-06-01 14:57:42 +00:00
markj	e9b44e8630	netdump: Buffer pages to avoid calling netdump_send() on each 4KB write. netdump waits for acknowledgement from the server for each write. When dumping page table pages, we perform many small writes, limiting throughput. Use the netdump client's buffer to buffer small contiguous writes before calling netdump_send() to flush the MAXDUMPPGS-sized buffer. This results in a significant reduction in the time taken to complete a netdump. Submitted by: Sam Gwydir <sam@samgwydir.com> Reviewed by: cem MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20317	2019-05-31 18:29:12 +00:00
tuexen	6901687b42	When an ACK segment as the third message of the three way handshake is received and support for time stamps was negotiated in the SYN/SYNACK exchange, perform the PAWS check and only expand the syn cache entry if the check is passed. Without this check, endpoints may get stuck on the incomplete queue. Reviewed by: jtl@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D20374	2019-05-26 17:18:14 +00:00
jhb	5518ae8169	Restructure mbuf send tags to provide stronger guarantees. - Perform ifp mismatch checks (to determine if a send tag is allocated for a different ifp than the one the packet is being output on), in ip_output() and ip6_output(). This avoids sending packets with send tags to ifnet drivers that don't support send tags. Since we are now checking for ifp mismatches before invoking if_output, we can now try to allocate a new tag before invoking if_output sending the original packet on the new tag if allocation succeeds. To avoid code duplication for the fragment and unfragmented cases, add ip_output_send() and ip6_output_send() as wrappers around if_output and nd6_output_ifp, respectively. All of the logic for setting send tags and dealing with send tag-related errors is done in these wrapper functions. For pseudo interfaces that wrap other network interfaces (vlan and lagg), wrapper send tags are now allocated so that ip*_output see the wrapper ifp as the ifp in the send tag. The if_transmit routines rewrite the send tags after performing an ifp mismatch check. If an ifp mismatch is detected, the transmit routines fail with EAGAIN. - To provide clearer life cycle management of send tags, especially in the presence of vlan and lagg wrapper tags, add a reference count to send tags managed via m_snd_tag_ref() and m_snd_tag_rele(). Provide a helper function (m_snd_tag_init()) for use by drivers supporting send tags. m_snd_tag_init() takes care of the if_ref on the ifp meaning that code alloating send tags via if_snd_tag_alloc no longer has to manage that manually. Similarly, m_snd_tag_rele drops the refcount on the ifp after invoking if_snd_tag_free when the last reference to a send tag is dropped. This also closes use after free races if there are pending packets in driver tx rings after the socket is closed (e.g. from tcpdrop). In order for m_free to work reliably, add a new CSUM_SND_TAG flag in csum_flags to indicate 'snd_tag' is set (rather than 'rcvif'). Drivers now also check this flag instead of checking snd_tag against NULL. This avoids false positive matches when a forwarded packet has a non-NULL rcvif that was treated as a send tag. - cxgbe was relying on snd_tag_free being called when the inp was detached so that it could kick the firmware to flush any pending work on the flow. This is because the driver doesn't require ACK messages from the firmware for every request, but instead does a kind of manual interrupt coalescing by only setting a flag to request a completion on a subset of requests. If all of the in-flight requests don't have the flag when the tag is detached from the inp, the flow might never return the credits. The current snd_tag_free command issues a flush command to force the credits to return. However, the credit return is what also frees the mbufs, and since those mbufs now hold references on the tag, this meant that snd_tag_free would never be called. To fix, explicitly drop the mbuf's reference on the snd tag when the mbuf is queued in the firmware work queue. This means that once the inp's reference on the tag goes away and all in-flight mbufs have been queued to the firmware, tag's refcount will drop to zero and snd_tag_free will kick in and send the flush request. Note that we need to avoid doing this in the middle of ethofld_tx(), so the driver grabs a temporary reference on the tag around that loop to defer the free to the end of the function in case it sends the last mbuf to the queue after the inp has dropped its reference on the tag. - mlx5 preallocates send tags and was using the ifp pointer even when the send tag wasn't in use. Explicitly use the ifp from other data structures instead. - Sprinkle some assertions in various places to assert that received packets don't have a send tag, and that other places that overwrite rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer. Reviewed by: gallatin, hselasky, rgrimes, ae Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20117	2019-05-24 22:30:40 +00:00
bz	0f70df8712	Massively blow up the locking-related KASSERTs used to make sure that we end up in a consistent locking state at the end of udp_output() in order to be able to see what the values are based on which we once took a decision (note: some values may have changed). This helped to debug a syzkaller report. MFC after: 2 months Event: Waterloo Hackathon 2019	2019-05-21 19:23:56 +00:00
bz	f365d1c4d7	Similarly to r338257,338306 try to fold the two consecutive #ifdef RSS section in udp_output() into one by moving a '}' outside of the conditional block. MFC after: 2 months Event: Waterloo Hackathon 2019	2019-05-21 19:18:55 +00:00
cem	2e158b518b	Add two missing eventhandler.h headers These are obviously missing from the .c files, but don't show up in any tinderbox configuration (due to latent header pollution of some kind). It seems some configurations don't have this pollution, and the includes are obviously missing, so go ahead and add them. Reported by: Peter Jeremy <peter AT rulingia.com> X-MFC-With: r347984	2019-05-21 00:04:19 +00:00
cem	250e158ddf	Extract eventfilter declarations to sys/_eventfilter.h This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h" in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header pollution substantially. EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c files into appropriate headers (e.g., sys/proc.h, powernv/opal.h). As a side effect of reduced header pollution, many .c files and headers no longer contain needed definitions. The remainder of the patch addresses adding appropriate includes to fix those files. LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by sys/mutex.h since r326106 (but silently protected by header pollution prior to this change). No functional change (intended). Of course, any out of tree modules that relied on header pollution for sys/eventhandler.h, sys/lock.h, or sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.	2019-05-20 00:38:23 +00:00
tuexen	724fce5e3b	Allow sending on demand SCTP HEARTBEATS only in the ESTABLISHED state. This issue was found by running syzkaller. MFC after: 3 days	2019-05-19 17:53:36 +00:00
tuexen	568140500b	Improve input validation for the IPPROTO_SCTP level socket options SCTP_CONNECT_X and SCTP_CONNECT_X_DELAYED. Some issues where found by running syzkaller. MFC after: 3 days	2019-05-19 17:28:00 +00:00
markj	cd39cf0fa8	Revert r347582 for now. The inp lock still needs to be dropped when calling into the driver ioctl handler, as some drivers expect to be able to sleep. Reported by: kib	2019-05-16 13:04:26 +00:00
markj	46ad7dbca8	Close some races in multicast socket option handling. r333175 converted the global multicast lock to a sleepable sx lock, so the lock order with respect to the (non-sleepable) inp lock changed. To handle this, r333175 and r333505 added code to drop the inp lock, but this opened races that could leave multicast group description structures in an inconsistent state. This change fixes the problem by simply acquiring the global lock sooner. Along the way, this fixes some LORs and bogus error handling introduced in r333175, and commits some related cleanup. Reported by: syzbot+ba7c4943547e0604faca@syzkaller.appspotmail.com Reported by: syzbot+1b803796ab94d11a46f9@syzkaller.appspotmail.com Reviewed by: ae MFC after: 3 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20070	2019-05-14 21:30:55 +00:00

1 2 3 4 5 ...

6216 Commits