freebsd-dev

Author	SHA1	Message	Date
Bjoern A. Zeeb	702828f643	frag6: do not leak counter in error cases When allocating the IPv6 fragement packet queue entry we do checks against counters and if we pass we increment one of the counters to claim the spot. Right after that we have two cases (malloc and MAC) which can both fail in which case we free the entry but never released our claim on the counter. In theory this can lead to not accepting new fragments after a long time, especially if it would be MAC "refusing" them. Rather than immediately subtracting the value in the error case, only increment it after these two cases so we can no longer leak it. MFC after: 3 weeks Sponsored by: Netflix	2019-10-25 16:29:09 +00:00
Bjoern A. Zeeb	619456bb59	frag6: prevent overwriting initial fragoff=0 packet meta-data. When we receive the packet with the first fragmented part (fragoff=0) we remember the length of the unfragmentable part and the next header (and should probably also remember ECN) as meta-data on the reassembly queue. Someone replying this packet so far could change these 2 (3) values. While changing the next header seems more severe, for a full size fragmented UDP packet, for example, adding an extension header to the unfragmentable part would go unnoticed (as the framented part would be considered an exact duplicate) but make reassembly fail. So do not allow updating the meta-data after we have seen the first fragmented part anymore. The frag6_20 test case is added which failed before triggering an ICMPv6 "param prob" due to the check for each queued fragment for a max-size violation if a fragoff=0 packet was received. MFC after: 3 weeks Sponsored by: Netflix	2019-10-24 22:07:45 +00:00
Bjoern A. Zeeb	cd188da20f	frag6: handling of overlapping fragments to conform to RFC 8200 While the comment was updated in r350746, the code was not. RFC8200 says that unless fragment overlaps are exact (same fragment twice) not only the current fragment but the entire reassembly queue for this packet must be silently discarded, which we now do if fragment offset and fragment length do not match. Obtained from: jtl MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16850	2019-10-24 20:22:52 +00:00
Michael Tuexen	4a91aa8fc9	Ensure that the flags indicating IPv4/IPv6 are not changed by failing bind() calls. This would lead to inconsistent state resulting in a panic. A fix for stable/11 was committed in https://svnweb.freebsd.org/base?view=revision&revision=338986 An accelerated MFC is planned as discussed with emaste@. Reported by: syzbot+2609a378d89264ff5a42@syzkaller.appspotmail.com Obtained from: jtl@ MFC after: 1 day Sponsored by: Netflix, Inc.	2019-10-24 20:05:10 +00:00
Bjoern A. Zeeb	53707abd41	frag6: export another counter read-only by sysctl Similar to the system global counter also export the per-VNET counter "frag6_nfragpackets" detailing the current number of fragment packets in this VNET's reassembly queues. The read-only counter is helpful for in-VNET statistical monitoring and for test-cases. MFC after: 3 weeks Sponsored by: Netflix	2019-10-24 20:00:37 +00:00
Bjoern A. Zeeb	dda02192f9	frag6: fix counter leak in error case and optimise code In case the first fragmented part (off=0) arrives we check for the maximum packet size for each fragmented part we already queued with the addition of the unfragmentable part from the first one. For one we do not have to enter the loop at all if this is the first fragmented part to arrive, and we can skip the check. Should we encounter an error case we send an ICMPv6 message for any fragment exceeding the maximum length limit. While dequeueing the original packet and freeing it, statistics were not updated and leaked both the reassembly queue count for the fragment and the global fragment count. Found by code inspection and confirmed by tightening test cases checking more statistical and system counters. While here properly wrap a line. MFC after: 3 weeks Sponsored by: Netflix	2019-10-24 19:57:18 +00:00
Bjoern A. Zeeb	e5fffe9a69	frag6.c: do not leak packet queue entry in error case When we are checking for the maximum reassembled packet size of the fragmentable part and run into the error case (packet too big), we are leaking the packet queue enntry if this was a first fragment to arrive. Properly cleanup, removing the queue entry from the bucket, decrementing counters, and freeing the memory. MFC after: 3 weeks Sponsored by: Netflix	2019-10-24 19:47:32 +00:00
Bjoern A. Zeeb	30809ba9e3	frag6: leave a note about upper layer header checks TBD Per sepcification the upper layer header needs to be within the first fragment. The check was not done so far and there is an open review for related work, so just leave a note as to where to put it. Move the extraction of frag offset up to this as it is needed to determine whether this is a first fragment or not. MFC after: 3 weeks Sponsored by: Netflix	2019-10-24 12:16:15 +00:00
Bjoern A. Zeeb	7715d794ef	frag6: check global limits before hash and lock Check whether we are accepting more fragments (based on global limits) before doing expensive operations of calculating the hash and taking the bucket lock. This slightly increases a "race" between check time and incrementing counters (which is already there) possibly allowing a few more fragments than the maximum limits. However, when under attack, we rather save this CPU time for other packets/work. MFC after: 3 weeks Sponsored by: Netflix	2019-10-24 11:58:24 +00:00
Bjoern A. Zeeb	efdfee93c0	frag6: small improvements Rather than walking the mbuf chain manually use m_last() which doing exactly that for us. Defer initializing srcifp for longer as there are multiple exit paths out of the function which do not need it set. Initialize before taking the lock though. Rename the mtx lock to match the type better. MFC after: 3 weeks Sponsored by: Netflix	2019-10-24 08:15:40 +00:00
Bjoern A. Zeeb	da89a0fe94	frag6: remove IP6_REASS_MBUF macro The IP6_REASS_MBUF() macro did some pointer gynmastics to end up with the same type as it gets in [(cast *)&]. Spelling it out instead saves all this and makes the code more readable and less obfuscated directly using the structure field. MFC after: 3 weeks Sponsored by: Netflix	2019-10-24 07:53:10 +00:00
Bjoern A. Zeeb	f1664f3258	frag6: add "big picture" Add some ASCII relation of how the bits plug together. The terminology difference of "fragmented packets" and "fragment packets" is subtle. While here clear up more whitespace and comments. No functional change. MFC after: 3 weeks Sponsored by: Netflix	2019-10-23 23:10:12 +00:00
Bjoern A. Zeeb	21f08a074d	frag6: replace KAME hand-rolled queues with queue(9) TAILQs Remove the KAME custom circular queue for fragments and fragmented packets and replace them with a standard TAILQ. This make the code a lot more understandable and maintainable and removes further hand-rolled code from the the tree using a standard interface instead. Hide the still public structures under #ifdef _KERNEL as there is no use for them in user space. The naming is a bit confusing now as struct ip6q and the ip6q[] buckets array are not the same anymore; sadly struct ip6q is also used by the MAC framework and we cannot rename it. Submitted by: jtl (initally) MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16847 (jtl's original)	2019-10-23 23:01:18 +00:00
Bjoern A. Zeeb	3c7165b35e	frag6: whitespace changes Remove trailing white space, add a blank line, and compress a comment. No functional changes. MFC after: 10 days Sponsored by: Netflix	2019-10-23 20:37:15 +00:00
Gleb Smirnoff	be0c32e2ff	Execute nd6_dad_timer() in the network epoch, since nd6_dad_duplicated() requires it. Make nd6_dad_starttimer() require network epoch. Two calls out of three happen from nd6_dad_timer(). Enter epoch in the remaining one.	2019-10-22 16:06:33 +00:00
Bjoern A. Zeeb	67a10c4644	frag6: fix vnet teardown leak When shutting down a VNET we did not cleanup the fragmentation hashes. This has multiple problems: (1) leak memory but also (2) leak on the global counters, which might eventually lead to a problem on a system starting and stopping a lot of vnets and dealing with a lot of IPv6 fragments that the counters/limits would be exhausted and processing would no longer take place. Unfortunately we do not have a useable variable to indicate when per-VNET initialization of frag6 has happened (or when destroy happened) so introduce a boolean to flag this. This is needed here as well as it was in r353635 for ip_reass.c in order to avoid tripping over the already destroyed locks if interfaces go away after the frag6 destroy. While splitting things up convert the TRY_LOCK to a LOCK operation in now frag6_drain_one(). The try-lock was derived from a manual hand-rolled implementation and carried forward all the time. We no longer can afford not to get the lock as that would mean we would continue to leak memory. Assert that all the buckets are empty before destroying to lock to ensure long-term stability of a clean shutdown. Reported by: hselasky Reviewed by: hselasky MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22054	2019-10-21 08:48:47 +00:00
Bjoern A. Zeeb	65456706c0	frag6: add read-only sysctl for nfrags. Add a read-only sysctl exporting the global number of fragments (base system and all vnets). This is helpful to (a) know how many fragments are currently being processed, (b) if there are possible leaks, (c) if vnet teardown is not working correctly, and lastly (d) it can be used as part of test-suits to ensure (a) to (c). MFC after: 3 weeks Sponsored by: Netflix	2019-10-21 08:36:15 +00:00
Hans Petter Selasky	a55383e720	Fix panic in network stack due to use after free when receiving partial fragmented packets before a network interface is detached. When sending IPv4 or IPv6 fragmented packets and a fragment is lost before the network device is freed, the mbuf making up the fragment will remain in the temporary hashed fragment list and cause a panic when it times out due to accessing a freed network interface structure. 1) Make sure the m_pkthdr.rcvif always points to a valid network interface. Else the rcvif field should be set to NULL. 2) Use the rcvif of the last received fragment as m_pkthdr.rcvif for the fully defragged packet, instead of the first received fragment. Panic backtrace for IPv6: panic() icmp6_reflect() # tries to access rcvif->if_afdata[AF_INET6]->xxx icmp6_error() frag6_freef() frag6_slowtimo() pfslowtimo() softclock_call_cc() softclock() ithread_loop() Reviewed by: bz Differential Revision: https://reviews.freebsd.org/D19622 MFC after: 1 week Sponsored by: Mellanox Technologies	2019-10-16 09:11:49 +00:00
Gleb Smirnoff	5f5ec65aaf	in6ifa_llaonifp() is never called from fast path, so do not require epoch being entered.	2019-10-14 15:33:53 +00:00
Michael Tuexen	583b625ba8	Remove line not needed. Submitted by: markj@ MFC after: 3 days	2019-10-13 09:35:03 +00:00
Gleb Smirnoff	ef2e580e56	Don't cover in6_ifattach() with network epoch, as it may call into network drivers ioctls, that may sleep. PR: 241223	2019-10-13 04:25:16 +00:00
Mark Johnston	49c5659e1c	Add a missing include of opt_sctp.h. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-10-12 22:58:33 +00:00
Gleb Smirnoff	1e4f4e56b9	ip6_output() has a complex set of gotos, and some can jump out of the epoch section towards return statement. Since entering epoch is cheap, it is easier to cover the whole function with epoch, rather than try to properly maintain its state.	2019-10-09 17:02:28 +00:00
Gleb Smirnoff	3af7f97c4e	Revert changes to rip6_bind() from r353292. This function is always called in syscall context, so it must enter epoch itself. This changeset originates from early version of the patch, and somehow slipped to the final version. Reported by: pho	2019-10-09 05:52:07 +00:00
Mark Johnston	cb49ec5431	Improve locking in the IPV6_V6ONLY socket option handler. Acquire the inp lock before checking whether the socket is already bound, and around updates to the inp_vflag field. MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21867	2019-10-07 23:35:23 +00:00
Gleb Smirnoff	b8a6e03fac	Widen NET_EPOCH coverage. When epoch(9) was introduced to network stack, it was basically dropped in place of existing locking, which was mutexes and rwlocks. For the sake of performance mutex covered areas were as small as possible, so became epoch covered areas. However, epoch doesn't introduce any contention, it just delays memory reclaim. So, there is no point to minimise epoch covered areas in sense of performance. Meanwhile entering/exiting epoch also has non-zero CPU usage, so doing this less often is a win. Not the least is also code maintainability. In the new paradigm we can assume that at any stage of processing a packet, we are inside network epoch. This makes coding both input and output path way easier. On output path we already enter epoch quite early - in the ip_output(), in the ip6_output(). This patch does the same for the input path. All ISR processing, network related callouts, other ways of packet injection to the network stack shall be performed in net_epoch. Any leaf function that walks network configuration now asserts epoch. Tricky part is configuration code paths - ioctls, sysctls. They also call into leaf functions, so some need to be changed. This patch would introduce more epoch recursions (see EPOCH_TRACE) than we had before. They will be cleaned up separately, as several of them aren't trivial. Note, that unlike a lock recursion the epoch recursion is safe and just wastes a bit of resources. Reviewed by: gallatin, hselasky, cy, adrian, kristof Differential Revision: https://reviews.freebsd.org/D19111	2019-10-07 22:40:05 +00:00
Michael Tuexen	e7a541b0b9	When processing an incoming IPv6 packet over the loopback interface which contains Hop-by-Hop options, the mbuf chain is potentially changed in ip6_hopopts_input(), called by ip6_input_hbh(). This can happen, because of the the use of IP6_EXTHDR_CHECK, which might call m_pullup(). So provide the updated pointer back to the called of ip6_input_hbh() to avoid using a freed mbuf chain in`ip6_input()`. Reviewed by: markj@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21664	2019-09-19 10:22:29 +00:00
John Baldwin	b2e60773c6	Add kernel-side support for in-kernel TLS. KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277	2019-08-27 00:01:56 +00:00
Bjoern A. Zeeb	1540a98e36	frag6: move public structure into file local space. Move ip6asfrag and the accompanying IP6_REASS_MBUF macro from ip6_var.h into frag6.c as they are not used outside frag6.c. Sadly struct ip6q is all over the mac framework so we have to leave it public. This reduces the public KPI space. MFC after: 3 months X-MFC: possibly MFC the #define only to stable branches Sponsored by: Netflix	2019-08-08 10:59:54 +00:00
Bjoern A. Zeeb	5778b399f1	frag6.c: cleanup varaibles and return statements. Consitently put () around return values. Do not assign variables at the time of variable declaration. Sort variables. Rename ia to ia6, remove/reuse some variables used only once or twice for temporary calculations. No functional changes intended. MFC after: 3 months Sponsored by: Netflix	2019-08-08 10:15:47 +00:00
Bjoern A. Zeeb	23d374aa14	frag6.c: initial comment and whitespace cleanup. Cleanup some comments (start with upper case, ends in punctuation, use width and do not consume vertical space). Update comments to RFC8200. Some whitespace changes. No functional changes. MFC after: 3 months Sponsored by: Netflix	2019-08-08 09:42:57 +00:00
Ed Maste	7f8c266da5	Correct ICMPv6/MLDv2 out-of-bounds memory access Previously the ICMPv6 input path incorrectly handled cases where an MLDv2 listener query packet was internally fragmented across multiple mbufs. admbugs: 921 Submitted by: jtl Reported by: CJD of Apple Approved by: so MFC after: 0 minutes Security: CVE-2019-5608	2019-08-06 17:11:30 +00:00
Michael Tuexen	94962f6ba0	Improve consistency. No functional change. MFC after: 3 days	2019-08-05 13:22:15 +00:00
Bjoern A. Zeeb	9cb1a47af2	frag6.c: rename ip6q[] to ipq6b[] and consistently use "bucket" The hash buckets array is called ip6q. The data structure ip6q is a description of different object, the one the array holds these days (since r337776). To clear some of this confusion, rename the array to ip6qb. When iterating over all buckets or addressing them directly, we use at least the variables i, hash, and bucket. To keep the terminology consistent use the variable name "bucket" and always make it an uint32_t and not sometimes an int. No functional behaviour changes intended. MFC after: 3 months Sponsored by: Netflix	2019-08-05 11:01:12 +00:00
Bjoern A. Zeeb	c00464a245	frag6.c: re-order functions within file Re-order functions within the file in preparation for an upcoming code simplification. No functional changes. MFC after: 3 months Sponsored by: Netflix	2019-08-05 09:49:24 +00:00
Bjoern A. Zeeb	f349c821f5	frag6.c: fix includes Bring back systm.h after r350532 and banish errno.h, time.h, and machine/atomic.h. Reported by: bde (Thank you!) Pointyhat to: bz MFC after: 12 weeks X-MFC: with r350532 Sponsored by: Netflix	2019-08-03 16:56:44 +00:00
Bjoern A. Zeeb	09b361c792	frag6.c: make compile with gcc Removing the prototype from the header and making the function static in r350533 makes architectures using gcc complain "function declaration isn't a prototype". Add the missing void given the function has no arguments. Reported by: the CI machinery Pointyhat to: bz MFC after: 3 months X-MFC with: r350533 Sponsored by: Netflix	2019-08-02 11:05:00 +00:00
Bjoern A. Zeeb	487a161cff	frag6.c: rename malloc type Rename M_FTABLE to M_FRAG6 as the former sounds very much like the former "flowtable" rather than anything to do with fragments and reassembly. While here, let malloc( , .. \| M_ZERO) do the zeroing rather than calling bzero() ourselves. MFC after: 3 months Sponsored by: Netflix	2019-08-02 10:54:57 +00:00
Bjoern A. Zeeb	a687de6aee	frag6.c: remove dead code Remove all the #if 0 and #if notyet blocks of dead code which have been there for at least 18 years from what I can see. No functional changes. MFC after: 3 months Sponsored by: Netflix	2019-08-02 10:41:51 +00:00
Bjoern A. Zeeb	757cb678e5	frag6.c: move variables and sysctls into local file Move the sysctls and the related variables only used in frag6.c into the file and out of in6_proto.c. That way everything belonging together is in one place. Sort the variables into global and per-vnet scopes and make them static. No longer export the (helper) function frag6_set_bucketsize() now also file-local only. Should be no functional changes, only reduced public KPI/KBI surface. MFC after: 3 months Sponsored by: Netflix	2019-08-02 10:29:53 +00:00
Bjoern A. Zeeb	1a3044fa2c	frag6.c: sort includes Sort includes and remove duplicate kernel.h as well as the unneeded systm.h. Hide the mac framework incude behind #fidef MAC. MFC after: 3 months Sponsored by: Netflix	2019-08-02 10:06:54 +00:00
Bjoern A. Zeeb	0ecd976e80	IPv6 cleanup: kernel Finish what was started a few years ago and harmonize IPv6 and IPv4 kernel names. We are down to very few places now that it is feasible to do the change for everything remaining with causing too much disturbance. Remove "aliases" for IPv6 names which confusingly could indicate that we are talking about a different data structure or field or have two fields, one for each address family. Try to follow common conventions used in FreeBSD. * Rename sin6p to sin6 as that is how it is spelt in most places. * Remove "aliases" (#defines) for: - in6pcb which really is an inpcb and nothing separate - sotoin6pcb which is sotoinpcb (as per above) - in6p_sp which is inp_sp - in6p_flowinfo which is inp_flow * Try to use ia6 for in6_addr rather than in6p. * With all these gone also rename the in6p variables to inp as that is what we call it in most of the network stack including parts of netinet6. The reasons behind this cleanup are that we try to further unify netinet and netinet6 code where possible and that people will less ignore one or the other protocol family when doing code changes as they may not have spotted places due to different names for the same thing. No functional changes. Discussed with: tuexen (SCTP changes) MFC after: 3 months Sponsored by: Netflix	2019-08-02 07:41:36 +00:00
Michael Tuexen	8a956abe12	When calling sctp_initialize_auth_params(), the inp must have at least a read lock. To avoid more complex locking dances, just call it in sctp_aloc_assoc() when the write lock is still held. Reported by: syzbot+08a486f7e6966f1c3cfb@syzkaller.appspotmail.com MFC after: 1 week	2019-07-14 12:04:39 +00:00
Michael Tuexen	9e44bc22d8	r348494 fixes a race in udp_output(). The same race exists in udp_output6(), therefore apply a similar patch to IPv6. Reported by: syzbot+c5ffbc8f14294c7b0e54@syzkaller.appspotmail.com Reviewed by: bz@, markj@ MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D20936	2019-07-13 12:45:08 +00:00
John Baldwin	82334850ea	Add an external mbuf buffer type that holds multiple unmapped pages. Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement for Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web serving workloads when used by sendfile, due to effectively compressing socket buffers by an order of magnitude, and hence reducing cache misses. For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer now points to a struct mbuf_ext_pgs structure instead of a data buffer. This structure contains an array of physical addresses (this reduces cache misses compared to an earlier version that stored an array of vm_page_t pointers). It also stores additional fields needed for in-kernel TLS such as the TLS header and trailer data that are currently unused. To more easily detect these mbufs, the M_NOMAP flag is set in m_flags in addition to M_EXT. Various functions like m_copydata() have been updated to safely access packet contents (using uiomove_fromphys()), to make things like BPF safe. NIC drivers advertise support for unmapped mbufs on transmit via a new IFCAP_NOMAP capability. This capability can be toggled via the new 'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only transmit packet contents via DMA and use bus_dma, adding the capability to if_capabilities and if_capenable should be all that is required. If a NIC does not support unmapped mbufs, they are converted to a chain of mapped mbufs (using sf_bufs to provide the mapping) in ip_output or ip6_output. If an unmapped mbuf requires software checksums, it is also converted to a chain of mapped mbufs before computing the checksum. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Discussed with: ae, kp (firewalls) Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:48:33 +00:00
Hans Petter Selasky	59854ecf55	Convert all IPv4 and IPv6 multicast memberships into using a STAILQ instead of a linear array. The multicast memberships for the inpcb structure are protected by a non-sleepable lock, INP_WLOCK(), which needs to be dropped when calling the underlying possibly sleeping if_ioctl() method. When using a linear array to keep track of multicast memberships, the computed memory location of the multicast filter may suddenly change, due to concurrent insertion or removal of elements in the linear array. This in turn leads to various invalid memory access issues and kernel panics. To avoid this problem, put all multicast memberships on a STAILQ based list. Then the memory location of the IPv4 and IPv6 multicast filters become fixed during their lifetime and use after free and memory leak issues are easier to track, for example by: vmstat -m \| grep multi All list manipulation has been factored into inline functions including some macros, to easily allow for a future hash-list implementation, if needed. This patch has been tested by pho@ . Differential Revision: https://reviews.freebsd.org/D20080 Reviewed by: markj @ MFC after: 1 week Sponsored by: Mellanox Technologies	2019-06-25 11:54:41 +00:00
John Baldwin	77a0144145	Sort opt_foo.h #includes and add a missing blank line in ip_output().	2019-06-11 22:07:39 +00:00
John Baldwin	fb3bc59600	Restructure mbuf send tags to provide stronger guarantees. - Perform ifp mismatch checks (to determine if a send tag is allocated for a different ifp than the one the packet is being output on), in ip_output() and ip6_output(). This avoids sending packets with send tags to ifnet drivers that don't support send tags. Since we are now checking for ifp mismatches before invoking if_output, we can now try to allocate a new tag before invoking if_output sending the original packet on the new tag if allocation succeeds. To avoid code duplication for the fragment and unfragmented cases, add ip_output_send() and ip6_output_send() as wrappers around if_output and nd6_output_ifp, respectively. All of the logic for setting send tags and dealing with send tag-related errors is done in these wrapper functions. For pseudo interfaces that wrap other network interfaces (vlan and lagg), wrapper send tags are now allocated so that ip*_output see the wrapper ifp as the ifp in the send tag. The if_transmit routines rewrite the send tags after performing an ifp mismatch check. If an ifp mismatch is detected, the transmit routines fail with EAGAIN. - To provide clearer life cycle management of send tags, especially in the presence of vlan and lagg wrapper tags, add a reference count to send tags managed via m_snd_tag_ref() and m_snd_tag_rele(). Provide a helper function (m_snd_tag_init()) for use by drivers supporting send tags. m_snd_tag_init() takes care of the if_ref on the ifp meaning that code alloating send tags via if_snd_tag_alloc no longer has to manage that manually. Similarly, m_snd_tag_rele drops the refcount on the ifp after invoking if_snd_tag_free when the last reference to a send tag is dropped. This also closes use after free races if there are pending packets in driver tx rings after the socket is closed (e.g. from tcpdrop). In order for m_free to work reliably, add a new CSUM_SND_TAG flag in csum_flags to indicate 'snd_tag' is set (rather than 'rcvif'). Drivers now also check this flag instead of checking snd_tag against NULL. This avoids false positive matches when a forwarded packet has a non-NULL rcvif that was treated as a send tag. - cxgbe was relying on snd_tag_free being called when the inp was detached so that it could kick the firmware to flush any pending work on the flow. This is because the driver doesn't require ACK messages from the firmware for every request, but instead does a kind of manual interrupt coalescing by only setting a flag to request a completion on a subset of requests. If all of the in-flight requests don't have the flag when the tag is detached from the inp, the flow might never return the credits. The current snd_tag_free command issues a flush command to force the credits to return. However, the credit return is what also frees the mbufs, and since those mbufs now hold references on the tag, this meant that snd_tag_free would never be called. To fix, explicitly drop the mbuf's reference on the snd tag when the mbuf is queued in the firmware work queue. This means that once the inp's reference on the tag goes away and all in-flight mbufs have been queued to the firmware, tag's refcount will drop to zero and snd_tag_free will kick in and send the flush request. Note that we need to avoid doing this in the middle of ethofld_tx(), so the driver grabs a temporary reference on the tag around that loop to defer the free to the end of the function in case it sends the last mbuf to the queue after the inp has dropped its reference on the tag. - mlx5 preallocates send tags and was using the ifp pointer even when the send tag wasn't in use. Explicitly use the ifp from other data structures instead. - Sprinkle some assertions in various places to assert that received packets don't have a send tag, and that other places that overwrite rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer. Reviewed by: gallatin, hselasky, rgrimes, ae Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20117	2019-05-24 22:30:40 +00:00
Andrey V. Elsukov	b1536a812b	Restore IPV6_NEXTHOP option support that seem was partially broken since r286195. Do not forget results of route lookup and initialize rt and ifp pointers. PR: 238098 Submitted by: Masse Nicolas <nicolas.masse at stormshield eu> MFC after: 1 week	2019-05-24 11:45:32 +00:00
Alexander V. Chernikov	563ab4e400	Fix gateway setup for the interface routes. Currently rinit1() and its IPv6 counterpart nd6_prefix_onlink_rtrequest() uses dummy null_sdl gateway address during route insertion and change it afterwards. This behaviour brings complications to the routing stack and the users of its upcoming notification system. This change fixes both rinit1() and nd6_prefix_onlink_rtrequest() by filling in proper gateway in the beginning. It does not change any of the userland notifications as in both cases, they happen after the insertion and fixup process (rt_newaddrmsg_fib() and nd6_rtmsg()). MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20328	2019-05-22 21:20:15 +00:00

1 2 3 4 5 ...

1941 Commits