freebsd-nq

Author	SHA1	Message	Date
Bjoern A. Zeeb	334fc5822b	vnet: virtualise more network stack sysctls. Virtualise tcp_always_keepalive, TCP and UDP log_in_vain. All three are set in the netoptions startup script, which we would love to run for VNETs as well [1]. While virtualising the log_in_vain sysctls seems pointles at first for as long as the kernel message buffer is not virtualised, it at least allows an administrator to debug the base system or an individual jail if needed without turning the logging on for all jails running on a system. PR: 243193 [1] MFC after: 2 weeks	2020-01-08 23:30:26 +00:00
Alexander V. Chernikov	e02d3fe70c	Fix rtsock route message generation for interface addresses. Reviewed by: olivier MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D22974	2020-01-07 21:16:30 +00:00
Gleb Smirnoff	e00ee1a9f4	In r343631 error code for a packet blocked by a firewall was changed from EACCES to EPERM. This change was not intentional, so fix that. Return EACCESS if a firewall forbids sending. Noticed by: ae	2020-01-01 17:32:20 +00:00
Alexander V. Chernikov	bdb214a4a4	Remove useless code from in6_rmx.c The code in questions walks IPv6 tree every 60 seconds and looks into the routes with non-zero expiration time (typically, redirected routes). For each such route it sets RTF_PROBEMTU flag at the expiration time. No other part of the kernel checks for RTF_PROBEMTU flag. RTF_PROBEMTU was defined 21 years ago, 30 Jun 1999, as RTF_PROTO1. RTF_PROTO1 is a de-facto standard indication of a route installed by a routing daemon for a last decade. Reviewed by: bz, ae MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D22865	2019-12-18 22:10:56 +00:00
Hans Petter Selasky	a4c5668d12	Leave multicast group before reaping and committing state for both IPv4 and IPv6. This fixes a regression issue after r349369. When trying to exit a multicast group before closing the socket, a multicast leave packet should be sent. Differential Revision: https://reviews.freebsd.org/D22848 PR: 242677 Reviewed by: bz (network) Tested by: Aleksandr Fedorov <aleksandr.fedorov@itglobal.com> MFC after: 1 week Sponsored by: Mellanox Technologies	2019-12-18 12:06:34 +00:00
Bjoern A. Zeeb	74ff87cd16	Update comment. Update the comment related to SIIT and v4mapped addresses being rejected by us when coming from the wire given we have supported IPv6-only kernels for a few years now. See also draft-itojun-v6ops-v4mapped-harmful. Suggested by: melifaro MFC after: 2 weeks	2019-12-06 16:53:42 +00:00
Bjoern A. Zeeb	b745e7623c	ip6_input: remove redundant v4mapped check In ip6_input() we apply the same v4mapped address check twice. The only case which skipps the first one is M_FASTFWD_OURS which should have passed the check on the firstinput pass and passed the firewall. Remove the 2nd redundant check. Reviewed by: kp, melifaro MFC after: 2 weeks Sponsored by: Netflix (originally) Differential Revision: https://reviews.freebsd.org/D22462	2019-12-06 16:42:58 +00:00
Kristof Provost	200424235e	Remove useless NULL check Coverity points out that we've already dereferenced m by the time we check, so there's no reason to keep the check. Moreover, it's safe to pass NULL to m_freem() anyway. CID: 1019092	2019-12-05 16:50:54 +00:00
Bjoern A. Zeeb	0700d2c3f0	Make icmp6_reflect() static. icmp6_reflect() is not used anywhere outside icmp6.c, no reason to export it. Sponsored by: Netflix	2019-12-03 14:46:38 +00:00
Hans Petter Selasky	5b64b824b9	Use refcount from "in_joingroup_locked()" when joining multicast groups. Do not acquire additional references. This makes the IPv4 IGMP code in line with the IPv6 MLD code. Background: The IPv4 multicast code puts an extra reference on the in_multi struct when joining groups. This becomes visible when using daemons like igmpproxy from ports, that multicast entries do not disappear from the output of ifmcstat(8) when multicast streams are disconnected. This fixes a regression issue after r349762. While at it factor the ip_mfilter_insert() and ip6_mfilter_insert() calls to avoid repeated "is_new" check. Differential Revision: https://reviews.freebsd.org/D22595 Tested by: Guido van Rooij <guido@gvr.org> Reviewed by: rgrimes (network) MFC after: 1 week Sponsored by: Mellanox Technologies	2019-12-03 08:46:59 +00:00
Michael Tuexen	e25b0dab9a	Update the hostcache also for PTB messages received for SCTP/IPv6. The corresponding code for SCTP/IPv4 was introduced in https://svnweb.freebsd.org/base?view=revision&revision=317597 Submitted by: Julius Flohr MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D22605	2019-12-01 16:14:44 +00:00
Bjoern A. Zeeb	a4adf6cc65	Fix m_pullup() problem after removing PULLDOWN_TESTs and KAME EXT_*macros. r354748-354750 replaced the KAME macros with m_pulldown() calls. Contrary to the rest of the network stack m_len checks before m_pulldown() were not put in placed (see r354748). Put these m_len checks in place for now (to go along with the style of the network stack since the initial commits). These are not put in for performance but to avoid an error scenario (even though it also will help performance at the moment as it avoid allocating an extra mbuf; not because of the unconditional function call). The observed error case went like this: (1) an mbuf with M_EXT arrives and we call m_pullup() unconditionally on it. (2) m_pullup() will call m_get() unless the requested length is larger than MHLEN (in which case it'll m_freem() the perfectly fine mbuf) and migrate the requested length of data and pkthdr into the new mbuf. (3) If m_get() succeeds, a further m_pullup() call going over MHLEN will fail. This was observed with failing auto-configuration as an RA packet of 200 bytes exceeded MHLEN and the m_pullup() called from nd6_ra_input() dropped the mbuf. (Re-)adding the m_len checks before m_pullup() calls avoids this problems with mbufs using external storage for now. MFC after: 3 weeks Sponsored by: Netflix	2019-12-01 00:22:04 +00:00
Ryan Libby	6afe56f9c3	in6_joingroup_locked: need if_addr_lock around in6m_disconnect_locked It looks like the call that requires the lock was introduced in r337866. Reviewed by: hselasky Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20739	2019-11-25 22:25:10 +00:00
Bjoern A. Zeeb	f8d4f9bce9	in6: move include Move the include for sysctl.h out of the middle of the file to the includes at the beginning. This is will make it easier to add new sysctls. No functional changes. MFC after: 3 weeks Sponsored by: Netflix	2019-11-19 21:14:15 +00:00
Bjoern A. Zeeb	3c5018ca10	nd6: sysctl Move the SYSCTL_DECL to the top of the file. Move the sysctl function before SYSCTL_PROC so that we don't need an extra function declaration in the middle of the file. No functional changes. MFC after: 3 weeks Sponsored by: Netflix	2019-11-19 21:08:18 +00:00
Bjoern A. Zeeb	6db6527385	nd6: make nd6_timer_ch static nd6_timer_ch is only used in file local context. There is no need to export it, so make it static. MFC after: 3 weeks Sponsored by: Netflix	2019-11-19 20:54:17 +00:00
Bjoern A. Zeeb	f77a6dbd1e	nd6_rtr: re-sort functions Resort functions within file in a way that they depend on each other as that makes it easier to rework various things. Also allows us to remove file local function declarations. No functional changes. MFC after: 3 weeks Sponsored by: Netflix	2019-11-19 20:34:33 +00:00
Bjoern A. Zeeb	b2b7a4b2ca	mld: fix epoch assertion in6ifa_ifpforlinklocal() asserts the net epoch. The test case from r354832 revealed code paths where we call into the function without having acquired the net epoch first and consequently we hit the assert. This happens in certain MLD states during VNET shutdown and most people normaly not notice this. For correctness acquire the net epoch around calls to mld_v1_transmit_report() in all cases to avoid the assertion firing. MFC after: 2 weeks Sponsored by: Netflix	2019-11-19 14:53:13 +00:00
Bjoern A. Zeeb	32af08ecad	icmpv6: Fix mbuf change in mld After r354748 mld_input() can change the mbuf. The new pointer is never returned to icmp6_input() and when passed to icmp6_rip6_input() the mbuf may no longer valid leading to a panic. Pass a pointer to the mbuf to mld_input() so we can return an updated version in the non-error case. Add a test sending an MLD packet case which will trigger this bug. Pointyhat to: bz Reported by: gallatin, thj MFC After: 2 weeks X-MFC with: r354748 Sponsored by: Netflix	2019-11-18 21:59:47 +00:00
Bjoern A. Zeeb	808c432f62	nd6: retire defrouter_select(), use _fib() variant. Burn bridges and replace the last two calls of defrouter_select() with defrouter_select_fib(). That allows us to retire defrouter_select() and make it more clear in the calling code that it applies to all FIBs. Sponsored by: Netflix	2019-11-16 00:17:35 +00:00
Bjoern A. Zeeb	f592d0c377	nd6_rtr: Pull in the TAILQ_HEAD() as it is not needed outside nd6_rtr.c. Rename the TAILQ_HEAD() struct and the nd_defrouter variable from "nd_" to "nd6_" as they are not part of the RFC 3542 API which uses "ND_". Ideally I'd like to also rename the struct nd_defrouter {} to "nd6_*" but given that is used externally there is more work to do. No functional changes. MFC after: 3 weeks Sponsored by: Netflix	2019-11-16 00:02:36 +00:00
Bjoern A. Zeeb	63abacc204	netinet*: replace IP6_EXTHDR_GET() In a few places we have IP6_EXTHDR_GET() left in upper layer protocols. The IP6_EXTHDR_GET() macro might perform an m_pulldown() in case the data fragment is not contiguous. Convert these last remaining instances into m_pullup()s instead. In CARP, for example, we will a few lines later call m_pullup() anyway, the IPsec code coming from OpenBSD would otherwise have done the m_pullup() and are copying the data a bit later anyway, so pulling it in seems no better or worse. Note: this leaves very few m_pulldown() cases behind in the tree and we might want to consider removing them as well to make mbuf management easier again on a path to variable size mbufs, especially given m_pulldown() still has an issue not re-checking M_WRITEABLE(). Reviewed by: gallatin MFC after: 8 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22335	2019-11-15 21:44:17 +00:00
Bjoern A. Zeeb	a61b5cfbbf	netinet6: Remove PULLDOWN_TESTs. Remove the KAME introduced PULLDOWN_TESTs which did not even have a compile-time option in sys/conf to turn them on for a custom kernel build. They made the code a lot harder to read or more complicated in a few cases. Convert the IP6_EXTHDR_CHECK() calls into FreeBSD looking code. Rather than throwing the packet away if it would not fit the KAME mbuf expectations, convert the macros to m_pullup() calls. Do not do any extra manual conditional checks upfront as to whether the m_len would suffice (), simply let m_pullup() do its work (incl. an early check). Remove extra m_pullup() calls where earlier in the function or the only caller has already done the pullup. Discussed with: rwatson () Reviewed by: ae MFC after: 8 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22334	2019-11-15 21:40:40 +00:00
Bjoern A. Zeeb	e20b5bc485	nd6: simplify code We are taking the same actions in both cases of the branch inside the block. Simplify that code as the extra branch is not needed. MFC after: 3 weeks Sponsored by: Netflix	2019-11-15 13:45:38 +00:00
Bjoern A. Zeeb	b3a25d2993	nd6: remove unused structs and defines Remove a collections of unused structs and #defines to make it easier to understand what is actually in use. Sponsored by: Netflix	2019-11-13 14:28:07 +00:00
Bjoern A. Zeeb	d64df9a2b2	nd6: make nd6_alloc() file static nd6_alloc() is a function used only locally. Make it static and no longer export it. Keeps the KPI smaller. Sponsored by: Netflix	2019-11-13 13:53:17 +00:00
Bjoern A. Zeeb	ad675b3279	nd6 defrouter: consolidate nd_defrouter manipulations in nd6_rtr.c Move the nd_defrouter along with the sysctl handler from nd6.c to nd6_rtr.c and make the variable file static. Provide (temporary) new accessor functions for code manipulating nd_defrouter from nd6.c, and stop exporting functions no longer needed outside nd6_rtr.c. This also shuffles a few functions around in nd6_rtr.c without functional changes. Given all nd_defrouter logic is now in one place we can tidy up the code, locking and, and other open items. MFC after: 3 weeks X-MFC: keep exporting the functions Sponsored by: Netflix	2019-11-13 12:05:48 +00:00
Bjoern A. Zeeb	a8fe77d877	netinet: update mp to pass the proper value back In ip6_[direct_]input() we are looping over the extension headers to deal with the next header. We pass a pointer to an mbuf pointer to the handling functions. In certain cases the mbuf can be updated there and we need to pass the new one back. That missing in dest6_input() and route6_input(). In tcp6_input() we should also update it before we call tcp_input(). In addition to that mark the mbuf NULL all the times when we return that we are done with handling the packet and no next header should be checked (IPPROTO_DONE). This will eventually allow us to assert proper behaviour and catch the above kind of errors more easily, expecting *mp to always be set. This change is extracted from a larger patch and not an exhaustive change across the entire stack yet. PR: 240135 Reported by: prabhakar.lakhera gmail.com MFC after: 3 weeks Sponsored by: Netflix	2019-11-12 15:46:28 +00:00
Gleb Smirnoff	c17cd08f53	It is unclear why in6_pcblookup_local() would require write access to the PCB hash. The function doesn't modify the hash. It always asserted write lock historically, but with epoch conversion this fails in some special cases. Reviewed by: rwatson, bz Reported-by: syzbot+0b0488ca537e20cb2429@syzkaller.appspotmail.com	2019-11-11 06:28:25 +00:00
Bjoern A. Zeeb	c1131de6f1	frag6: properly handle atomic fragments according to RFCs. RFC 8200 says: "If the fragment is a whole datagram (that is, both the Fragment Offset field and the M flag are zero), then it does not need any further reassembly and should be processed as a fully reassembled packet (i.e., updating Next Header, adjust Payload Length, removing the Fragment header, etc.). .." That means we should remove the fragment header and make all the adjustments rather than just skipping over the fragment header. The difference should be noticeable in that a properly handled atomic fragment triggering an ICMPv6 message at an upper layer (e.g. dest unreach, unreachable port) will not include the fragment header. Update the test cases to also test for an unfragmentable part. That is needed so that the next header is properly updated (not just lengths). MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22155	2019-11-08 14:36:44 +00:00
Gleb Smirnoff	2435e507de	Now with epoch synchronized PCB lookup tables we can greatly simplify locking in udp_output() and udp6_output(). First, we select if we need read or write lock in PCB itself, we take the lock and enter network epoch. Then, we proceed for the rest of the function. In case if we need to modify PCB hash, we would take write lock on it for a short piece of code. We could exit the epoch before allocating an mbuf, but with this patch we are keeping it all the way into ip_output()/ip6_output(). Today this creates an epoch recursion, since ip_output() enters epoch itself. However, once all protocols are reviewed, ip_output() and ip6_output() would require epoch instead of entering it. Note: I'm not 100% sure that in udp6_output() the epoch is required. We don't do PCB hash lookup for a bound socket. And all branches of in6_select_src() don't require epoch, at least they lack assertions. Today inet6 address list is protected by rmlock, although it is CKLIST. AFAIU, the future plan is to protect it by network epoch. That would require epoch in in6_select_src(). Anyway, in future ip6_output() would require epoch, udp6_output() would need to enter it.	2019-11-07 21:01:36 +00:00
Gleb Smirnoff	d797164a86	Since r353292 on input path we are always in network epoch, when we lookup PCBs. Thus, do not enter epoch recursively in in_pcblookup_hash() and in6_pcblookup_hash(). Same applies to tcp_ctlinput() and tcp6_ctlinput(). This leaves several sysctl(9) handlers that return PCB credentials unprotected. Add epoch enter/exit to all of them. Differential Revision: https://reviews.freebsd.org/D22197	2019-11-07 20:49:56 +00:00
Gleb Smirnoff	cf377af6e2	Remove unnecessary recursive epoch enter via INP_INFO_RLOCK macro in icmp6_rip6_input(). It shall always run in the network epoch.	2019-11-07 20:43:12 +00:00
Gleb Smirnoff	f42347c39a	Remove unnecessary recursive epoch enter via INP_INFO_RLOCK macro in raw input functions for IPv4 and IPv6. They shall always run in the network epoch.	2019-11-07 20:40:44 +00:00
Gleb Smirnoff	8d28524a90	Remove unnecessary recursive epoch enter via INP_INFO_RLOCK macro in udp6_input(). It shall always run in the network epoch.	2019-11-07 20:38:53 +00:00
Bjoern A. Zeeb	503f4e4736	netinet*: variable cleanup In preparation for another change factor out various variable cleanups. These mainly include: (1) do not assign values to variables during declaration: this makes the code more readable and does allow for better grouping of variable declarations, (2) do not assign values to variables before need; e.g., if a variable is only used in the 2nd half of a function and we have multiple return paths before that, then do not set it before it is needed, and (3) try to avoid assigning the same value multiple times. MFC after: 3 weeks Sponsored by: Netflix	2019-11-07 18:29:51 +00:00
Gleb Smirnoff	751d8d156a	Widen network epoch coverage in nd6_prefix_onlink() as in6ifa_ifpforlinklocal() requires the epoch. Reported by: bz Reviewed by: bz	2019-11-07 17:00:20 +00:00
Gleb Smirnoff	d6dbfed81e	In nd6_timer() enter the network epoch earlier. The defrouter_del() may call into leaf functions that require epoch. Since the function is already run in non-sleepable context, it should be safe to cover it whole with epoch. Reported by: syzcaller	2019-11-04 17:35:37 +00:00
Bjoern A. Zeeb	6e6b5143f5	Properly set VNET when nuking recvif from fragment queues. In theory the eventhandler invoke should be in the same VNET as the the current interface. We however cannot guarantee that for all cases in the future. So before checking if the fragmentation handling for this VNET is active, switch the VNET to the VNET of the interface to always get the one we want. Reviewed by: hselasky MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22153	2019-10-25 18:54:06 +00:00
Bjoern A. Zeeb	702828f643	frag6: do not leak counter in error cases When allocating the IPv6 fragement packet queue entry we do checks against counters and if we pass we increment one of the counters to claim the spot. Right after that we have two cases (malloc and MAC) which can both fail in which case we free the entry but never released our claim on the counter. In theory this can lead to not accepting new fragments after a long time, especially if it would be MAC "refusing" them. Rather than immediately subtracting the value in the error case, only increment it after these two cases so we can no longer leak it. MFC after: 3 weeks Sponsored by: Netflix	2019-10-25 16:29:09 +00:00
Bjoern A. Zeeb	619456bb59	frag6: prevent overwriting initial fragoff=0 packet meta-data. When we receive the packet with the first fragmented part (fragoff=0) we remember the length of the unfragmentable part and the next header (and should probably also remember ECN) as meta-data on the reassembly queue. Someone replying this packet so far could change these 2 (3) values. While changing the next header seems more severe, for a full size fragmented UDP packet, for example, adding an extension header to the unfragmentable part would go unnoticed (as the framented part would be considered an exact duplicate) but make reassembly fail. So do not allow updating the meta-data after we have seen the first fragmented part anymore. The frag6_20 test case is added which failed before triggering an ICMPv6 "param prob" due to the check for each queued fragment for a max-size violation if a fragoff=0 packet was received. MFC after: 3 weeks Sponsored by: Netflix	2019-10-24 22:07:45 +00:00
Bjoern A. Zeeb	cd188da20f	frag6: handling of overlapping fragments to conform to RFC 8200 While the comment was updated in r350746, the code was not. RFC8200 says that unless fragment overlaps are exact (same fragment twice) not only the current fragment but the entire reassembly queue for this packet must be silently discarded, which we now do if fragment offset and fragment length do not match. Obtained from: jtl MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16850	2019-10-24 20:22:52 +00:00
Michael Tuexen	4a91aa8fc9	Ensure that the flags indicating IPv4/IPv6 are not changed by failing bind() calls. This would lead to inconsistent state resulting in a panic. A fix for stable/11 was committed in https://svnweb.freebsd.org/base?view=revision&revision=338986 An accelerated MFC is planned as discussed with emaste@. Reported by: syzbot+2609a378d89264ff5a42@syzkaller.appspotmail.com Obtained from: jtl@ MFC after: 1 day Sponsored by: Netflix, Inc.	2019-10-24 20:05:10 +00:00
Bjoern A. Zeeb	53707abd41	frag6: export another counter read-only by sysctl Similar to the system global counter also export the per-VNET counter "frag6_nfragpackets" detailing the current number of fragment packets in this VNET's reassembly queues. The read-only counter is helpful for in-VNET statistical monitoring and for test-cases. MFC after: 3 weeks Sponsored by: Netflix	2019-10-24 20:00:37 +00:00
Bjoern A. Zeeb	dda02192f9	frag6: fix counter leak in error case and optimise code In case the first fragmented part (off=0) arrives we check for the maximum packet size for each fragmented part we already queued with the addition of the unfragmentable part from the first one. For one we do not have to enter the loop at all if this is the first fragmented part to arrive, and we can skip the check. Should we encounter an error case we send an ICMPv6 message for any fragment exceeding the maximum length limit. While dequeueing the original packet and freeing it, statistics were not updated and leaked both the reassembly queue count for the fragment and the global fragment count. Found by code inspection and confirmed by tightening test cases checking more statistical and system counters. While here properly wrap a line. MFC after: 3 weeks Sponsored by: Netflix	2019-10-24 19:57:18 +00:00
Bjoern A. Zeeb	e5fffe9a69	frag6.c: do not leak packet queue entry in error case When we are checking for the maximum reassembled packet size of the fragmentable part and run into the error case (packet too big), we are leaking the packet queue enntry if this was a first fragment to arrive. Properly cleanup, removing the queue entry from the bucket, decrementing counters, and freeing the memory. MFC after: 3 weeks Sponsored by: Netflix	2019-10-24 19:47:32 +00:00
Bjoern A. Zeeb	30809ba9e3	frag6: leave a note about upper layer header checks TBD Per sepcification the upper layer header needs to be within the first fragment. The check was not done so far and there is an open review for related work, so just leave a note as to where to put it. Move the extraction of frag offset up to this as it is needed to determine whether this is a first fragment or not. MFC after: 3 weeks Sponsored by: Netflix	2019-10-24 12:16:15 +00:00
Bjoern A. Zeeb	7715d794ef	frag6: check global limits before hash and lock Check whether we are accepting more fragments (based on global limits) before doing expensive operations of calculating the hash and taking the bucket lock. This slightly increases a "race" between check time and incrementing counters (which is already there) possibly allowing a few more fragments than the maximum limits. However, when under attack, we rather save this CPU time for other packets/work. MFC after: 3 weeks Sponsored by: Netflix	2019-10-24 11:58:24 +00:00
Bjoern A. Zeeb	efdfee93c0	frag6: small improvements Rather than walking the mbuf chain manually use m_last() which doing exactly that for us. Defer initializing srcifp for longer as there are multiple exit paths out of the function which do not need it set. Initialize before taking the lock though. Rename the mtx lock to match the type better. MFC after: 3 weeks Sponsored by: Netflix	2019-10-24 08:15:40 +00:00
Bjoern A. Zeeb	da89a0fe94	frag6: remove IP6_REASS_MBUF macro The IP6_REASS_MBUF() macro did some pointer gynmastics to end up with the same type as it gets in [(cast *)&]. Spelling it out instead saves all this and makes the code more readable and less obfuscated directly using the structure field. MFC after: 3 weeks Sponsored by: Netflix	2019-10-24 07:53:10 +00:00
Bjoern A. Zeeb	f1664f3258	frag6: add "big picture" Add some ASCII relation of how the bits plug together. The terminology difference of "fragmented packets" and "fragment packets" is subtle. While here clear up more whitespace and comments. No functional change. MFC after: 3 weeks Sponsored by: Netflix	2019-10-23 23:10:12 +00:00
Bjoern A. Zeeb	21f08a074d	frag6: replace KAME hand-rolled queues with queue(9) TAILQs Remove the KAME custom circular queue for fragments and fragmented packets and replace them with a standard TAILQ. This make the code a lot more understandable and maintainable and removes further hand-rolled code from the the tree using a standard interface instead. Hide the still public structures under #ifdef _KERNEL as there is no use for them in user space. The naming is a bit confusing now as struct ip6q and the ip6q[] buckets array are not the same anymore; sadly struct ip6q is also used by the MAC framework and we cannot rename it. Submitted by: jtl (initally) MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16847 (jtl's original)	2019-10-23 23:01:18 +00:00
Bjoern A. Zeeb	3c7165b35e	frag6: whitespace changes Remove trailing white space, add a blank line, and compress a comment. No functional changes. MFC after: 10 days Sponsored by: Netflix	2019-10-23 20:37:15 +00:00
Gleb Smirnoff	be0c32e2ff	Execute nd6_dad_timer() in the network epoch, since nd6_dad_duplicated() requires it. Make nd6_dad_starttimer() require network epoch. Two calls out of three happen from nd6_dad_timer(). Enter epoch in the remaining one.	2019-10-22 16:06:33 +00:00
Bjoern A. Zeeb	67a10c4644	frag6: fix vnet teardown leak When shutting down a VNET we did not cleanup the fragmentation hashes. This has multiple problems: (1) leak memory but also (2) leak on the global counters, which might eventually lead to a problem on a system starting and stopping a lot of vnets and dealing with a lot of IPv6 fragments that the counters/limits would be exhausted and processing would no longer take place. Unfortunately we do not have a useable variable to indicate when per-VNET initialization of frag6 has happened (or when destroy happened) so introduce a boolean to flag this. This is needed here as well as it was in r353635 for ip_reass.c in order to avoid tripping over the already destroyed locks if interfaces go away after the frag6 destroy. While splitting things up convert the TRY_LOCK to a LOCK operation in now frag6_drain_one(). The try-lock was derived from a manual hand-rolled implementation and carried forward all the time. We no longer can afford not to get the lock as that would mean we would continue to leak memory. Assert that all the buckets are empty before destroying to lock to ensure long-term stability of a clean shutdown. Reported by: hselasky Reviewed by: hselasky MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22054	2019-10-21 08:48:47 +00:00
Bjoern A. Zeeb	65456706c0	frag6: add read-only sysctl for nfrags. Add a read-only sysctl exporting the global number of fragments (base system and all vnets). This is helpful to (a) know how many fragments are currently being processed, (b) if there are possible leaks, (c) if vnet teardown is not working correctly, and lastly (d) it can be used as part of test-suits to ensure (a) to (c). MFC after: 3 weeks Sponsored by: Netflix	2019-10-21 08:36:15 +00:00
Hans Petter Selasky	a55383e720	Fix panic in network stack due to use after free when receiving partial fragmented packets before a network interface is detached. When sending IPv4 or IPv6 fragmented packets and a fragment is lost before the network device is freed, the mbuf making up the fragment will remain in the temporary hashed fragment list and cause a panic when it times out due to accessing a freed network interface structure. 1) Make sure the m_pkthdr.rcvif always points to a valid network interface. Else the rcvif field should be set to NULL. 2) Use the rcvif of the last received fragment as m_pkthdr.rcvif for the fully defragged packet, instead of the first received fragment. Panic backtrace for IPv6: panic() icmp6_reflect() # tries to access rcvif->if_afdata[AF_INET6]->xxx icmp6_error() frag6_freef() frag6_slowtimo() pfslowtimo() softclock_call_cc() softclock() ithread_loop() Reviewed by: bz Differential Revision: https://reviews.freebsd.org/D19622 MFC after: 1 week Sponsored by: Mellanox Technologies	2019-10-16 09:11:49 +00:00
Gleb Smirnoff	5f5ec65aaf	in6ifa_llaonifp() is never called from fast path, so do not require epoch being entered.	2019-10-14 15:33:53 +00:00
Michael Tuexen	583b625ba8	Remove line not needed. Submitted by: markj@ MFC after: 3 days	2019-10-13 09:35:03 +00:00
Gleb Smirnoff	ef2e580e56	Don't cover in6_ifattach() with network epoch, as it may call into network drivers ioctls, that may sleep. PR: 241223	2019-10-13 04:25:16 +00:00
Mark Johnston	49c5659e1c	Add a missing include of opt_sctp.h. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-10-12 22:58:33 +00:00
Gleb Smirnoff	1e4f4e56b9	ip6_output() has a complex set of gotos, and some can jump out of the epoch section towards return statement. Since entering epoch is cheap, it is easier to cover the whole function with epoch, rather than try to properly maintain its state.	2019-10-09 17:02:28 +00:00
Gleb Smirnoff	3af7f97c4e	Revert changes to rip6_bind() from r353292. This function is always called in syscall context, so it must enter epoch itself. This changeset originates from early version of the patch, and somehow slipped to the final version. Reported by: pho	2019-10-09 05:52:07 +00:00
Mark Johnston	cb49ec5431	Improve locking in the IPV6_V6ONLY socket option handler. Acquire the inp lock before checking whether the socket is already bound, and around updates to the inp_vflag field. MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21867	2019-10-07 23:35:23 +00:00
Gleb Smirnoff	b8a6e03fac	Widen NET_EPOCH coverage. When epoch(9) was introduced to network stack, it was basically dropped in place of existing locking, which was mutexes and rwlocks. For the sake of performance mutex covered areas were as small as possible, so became epoch covered areas. However, epoch doesn't introduce any contention, it just delays memory reclaim. So, there is no point to minimise epoch covered areas in sense of performance. Meanwhile entering/exiting epoch also has non-zero CPU usage, so doing this less often is a win. Not the least is also code maintainability. In the new paradigm we can assume that at any stage of processing a packet, we are inside network epoch. This makes coding both input and output path way easier. On output path we already enter epoch quite early - in the ip_output(), in the ip6_output(). This patch does the same for the input path. All ISR processing, network related callouts, other ways of packet injection to the network stack shall be performed in net_epoch. Any leaf function that walks network configuration now asserts epoch. Tricky part is configuration code paths - ioctls, sysctls. They also call into leaf functions, so some need to be changed. This patch would introduce more epoch recursions (see EPOCH_TRACE) than we had before. They will be cleaned up separately, as several of them aren't trivial. Note, that unlike a lock recursion the epoch recursion is safe and just wastes a bit of resources. Reviewed by: gallatin, hselasky, cy, adrian, kristof Differential Revision: https://reviews.freebsd.org/D19111	2019-10-07 22:40:05 +00:00
Michael Tuexen	e7a541b0b9	When processing an incoming IPv6 packet over the loopback interface which contains Hop-by-Hop options, the mbuf chain is potentially changed in ip6_hopopts_input(), called by ip6_input_hbh(). This can happen, because of the the use of IP6_EXTHDR_CHECK, which might call m_pullup(). So provide the updated pointer back to the called of ip6_input_hbh() to avoid using a freed mbuf chain in`ip6_input()`. Reviewed by: markj@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21664	2019-09-19 10:22:29 +00:00
John Baldwin	b2e60773c6	Add kernel-side support for in-kernel TLS. KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277	2019-08-27 00:01:56 +00:00
Bjoern A. Zeeb	1540a98e36	frag6: move public structure into file local space. Move ip6asfrag and the accompanying IP6_REASS_MBUF macro from ip6_var.h into frag6.c as they are not used outside frag6.c. Sadly struct ip6q is all over the mac framework so we have to leave it public. This reduces the public KPI space. MFC after: 3 months X-MFC: possibly MFC the #define only to stable branches Sponsored by: Netflix	2019-08-08 10:59:54 +00:00
Bjoern A. Zeeb	5778b399f1	frag6.c: cleanup varaibles and return statements. Consitently put () around return values. Do not assign variables at the time of variable declaration. Sort variables. Rename ia to ia6, remove/reuse some variables used only once or twice for temporary calculations. No functional changes intended. MFC after: 3 months Sponsored by: Netflix	2019-08-08 10:15:47 +00:00
Bjoern A. Zeeb	23d374aa14	frag6.c: initial comment and whitespace cleanup. Cleanup some comments (start with upper case, ends in punctuation, use width and do not consume vertical space). Update comments to RFC8200. Some whitespace changes. No functional changes. MFC after: 3 months Sponsored by: Netflix	2019-08-08 09:42:57 +00:00
Ed Maste	7f8c266da5	Correct ICMPv6/MLDv2 out-of-bounds memory access Previously the ICMPv6 input path incorrectly handled cases where an MLDv2 listener query packet was internally fragmented across multiple mbufs. admbugs: 921 Submitted by: jtl Reported by: CJD of Apple Approved by: so MFC after: 0 minutes Security: CVE-2019-5608	2019-08-06 17:11:30 +00:00
Michael Tuexen	94962f6ba0	Improve consistency. No functional change. MFC after: 3 days	2019-08-05 13:22:15 +00:00
Bjoern A. Zeeb	9cb1a47af2	frag6.c: rename ip6q[] to ipq6b[] and consistently use "bucket" The hash buckets array is called ip6q. The data structure ip6q is a description of different object, the one the array holds these days (since r337776). To clear some of this confusion, rename the array to ip6qb. When iterating over all buckets or addressing them directly, we use at least the variables i, hash, and bucket. To keep the terminology consistent use the variable name "bucket" and always make it an uint32_t and not sometimes an int. No functional behaviour changes intended. MFC after: 3 months Sponsored by: Netflix	2019-08-05 11:01:12 +00:00
Bjoern A. Zeeb	c00464a245	frag6.c: re-order functions within file Re-order functions within the file in preparation for an upcoming code simplification. No functional changes. MFC after: 3 months Sponsored by: Netflix	2019-08-05 09:49:24 +00:00
Bjoern A. Zeeb	f349c821f5	frag6.c: fix includes Bring back systm.h after r350532 and banish errno.h, time.h, and machine/atomic.h. Reported by: bde (Thank you!) Pointyhat to: bz MFC after: 12 weeks X-MFC: with r350532 Sponsored by: Netflix	2019-08-03 16:56:44 +00:00
Bjoern A. Zeeb	09b361c792	frag6.c: make compile with gcc Removing the prototype from the header and making the function static in r350533 makes architectures using gcc complain "function declaration isn't a prototype". Add the missing void given the function has no arguments. Reported by: the CI machinery Pointyhat to: bz MFC after: 3 months X-MFC with: r350533 Sponsored by: Netflix	2019-08-02 11:05:00 +00:00
Bjoern A. Zeeb	487a161cff	frag6.c: rename malloc type Rename M_FTABLE to M_FRAG6 as the former sounds very much like the former "flowtable" rather than anything to do with fragments and reassembly. While here, let malloc( , .. \| M_ZERO) do the zeroing rather than calling bzero() ourselves. MFC after: 3 months Sponsored by: Netflix	2019-08-02 10:54:57 +00:00
Bjoern A. Zeeb	a687de6aee	frag6.c: remove dead code Remove all the #if 0 and #if notyet blocks of dead code which have been there for at least 18 years from what I can see. No functional changes. MFC after: 3 months Sponsored by: Netflix	2019-08-02 10:41:51 +00:00
Bjoern A. Zeeb	757cb678e5	frag6.c: move variables and sysctls into local file Move the sysctls and the related variables only used in frag6.c into the file and out of in6_proto.c. That way everything belonging together is in one place. Sort the variables into global and per-vnet scopes and make them static. No longer export the (helper) function frag6_set_bucketsize() now also file-local only. Should be no functional changes, only reduced public KPI/KBI surface. MFC after: 3 months Sponsored by: Netflix	2019-08-02 10:29:53 +00:00
Bjoern A. Zeeb	1a3044fa2c	frag6.c: sort includes Sort includes and remove duplicate kernel.h as well as the unneeded systm.h. Hide the mac framework incude behind #fidef MAC. MFC after: 3 months Sponsored by: Netflix	2019-08-02 10:06:54 +00:00
Bjoern A. Zeeb	0ecd976e80	IPv6 cleanup: kernel Finish what was started a few years ago and harmonize IPv6 and IPv4 kernel names. We are down to very few places now that it is feasible to do the change for everything remaining with causing too much disturbance. Remove "aliases" for IPv6 names which confusingly could indicate that we are talking about a different data structure or field or have two fields, one for each address family. Try to follow common conventions used in FreeBSD. * Rename sin6p to sin6 as that is how it is spelt in most places. * Remove "aliases" (#defines) for: - in6pcb which really is an inpcb and nothing separate - sotoin6pcb which is sotoinpcb (as per above) - in6p_sp which is inp_sp - in6p_flowinfo which is inp_flow * Try to use ia6 for in6_addr rather than in6p. * With all these gone also rename the in6p variables to inp as that is what we call it in most of the network stack including parts of netinet6. The reasons behind this cleanup are that we try to further unify netinet and netinet6 code where possible and that people will less ignore one or the other protocol family when doing code changes as they may not have spotted places due to different names for the same thing. No functional changes. Discussed with: tuexen (SCTP changes) MFC after: 3 months Sponsored by: Netflix	2019-08-02 07:41:36 +00:00
Michael Tuexen	8a956abe12	When calling sctp_initialize_auth_params(), the inp must have at least a read lock. To avoid more complex locking dances, just call it in sctp_aloc_assoc() when the write lock is still held. Reported by: syzbot+08a486f7e6966f1c3cfb@syzkaller.appspotmail.com MFC after: 1 week	2019-07-14 12:04:39 +00:00
Michael Tuexen	9e44bc22d8	r348494 fixes a race in udp_output(). The same race exists in udp_output6(), therefore apply a similar patch to IPv6. Reported by: syzbot+c5ffbc8f14294c7b0e54@syzkaller.appspotmail.com Reviewed by: bz@, markj@ MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D20936	2019-07-13 12:45:08 +00:00
John Baldwin	82334850ea	Add an external mbuf buffer type that holds multiple unmapped pages. Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement for Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web serving workloads when used by sendfile, due to effectively compressing socket buffers by an order of magnitude, and hence reducing cache misses. For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer now points to a struct mbuf_ext_pgs structure instead of a data buffer. This structure contains an array of physical addresses (this reduces cache misses compared to an earlier version that stored an array of vm_page_t pointers). It also stores additional fields needed for in-kernel TLS such as the TLS header and trailer data that are currently unused. To more easily detect these mbufs, the M_NOMAP flag is set in m_flags in addition to M_EXT. Various functions like m_copydata() have been updated to safely access packet contents (using uiomove_fromphys()), to make things like BPF safe. NIC drivers advertise support for unmapped mbufs on transmit via a new IFCAP_NOMAP capability. This capability can be toggled via the new 'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only transmit packet contents via DMA and use bus_dma, adding the capability to if_capabilities and if_capenable should be all that is required. If a NIC does not support unmapped mbufs, they are converted to a chain of mapped mbufs (using sf_bufs to provide the mapping) in ip_output or ip6_output. If an unmapped mbuf requires software checksums, it is also converted to a chain of mapped mbufs before computing the checksum. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Discussed with: ae, kp (firewalls) Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:48:33 +00:00
Hans Petter Selasky	59854ecf55	Convert all IPv4 and IPv6 multicast memberships into using a STAILQ instead of a linear array. The multicast memberships for the inpcb structure are protected by a non-sleepable lock, INP_WLOCK(), which needs to be dropped when calling the underlying possibly sleeping if_ioctl() method. When using a linear array to keep track of multicast memberships, the computed memory location of the multicast filter may suddenly change, due to concurrent insertion or removal of elements in the linear array. This in turn leads to various invalid memory access issues and kernel panics. To avoid this problem, put all multicast memberships on a STAILQ based list. Then the memory location of the IPv4 and IPv6 multicast filters become fixed during their lifetime and use after free and memory leak issues are easier to track, for example by: vmstat -m \| grep multi All list manipulation has been factored into inline functions including some macros, to easily allow for a future hash-list implementation, if needed. This patch has been tested by pho@ . Differential Revision: https://reviews.freebsd.org/D20080 Reviewed by: markj @ MFC after: 1 week Sponsored by: Mellanox Technologies	2019-06-25 11:54:41 +00:00
John Baldwin	77a0144145	Sort opt_foo.h #includes and add a missing blank line in ip_output().	2019-06-11 22:07:39 +00:00
John Baldwin	fb3bc59600	Restructure mbuf send tags to provide stronger guarantees. - Perform ifp mismatch checks (to determine if a send tag is allocated for a different ifp than the one the packet is being output on), in ip_output() and ip6_output(). This avoids sending packets with send tags to ifnet drivers that don't support send tags. Since we are now checking for ifp mismatches before invoking if_output, we can now try to allocate a new tag before invoking if_output sending the original packet on the new tag if allocation succeeds. To avoid code duplication for the fragment and unfragmented cases, add ip_output_send() and ip6_output_send() as wrappers around if_output and nd6_output_ifp, respectively. All of the logic for setting send tags and dealing with send tag-related errors is done in these wrapper functions. For pseudo interfaces that wrap other network interfaces (vlan and lagg), wrapper send tags are now allocated so that ip*_output see the wrapper ifp as the ifp in the send tag. The if_transmit routines rewrite the send tags after performing an ifp mismatch check. If an ifp mismatch is detected, the transmit routines fail with EAGAIN. - To provide clearer life cycle management of send tags, especially in the presence of vlan and lagg wrapper tags, add a reference count to send tags managed via m_snd_tag_ref() and m_snd_tag_rele(). Provide a helper function (m_snd_tag_init()) for use by drivers supporting send tags. m_snd_tag_init() takes care of the if_ref on the ifp meaning that code alloating send tags via if_snd_tag_alloc no longer has to manage that manually. Similarly, m_snd_tag_rele drops the refcount on the ifp after invoking if_snd_tag_free when the last reference to a send tag is dropped. This also closes use after free races if there are pending packets in driver tx rings after the socket is closed (e.g. from tcpdrop). In order for m_free to work reliably, add a new CSUM_SND_TAG flag in csum_flags to indicate 'snd_tag' is set (rather than 'rcvif'). Drivers now also check this flag instead of checking snd_tag against NULL. This avoids false positive matches when a forwarded packet has a non-NULL rcvif that was treated as a send tag. - cxgbe was relying on snd_tag_free being called when the inp was detached so that it could kick the firmware to flush any pending work on the flow. This is because the driver doesn't require ACK messages from the firmware for every request, but instead does a kind of manual interrupt coalescing by only setting a flag to request a completion on a subset of requests. If all of the in-flight requests don't have the flag when the tag is detached from the inp, the flow might never return the credits. The current snd_tag_free command issues a flush command to force the credits to return. However, the credit return is what also frees the mbufs, and since those mbufs now hold references on the tag, this meant that snd_tag_free would never be called. To fix, explicitly drop the mbuf's reference on the snd tag when the mbuf is queued in the firmware work queue. This means that once the inp's reference on the tag goes away and all in-flight mbufs have been queued to the firmware, tag's refcount will drop to zero and snd_tag_free will kick in and send the flush request. Note that we need to avoid doing this in the middle of ethofld_tx(), so the driver grabs a temporary reference on the tag around that loop to defer the free to the end of the function in case it sends the last mbuf to the queue after the inp has dropped its reference on the tag. - mlx5 preallocates send tags and was using the ifp pointer even when the send tag wasn't in use. Explicitly use the ifp from other data structures instead. - Sprinkle some assertions in various places to assert that received packets don't have a send tag, and that other places that overwrite rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer. Reviewed by: gallatin, hselasky, rgrimes, ae Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20117	2019-05-24 22:30:40 +00:00
Andrey V. Elsukov	b1536a812b	Restore IPV6_NEXTHOP option support that seem was partially broken since r286195. Do not forget results of route lookup and initialize rt and ifp pointers. PR: 238098 Submitted by: Masse Nicolas <nicolas.masse at stormshield eu> MFC after: 1 week	2019-05-24 11:45:32 +00:00
Alexander V. Chernikov	563ab4e400	Fix gateway setup for the interface routes. Currently rinit1() and its IPv6 counterpart nd6_prefix_onlink_rtrequest() uses dummy null_sdl gateway address during route insertion and change it afterwards. This behaviour brings complications to the routing stack and the users of its upcoming notification system. This change fixes both rinit1() and nd6_prefix_onlink_rtrequest() by filling in proper gateway in the beginning. It does not change any of the userland notifications as in both cases, they happen after the insertion and fixup process (rt_newaddrmsg_fib() and nd6_rtmsg()). MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20328	2019-05-22 21:20:15 +00:00
Conrad Meyer	e2e050c8ef	Extract eventfilter declarations to sys/_eventfilter.h This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h" in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header pollution substantially. EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c files into appropriate headers (e.g., sys/proc.h, powernv/opal.h). As a side effect of reduced header pollution, many .c files and headers no longer contain needed definitions. The remainder of the patch addresses adding appropriate includes to fix those files. LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by sys/mutex.h since r326106 (but silently protected by header pollution prior to this change). No functional change (intended). Of course, any out of tree modules that relied on header pollution for sys/eventhandler.h, sys/lock.h, or sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.	2019-05-20 00:38:23 +00:00
Hiroki Sato	7460ef5d7a	Fix hostname to be returned in an ICMPv6 NI Reply message defined in RFC 4620, ICMPv6 Node Information Queries. A vnet jail with an IPv6 address sent a hostname of the host environment, not the jail, even if another hostname was set to the jail. This change can be tested by the following commands: # ifconfig epair0 create # jail -c -n j1 vnet host.hostname=vnetjail path=/ persist # ifconfig epair0b vnet j1 # ifconfig epair0a inet6 -ifdisabled auto_linklocal up # jexec j1 ifconfig epair0b inet6 -ifdisabled auto_linklocal up # ping6 -w ff02::1%epair0a Differential Revision: https://reviews.freebsd.org/D20207 MFC after: 1 week	2019-05-16 19:09:41 +00:00
Mark Johnston	f00876fb60	Revert r347582 for now. The inp lock still needs to be dropped when calling into the driver ioctl handler, as some drivers expect to be able to sleep. Reported by: kib	2019-05-16 13:04:26 +00:00
Mark Johnston	5a1e222bfd	Close some races in multicast socket option handling. r333175 converted the global multicast lock to a sleepable sx lock, so the lock order with respect to the (non-sleepable) inp lock changed. To handle this, r333175 and r333505 added code to drop the inp lock, but this opened races that could leave multicast group description structures in an inconsistent state. This change fixes the problem by simply acquiring the global lock sooner. Along the way, this fixes some LORs and bogus error handling introduced in r333175, and commits some related cleanup. Reported by: syzbot+ba7c4943547e0604faca@syzkaller.appspotmail.com Reported by: syzbot+1b803796ab94d11a46f9@syzkaller.appspotmail.com Reviewed by: ae MFC after: 3 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20070	2019-05-14 21:30:55 +00:00
John Baldwin	c9d337083f	Apply r280991 to ip6_fragment. This uses m_dup_pkthdr() to copy all of the metadata about a packet to each of its fragments including VLAN tags, mbuf tags, etc. instead of hand-copying a few fields. Reviewed by: bz MFC after: 1 month Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20117	2019-05-10 20:15:40 +00:00
Andrey V. Elsukov	50ec8b3b3e	In mld_v2_cancel_link_timers() check number of references and disconnect inm before releasing the last reference. This fixes possible panics and assertion. PR: 237329 Reviewed by: mmacy MFC after: 2 weeks	2019-05-09 07:57:33 +00:00
Andrew Gallatin	50575ce11c	Track TCP connection's NUMA domain in the inpcb Drivers can now pass up numa domain information via the mbuf numa domain field. This information is then used by TCP syncache_socket() to associate that information with the inpcb. The domain information is then fed back into transmitted mbufs in ip{6}_output(). This mechanism is nearly identical to what is done to track RSS hash values in the inp_flowid. Follow on changes will use this information for lacp egress port selection, binding TCP pacers to the appropriate NUMA domain, etc. Reviewed by: markj, kib, slavash, bz, scottl, jtl, tuexen Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20028	2019-04-25 15:37:28 +00:00
Andrey V. Elsukov	aee793eec9	Add GRE-in-UDP encapsulation support as defined in RFC8086. This GRE-in-UDP encapsulation allows the UDP source port field to be used as an entropy field for load-balancing of GRE traffic in transit networks. Also most of multiqueue network cards are able distribute incoming UDP datagrams to different NIC queues, while very little are able do this for GRE packets. When an administrator enables UDP encapsulation with command `ifconfig gre0 udpencap`, the driver creates kernel socket, that binds to tunnel source address and after udp_set_kernel_tunneling() starts receiving of all UDP packets destined to 4754 port. Each kernel socket maintains list of tunnels with different destination addresses. Thus when several tunnels use the same source address, they all handled by single socket. The IP[V6]_BINDANY socket option is used to be able bind socket to source address even if it is not yet available in the system. This may happen on system boot, when gre(4) interface is created before source address become available. The encapsulation and sending of packets is done directly from gre(4) into ip[6]_output() without using sockets. Reviewed by: eugen MFC after: 1 month Relnotes: yes Differential Revision: https://reviews.freebsd.org/D19921	2019-04-24 09:05:45 +00:00
Conrad Meyer	5947c05768	ip6_randomflowlabel: Avoid blocking if random(4) is not available If kern.random.initial_seeding.bypass_before_seeding is disabled, random(4) and arc4random(9) will block indefinitely until enough entropy is available to initially seed Fortuna. It seems that zero flowids are perfectly valid, so avoid blocking on random until initial seeding takes place. Discussed with: bz (earlier revision) Reviewed by: thj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20011	2019-04-23 17:18:20 +00:00
Konstantin Belousov	c4cc609796	poib: assign link-local address according to RFC RFC 4391 specifies that the IB interface GID should be re-used as IPv6 link-local address. Since the code in in6_get_hw_ifid() ignored IFT_INFINIBAND case, ibX interfaces ended up with the local address borrowed from some other interface, which is non-compliant. Use lowest eight bytes from GID for filling the link-local address, same as Linux. Reviewed by: bz (previous version), ae, hselasky, slavash, Sponsored by: Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D20006	2019-04-23 12:23:44 +00:00
Hans Petter Selasky	6bbdbbb830	Revert r346530 until further. MFC after: 1 week Sponsored by: Mellanox Technologies	2019-04-22 19:36:19 +00:00
Hans Petter Selasky	04f44499ca	Fix build for mips and powerpc after r346530. Need to include sys/kernel.h to define SYSINIT() which is used by sys/eventhandler.h . MFC after: 1 week Sponsored by: Mellanox Technologies	2019-04-22 08:32:00 +00:00
Hans Petter Selasky	40eb389666	Fix panic in network stack due to memory use after free in relation to fragmented packets. When sending IPv4 and IPv6 fragmented packets and a fragment is lost, the mbuf making up the fragment will remain in the temporary hashed fragment list for a while. If the network interface departs before the so-called slow timeout clears the packet, the fragment causes a panic when the timeout kicks in due to accessing a freed network interface structure. Make sure that when a network device is departing, all hashed IPv4 and IPv6 fragments belonging to it, get freed. Backtrace: panic() icmp6_reflect() hlim = ND_IFINFO(m->m_pkthdr.rcvif)->chlim; ^^^^ rcvif->if_afdata[AF_INET6] is NULL. icmp6_error() frag6_freef() frag6_slowtimo() pfslowtimo() softclock_call_cc() softclock() ithread_loop() Differential Revision: https://reviews.freebsd.org/D19622 Reviewed by: bz (network), adrian MFC after: 1 week Sponsored by: Mellanox Technologies	2019-04-22 07:27:24 +00:00
Michael Tuexen	fb288770e8	When an IPv6 packet is received for a raw socket which has the IPPROTO_IPV6 level socket option IPV6_CHECKSUM enabled and the checksum check fails, drop the message. Without this fix, an ICMP6 message was sent indicating a parameter problem. Thanks to bz@ for suggesting a way to simplify this fix. Reviewed by: bz@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D19969	2019-04-19 18:09:37 +00:00
Michael Tuexen	70a0f3dcdc	When a checksum has to be computed for a received IPv6 packet because it is requested by the application using the IPPROTO_IPV6 level socket option IPV6_CHECKSUM on a raw socket, ensure that the packet contains enough bytes to contain the checksum at the specified offset. Reported by: syzbot+6295fcc5a8aced81d599@syzkaller.appspotmail.com Reviewed by: bz@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D19968	2019-04-19 17:28:28 +00:00
Michael Tuexen	ae7c65b171	Avoid a buffer overwrite in rip6_output() when computing the checksum as requested by the user via the IPPROTO_IPV6 level socket option IPV6_CHECKSUM. The check if there are enough bytes in the packet to store the checksum at the requested offset was wrong by 1. Reviewed by: bz@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D19967	2019-04-19 17:21:35 +00:00
Michael Tuexen	2f041b74b9	Improve input validation for the socket option IPV6_CHECKSUM. When using the IPPROTO_IPV6 level socket option IPV6_CHECKSUM on a raw IPv6 socket, ensure that the value is either -1 or a non-negative even number. Reviewed by: bz@, thj@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D19966	2019-04-19 17:17:41 +00:00
Tom Jones	2946a9415c	Add stat counter for ipv6 atomic fragments Add a stat counter to track ipv6 atomic fragments. Atomic fragments can be generated in response to invalid path MTU values, but are also a potential attack vector and considered harmful (see RFC6946 and RFC8021). While here add tracking of the atomic fragment counter to netstat and systat. Reviewed by: tuexen, jtl, bz Approved by: jtl (mentor), bz (mentor) Event: Aberdeen hackathon 2019 Differential Revision: https://reviews.freebsd.org/D17511	2019-04-19 17:06:43 +00:00
Mark Johnston	f1ef572a1e	Reinitialize multicast source filter structures after invalidation. When leaving a multicast group, a hole may be created in the inpcb's source filter and group membership arrays. To remove the hole, the succeeding array elements are copied over by one entry. The multicast code expects that a newly allocated array element is initialized, but the code which shifts a tail of the array was leaving stale data in the final entry. Fix this by explicitly reinitializing the last entry following such a copy. Reported by: syzbot+f8c3c564ee21d650475e@syzkaller.appspotmail.com Reviewed by: ae MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D19872	2019-04-11 08:00:59 +00:00
Mark Johnston	ca1163bd5f	Do not perform DAD on stf(4) interfaces. stf(4) interfaces are not multicast-capable so they can't perform DAD. They also did not set IFF_DRV_RUNNING when an address was assigned, so the logic in nd6_timer() would periodically flag such an address as tentative, resulting in interface flapping. Fix the problem by setting IFF_DRV_RUNNING when an address is assigned, and do some related cleanup: - In in6if_do_dad(), remove a redundant check for !UP \|\| !RUNNING. There is only one caller in the tree, and it only looks at whether the return value is non-zero. - Have in6if_do_dad() return false if the interface is not multicast-capable. - Set ND6_IFF_NO_DAD when an address is assigned to an stf(4) interface and the interface goes UP as a result. Note that this is not sufficient to fix the problem because the new address is marked as tentative and DAD is started before in6_ifattach() is called. However, setting no_dad is formally correct. - Change nd6_timer() to not flag addresses as tentative if no_dad is set. This is based on a patch from Viktor Dukhovni. Reported by: Viktor Dukhovni <ietf-dane@dukhovni.org> Reviewed by: ae MFC after: 3 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D19751	2019-03-30 18:00:44 +00:00
Andrey V. Elsukov	d18c1f26a4	Reapply r345274 with build fixes for 32-bit architectures. Update NAT64LSN implementation: o most of data structures and relations were modified to be able support large number of translation states. Now each supported protocol can use full ports range. Ports groups now are belongs to IPv4 alias addresses, not hosts. Each ports group can keep several states chunks. This is controlled with new `states_chunks` config option. States chunks allow to have several translation states for single alias address and port, but for different destination addresses. o by default all hash tables now use jenkins hash. o ConcurrencyKit and epoch(9) is used to make NAT64LSN lockless on fast path. o one NAT64LSN instance now can be used to handle several IPv6 prefixes, special prefix "::" value should be used for this purpose when instance is created. o due to modified internal data structures relations, the socket opcode that does states listing was changed. Obtained from: Yandex LLC MFC after: 1 month Sponsored by: Yandex LLC	2019-03-19 10:57:03 +00:00
Andrey V. Elsukov	d6369c2d18	Revert r345274. It appears that not all 32-bit architectures have necessary CK primitives.	2019-03-18 14:00:19 +00:00
Andrey V. Elsukov	d7a1cf06f3	Update NAT64LSN implementation: o most of data structures and relations were modified to be able support large number of translation states. Now each supported protocol can use full ports range. Ports groups now are belongs to IPv4 alias addresses, not hosts. Each ports group can keep several states chunks. This is controlled with new `states_chunks` config option. States chunks allow to have several translation states for single alias address and port, but for different destination addresses. o by default all hash tables now use jenkins hash. o ConcurrencyKit and epoch(9) is used to make NAT64LSN lockless on fast path. o one NAT64LSN instance now can be used to handle several IPv6 prefixes, special prefix "::" value should be used for this purpose when instance is created. o due to modified internal data structures relations, the socket opcode that does states listing was changed. Obtained from: Yandex LLC MFC after: 1 month Sponsored by: Yandex LLC	2019-03-18 12:59:08 +00:00
Andrey V. Elsukov	5c04f73e07	Add NAT64 CLAT implementation as defined in RFC6877. CLAT is customer-side translator that algorithmically translates 1:1 private IPv4 addresses to global IPv6 addresses, and vice versa. It is implemented as part of ipfw_nat64 kernel module. When module is loaded or compiled into the kernel, it registers "nat64clat" external action. External action named instance can be created using `create` command and then used in ipfw rules. The create command accepts two IPv6 prefixes `plat_prefix` and `clat_prefix`. If plat_prefix is ommitted, IPv6 NAT64 Well-Known prefix 64:ff9b::/96 will be used. # ipfw nat64clat CLAT create clat_prefix SRC_PFX plat_prefix DST_PFX # ipfw add nat64clat CLAT ip4 from IPv4_PFX to any out # ipfw add nat64clat CLAT ip6 from DST_PFX to SRC_PFX in Obtained from: Yandex LLC Submitted by: Boris N. Lytochkin MFC after: 1 month Relnotes: yes Sponsored by: Yandex LLC	2019-03-18 11:44:53 +00:00
Andrey V. Elsukov	002cae78da	Add SPDX-License-Identifier and update year in copyright. MFC after: 1 month	2019-03-18 10:50:32 +00:00
Andrey V. Elsukov	b11efc1eb6	Modify struct nat64_config. Add second IPv6 prefix to generic config structure and rename another fields to conform to RFC6877. Now it contains two prefixes and length: PLAT is provider-side translator that translates N:1 global IPv6 addresses to global IPv4 addresses. CLAT is customer-side translator (XLAT) that algorithmically translates 1:1 IPv4 addresses to global IPv6 addresses. Use PLAT prefix in stateless (nat64stl) and stateful (nat64lsn) translators. Modify nat64_extract_ip4() and nat64_embed_ip4() functions to accept prefix length and use plat_plen to specify prefix length. Retire net.inet.ip.fw.nat64_allow_private sysctl variable. Add NAT64_ALLOW_PRIVATE flag and use "allow_private" config option to configure this ability separately for each NAT64 instance. Obtained from: Yandex LLC MFC after: 1 month Sponsored by: Yandex LLC	2019-03-18 10:39:14 +00:00
Bjoern A. Zeeb	30b450774e	Update for IETF draft-ietf-6man-ipv6only-flag. When we roam between networks and our link-state goes down, automatically remove the IPv6-Only flag from the interface. Otherwise we might switch from an IPv6-only to and IPv4-only network and the flag would stay and we would prevent IPv4 from working. While the actual function call to clear the flag is under EXPERIMENTAL, the eventhandler is not as we might want to re-use it for other functionality on link-down event (such was re-calculate default routers for example if there is more than one). Reviewed by: hrs Differential Revision: https://reviews.freebsd.org/D19487	2019-03-07 23:03:39 +00:00
Bjoern A. Zeeb	21231a7aa6	Update for IETF draft-ietf-6man-ipv6only-flag. All changes are hidden behind the EXPERIMENTAL option and are not compiled in by default. Add ND6_IFF_IPV6_ONLY_MANUAL to be able to set the interface into no-IPv4-mode manually without router advertisement options. This will allow developers to test software for the appropriate behaviour even on dual-stack networks or IPv6-Only networks without the option being set in RA messages. Update ifconfig to allow setting and displaying the flag. Update the checks for the filters to check for either the automatic or the manual flag to be set. Add REVARP to the list of filtered IPv4-related protocols and add an input filter similar to the output filter. Add a check, when receiving the IPv6-Only RA flag to see if the receiving interface has any IPv4 configured. If it does, ignore the IPv6-Only flag. Add a per-VNET global sysctl, which is on by default, to not process the automatic RA IPv6-Only flag. This way an administrator (if this is compiled in) has control over the behaviour in case the node still relies on IPv4.	2019-03-06 23:31:42 +00:00
Tom Jones	198fdaeda1	When dropping a fragment queue count the number of fragments in the queue When dropping a fragment queue, account for the number of fragments in the queue. This improves accounting between the number of fragments received and the number of fragments dropped. Reviewed by: jtl, bz, transport Approved by: jtl (mentor), bz (mentor) Differential Revision: https://review.freebsd.org/D17521	2019-02-19 19:57:55 +00:00
Gleb Smirnoff	b252313f0b	New pfil(9) KPI together with newborn pfil API and control utility. The KPI have been reviewed and cleansed of features that were planned back 20 years ago and never implemented. The pfil(9) internals have been made opaque to protocols with only returned types and function declarations exposed. The KPI is made more strict, but at the same time more extensible, as kernel uses same command structures that userland ioctl uses. In nutshell [KA]PI is about declaring filtering points, declaring filters and linking and unlinking them together. New [KA]PI makes it possible to reconfigure pfil(9) configuration: change order of hooks, rehook filter from one filtering point to a different one, disconnect a hook on output leaving it on input only, prepend/append a filter to existing list of filters. Now it possible for a single packet filter to provide multiple rulesets that may be linked to different points. Think of per-interface ACLs in Cisco or Juniper. None of existing packet filters yet support that, however limited usage is already possible, e.g. default ruleset can be moved to single interface, as soon as interface would pride their filtering points. Another future feature is possiblity to create pfil heads, that provide not an mbuf pointer but just a memory pointer with length. That would allow filtering at very early stages of a packet lifecycle, e.g. when packet has just been received by a NIC and no mbuf was yet allocated. Differential Revision: https://reviews.freebsd.org/D18951	2019-01-31 23:01:03 +00:00
Hans Petter Selasky	2cd6ad766e	Fix refcounting leaks in IPv6 MLD code leading to loss of IPv6 connectivity. Looking at past changes in this area like r337866, some refcounting bugs have been introduced, one by one. For example like calling in6m_disconnect() and in6m_rele_locked() in mld_v1_process_group_timer() where previously no disconnect nor refcount decrement was done. Calling in6m_disconnect() when it shouldn't causes IPv6 solitation to no longer work, because all the multicast addresses receiving the solitation messages are now deleted from the network interface. This patch reverts some recent changes while improving the MLD refcounting and concurrency model after the MLD code was converted to using EPOCH(9). List changes: - All CK_STAILQ_FOREACH() macros are now properly enclosed into EPOCH(9) sections. This simplifies assertion of locking inside in6m_ifmultiaddr_get_inm(). - Corrected bad use of in6m_disconnect() leading to loss of IPv6 connectivity for MLD v1. - Factored out checks for valid inm structure into in6m_ifmultiaddr_get_inm(). PR: 233535 Differential Revision: https://reviews.freebsd.org/D18887 Reviewed by: bz (net) Tested by: ae MFC after: 1 week Sponsored by: Mellanox Technologies	2019-01-24 08:34:13 +00:00
Hans Petter Selasky	dea72f062a	When detaching a network interface drain the workqueue freeing the inm's because the destructor will access the if_ioctl() callback in the ifnet pointer which is about to be freed. This prevents use-after-free. PR: 233535 Differential Revision: https://reviews.freebsd.org/D18887 Reviewed by: bz (net) Tested by: ae MFC after: 1 week Sponsored by: Mellanox Technologies	2019-01-24 08:25:02 +00:00
Hans Petter Selasky	7a02897647	Add debugging sysctl to disable incoming MLD v2 messages similar to the existing sysctl for MLD v1 messages. PR: 233535 Differential Revision: https://reviews.freebsd.org/D18887 Reviewed by: bz (net) Tested by: ae MFC after: 1 week Sponsored by: Mellanox Technologies	2019-01-24 08:18:02 +00:00
Hans Petter Selasky	130f575d07	Fix duplicate acquiring of refcount when joining IPv6 multicast groups. This was observed by starting and stopping rpcbind(8) multiple times. PR: 233535 Differential Revision: https://reviews.freebsd.org/D18887 Reviewed by: bz (net) Tested by: ae MFC after: 1 week Sponsored by: Mellanox Technologies	2019-01-24 08:15:41 +00:00
Mark Johnston	49cf58e559	Style. Reviewed by: bz MFC after: 3 days Sponsored by: The FreeBSD Foundation	2019-01-23 22:19:49 +00:00
Mark Johnston	c06cc56e39	Fix an LLE lookup race. After the afdata read lock was converted to epoch(9), readers could observe a linked LLE and block on the LLE while a thread was unlinking the LLE. The writer would then release the lock and schedule the LLE for deferred free, allowing readers to continue and potentially schedule the LLE timer. By the point the timer fires, the structure is freed, typically resulting in a crash in the callout subsystem. Fix the problem by modifying the lookup path to check for the LLE_LINKED flag upon acquiring the LLE lock. If it's not set, the lookup fails. PR: 234296 Reviewed by: bz Tested by: sbruno, Victor <chernov_victor@list.ru>, Mike Andrews <mandrews@bit0.com> MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18906	2019-01-23 22:18:23 +00:00
Gleb Smirnoff	c962ca9f2d	Remove unnecessary ifdef. With INVARIANTS all KASSERTs are empty statements, so won't be compiled in.	2019-01-10 00:52:06 +00:00
Hans Petter Selasky	ef0111fdf3	Fix loopback traffic when using non-lo0 link local IPv6 addresses. The loopback interface can only receive packets with a single scope ID, namely the scope ID of the loopback interface itself. To mitigate this packets which use the scope ID are appearing as received by the real network interface, see "origifp" in the patch. The current code would drop packets which are designated for loopback which use a link-local scope ID in the destination address or source address, because they won't match the lo0's scope ID. To fix this restore the network interface pointer from the scope ID in the destination address for the problematic cases. See comments added in patch for a more detailed description. This issue was introduced with route caching (ae@). Reviewed by: bz (network) Differential Revision: https://reviews.freebsd.org/D18769 MFC after: 1 week Sponsored by: Mellanox Technologies	2019-01-09 14:28:08 +00:00
Gleb Smirnoff	a68cc38879	Mechanical cleanup of epoch(9) usage in network stack. - Remove macros that covertly create epoch_tracker on thread stack. Such macros a quite unsafe, e.g. will produce a buggy code if same macro is used in embedded scopes. Explicitly declare epoch_tracker always. - Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read locking macros to what they actually are - the net_epoch. Keeping them as is is very misleading. They all are named FOO_RLOCK(), while they no longer have lock semantics. Now they allow recursion and what's more important they now no longer guarantee protection against their companion WLOCK macros. Note: INP_HASH_RLOCK() has same problems, but not touched by this commit. This is non functional mechanical change. The only functionally changed functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter epoch recursively. Discussed with: jtl, gallatin	2019-01-09 01:11:19 +00:00
Mateusz Guzik	cc426dd319	Remove unused argument to priv_check_cred. Patch mostly generated with cocinnelle: @@ expression E1,E2; @@ - priv_check_cred(E1,E2,0) + priv_check_cred(E1,E2) Sponsored by: The FreeBSD Foundation	2018-12-11 19:32:16 +00:00
Mark Johnston	9d2877fc3d	Clamp the INPCB port hash tables to IPPORT_MAX + 1 chains. Memory beyond that limit was previously unused, wasting roughly 1MB per 8GB of RAM. Also retire INP_PCBLBGROUP_PORTHASH, which was identical to INP_PCBPORTHASH. Reviewed by: glebius MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D17803	2018-12-05 17:06:00 +00:00
Mark Johnston	79db6fe7aa	Plug some networking sysctl leaks. Various network protocol sysctl handlers were not zero-filling their output buffers and thus would export uninitialized stack memory to userland. Fix a number of such handlers. Reported by: Thomas Barabosch, Fraunhofer FKIE Reviewed by: tuexen MFC after: 3 days Security: kernel memory disclosure Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18301	2018-11-22 20:49:41 +00:00
Andrey V. Elsukov	b2b5660688	Add ability to use dynamic external prefix in ipfw_nptv6 module. Now an interface name can be specified for nptv6 instance instead of ext_prefix. The module will track if_addr_ext events and when suitable IPv6 address will be added to specified interface, it will be configured as external prefix. When address disappears instance becomes unusable, i.e. it doesn't match any packets. Reviewed by: 0mp (manpages) Tested by: Dries Michiels <driesm dot michiels gmail com> MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D17765	2018-11-12 11:20:59 +00:00
Eric van Gyzen	68b840878c	in6_ifattach_linklocal: handle immediate removal of the new LLA If another thread immediately removes the link-local address added by in6_update_ifa(), in6ifa_ifpforlinklocal() can return NULL, so the following assertion (or dereference) is wrong. Remove the assertion, and handle NULL somewhat better than panicking. This matches all of the other callers of in6_update_ifa(). PR: 219250 Reviewed by: bz, dab (both an earlier version) MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D17898	2018-11-08 19:50:23 +00:00
Mark Johnston	d9ff5789be	Remove redundant checks for a NULL lbgroup table. No functional change intended. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17108	2018-11-01 15:52:49 +00:00
Bjoern A. Zeeb	201100c58b	Initial implementation of draft-ietf-6man-ipv6only-flag. This change defines the RA "6" (IPv6-Only) flag which routers may advertise, kernel logic to check if all routers on a link have the flag set and accordingly update a per-interface flag. If all routers agree that it is an IPv6-only link, ether_output_frame(), based on the interface flag, will filter out all ETHERTYPE_IP/ARP frames, drop them, and return EAFNOSUPPORT to upper layers. The change also updates ndp to show the "6" flag, ifconfig to display the IPV6_ONLY nd6 flag if set, and rtadvd to allow announcing the flag. Further changes to tcpdump (contrib code) are availble and will be upstreamed. Tested the code (slightly earlier version) with 2 FreeBSD IPv6 routers, a FreeBSD laptop on ethernet as well as wifi, and with Win10 and OSX clients (which did not fall over with the "6" flag set but not understood). We may also want to (a) implement and RX filter, and (b) over time enahnce user space to, say, stop dhclient from running when the interface flag is set. Also we might want to start IPv6 before IPv4 in the future. All the code is hidden under the EXPERIMENTAL option and not compiled by default as the draft is a work-in-progress and we cannot rely on the fact that IANA will assign the bits as requested by the draft and hence they may change. Dear 6man, you have running code. Discussed with: Bob Hinden, Brian E Carpenter	2018-10-30 20:08:48 +00:00
Bjoern A. Zeeb	1ff6e7a8a8	rip6_input() inp validation after epoch(9) After r335924 rip6_input() needs inp validation to avoid working on FREED inps. Apply the relevant bits from r335497,r335501 (rip_input() change) to the IPv6 counterpart. PR: 232194 Reviewed by: rgrimes, ae (,hps) MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D17594	2018-10-24 10:42:35 +00:00
Andrey V. Elsukov	8796e291f8	Add the check that current VNET is ready and access to srchash is allowed. This change is similar to r339646. The callback that checks for appearing and disappearing of tunnel ingress address can be called during VNET teardown. To prevent access to already freed memory, add check to the callback and epoch_wait() call to be sure that callback has finished its work. MFC after: 20 days	2018-10-23 13:11:45 +00:00
Mark Johnston	d3a4b0dabc	Fix style bugs in in6_pcblookup_lbgroup(). This should have been a part of r338470. No functional changes intended. Reported by: gallatin Reviewed by: gallatin, Johannes Lundberg <johalun0@gmail.com> MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17109	2018-10-22 16:09:01 +00:00
Andrey V. Elsukov	19873f4780	Add handling for appearing/disappearing of ingress addresses to if_gre(4). * register handler for ingress address appearing/disappearing; * add new srcaddr hash table for fast softc lookup by srcaddr; * when srcaddr disappears, clear IFF_DRV_RUNNING flag from interface, and set it otherwise; MFC after: 1 month Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D17214	2018-10-21 18:13:45 +00:00
Andrey V. Elsukov	009d82ee0f	Add handling for appearing/disappearing of ingress addresses to if_gif(4). * register handler for ingress address appearing/disappearing; * add new srcaddr hash table for fast softc lookup by srcaddr; * when srcaddr disappears, clear IFF_DRV_RUNNING flag from interface, and set it otherwise; * remove the note about ingress address from BUGS section. MFC after: 1 month Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D17134	2018-10-21 18:06:15 +00:00
Andrey V. Elsukov	64d63b1e03	Add ifaddr_event_ext event. It is similar to ifaddr_event, but the handler receives the type of event IFADDR_EVENT_ADD/IFADDR_EVENT_DEL, and the pointer to ifaddr. Also ifaddr_event now is implemented using ifaddr_event_ext handler. MFC after: 3 weeks Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D17100	2018-10-21 15:02:06 +00:00
Jonathan T. Looney	13c6ba6d94	There are three places where we return from a function which entered an epoch section without exiting that epoch section. This is bad for two reasons: the epoch section won't exit, and we will leave the epoch tracker from the stack on the epoch list. Fix the epoch leak by making sure we exit epoch sections before returning. Reviewed by: ae, gallatin, mmacy Approved by: re (gjb, kib) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D17450	2018-10-09 13:26:06 +00:00
Bjoern A. Zeeb	9cffbc68bd	After r338257 is was possible to trigger a KASSERT() in ud6_output() using an application trying to use a v4mapped destination address on a kernel without INET support or on a v6only socket. Catch this case and prevent the packet from going anywhere; else, without the KASSERT() armed, a v4mapped destination address might go out on the wire or other undefined behaviour might happen, while with the KASSERT() we panic. PR: 231728 Reported by: Jeremy Faulkner (gldisater gmail.com) Approved by: re (kib)	2018-10-02 17:29:56 +00:00
Bjoern A. Zeeb	e15e0e3e4d	In in6_pcbpurgeif0() called, e.g., from if_clone_destroy(), once we have a lock, make sure the inp is not marked freed. This can happen since the list traversal and locking was converted to epoch(9). If the inp is marked "freed", skip it. This prevents a NULL pointer deref panic later on. Reported by: slavash (Mellanox) Tested by: slavash (Mellanox) Reviewed by: markj (no formal review but caught my unlock mistake) Approved by: re (kib)	2018-09-27 15:32:37 +00:00
Bjoern A. Zeeb	6675bee81a	In icmp6_rip6_input(), once we have a lock, make sure the inp is not freed. This can happen since the list traversal and locking was converted to epoch(9). If the inp is marked "freed", skip it. This prevents a NULL pointer deref panic in ip6_savecontrol_v4() trying to access the socket hanging off the inp, which was gone by the time we got there. Reported by: andrew Tested by: andrew Approved by: re (gjb)	2018-09-20 15:45:53 +00:00
Bjoern A. Zeeb	997fecb5c2	Update udp6_output() inp locking to avoid concurrency issues with route cache updates. Bring over locking changes applied to udp_output() for the route cache in r297225 and fixed in r306559 which achieve multiple things: (1) acquire an exclusive inp lock earlier depending on the expected conditions; we add a comment explaining this in udp6, (2) having acquired the exclusive lock earlier eliminates a slight possible chance for a race condition which was present in v4 for multiple years as well and is now gone, and (3) only pass the inp_route6 to ip6_output() if we are holding an exclusive inp lock, so that possible route cache updates in case of routing table generation number changes can happen safely. In addition this change (as the legacy IP counterpart) decomposes the tracking of inp and pcbinfo lock and adds extra assertions, that the two together are acquired correctly. PR: 230950 Reviewed by: karels, markj Approved by: re (gjb) Pointyhat to: bz (for completely missing this bit) Differential Revision: https://reviews.freebsd.org/D17230	2018-09-19 18:49:37 +00:00
Mark Johnston	54af3d0dac	Fix synchronization of LB group access. Lookups are protected by an epoch section, so the LB group linkage must be a CK_LIST rather than a plain LIST. Furthermore, we were not deferring LB group frees, so in_pcbremlbgrouphash() could race with readers and cause a use-after-free. Reviewed by: sbruno, Johannes Lundberg <johalun0@gmail.com> Tested by: gallatin Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17031	2018-09-10 19:00:29 +00:00
Bjoern A. Zeeb	ec86402ecd	Replicate r328271 from legacy IP to IPv6 using a single macro to clear L2 and L3 route caches. Also mark one function argument as __unused. Reviewed by: karels, ae Approved by: re (rgrimes) Differential Revision: https://reviews.freebsd.org/D17007	2018-09-03 22:27:27 +00:00
Bjoern A. Zeeb	f6aeb1eee5	Replicate r307234 from legacy IP to IPv6 code, using the RO_RTFREE() macro rather than hand crafted code. No functional changes. Reviewed by: karels Approved by: re (rgrimes) Differential Revision: https://reviews.freebsd.org/D17006	2018-09-03 22:14:37 +00:00
Bjoern A. Zeeb	bc11a8829e	As discussed in D6262 post-commit review, change inp_route to inp_route6 for IPv6 code after r301217. This was most likely a c&p error from the legacy IP code, which did not matter as it is a union and both structures have the same layout at the beginning. No functional changes. Reviewed by: karels, ae Approved by: re (rgrimes) Differential Revision: https://reviews.freebsd.org/D17005	2018-09-03 22:12:48 +00:00

1 2 3 4 5 ...

2080 Commits