freebsd-dev

Author	SHA1	Message	Date
Alexander V. Chernikov	e58c8da068	Map IPv6 link-local prefix to the link-local ifa. Currently we create link-local route by creating an always-on IPv6 prefix in the prefix list. This prefix is not tied to the link-local ifa. This leads to the following problems: First, when flushing interface addresses we skip on-link route, leaving fe80::/64 prefix on the interface without any IPv6 addresses. Second, when creating and removing link-local alias we lose fe80::/64 prefix from the routing table. Fix this by attaching link-local prefix to the ifa at the initial creation. Reviewed by: hrs MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D28129	2021-01-13 10:03:15 +00:00
Alexander V. Chernikov	2defbe9f0e	Use rn_match instead of doing indirect calls in fib_algo. Relevant inet/inet6 code has the control over deciding what the RIB lookup function currently is. With that in mind, explicitly set it to the current value (rn_match) in the datapath lookups. This avoids cost on indirect call. Differential Revision: https://reviews.freebsd.org/D28066	2021-01-11 23:30:35 +00:00
Alexander V. Chernikov	0da3f8c98d	Bump amount of queued packets in for unresolved ARP/NDP entries to 16. Currently default behaviour is to keep only 1 packet per unresolved entry. Ability to queue more than one packet was added 10 years ago, in r215207, though the default value was kep intact. Things have changed since that time. Systems tend to initiate multiple connections at once for a variety of reasons. For example, recent kern/252278 bug report describe happy-eyeball DNS behaviour sending multiple requests to the DNS server. The primary driver for upper value for the queue length determination is memory consumption. Remote actors should not be able to easily exhaust local memory by sending packets to unresolved arp/ND entries. For now, bump value to 16 packets, to match Darwin implementation. The proper approach would be to switch the limit to calculate memory consumption instead of packet count and limit based on memory. We should MFC this with a variation of D22447. Reviewers: #manpages, #network, bz, emaste Reviewed By: emaste, gbe(doc), jilles(doc) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D28068	2021-01-11 19:51:11 +00:00
Alexander V. Chernikov	d68cf57b7f	Refactor rt_addrmsg() and rt_routemsg(). Summary: * Refactor rt_addrmsg(): make V_rt_add_addr_allfibs decision locally. * Fix rt_routemsg() and multipath by accepting nexthop instead of interface pointer. * Refactor rtsock_routemsg(): avoid accessing rtentry fields directly. * Simplify in_addprefix() by moving prefix search to a separate function. Reviewers: #network Subscribers: imp, ae, bz Differential Revision: https://reviews.freebsd.org/D28011	2021-01-07 19:38:19 +00:00
Alexander V. Chernikov	f5baf8bb12	Add modular fib lookup framework. This change introduces framework that allows to dynamically attach or detach longest prefix match (lpm) lookup algorithms to speed up datapath route tables lookups. Framework takes care of handling initial synchronisation, route subscription, nhop/nhop groups reference and indexing, dataplane attachments and fib instance algorithm setup/teardown. Framework features automatic algorithm selection, allowing for picking the best matching algorithm on-the-fly based on the amount of routes in the routing table. Currently framework code is guarded under FIB_ALGO config option. An idea is to enable it by default in the next couple of weeks. The following algorithms are provided by default: IPv4: * bsearch4 (lockless binary search in a special IP array), tailored for small-fib (<16 routes) * radix4_lockless (lockless immutable radix, re-created on every rtable change), tailored for small-fib (<1000 routes) * radix4 (base system radix backend) * dpdk_lpm4 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized for large-fib (D27412) IPv6: * radix6_lockless (lockless immutable radix, re-created on every rtable change), tailed for small-fib (<1000 routes) * radix6 (base system radix backend) * dpdk_lpm6 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized for large-fib (D27412) Performance changes: Micro benchmarks (I7-7660U, single-core lookups, 2048k dst, code in D27604): IPv4: 8 routes: radix4: ~20mpps radix4_lockless: ~24.8mpps bsearch4: ~69mpps dpdk_lpm4: ~67 mpps 700k routes: radix4_lockless: 3.3mpps dpdk_lpm4: 46mpps IPv6: 8 routes: radix6_lockless: ~20mpps dpdk_lpm6: ~70mpps 100k routes: radix6_lockless: 13.9mpps dpdk_lpm6: 57mpps Forwarding benchmarks: + 10-15% IPv4 forwarding performance (small-fib, bsearch4) + 25% IPv4 forwarding performance (full-view, dpdk_lpm4) + 20% IPv6 forwarding performance (full-view, dpdk_lpm6) Control: Framwork adds the following runtime sysctls: List algos * net.route.algo.inet.algo_list: bsearch4, radix4_lockless, radix4 * net.route.algo.inet6.algo_list: radix6_lockless, radix6, dpdk_lpm6 Debug level (7=LOG_DEBUG, per-route) net.route.algo.debug_level: 5 Algo selection (currently only for fib 0): net.route.algo.inet.algo: bsearch4 net.route.algo.inet6.algo: radix6_lockless Support for manually changing algos in non-default fib will be added soon. Some sysctl names will be changed in the near future. Differential Revision: https://reviews.freebsd.org/D27401	2020-12-25 11:33:17 +00:00
Andrew Gallatin	a034518ac8	Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636	2020-12-19 22:04:46 +00:00
Hans Petter Selasky	ac4dd4cd95	Expose nonstandard IPv6 kernel definitions to standalone builds. No functional change. Reviewed by: bz@ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2020-12-04 21:51:47 +00:00
Alexander V. Chernikov	d1d941c5b9	Remove RADIX_MPATH config option. ROUTE_MPATH is the new config option controlling new multipath routing implementation. Remove the last pieces of RADIX_MPATH-related code and the config option. Reviewed by: glebius Differential Revision: https://reviews.freebsd.org/D27244	2020-11-29 19:43:33 +00:00
Alexander V. Chernikov	b712e3e343	Refactor fib4/fib6 functions. No functional changes. * Make lookup path of fib<4\|6>_lookup_debugnet() separate functions (fib<46>_lookup_rt()). These will be used in the control plane code requiring unlocked radix operations and actual prefix pointer. * Make lookup part of fib<4\|6>_check_urpf() separate functions. This change simplifies the switch to alternative lookup implementations, which helps algorithmic lookups introduction. * While here, use static initializers for IPv4/IPv6 keys Differential Revision: https://reviews.freebsd.org/D27405	2020-11-29 13:41:49 +00:00
Bjoern A. Zeeb	dd4d5a5ffb	IPv6: set ifdisabled in the kernel rather than in rc Enable ND6_IFF_IFDISABLED when the interface is created in the kernel before return to user space. This avoids a race when an interface is create by a program which also calls ifconfig IF inet6 -ifdisabled and races with the devd -> /etc/pccard_ether -> .. netif start IF -> ifdisabled calls (the devd/rc framework disabling IPv6 again after the program had enabled it already). In case the global net.inet6.ip6.accept_rtadv was turned on, we also default to enabling IPv6 on the interfaces, rather than disabling them. PR: 248172 Reported by: Gert Doering (gert greenie.muc.de) Reviewed by: glebius (, phk) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D27324	2020-11-25 20:58:01 +00:00
Alexander V. Chernikov	7511a63825	Refactor rib iterator functions. * Make rib_walk() order of arguments consistent with the rest of RIB api * Add rib_walk_ext() allowing to exec callback before/after iteration. * Rename rt_foreach_fib_walk_del -> rib_foreach_table_walk_del * Rename rt_forach_fib_walk -> rib_foreach_table_walk * Move rib_foreach_table_walk{_del} to route/route_helpers.c * Slightly refactor rib_foreach_table_walk{_del} to make the implementation consistent and prepare for upcoming iterator optimizations. Differential Revision: https://reviews.freebsd.org/D27219	2020-11-22 20:21:10 +00:00
Jonathan T. Looney	440598dd9e	Fix implicit automatic local port selection for IPv6 during connect calls. When a user creates a TCP socket and tries to connect to the socket without explicitly binding the socket to a local address, the connect call implicitly chooses an appropriate local port. When evaluating candidate local ports, the algorithm checks for conflicts with existing ports by doing a lookup in the connection hash table. In this circumstance, both the IPv4 and IPv6 code look for exact matches in the hash table. However, the IPv4 code goes a step further and checks whether the proposed 4-tuple will match wildcard (e.g. TCP "listen") entries. The IPv6 code has no such check. The missing wildcard check can cause problems when connecting to a local server. It is possible that the algorithm will choose the same value for the local port as the foreign port uses. This results in a connection with identical source and destination addresses and ports. Changing the IPv6 code to align with the IPv4 code's behavior fixes this problem. Reviewed by: tuexen Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27164	2020-11-14 14:50:34 +00:00
Alexander V. Chernikov	d9999ae9ca	Fix use-after-free in icmp6_notify_error(). Reported by: Maxime Villard <max at m00nbsd.net> Reviewed by: markj MFC after: 3 days	2020-10-28 20:22:20 +00:00
Mark Johnston	4caea9b169	icmp6: Count packets dropped due to an invalid hop limit Pad the icmp6stat structure so that we can add more counters in the future without breaking compatibility again, last done in r358620. Annotate the rarely executed error paths with __predict_false while here. Reviewed by: bz, melifaro Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D26578	2020-10-19 17:07:19 +00:00
Alexander V. Chernikov	0c325f53f1	Implement flowid calculation for outbound connections to balance connections over multiple paths. Multipath routing relies on mbuf flowid data for both transit and outbound traffic. Current code fills mbuf flowid from inp_flowid for connection-oriented sockets. However, inp_flowid is currently not calculated for outbound connections. This change creates simple hashing functions and starts calculating hashes for TCP,UDP/UDP-Lite and raw IP if multipath routes are present in the system. Reviewed by: glebius (previous version),ae Differential Revision: https://reviews.freebsd.org/D26523	2020-10-18 17:15:47 +00:00
Richard Scheffenegger	868aabb470	Add IP(V6)_VLAN_PCP to set 802.1 priority per-flow. This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt) option IP(V6)_VLAN_PCP, which can be set to -1 (interface default), or explicitly to any priority between 0 and 7. Note that for untagged traffic, explicitly adding a priority will insert a special 801.1Q vlan header with vlan ID = 0 to carry the priority setting Reviewed by: gallatin, rrs MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D26409	2020-10-09 12:06:43 +00:00
Alexander V. Chernikov	fedeb08b6a	Introduce scalable route multipath. This change is based on the nexthop objects landed in D24232. The change introduces the concept of nexthop groups. Each group contains the collection of nexthops with their relative weights and a dataplane-optimized structure to enable efficient nexthop selection. Simular to the nexthops, nexthop groups are immutable. Dataplane part gets compiled during group creation and is basically an array of nexthop pointers, compiled w.r.t their weights. With this change, `rt_nhop` field of `struct rtentry` contains either nexthop or nexthop group. They are distinguished by the presense of NHF_MULTIPATH flag. All dataplane lookup functions returns pointer to the nexthop object, leaving nexhop groups details inside routing subsystem. User-visible changes: The change is intended to be backward-compatible: all non-mpath operations should work as before with ROUTE_MPATH and net.route.multipath=1. All routes now comes with weight, default weight is 1, maximum is 2^24-1. Current maximum multipath group width is statically set to 64. This will become sysctl-tunable in the followup changes. Using functionality: * Recompile kernel with ROUTE_MPATH * set net.route.multipath to 1 route add -6 2001:db8::/32 2001:db8::2 -weight 10 route add -6 2001:db8::/32 2001:db8::3 -weight 20 netstat -6On Nexthop groups data Internet6: GrpIdx NhIdx Weight Slots Gateway Netif Refcnt 1 ------- ------- ------- --------------------------------------- --------- 1 13 10 1 2001:db8::2 vlan2 14 20 2 2001:db8::3 vlan2 Next steps: * Land outbound hashing for locally-originated routes ( D26523 ). * Fix net/bird multipath (net/frr seems to work fine) * Add ROUTE_MPATH to GENERIC * Set net.route.multipath=1 by default Tested by: olivier Reviewed by: glebius Relnotes: yes Differential Revision: https://reviews.freebsd.org/D26449	2020-10-03 10:47:17 +00:00
Alexander V. Chernikov	2259a03020	Rework part of routing code to reduce difference to D26449. * Split rt_setmetrics into get_info_weight() and rt_set_expire_info(), as these two can be applied at different entities and at different times. * Start filling route weight in route change notifications * Pass flowid to UDP/raw IP route lookups * Rework nd6_subscription_cb() and sysctl_dumpentry() to prepare for the fact that rtentry can contain multiple nexthops. Differential Revision: https://reviews.freebsd.org/D26497	2020-09-21 20:02:26 +00:00
Alexander V. Chernikov	1440f62266	Remove unused nhop_ref_any() function. Remove "opt_mpath.h" header where not needed. No functional changes.	2020-09-20 21:32:52 +00:00
Navdeep Parhar	b092fd6c97	if_vxlan(4): add support for hardware assisted checksumming, TSO, and RSS. This lets a VXLAN pseudo-interface take advantage of hardware checksumming (tx and rx), TSO, and RSS if the NIC is capable of performing these operations on inner VXLAN traffic. A VXLAN interface inherits the capabilities of its vxlandev interface if one is specified or of the interface that hosts the vxlanlocal address. If other interfaces will carry traffic for that VXLAN then they must have the same hardware capabilities. On transmit, if_vxlan verifies that the outbound interface has the required capabilities and then translates the CSUM_ flags to their inner equivalents. This tells the hardware ifnet that it needs to operate on the inner frame and not the outer VXLAN headers. An event is generated when a VXLAN ifnet starts. This allows hardware drivers to configure their devices to expect VXLAN traffic on the specified incoming port. On receive, the hardware does RSS and checksum verification on the inner frame. if_vxlan now does a direct netisr dispatch to take full advantage of RSS. It is not very clear why it didn't do this already. Future work: Rx: it should be possible to avoid the first trip up the protocol stack to get the frame to if_vxlan just so it can decapsulate and requeue for a second trip up the stack. The hardware NIC driver could directly call an if_vxlan receive routine for VXLAN traffic instead. Rx: LRO. depends on what happens with the previous item. There will have to to be a mechanism to indicate that it's time for if_vxlan to flush its LRO state. Reviewed by: kib@ Relnotes: Yes Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D25873	2020-09-18 02:37:57 +00:00
Navdeep Parhar	72cc43df17	Add a knob to allow zero UDP checksums for UDP/IPv6 traffic on the given UDP port. This will be used by some upcoming changes to if_vxlan(4). RFC 7348 (VXLAN) says that the UDP checksum "SHOULD be transmitted as zero. When a packet is received with a UDP checksum of zero, it MUST be accepted for decapsulation." But the original IPv6 RFCs did not allow zero UDP checksum. RFC 6935 attempts to resolve this. Reviewed by: kib@ Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D25873	2020-09-18 02:21:15 +00:00
Mateusz Guzik	662c13053f	net: clean up empty lines in .c and .h files	2020-09-01 21:19:14 +00:00
Kyle Evans	1e9b8db9b2	ipv6: quit dropping packets looping back on p2p interfaces To paraphrase the below-referenced PR: This logic originated in the KAME project, and was even controversial when it was enabled there by default in 2001. No such equivalent logic exists in the IPv4 stack, and it turns out that this leads to us dropping valid traffic when the "point to point" interface is actually a 1:many tun interface, e.g. with the wireguard userland stack. Even in the case of true point-to-point links, this logic only avoids transient looping of packets sent by misconfigured applications or attackers, which can be subverted by proper route configuration rather than hardcoded logic in the kernel to drop packets. In the review, melifaro goes on to note that the kernel can't fix it, so it perhaps shouldn't try to be 'smart' about it. Additionally, that TTL will still kick in even with incorrect route configuration. PR: 247718 Reviewed by: melifaro, rgrimes MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D25567	2020-08-31 01:45:48 +00:00
Alexander V. Chernikov	a624ca3dff	Move net/route/shared.h definitions to net/route/route_var.h. No functional changes. net/route/shared.h was created in the inital phases of nexthop conversion. It was intended to serve the same purpose as route_var.h - share definitions of functions and structures between the routing subsystem components. At that time route_var.h was included by many files external to the routing subsystem, which largerly defeats its purpose. As currently this is not the case anymore and amount of route_var.h includes is roughly the same as shared.h, retire the latter in favour of the former.	2020-08-28 22:50:20 +00:00
Alexander V. Chernikov	bec053ffe0	Make net.inet6.ip6.deembed_scopeid behaviour default & remove sysctl. Submitted by: Neel Chauhan <neel AT neelc DOT org> Differential Revision: https://reviews.freebsd.org/D25637	2020-08-15 11:37:44 +00:00
Alexander V. Chernikov	2f23f45b20	Simplify dom_<rtattach\|rtdetach>. Remove unused arguments from dom_rtattach/dom_rtdetach functions and make them return/accept 'struct rib_head' instead of 'void **'. Declare inet/inet6 implementations in the relevant _var.h headers similar to domifattach / domifdetach. Add rib_subscribe_internal() function to accept subscriptions to the rnh directly. Differential Revision: https://reviews.freebsd.org/D26053	2020-08-14 21:29:56 +00:00
Hans Petter Selasky	b453d3d239	Use a static initializer for the multicast free tasks. This makes the SYSINIT() function updated in r364072 superfluous. Suggested by: glebius@ MFC after: 1 week Sponsored by: Mellanox Technologies	2020-08-11 08:31:40 +00:00
Alexander V. Chernikov	9a00f6d067	Fix rib_subscribe() waitok flag by performing allocation outside epoch. Make in6_inithead() use rib_subscribe with waitok to achieve reliable subscription allocation. Reviewed by: glebius	2020-08-11 07:05:30 +00:00
Bjoern A. Zeeb	f9461246a2	MC: add a note with reference to the discussion and history as-to why we are where we are now. The main thing is to try to get rid of the delayed freeing to avoid blocking on the taskq when shutting down vnets. X-Timeout: if you still see this before 14-RELEASE remove it.	2020-08-10 10:58:43 +00:00
Hans Petter Selasky	3689652c65	Make sure the multicast release tasks are properly drained when destroying a VNET or a network interface. Else the inm release tasks, both IPv4 and IPv6 may cause a panic accessing a freed VNET or network interface. Reviewed by: jmg@ Discussed with: bz@ Differential Revision: https://reviews.freebsd.org/D24914 MFC after: 1 week Sponsored by: Mellanox Technologies	2020-08-10 10:46:08 +00:00
Hans Petter Selasky	a95ef9d38d	Use proper prototype for SYSINIT() functions. Mark the unused argument using the __unused macro. Discussed with: kib@ MFC after: 1 week Sponsored by: Mellanox Technologies	2020-08-10 10:40:19 +00:00
Bjoern A. Zeeb	a9839c4aee	IPV6_PKTINFO support for v4-mapped IPv6 sockets When using v4-mapped IPv6 sockets with IPV6_PKTINFO we do not respect the given v4-mapped src address on the IPv4 socket. Implement the needed functionality. This allows single-socket UDP applications (such as OpenVPN) to work better on FreeBSD. Requested by: Gert Doering (gert greenie.net), pfsense Tested by: Gert Doering (gert greenie.net) Reviewed by: melifaro MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D24135	2020-08-07 15:13:53 +00:00
Andrey V. Elsukov	ce8875b6c4	Fix typo. Submitted by: Evgeniy Khramtsov <evgeniy at khramtsov org> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D25932	2020-08-05 10:27:11 +00:00
Mark Johnston	cfae6a92ac	Remove an incorrect assertion from in6p_lookup_mcast_ifp(). The socket may be bound to an IPv4-mapped IPv6 address. However, the inp address is not relevant to the JOIN_GROUP or LEAVE_GROUP operations. While here remove an unnecessary check for inp == NULL. Reported by: syzbot+d01ab3d5e6c1516a393c@syzkaller.appspotmail.com Reviewed by: hselasky MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25888	2020-08-04 15:00:02 +00:00
Mark Johnston	19afc65a7e	ip6_output(): Check the return value of in6_getlinkifnet(). If the destination address has an embedded scope ID, make sure that it corresponds to a valid ifnet before proceeding. Otherwise a sendto() with a bogus link-local address can trigger a NULL pointer dereference. Reported by: syzkaller Reviewed by: ae Fixes: r358572 Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25887	2020-07-30 17:43:23 +00:00
Alexander V. Chernikov	e1c05fd290	Transition from rtrequest1_fib() to rib_action(). Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib, in6_rtrequest, rtrequest_fib> and their uses and switch to to rib_action(). This is part of the new routing KPI. Submitted by: Neel Chauhan <neel AT neelc DOT org> Differential Revision: https://reviews.freebsd.org/D25546	2020-07-21 19:56:13 +00:00
Alexander V. Chernikov	725871230d	Temporarly revert r363319 to unbreak the build. Reported by: CI Pointy hat to: melifaro	2020-07-19 10:53:15 +00:00
Alexander V. Chernikov	8cee15d9e4	Transition from rtrequest1_fib() to rib_action(). Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib, in6_rtrequest, rtrequest_fib> and their uses and switch to to rib_action(). This is part of the new routing KPI. Submitted by: Neel Chauhan <neel AT neelc DOT org> Differential Revision: https://reviews.freebsd.org/D25546	2020-07-19 09:29:27 +00:00
Alexander V. Chernikov	4c7ba83f9d	Switch inet6 default route subscription to the new rib subscription api. Old subscription model allowed only single customer. Switch inet6 to the new subscription api and eliminate the old model. Differential Revision: https://reviews.freebsd.org/D25615	2020-07-12 11:24:23 +00:00
Alexander V. Chernikov	eddfb2e86f	Fix IPv6 regression introduced by r362900. PR: kern/247729	2020-07-03 08:06:26 +00:00
Alexander V. Chernikov	6ad7446c6f	Complete conversions from fib<4\|6>_lookup_nh_<basic\|ext> to fib<4\|6>_lookup(). fib[46]_lookup_nh_ represents pre-epoch generation of fib api, providing less guarantees over pointer validness and requiring on-stack data copying. With no callers remaining, remove fib[46]_lookup_nh_ functions. Submitted by: Neel Chauhan <neel AT neelc DOT org> Differential Revision: https://reviews.freebsd.org/D25445	2020-07-02 21:04:08 +00:00
Mark Johnston	95033af923	Add the SCTP_SUPPORT kernel option. This is in preparation for enabling a loadable SCTP stack. Analogous to IPSEC/IPSEC_SUPPORT, the SCTP_SUPPORT kernel option must be configured in order to support a loadable SCTP implementation. Discussed with: tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2020-06-18 19:32:34 +00:00
Michael Tuexen	70486b27ae	Retire SCTP_SO_LOCK_TESTING. This was intended to test the locking used in the MacOS X kernel on a FreeBSD system, to make use of WITNESS and other debugging infrastructure. This hasn't been used for ages, to take it out to reduce the #ifdef complexity. MFC after: 1 week	2020-06-07 14:39:20 +00:00
Ryan Moeller	78a3645fd2	Fix typo in previous commit Applied the wrong patch Reported by: Michael Butler <imb@protected-networks.net> Approved by: mav (mentor) Sponsored by: iXsystems.com	2020-06-03 17:26:00 +00:00
Ryan Moeller	f057d56c6c	scope6: Check for NULL afdata before dereferencing Narrows the race window with if_detach. Approved by: mav (mentor) MFC after: 3 days Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D25017	2020-06-03 16:57:30 +00:00
Alexander V. Chernikov	da187ddb3d	* Add rib_<add\|del\|change>_route() functions to manipulate the routing table. The main driver for the change is the need to improve notification mechanism. Currently callers guess the operation data based on the rtentry structure returned in case of successful operation result. There are two problems with this appoach. First is that it doesn't provide enough information for the upcoming multipath changes, where rtentry refers to a new nexthop group, and there is no way of guessing which paths were added during the change. Second is that some rtentry fields can change during notification and protecting from it by requiring customers to unlock rtentry is not desired. Additionally, as the consumers such as rtsock do know which operation they request in advance, making explicit add/change/del versions of the functions makes sense, especially given the functions don't share a lot of code. With that in mind, introduce rib_cmd_info notification structure and rib_<add\|del\|change>_route() functions, with mandatory rib_cmd_info pointer. It will be used in upcoming generalized notifications. * Move definitions of the new functions and some other functions/structures used for the routing table manipulation to a separate header file, net/route/route_ctl.h. net/route.h is a frequently used file included in ~140 places in kernel, and 90% of the users don't need these definitions. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D25067	2020-06-01 20:49:42 +00:00
Alexander V. Chernikov	e7403d0230	Revert r361704, it accidentally committed merged D25067 and D25070.	2020-06-01 20:40:40 +00:00
Alexander V. Chernikov	79674562b8	* Add rib_<add\|del\|change>_route() functions to manipulate the routing table. The main driver for the change is the need to improve notification mechanism. Currently callers guess the operation data based on the rtentry structure returned in case of successful operation result. There are two problems with this appoach. First is that it doesn't provide enough information for the upcoming multipath changes, where rtentry refers to a new nexthop group, and there is no way of guessing which paths were added during the change. Second is that some rtentry fields can change during notification and protecting from it by requiring customers to unlock rtentry is not desired. Additionally, as the consumers such as rtsock do know which operation they request in advance, making explicit add/change/del versions of the functions makes sense, especially given the functions don't share a lot of code. With that in mind, introduce rib_cmd_info notification structure and rib_<add\|del\|change>_route() functions, with mandatory rib_cmd_info pointer. It will be used in upcoming generalized notifications. * Move definitions of the new functions and some other functions/structures used for the routing table manipulation to a separate header file, net/route/route_ctl.h. net/route.h is a frequently used file included in ~140 places in kernel, and 90% of the users don't need these definitions. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D25067	2020-06-01 20:32:02 +00:00
Alexander V. Chernikov	a37a5246ca	Use fib[46]_lookup() in mtu calculations. fib[46]_lookup_nh_ represents pre-epoch generation of fib api, providing less guarantees over pointer validness and requiring on-stack data copying. Conversion is straight-forwarded, as the only 2 differences are requirement of running in network epoch and the need to handle RTF_GATEWAY case in the caller code. Differential Revision: https://reviews.freebsd.org/D24974	2020-05-28 08:00:08 +00:00
Alexander V. Chernikov	1483c1c508	Replace ip6_ouput fib6_lookup_nh_<ext\|basic> calls with fib6_lookup(). fib6_lookup_nh_ represents pre-epoch generation of fib api, providing less guarantees over pointer validness and requiring on-stack data copying. Conversion is straight-forwarded, as the only 2 differences are requirement of running in network epoch and the need to handle RTF_GATEWAY case in the caller code. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D24973	2020-05-28 07:29:44 +00:00
Alexander V. Chernikov	7bfc98af12	Switch gif(4) path verification to fib[46]_check_urfp(). fibX_lookup_nh_ represents pre-epoch generation of fib api, providing less guarantees over pointer validness and requiring on-stack data copying. Use specialized fib[46]_check_urpf() from newer KPI instead, to allow removal of older KPI. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D24978	2020-05-28 07:26:18 +00:00
Alexander V. Chernikov	4d2c2509f2	Move <add\|del\|change>_route() functions to route_ctl.c in preparation of multipath control plane changed described in D24141. Currently route.c contains core routing init/teardown functions, route table manipulation functions and various helper functions, resulting in >2KLOC file in total. This change moves most of the route table manipulation parts to a dedicated file, simplifying planned multipath changes and making route.c more manageable. Differential Revision: https://reviews.freebsd.org/D24870	2020-05-23 19:06:57 +00:00
Alexander V. Chernikov	2bbab0af6d	Use epoch(9) for rtentries to simplify control plane operations. Currently the only reason of refcounting rtentries is the need to report the rtable operation details immediately after the execution. Delaying rtentry reclamation allows to stop refcounting and simplify the code. Additionally, this change allows to reimplement rib_lookup_info(), which is used by some of the customers to get the matching prefix along with nexthops, in more efficient way. The change keeps per-vnet rtzone uma zone. It adds nh_vnet field to nhop_priv to be able to reliably set curvnet even during vnet teardown. Rest of the reference counting code will be removed in the D24867 . Differential Revision: https://reviews.freebsd.org/D24866	2020-05-23 10:21:02 +00:00
Mike Karels	2510235150	Allow TCP to reuse local port with different destinations Previously, tcp_connect() would bind a local port before connecting, forcing the local port to be unique across all outgoing TCP connections for the address family. Instead, choose a local port after selecting the destination and the local address, requiring only that the tuple is unique and does not match a wildcard binding. Reviewed by: tuexen (rscheff, rrs previous version) MFC after: 1 month Sponsored by: Forcepoint LLC Differential Revision: https://reviews.freebsd.org/D24781	2020-05-18 22:53:12 +00:00
Andrew Gallatin	bc74b81991	IPv6: Fix a panic in the nd6 code with unmapped mbufs. If the neighbor entry for an IPv6 TCP session using unmapped mbufs times out, IPv6 will send an icmp6 dest. unreachable message. In doing this, it will try to do a software checksum on the reflected packet. If this is a TCP session using unmapped mbufs, then there will be a kernel panic. To fix this, just free packets with unmapped mbufs, rather than sending the icmp. Reviewed by: np, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D24821	2020-05-12 17:18:44 +00:00
Andrew Gallatin	d7452d89ad	IPv6: sync IP_NO_SND_TAG_RL support from IPv4 The IP_NO_SND_TAG_RL flag to ip{,6}_output() means that the packets being sent should bypass hardware rate limiting. This is typically used by modern TCP stacks for rexmits. This support was added to IPv4 in r352657, but never added to IPv6, even though rack and bbr call ip6_output() with this flag. Reviewed by: rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D24822	2020-05-12 14:01:12 +00:00
Andrew Gallatin	84af4cc153	Fix the build Back out the IPv6 portion of r360903, as the stamp_tag param is apparently not supported in upstream FreeBSD. Sponsored by: Netflix Pointy hat to: gallatin	2020-05-11 21:23:22 +00:00
Andrew Gallatin	6043ac201a	Ktls: never skip stamping tags for NIC TLS The newer RACK and BBR TCP stacks have added a mechanism to disable hardware packet pacing for TCP retransmits. This mechanism works by skipping the send-tag stamp on rate-limited connections when the TCP stack calls ip_output() with the IP_NO_SND_TAG_RL flag set. When doing NIC TLS, we must ignore this flag, as NIC TLS packets must always be stamped. Failure to stamp a NIC TLS packet will result in crypto issues. Reviewed by: hselasky, rrs Sponsored by: Netflix, Mellanox	2020-05-11 19:17:33 +00:00
Alexander V. Chernikov	9e02229580	Remove now-unused rt_ifp,rt_ifa,rt_gateway,rt_mtu rte fields. After converting routing subsystem customers to use nexthop objects defined in r359823, some fields in struct rtentry became unused. This commit removes rt_ifp, rt_ifa, rt_gateway and rt_mtu from struct rtentry along with the code initializing and updating these fields. Cleanup of the remaining fields will be addressed by D24669. This commit also changes the implementation of the RTM_CHANGE handling. Old implementation tried to perform the whole operation under radix WLOCK, resulting in slow performance and hacks like using RTF_RNH_LOCKED flag. New implementation looks up the route nexthop under radix RLOCK, creates new nexthop and tries to update rte nhop pointer. Only last part is done under WLOCK. In the hypothetical scenarious where multiple rtsock clients repeatedly issue RTM_CHANGE requests for the same route, route may get updated between read and update operation. This is addressed by retrying the operation multiple (3) times before returning failure back to the caller. Differential Revision: https://reviews.freebsd.org/D24666	2020-05-04 14:31:45 +00:00
Gleb Smirnoff	7b6c99d08d	Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbuf within m_epg namespace. All edits except the 'struct mbuf' declaration and mb_dupcl() were done mechanically with sed: s/->m_ext_pgs.nrdy/->m_epg_nrdy/g s/->m_ext_pgs.hdr_len/->m_epg_hdrlen/g s/->m_ext_pgs.trail_len/->m_epg_trllen/g s/->m_ext_pgs.first_pg_off/->m_epg_1st_off/g s/->m_ext_pgs.last_pg_len/->m_epg_last_len/g s/->m_ext_pgs.flags/->m_epg_flags/g s/->m_ext_pgs.record_type/->m_epg_record_type/g s/->m_ext_pgs.enc_cnt/->m_epg_enc_cnt/g s/->m_ext_pgs.tls/->m_epg_tls/g s/->m_ext_pgs.so/->m_epg_so/g s/->m_ext_pgs.seqno/->m_epg_seqno/g s/->m_ext_pgs.stailq/->m_epg_stailq/g Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-03 00:12:56 +00:00
Alexander V. Chernikov	74787ef47b	Add nhop to the ifa_rtrequest() callback. With the upcoming multipath changes described in D24141, rt->rt_nhop can potentially point to a nexthop group instead of an individual nhop. To simplify caller handling of such cases, change ifa_rtrequest() callback to pass changed nhop directly. Differential Revision: https://reviews.freebsd.org/D24604	2020-04-29 19:28:56 +00:00
Alexander V. Chernikov	e7d8af4f65	Move route_temporal.c and route_var.h to net/route. Nexthop objects implementation, defined in r359823, introduced sys/net/route directory intended to hold all routing-related code. Move recently-introduced route_temporal.c and private route_var.h header there. Differential Revision: https://reviews.freebsd.org/D24597	2020-04-28 19:14:09 +00:00
Alexander V. Chernikov	fe6da72759	Move struct rtentry definition to nhop_var.h. One of the goals of the new routing KPI defined in r359823 is to entirely hide`struct rtentry` from the consumers. It will allow to improve routing subsystem internals and deliver features much faster. This is one of the last changes, effectively moving struct rtentry definition to a net/route_var.h header, internal to the routing subsystem. Differential Revision: https://reviews.freebsd.org/D24580	2020-04-28 18:42:30 +00:00
Alexander V. Chernikov	1b0051bada	Eliminate now-unused parts of old routing KPI. r360292 switched most of the remaining routing customers to a new KPI, leaving a bunch of wrappers for old routing lookup functions unused. Remove them from the tree as a part of routing cleanup. Differential Revision: https://reviews.freebsd.org/D24569	2020-04-28 07:25:34 +00:00
Alexander V. Chernikov	55f57ca9ac	Convert debugnet to the new routing KPI. Introduce new fib[46]_lookup_debugnet() functions serving as a special interface for the crash-time operations. Underlying implementation will try to return lookup result if datastructures are not corrupted, avoding locking. Convert debugnet to use fib4_lookup_debugnet() and switch it to use nexthops instead of rtentries. Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D24555	2020-04-26 18:42:38 +00:00
Alexander V. Chernikov	49c9f84f54	Fix IPv6 link-local operations with RADIX_MPATH. It was broken by r360292 as fib6_lookup() assumes de-embedded addresses while rtalloc_mpath_fib() requires sockaddr with embedded ones. New fib6_lookup() transparently supports multipath, hence remove old RADIX_MPATH condition.	2020-04-26 18:07:35 +00:00
Alexander V. Chernikov	983066f05b	Convert route caching to nexthop caching. This change is build on top of nexthop objects introduced in r359823. Nexthops are separate datastructures, containing all necessary information to perform packet forwarding such as gateway interface and mtu. Nexthops are shared among the routes, providing more pre-computed cache-efficient data while requiring less memory. Splitting the LPM code and the attached data solves multiple long-standing problems in the routing layer, drastically reduces the coupling with outher parts of the stack and allows to transparently introduce faster lookup algorithms. Route caching was (re)introduced to minimise (slow) routing lookups, allowing for notably better performance for large TCP senders. Caching works by acquiring rtentry reference, which is protected by per-rtentry mutex. If the routing table is changed (checked by comparing the rtable generation id) or link goes down, cache record gets withdrawn. Nexthops have the same reference counting interface, backed by refcount(9). This change merely replaces rtentry with the actual forwarding nextop as a cached object, which is mostly mechanical. Other moving parts like cache cleanup on rtable change remains the same. Differential Revision: https://reviews.freebsd.org/D24340	2020-04-25 09:06:11 +00:00
Alexander V. Chernikov	aaad3c4fca	Convert rtentry field accesses into nhop field accesses. One of the goals of the new routing KPI defined in r359823 is to entirely hide`struct rtentry` from the consumers. It will allow to improve routing subsystem internals and deliver more features much faster. This commit is mostly mechanical change to eliminate direct struct rtentry field accesses. The only notable difference is AF_LINK gateway encoding. AF_LINK gw is used in routing stack for operations with interface routes and host loopback routes. In the former case it indicates _some_ non-NULL gateway, as the interface is the same as in rt_ifp in kernel and rtm_ifindex in rtsock reporting. In the latter case the interface index inside gateway was used by the IPv6 datapath to verify address scope for link-local interfaces. Kernel uses struct sockaddr_dl for this type of gateway. This structure allows for specifying rich interface data, such as mac address and interface name. However, this results in relatively large structure size - 52 bytes. Routing stack fils in only 2 fields - sdl_index and sdl_type, which reside in the first 8 bytes of the structure. In the new KPI, struct nhop_object tries to be cache-efficient, hence embodies gateway address inside the structure. In the AF_LINK case it stores stortened version of the structure - struct sockaddr_dl_short, which occupies 16 bytes. After D24340 changes, the data inside AF_LINK gateway will not be used in the kernel at all, leaving rtsock as the only potential concern. The difference in rtsock reporting: (old) got message of size 240 on Thu Apr 16 03:12:13 2020 RTM_ADD: Add Route: len 240, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED> locks: inits: sockaddrs: <DST,GATEWAY,NETMASK> 10.0.0.0 link#5 255.255.255.0 (new) got message of size 200 on Sun Apr 19 09:46:32 2020 RTM_ADD: Add Route: len 200, pid: 0, seq 0, errno 0, flags:<UP,DONE,PINNED> locks: inits: sockaddrs: <DST,GATEWAY,NETMASK> 10.0.0.0 link#5 255.255.255.0 Note 40 bytes different (52-16 + alignment). However, gateway is still a valid AF_LINK gateway with proper data filled in. It is worth noting that these particular messages (interface routes) are mostly ignored by routing daemons: * bird/quagga/frr uses RTM_NEWADDR and ignores prefix route addition messages. * quagga/frr ignores routes without gateway More detailed overview on how rtsock messages are used by the routing daemons to reconstruct the kernel view, can be found in D22974. Differential Revision: https://reviews.freebsd.org/D24519	2020-04-23 08:04:20 +00:00
Alexander V. Chernikov	d98351e13c	Fix lookup key generation in fib6_check_urpf(). The version introduced in r359823 assumed D23051 had been in tree already. As this is not the case yet, revert to sockaddr.	2020-04-19 07:27:12 +00:00
Jonathan T. Looney	5d6e356cb0	Avoid calling protocol drain routines more than once per reclamation event. mb_reclaim() calls the protocol drain routines for each protocol in each domain. Some protocols exist in more than one domain and share drain routines. In the case of SCTP, it also uses the same drain routine for its SOCK_SEQPACKET and SOCK_STREAM entries in the same domain. On systems with INET, INET6, and SCTP all defined, mb_reclaim() calls sctp_drain() four times. On systems with INET and INET6 defined, mb_reclaim() calls tcp_drain() twice. mb_reclaim() is the only in-tree caller of the pr_drain protocol entry. Eliminate this duplication by ensuring that each pr_drain routine is only specified for one protocol entry in one domain. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D24418	2020-04-16 20:17:24 +00:00
Alexander V. Chernikov	539642a29d	Add nhop parameter to rti_filter callback. One of the goals of the new routing KPI defined in r359823 is to entirely hide`struct rtentry` from the consumers. It will allow to improve routing subsystem internals and deliver more features much faster. This change is one of the ongoing changes to eliminate direct struct rtentry field accesses. Additionally, with the followup multipath changes, single rtentry can point to multiple nexthops. With that in mind, convert rti_filter callback used when traversing the routing table to accept pair (rt, nhop) instead of nexthop. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D24440	2020-04-16 17:20:18 +00:00
Alexander V. Chernikov	53a4886d5d	Convert ip6_forward() to the new routing KPI. Update ip6_forward() internals to use deembedded IPv6 addresses to simplify calls to the new KPI and prepare for the future scope-embedding cleanup. Add in6_get_unicast_scopeid() and in6_set_unicast_scopeid() scopeid operation functions tailored for unicast processing. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D24334	2020-04-15 12:56:05 +00:00
Alexander V. Chernikov	9ac7c6cfed	Convert IP/IPv6 forwarding, ICMP processing and IP PCB laddr selection to the new routing KPI. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D24245	2020-04-14 23:06:25 +00:00
Alexander V. Chernikov	dd4776f0cc	Reorganise nd6 notification code to avoid direct rtentry field access. One of the goals of the new routing KPI defined in r359823 is to entirely hide `struct rtentry` from the consumers. Doing so will allow to improve routing subsystem internals and deliver features more easily. This change is one of the ongoing changes to eliminate direct struct rtentry field accesses. It introduces rtfree_func() wrapper around RTFREE() and reorganises nd6 notification code to avoid accessing most of the rtentry fields. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D24404	2020-04-14 22:48:33 +00:00
Andrew Gallatin	23feb56348	KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself. While the original implementation of unmapped mbufs was a large step forward in terms of reducing cache misses by enabling mbufs to carry more than a single page for sendfile, they are rather cache unfriendly when accessing the ext_pgs metadata and data. This is because the ext_pgs part of the mbuf is allocated separately, and almost guaranteed to be cold in cache. This change takes advantage of the fact that unmapped mbufs are never used at the same time as pkthdr mbufs. Given this fact, we can overlap the ext_pgs metadata with the mbuf pkthdr, and carry the ext_pgs meta directly in the mbuf itself. Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array of pages) directly after the existing m_ext. In order to be able to carry 5 pages (which is the minimum required for a 16K TLS record which is not perfectly aligned) on LP64, I've had to steal ext_arg2. The only user of this in the xmit path is sendfile, and I've adjusted it to use arg1 when using unmapped mbufs. This change is almost entirely mechanical, except that we change mb_alloc_ext_pgs() to no longer allow allocating pkthdrs, the change to avoid ext_arg2 as mentioned above, and the removal of the ext_pgs zone, This change saves roughly 2% "raw" CPU (~59% -> 57%), or over 3% "scaled" CPU on a Netflix 100% software kTLS workload at 90+ Gb/s on Broadwell Xeons. In a follow-on commit, I plan to remove some hacks to avoid access ext_pgs fields of mbufs, since they will now be in cache. Many thanks to glebius for helping to make this better in the Netflix tree. Reviewed by: hselasky, jhb, rrs, glebius (early version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D24213	2020-04-14 14:46:06 +00:00
Alexander V. Chernikov	6722086045	Plug netmask NULL check during route addition causing kernel panic. This bug was introduced by the r359823. Reported by: hselasky	2020-04-14 13:12:22 +00:00
Alexander V. Chernikov	3133002560	Remove tcp_rtlookup6() function signature. The function itself was removed in r122922 16 years ago.	2020-04-13 08:26:11 +00:00
Alexander V. Chernikov	a666325282	Introduce nexthop objects and new routing KPI. This is the foundational change for the routing subsytem rearchitecture. More details and goals are available in https://reviews.freebsd.org/D24141 . This patch introduces concept of nexthop objects and new nexthop-based routing KPI. Nexthops are objects, containing all necessary information for performing the packet output decision. Output interface, mtu, flags, gw address goes there. For most of the cases, these objects will serve the same role as the struct rtentry is currently serving. Typically there will be low tens of such objects for the router even with multiple BGP full-views, as these objects will be shared between routing entries. This allows to store more information in the nexthop. New KPI: struct nhop_object fib4_lookup(uint32_t fibnum, struct in_addr dst, uint32_t scopeid, uint32_t flags, uint32_t flowid); struct nhop_object fib6_lookup(uint32_t fibnum, const struct in6_addr dst6, uint32_t scopeid, uint32_t flags, uint32_t flowid); These 2 function are intended to replace all all flavours of <in_\|in6_>rtalloc[1]<_ign><_fib>, mpath functions and the previous fib[46]-generation functions. Upon successful lookup, they return nexthop object which is guaranteed to exist within current NET_EPOCH. If longer lifetime is desired, one can specify NHR_REF as a flag and get a referenced version of the nexthop. Reference semantic closely resembles rtentry one, allowing sed-style conversion. Additionally, another 2 functions are introduced to support uRPF functionality inside variety of our firewalls. Their primary goal is to hide the multipath implementation details inside the routing subsystem, greatly simplifying firewalls implementation: int fib4_lookup_urpf(uint32_t fibnum, struct in_addr dst, uint32_t scopeid, uint32_t flags, const struct ifnet src_if); int fib6_lookup_urpf(uint32_t fibnum, const struct in6_addr dst6, uint32_t scopeid, uint32_t flags, const struct ifnet src_if); All functions have a separate scopeid argument, paving way to eliminating IPv6 scope embedding and allowing to support IPv4 link-locals in the future. Structure changes: * rtentry gets new 'rt_nhop' pointer, slightly growing the overall size. * rib_head gets new 'rnh_preadd' callback pointer, slightly growing overall sz. Old KPI: During the transition state old and new KPI will coexists. As there are another 4-5 decent-sized conversion patches, it will probably take a couple of weeks. To support both KPIs, fields not required by the new KPI (most of rtentry) has to be kept, resulting in the temporary size increase. Once conversion is finished, rtentry will notably shrink. More details: * architectural overview: https://reviews.freebsd.org/D24141 * list of the next changes: https://reviews.freebsd.org/D24232 Reviewed by: ae,glebius(initial version) Differential Revision: https://reviews.freebsd.org/D24232	2020-04-12 14:30:00 +00:00
Alexander V. Chernikov	c80b717f71	Remove RADIX_MPATH headers, they were unused since r293159. MFC after: 2 weeks	2020-04-11 07:56:11 +00:00
Alexander V. Chernikov	4684d3cbcb	Remove per-AF radix_mpath initializtion functions. Split their functionality by moving random seed allocation to SYSINIT and calling (new) generic multipath function from standard IPv4/IPv5 RIB init handlers. Differential Revision: https://reviews.freebsd.org/D24356	2020-04-11 07:37:08 +00:00
Andrey V. Elsukov	cfad769689	Ignore ND6 neighbor advertisement received for static link-layer entries. Previously such NA could override manually created LLE. Reported by: Martin Beran <martin at mber cz> Reviewed by: melifaro MFC after: 10 days	2020-04-01 02:13:01 +00:00
Mark Johnston	431f2b8712	Use a dedicated taskqueue thread for in6m_release_task(). Interfaces may be detached from a taskqueue_thread task, for example by prison_complete(), so after r359438, when draining the queue we may end up deadlocking. Reported by: Jenkins via lwhsu MFC with: r359438	2020-03-31 02:25:53 +00:00
Mark Johnston	9b1d850be8	Remove the "config" taskqgroup and its KPIs. Equivalent functionality is already provided by taskqueue(9), just use that instead. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2020-03-30 14:24:03 +00:00
Mark Johnston	e02582d1ae	Fix synchronization in the IPV6_2292PKTOPTIONS set handler. The inpcb needs to be locked when we update output packet options. Otherwise it is possible for the IPV6_2292PKTOPTIONS handler to free packet option structures while another thread is reading or updating them. Note that the option handler is still kind of broken. For instance it frees all options before performing privilege checks for individual options. However, this can be fixed separately. Reported by: syzbot+52eb0fd4ddc119787f9d@syzkaller.appspotmail.com Reviewed by: bz, tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D24125	2020-03-19 21:38:52 +00:00
Bjoern A. Zeeb	8483fce695	ip6: retire in6_selectroute_fib() as promised 8 years ago In r231852 I added in6_selectroute_fib() as a compat function with the fibnum as an extra argument compared to in6_selectroute() to keep the KPI stable. Way too late retire this function again and add the fib to in6_selectroute() which also only has a single consumer now and was an orphan function before.	2020-03-03 13:48:12 +00:00
Bjoern A. Zeeb	000c42faf3	ip6_output: use new routing KPI when not passed a cached route Implement the equivalent of r347375 (IPv4) for the IPv6 output path. In IPv6 we get passed a cached route (and inp) by udp6_output() depending on whether we acquired a write lock on the INP. In case we neither bind nor connect a first UDP packet would come in with a cached route (wlocked) and all further packets would not. In case we bind and do not connect we never write-lock the inp. When we do not pass in a cached route, rather than providing the storage for a route locally and pass it over the old lookup code and down the stack, use the new route lookup KPI and acquire all details we need to send the packet. Compared to the IPv4 code the IPv6 code has a couple of possible complications: given an option with a routing hdr/caching route there, and path mtu (ro_pmtu) case which now equally has to deal with the possibility of having a route which is NULL passed in, and the fwd_tag in case a firewall changes the next hop (something to factor out in the future). Sponsored by: Netflix Reviewed by: glebius MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D23886	2020-03-03 11:32:47 +00:00
Bjoern A. Zeeb	5f3e375ed8	in6_fib: return nh_ia in the ext interface as we do for IPv4 Like for IPv4 add nh_ia to the ext interface and return rt_ifa in order to be used for, e.g., packet/octets accounting purposes. Reviewed by: melifaro MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D23873	2020-03-03 09:50:33 +00:00
Bjoern A. Zeeb	f6428cdb1f	fib6_rte_to_nh_: return a link-local gw address with scope embedded In fib6_rte_to_nh_ when returning a link-local gateway address currently we do clear the scope. That could be recovered using the ifp returned as well, but the code in general seems to expect a link-local address with scope embeedded as otherwise the "dst" (gw) passed to the output routines will not include scope and not send the packet out (the right interface). Do not clear the scope when returning a link-local address and allow packets to go out (the right interface). Remove the (now) extra scope recovery in the IPv6 fast-fwd code. Sponsored by: Netflix Reviewed by: melifaro, ae MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23872	2020-03-03 09:45:16 +00:00
Bjoern A. Zeeb	f1db666a61	mld6: initialize oifp to avoid bogus results/panics in edge cases In certain cases (probably not during normal operation but observed in the lab during development) ip6_ouput() could return without error and ifpp (&oifp) not updated. Given oifp was never initialized we would take the later branch as oifp was not NULL, and when calling icmp6_ifstat_inc() we would panic dereferencing a garbage pointer. For code stability initialize oifp to NULL before first use to always have a deterministic value and not rely on a called function to behave and always and for ever do the work for us as we hope for. MFC after: 3 days Sponsored by: Netflix	2020-02-28 11:16:41 +00:00
Pawel Biernacki	7029da5c36	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718	2020-02-26 14:26:36 +00:00
Bjoern A. Zeeb	3db6053160	ip6_output: fix regression introduced in r358167 for ipv6 fragmentation When moving the calculations for the optlen into the if (opt) block which deals with possible extension headers I failed to initialise unfragpartlen to the ipv6 header length if there were no extension headers present. Correct that mistake to make IPv6 fragment length calculcations work again. Reported by: hselasky, kp OKed by: hselasky, kp MFC after: 3 days X-MFC with: r358167 PR: 244393	2020-02-25 15:03:41 +00:00
Bjoern A. Zeeb	3459050c9a	Fix IPv6 checksums when exthdrs are present. In two places in ip6_output we are doing (delayed) checksum calculations. The initial logic came from SCTP in r205075,205104 and later I copied and adjusted it for the TCP\|UDP case in r235958. The problem was that the original SCTP offsets were already wrong for any case with extension headers present given IPv6 extension headers are not part of the pseudo checksum calculations. The later changes do not help in case there is checksum offloading as for extension headers (incl. fragments) we do currrently never offload as we have no infrastructure to know whether the NIC can handle these cases. Correct the offsets for delayed checksum calculations and properly handle mbuf flags. In addition harmonize the almost identical duplicate code. While here eliminate the now unneeded variable hlen and add an always missing mtod() call in the 1-b and 3 cases after the introduction of the mb_unmapped_to_ext() calls. Reported by: Francis Dupont (fdupont isc.org) PR: 243675 MFC after: 6 days Reviewed by: markj (earlier version), gallatin Differential Revision: https://reviews.freebsd.org/D23760	2020-02-24 19:12:20 +00:00
Pawel Biernacki	295a18d184	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (14 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Approved by: kib (mentor, blanket) Differential Revision: https://reviews.freebsd.org/D23639	2020-02-24 10:47:18 +00:00
Bjoern A. Zeeb	a1a6c01e41	ip6_output: improve extension header handling Move IPv6 source address checks from after extension header heandling to the top of the function. If we do not pass these checks there is no reason to do a lot of work upfront. Fold extension header preparations and length calculations together into a single branch and macro rather than doing them sequentially. Likewise move extension header concatination into a single branch block only doing it if we recorded any extension header length length. Reviewed by: melifaro (earlier version), markj, gallatin Sponsored by: Netflix (partially, originally) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23740	2020-02-20 10:56:12 +00:00
Michael Tuexen	868b51f234	Epochify SCTP.	2020-02-18 21:25:17 +00:00
Bjoern A. Zeeb	7c1daefe2c	ip6_output: update comments. Clear up some comments and improve to panic messages. No functional changes. MFC after: 3 days	2020-02-18 11:28:00 +00:00
Hans Petter Selasky	bacb11c9ed	Fix kernel panic while trying to read multicast stream. When VIMAGE is enabled make sure the "m_pkthdr.rcvif" pointer is set for all mbufs being input by the IGMP/MLD6 code. Else there will be a NULL-pointer dereference in the netisr code when trying to set the VNET based on the incoming mbuf. Add an assert to catch this when queueing mbufs on a netisr to make debugging of similar cases easier. Found by: Vladislav V. Prodan PR: 244002 Reviewed by: bz@ MFC after: 1 week Sponsored by: Mellanox Technologies	2020-02-17 09:46:32 +00:00
Navdeep Parhar	c53c867eb3	Fix NOINET builds.	2020-01-31 02:23:48 +00:00
Gleb Smirnoff	e617b21d2f	Enter network epoch when calling in_pcbconnect() for IPv6 mapped to IPv4 UDP sockets. This is miss from r356983. Reported by: https://syzkaller.appspot.com/bug?id=73c7a2e3f0783f9947459065e5c2f25fe8f82f54	2020-01-22 17:06:55 +00:00
Alexander V. Chernikov	34a5582c47	Bring back redirect route expiration. Redirect (and temporal) route expiration was broken a while ago. This change brings route expiration back, with unified IPv4/IPv6 handling code. It introduces net.inet.icmp.redirtimeout sysctl, allowing to set an expiration time for redirected routes. It defaults to 10 minutes, analogues with net.inet6.icmp6.redirtimeout. Implementation uses separate file, route_temporal.c, as route.c is already bloated with tons of different functions. Internally, expiration is implemented as an per-rnh callout scheduled when route with non-zero rt_expire time is added or rt_expire is changed. It does not add any overhead when no temporal routes are present. Callout traverses entire routing tree under wlock, scheduling expired routes for deletion and calculating the next time it needs to be run. The rationale for such implemention is the following: typically workloads requiring large amount of routes have redirects turned off already, while the systems with small amount of routes will not inhibit large overhead during tree traversal. This changes also fixes netstat -rn display of route expiration time, which has been broken since the conversion from kread() to sysctl. Reviewed by: bz MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D23075	2020-01-22 13:53:18 +00:00
Gleb Smirnoff	b955545386	Make ip6_output() and ip_output() require network epoch. All callers that before may called into these functions without network epoch now must enter it.	2020-01-22 05:51:22 +00:00
Gleb Smirnoff	bab98355f9	Add some documenting NET_EPOCH_ASSERTs.	2020-01-22 02:37:47 +00:00
Gleb Smirnoff	f6a2a6b163	Unroll macro that is used just once. Not a functional change.	2020-01-22 02:35:39 +00:00
Alexander V. Chernikov	16c2f24169	Document requirements for the 'struct route' variations. MFC after: 2 weeks	2020-01-21 12:00:34 +00:00
Gleb Smirnoff	2a4bd982d0	Introduce NET_EPOCH_CALL() macro and use it everywhere where we free data based on the network epoch. The macro reverses the argument order of epoch_call(9) - first function, then its argument. NFC	2020-01-15 06:05:20 +00:00
Gleb Smirnoff	97168be809	Mechanically substitute assertion of in_epoch(net_epoch_preempt) to NET_EPOCH_ASSERT(). NFC	2020-01-15 05:45:27 +00:00
Michael Tuexen	fe1274ee39	Fix race when accepting TCP connections. When expanding a SYN-cache entry to a socket/inp a two step approach was taken: 1) The local address was filled in, then the inp was added to the hash table. 2) The remote address was filled in and the inp was relocated in the hash table. Before the epoch changes, a write lock was held when this happens and the code looking up entries was holding a corresponding read lock. Since the read lock is gone away after the introduction of the epochs, the half populated inp was found during lookup. This resulted in processing TCP segments in the context of the wrong TCP connection. This patch changes the above procedure in a way that the inp is fully populated before inserted into the hash table. Thanks to Paul <devgs@ukr.net> for reporting the issue on the net@ mailing list and for testing the patch! Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D22971	2020-01-12 17:52:32 +00:00
Bjoern A. Zeeb	c6feea3b89	nd6_rtr: constantly use __func__ for nd6log() Over time one or two hard coded function names did not match the actual function anymore. Consistently use __func__ for nd6log() calls and re-wrap/re-format some messages for consitency. MFC after: 2 weeks	2020-01-12 17:41:09 +00:00
Bjoern A. Zeeb	25ebfe3350	nd6_rtr: make nd6_prefix_onlink() static nd6_prefix_onlink() is not used anywhere outside nd6_rtr.c. Stop exporting it and make it file local static.	2020-01-12 16:58:21 +00:00
Bjoern A. Zeeb	e1891232fc	in6_mcast: make in6_joingroup_locked() static in6_joingroup_locked() is only used file-local. No need to export it hance make it static.	2020-01-11 18:55:12 +00:00
Alexander V. Chernikov	ead85fe415	Add fibnum, family and vnet pointer to each rib head. Having metadata such as fibnum or vnet in the struct rib_head is handy as it eases building functionality in the routing space. This change is required to properly bring back route redirect support. Reviewed by: bz MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D23047	2020-01-09 17:21:00 +00:00
Bjoern A. Zeeb	334fc5822b	vnet: virtualise more network stack sysctls. Virtualise tcp_always_keepalive, TCP and UDP log_in_vain. All three are set in the netoptions startup script, which we would love to run for VNETs as well [1]. While virtualising the log_in_vain sysctls seems pointles at first for as long as the kernel message buffer is not virtualised, it at least allows an administrator to debug the base system or an individual jail if needed without turning the logging on for all jails running on a system. PR: 243193 [1] MFC after: 2 weeks	2020-01-08 23:30:26 +00:00
Alexander V. Chernikov	e02d3fe70c	Fix rtsock route message generation for interface addresses. Reviewed by: olivier MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D22974	2020-01-07 21:16:30 +00:00
Gleb Smirnoff	e00ee1a9f4	In r343631 error code for a packet blocked by a firewall was changed from EACCES to EPERM. This change was not intentional, so fix that. Return EACCESS if a firewall forbids sending. Noticed by: ae	2020-01-01 17:32:20 +00:00
Alexander V. Chernikov	bdb214a4a4	Remove useless code from in6_rmx.c The code in questions walks IPv6 tree every 60 seconds and looks into the routes with non-zero expiration time (typically, redirected routes). For each such route it sets RTF_PROBEMTU flag at the expiration time. No other part of the kernel checks for RTF_PROBEMTU flag. RTF_PROBEMTU was defined 21 years ago, 30 Jun 1999, as RTF_PROTO1. RTF_PROTO1 is a de-facto standard indication of a route installed by a routing daemon for a last decade. Reviewed by: bz, ae MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D22865	2019-12-18 22:10:56 +00:00
Hans Petter Selasky	a4c5668d12	Leave multicast group before reaping and committing state for both IPv4 and IPv6. This fixes a regression issue after r349369. When trying to exit a multicast group before closing the socket, a multicast leave packet should be sent. Differential Revision: https://reviews.freebsd.org/D22848 PR: 242677 Reviewed by: bz (network) Tested by: Aleksandr Fedorov <aleksandr.fedorov@itglobal.com> MFC after: 1 week Sponsored by: Mellanox Technologies	2019-12-18 12:06:34 +00:00
Bjoern A. Zeeb	74ff87cd16	Update comment. Update the comment related to SIIT and v4mapped addresses being rejected by us when coming from the wire given we have supported IPv6-only kernels for a few years now. See also draft-itojun-v6ops-v4mapped-harmful. Suggested by: melifaro MFC after: 2 weeks	2019-12-06 16:53:42 +00:00
Bjoern A. Zeeb	b745e7623c	ip6_input: remove redundant v4mapped check In ip6_input() we apply the same v4mapped address check twice. The only case which skipps the first one is M_FASTFWD_OURS which should have passed the check on the firstinput pass and passed the firewall. Remove the 2nd redundant check. Reviewed by: kp, melifaro MFC after: 2 weeks Sponsored by: Netflix (originally) Differential Revision: https://reviews.freebsd.org/D22462	2019-12-06 16:42:58 +00:00
Kristof Provost	200424235e	Remove useless NULL check Coverity points out that we've already dereferenced m by the time we check, so there's no reason to keep the check. Moreover, it's safe to pass NULL to m_freem() anyway. CID: 1019092	2019-12-05 16:50:54 +00:00
Bjoern A. Zeeb	0700d2c3f0	Make icmp6_reflect() static. icmp6_reflect() is not used anywhere outside icmp6.c, no reason to export it. Sponsored by: Netflix	2019-12-03 14:46:38 +00:00
Hans Petter Selasky	5b64b824b9	Use refcount from "in_joingroup_locked()" when joining multicast groups. Do not acquire additional references. This makes the IPv4 IGMP code in line with the IPv6 MLD code. Background: The IPv4 multicast code puts an extra reference on the in_multi struct when joining groups. This becomes visible when using daemons like igmpproxy from ports, that multicast entries do not disappear from the output of ifmcstat(8) when multicast streams are disconnected. This fixes a regression issue after r349762. While at it factor the ip_mfilter_insert() and ip6_mfilter_insert() calls to avoid repeated "is_new" check. Differential Revision: https://reviews.freebsd.org/D22595 Tested by: Guido van Rooij <guido@gvr.org> Reviewed by: rgrimes (network) MFC after: 1 week Sponsored by: Mellanox Technologies	2019-12-03 08:46:59 +00:00
Michael Tuexen	e25b0dab9a	Update the hostcache also for PTB messages received for SCTP/IPv6. The corresponding code for SCTP/IPv4 was introduced in https://svnweb.freebsd.org/base?view=revision&revision=317597 Submitted by: Julius Flohr MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D22605	2019-12-01 16:14:44 +00:00
Bjoern A. Zeeb	a4adf6cc65	Fix m_pullup() problem after removing PULLDOWN_TESTs and KAME EXT_*macros. r354748-354750 replaced the KAME macros with m_pulldown() calls. Contrary to the rest of the network stack m_len checks before m_pulldown() were not put in placed (see r354748). Put these m_len checks in place for now (to go along with the style of the network stack since the initial commits). These are not put in for performance but to avoid an error scenario (even though it also will help performance at the moment as it avoid allocating an extra mbuf; not because of the unconditional function call). The observed error case went like this: (1) an mbuf with M_EXT arrives and we call m_pullup() unconditionally on it. (2) m_pullup() will call m_get() unless the requested length is larger than MHLEN (in which case it'll m_freem() the perfectly fine mbuf) and migrate the requested length of data and pkthdr into the new mbuf. (3) If m_get() succeeds, a further m_pullup() call going over MHLEN will fail. This was observed with failing auto-configuration as an RA packet of 200 bytes exceeded MHLEN and the m_pullup() called from nd6_ra_input() dropped the mbuf. (Re-)adding the m_len checks before m_pullup() calls avoids this problems with mbufs using external storage for now. MFC after: 3 weeks Sponsored by: Netflix	2019-12-01 00:22:04 +00:00
Ryan Libby	6afe56f9c3	in6_joingroup_locked: need if_addr_lock around in6m_disconnect_locked It looks like the call that requires the lock was introduced in r337866. Reviewed by: hselasky Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20739	2019-11-25 22:25:10 +00:00
Bjoern A. Zeeb	f8d4f9bce9	in6: move include Move the include for sysctl.h out of the middle of the file to the includes at the beginning. This is will make it easier to add new sysctls. No functional changes. MFC after: 3 weeks Sponsored by: Netflix	2019-11-19 21:14:15 +00:00
Bjoern A. Zeeb	3c5018ca10	nd6: sysctl Move the SYSCTL_DECL to the top of the file. Move the sysctl function before SYSCTL_PROC so that we don't need an extra function declaration in the middle of the file. No functional changes. MFC after: 3 weeks Sponsored by: Netflix	2019-11-19 21:08:18 +00:00
Bjoern A. Zeeb	6db6527385	nd6: make nd6_timer_ch static nd6_timer_ch is only used in file local context. There is no need to export it, so make it static. MFC after: 3 weeks Sponsored by: Netflix	2019-11-19 20:54:17 +00:00
Bjoern A. Zeeb	f77a6dbd1e	nd6_rtr: re-sort functions Resort functions within file in a way that they depend on each other as that makes it easier to rework various things. Also allows us to remove file local function declarations. No functional changes. MFC after: 3 weeks Sponsored by: Netflix	2019-11-19 20:34:33 +00:00
Bjoern A. Zeeb	b2b7a4b2ca	mld: fix epoch assertion in6ifa_ifpforlinklocal() asserts the net epoch. The test case from r354832 revealed code paths where we call into the function without having acquired the net epoch first and consequently we hit the assert. This happens in certain MLD states during VNET shutdown and most people normaly not notice this. For correctness acquire the net epoch around calls to mld_v1_transmit_report() in all cases to avoid the assertion firing. MFC after: 2 weeks Sponsored by: Netflix	2019-11-19 14:53:13 +00:00
Bjoern A. Zeeb	32af08ecad	icmpv6: Fix mbuf change in mld After r354748 mld_input() can change the mbuf. The new pointer is never returned to icmp6_input() and when passed to icmp6_rip6_input() the mbuf may no longer valid leading to a panic. Pass a pointer to the mbuf to mld_input() so we can return an updated version in the non-error case. Add a test sending an MLD packet case which will trigger this bug. Pointyhat to: bz Reported by: gallatin, thj MFC After: 2 weeks X-MFC with: r354748 Sponsored by: Netflix	2019-11-18 21:59:47 +00:00
Bjoern A. Zeeb	808c432f62	nd6: retire defrouter_select(), use _fib() variant. Burn bridges and replace the last two calls of defrouter_select() with defrouter_select_fib(). That allows us to retire defrouter_select() and make it more clear in the calling code that it applies to all FIBs. Sponsored by: Netflix	2019-11-16 00:17:35 +00:00
Bjoern A. Zeeb	f592d0c377	nd6_rtr: Pull in the TAILQ_HEAD() as it is not needed outside nd6_rtr.c. Rename the TAILQ_HEAD() struct and the nd_defrouter variable from "nd_" to "nd6_" as they are not part of the RFC 3542 API which uses "ND_". Ideally I'd like to also rename the struct nd_defrouter {} to "nd6_*" but given that is used externally there is more work to do. No functional changes. MFC after: 3 weeks Sponsored by: Netflix	2019-11-16 00:02:36 +00:00
Bjoern A. Zeeb	63abacc204	netinet*: replace IP6_EXTHDR_GET() In a few places we have IP6_EXTHDR_GET() left in upper layer protocols. The IP6_EXTHDR_GET() macro might perform an m_pulldown() in case the data fragment is not contiguous. Convert these last remaining instances into m_pullup()s instead. In CARP, for example, we will a few lines later call m_pullup() anyway, the IPsec code coming from OpenBSD would otherwise have done the m_pullup() and are copying the data a bit later anyway, so pulling it in seems no better or worse. Note: this leaves very few m_pulldown() cases behind in the tree and we might want to consider removing them as well to make mbuf management easier again on a path to variable size mbufs, especially given m_pulldown() still has an issue not re-checking M_WRITEABLE(). Reviewed by: gallatin MFC after: 8 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22335	2019-11-15 21:44:17 +00:00
Bjoern A. Zeeb	a61b5cfbbf	netinet6: Remove PULLDOWN_TESTs. Remove the KAME introduced PULLDOWN_TESTs which did not even have a compile-time option in sys/conf to turn them on for a custom kernel build. They made the code a lot harder to read or more complicated in a few cases. Convert the IP6_EXTHDR_CHECK() calls into FreeBSD looking code. Rather than throwing the packet away if it would not fit the KAME mbuf expectations, convert the macros to m_pullup() calls. Do not do any extra manual conditional checks upfront as to whether the m_len would suffice (), simply let m_pullup() do its work (incl. an early check). Remove extra m_pullup() calls where earlier in the function or the only caller has already done the pullup. Discussed with: rwatson () Reviewed by: ae MFC after: 8 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22334	2019-11-15 21:40:40 +00:00
Bjoern A. Zeeb	e20b5bc485	nd6: simplify code We are taking the same actions in both cases of the branch inside the block. Simplify that code as the extra branch is not needed. MFC after: 3 weeks Sponsored by: Netflix	2019-11-15 13:45:38 +00:00
Bjoern A. Zeeb	b3a25d2993	nd6: remove unused structs and defines Remove a collections of unused structs and #defines to make it easier to understand what is actually in use. Sponsored by: Netflix	2019-11-13 14:28:07 +00:00
Bjoern A. Zeeb	d64df9a2b2	nd6: make nd6_alloc() file static nd6_alloc() is a function used only locally. Make it static and no longer export it. Keeps the KPI smaller. Sponsored by: Netflix	2019-11-13 13:53:17 +00:00
Bjoern A. Zeeb	ad675b3279	nd6 defrouter: consolidate nd_defrouter manipulations in nd6_rtr.c Move the nd_defrouter along with the sysctl handler from nd6.c to nd6_rtr.c and make the variable file static. Provide (temporary) new accessor functions for code manipulating nd_defrouter from nd6.c, and stop exporting functions no longer needed outside nd6_rtr.c. This also shuffles a few functions around in nd6_rtr.c without functional changes. Given all nd_defrouter logic is now in one place we can tidy up the code, locking and, and other open items. MFC after: 3 weeks X-MFC: keep exporting the functions Sponsored by: Netflix	2019-11-13 12:05:48 +00:00
Bjoern A. Zeeb	a8fe77d877	netinet: update mp to pass the proper value back In ip6_[direct_]input() we are looping over the extension headers to deal with the next header. We pass a pointer to an mbuf pointer to the handling functions. In certain cases the mbuf can be updated there and we need to pass the new one back. That missing in dest6_input() and route6_input(). In tcp6_input() we should also update it before we call tcp_input(). In addition to that mark the mbuf NULL all the times when we return that we are done with handling the packet and no next header should be checked (IPPROTO_DONE). This will eventually allow us to assert proper behaviour and catch the above kind of errors more easily, expecting *mp to always be set. This change is extracted from a larger patch and not an exhaustive change across the entire stack yet. PR: 240135 Reported by: prabhakar.lakhera gmail.com MFC after: 3 weeks Sponsored by: Netflix	2019-11-12 15:46:28 +00:00
Gleb Smirnoff	c17cd08f53	It is unclear why in6_pcblookup_local() would require write access to the PCB hash. The function doesn't modify the hash. It always asserted write lock historically, but with epoch conversion this fails in some special cases. Reviewed by: rwatson, bz Reported-by: syzbot+0b0488ca537e20cb2429@syzkaller.appspotmail.com	2019-11-11 06:28:25 +00:00
Bjoern A. Zeeb	c1131de6f1	frag6: properly handle atomic fragments according to RFCs. RFC 8200 says: "If the fragment is a whole datagram (that is, both the Fragment Offset field and the M flag are zero), then it does not need any further reassembly and should be processed as a fully reassembled packet (i.e., updating Next Header, adjust Payload Length, removing the Fragment header, etc.). .." That means we should remove the fragment header and make all the adjustments rather than just skipping over the fragment header. The difference should be noticeable in that a properly handled atomic fragment triggering an ICMPv6 message at an upper layer (e.g. dest unreach, unreachable port) will not include the fragment header. Update the test cases to also test for an unfragmentable part. That is needed so that the next header is properly updated (not just lengths). MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22155	2019-11-08 14:36:44 +00:00
Gleb Smirnoff	2435e507de	Now with epoch synchronized PCB lookup tables we can greatly simplify locking in udp_output() and udp6_output(). First, we select if we need read or write lock in PCB itself, we take the lock and enter network epoch. Then, we proceed for the rest of the function. In case if we need to modify PCB hash, we would take write lock on it for a short piece of code. We could exit the epoch before allocating an mbuf, but with this patch we are keeping it all the way into ip_output()/ip6_output(). Today this creates an epoch recursion, since ip_output() enters epoch itself. However, once all protocols are reviewed, ip_output() and ip6_output() would require epoch instead of entering it. Note: I'm not 100% sure that in udp6_output() the epoch is required. We don't do PCB hash lookup for a bound socket. And all branches of in6_select_src() don't require epoch, at least they lack assertions. Today inet6 address list is protected by rmlock, although it is CKLIST. AFAIU, the future plan is to protect it by network epoch. That would require epoch in in6_select_src(). Anyway, in future ip6_output() would require epoch, udp6_output() would need to enter it.	2019-11-07 21:01:36 +00:00
Gleb Smirnoff	d797164a86	Since r353292 on input path we are always in network epoch, when we lookup PCBs. Thus, do not enter epoch recursively in in_pcblookup_hash() and in6_pcblookup_hash(). Same applies to tcp_ctlinput() and tcp6_ctlinput(). This leaves several sysctl(9) handlers that return PCB credentials unprotected. Add epoch enter/exit to all of them. Differential Revision: https://reviews.freebsd.org/D22197	2019-11-07 20:49:56 +00:00
Gleb Smirnoff	cf377af6e2	Remove unnecessary recursive epoch enter via INP_INFO_RLOCK macro in icmp6_rip6_input(). It shall always run in the network epoch.	2019-11-07 20:43:12 +00:00
Gleb Smirnoff	f42347c39a	Remove unnecessary recursive epoch enter via INP_INFO_RLOCK macro in raw input functions for IPv4 and IPv6. They shall always run in the network epoch.	2019-11-07 20:40:44 +00:00
Gleb Smirnoff	8d28524a90	Remove unnecessary recursive epoch enter via INP_INFO_RLOCK macro in udp6_input(). It shall always run in the network epoch.	2019-11-07 20:38:53 +00:00
Bjoern A. Zeeb	503f4e4736	netinet*: variable cleanup In preparation for another change factor out various variable cleanups. These mainly include: (1) do not assign values to variables during declaration: this makes the code more readable and does allow for better grouping of variable declarations, (2) do not assign values to variables before need; e.g., if a variable is only used in the 2nd half of a function and we have multiple return paths before that, then do not set it before it is needed, and (3) try to avoid assigning the same value multiple times. MFC after: 3 weeks Sponsored by: Netflix	2019-11-07 18:29:51 +00:00
Gleb Smirnoff	751d8d156a	Widen network epoch coverage in nd6_prefix_onlink() as in6ifa_ifpforlinklocal() requires the epoch. Reported by: bz Reviewed by: bz	2019-11-07 17:00:20 +00:00
Gleb Smirnoff	d6dbfed81e	In nd6_timer() enter the network epoch earlier. The defrouter_del() may call into leaf functions that require epoch. Since the function is already run in non-sleepable context, it should be safe to cover it whole with epoch. Reported by: syzcaller	2019-11-04 17:35:37 +00:00
Bjoern A. Zeeb	6e6b5143f5	Properly set VNET when nuking recvif from fragment queues. In theory the eventhandler invoke should be in the same VNET as the the current interface. We however cannot guarantee that for all cases in the future. So before checking if the fragmentation handling for this VNET is active, switch the VNET to the VNET of the interface to always get the one we want. Reviewed by: hselasky MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22153	2019-10-25 18:54:06 +00:00

1 2 3 4 5 ...

2191 Commits