freebsd-dev

Author	SHA1	Message	Date
Andrey V. Elsukov	978f2d1728	Add "tcpmss" opcode to match the TCP MSS value. With this opcode it is possible to match TCP packets with specified MSS option, whose value corresponds to configured in opcode value. It is allowed to specify single value, range of values, or array of specific values or ranges. E.g. # ipfw add deny log tcp from any to any tcpmss 0-500 Reviewed by: melifaro,bcr Obtained from: Yandex LLC MFC after: 1 week Sponsored by: Yandex LLC	2019-06-21 10:54:51 +00:00
Xin LI	f89d207279	Separate kernel crc32() implementation to its own header (gsb_crc32.h) and rename the source to gsb_crc32.c. This is a prerequisite of unifying kernel zlib instances. PR: 229763 Submitted by: Yoshihiro Ota <ota at j.email.ne.jp> Differential Revision: https://reviews.freebsd.org/D20193	2019-06-17 19:49:08 +00:00
Andrey V. Elsukov	efdadaa2d8	Initialize V_nat64out methods explicitly. It looks like initialization of static variable doesn't work for VIMAGE and this leads to panic. Reported by: olivier MFC after: 1 week	2019-06-05 09:25:40 +00:00
Li-Wen Hsu	d086d41363	Remove an uneeded indentation introduced in r223637 to silence gcc warnging MFC after: 3 days Sponsored by: The FreeBSD Foundation	2019-05-25 23:58:09 +00:00
John Baldwin	fb3bc59600	Restructure mbuf send tags to provide stronger guarantees. - Perform ifp mismatch checks (to determine if a send tag is allocated for a different ifp than the one the packet is being output on), in ip_output() and ip6_output(). This avoids sending packets with send tags to ifnet drivers that don't support send tags. Since we are now checking for ifp mismatches before invoking if_output, we can now try to allocate a new tag before invoking if_output sending the original packet on the new tag if allocation succeeds. To avoid code duplication for the fragment and unfragmented cases, add ip_output_send() and ip6_output_send() as wrappers around if_output and nd6_output_ifp, respectively. All of the logic for setting send tags and dealing with send tag-related errors is done in these wrapper functions. For pseudo interfaces that wrap other network interfaces (vlan and lagg), wrapper send tags are now allocated so that ip*_output see the wrapper ifp as the ifp in the send tag. The if_transmit routines rewrite the send tags after performing an ifp mismatch check. If an ifp mismatch is detected, the transmit routines fail with EAGAIN. - To provide clearer life cycle management of send tags, especially in the presence of vlan and lagg wrapper tags, add a reference count to send tags managed via m_snd_tag_ref() and m_snd_tag_rele(). Provide a helper function (m_snd_tag_init()) for use by drivers supporting send tags. m_snd_tag_init() takes care of the if_ref on the ifp meaning that code alloating send tags via if_snd_tag_alloc no longer has to manage that manually. Similarly, m_snd_tag_rele drops the refcount on the ifp after invoking if_snd_tag_free when the last reference to a send tag is dropped. This also closes use after free races if there are pending packets in driver tx rings after the socket is closed (e.g. from tcpdrop). In order for m_free to work reliably, add a new CSUM_SND_TAG flag in csum_flags to indicate 'snd_tag' is set (rather than 'rcvif'). Drivers now also check this flag instead of checking snd_tag against NULL. This avoids false positive matches when a forwarded packet has a non-NULL rcvif that was treated as a send tag. - cxgbe was relying on snd_tag_free being called when the inp was detached so that it could kick the firmware to flush any pending work on the flow. This is because the driver doesn't require ACK messages from the firmware for every request, but instead does a kind of manual interrupt coalescing by only setting a flag to request a completion on a subset of requests. If all of the in-flight requests don't have the flag when the tag is detached from the inp, the flow might never return the credits. The current snd_tag_free command issues a flush command to force the credits to return. However, the credit return is what also frees the mbufs, and since those mbufs now hold references on the tag, this meant that snd_tag_free would never be called. To fix, explicitly drop the mbuf's reference on the snd tag when the mbuf is queued in the firmware work queue. This means that once the inp's reference on the tag goes away and all in-flight mbufs have been queued to the firmware, tag's refcount will drop to zero and snd_tag_free will kick in and send the flush request. Note that we need to avoid doing this in the middle of ethofld_tx(), so the driver grabs a temporary reference on the tag around that loop to defer the free to the end of the function in case it sends the last mbuf to the queue after the inp has dropped its reference on the tag. - mlx5 preallocates send tags and was using the ifp pointer even when the send tag wasn't in use. Explicitly use the ifp from other data structures instead. - Sprinkle some assertions in various places to assert that received packets don't have a send tag, and that other places that overwrite rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer. Reviewed by: gallatin, hselasky, rgrimes, ae Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20117	2019-05-24 22:30:40 +00:00
Andrey V. Elsukov	90ecb41fba	Add IPv6 support for O_IPLEN opcode. Obtained from: Yandex LLC MFC after: 1 week Sponsored by: Yandex LLC	2019-04-29 09:33:16 +00:00
Kristof Provost	1c75b9d2cd	pf: No need to M_NOWAIT in DIOCRSETTFLAGS Now that we don't hold a lock during DIOCRSETTFLAGS memory allocation we can use M_WAITOK. MFC after: 1 week Event: Aberdeen hackathon 2019 Pointed out by: glebius@	2019-04-18 11:37:44 +00:00
Kristof Provost	f5e0d9fcb4	pf: Fix panic on invalid DIOCRSETTFLAGS If during DIOCRSETTFLAGS pfrio_buffer is NULL copyin() will fault, which we're not allowed to do with a lock held. We must count the number of entries in the table and release the lock during copyin(). Only then can we re-acquire the lock. Note that this is safe, because pfr_set_tflags() will check if the table and entries exist. This was discovered by a local syzcaller instance. MFC after: 1 week Event: Aberdeen hackathon 2019	2019-04-17 16:42:54 +00:00
Rodney W. Grimes	6c1c6ae537	Use IN_foo() macros from sys/netinet/in.h inplace of handcrafted code There are a few places that use hand crafted versions of the macros from sys/netinet/in.h making it difficult to actually alter the values in use by these macros. Correct that by replacing handcrafted code with proper macro usage. Reviewed by: karels, kristof Approved by: bde (mentor) MFC after: 3 weeks Sponsored by: John Gilmore Differential Revision: https://reviews.freebsd.org/D19317	2019-04-04 19:01:13 +00:00
Conrad Meyer	a8a16c7128	Replace read_random(9) with more appropriate arc4rand(9) KPIs Reviewed by: ae, delphij Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D19760	2019-04-04 01:02:50 +00:00
Ed Maste	a342f5772f	pf: use UID_ROOT and GID_WHEEL named constants in make_dev No functional change but improves consistency and greppability of make_dev calls. Discussed with: kp	2019-03-26 21:20:42 +00:00
Gleb Smirnoff	97245d4074	Always create ipfw(4) hooks as long as module is loaded. Now enabling ipfw(4) with sysctls controls only linkage of hooks to default heads. When module is loaded fetch sysctls as tunables, to make it possible to boot with ipfw(4) in kernel, but not linked to any pfil(9) hooks.	2019-03-21 16:15:29 +00:00
Kristof Provost	64af73aade	pf: Ensure that IP addresses match in ICMP error packets States in pf(4) let ICMP and ICMP6 packets pass if they have a packet in their payload that matches an exiting connection. It was not checked whether the outer ICMP packet has the same destination IP as the source IP of the inner protocol packet. Enforce that these addresses match, to prevent ICMP packets that do not make sense. Reported by: Nicolas Collignon, Corentin Bayet, Eloi Vanderbeken, Luca Moro at Synacktiv Obtained from: OpenBSD Security: CVE-2019-5598	2019-03-21 08:09:52 +00:00
Andrey V. Elsukov	b8c431f9c0	Do not enter epoch section recursively. A pfil hook is already invoked in NET_EPOCH section.	2019-03-20 10:11:21 +00:00
Andrey V. Elsukov	e0b7b6d465	Use NET_EPOCH instead of allocating separate one. MFC after: 1 month	2019-03-20 10:06:44 +00:00
Andrey V. Elsukov	d18c1f26a4	Reapply r345274 with build fixes for 32-bit architectures. Update NAT64LSN implementation: o most of data structures and relations were modified to be able support large number of translation states. Now each supported protocol can use full ports range. Ports groups now are belongs to IPv4 alias addresses, not hosts. Each ports group can keep several states chunks. This is controlled with new `states_chunks` config option. States chunks allow to have several translation states for single alias address and port, but for different destination addresses. o by default all hash tables now use jenkins hash. o ConcurrencyKit and epoch(9) is used to make NAT64LSN lockless on fast path. o one NAT64LSN instance now can be used to handle several IPv6 prefixes, special prefix "::" value should be used for this purpose when instance is created. o due to modified internal data structures relations, the socket opcode that does states listing was changed. Obtained from: Yandex LLC MFC after: 1 month Sponsored by: Yandex LLC	2019-03-19 10:57:03 +00:00
Andrey V. Elsukov	d6369c2d18	Revert r345274. It appears that not all 32-bit architectures have necessary CK primitives.	2019-03-18 14:00:19 +00:00
Andrey V. Elsukov	d7a1cf06f3	Update NAT64LSN implementation: o most of data structures and relations were modified to be able support large number of translation states. Now each supported protocol can use full ports range. Ports groups now are belongs to IPv4 alias addresses, not hosts. Each ports group can keep several states chunks. This is controlled with new `states_chunks` config option. States chunks allow to have several translation states for single alias address and port, but for different destination addresses. o by default all hash tables now use jenkins hash. o ConcurrencyKit and epoch(9) is used to make NAT64LSN lockless on fast path. o one NAT64LSN instance now can be used to handle several IPv6 prefixes, special prefix "::" value should be used for this purpose when instance is created. o due to modified internal data structures relations, the socket opcode that does states listing was changed. Obtained from: Yandex LLC MFC after: 1 month Sponsored by: Yandex LLC	2019-03-18 12:59:08 +00:00
Andrey V. Elsukov	5c04f73e07	Add NAT64 CLAT implementation as defined in RFC6877. CLAT is customer-side translator that algorithmically translates 1:1 private IPv4 addresses to global IPv6 addresses, and vice versa. It is implemented as part of ipfw_nat64 kernel module. When module is loaded or compiled into the kernel, it registers "nat64clat" external action. External action named instance can be created using `create` command and then used in ipfw rules. The create command accepts two IPv6 prefixes `plat_prefix` and `clat_prefix`. If plat_prefix is ommitted, IPv6 NAT64 Well-Known prefix 64:ff9b::/96 will be used. # ipfw nat64clat CLAT create clat_prefix SRC_PFX plat_prefix DST_PFX # ipfw add nat64clat CLAT ip4 from IPv4_PFX to any out # ipfw add nat64clat CLAT ip6 from DST_PFX to SRC_PFX in Obtained from: Yandex LLC Submitted by: Boris N. Lytochkin MFC after: 1 month Relnotes: yes Sponsored by: Yandex LLC	2019-03-18 11:44:53 +00:00
Andrey V. Elsukov	002cae78da	Add SPDX-License-Identifier and update year in copyright. MFC after: 1 month	2019-03-18 10:50:32 +00:00
Andrey V. Elsukov	b11efc1eb6	Modify struct nat64_config. Add second IPv6 prefix to generic config structure and rename another fields to conform to RFC6877. Now it contains two prefixes and length: PLAT is provider-side translator that translates N:1 global IPv6 addresses to global IPv4 addresses. CLAT is customer-side translator (XLAT) that algorithmically translates 1:1 IPv4 addresses to global IPv6 addresses. Use PLAT prefix in stateless (nat64stl) and stateful (nat64lsn) translators. Modify nat64_extract_ip4() and nat64_embed_ip4() functions to accept prefix length and use plat_plen to specify prefix length. Retire net.inet.ip.fw.nat64_allow_private sysctl variable. Add NAT64_ALLOW_PRIVATE flag and use "allow_private" config option to configure this ability separately for each NAT64 instance. Obtained from: Yandex LLC MFC after: 1 month Sponsored by: Yandex LLC	2019-03-18 10:39:14 +00:00
Kristof Provost	812483c46e	pf: Rename pfsync bucket lock Previously the main pfsync lock and the bucket locks shared the same name. This lead to spurious warnings from WITNESS like this: acquiring duplicate lock of same type: "pfsync" 1st pfsync @ /usr/src/sys/netpfil/pf/if_pfsync.c:1402 2nd pfsync @ /usr/src/sys/netpfil/pf/if_pfsync.c:1429 It's perfectly okay to grab both the main pfsync lock and a bucket lock at the same time. We don't need different names for each bucket lock, because we should always only acquire a single one of those at a time. MFC after: 1 week	2019-03-16 10:14:03 +00:00
Kristof Provost	5904868691	pf :Use counter(9) in pf tables. The counters of pf tables are updated outside the rule lock. That means state updates might overwrite each other. Furthermore allocation and freeing of counters happens outside the lock as well. Use counter(9) for the counters, and always allocate the counter table element, so that the race condition cannot happen any more. PR: 230619 Submitted by: Kajetan Staszkiewicz <vegeta@tuxpowered.net> Reviewed by: glebius MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D19558	2019-03-15 11:08:44 +00:00
Gleb Smirnoff	f355cb3e6f	PFIL_MEMPTR for ipfw link level hook With new pfil(9) KPI it is possible to pass a void pointer with length instead of mbuf pointer to a packet filter. Until this commit no filters supported that, so pfil run through a shim function pfil_fake_mbuf(). Now the ipfw(4) hook named "default-link", that is instantiated when net.link.ether.ipfw sysctl is on, supports processing pointer/length packets natively. - ip_fw_args now has union for either mbuf or void , and if flags have non-zero length, then we use the void . - through ipfw_chk() we handle mem/mbuf cases differently. - ether_header goes away from args. It is ipfw_chk() responsibility to do parsing of Ethernet header. - ipfw_log() now uses different bpf APIs to log packets. Although ipfw_chk() is now capable to process pointer/length packets, this commit adds support for the link level hook only, see ipfw_check_frame(). Potentially the IP processing hook ipfw_check_packet() can be improved too, but that requires more changes since the hook supports more complex actions: NAT, divert, etc. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D19357	2019-03-14 22:52:16 +00:00
Gleb Smirnoff	dc0fa4f712	Remove 'dir' argument from dummynet_io(). This makes it possible to make dn_dir flags private to dummynet. There is still some room for improvement.	2019-03-14 22:32:50 +00:00
Gleb Smirnoff	b00b7e03fd	Reduce argument list to ipfw_divert(), as args holds the rule ref and the direction. While here make 'tee' a bool.	2019-03-14 22:31:12 +00:00
Gleb Smirnoff	cef9f220cd	Remove 'dir' argument in ng_ipfw_input, since ip_fw_args now has this info. While here make 'tee' boolean.	2019-03-14 22:30:05 +00:00
Gleb Smirnoff	b7795b6746	- Add more flags to ip_fw_args. At this changeset only IPFW_ARGS_IN and IPFW_ARGS_OUT are utilized. They are intented to substitute the "dir" parameter that is often passes together with args. - Rename ip_fw_args.oif to ifp and now it is set to either input or output interface, depending on IPFW_ARGS_IN/OUT bit set.	2019-03-14 22:28:50 +00:00
Gleb Smirnoff	1830dae3d3	Make second argument of ip_divert(), that specifies packet direction a bool. This allows pf(4) to avoid including ipfw(4) private files.	2019-03-14 22:23:09 +00:00
Gleb Smirnoff	2d0232783c	Simplify ipfw_bpf_mtap2(). No functional change.	2019-03-14 22:20:48 +00:00
Andrey V. Elsukov	ca0f03e808	Add IP_FW_NAT64 to codes that ipfw_chk() can return. It will be used by upcoming NAT64 changes. We use separate code to avoid propogating EACCES error code to user level applications when NAT64 consumes a packet. Obtained from: Yandex LLC MFC after: 1 week Sponsored by: Yandex LLC	2019-03-11 10:42:09 +00:00
Andrey V. Elsukov	d76227959a	Add NULL pointer check to nat64_output(). It is possible, that a processed packet was originated by local host, in this case m->m_pkthdr.rcvif is NULL. Check and set it to V_loif to avoid NULL pointer dereference in IP input code, since it is expected that packet has valid receiving interface when netisr processes it. Obtained from: Yandex LLC MFC after: 1 week Sponsored by: Yandex LLC	2019-03-11 10:33:32 +00:00
Kristof Provost	f8e7fe32a4	pf: Fix DIOCGETSRCNODES r343295 broke DIOCGETSRCNODES by failing to reset 'nr' after counting the number of source tracking nodes. This meant that we never copied the information to userspace, leading to '? -> ?' output from pfctl. PR: 236368 MFC after: 1 week	2019-03-08 09:33:16 +00:00
Andrey V. Elsukov	83354acf5a	Fix the problem with O_LIMIT states introduced in r344018. dyn_install_state() uses `rule` pointer when it creates state. For O_LIMIT states this pointer actually is not struct ip_fw, it is pointer to O_LIMIT_PARENT state, that keeps actual pointer to ip_fw parent rule. Thus we need to cache rule id and number before calling dyn_get_parent_state(), so we can use them later when the `rule` pointer is overrided. PR: 236292 MFC after: 3 days	2019-03-07 04:40:44 +00:00
Kristof Provost	6f4909de5f	pf: IPv6 fragments with malformed extension headers could be erroneously passed by pf or cause a panic We mistakenly used the extoff value from the last packet to patch the next_header field. If a malicious host sends a chain of fragmented packets where the first packet and the final packet have different lengths or number of extension headers we'd patch the next_header at the wrong offset. This can potentially lead to panics or rule bypasses. Security: CVE-2019-5597 Obtained from: OpenBSD Reported by: Corentin Bayet, Nicolas Collignon, Luca Moro at Synacktiv	2019-03-01 07:37:45 +00:00
Kristof Provost	22c58991e3	pf: Small performance tweak Because fetching a counter is a rather expansive function we should use counter_u64_fetch() in pf_state_expires() only when necessary. A "rdr pass" rule should not cause more effort than separate "rdr" and "pass" rules. For rules with adaptive timeout values the call of counter_u64_fetch() should be accepted, but otherwise not. From the man page: The adaptive timeout values can be defined both globally and for each rule. When used on a per-rule basis, the values relate to the number of states created by the rule, otherwise to the total number of states. This handling of adaptive timeouts is done in pf_state_expires(). The calculation needs three values: start, end and states. 1. Normal rules "pass .." without adaptive setting meaning "start = 0" runs in the else-section and therefore takes "start" and "end" from the global default settings and sets "states" to pf_status.states (= total number of states). 2. Special rules like "pass .. keep state (adaptive.start 500 adaptive.end 1000)" have start != 0, run in the if-section and take "start" and "end" from the rule and set "states" to the number of states created by their rule using counter_u64_fetch(). Thats all ok, but there is a third case without special handling in the above code snippet: 3. All "rdr/nat pass .." statements use together the pf_default_rule. Therefore we have "start != 0" in this case and we run the if-section but we better should run the else-section in this case and do not fetch the counter of the pf_default_rule but take the total number of states. Submitted by: Andreas Longwitz <longwitz@incore.de> MFC after: 2 weeks	2019-02-24 17:23:55 +00:00
Andrey V. Elsukov	804a6541db	Remove `set' field from state structure and use set from parent rule. Initially it was introduced because parent rule pointer could be freed, and rule's information could become inaccessible. In r341471 this was changed. And now we don't need this information, and also it can become stale. E.g. rule can be moved from one set to another. This can lead to parent's set and state's set will not match. In this case it is possible that static rule will be freed, but dynamic state will not. This can happen when `ipfw delete set N` command is used to delete rules, that were moved to another set. To fix the problem we will use the set number from parent rule. Obtained from: Yandex LLC MFC after: 1 week Sponsored by: Yandex LLC	2019-02-11 18:10:55 +00:00
Patrick Kelsey	d178fee632	Place pf_altq_get_nth_active() under the ALTQ ifdef MFC after: 1 week	2019-02-11 05:39:38 +00:00
Patrick Kelsey	8f2ac65690	Reduce the time it takes the kernel to install a new PF config containing a large number of queues In general, the time savings come from separating the active and inactive queues lists into separate interface and non-interface queue lists, and changing the rule and queue tag management from list-based to hash-bashed. In HFSC, a linear scan of the class table during each queue destroy was also eliminated. There are now two new tunables to control the hash size used for each tag set (default for each is 128): net.pf.queue_tag_hashsize net.pf.rule_tag_hashsize Reviewed by: kp MFC after: 1 week Sponsored by: RG Nets Differential Revision: https://reviews.freebsd.org/D19131	2019-02-11 05:17:31 +00:00
Gleb Smirnoff	d38ca3297c	Return PFIL_CONSUMED if packet was consumed. While here gather all the identical endings of pf_check_*() into single function. PR: 235411	2019-02-02 05:49:05 +00:00
Gleb Smirnoff	2790ca97d9	Fix build without INET6.	2019-02-01 00:33:17 +00:00
Gleb Smirnoff	b252313f0b	New pfil(9) KPI together with newborn pfil API and control utility. The KPI have been reviewed and cleansed of features that were planned back 20 years ago and never implemented. The pfil(9) internals have been made opaque to protocols with only returned types and function declarations exposed. The KPI is made more strict, but at the same time more extensible, as kernel uses same command structures that userland ioctl uses. In nutshell [KA]PI is about declaring filtering points, declaring filters and linking and unlinking them together. New [KA]PI makes it possible to reconfigure pfil(9) configuration: change order of hooks, rehook filter from one filtering point to a different one, disconnect a hook on output leaving it on input only, prepend/append a filter to existing list of filters. Now it possible for a single packet filter to provide multiple rulesets that may be linked to different points. Think of per-interface ACLs in Cisco or Juniper. None of existing packet filters yet support that, however limited usage is already possible, e.g. default ruleset can be moved to single interface, as soon as interface would pride their filtering points. Another future feature is possiblity to create pfil heads, that provide not an mbuf pointer but just a memory pointer with length. That would allow filtering at very early stages of a packet lifecycle, e.g. when packet has just been received by a NIC and no mbuf was yet allocated. Differential Revision: https://reviews.freebsd.org/D18951	2019-01-31 23:01:03 +00:00
Gleb Smirnoff	f712b16127	Revert r316461: Remove "IPFW static rules" rmlock, and use pfil's global lock. The pfil(9) system is about to be converted to epoch(9) synchronization, so we need [temporarily] go back with ipfw internal locking. Discussed with: ae	2019-01-31 21:04:50 +00:00
Andrey V. Elsukov	7664b71b62	Fix the bug introduced in r342908, that causes problems with dynamic handling for protocols without ports numbers. Since port numbers were uninitialized for protocols like ICMP/ICMPv6, ipfw_chk() used some non-zero values to create dynamic states, and due this it failed to match replies with created states. Reported by: Oliver Hartmann, Boris Lytochkin Obtained from: Yandex LLC X-MFC after: r342908	2019-01-29 11:18:41 +00:00
Patrick Kelsey	59099cd385	Don't re-evaluate ALTQ kernel configuration due to events on non-ALTQ interfaces Re-evaluating the ALTQ kernel configuration can be expensive, particularly when there are a large number (hundreds or thousands) of queues, and is wholly unnecessary in response to events on interfaces that do not support ALTQ as such interfaces cannot be part of an ALTQ configuration. Reviewed by: kp MFC after: 1 week Sponsored by: RG Nets Differential Revision: https://reviews.freebsd.org/D18918	2019-01-28 20:26:09 +00:00
Kristof Provost	d9d146e67b	pf: Fix use-after-free of counters When cleaning up a vnet we free the counters in V_pf_default_rule and V_pf_status from shutdown_pf(), but we can still use them later, for example through pf_purge_expired_src_nodes(). Free them as the very last operation, as they rely on nothing else themselves. PR: 235097 MFC after: 1 week	2019-01-25 01:06:06 +00:00
Kristof Provost	180b0dcbbb	pf: Validate psn_len in DIOCGETSRCNODES psn_len is controlled by user space, but we allocated memory based on it. Check how much memory we might need at most (i.e. how many source nodes we have) and limit the allocation to that. Reported by: markj MFC after: 1 week	2019-01-22 02:13:33 +00:00
Kristof Provost	6a8ee0f715	pf: fix pfsync breaking carp Fix missing initialisation of sc_flags into a valid sync state on clone which breaks carp in pfsync. This regression was introduce by r342051. PR: 235005 Submitted by: smh@FreeBSD.org Pointy hat to: kp MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D18882	2019-01-18 08:19:54 +00:00
Kristof Provost	032dff662c	pf: silence a runtime warning Sometimes, for negated tables, pf can log 'pfr_update_stats: assertion failed'. This warning does not clarify anything for users, so silence it, just as OpenBSD has. PR: 234874 MFC after: 1 week	2019-01-15 08:59:51 +00:00
Andrey V. Elsukov	48266154de	Relax requirement to packet size of CARP protocol and remove version check. CARP shares protocol number 112 with VRRP (RFC 5798). And the size of VRRP packet may be smaller than CARP. ipfw_chk() does m_pullup() to at least sizeof(struct carp_header) and can fail when packet is VRRP. This leads to packet drop and message about failed pullup attempt. Also, RFC 5798 defines version 3 of VRRP protocol, this version number also unsupported by CARP and such check leads to packet drop. carp_input() does its own checks for protocol version and packet size, so we can remove these checks to be able pass VRRP packets. PR: 234207 MFC after: 1 week	2019-01-11 01:54:15 +00:00
Andrey V. Elsukov	3b1522c229	Fix the build with INVARIANTS. MFC after: 1 month	2019-01-10 02:01:20 +00:00
Andrey V. Elsukov	1cdf23bc03	Reduce the size of struct ip_fw_args from 240 to 128 bytes on amd64. And refactor the code to avoid unneeded initialization to reduce overhead of per-packet processing. ipfw(4) can be invoked by pfil(9) framework for each packet several times. Each call uses on-stack variable of type struct ip_fw_args to keep the state of ipfw(4) processing. Currently this variable has 240 bytes size on amd64. Each time ipfw(4) does bzero() on it, and then it initializes some fields. glebius@ has reported that they at Netflix discovered, that initialization of this variable produces significant overhead on packet processing. After patching I managed to increase performance of packet processing on simple routing with ipfw(4) firewalling to about 11% from 9.8Mpps up to 11Mpps (Xeon E5-2660 v4@ + Mellanox 100G card). Introduced new field flags, it is used to keep track of what fields was initialized. Some fields were moved into the anonymous union, to reduce the size. They all are mutually exclusive. dummypar field was unused, and therefore it is removed. The hopstore6 field type was changed from sockaddr_in6 to a bit smaller struct ip_fw_nh6. And now the size of struct ip_fw_args is 128 bytes. ipfw_chk() was modified to properly handle ip_fw_args.flags instead of rely on checking for NULL pointers. Reviewed by: gallatin Obtained from: Yandex LLC MFC after: 1 month Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D18690	2019-01-10 01:47:57 +00:00
Gleb Smirnoff	a68cc38879	Mechanical cleanup of epoch(9) usage in network stack. - Remove macros that covertly create epoch_tracker on thread stack. Such macros a quite unsafe, e.g. will produce a buggy code if same macro is used in embedded scopes. Explicitly declare epoch_tracker always. - Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read locking macros to what they actually are - the net_epoch. Keeping them as is is very misleading. They all are named FOO_RLOCK(), while they no longer have lock semantics. Now they allow recursion and what's more important they now no longer guarantee protection against their companion WLOCK macros. Note: INP_HASH_RLOCK() has same problems, but not touched by this commit. This is non functional mechanical change. The only functionally changed functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter epoch recursively. Discussed with: jtl, gallatin	2019-01-09 01:11:19 +00:00
Kristof Provost	336683f24f	pf: Fix endless loop on NAT exhaustion with sticky-address When we try to find a source port in pf_get_sport() it's possible that all available source ports will be in use. In that case we call pf_map_addr() to try to find a new source IP to try from. If there are no more available source IPs pf_map_addr() will return 1 and we stop trying. However, if sticky-address is set we'll always return the same IP address, even if we've already tried that one. We need to check the supplied address, because if that's the one we'd set it means pf_get_sport() has already tried it, and we should error out rather than keep trying. PR: 233867 MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D18483	2018-12-12 20:15:06 +00:00
Kristof Provost	5b551954ab	pf: Prevent integer overflow in PF when calculating the adaptive timeout. Mainly states of established TCP connections would be affected resulting in immediate state removal once the number of states is bigger than adaptive.start. Disabling adaptive timeouts is a workaround to avoid this bug. Issue found and initial diff by Mathieu Blanc (mathieu.blanc at cea dot fr) Reported by: Andreas Longwitz <longwitz AT incore.de> Obtained from: OpenBSD MFC after: 2 weeks	2018-12-11 21:44:39 +00:00
Kristof Provost	4fc65bcbe3	pfsync: Performance improvement pfsync code is called for every new state, state update and state deletion in pf. While pf itself can operate on multiple states at the same time (on different cores, assuming the states hash to a different hashrow), pfsync only had a single lock. This greatly reduced throughput on multicore systems. Address this by splitting the pfsync queues into buckets, based on the state id. This ensures that updates for a given connection always end up in the same bucket, which allows pfsync to still collapse multiple updates into one, while allowing multiple cores to proceed at the same time. The number of buckets is tunable, but defaults to 2 x number of cpus. Benchmarking has shown improvement, depending on hardware and setup, from ~30% to ~100%. MFC after: 1 week Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D18373	2018-12-06 19:27:15 +00:00
Kristof Provost	2b0a4ffadb	pf: add a comment describing why do we call pf_map_addr again if port selection process fails Obtained from: OpenBSD	2018-12-06 18:58:54 +00:00
Andrey V. Elsukov	d66f9c86fa	Add ability to request listing and deleting only for dynamic states. This can be useful, when net.inet.ip.fw.dyn_keep_states is enabled, but after rules reloading some state must be deleted. Added new flag '-D' for such purpose. Retire '-e' flag, since there can not be expired states in the meaning that this flag historically had. Also add "verbose" mode for listing of dynamic states, it can be enabled with '-v' flag and adds additional information to states list. This can be useful for debugging. Obtained from: Yandex LLC MFC after: 2 months Sponsored by: Yandex LLC	2018-12-04 16:12:43 +00:00
Andrey V. Elsukov	cefe3d67e2	Reimplement how net.inet.ip.fw.dyn_keep_states works. Turning on of this feature allows to keep dynamic states when parent rule is deleted. But it works only when the default rule is "allow from any to any". Now when rule with dynamic opcode is going to be deleted, and net.inet.ip.fw.dyn_keep_states is enabled, existing states will reference named objects corresponding to this rule, and also reference the rule. And when ipfw_dyn_lookup_state() will find state for deleted parent rule, it will return the pointer to the deleted rule, that is still valid. This implementation doesn't support O_LIMIT_PARENT rules. The refcnt field was added to struct ip_fw to keep reference, also next pointer added to be able iterate rules and not damage the content when deleted rules are chained. Named objects are referenced only when states are going to be deleted to be able reuse kidx of named objects when new parent rules will be installed. ipfw_dyn_get_count() function was modified and now it also looks into dynamic states and constructs maps of existing named objects. This is needed to correctly export orphaned states into userland. ipfw_free_rule() was changed to be global, since now dynamic state can free rule, when it is expired and references counters becomes 1. External actions subsystem also modified, since external actions can be deregisterd and instances can be destroyed. In these cases deleted rules, that are referenced by orphaned states, must be modified to prevent access to freed memory. ipfw_dyn_reset_eaction(), ipfw_reset_eaction_instance() functions added for these purposes. Obtained from: Yandex LLC MFC after: 2 months Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D17532	2018-12-04 16:01:25 +00:00
Andrey V. Elsukov	0df76496a6	Add assertion to check that named object has correct type. Obtained from: Yandex LLC MFC after: 1 week	2018-12-04 15:12:28 +00:00
Kristof Provost	b2e0b24f76	pf: Fix panic on overlapping interface names In rare situations[] it's possible for two different interfaces to have the same name. This confuses pf, because kifs are indexed by name (which is assumed to be unique). As a result we can end up trying to if_rele(NULL), which panics. Explicitly checking the ifp pointer before if_rele() prevents the panic. Note pf will likely behave in unexpected ways on the the overlapping interfaces. [] Insert an interface in a vnet jail. Rename it to an interface which exists on the host. Remove the jail. There are now two interfaces with the same name in the host.	2018-12-01 09:58:21 +00:00
Andrey V. Elsukov	2636ba4d03	Do not limit the mbuf queue length for keepalive packets. It was unlimited before overhaul, and one user reported that this limit can be reached easily. PR: 233562 MFC after: 1 week	2018-11-27 16:51:01 +00:00
Andrey V. Elsukov	b2b5660688	Add ability to use dynamic external prefix in ipfw_nptv6 module. Now an interface name can be specified for nptv6 instance instead of ext_prefix. The module will track if_addr_ext events and when suitable IPv6 address will be added to specified interface, it will be configured as external prefix. When address disappears instance becomes unusable, i.e. it doesn't match any packets. Reviewed by: 0mp (manpages) Tested by: Dries Michiels <driesm dot michiels gmail com> MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D17765	2018-11-12 11:20:59 +00:00
Kristof Provost	87e4ca37d5	pf: Prevent tables referenced by rules in anchors from getting disabled. PR: 183198 Obtained from: OpenBSD MFC after: 2 weeks	2018-11-08 21:54:40 +00:00
Kristof Provost	58ef854f8b	pf: Fix build if INVARIANTS is not set r340061 included a number of assertions pf_frent_remove(), but these assertions were the only use of the 'prev' variable. As a result builds without INVARIANTS had an unused variable, and failed. Reported by: vangyzen@	2018-11-02 19:23:50 +00:00
Kristof Provost	14624ab582	pf: Keep a reference to struct ifnets we're using Ensure that the struct ifnet we use can't go away until we're done with it.	2018-11-02 17:05:40 +00:00
Kristof Provost	dde6e1fecb	pfsync: Add missing unlock If we fail to set up the multicast entry for pfsync and return an error we must release the pfsync lock first. MFC after: 2 weeks Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D17506	2018-11-02 17:03:53 +00:00
Kristof Provost	04fe85f068	pfsync: Allow module to be unloaded MFC after: 2 weeks Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D17505	2018-11-02 17:01:18 +00:00
Kristof Provost	fbbf436d56	pfsync: Handle syncdev going away If the syncdev is removed we no longer need to clean up the multicast entry we've got set up for that device. Pass the ifnet detach event through pf to pfsync, and remove our multicast handle, and mark us as no longer having a syncdev. Note that this callback is always installed, even if the pfsync interface is disabled (and thus it's not a per-vnet callback pointer). MFC after: 2 weeks Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D17502	2018-11-02 16:57:23 +00:00
Kristof Provost	26549dfcad	pfsync: Ensure uninit is done before pf pfsync touches pf memory (for pf_state and the pfsync callback pointers), not the other way around. We need to ensure that pfsync is torn down before pf. MFC after: 2 weeks Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D17501	2018-11-02 16:53:15 +00:00
Kristof Provost	5f6cf24e2d	pfsync: Make pfsync callbacks per-vnet The callbacks are installed and removed depending on the state of the pfsync device, which is per-vnet. The callbacks must also be per-vnet. MFC after: 2 weeks Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D17499	2018-11-02 16:47:07 +00:00
Kristof Provost	790194cd47	pf: Limit the fragment entry queue length to 64 per bucket. So we have a global limit of 1024 fragments, but it is fine grained to the region of the packet. Smaller packets may have less fragments. This costs another 16 bytes of memory per reassembly and devides the worst case for searching by 8. Obtained from: OpenBSD Differential Revision: https://reviews.freebsd.org/D17734	2018-11-02 15:32:04 +00:00
Kristof Provost	fd2ea405e6	pf: Split the fragment reassembly queue into smaller parts Remember 16 entry points based on the fragment offset. Instead of a worst case of 8196 list traversals we now check a maximum of 512 list entries or 16 array elements. Obtained from: OpenBSD Differential Revision: https://reviews.freebsd.org/D17733	2018-11-02 15:26:51 +00:00
Kristof Provost	2b1c354ee6	pf: Count holes rather than fragments for reassembly Avoid traversing the list of fragment entris to check whether the pf(4) reassembly is complete. Instead count the holes that are created when inserting a fragment. If there are no holes left, the fragments are continuous. Obtained from: OpenBSD Differential Revision: https://reviews.freebsd.org/D17732	2018-11-02 15:23:57 +00:00
Kristof Provost	19a22ae313	Revert "pf: Limit the maximum number of fragments per packet" This reverts commit r337969. We'll handle this the OpenBSD way, in upcoming commits.	2018-11-02 15:01:59 +00:00
Kristof Provost	99eb00558a	pf: Make ':0' ignore link-local v6 addresses too When users mark an interface to not use aliases they likely also don't want to use the link-local v6 address there. PR: 201695 Submitted by: Russell Yount <Russell.Yount AT gmail.com> Differential Revision: https://reviews.freebsd.org/D17633	2018-10-28 05:32:50 +00:00
Eugene Grosbein	5310c19174	ipfw: implement ngtee/netgraph actions for layer-2 frames. Kernel part of ipfw does not support and ignores rules other than "pass", "deny" and dummynet-related for layer-2 (ethernet frames). Others are processed as "pass". Make it support ngtee/netgraph rules just like they are supported for IP packets. For example, this allows us to mirror some frames selectively to another interface for delivery to remote network analyzer over RSPAN vlan. Assuming ng_ipfw(4) netgraph node has a hook named "900" attached to "lower" hook of vlan900's ng_ether(4) node, that would be as simple as: ipfw add ngtee 900 ip from any to 8.8.8.8 layer2 out xmit igb0 PR: 213452 MFC after: 1 month Tested-by: Fyodor Ustinov <ufm@ufm.su>	2018-10-27 07:32:26 +00:00
Kristof Provost	13d640d376	pf: Fix copy/paste error in IPv6 address rewriting We checked the destination address, but replaced the source address. This was fixed in OpenBSD as part of their NAT rework, which we don't want to import right now. CID: 1009561 MFC after: 3 weeks	2018-10-24 00:19:44 +00:00
Kristof Provost	73c9014569	pf: ifp can never be NULL in pfi_ifaddr_event() There's no point in the NULL check for ifp, because we'll already have dereferenced it by then. Moreover, the event will always have a valid ifp. Replace the late check with an early assertion. CID: 1357338	2018-10-23 23:15:44 +00:00
Andrey V. Elsukov	ab108c4b07	Do not decrement RST life time if keep_alive is not turned on. This allows use differen values configured by user for sysctl variable net.inet.ip.fw.dyn_rst_lifetime. Obtained from: Yandex LLC MFC after: 3 weeks Sponsored by: Yandex LLC	2018-10-21 16:44:57 +00:00
Andrey V. Elsukov	2ffadd56f5	Call inet_ntop() only when its result is needed. Obtained from: Yandex LLC MFC after: 3 weeks Sponsored by: Yandex LLC	2018-10-21 16:37:53 +00:00
Andrey V. Elsukov	aa2715612c	Retire IPFIREWALL_NAT64_DIRECT_OUTPUT kernel option. And add ability to switch the output method in run-time. Also document some sysctl variables that can by changed for NAT64 module. NAT64 had compile time option IPFIREWALL_NAT64_DIRECT_OUTPUT to use if_output directly from nat64 module. By default is used netisr based output method. Now both methods can be used, but they require different handling by rules. Obtained from: Yandex LLC MFC after: 3 weeks Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D16647	2018-10-21 16:29:12 +00:00
Kristof Provost	1563a27e1f	pf synproxy will do the 3WHS on behalf of the target machine, and once the 3WHS is completed, establish the backend connection. The trigger for "3WHS completed" is the reception of the first ACK. However, we should not proceed if that ACK also has RST or FIN set. PR: 197484 Obtained from: OpenBSD MFC after: 2 weeks	2018-10-20 18:37:21 +00:00
Andrey V. Elsukov	986368d85d	Add extra parentheses to fix "versrcreach" opcode, (oif != NULL) should not be used as condition for ternary operator. Submitted by: Tatsuki Makino <tatsuki_makino at hotmail dot com> Approved by: re (kib) MFC after: 1 week	2018-10-15 10:25:34 +00:00
John-Mark Gurney	032d3aaa96	Significantly improve pf purge cpu usage by only taking locks when there is work to do. This reduces CPU consumption to one third on systems. This will help keep the thread CPU usage under control now that the default hash size has increased. Reviewed by: kp Approved by: re (kib) Differential Revision: https://reviews.freebsd.org/D17097	2018-09-16 00:44:23 +00:00
Patrick Kelsey	249cc75fd1	Extended pf(4) ioctl interface and pfctl(8) to allow bandwidths of 2^32 bps or greater to be used. Prior to this, bandwidth parameters would simply wrap at the 2^32 boundary. The computations in the HFSC scheduler and token bucket regulator have been modified to operate correctly up to at least 100 Gbps. No other algorithms have been examined or modified for correct operation above 2^32 bps (some may have existing computation resolution or overflow issues at rates below that threshold). pfctl(8) will now limit non-HFSC bandwidth parameters to 2^32 - 1 before passing them to the kernel. The extensions to the pf(4) ioctl interface have been made in a backwards-compatible way by versioning affected data structures, supporting all versions in the kernel, and implementing macros that will cause existing code that consumes that interface to use version 0 without source modifications. If version 0 consumers of the interface are used against a new kernel that has had bandwidth parameters of 2^32 or greater configured by updated tools, such bandwidth parameters will be reported as 2^32 - 1 bps by those old consumers. All in-tree consumers of the pf(4) interface have been updated. To update out-of-tree consumers to the latest version of the interface, define PFIOC_USE_LATEST ahead of any includes and use the code of pfctl(8) as a guide for the ioctls of interest. PR: 211730 Reviewed by: jmallett, kp, loos MFC after: 2 weeks Relnotes: yes Sponsored by: RG Nets Differential Revision: https://reviews.freebsd.org/D16782	2018-08-22 19:38:48 +00:00
Kristof Provost	d47023236c	pf: Limit the maximum number of fragments per packet Similar to the network stack issue fixed in r337782 pf did not limit the number of fragments per packet, which could be exploited to generate high CPU loads with a crafted series of packets. Limit each packet to no more than 64 fragments. This should be sufficient on typical networks to allow maximum-sized IP frames. This addresses the issue for both IPv4 and IPv6. MFC after: 3 days Security: CVE-2018-5391 Sponsored by: Klara Systems	2018-08-17 15:00:10 +00:00
Luiz Otavio O Souza	a0376d4d29	Fix a typo in comment. MFC after: 3 days X-MFC with: r321316 Sponsored by: Rubicon Communications, LLC (Netgate)	2018-08-15 16:36:29 +00:00
Kristof Provost	e9ddca4a40	pf: Take the IF_ADDR_RLOCK() when iterating over the group list We did do this elsewhere in pf, but the lock was missing here. Sponsored by: Essen Hackathon	2018-08-11 16:37:55 +00:00
Kristof Provost	33b242b533	pf: Fix 'set skip on' for groups The pfi_skip_if() function sometimes caused skipping of groups to work, if the members of the group used the groupname as a name prefix. This is often the case, e.g. group lo usually contains lo0, lo1, ..., but not always. Rather than relying on the name explicitly check for group memberships. Obtained from: OpenBSD (pf_if.c,v 1.62, pf_if.c,v 1.63) Sponsored by: Essen Hackathon	2018-08-11 16:34:30 +00:00
Andrey V. Elsukov	5c4aca8218	Use host byte order when comparing mss values. This fixes tcp-setmss action on little endian machines. PR: 225536 Submitted by: John Zielinski	2018-08-08 17:32:02 +00:00
Andrew Turner	5f901c92a8	Use the new VNET_DEFINE_STATIC macro when we are defining static VNET variables. Reviewed by: bz Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D16147	2018-07-24 16:35:52 +00:00
Kristof Provost	32ece669c2	pf: Fix synproxy Synproxy was accidentally broken by r335569. The 'return (action)' must be executed for every non-PF_PASS result, but the error packet (TCP RST or ICMP error) should only be sent if the packet was dropped (i.e. PF_DROP) and the return flag is set. PR: 229477 Submitted by: Andre Albsmeier <mail AT fbsd.e4m.org> MFC after: 1 week	2018-07-14 10:14:59 +00:00
Kristof Provost	3e603d1ffa	pf: Fix panic on vnet jail shutdown with synproxy When shutting down a vnet jail pf_shutdown() clears the remaining states, which through pf_clear_states() calls pf_unlink_state(). For synproxy states pf_unlink_state() will send a TCP RST, which eventually tries to schedule the pf swi in pf_send(). This means we can't remove the software interrupt until after pf_shutdown(). MFC after: 1 week	2018-07-14 09:11:32 +00:00
Andrey V. Elsukov	0a2c13d333	Use correct size when we are allocating array for skipto index. Also, there is no need to use M_ZERO for idxmap_back. It will be re-filled just after allocation in update_skipto_cache(). PR: 229665 MFC after: 1 week	2018-07-12 11:38:18 +00:00
Andrey V. Elsukov	f7c4fdee1a	Add "record-state", "set-limit" and "defer-action" rule options to ipfw. "record-state" is similar to "keep-state", but it doesn't produce implicit O_PROBE_STATE opcode in a rule. "set-limit" is like "limit", but it has the same feature as "record-state", it is single opcode without implicit O_PROBE_STATE opcode. "defer-action" is targeted to be used with dynamic states. When rule with this opcode is matched, the rule's action will not be executed, instead dynamic state will be created. And when this state will be matched by "check-state", then rule action will be executed. This allows create a more complicated rulesets. Submitted by: lev MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D1776	2018-07-09 11:35:18 +00:00
Andrew Turner	2bf9501287	Create a new macro for static DPCPU data. On arm64 (and possible other architectures) we are unable to use static DPCPU data in kernel modules. This is because the compiler will generate PC-relative accesses, however the runtime-linker expects to be able to relocate these. In preparation to fix this create two macros depending on if the data is global or static. Reviewed by: bz, emaste, markj Sponsored by: ABT Systems Ltd Differential Revision: https://reviews.freebsd.org/D16140	2018-07-05 17:13:37 +00:00
Will Andrews	cc535c95ca	Revert r335833. Several third-parties use at least some of these ioctls. While it would be better for regression testing if they were used in base (or at least in the test suite), it's currently not worth the trouble to push through removal. Submitted by: antoine, markj	2018-07-04 03:36:46 +00:00
Will Andrews	c1887e9f09	pf: remove unused ioctls. Several ioctls are unused in pf, in the sense that no base utility references them. Additionally, a cursory review of pf-based ports indicates they're not used elsewhere either. Some of them have been unused since the original import. As far as I can tell, they're also unused in OpenBSD. Finally, removing this code removes the need for future pf work to take them into account. Reviewed by: kp Differential Revision: https://reviews.freebsd.org/D16076	2018-07-01 01:16:03 +00:00
Kristof Provost	de210decd1	pfsync: Fix state sync during initial bulk update States learned via pfsync from a peer with the same ruleset checksum were not getting assigned to rules like they should because pfsync_in_upd() wasn't passing the PFSYNC_SI_CKSUM flag along to pfsync_state_import. PR: 229092 Submitted by: Kajetan Staszkiewicz <vegeta tuxpowered.net> Obtained from: OpenBSD MFC after: 1 week Sponsored by: InnoGames GmbH	2018-06-30 12:51:08 +00:00

1 2 3 4 5 ...

709 Commits