Now enabling ipfw(4) with sysctls controls only linkage of hooks to default
heads. When module is loaded fetch sysctls as tunables, to make it possible
to boot with ipfw(4) in kernel, but not linked to any pfil(9) hooks.
Update NAT64LSN implementation:
o most of data structures and relations were modified to be able support
large number of translation states. Now each supported protocol can
use full ports range. Ports groups now are belongs to IPv4 alias
addresses, not hosts. Each ports group can keep several states chunks.
This is controlled with new `states_chunks` config option. States
chunks allow to have several translation states for single alias address
and port, but for different destination addresses.
o by default all hash tables now use jenkins hash.
o ConcurrencyKit and epoch(9) is used to make NAT64LSN lockless on fast path.
o one NAT64LSN instance now can be used to handle several IPv6 prefixes,
special prefix "::" value should be used for this purpose when instance
is created.
o due to modified internal data structures relations, the socket opcode
that does states listing was changed.
Obtained from: Yandex LLC
MFC after: 1 month
Sponsored by: Yandex LLC
o most of data structures and relations were modified to be able support
large number of translation states. Now each supported protocol can
use full ports range. Ports groups now are belongs to IPv4 alias
addresses, not hosts. Each ports group can keep several states chunks.
This is controlled with new `states_chunks` config option. States
chunks allow to have several translation states for single alias address
and port, but for different destination addresses.
o by default all hash tables now use jenkins hash.
o ConcurrencyKit and epoch(9) is used to make NAT64LSN lockless on fast path.
o one NAT64LSN instance now can be used to handle several IPv6 prefixes,
special prefix "::" value should be used for this purpose when instance
is created.
o due to modified internal data structures relations, the socket opcode
that does states listing was changed.
Obtained from: Yandex LLC
MFC after: 1 month
Sponsored by: Yandex LLC
CLAT is customer-side translator that algorithmically translates 1:1
private IPv4 addresses to global IPv6 addresses, and vice versa.
It is implemented as part of ipfw_nat64 kernel module. When module
is loaded or compiled into the kernel, it registers "nat64clat" external
action. External action named instance can be created using `create`
command and then used in ipfw rules. The create command accepts two
IPv6 prefixes `plat_prefix` and `clat_prefix`. If plat_prefix is ommitted,
IPv6 NAT64 Well-Known prefix 64:ff9b::/96 will be used.
# ipfw nat64clat CLAT create clat_prefix SRC_PFX plat_prefix DST_PFX
# ipfw add nat64clat CLAT ip4 from IPv4_PFX to any out
# ipfw add nat64clat CLAT ip6 from DST_PFX to SRC_PFX in
Obtained from: Yandex LLC
Submitted by: Boris N. Lytochkin
MFC after: 1 month
Relnotes: yes
Sponsored by: Yandex LLC
Add second IPv6 prefix to generic config structure and rename another
fields to conform to RFC6877. Now it contains two prefixes and length:
PLAT is provider-side translator that translates N:1 global IPv6 addresses
to global IPv4 addresses. CLAT is customer-side translator (XLAT) that
algorithmically translates 1:1 IPv4 addresses to global IPv6 addresses.
Use PLAT prefix in stateless (nat64stl) and stateful (nat64lsn)
translators.
Modify nat64_extract_ip4() and nat64_embed_ip4() functions to accept
prefix length and use plat_plen to specify prefix length.
Retire net.inet.ip.fw.nat64_allow_private sysctl variable.
Add NAT64_ALLOW_PRIVATE flag and use "allow_private" config option to
configure this ability separately for each NAT64 instance.
Obtained from: Yandex LLC
MFC after: 1 month
Sponsored by: Yandex LLC
With new pfil(9) KPI it is possible to pass a void pointer with length
instead of mbuf pointer to a packet filter. Until this commit no filters
supported that, so pfil run through a shim function pfil_fake_mbuf().
Now the ipfw(4) hook named "default-link", that is instantiated when
net.link.ether.ipfw sysctl is on, supports processing pointer/length
packets natively.
- ip_fw_args now has union for either mbuf or void *, and if flags have
non-zero length, then we use the void *.
- through ipfw_chk() we handle mem/mbuf cases differently.
- ether_header goes away from args. It is ipfw_chk() responsibility
to do parsing of Ethernet header.
- ipfw_log() now uses different bpf APIs to log packets.
Although ipfw_chk() is now capable to process pointer/length packets,
this commit adds support for the link level hook only, see
ipfw_check_frame(). Potentially the IP processing hook ipfw_check_packet()
can be improved too, but that requires more changes since the hook
supports more complex actions: NAT, divert, etc.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D19357
IPFW_ARGS_OUT are utilized. They are intented to substitute the "dir"
parameter that is often passes together with args.
- Rename ip_fw_args.oif to ifp and now it is set to either input or
output interface, depending on IPFW_ARGS_IN/OUT bit set.
It will be used by upcoming NAT64 changes. We use separate code
to avoid propogating EACCES error code to user level applications
when NAT64 consumes a packet.
Obtained from: Yandex LLC
MFC after: 1 week
Sponsored by: Yandex LLC
It is possible, that a processed packet was originated by local host,
in this case m->m_pkthdr.rcvif is NULL. Check and set it to V_loif to
avoid NULL pointer dereference in IP input code, since it is expected
that packet has valid receiving interface when netisr processes it.
Obtained from: Yandex LLC
MFC after: 1 week
Sponsored by: Yandex LLC
dyn_install_state() uses `rule` pointer when it creates state.
For O_LIMIT states this pointer actually is not struct ip_fw,
it is pointer to O_LIMIT_PARENT state, that keeps actual pointer
to ip_fw parent rule. Thus we need to cache rule id and number
before calling dyn_get_parent_state(), so we can use them later
when the `rule` pointer is overrided.
PR: 236292
MFC after: 3 days
Initially it was introduced because parent rule pointer could be freed,
and rule's information could become inaccessible. In r341471 this was
changed. And now we don't need this information, and also it can become
stale. E.g. rule can be moved from one set to another. This can lead
to parent's set and state's set will not match. In this case it is
possible that static rule will be freed, but dynamic state will not.
This can happen when `ipfw delete set N` command is used to delete
rules, that were moved to another set.
To fix the problem we will use the set number from parent rule.
Obtained from: Yandex LLC
MFC after: 1 week
Sponsored by: Yandex LLC
The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented. The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.
In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.
New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.
Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.
Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.
Differential Revision: https://reviews.freebsd.org/D18951
The pfil(9) system is about to be converted to epoch(9) synchronization, so
we need [temporarily] go back with ipfw internal locking.
Discussed with: ae
handling for protocols without ports numbers.
Since port numbers were uninitialized for protocols like ICMP/ICMPv6,
ipfw_chk() used some non-zero values to create dynamic states, and due
this it failed to match replies with created states.
Reported by: Oliver Hartmann, Boris Lytochkin
Obtained from: Yandex LLC
X-MFC after: r342908
CARP shares protocol number 112 with VRRP (RFC 5798). And the size of
VRRP packet may be smaller than CARP. ipfw_chk() does m_pullup() to at
least sizeof(struct carp_header) and can fail when packet is VRRP. This
leads to packet drop and message about failed pullup attempt.
Also, RFC 5798 defines version 3 of VRRP protocol, this version number
also unsupported by CARP and such check leads to packet drop.
carp_input() does its own checks for protocol version and packet size,
so we can remove these checks to be able pass VRRP packets.
PR: 234207
MFC after: 1 week
And refactor the code to avoid unneeded initialization to reduce overhead
of per-packet processing.
ipfw(4) can be invoked by pfil(9) framework for each packet several times.
Each call uses on-stack variable of type struct ip_fw_args to keep the
state of ipfw(4) processing. Currently this variable has 240 bytes size
on amd64. Each time ipfw(4) does bzero() on it, and then it initializes
some fields.
glebius@ has reported that they at Netflix discovered, that initialization
of this variable produces significant overhead on packet processing.
After patching I managed to increase performance of packet processing on
simple routing with ipfw(4) firewalling to about 11% from 9.8Mpps up to
11Mpps (Xeon E5-2660 v4@ + Mellanox 100G card).
Introduced new field flags, it is used to keep track of what fields was
initialized. Some fields were moved into the anonymous union, to reduce
the size. They all are mutually exclusive. dummypar field was unused, and
therefore it is removed. The hopstore6 field type was changed from
sockaddr_in6 to a bit smaller struct ip_fw_nh6. And now the size of struct
ip_fw_args is 128 bytes.
ipfw_chk() was modified to properly handle ip_fw_args.flags instead of
rely on checking for NULL pointers.
Reviewed by: gallatin
Obtained from: Yandex LLC
MFC after: 1 month
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D18690
This can be useful, when net.inet.ip.fw.dyn_keep_states is enabled, but
after rules reloading some state must be deleted. Added new flag '-D'
for such purpose.
Retire '-e' flag, since there can not be expired states in the meaning
that this flag historically had.
Also add "verbose" mode for listing of dynamic states, it can be enabled
with '-v' flag and adds additional information to states list. This can
be useful for debugging.
Obtained from: Yandex LLC
MFC after: 2 months
Sponsored by: Yandex LLC
Turning on of this feature allows to keep dynamic states when parent
rule is deleted. But it works only when the default rule is
"allow from any to any".
Now when rule with dynamic opcode is going to be deleted, and
net.inet.ip.fw.dyn_keep_states is enabled, existing states will reference
named objects corresponding to this rule, and also reference the rule.
And when ipfw_dyn_lookup_state() will find state for deleted parent rule,
it will return the pointer to the deleted rule, that is still valid.
This implementation doesn't support O_LIMIT_PARENT rules.
The refcnt field was added to struct ip_fw to keep reference, also
next pointer added to be able iterate rules and not damage the content
when deleted rules are chained.
Named objects are referenced only when states are going to be deleted to
be able reuse kidx of named objects when new parent rules will be
installed.
ipfw_dyn_get_count() function was modified and now it also looks into
dynamic states and constructs maps of existing named objects. This is
needed to correctly export orphaned states into userland.
ipfw_free_rule() was changed to be global, since now dynamic state can
free rule, when it is expired and references counters becomes 1.
External actions subsystem also modified, since external actions can be
deregisterd and instances can be destroyed. In these cases deleted rules,
that are referenced by orphaned states, must be modified to prevent access
to freed memory. ipfw_dyn_reset_eaction(), ipfw_reset_eaction_instance()
functions added for these purposes.
Obtained from: Yandex LLC
MFC after: 2 months
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D17532
Now an interface name can be specified for nptv6 instance instead of
ext_prefix. The module will track if_addr_ext events and when suitable
IPv6 address will be added to specified interface, it will be configured
as external prefix. When address disappears instance becomes unusable,
i.e. it doesn't match any packets.
Reviewed by: 0mp (manpages)
Tested by: Dries Michiels <driesm dot michiels gmail com>
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D17765
Kernel part of ipfw does not support and ignores rules other than
"pass", "deny" and dummynet-related for layer-2 (ethernet frames).
Others are processed as "pass".
Make it support ngtee/netgraph rules just like they are supported
for IP packets. For example, this allows us to mirror some frames
selectively to another interface for delivery to remote network analyzer
over RSPAN vlan. Assuming ng_ipfw(4) netgraph node has a hook named "900"
attached to "lower" hook of vlan900's ng_ether(4) node, that would be
as simple as:
ipfw add ngtee 900 ip from any to 8.8.8.8 layer2 out xmit igb0
PR: 213452
MFC after: 1 month
Tested-by: Fyodor Ustinov <ufm@ufm.su>
This allows use differen values configured by user for sysctl variable
net.inet.ip.fw.dyn_rst_lifetime.
Obtained from: Yandex LLC
MFC after: 3 weeks
Sponsored by: Yandex LLC
to switch the output method in run-time. Also document some sysctl
variables that can by changed for NAT64 module.
NAT64 had compile time option IPFIREWALL_NAT64_DIRECT_OUTPUT to use
if_output directly from nat64 module. By default is used netisr based
output method. Now both methods can be used, but they require different
handling by rules.
Obtained from: Yandex LLC
MFC after: 3 weeks
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D16647
not be used as condition for ternary operator.
Submitted by: Tatsuki Makino <tatsuki_makino at hotmail dot com>
Approved by: re (kib)
MFC after: 1 week
Also, there is no need to use M_ZERO for idxmap_back. It will be
re-filled just after allocation in update_skipto_cache().
PR: 229665
MFC after: 1 week
"record-state" is similar to "keep-state", but it doesn't produce implicit
O_PROBE_STATE opcode in a rule. "set-limit" is like "limit", but it has the
same feature as "record-state", it is single opcode without implicit
O_PROBE_STATE opcode. "defer-action" is targeted to be used with dynamic
states. When rule with this opcode is matched, the rule's action will
not be executed, instead dynamic state will be created. And when this
state will be matched by "check-state", then rule action will be executed.
This allows create a more complicated rulesets.
Submitted by: lev
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D1776
On arm64 (and possible other architectures) we are unable to use static
DPCPU data in kernel modules. This is because the compiler will generate
PC-relative accesses, however the runtime-linker expects to be able to
relocate these.
In preparation to fix this create two macros depending on if the data is
global or static.
Reviewed by: bz, emaste, markj
Sponsored by: ABT Systems Ltd
Differential Revision: https://reviews.freebsd.org/D16140
Using of rwlock with multiqueue NICs for IP forwarding on high pps
produces high lock contention and inefficient. Rmlock fits better for
such workloads.
Reviewed by: melifaro, olivier
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D15789
Per-cpu zone allocations are very rarely done compared to regular zones.
The intent is to avoid pessimizing the latter case with per-cpu specific
code.
In particular contrary to the claim in r334824, M_ZERO is sometimes being
used for such zones. But the zeroing method is completely different and
braching on it in the fast path for regular zones is a waste of time.
This feature is disabled by default and was removed when dynamic states
implementation changed to be lockless. Now it is reimplemented with small
differences - when dyn_keep_states sysctl variable is enabled,
dyn_match_ipv[46]_state() function doesn't match child states of deleted
rule. And thus they are keept alive until expired. ipfw_dyn_lookup_state()
function does check that state was not orphaned, and if so, it returns
pointer to default_rule and its position in the rules map. The main visible
difference is that orphaned states still have the same rule number that
they have before parent rule deleted, because now a state has many fields
related to rule and changing them all atomically to point to default_rule
seems hard enough.
Reported by: <lantw44 at gmail.com>
MFC after: 2 days
dyn_lookup_ipv[46]_state_locked(). These checks are remnants of not
ready to be committed code, and they are there by accident.
Due to the race these checks can lead to creating of duplicate states
when concurrent threads in the same time will try to add state for two
packets of the same flow, but in reverse directions and matched by
different parent rules.
Reported by: lev
MFC after: 3 days
o Modify ipfw(8) to be able set any prefix6 not just Well-Known,
and also show configured prefix6;
o relocate some definitions and macros into proper place;
o convert nat64_debug and nat64_allow_private variables to be
VNET-compatible;
o add struct nat64_config that keeps generic configuration needed
to NAT64 code;
o add nat64_check_prefix6() function to check validness of specified
by user IPv6 prefix according to RFC6052;
o use nat64_check_private_ip4() and nat64_embed_ip4() functions
instead of nat64_get_ip4() and nat64_set_ip4() macros. This allows
to use any configured IPv6 prefixes that are allowed by RFC6052;
o introduce NAT64_WKPFX flag, that is set when IPv6 prefix is
Well-Known IPv6 prefix. It is used to reduce overhead to check this;
o modify nat64lsn_cfg and nat64stl_cfg structures to use nat64_config
structure. And respectivelly modify the rest of code;
o remove now unused ro argument from nat64_output() function;
o remove __FreeBSD_version ifdef, NAT64 was not merged to older versions;
o add commented -DIPFIREWALL_NAT64_DIRECT_OUTPUT flag to module's Makefile
as example.
Obtained from: Yandex LLC
MFC after: 1 month
Sponsored by: Yandex LLC