Now enabling ipfw(4) with sysctls controls only linkage of hooks to default
heads. When module is loaded fetch sysctls as tunables, to make it possible
to boot with ipfw(4) in kernel, but not linked to any pfil(9) hooks.
With new pfil(9) KPI it is possible to pass a void pointer with length
instead of mbuf pointer to a packet filter. Until this commit no filters
supported that, so pfil run through a shim function pfil_fake_mbuf().
Now the ipfw(4) hook named "default-link", that is instantiated when
net.link.ether.ipfw sysctl is on, supports processing pointer/length
packets natively.
- ip_fw_args now has union for either mbuf or void *, and if flags have
non-zero length, then we use the void *.
- through ipfw_chk() we handle mem/mbuf cases differently.
- ether_header goes away from args. It is ipfw_chk() responsibility
to do parsing of Ethernet header.
- ipfw_log() now uses different bpf APIs to log packets.
Although ipfw_chk() is now capable to process pointer/length packets,
this commit adds support for the link level hook only, see
ipfw_check_frame(). Potentially the IP processing hook ipfw_check_packet()
can be improved too, but that requires more changes since the hook
supports more complex actions: NAT, divert, etc.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D19357
IPFW_ARGS_OUT are utilized. They are intented to substitute the "dir"
parameter that is often passes together with args.
- Rename ip_fw_args.oif to ifp and now it is set to either input or
output interface, depending on IPFW_ARGS_IN/OUT bit set.
The pfil(9) system is about to be converted to epoch(9) synchronization, so
we need [temporarily] go back with ipfw internal locking.
Discussed with: ae
handling for protocols without ports numbers.
Since port numbers were uninitialized for protocols like ICMP/ICMPv6,
ipfw_chk() used some non-zero values to create dynamic states, and due
this it failed to match replies with created states.
Reported by: Oliver Hartmann, Boris Lytochkin
Obtained from: Yandex LLC
X-MFC after: r342908
CARP shares protocol number 112 with VRRP (RFC 5798). And the size of
VRRP packet may be smaller than CARP. ipfw_chk() does m_pullup() to at
least sizeof(struct carp_header) and can fail when packet is VRRP. This
leads to packet drop and message about failed pullup attempt.
Also, RFC 5798 defines version 3 of VRRP protocol, this version number
also unsupported by CARP and such check leads to packet drop.
carp_input() does its own checks for protocol version and packet size,
so we can remove these checks to be able pass VRRP packets.
PR: 234207
MFC after: 1 week
And refactor the code to avoid unneeded initialization to reduce overhead
of per-packet processing.
ipfw(4) can be invoked by pfil(9) framework for each packet several times.
Each call uses on-stack variable of type struct ip_fw_args to keep the
state of ipfw(4) processing. Currently this variable has 240 bytes size
on amd64. Each time ipfw(4) does bzero() on it, and then it initializes
some fields.
glebius@ has reported that they at Netflix discovered, that initialization
of this variable produces significant overhead on packet processing.
After patching I managed to increase performance of packet processing on
simple routing with ipfw(4) firewalling to about 11% from 9.8Mpps up to
11Mpps (Xeon E5-2660 v4@ + Mellanox 100G card).
Introduced new field flags, it is used to keep track of what fields was
initialized. Some fields were moved into the anonymous union, to reduce
the size. They all are mutually exclusive. dummypar field was unused, and
therefore it is removed. The hopstore6 field type was changed from
sockaddr_in6 to a bit smaller struct ip_fw_nh6. And now the size of struct
ip_fw_args is 128 bytes.
ipfw_chk() was modified to properly handle ip_fw_args.flags instead of
rely on checking for NULL pointers.
Reviewed by: gallatin
Obtained from: Yandex LLC
MFC after: 1 month
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D18690
not be used as condition for ternary operator.
Submitted by: Tatsuki Makino <tatsuki_makino at hotmail dot com>
Approved by: re (kib)
MFC after: 1 week
"record-state" is similar to "keep-state", but it doesn't produce implicit
O_PROBE_STATE opcode in a rule. "set-limit" is like "limit", but it has the
same feature as "record-state", it is single opcode without implicit
O_PROBE_STATE opcode. "defer-action" is targeted to be used with dynamic
states. When rule with this opcode is matched, the rule's action will
not be executed, instead dynamic state will be created. And when this
state will be matched by "check-state", then rule action will be executed.
This allows create a more complicated rulesets.
Submitted by: lev
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D1776
ip_reass() expects IPv4 packet and will just corrupt any IPv6 packets
that it gets. Until proper IPv6 fragments handling function will be
implemented, pass IPv6 packets to next rule.
PR: 170604
MFC after: 1 week
o added struct ipfw_dyn_info that keeps all needed for ipfw_chk and
for dynamic states implementation information;
o added DYN_LOOKUP_NEEDED() macro that can be used to determine the
need of new lookup of dynamic states;
o ipfw_dyn_rule now becomes obsolete. Currently it used to pass
information from kernel to userland only.
o IPv4 and IPv6 states now described by different structures
dyn_ipv4_state and dyn_ipv6_state;
o IPv6 scope zones support is added;
o ipfw(4) now depends from Concurrency Kit;
o states are linked with "entry" field using CK_SLIST. This allows
lockless lookup and protected by mutex modifications.
o the "expired" SLIST field is used for states expiring.
o struct dyn_data is used to keep generic information for both IPv4
and IPv6;
o struct dyn_parent is used to keep O_LIMIT_PARENT information;
o IPv4 and IPv6 states are stored in different hash tables;
o O_LIMIT_PARENT states now are kept separately from O_LIMIT and
O_KEEP_STATE states;
o per-cpu dyn_hp pointers are used to implement hazard pointers and they
prevent freeing states that are locklessly used by lookup threads;
o mutexes to protect modification of lists in hash tables now kept in
separate arrays. 65535 limit to maximum number of hash buckets now
removed.
o Separate lookup and install functions added for IPv4 and IPv6 states
and for parent states.
o By default now is used Jenkinks hash function.
Obtained from: Yandex LLC
MFC after: 42 days
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D12685
specified in the arg1 into ICMPv6 destination unreachable code according
to RFC7915.
Obtained from: Yandex LLC
MFC after: 2 weeks
Sponsored by: Yandex LLC
Now it is possible to use UDPLite's port numbers in rules,
create dynamic states for UDPLite packets and see "UDPLite" for matched
packets in log.
Obtained from: Yandex LLC
MFC after: 2 weeks
Sponsored by: Yandex LLC
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
No functional change intended.
This is similar to the TCP case. where a TCP RST segment can be sent.
There is one limitation: When sending an ABORT in response to an incoming
packet, it should be tested if there is no ABORT chunk in the received
packet. Currently, it is only checked if the first chunk is an ABORT
chunk to avoid parsing the whole packet, which could result in a DOS attack.
Thanks to Timo Voelker for helping me to test this patch.
Reviewed by: bcr@ (man page part), ae@ (generic, non-SCTP part)
Differential Revision: https://reviews.freebsd.org/D13239
Hide the locking logic used in the dynamic states implementation from
generic code. Rename ipfw_install_state() and ipfw_lookup_dyn_rule()
function to have similar names: ipfw_dyn_install_state() and
ipfw_dyn_lookup_state(). Move dynamic rule counters updating to the
ipfw_dyn_lookup_state() function. Now this function return NULL when
there is no state and pointer to the parent rule when state is found.
Thus now there is no need to return pointer to dynamic rule, and no need
to hold bucket lock for this state. Remove ipfw_dyn_unlock() function.
Obtained from: Yandex LLC
MFC after: 1 week
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D11657
Do not invoke IPv4 NAT handler for non IPv4 packets. Libalias expects
a packet is IPv4. And in case when it is IPv6, it just translates them
as IPv4. This leads to corruption and in some cases to panics.
In particular a panic can happen when value of ip6_plen modified to
something that leads to IP fragmentation, but actual packet length does
not match the IP length.
Packets that are not IPv4 will be dropped by NAT rule.
Reported by: Viktor Dukhovni <freebsd at dukhovni dot org>
MFC after: 1 week
IPsec support can be loaded as kernel module, thus do not depend from
kernel option IPSEC and always build O_IPSEC opcode implementation as
enabled.
Obtained from: Yandex LLC
MFC after: 1 week
Sponsored by: Yandex LLC
To properly handle 'fwd tablearg,port' opcode, copy sin_port value from
sockaddr_in structure stored in the opcode into corresponding hopstore
field.
PR: 222953
MFC after: 1 week
Previously, GRE packets in IPv6 tunnels would be dropped by IPFW (unless
net.inet6.ip6.fw.deny_unknown_exthdrs was unset).
PR: 220640
Submitted by: Kun Xie <kxie@xiplink.com>
MFC after: 1 week
Make PFIL's lock global and use it for this purpose.
This reduces the number of locks needed to acquire for each packet.
Obtained from: Yandex LLC
MFC after: 2 weeks
Sponsored by: Yandex LLC
No objection from: #network
Differential Revision: https://reviews.freebsd.org/D10154
external action is completed, but the rule search is continued.
External action handler can change the content of @args argument,
that is used for dynamic state lookup. Enforce the new lookup to be able
install new state, when the search is continued.
Obtained from: Yandex LLC
MFC after: 1 week
Sponsored by: Yandex LLC
o check the size of O_IP_SRC_LOOKUP opcode, it can not exceed the size of
ipfw_insn_u32;
o rename ipfw_lookup_table_extended() function into ipfw_lookup_table() and
remove old ipfw_lookup_table();
o use args->f_id.flow_id6 that is in host byte order to get DSCP value;
o add SCTP ports support to 'lookup src/dst-port' opcode;
o add IPv6 support to 'lookup src/dst-ip' opcode.
PR: 217292
Reviewed by: melifaro
MFC after: 2 weeks
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D9873
in valuestate array.
When opcode has size equal to ipfw_insn_u32, this means that it should
additionally match value specified in d[0] with table entry value.
ipfw_table_lookup() returns table value index, use TARG_VAL() macro to
convert it to its value. The actual 32-bit value stored in the tag field
of table_value structure, where all unspecified u32 values are kept.
PR: 217262
Reviewed by: melifaro
MFC after: 1 week
Sponsored by: Yandex LLC
Consider the rule matching when both @done and @retval values
returned from ipfw_run_eaction() are zero. And modify ipfw_nptv6()
to return IP_FW_DENY and @done=0 when addresses do not match.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
* make interface cloner VNET-aware;
* simplify cloner code and use if_clone_simple();
* migrate LOGIF_LOCK() to rmlock;
* add ipfw_bpf_mtap2() function to pass mbuf to BPF;
* introduce new additional ipfwlog0 pseudo interface. It differs from
ipfw0 by DLT type used in bpfattach. This interface is intended to
used by ipfw modules to dump packets with additional info attached.
Currently pflog format is used. ipfw_bpf_mtap2() function uses second
argument to determine which interface use for dumping. If dlen is equal
to ETHER_HDR_LEN it uses old ipfw0 interface, if dlen is equal to
PFLOG_HDRLEN - ipfwlog0 will be used.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
Now zero value of arg1 used to specify "tablearg", use the old "tablearg"
value for "nat global". Introduce new macro IP_FW_NAT44_GLOBAL to replace
hardcoded magic number to specify "nat global". Also replace 65535 magic
number with corresponding macro. Fix typo in comments.
PR: 211256
Tested by: Victor Chernov
MFC after: 3 days
The keep-state, limit and check-state now will have additional argument
flowname. This flowname will be assigned to dynamic rule by keep-state
or limit opcode. And then can be matched by check-state opcode or
O_PROBE_STATE internal opcode. To reduce possible breakage and to maximize
compatibility with old rulesets default flowname introduced.
It will be assigned to the rules when user has omitted state name in
keep-state and check-state opcodes. Also if name is ambiguous (can be
evaluated as rule opcode) it will be replaced to default.
Reviewed by: julian
Obtained from: Yandex LLC
MFC after: 1 month
Relnotes: yes
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D6674
per-VNET initialisation and virtualise the interface cloning to
allow a dedicated ipfw log interface per VNET.
Approved by: re (gjb)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
than removing the network interfaces first. This change is rather larger
and convoluted as the ordering requirements cannot be separated.
Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and
related modules to their own SI_SUB_PROTO_FIREWALL.
Move initialization of "physical" interfaces to SI_SUB_DRIVERS,
move virtual (cloned) interfaces to SI_SUB_PSEUDO.
Move Multicast to SI_SUB_PROTO_MC.
Re-work parts of multicast initialisation and teardown, not taking the
huge amount of memory into account if used as a module yet.
For interface teardown we try to do as many of them as we can on
SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling
over a higher layer protocol such as IP. In that case the interface
has to go along (or before) the higher layer protocol is shutdown.
Kernel hhooks need to go last on teardown as they may be used at various
higher layers and we cannot remove them before we cleaned up the higher
layers.
For interface teardown there are multiple paths:
(a) a cloned interface is destroyed (inside a VIMAGE or in the base system),
(b) any interface is moved from a virtual network stack to a different
network stack ("vmove"), or (c) a virtual network stack is being shut down.
All code paths go through if_detach_internal() where we, depending on the
vmove flag or the vnet state, make a decision on how much to shut down;
in case we are destroying a VNET the individual protocol layers will
cleanup their own parts thus we cannot do so again for each interface as
we end up with, e.g., double-frees, destroying locks twice or acquiring
already destroyed locks.
When calling into protocol cleanups we equally have to tell them
whether they need to detach upper layer protocols ("ulp") or not
(e.g., in6_ifdetach()).
Provide or enahnce helper functions to do proper cleanup at a protocol
rather than at an interface level.
Approved by: re (hrs)
Obtained from: projects/vnet
Reviewed by: gnn, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D6747
It allows implementing loadable kernel modules with new actions and
without needing to modify kernel headers and ipfw(8). The module
registers its action handler and keyword string, that will be used
as action name. Using generic syntax user can add rules with this
action. Also ipfw(8) can be easily modified to extend basic syntax
for external actions, that become a part base system.
Sample modules will coming soon.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
panics when unloading the dummynet and IPFW modules:
- The callout drain function can sleep and should not be called having
a non-sleepable lock locked. Remove locks around "ipfw_dyn_uninit(0)".
- Add a new "dn_gone" variable to prevent asynchronous restart of
dummynet callouts when unloading the dummynet kernel module.
- Call "dn_reschedule()" locked so that "dn_gone" can be set and
checked atomically with regard to starting a new callout.
Reviewed by: hiren
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D3855
Vast majority of rtalloc(9) users require only basic info from
route table (e.g. "does the rtentry interface match with the interface
I have?". "what is the MTU?", "Give me the IPv4 source address to use",
etc..).
Instead of hand-rolling lookups, checking if rtentry is up, valid,
dealing with IPv6 mtu, finding "address" ifp (almost never done right),
provide easy-to-use API hiding all the complexity and returning the
needed info into small on-stack structure.
This change also helps hiding route subsystem internals (locking, direct
rtentry accesses).
Additionaly, using this API improves lookup performance since rtentry is not
locked.
(This is safe, since all the rtentry changes happens under both radix WLOCK
and rtentry WLOCK).
Sponsored by: Yandex LLC
Skip checks for IPv6 multicast addresses.
Use in6_localip() for global unicast.
And for IPv6 link-local addresses do search in the IPv6 addresses list.
Since LLA are stored in the kernel internal form, use
IN6_ARE_MASKED_ADDR_EQUAL() macro with lla_mask for addresses comparison.
lla_mask has zero bits in the second word, where we keep sin6_scope_id.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
Currently we have tables identified by their names in userland
with internal kernel-assigned indices. This works the following way:
When userland wishes to communicate with kernel to add or change rule(s),
it makes indexed sorted array of table names
(internally ipfw_obj_ntlv entries), and refer to indices in that
array in rule manipulation.
Prior to committing new rule to the ruleset kernel
a) finds all referenced tables, bump their refcounts and change
values inside the opcodes to be real kernel indices
b) auto-creates all referenced but not existing tables and then
do a) for them.
Kernel does almost the same when exporting rules to userland:
prepares array of used tables in all rules in range, and
prepends it before the actual ruleset retaining actual in-kernel
indexes for that.
There is also special translation layer for legacy clients which is
able to provide 'real' indices for table names (basically doing atoi()).
While it is arguable that every subsystem really needs names instead of
numbers, there are several things that should be noted:
1) every non-singleton subsystem needs to store its runtime state
somewhere inside ipfw chain (and be able to get it fast)
2) we can't assume object numbers provided by humans will be dense.
Existing nat implementation (O(n) access and LIST inside chain) is a
good example.
Hence the following:
* Convert table-centric rewrite code to be more generic, callback-based
* Move most of the code from ip_fw_table.c to ip_fw_sockopt.c
* Provide abstract API to permit subsystems convert their objects
between userland string identifier and in-kernel index.
(See struct opcode_obj_rewrite) for more details
* Create another per-chain index (in next commit) shared among all subsystems
* Convert current NAT44 implementation to use new API, O(1) lookups,
shared index and names instead of numbers (in next commit).
Sponsored by: Yandex LLC