Improve caching behaviour by using counter_u64 rather than variables
shared between cores.
The result of converting all counters to counter(9) (i.e. this full
patch series) is a significant improvement in throughput. As tested by
olivier@, on Intel Xeon E5-2697Av4 (16Cores, 32 threads) hardware with
Mellanox ConnectX-4 MCX416A-CCAT (100GBase-SR4) nics we see:
x FreeBSD 20201223: inet packets-per-second
+ FreeBSD 20201223 with pf patches: inet packets-per-second
+--------------------------------------------------------------------------+
| + |
| xx + |
|xxx +++|
||A| |
| |A||
+--------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 9216962 9526356 9343902 9371057.6 116720.36
+ 5 19427190 19698400 19502922 19546509 109084.92
Difference at 95.0% confidence
1.01755e+07 +/- 164756
108.584% +/- 2.9359%
(Student's t, pooled s = 112967)
Reviewed by: philip
MFC after: 2 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D27763
Factor out allocating and freeing pfi_kkif structures. This will be
useful when we change the counters to be counter_u64, so we don't have
to deal with that complexity in the multiple locations where we allocate
pfi_kkif structures.
No functional change.
MFC after: 2 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D27762
This improves the cache behaviour of pf and results in improved
throughput.
MFC after: 2 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D27760
As part of the split between user and kernel mode structures we're
moving all user space usable definitions into pf.h.
No functional change intended.
MFC after: 2 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D27757
Introduce a kernel version of struct pf_src_node (pf_ksrc_node).
This will allow us to improve the in-kernel data structure without
breaking userspace compatibility.
Reviewed by: philip
MFC after: 2 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D27707
This improves cache behaviour by not writing to the same variable from
multiple cores simultaneously.
pf_state is only used in the kernel, so can be safely modified.
Reviewed by: Lutz Donnerhacke, philip
MFC after: 1 week
Sponsed by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D27661
The algorithm we use to update checksums only works correctly if the
updated data is aligned on 16-bit boundaries (relative to the start of
the packet).
Import the OpenBSD fix for this issue.
PR: 240416
Obtained from: OpenBSD
MFC after: 1 week
Reviewed by: tuexen (previous version)
Differential Revision: https://reviews.freebsd.org/D27696
Mark request_maxcount as RWTUN so we can set it both at runtime and from
loader.conf. This avoids usings getting caught out by the change from tunable
to run time configuration.
Suggested by: Franco Fichtner
MFC after: 3 days
The hme (Happy Meal Ethernet) driver was the onboard NIC in most
supported sparc64 platforms. A few PCI NICs do exist, but we have seen
no evidence of use on non-sparc systems.
Reviewed by: imp, emaste, bcr
Sponsored by: DARPA
When updating a table, pf will keep existing table entry structures
corresponding to addresses that are in both of the old and new tables.
However, the update may also enable or disable per-entry counters which
are allocated separately. Thus when toggling PFR_TFLAG_COUNTERS, the
entries may be missing counters or may have unused counters allocated.
Fix the problem by modifying pfr_ina_commit() to transfer counters
from or to entries in the shadow table.
PR: 251414
Reported by: sigsys@gmail.com
Reviewed by: kp
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27440
tagname2tag() hashes the tag name before truncating it to 63 characters.
tag_unref() removes the tag from the name hash by computing the hash
over the truncated name. Ensure that both operations compute the same
hash for a given tag.
The larger issue is a lack of string validation in pf(4) ioctl handlers.
This is intended to be fixed with some future work, but an extra safety
belt in tagname2hashindex() is worthwhile regardless.
Reported by: syzbot+a0988828aafb00de7d68@syzkaller.appspotmail.com
Reviewed by: kp
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27346
We never set PFRULE_RULESRCTRACK when calling pf_insert_src_node(). We do set
PFRULE_SRCTRACK, so update the assertion to match.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27254
Even if a kif doesn't have an ifp or if_group pointer we still can't delete it
if it's referenced by a rule. In other words: we must check rulerefs as well.
While we're here also teach pfi_kif_unref() not to remove kifs with flags.
Reported-by: syzbot+b31d1d7e12c5d4d42f28@syzkaller.appspotmail.com
MFC after: 2 weeks
If userspace tries to set flags (e.g. 'set skip on <ifspec>') and <ifspec>
doesn't exist we should create a kif so that we apply the flags when the
<ifspec> does turn up.
Otherwise we'd end up in surprising situations where the rules say the
interface should be skipped, but it's not until the rules get re-applied.
Reviewed by: Lutz Donnerhacke <lutz_donnerhacke.de>
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26742
This is in preparation for enabling a loadable SCTP stack. Analogous to
IPSEC/IPSEC_SUPPORT, the SCTP_SUPPORT kernel option must be configured
in order to support a loadable SCTP implementation.
Discussed with: tuexen
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Right now we optionally allocate 8 counters per table entry, so in
addition to memory consumed by counters, we require 8 pointers worth of
space in each entry even when counters are not allocated (the default).
Instead, define a UMA zone that returns contiguous per-CPU counter
arrays for use in table entries. On amd64 this reduces sizeof(struct
pfr_kentry) from 216 to 160. The smaller size also results in better
slab efficiency, so memory usage for large tables is reduced by about
28%.
Reviewed by: kp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24843
pf by default does not do per-table address accounting unless the
"counters" keyword is specified in the corresponding pf.conf table
definition. Yet, we always allocate 12 per-CPU counters per table. For
large tables this carries a lot of overhead, so only allocate counters
when they will actually be used.
A further enhancement might be to use a dedicated UMA zone to allocate
counter arrays for table entries, since close to half of the structure
size comes from counter pointers. A related issue is the cost of
zeroing counters, since counter_u64_zero() calls smp_rendezvous() on
some architectures.
Reported by: loos, Jim Pingle <jimp@netgate.com>
Reviewed by: kp
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC (Netgate)
Differential Revision: https://reviews.freebsd.org/D24803
The pf_frag_mtx mutex protects the fragments queue. The fragments queue
is virtualised already (i.e. per-vnet) so it makes no sense to block
jail A from accessing its fragments queue while jail B is accessing its
own fragments queue.
Virtualise the lock for improved concurrency.
Differential Revision: https://reviews.freebsd.org/D24504
If we pass an anchor name which doesn't exist pfr_table_count() returns
-1, which leads to an overflow in mallocarray() and thus a panic.
Explicitly check that pfr_table_count() does not return an error.
Reported-by: syzbot+bd09d55d897d63d5f4f4@syzkaller.appspotmail.com
Reviewed by: melifaro
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D24539
Both DIOCCHANGEADDR and DIOCADDADDR take a struct pf_pooladdr from
userspace. They failed to validate the dyn pointer contained in its
struct pf_addr_wrap member structure.
This triggered assertion failures under fuzz testing in
pfi_dynaddr_setup(). Happily the dyn variable was overruled there, but
we should verify that it's set to NULL anyway.
Reported-by: syzbot+93e93150bc29f9b4b85f@syzkaller.appspotmail.com
Reviewed by: emaste
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D24431
Switch uRPF to use specific fib(9)-provided uRPF.
Switch MSS calculation to the latest fib(9) kpi.
Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D24386
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
Mark all nodes in pf, pfsync and carp as MPSAFE.
Reviewed by: kp
Approved by: kib (mentor, blanket)
Differential Revision: https://reviews.freebsd.org/D23634
If we have a 'set skip on <ifgroup>' rule this flag it set on the group
kif, but must also be set on all members. pfctl does this when the rules
are set, but if groups are added afterwards we must also apply the flags
to the new member. If not, new group members will not be skipped until
the rules are reloaded.
Reported by: dvl@
Reviewed by: glebius@
Differential Revision: https://reviews.freebsd.org/D23254
As of r356974 calls to ip_output() require us to be in the network epoch.
That wasn't the case for the calls done from pfsyncintr() and
pfsync_defer_tmo().
There's no reason for this to be a tunable. It's perfectly safe to
change this at runtime.
Reviewed by: Lutz Donnerhacke
Differential Revision: https://reviews.freebsd.org/D22737
icmp_reflect(), called through icmp_error() requires us to be in NET_EPOCH.
Failure to hold it leads to the following panic (with INVARIANTS):
panic: Assertion in_epoch(net_epoch_preempt) failed at /usr/src/sys/netinet/ip_icmp.c:742
cpuid = 2
time = 1571233273
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e0977920
vpanic() at vpanic+0x17e/frame 0xfffffe00e0977980
panic() at panic+0x43/frame 0xfffffe00e09779e0
icmp_reflect() at icmp_reflect+0x625/frame 0xfffffe00e0977aa0
icmp_error() at icmp_error+0x720/frame 0xfffffe00e0977b10
pf_intr() at pf_intr+0xd5/frame 0xfffffe00e0977b50
ithread_loop() at ithread_loop+0x1c6/frame 0xfffffe00e0977bb0
fork_exit() at fork_exit+0x80/frame 0xfffffe00e0977bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e0977bf0
Note that we now enter NET_EPOCH twice if we enter ip_output() from pf_intr(),
but ip_output() will soon be converted to a function that requires epoch, so
entering NET_EPOCH directly from pf_intr() makes more sense.
Discussed with: glebius@
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.
However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.
Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.
On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().
This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.
Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.
This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.
Reviewed by: gallatin, hselasky, cy, adrian, kristof
Differential Revision: https://reviews.freebsd.org/D19111
Avoid potential structure padding leak. r350294 identified a leak via
static analysis; although there's no report of a leak with the
DIOCGETSRCNODES ioctl it's a good practice to zero the memory.
Suggested by: kp
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Remove our (very partial) support for RFC2675 Jumbograms. They're not
used, not actually supported and not a good idea.
Reviewed by: thj@
Differential Revision: https://reviews.freebsd.org/D21086
instead of a linear array.
The multicast memberships for the inpcb structure are protected by a
non-sleepable lock, INP_WLOCK(), which needs to be dropped when
calling the underlying possibly sleeping if_ioctl() method. When using
a linear array to keep track of multicast memberships, the computed
memory location of the multicast filter may suddenly change, due to
concurrent insertion or removal of elements in the linear array. This
in turn leads to various invalid memory access issues and kernel
panics.
To avoid this problem, put all multicast memberships on a STAILQ based
list. Then the memory location of the IPv4 and IPv6 multicast filters
become fixed during their lifetime and use after free and memory leak
issues are easier to track, for example by: vmstat -m | grep multi
All list manipulation has been factored into inline functions
including some macros, to easily allow for a future hash-list
implementation, if needed.
This patch has been tested by pho@ .
Differential Revision: https://reviews.freebsd.org/D20080
Reviewed by: markj @
MFC after: 1 week
Sponsored by: Mellanox Technologies
rename the source to gsb_crc32.c.
This is a prerequisite of unifying kernel zlib instances.
PR: 229763
Submitted by: Yoshihiro Ota <ota at j.email.ne.jp>
Differential Revision: https://reviews.freebsd.org/D20193
Now that we don't hold a lock during DIOCRSETTFLAGS memory allocation we can
use M_WAITOK.
MFC after: 1 week
Event: Aberdeen hackathon 2019
Pointed out by: glebius@
If during DIOCRSETTFLAGS pfrio_buffer is NULL copyin() will fault, which we're
not allowed to do with a lock held.
We must count the number of entries in the table and release the lock during
copyin(). Only then can we re-acquire the lock. Note that this is safe, because
pfr_set_tflags() will check if the table and entries exist.
This was discovered by a local syzcaller instance.
MFC after: 1 week
Event: Aberdeen hackathon 2019
There are a few places that use hand crafted versions of the macros
from sys/netinet/in.h making it difficult to actually alter the
values in use by these macros. Correct that by replacing handcrafted
code with proper macro usage.
Reviewed by: karels, kristof
Approved by: bde (mentor)
MFC after: 3 weeks
Sponsored by: John Gilmore
Differential Revision: https://reviews.freebsd.org/D19317
States in pf(4) let ICMP and ICMP6 packets pass if they have a
packet in their payload that matches an exiting connection. It was
not checked whether the outer ICMP packet has the same destination
IP as the source IP of the inner protocol packet. Enforce that
these addresses match, to prevent ICMP packets that do not make
sense.
Reported by: Nicolas Collignon, Corentin Bayet, Eloi Vanderbeken, Luca Moro at Synacktiv
Obtained from: OpenBSD
Security: CVE-2019-5598
Previously the main pfsync lock and the bucket locks shared the same name.
This lead to spurious warnings from WITNESS like this:
acquiring duplicate lock of same type: "pfsync"
1st pfsync @ /usr/src/sys/netpfil/pf/if_pfsync.c:1402
2nd pfsync @ /usr/src/sys/netpfil/pf/if_pfsync.c:1429
It's perfectly okay to grab both the main pfsync lock and a bucket lock at the
same time.
We don't need different names for each bucket lock, because we should always
only acquire a single one of those at a time.
MFC after: 1 week
The counters of pf tables are updated outside the rule lock. That means state
updates might overwrite each other. Furthermore allocation and
freeing of counters happens outside the lock as well.
Use counter(9) for the counters, and always allocate the counter table
element, so that the race condition cannot happen any more.
PR: 230619
Submitted by: Kajetan Staszkiewicz <vegeta@tuxpowered.net>
Reviewed by: glebius
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19558
r343295 broke DIOCGETSRCNODES by failing to reset 'nr' after counting the
number of source tracking nodes.
This meant that we never copied the information to userspace, leading to '? ->
?' output from pfctl.
PR: 236368
MFC after: 1 week
We mistakenly used the extoff value from the last packet to patch the
next_header field. If a malicious host sends a chain of fragmented packets
where the first packet and the final packet have different lengths or number of
extension headers we'd patch the next_header at the wrong offset.
This can potentially lead to panics or rule bypasses.
Security: CVE-2019-5597
Obtained from: OpenBSD
Reported by: Corentin Bayet, Nicolas Collignon, Luca Moro at Synacktiv
Because fetching a counter is a rather expansive function we should use
counter_u64_fetch() in pf_state_expires() only when necessary. A "rdr
pass" rule should not cause more effort than separate "rdr" and "pass"
rules. For rules with adaptive timeout values the call of
counter_u64_fetch() should be accepted, but otherwise not.
From the man page:
The adaptive timeout values can be defined both globally and for
each rule. When used on a per-rule basis, the values relate to the
number of states created by the rule, otherwise to the total number
of states.
This handling of adaptive timeouts is done in pf_state_expires(). The
calculation needs three values: start, end and states.
1. Normal rules "pass .." without adaptive setting meaning "start = 0"
runs in the else-section and therefore takes "start" and "end" from
the global default settings and sets "states" to pf_status.states
(= total number of states).
2. Special rules like
"pass .. keep state (adaptive.start 500 adaptive.end 1000)"
have start != 0, run in the if-section and take "start" and "end"
from the rule and set "states" to the number of states created by
their rule using counter_u64_fetch().
Thats all ok, but there is a third case without special handling in the
above code snippet:
3. All "rdr/nat pass .." statements use together the pf_default_rule.
Therefore we have "start != 0" in this case and we run the
if-section but we better should run the else-section in this case and
do not fetch the counter of the pf_default_rule but take the total
number of states.
Submitted by: Andreas Longwitz <longwitz@incore.de>
MFC after: 2 weeks
In general, the time savings come from separating the active and
inactive queues lists into separate interface and non-interface queue
lists, and changing the rule and queue tag management from list-based
to hash-bashed.
In HFSC, a linear scan of the class table during each queue destroy
was also eliminated.
There are now two new tunables to control the hash size used for each
tag set (default for each is 128):
net.pf.queue_tag_hashsize
net.pf.rule_tag_hashsize
Reviewed by: kp
MFC after: 1 week
Sponsored by: RG Nets
Differential Revision: https://reviews.freebsd.org/D19131
The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented. The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.
In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.
New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.
Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.
Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.
Differential Revision: https://reviews.freebsd.org/D18951
Re-evaluating the ALTQ kernel configuration can be expensive,
particularly when there are a large number (hundreds or thousands) of
queues, and is wholly unnecessary in response to events on interfaces
that do not support ALTQ as such interfaces cannot be part of an ALTQ
configuration.
Reviewed by: kp
MFC after: 1 week
Sponsored by: RG Nets
Differential Revision: https://reviews.freebsd.org/D18918
When cleaning up a vnet we free the counters in V_pf_default_rule and
V_pf_status from shutdown_pf(), but we can still use them later, for example
through pf_purge_expired_src_nodes().
Free them as the very last operation, as they rely on nothing else themselves.
PR: 235097
MFC after: 1 week
psn_len is controlled by user space, but we allocated memory based on it.
Check how much memory we might need at most (i.e. how many source nodes we
have) and limit the allocation to that.
Reported by: markj
MFC after: 1 week
Fix missing initialisation of sc_flags into a valid sync state on clone which
breaks carp in pfsync.
This regression was introduce by r342051.
PR: 235005
Submitted by: smh@FreeBSD.org
Pointy hat to: kp
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D18882
Sometimes, for negated tables, pf can log 'pfr_update_stats: assertion failed'.
This warning does not clarify anything for users, so silence it, just as
OpenBSD has.
PR: 234874
MFC after: 1 week
- Remove macros that covertly create epoch_tracker on thread stack. Such
macros a quite unsafe, e.g. will produce a buggy code if same macro is
used in embedded scopes. Explicitly declare epoch_tracker always.
- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
locking macros to what they actually are - the net_epoch.
Keeping them as is is very misleading. They all are named FOO_RLOCK(),
while they no longer have lock semantics. Now they allow recursion and
what's more important they now no longer guarantee protection against
their companion WLOCK macros.
Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.
This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.
Discussed with: jtl, gallatin
When we try to find a source port in pf_get_sport() it's possible that
all available source ports will be in use. In that case we call
pf_map_addr() to try to find a new source IP to try from. If there are
no more available source IPs pf_map_addr() will return 1 and we stop
trying.
However, if sticky-address is set we'll always return the same IP
address, even if we've already tried that one.
We need to check the supplied address, because if that's the one we'd
set it means pf_get_sport() has already tried it, and we should error
out rather than keep trying.
PR: 233867
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D18483
Mainly states of established TCP connections would be affected resulting
in immediate state removal once the number of states is bigger than
adaptive.start. Disabling adaptive timeouts is a workaround to avoid this bug.
Issue found and initial diff by Mathieu Blanc (mathieu.blanc at cea dot fr)
Reported by: Andreas Longwitz <longwitz AT incore.de>
Obtained from: OpenBSD
MFC after: 2 weeks
pfsync code is called for every new state, state update and state
deletion in pf. While pf itself can operate on multiple states at the
same time (on different cores, assuming the states hash to a different
hashrow), pfsync only had a single lock.
This greatly reduced throughput on multicore systems.
Address this by splitting the pfsync queues into buckets, based on the
state id. This ensures that updates for a given connection always end up
in the same bucket, which allows pfsync to still collapse multiple
updates into one, while allowing multiple cores to proceed at the same
time.
The number of buckets is tunable, but defaults to 2 x number of cpus.
Benchmarking has shown improvement, depending on hardware and setup, from ~30%
to ~100%.
MFC after: 1 week
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D18373
In rare situations[*] it's possible for two different interfaces to have
the same name. This confuses pf, because kifs are indexed by name (which
is assumed to be unique). As a result we can end up trying to
if_rele(NULL), which panics.
Explicitly checking the ifp pointer before if_rele() prevents the panic.
Note pf will likely behave in unexpected ways on the the overlapping
interfaces.
[*] Insert an interface in a vnet jail. Rename it to an interface which
exists on the host. Remove the jail. There are now two interfaces with
the same name in the host.
r340061 included a number of assertions pf_frent_remove(), but these assertions
were the only use of the 'prev' variable. As a result builds without
INVARIANTS had an unused variable, and failed.
Reported by: vangyzen@
If we fail to set up the multicast entry for pfsync and return an error
we must release the pfsync lock first.
MFC after: 2 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D17506
If the syncdev is removed we no longer need to clean up the multicast
entry we've got set up for that device.
Pass the ifnet detach event through pf to pfsync, and remove our
multicast handle, and mark us as no longer having a syncdev.
Note that this callback is always installed, even if the pfsync
interface is disabled (and thus it's not a per-vnet callback pointer).
MFC after: 2 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D17502
pfsync touches pf memory (for pf_state and the pfsync callback
pointers), not the other way around. We need to ensure that pfsync is
torn down before pf.
MFC after: 2 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D17501
The callbacks are installed and removed depending on the state of the
pfsync device, which is per-vnet. The callbacks must also be per-vnet.
MFC after: 2 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D17499
So we have a global limit of 1024 fragments, but it is fine grained to
the region of the packet. Smaller packets may have less fragments.
This costs another 16 bytes of memory per reassembly and devides the
worst case for searching by 8.
Obtained from: OpenBSD
Differential Revision: https://reviews.freebsd.org/D17734
Remember 16 entry points based on the fragment offset. Instead of
a worst case of 8196 list traversals we now check a maximum of 512
list entries or 16 array elements.
Obtained from: OpenBSD
Differential Revision: https://reviews.freebsd.org/D17733
Avoid traversing the list of fragment entris to check whether the
pf(4) reassembly is complete. Instead count the holes that are
created when inserting a fragment. If there are no holes left, the
fragments are continuous.
Obtained from: OpenBSD
Differential Revision: https://reviews.freebsd.org/D17732
When users mark an interface to not use aliases they likely also don't
want to use the link-local v6 address there.
PR: 201695
Submitted by: Russell Yount <Russell.Yount AT gmail.com>
Differential Revision: https://reviews.freebsd.org/D17633
We checked the destination address, but replaced the source address. This was
fixed in OpenBSD as part of their NAT rework, which we don't want to import
right now.
CID: 1009561
MFC after: 3 weeks
There's no point in the NULL check for ifp, because we'll already have
dereferenced it by then. Moreover, the event will always have a valid ifp.
Replace the late check with an early assertion.
CID: 1357338
the 3WHS is completed, establish the backend connection. The trigger
for "3WHS completed" is the reception of the first ACK. However, we
should not proceed if that ACK also has RST or FIN set.
PR: 197484
Obtained from: OpenBSD
MFC after: 2 weeks
when there is work to do. This reduces CPU consumption to one
third on systems. This will help keep the thread CPU usage under
control now that the default hash size has increased.
Reviewed by: kp
Approved by: re (kib)
Differential Revision: https://reviews.freebsd.org/D17097
2^32 bps or greater to be used. Prior to this, bandwidth parameters
would simply wrap at the 2^32 boundary. The computations in the HFSC
scheduler and token bucket regulator have been modified to operate
correctly up to at least 100 Gbps. No other algorithms have been
examined or modified for correct operation above 2^32 bps (some may
have existing computation resolution or overflow issues at rates below
that threshold). pfctl(8) will now limit non-HFSC bandwidth
parameters to 2^32 - 1 before passing them to the kernel.
The extensions to the pf(4) ioctl interface have been made in a
backwards-compatible way by versioning affected data structures,
supporting all versions in the kernel, and implementing macros that
will cause existing code that consumes that interface to use version 0
without source modifications. If version 0 consumers of the interface
are used against a new kernel that has had bandwidth parameters of
2^32 or greater configured by updated tools, such bandwidth parameters
will be reported as 2^32 - 1 bps by those old consumers.
All in-tree consumers of the pf(4) interface have been updated. To
update out-of-tree consumers to the latest version of the interface,
define PFIOC_USE_LATEST ahead of any includes and use the code of
pfctl(8) as a guide for the ioctls of interest.
PR: 211730
Reviewed by: jmallett, kp, loos
MFC after: 2 weeks
Relnotes: yes
Sponsored by: RG Nets
Differential Revision: https://reviews.freebsd.org/D16782
Similar to the network stack issue fixed in r337782 pf did not limit the number
of fragments per packet, which could be exploited to generate high CPU loads
with a crafted series of packets.
Limit each packet to no more than 64 fragments. This should be sufficient on
typical networks to allow maximum-sized IP frames.
This addresses the issue for both IPv4 and IPv6.
MFC after: 3 days
Security: CVE-2018-5391
Sponsored by: Klara Systems
The pfi_skip_if() function sometimes caused skipping of groups to work,
if the members of the group used the groupname as a name prefix.
This is often the case, e.g. group lo usually contains lo0, lo1, ...,
but not always.
Rather than relying on the name explicitly check for group memberships.
Obtained from: OpenBSD (pf_if.c,v 1.62, pf_if.c,v 1.63)
Sponsored by: Essen Hackathon
Synproxy was accidentally broken by r335569. The 'return (action)' must be
executed for every non-PF_PASS result, but the error packet (TCP RST or ICMP
error) should only be sent if the packet was dropped (i.e. PF_DROP) and the
return flag is set.
PR: 229477
Submitted by: Andre Albsmeier <mail AT fbsd.e4m.org>
MFC after: 1 week
When shutting down a vnet jail pf_shutdown() clears the remaining states, which
through pf_clear_states() calls pf_unlink_state().
For synproxy states pf_unlink_state() will send a TCP RST, which eventually
tries to schedule the pf swi in pf_send(). This means we can't remove the
software interrupt until after pf_shutdown().
MFC after: 1 week
Several third-parties use at least some of these ioctls. While it would be
better for regression testing if they were used in base (or at least in the
test suite), it's currently not worth the trouble to push through removal.
Submitted by: antoine, markj
Several ioctls are unused in pf, in the sense that no base utility
references them. Additionally, a cursory review of pf-based ports
indicates they're not used elsewhere either. Some of them have been
unused since the original import. As far as I can tell, they're also
unused in OpenBSD. Finally, removing this code removes the need for
future pf work to take them into account.
Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D16076
States learned via pfsync from a peer with the same ruleset checksum were not
getting assigned to rules like they should because pfsync_in_upd() wasn't
passing the PFSYNC_SI_CKSUM flag along to pfsync_state_import.
PR: 229092
Submitted by: Kajetan Staszkiewicz <vegeta tuxpowered.net>
Obtained from: OpenBSD
MFC after: 1 week
Sponsored by: InnoGames GmbH
Normally pf rules are expected to do one of two things: pass the traffic or
block it. Blocking can be silent - "drop", or loud - "return", "return-rst",
"return-icmp". Yet there is a 3rd category of traffic passing through pf:
Packets matching a "pass" rule but when applying the rule fails. This happens
when redirection table is empty or when src node or state creation fails. Such
rules always fail silently without notifying the sender.
Allow users to configure this behaviour too, so that pf returns an error packet
in these cases.
PR: 226850
Submitted by: Kajetan Staszkiewicz <vegeta tuxpowered.net>
MFC after: 1 week
Sponsored by: InnoGames GmbH
If a locally generated packet is routed (with route-to/reply-to/dup-to) out of
a different interface it's passed through the firewall again. This meant we
lost the inp pointer and if we required the pointer (e.g. for user ID matching)
we'd deadlock trying to acquire an inp lock we've already got.
Pass the inp pointer along with pf_route()/pf_route6().
PR: 228782
MFC after: 1 week
Given that PF_RULES_LOCK is a mostly read lock, replace the rwlock with rmlock.
This change improves packet processing rate in high pps environments.
Benchmarking by olivier@ shows a 65% improvement in pps.
While here, also eliminate all appearances of "sys/rwlock.h" includes since it
is not used anymore.
Submitted by: farrokhi@
Differential Revision: https://reviews.freebsd.org/D15502