Commit Graph

782 Commits

Author SHA1 Message Date
Andrey V. Elsukov
1ae74c1359 Make TCP options parsing stricter.
Rework tcpopts_parse() to be more strict. Use const pointer. Add length
checks for specific TCP options. The main purpose of the change is
avoiding of possible out of mbuf's data access.

Reported by:	Maxime Villard
Reviewed by:	melifaro, emaste
MFC after:	1 week
2019-12-13 11:47:58 +00:00
Andrey V. Elsukov
2873980947 Follow RFC 4443 p2.2 and always use own addresses for reflected ICMPv6
datagrams.

Previously destination address from original datagram was used. That
looked confusing, especially in the traceroute6 output.
Also honor IPSTEALTH kernel option and do TTL/HLIM decrementing only
when stealth mode is disabled.

Reported by:	Marco van Tol <marco at tols org>
Reviewed by:	melifaro
MFC after:	2 weeks
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D22631
2019-12-12 13:28:46 +00:00
Andrey V. Elsukov
ca0ac0a6c1 Avoid access to stale ip pointer and call UPDATE_POINTERS() after
PULLUP_LEN_LOCKED().

PULLUP_LEN_LOCKED() could update mbuf and thus we need to update related
pointers that can be used in next opcodes.

Reported by:	Maxime Villard <max at m00nbsd net>
MFC after:	1 week
2019-12-10 10:35:32 +00:00
Kristof Provost
492f3a312a pf: Add endline to all DPFPRINTF()
DPFPRINTF() doesn't automatically add an endline, so be consistent and
always add it.
2019-11-24 13:53:36 +00:00
Kristof Provost
a0d571cbef pf: Must be in NET_EPOCH to call icmp_error
icmp_reflect(), called through icmp_error() requires us to be in NET_EPOCH.
Failure to hold it leads to the following panic (with INVARIANTS):

  panic: Assertion in_epoch(net_epoch_preempt) failed at /usr/src/sys/netinet/ip_icmp.c:742
  cpuid = 2
  time = 1571233273
  KDB: stack backtrace:
  db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e0977920
  vpanic() at vpanic+0x17e/frame 0xfffffe00e0977980
  panic() at panic+0x43/frame 0xfffffe00e09779e0
  icmp_reflect() at icmp_reflect+0x625/frame 0xfffffe00e0977aa0
  icmp_error() at icmp_error+0x720/frame 0xfffffe00e0977b10
  pf_intr() at pf_intr+0xd5/frame 0xfffffe00e0977b50
  ithread_loop() at ithread_loop+0x1c6/frame 0xfffffe00e0977bb0
  fork_exit() at fork_exit+0x80/frame 0xfffffe00e0977bf0
  fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e0977bf0

Note that we now enter NET_EPOCH twice if we enter ip_output() from pf_intr(),
but ip_output() will soon be converted to a function that requires epoch, so
entering NET_EPOCH directly from pf_intr() makes more sense.

Discussed with:	glebius@
2019-10-18 03:36:26 +00:00
Gleb Smirnoff
16a72f53e2 Use epoch(9) directly instead of obsoleted KPI. 2019-10-14 16:37:41 +00:00
Gleb Smirnoff
961c033ef1 ipfw(4) rule matching always happens in network epoch. 2019-10-14 16:37:00 +00:00
Mark Johnston
bff630d1dc Fix the build after r353458.
MFC with:	r353458
Sponsored by:	The FreeBSD Foundation
2019-10-13 00:08:17 +00:00
Mark Johnston
6cc9ab8610 Add a missing include of opt_sctp.h.
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-10-12 23:01:16 +00:00
Gleb Smirnoff
b8a6e03fac Widen NET_EPOCH coverage.
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by:	gallatin, hselasky, cy, adrian, kristof
Differential Revision:	https://reviews.freebsd.org/D19111
2019-10-07 22:40:05 +00:00
Gleb Smirnoff
f8b45306c6 Drivers may pass runt packets to filter. This is okay.
Reviewed by:	gallatin
2019-09-13 22:36:04 +00:00
Andrey V. Elsukov
773a7e2224 Fix rule truncation on external action module unloading.
Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2019-08-15 13:44:33 +00:00
Ed Maste
c54ee572e5 pf: zero (another) output buffer in pfioctl
Avoid potential structure padding leak.  r350294 identified a leak via
static analysis; although there's no report of a leak with the
DIOCGETSRCNODES ioctl it's a good practice to zero the memory.

Suggested by:	kp
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2019-07-31 16:58:09 +00:00
Andrey V. Elsukov
e758846c09 dd ipfw_get_action() function to get the pointer to action opcode.
ACTION_PTR() returns pointer to the start of rule action section,
but rule can keep several rule modifiers like O_LOG, O_TAG and O_ALTQ,
and only then real action opcode is stored.

ipfw_get_action() function inspects the rule action section, skips
all modifiers and returns action opcode.

Use this function in ipfw_reset_eaction() and flush_nat_ptrs().

MFC after:	1 week
Sponsored by:	Yandex LLC
2019-07-29 15:09:12 +00:00
Kristof Provost
f287767d4f pf: Remove partial RFC2675 support
Remove our (very partial) support for RFC2675 Jumbograms. They're not
used, not actually supported and not a good idea.

Reviewed by:	thj@
Differential Revision:	https://reviews.freebsd.org/D21086
2019-07-29 13:21:31 +00:00
Andrey V. Elsukov
d4e6a52959 Avoid possible lock leaking.
After r343619 ipfw uses own locking for packets flow. PULLUP_LEN() macro
is used in ipfw_chk() to make m_pullup(). When m_pullup() fails, it just
returns via `goto pullup_failed`. There are two places where PULLUP_LEN()
is called with IPFW_PF_RLOCK() held.

Add PULLUP_LEN_LOCKED() macro to use in these places to be able release
the lock, when m_pullup() fails.

Sponsored by:	Yandex LLC
2019-07-29 12:55:48 +00:00
Ed Maste
532bc58628 pf: zero output buffer in pfioctl
Avoid potential structure padding leak.

Reported by:	Vlad Tsyrklevich <vlad@tsyrklevich.net>
Reviewed by:	kp
MFC after:	3 days
Security:	Potential kernel memory disclosure
Sponsored by:	The FreeBSD Foundation
2019-07-24 16:51:14 +00:00
Andrey V. Elsukov
455d2ecb71 Eliminate rmlock from ipfw's BPF code.
After r343631 pfil hooks are invoked in net_epoch_preempt section,
this allows to avoid extra locking. Add NET_EPOCH_ASSER() assertion
to each ipfw_bpf_*tap*() call to require to be called from inside
epoch section.

Use NET_EPOCH_WAIT() in ipfw_clone_destroy() to wait until it becomes
safe to free() ifnet. And use on-stack ifnet pointer in each
ipfw_bpf_*tap*() call to avoid NULL pointer dereference in case when
V_*log_if global variable will become NULL during ipfw_bpf_*tap*() call.

Sponsored by:	Yandex LLC
2019-07-23 12:52:36 +00:00
Andrey V. Elsukov
2dab0de635 Do not modify cmd pointer if it is already last opcode in the rule.
MFC after:	1 week
2019-07-12 09:59:21 +00:00
Andrey V. Elsukov
4ee2f4c180 Correctly truncate the rule in case when it has several action opcodes.
It is possible, that opcode at the ACTION_PTR() location is not real
action, but action modificator like "log", "tag" etc. In this case we
need to check for each opcode in the loop to find O_EXTERNAL_ACTION.

Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2019-07-12 09:48:42 +00:00
Hans Petter Selasky
59854ecf55 Convert all IPv4 and IPv6 multicast memberships into using a STAILQ
instead of a linear array.

The multicast memberships for the inpcb structure are protected by a
non-sleepable lock, INP_WLOCK(), which needs to be dropped when
calling the underlying possibly sleeping if_ioctl() method. When using
a linear array to keep track of multicast memberships, the computed
memory location of the multicast filter may suddenly change, due to
concurrent insertion or removal of elements in the linear array. This
in turn leads to various invalid memory access issues and kernel
panics.

To avoid this problem, put all multicast memberships on a STAILQ based
list. Then the memory location of the IPv4 and IPv6 multicast filters
become fixed during their lifetime and use after free and memory leak
issues are easier to track, for example by: vmstat -m | grep multi

All list manipulation has been factored into inline functions
including some macros, to easily allow for a future hash-list
implementation, if needed.

This patch has been tested by pho@ .

Differential Revision: https://reviews.freebsd.org/D20080
Reviewed by:	markj @
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-06-25 11:54:41 +00:00
Andrey V. Elsukov
019c8c9330 Follow the RFC 3128 and drop short TCP fragments with offset = 1.
Reported by:	emaste
MFC after:	1 week
2019-06-25 11:40:37 +00:00
Andrey V. Elsukov
7d4b2d5244 Mark default rule with IPFW_RULE_NOOPT flag, so it can be showed in
compact form.

MFC after:	1 week
2019-06-25 09:11:22 +00:00
Andrey V. Elsukov
978f2d1728 Add "tcpmss" opcode to match the TCP MSS value.
With this opcode it is possible to match TCP packets with specified
MSS option, whose value corresponds to configured in opcode value.
It is allowed to specify single value, range of values, or array of
specific values or ranges. E.g.

 # ipfw add deny log tcp from any to any tcpmss 0-500

Reviewed by:	melifaro,bcr
Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2019-06-21 10:54:51 +00:00
Xin LI
f89d207279 Separate kernel crc32() implementation to its own header (gsb_crc32.h) and
rename the source to gsb_crc32.c.

This is a prerequisite of unifying kernel zlib instances.

PR:		229763
Submitted by:	Yoshihiro Ota <ota at j.email.ne.jp>
Differential Revision:	https://reviews.freebsd.org/D20193
2019-06-17 19:49:08 +00:00
Andrey V. Elsukov
efdadaa2d8 Initialize V_nat64out methods explicitly.
It looks like initialization of static variable doesn't work for
VIMAGE and this leads to panic.

Reported by:	olivier
MFC after:	1 week
2019-06-05 09:25:40 +00:00
Li-Wen Hsu
d086d41363 Remove an uneeded indentation introduced in r223637 to silence gcc warnging
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2019-05-25 23:58:09 +00:00
John Baldwin
fb3bc59600 Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
  for a different ifp than the one the packet is being output on), in
  ip_output() and ip6_output().  This avoids sending packets with send
  tags to ifnet drivers that don't support send tags.

  Since we are now checking for ifp mismatches before invoking
  if_output, we can now try to allocate a new tag before invoking
  if_output sending the original packet on the new tag if allocation
  succeeds.

  To avoid code duplication for the fragment and unfragmented cases,
  add ip_output_send() and ip6_output_send() as wrappers around
  if_output and nd6_output_ifp, respectively.  All of the logic for
  setting send tags and dealing with send tag-related errors is done
  in these wrapper functions.

  For pseudo interfaces that wrap other network interfaces (vlan and
  lagg), wrapper send tags are now allocated so that ip*_output see
  the wrapper ifp as the ifp in the send tag.  The if_transmit
  routines rewrite the send tags after performing an ifp mismatch
  check.  If an ifp mismatch is detected, the transmit routines fail
  with EAGAIN.

- To provide clearer life cycle management of send tags, especially
  in the presence of vlan and lagg wrapper tags, add a reference count
  to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
  Provide a helper function (m_snd_tag_init()) for use by drivers
  supporting send tags.  m_snd_tag_init() takes care of the if_ref
  on the ifp meaning that code alloating send tags via if_snd_tag_alloc
  no longer has to manage that manually.  Similarly, m_snd_tag_rele
  drops the refcount on the ifp after invoking if_snd_tag_free when
  the last reference to a send tag is dropped.

  This also closes use after free races if there are pending packets in
  driver tx rings after the socket is closed (e.g. from tcpdrop).

  In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
  csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
  Drivers now also check this flag instead of checking snd_tag against
  NULL.  This avoids false positive matches when a forwarded packet
  has a non-NULL rcvif that was treated as a send tag.

- cxgbe was relying on snd_tag_free being called when the inp was
  detached so that it could kick the firmware to flush any pending
  work on the flow.  This is because the driver doesn't require ACK
  messages from the firmware for every request, but instead does a
  kind of manual interrupt coalescing by only setting a flag to
  request a completion on a subset of requests.  If all of the
  in-flight requests don't have the flag when the tag is detached from
  the inp, the flow might never return the credits.  The current
  snd_tag_free command issues a flush command to force the credits to
  return.  However, the credit return is what also frees the mbufs,
  and since those mbufs now hold references on the tag, this meant
  that snd_tag_free would never be called.

  To fix, explicitly drop the mbuf's reference on the snd tag when the
  mbuf is queued in the firmware work queue.  This means that once the
  inp's reference on the tag goes away and all in-flight mbufs have
  been queued to the firmware, tag's refcount will drop to zero and
  snd_tag_free will kick in and send the flush request.  Note that we
  need to avoid doing this in the middle of ethofld_tx(), so the
  driver grabs a temporary reference on the tag around that loop to
  defer the free to the end of the function in case it sends the last
  mbuf to the queue after the inp has dropped its reference on the
  tag.

- mlx5 preallocates send tags and was using the ifp pointer even when
  the send tag wasn't in use.  Explicitly use the ifp from other data
  structures instead.

- Sprinkle some assertions in various places to assert that received
  packets don't have a send tag, and that other places that overwrite
  rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.

Reviewed by:	gallatin, hselasky, rgrimes, ae
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
Andrey V. Elsukov
90ecb41fba Add IPv6 support for O_IPLEN opcode.
Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2019-04-29 09:33:16 +00:00
Kristof Provost
1c75b9d2cd pf: No need to M_NOWAIT in DIOCRSETTFLAGS
Now that we don't hold a lock during DIOCRSETTFLAGS memory allocation we can
use M_WAITOK.

MFC after:	1 week
Event:		Aberdeen hackathon 2019
Pointed out by:	glebius@
2019-04-18 11:37:44 +00:00
Kristof Provost
f5e0d9fcb4 pf: Fix panic on invalid DIOCRSETTFLAGS
If during DIOCRSETTFLAGS pfrio_buffer is NULL copyin() will fault, which we're
not allowed to do with a lock held.
We must count the number of entries in the table and release the lock during
copyin(). Only then can we re-acquire the lock. Note that this is safe, because
pfr_set_tflags() will check if the table and entries exist.

This was discovered by a local syzcaller instance.

MFC after:	1 week
Event:		Aberdeen hackathon 2019
2019-04-17 16:42:54 +00:00
Rodney W. Grimes
6c1c6ae537 Use IN_foo() macros from sys/netinet/in.h inplace of handcrafted code
There are a few places that use hand crafted versions of the macros
from sys/netinet/in.h making it difficult to actually alter the
values in use by these macros.  Correct that by replacing handcrafted
code with proper macro usage.

Reviewed by:		karels, kristof
Approved by:		bde (mentor)
MFC after:		3 weeks
Sponsored by:		John Gilmore
Differential Revision:	https://reviews.freebsd.org/D19317
2019-04-04 19:01:13 +00:00
Conrad Meyer
a8a16c7128 Replace read_random(9) with more appropriate arc4rand(9) KPIs
Reviewed by:	ae, delphij
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D19760
2019-04-04 01:02:50 +00:00
Ed Maste
a342f5772f pf: use UID_ROOT and GID_WHEEL named constants in make_dev
No functional change but improves consistency and greppability of
make_dev calls.

Discussed with: kp
2019-03-26 21:20:42 +00:00
Gleb Smirnoff
97245d4074 Always create ipfw(4) hooks as long as module is loaded.
Now enabling ipfw(4) with sysctls controls only linkage of hooks to default
heads. When module is loaded fetch sysctls as tunables, to make it possible
to boot with ipfw(4) in kernel, but not linked to any pfil(9) hooks.
2019-03-21 16:15:29 +00:00
Kristof Provost
64af73aade pf: Ensure that IP addresses match in ICMP error packets
States in pf(4) let ICMP and ICMP6 packets pass if they have a
packet in their payload that matches an exiting connection.  It was
not checked whether the outer ICMP packet has the same destination
IP as the source IP of the inner protocol packet.  Enforce that
these addresses match, to prevent ICMP packets that do not make
sense.

Reported by:	Nicolas Collignon, Corentin Bayet, Eloi Vanderbeken, Luca Moro at Synacktiv
Obtained from:	OpenBSD
Security:	CVE-2019-5598
2019-03-21 08:09:52 +00:00
Andrey V. Elsukov
b8c431f9c0 Do not enter epoch section recursively.
A pfil hook is already invoked in NET_EPOCH section.
2019-03-20 10:11:21 +00:00
Andrey V. Elsukov
e0b7b6d465 Use NET_EPOCH instead of allocating separate one.
MFC after:	1 month
2019-03-20 10:06:44 +00:00
Andrey V. Elsukov
d18c1f26a4 Reapply r345274 with build fixes for 32-bit architectures.
Update NAT64LSN implementation:

  o most of data structures and relations were modified to be able support
    large number of translation states. Now each supported protocol can
    use full ports range. Ports groups now are belongs to IPv4 alias
    addresses, not hosts. Each ports group can keep several states chunks.
    This is controlled with new `states_chunks` config option. States
    chunks allow to have several translation states for single alias address
    and port, but for different destination addresses.
  o by default all hash tables now use jenkins hash.
  o ConcurrencyKit and epoch(9) is used to make NAT64LSN lockless on fast path.
  o one NAT64LSN instance now can be used to handle several IPv6 prefixes,
    special prefix "::" value should be used for this purpose when instance
    is created.
  o due to modified internal data structures relations, the socket opcode
    that does states listing was changed.

Obtained from:	Yandex LLC
MFC after:	1 month
Sponsored by:	Yandex LLC
2019-03-19 10:57:03 +00:00
Andrey V. Elsukov
d6369c2d18 Revert r345274. It appears that not all 32-bit architectures have
necessary CK primitives.
2019-03-18 14:00:19 +00:00
Andrey V. Elsukov
d7a1cf06f3 Update NAT64LSN implementation:
o most of data structures and relations were modified to be able support
  large number of translation states. Now each supported protocol can
  use full ports range. Ports groups now are belongs to IPv4 alias
  addresses, not hosts. Each ports group can keep several states chunks.
  This is controlled with new `states_chunks` config option. States
  chunks allow to have several translation states for single alias address
  and port, but for different destination addresses.
o by default all hash tables now use jenkins hash.
o ConcurrencyKit and epoch(9) is used to make NAT64LSN lockless on fast path.
o one NAT64LSN instance now can be used to handle several IPv6 prefixes,
  special prefix "::" value should be used for this purpose when instance
  is created.
o due to modified internal data structures relations, the socket opcode
  that does states listing was changed.

Obtained from:	Yandex LLC
MFC after:	1 month
Sponsored by:	Yandex LLC
2019-03-18 12:59:08 +00:00
Andrey V. Elsukov
5c04f73e07 Add NAT64 CLAT implementation as defined in RFC6877.
CLAT is customer-side translator that algorithmically translates 1:1
private IPv4 addresses to global IPv6 addresses, and vice versa.
It is implemented as part of ipfw_nat64 kernel module. When module
is loaded or compiled into the kernel, it registers "nat64clat" external
action. External action named instance can be created using `create`
command and then used in ipfw rules. The create command accepts two
IPv6 prefixes `plat_prefix` and `clat_prefix`. If plat_prefix is ommitted,
IPv6 NAT64 Well-Known prefix 64:ff9b::/96 will be used.

  # ipfw nat64clat CLAT create clat_prefix SRC_PFX plat_prefix DST_PFX
  # ipfw add nat64clat CLAT ip4 from IPv4_PFX to any out
  # ipfw add nat64clat CLAT ip6 from DST_PFX to SRC_PFX in

Obtained from:	Yandex LLC
Submitted by:	Boris N. Lytochkin
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Yandex LLC
2019-03-18 11:44:53 +00:00
Andrey V. Elsukov
002cae78da Add SPDX-License-Identifier and update year in copyright.
MFC after:	1 month
2019-03-18 10:50:32 +00:00
Andrey V. Elsukov
b11efc1eb6 Modify struct nat64_config.
Add second IPv6 prefix to generic config structure and rename another
fields to conform to RFC6877. Now it contains two prefixes and length:
PLAT is provider-side translator that translates N:1 global IPv6 addresses
to global IPv4 addresses. CLAT is customer-side translator (XLAT) that
algorithmically translates 1:1 IPv4 addresses to global IPv6 addresses.
Use PLAT prefix in stateless (nat64stl) and stateful (nat64lsn)
translators.

Modify nat64_extract_ip4() and nat64_embed_ip4() functions to accept
prefix length and use plat_plen to specify prefix length.

Retire net.inet.ip.fw.nat64_allow_private sysctl variable.
Add NAT64_ALLOW_PRIVATE flag and use "allow_private" config option to
configure this ability separately for each NAT64 instance.

Obtained from:	Yandex LLC
MFC after:	1 month
Sponsored by:	Yandex LLC
2019-03-18 10:39:14 +00:00
Kristof Provost
812483c46e pf: Rename pfsync bucket lock
Previously the main pfsync lock and the bucket locks shared the same name.
This lead to spurious warnings from WITNESS like this:

    acquiring duplicate lock of same type: "pfsync"
     1st pfsync @ /usr/src/sys/netpfil/pf/if_pfsync.c:1402
     2nd pfsync @ /usr/src/sys/netpfil/pf/if_pfsync.c:1429

It's perfectly okay to grab both the main pfsync lock and a bucket lock at the
same time.

We don't need different names for each bucket lock, because we should always
only acquire a single one of those at a time.

MFC after:	1 week
2019-03-16 10:14:03 +00:00
Kristof Provost
5904868691 pf :Use counter(9) in pf tables.
The counters of pf tables are updated outside the rule lock. That means state
updates might overwrite each other. Furthermore allocation and
freeing of counters happens outside the lock as well.

Use counter(9) for the counters, and always allocate the counter table
element, so that the race condition cannot happen any more.

PR:		230619
Submitted by:	Kajetan Staszkiewicz <vegeta@tuxpowered.net>
Reviewed by:	glebius
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D19558
2019-03-15 11:08:44 +00:00
Gleb Smirnoff
f355cb3e6f PFIL_MEMPTR for ipfw link level hook
With new pfil(9) KPI it is possible to pass a void pointer with length
instead of mbuf pointer to a packet filter. Until this commit no filters
supported that, so pfil run through a shim function pfil_fake_mbuf().

Now the ipfw(4) hook named "default-link", that is instantiated when
net.link.ether.ipfw sysctl is on, supports processing pointer/length
packets natively.

- ip_fw_args now has union for either mbuf or void *, and if flags have
  non-zero length, then we use the void *.
- through ipfw_chk() we handle mem/mbuf cases differently.
- ether_header goes away from args. It is ipfw_chk() responsibility
  to do parsing of Ethernet header.
- ipfw_log() now uses different bpf APIs to log packets.

Although ipfw_chk() is now capable to process pointer/length packets,
this commit adds support for the link level hook only, see
ipfw_check_frame(). Potentially the IP processing hook ipfw_check_packet()
can be improved too, but that requires more changes since the hook
supports more complex actions: NAT, divert, etc.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D19357
2019-03-14 22:52:16 +00:00
Gleb Smirnoff
dc0fa4f712 Remove 'dir' argument from dummynet_io(). This makes it possible to make
dn_dir flags private to dummynet. There is still some room for improvement.
2019-03-14 22:32:50 +00:00
Gleb Smirnoff
b00b7e03fd Reduce argument list to ipfw_divert(), as args holds the rule ref and
the direction. While here make 'tee' a bool.
2019-03-14 22:31:12 +00:00
Gleb Smirnoff
cef9f220cd Remove 'dir' argument in ng_ipfw_input, since ip_fw_args now has this info.
While here make 'tee' boolean.
2019-03-14 22:30:05 +00:00
Gleb Smirnoff
b7795b6746 - Add more flags to ip_fw_args. At this changeset only IPFW_ARGS_IN and
IPFW_ARGS_OUT are utilized. They are intented to substitute the "dir"
  parameter that is often passes together with args.
- Rename ip_fw_args.oif to ifp and now it is set to either input or
  output interface, depending on IPFW_ARGS_IN/OUT bit set.
2019-03-14 22:28:50 +00:00
Gleb Smirnoff
1830dae3d3 Make second argument of ip_divert(), that specifies packet direction a bool.
This allows pf(4) to avoid including ipfw(4) private files.
2019-03-14 22:23:09 +00:00
Gleb Smirnoff
2d0232783c Simplify ipfw_bpf_mtap2(). No functional change. 2019-03-14 22:20:48 +00:00
Andrey V. Elsukov
ca0f03e808 Add IP_FW_NAT64 to codes that ipfw_chk() can return.
It will be used by upcoming NAT64 changes. We use separate code
to avoid propogating EACCES error code to user level applications
when NAT64 consumes a packet.

Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2019-03-11 10:42:09 +00:00
Andrey V. Elsukov
d76227959a Add NULL pointer check to nat64_output().
It is possible, that a processed packet was originated by local host,
in this case m->m_pkthdr.rcvif is NULL. Check and set it to V_loif to
avoid NULL pointer dereference in IP input code, since it is expected
that packet has valid receiving interface when netisr processes it.

Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2019-03-11 10:33:32 +00:00
Kristof Provost
f8e7fe32a4 pf: Fix DIOCGETSRCNODES
r343295 broke DIOCGETSRCNODES by failing to reset 'nr' after counting the
number of source tracking nodes.
This meant that we never copied the information to userspace, leading to '? ->
?' output from pfctl.

PR:		236368
MFC after:	1 week
2019-03-08 09:33:16 +00:00
Andrey V. Elsukov
83354acf5a Fix the problem with O_LIMIT states introduced in r344018.
dyn_install_state() uses `rule` pointer when it creates state.
For O_LIMIT states this pointer actually is not struct ip_fw,
it is pointer to O_LIMIT_PARENT state, that keeps actual pointer
to ip_fw parent rule. Thus we need to cache rule id and number
before calling dyn_get_parent_state(), so we can use them later
when the `rule` pointer is overrided.

PR:		236292
MFC after:	3 days
2019-03-07 04:40:44 +00:00
Kristof Provost
6f4909de5f pf: IPv6 fragments with malformed extension headers could be erroneously passed by pf or cause a panic
We mistakenly used the extoff value from the last packet to patch the
next_header field. If a malicious host sends a chain of fragmented packets
where the first packet and the final packet have different lengths or number of
extension headers we'd patch the next_header at the wrong offset.
This can potentially lead to panics or rule bypasses.

Security:       CVE-2019-5597
Obtained from:  OpenBSD
Reported by:    Corentin Bayet, Nicolas Collignon, Luca Moro at Synacktiv
2019-03-01 07:37:45 +00:00
Kristof Provost
22c58991e3 pf: Small performance tweak
Because fetching a counter is a rather expansive function we should use
counter_u64_fetch() in pf_state_expires() only when necessary. A "rdr
pass" rule should not cause more effort than separate "rdr" and "pass"
rules. For rules with adaptive timeout values the call of
counter_u64_fetch() should be accepted, but otherwise not.

From the man page:
    The adaptive timeout values can be defined both globally and for
    each rule.  When used on a per-rule basis, the values relate to the
    number of states created by the rule, otherwise to the total number
    of states.

This handling of adaptive timeouts is done in pf_state_expires().  The
calculation needs three values: start, end and states.

1. Normal rules "pass .." without adaptive setting meaning "start = 0"
   runs in the else-section and therefore takes "start" and "end" from
   the global default settings and sets "states" to pf_status.states
   (= total number of states).

2. Special rules like
   "pass .. keep state (adaptive.start 500 adaptive.end 1000)"
   have start != 0, run in the if-section and take "start" and "end"
   from the rule and set "states" to the number of states created by
   their rule using counter_u64_fetch().

Thats all ok, but there is a third case without special handling in the
above code snippet:

3. All "rdr/nat pass .." statements use together the pf_default_rule.
   Therefore we have "start != 0" in this case and we run the
   if-section but we better should run the else-section in this case and
   do not fetch the counter of the pf_default_rule but take the total
   number of states.

Submitted by:	Andreas Longwitz <longwitz@incore.de>
MFC after:	2 weeks
2019-02-24 17:23:55 +00:00
Andrey V. Elsukov
804a6541db Remove `set' field from state structure and use set from parent rule.
Initially it was introduced because parent rule pointer could be freed,
and rule's information could become inaccessible. In r341471 this was
changed. And now we don't need this information, and also it can become
stale. E.g. rule can be moved from one set to another. This can lead
to parent's set and state's set will not match. In this case it is
possible that static rule will be freed, but dynamic state will not.
This can happen when `ipfw delete set N` command is used to delete
rules, that were moved to another set.
To fix the problem we will use the set number from parent rule.

Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2019-02-11 18:10:55 +00:00
Patrick Kelsey
d178fee632 Place pf_altq_get_nth_active() under the ALTQ ifdef
MFC after:	1 week
2019-02-11 05:39:38 +00:00
Patrick Kelsey
8f2ac65690 Reduce the time it takes the kernel to install a new PF config containing a large number of queues
In general, the time savings come from separating the active and
inactive queues lists into separate interface and non-interface queue
lists, and changing the rule and queue tag management from list-based
to hash-bashed.

In HFSC, a linear scan of the class table during each queue destroy
was also eliminated.

There are now two new tunables to control the hash size used for each
tag set (default for each is 128):

net.pf.queue_tag_hashsize
net.pf.rule_tag_hashsize

Reviewed by:	kp
MFC after:	1 week
Sponsored by:	RG Nets
Differential Revision:	https://reviews.freebsd.org/D19131
2019-02-11 05:17:31 +00:00
Gleb Smirnoff
d38ca3297c Return PFIL_CONSUMED if packet was consumed. While here gather all
the identical endings of pf_check_*() into single function.

PR:		235411
2019-02-02 05:49:05 +00:00
Gleb Smirnoff
2790ca97d9 Fix build without INET6. 2019-02-01 00:33:17 +00:00
Gleb Smirnoff
b252313f0b New pfil(9) KPI together with newborn pfil API and control utility.
The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented.  The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.

In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.

New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.

Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.

Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.

Differential Revision:	https://reviews.freebsd.org/D18951
2019-01-31 23:01:03 +00:00
Gleb Smirnoff
f712b16127 Revert r316461: Remove "IPFW static rules" rmlock, and use pfil's global lock.
The pfil(9) system is about to be converted to epoch(9) synchronization, so
we need [temporarily] go back with ipfw internal locking.

Discussed with:	ae
2019-01-31 21:04:50 +00:00
Andrey V. Elsukov
7664b71b62 Fix the bug introduced in r342908, that causes problems with dynamic
handling for protocols without ports numbers.

Since port numbers were uninitialized for protocols like ICMP/ICMPv6,
ipfw_chk() used some non-zero values to create dynamic states, and due
this it failed to match replies with created states.

Reported by:	Oliver Hartmann, Boris Lytochkin
Obtained from:	Yandex LLC
X-MFC after:	r342908
2019-01-29 11:18:41 +00:00
Patrick Kelsey
59099cd385 Don't re-evaluate ALTQ kernel configuration due to events on non-ALTQ interfaces
Re-evaluating the ALTQ kernel configuration can be expensive,
particularly when there are a large number (hundreds or thousands) of
queues, and is wholly unnecessary in response to events on interfaces
that do not support ALTQ as such interfaces cannot be part of an ALTQ
configuration.

Reviewed by:	kp
MFC after:	1 week
Sponsored by:	RG Nets
Differential Revision:	https://reviews.freebsd.org/D18918
2019-01-28 20:26:09 +00:00
Kristof Provost
d9d146e67b pf: Fix use-after-free of counters
When cleaning up a vnet we free the counters in V_pf_default_rule and
V_pf_status from shutdown_pf(), but we can still use them later, for example
through pf_purge_expired_src_nodes().

Free them as the very last operation, as they rely on nothing else themselves.

PR:		235097
MFC after:	1 week
2019-01-25 01:06:06 +00:00
Kristof Provost
180b0dcbbb pf: Validate psn_len in DIOCGETSRCNODES
psn_len is controlled by user space, but we allocated memory based on it.
Check how much memory we might need at most (i.e. how many source nodes we
have) and limit the allocation to that.

Reported by:	markj
MFC after:	1 week
2019-01-22 02:13:33 +00:00
Kristof Provost
6a8ee0f715 pf: fix pfsync breaking carp
Fix missing initialisation of sc_flags into a valid sync state on clone which
breaks carp in pfsync.

This regression was introduce by r342051.

PR:		235005
Submitted by:	smh@FreeBSD.org
Pointy hat to:	kp
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D18882
2019-01-18 08:19:54 +00:00
Kristof Provost
032dff662c pf: silence a runtime warning
Sometimes, for negated tables, pf can log 'pfr_update_stats: assertion failed'.
This warning does not clarify anything for users, so silence it, just as
OpenBSD has.

PR:		234874
MFC after:	1 week
2019-01-15 08:59:51 +00:00
Andrey V. Elsukov
48266154de Relax requirement to packet size of CARP protocol and remove version check.
CARP shares protocol number 112 with VRRP (RFC 5798). And the size of
VRRP packet may be smaller than CARP. ipfw_chk() does m_pullup() to at
least sizeof(struct carp_header) and can fail when packet is VRRP. This
leads to packet drop and message about failed pullup attempt.
Also, RFC 5798 defines version 3 of VRRP protocol, this version number
also unsupported by CARP and such check leads to packet drop.

carp_input() does its own checks for protocol version and packet size,
so we can remove these checks to be able pass VRRP packets.

PR:		234207
MFC after:	1 week
2019-01-11 01:54:15 +00:00
Andrey V. Elsukov
3b1522c229 Fix the build with INVARIANTS.
MFC after:	1 month
2019-01-10 02:01:20 +00:00
Andrey V. Elsukov
1cdf23bc03 Reduce the size of struct ip_fw_args from 240 to 128 bytes on amd64.
And refactor the code to avoid unneeded initialization to reduce overhead
of per-packet processing.

ipfw(4) can be invoked by pfil(9) framework for each packet several times.
Each call uses on-stack variable of type struct ip_fw_args to keep the
state of ipfw(4) processing. Currently this variable has 240 bytes size
on amd64.  Each time ipfw(4) does bzero() on it, and then it initializes
some fields.

glebius@ has reported that they at Netflix discovered, that initialization
of this variable produces significant overhead on packet processing.
After patching I managed to increase performance of packet processing on
simple routing with ipfw(4) firewalling to about 11% from 9.8Mpps up to
11Mpps (Xeon E5-2660 v4@ + Mellanox 100G card).

Introduced new field flags, it is used to keep track of what fields was
initialized. Some fields were moved into the anonymous union, to reduce
the size. They all are mutually exclusive. dummypar field was unused, and
therefore it is removed.  The hopstore6 field type was changed from
sockaddr_in6 to a bit smaller struct ip_fw_nh6. And now the size of struct
ip_fw_args is 128 bytes.

ipfw_chk() was modified to properly handle ip_fw_args.flags instead of
rely on checking for NULL pointers.

Reviewed by:	gallatin
Obtained from:	Yandex LLC
MFC after:	1 month
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D18690
2019-01-10 01:47:57 +00:00
Gleb Smirnoff
a68cc38879 Mechanical cleanup of epoch(9) usage in network stack.
- Remove macros that covertly create epoch_tracker on thread stack. Such
  macros a quite unsafe, e.g. will produce a buggy code if same macro is
  used in embedded scopes. Explicitly declare epoch_tracker always.

- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
  IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
  locking macros to what they actually are - the net_epoch.
  Keeping them as is is very misleading. They all are named FOO_RLOCK(),
  while they no longer have lock semantics. Now they allow recursion and
  what's more important they now no longer guarantee protection against
  their companion WLOCK macros.
  Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.

This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.

Discussed with:	jtl, gallatin
2019-01-09 01:11:19 +00:00
Kristof Provost
336683f24f pf: Fix endless loop on NAT exhaustion with sticky-address
When we try to find a source port in pf_get_sport() it's possible that
all available source ports will be in use. In that case we call
pf_map_addr() to try to find a new source IP to try from. If there are
no more available source IPs pf_map_addr() will return 1 and we stop
trying.

However, if sticky-address is set we'll always return the same IP
address, even if we've already tried that one.
We need to check the supplied address, because if that's the one we'd
set it means pf_get_sport() has already tried it, and we should error
out rather than keep trying.

PR:		233867
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D18483
2018-12-12 20:15:06 +00:00
Kristof Provost
5b551954ab pf: Prevent integer overflow in PF when calculating the adaptive timeout.
Mainly states of established TCP connections would be affected resulting
in immediate state removal once the number of states is bigger than
adaptive.start.  Disabling adaptive timeouts is a workaround to avoid this bug.
Issue found and initial diff by Mathieu Blanc (mathieu.blanc at cea dot fr)

Reported by: Andreas Longwitz <longwitz AT incore.de>
Obtained from:  OpenBSD
MFC after:	2 weeks
2018-12-11 21:44:39 +00:00
Kristof Provost
4fc65bcbe3 pfsync: Performance improvement
pfsync code is called for every new state, state update and state
deletion in pf. While pf itself can operate on multiple states at the
same time (on different cores, assuming the states hash to a different
hashrow), pfsync only had a single lock.
This greatly reduced throughput on multicore systems.

Address this by splitting the pfsync queues into buckets, based on the
state id. This ensures that updates for a given connection always end up
in the same bucket, which allows pfsync to still collapse multiple
updates into one, while allowing multiple cores to proceed at the same
time.

The number of buckets is tunable, but defaults to 2 x number of cpus.
Benchmarking has shown improvement, depending on hardware and setup, from ~30%
to ~100%.

MFC after:	1 week
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D18373
2018-12-06 19:27:15 +00:00
Kristof Provost
2b0a4ffadb pf: add a comment describing why do we call pf_map_addr again if port
selection process fails

Obtained from:	OpenBSD
2018-12-06 18:58:54 +00:00
Andrey V. Elsukov
d66f9c86fa Add ability to request listing and deleting only for dynamic states.
This can be useful, when net.inet.ip.fw.dyn_keep_states is enabled, but
after rules reloading some state must be deleted. Added new flag '-D'
for such purpose.

Retire '-e' flag, since there can not be expired states in the meaning
that this flag historically had.

Also add "verbose" mode for listing of dynamic states, it can be enabled
with '-v' flag and adds additional information to states list. This can
be useful for debugging.

Obtained from:	Yandex LLC
MFC after:	2 months
Sponsored by:	Yandex LLC
2018-12-04 16:12:43 +00:00
Andrey V. Elsukov
cefe3d67e2 Reimplement how net.inet.ip.fw.dyn_keep_states works.
Turning on of this feature allows to keep dynamic states when parent
rule is deleted. But it works only when the default rule is
"allow from any to any".

Now when rule with dynamic opcode is going to be deleted, and
net.inet.ip.fw.dyn_keep_states is enabled, existing states will reference
named objects corresponding to this rule, and also reference the rule.
And when ipfw_dyn_lookup_state() will find state for deleted parent rule,
it will return the pointer to the deleted rule, that is still valid.
This implementation doesn't support O_LIMIT_PARENT rules.

The refcnt field was added to struct ip_fw to keep reference, also
next pointer added to be able iterate rules and not damage the content
when deleted rules are chained.

Named objects are referenced only when states are going to be deleted to
be able reuse kidx of named objects when new parent rules will be
installed.

ipfw_dyn_get_count() function was modified and now it also looks into
dynamic states and constructs maps of existing named objects. This is
needed to correctly export orphaned states into userland.

ipfw_free_rule() was changed to be global, since now dynamic state can
free rule, when it is expired and references counters becomes 1.

External actions subsystem also modified, since external actions can be
deregisterd and instances can be destroyed. In these cases deleted rules,
that are referenced by orphaned states, must be modified to prevent access
to freed memory. ipfw_dyn_reset_eaction(), ipfw_reset_eaction_instance()
functions added for these purposes.

Obtained from:	Yandex LLC
MFC after:	2 months
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D17532
2018-12-04 16:01:25 +00:00
Andrey V. Elsukov
0df76496a6 Add assertion to check that named object has correct type.
Obtained from:	Yandex LLC
MFC after:	1 week
2018-12-04 15:12:28 +00:00
Kristof Provost
b2e0b24f76 pf: Fix panic on overlapping interface names
In rare situations[*] it's possible for two different interfaces to have
the same name. This confuses pf, because kifs are indexed by name (which
is assumed to be unique). As a result we can end up trying to
if_rele(NULL), which panics.

Explicitly checking the ifp pointer before if_rele() prevents the panic.
Note pf will likely behave in unexpected ways on the the overlapping
interfaces.

[*] Insert an interface in a vnet jail. Rename it to an interface which
exists on the host. Remove the jail. There are now two interfaces with
the same name in the host.
2018-12-01 09:58:21 +00:00
Andrey V. Elsukov
2636ba4d03 Do not limit the mbuf queue length for keepalive packets.
It was unlimited before overhaul, and one user reported that this limit
can be reached easily.

PR:		233562
MFC after:	1 week
2018-11-27 16:51:01 +00:00
Andrey V. Elsukov
b2b5660688 Add ability to use dynamic external prefix in ipfw_nptv6 module.
Now an interface name can be specified for nptv6 instance instead of
ext_prefix. The module will track if_addr_ext events and when suitable
IPv6 address will be added to specified interface, it will be configured
as external prefix. When address disappears instance becomes unusable,
i.e. it doesn't match any packets.

Reviewed by:	0mp (manpages)
Tested by:	Dries Michiels <driesm dot michiels gmail com>
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D17765
2018-11-12 11:20:59 +00:00
Kristof Provost
87e4ca37d5 pf: Prevent tables referenced by rules in anchors from getting disabled.
PR:		183198
Obtained from:	OpenBSD
MFC after:	2 weeks
2018-11-08 21:54:40 +00:00
Kristof Provost
58ef854f8b pf: Fix build if INVARIANTS is not set
r340061 included a number of assertions pf_frent_remove(), but these assertions
were the only use of the 'prev' variable. As a result builds without
INVARIANTS had an unused variable, and failed.

Reported by:	vangyzen@
2018-11-02 19:23:50 +00:00
Kristof Provost
14624ab582 pf: Keep a reference to struct ifnets we're using
Ensure that the struct ifnet we use can't go away until we're done with
it.
2018-11-02 17:05:40 +00:00
Kristof Provost
dde6e1fecb pfsync: Add missing unlock
If we fail to set up the multicast entry for pfsync and return an error
we must release the pfsync lock first.

MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D17506
2018-11-02 17:03:53 +00:00
Kristof Provost
04fe85f068 pfsync: Allow module to be unloaded
MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D17505
2018-11-02 17:01:18 +00:00
Kristof Provost
fbbf436d56 pfsync: Handle syncdev going away
If the syncdev is removed we no longer need to clean up the multicast
entry we've got set up for that device.

Pass the ifnet detach event through pf to pfsync, and remove our
multicast handle, and mark us as no longer having a syncdev.

Note that this callback is always installed, even if the pfsync
interface is disabled (and thus it's not a per-vnet callback pointer).

MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D17502
2018-11-02 16:57:23 +00:00
Kristof Provost
26549dfcad pfsync: Ensure uninit is done before pf
pfsync touches pf memory (for pf_state and the pfsync callback
pointers), not the other way around. We need to ensure that pfsync is
torn down before pf.

MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D17501
2018-11-02 16:53:15 +00:00
Kristof Provost
5f6cf24e2d pfsync: Make pfsync callbacks per-vnet
The callbacks are installed and removed depending on the state of the
pfsync device, which is per-vnet. The callbacks must also be per-vnet.

MFC after:	2 weeks
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D17499
2018-11-02 16:47:07 +00:00
Kristof Provost
790194cd47 pf: Limit the fragment entry queue length to 64 per bucket.
So we have a global limit of 1024 fragments, but it is fine grained to
the region of the packet.  Smaller packets may have less fragments.
This costs another 16 bytes of memory per reassembly and devides the
worst case for searching by 8.

Obtained from:	OpenBSD
Differential Revision:	https://reviews.freebsd.org/D17734
2018-11-02 15:32:04 +00:00
Kristof Provost
fd2ea405e6 pf: Split the fragment reassembly queue into smaller parts
Remember 16 entry points based on the fragment offset.  Instead of
a worst case of 8196 list traversals we now check a maximum of 512
list entries or 16 array elements.

Obtained from:	OpenBSD
Differential Revision:	https://reviews.freebsd.org/D17733
2018-11-02 15:26:51 +00:00
Kristof Provost
2b1c354ee6 pf: Count holes rather than fragments for reassembly
Avoid traversing the list of fragment entris to check whether the
pf(4) reassembly is complete.  Instead count the holes that are
created when inserting a fragment.  If there are no holes left, the
fragments are continuous.

Obtained from:	OpenBSD
Differential Revision:	https://reviews.freebsd.org/D17732
2018-11-02 15:23:57 +00:00
Kristof Provost
19a22ae313 Revert "pf: Limit the maximum number of fragments per packet"
This reverts commit r337969.
We'll handle this the OpenBSD way, in upcoming commits.
2018-11-02 15:01:59 +00:00
Kristof Provost
99eb00558a pf: Make ':0' ignore link-local v6 addresses too
When users mark an interface to not use aliases they likely also don't
want to use the link-local v6 address there.

PR:		201695
Submitted by:	Russell Yount <Russell.Yount AT gmail.com>
Differential Revision:	https://reviews.freebsd.org/D17633
2018-10-28 05:32:50 +00:00
Eugene Grosbein
5310c19174 ipfw: implement ngtee/netgraph actions for layer-2 frames.
Kernel part of ipfw does not support and ignores rules other than
"pass", "deny" and dummynet-related for layer-2 (ethernet frames).
Others are processed as "pass".

Make it support ngtee/netgraph rules just like they are supported
for IP packets. For example, this allows us to mirror some frames
selectively to another interface for delivery to remote network analyzer
over RSPAN vlan. Assuming ng_ipfw(4) netgraph node has a hook named "900"
attached to "lower" hook of vlan900's ng_ether(4) node, that would be
as simple as:

ipfw add ngtee 900 ip from any to 8.8.8.8 layer2 out xmit igb0

PR:		213452
MFC after:	1 month
Tested-by:	Fyodor Ustinov <ufm@ufm.su>
2018-10-27 07:32:26 +00:00
Kristof Provost
13d640d376 pf: Fix copy/paste error in IPv6 address rewriting
We checked the destination address, but replaced the source address. This was
fixed in OpenBSD as part of their NAT rework, which we don't want to import
right now.

CID:		1009561
MFC after:	3 weeks
2018-10-24 00:19:44 +00:00
Kristof Provost
73c9014569 pf: ifp can never be NULL in pfi_ifaddr_event()
There's no point in the NULL check for ifp, because we'll already have
dereferenced it by then. Moreover, the event will always have a valid ifp.

Replace the late check with an early assertion.

CID:		1357338
2018-10-23 23:15:44 +00:00
Andrey V. Elsukov
ab108c4b07 Do not decrement RST life time if keep_alive is not turned on.
This allows use differen values configured by user for sysctl variable
net.inet.ip.fw.dyn_rst_lifetime.

Obtained from:	Yandex LLC
MFC after:	3 weeks
Sponsored by:	Yandex LLC
2018-10-21 16:44:57 +00:00
Andrey V. Elsukov
2ffadd56f5 Call inet_ntop() only when its result is needed.
Obtained from:	Yandex LLC
MFC after:	3 weeks
Sponsored by:	Yandex LLC
2018-10-21 16:37:53 +00:00
Andrey V. Elsukov
aa2715612c Retire IPFIREWALL_NAT64_DIRECT_OUTPUT kernel option. And add ability
to switch the output method in run-time. Also document some sysctl
variables that can by changed for NAT64 module.

NAT64 had compile time option IPFIREWALL_NAT64_DIRECT_OUTPUT to use
if_output directly from nat64 module. By default is used netisr based
output method. Now both methods can be used, but they require different
handling by rules.

Obtained from:	Yandex LLC
MFC after:	3 weeks
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D16647
2018-10-21 16:29:12 +00:00
Kristof Provost
1563a27e1f pf synproxy will do the 3WHS on behalf of the target machine, and once
the 3WHS is completed, establish the backend connection. The trigger
for "3WHS completed" is the reception of the first ACK. However, we
should not proceed if that ACK also has RST or FIN set.

PR:		197484
Obtained from:	OpenBSD
MFC after:	2 weeks
2018-10-20 18:37:21 +00:00
Andrey V. Elsukov
986368d85d Add extra parentheses to fix "versrcreach" opcode, (oif != NULL) should
not be used as condition for ternary operator.

Submitted by:	Tatsuki Makino <tatsuki_makino at hotmail dot com>
Approved by:	re (kib)
MFC after:	1 week
2018-10-15 10:25:34 +00:00
John-Mark Gurney
032d3aaa96 Significantly improve pf purge cpu usage by only taking locks
when there is work to do.  This reduces CPU consumption to one
third on systems.  This will help keep the thread CPU usage under
control now that the default hash size has increased.

Reviewed by:	kp
Approved by:	re (kib)
Differential Revision:	https://reviews.freebsd.org/D17097
2018-09-16 00:44:23 +00:00
Patrick Kelsey
249cc75fd1 Extended pf(4) ioctl interface and pfctl(8) to allow bandwidths of
2^32 bps or greater to be used.  Prior to this, bandwidth parameters
would simply wrap at the 2^32 boundary.  The computations in the HFSC
scheduler and token bucket regulator have been modified to operate
correctly up to at least 100 Gbps.  No other algorithms have been
examined or modified for correct operation above 2^32 bps (some may
have existing computation resolution or overflow issues at rates below
that threshold).  pfctl(8) will now limit non-HFSC bandwidth
parameters to 2^32 - 1 before passing them to the kernel.

The extensions to the pf(4) ioctl interface have been made in a
backwards-compatible way by versioning affected data structures,
supporting all versions in the kernel, and implementing macros that
will cause existing code that consumes that interface to use version 0
without source modifications.  If version 0 consumers of the interface
are used against a new kernel that has had bandwidth parameters of
2^32 or greater configured by updated tools, such bandwidth parameters
will be reported as 2^32 - 1 bps by those old consumers.

All in-tree consumers of the pf(4) interface have been updated.  To
update out-of-tree consumers to the latest version of the interface,
define PFIOC_USE_LATEST ahead of any includes and use the code of
pfctl(8) as a guide for the ioctls of interest.

PR:	211730
Reviewed by:	jmallett, kp, loos
MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	RG Nets
Differential Revision:	https://reviews.freebsd.org/D16782
2018-08-22 19:38:48 +00:00
Kristof Provost
d47023236c pf: Limit the maximum number of fragments per packet
Similar to the network stack issue fixed in r337782 pf did not limit the number
of fragments per packet, which could be exploited to generate high CPU loads
with a crafted series of packets.

Limit each packet to no more than 64 fragments. This should be sufficient on
typical networks to allow maximum-sized IP frames.

This addresses the issue for both IPv4 and IPv6.

MFC after:	3 days
Security:	CVE-2018-5391
Sponsored by:	Klara Systems
2018-08-17 15:00:10 +00:00
Luiz Otavio O Souza
a0376d4d29 Fix a typo in comment.
MFC after:	3 days
X-MFC with:	r321316
Sponsored by:	Rubicon Communications, LLC (Netgate)
2018-08-15 16:36:29 +00:00
Kristof Provost
e9ddca4a40 pf: Take the IF_ADDR_RLOCK() when iterating over the group list
We did do this elsewhere in pf, but the lock was missing here.

Sponsored by:	Essen Hackathon
2018-08-11 16:37:55 +00:00
Kristof Provost
33b242b533 pf: Fix 'set skip on' for groups
The pfi_skip_if() function sometimes caused skipping of groups to work,
if the members of the group used the groupname as a name prefix.
This is often the case, e.g. group lo usually contains lo0, lo1, ...,
but not always.

Rather than relying on the name explicitly check for group memberships.

Obtained from:	OpenBSD (pf_if.c,v 1.62, pf_if.c,v 1.63)
Sponsored by:	Essen Hackathon
2018-08-11 16:34:30 +00:00
Andrey V. Elsukov
5c4aca8218 Use host byte order when comparing mss values.
This fixes tcp-setmss action on little endian machines.

PR:		225536
Submitted by:	John Zielinski
2018-08-08 17:32:02 +00:00
Andrew Turner
5f901c92a8 Use the new VNET_DEFINE_STATIC macro when we are defining static VNET
variables.

Reviewed by:	bz
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D16147
2018-07-24 16:35:52 +00:00
Kristof Provost
32ece669c2 pf: Fix synproxy
Synproxy was accidentally broken by r335569. The 'return (action)' must be
executed for every non-PF_PASS result, but the error packet (TCP RST or ICMP
error) should only be sent if the packet was dropped (i.e. PF_DROP) and the
return flag is set.

PR:		229477
Submitted by:	Andre Albsmeier <mail AT fbsd.e4m.org>
MFC after:	1 week
2018-07-14 10:14:59 +00:00
Kristof Provost
3e603d1ffa pf: Fix panic on vnet jail shutdown with synproxy
When shutting down a vnet jail pf_shutdown() clears the remaining states, which
through pf_clear_states() calls pf_unlink_state().
For synproxy states pf_unlink_state() will send a TCP RST, which eventually
tries to schedule the pf swi in pf_send(). This means we can't remove the
software interrupt until after pf_shutdown().

MFC after:	1 week
2018-07-14 09:11:32 +00:00
Andrey V. Elsukov
0a2c13d333 Use correct size when we are allocating array for skipto index.
Also, there is no need to use M_ZERO for idxmap_back. It will be
re-filled just after allocation in update_skipto_cache().

PR:		229665
MFC after:	1 week
2018-07-12 11:38:18 +00:00
Andrey V. Elsukov
f7c4fdee1a Add "record-state", "set-limit" and "defer-action" rule options to ipfw.
"record-state" is similar to "keep-state", but it doesn't produce implicit
O_PROBE_STATE opcode in a rule. "set-limit" is like "limit", but it has the
same feature as "record-state", it is single opcode without implicit
O_PROBE_STATE opcode. "defer-action" is targeted to be used with dynamic
states. When rule with this opcode is matched, the rule's action will
not be executed, instead dynamic state will be created. And when this
state will be matched by "check-state", then rule action will be executed.
This allows create a more complicated rulesets.

Submitted by:	lev
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D1776
2018-07-09 11:35:18 +00:00
Andrew Turner
2bf9501287 Create a new macro for static DPCPU data.
On arm64 (and possible other architectures) we are unable to use static
DPCPU data in kernel modules. This is because the compiler will generate
PC-relative accesses, however the runtime-linker expects to be able to
relocate these.

In preparation to fix this create two macros depending on if the data is
global or static.

Reviewed by:	bz, emaste, markj
Sponsored by:	ABT Systems Ltd
Differential Revision:	https://reviews.freebsd.org/D16140
2018-07-05 17:13:37 +00:00
Will Andrews
cc535c95ca Revert r335833.
Several third-parties use at least some of these ioctls.  While it would be
better for regression testing if they were used in base (or at least in the
test suite), it's currently not worth the trouble to push through removal.

Submitted by:	antoine, markj
2018-07-04 03:36:46 +00:00
Will Andrews
c1887e9f09 pf: remove unused ioctls.
Several ioctls are unused in pf, in the sense that no base utility
references them.  Additionally, a cursory review of pf-based ports
indicates they're not used elsewhere either.  Some of them have been
unused since the original import.  As far as I can tell, they're also
unused in OpenBSD.  Finally, removing this code removes the need for
future pf work to take them into account.

Reviewed by:		kp
Differential Revision:	https://reviews.freebsd.org/D16076
2018-07-01 01:16:03 +00:00
Kristof Provost
de210decd1 pfsync: Fix state sync during initial bulk update
States learned via pfsync from a peer with the same ruleset checksum were not
getting assigned to rules like they should because pfsync_in_upd() wasn't
passing the PFSYNC_SI_CKSUM flag along to pfsync_state_import.

PR:		229092
Submitted by:	Kajetan Staszkiewicz <vegeta tuxpowered.net>
Obtained from:	OpenBSD
MFC after:	1 week
Sponsored by:	InnoGames GmbH
2018-06-30 12:51:08 +00:00
Kristof Provost
150182e309 pf: Support "return" statements in passing rules when they fail.
Normally pf rules are expected to do one of two things: pass the traffic or
block it. Blocking can be silent - "drop", or loud - "return", "return-rst",
"return-icmp". Yet there is a 3rd category of traffic passing through pf:
Packets matching a "pass" rule but when applying the rule fails. This happens
when redirection table is empty or when src node or state creation fails. Such
rules always fail silently without notifying the sender.

Allow users to configure this behaviour too, so that pf returns an error packet
in these cases.

PR:		226850
Submitted by:	Kajetan Staszkiewicz <vegeta tuxpowered.net>
MFC after:	1 week
Sponsored by:	InnoGames GmbH
2018-06-22 21:59:30 +00:00
Andrey V. Elsukov
20efcfc602 Switch RIB and RADIX_NODE_HEAD lock from rwlock(9) to rmlock(9).
Using of rwlock with multiqueue NICs for IP forwarding on high pps
produces high lock contention and inefficient. Rmlock fits better for
such workloads.

Reviewed by:	melifaro, olivier
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D15789
2018-06-16 08:26:23 +00:00
Kristof Provost
0b799353d8 pf: Fix deadlock with route-to
If a locally generated packet is routed (with route-to/reply-to/dup-to) out of
a different interface it's passed through the firewall again. This meant we
lost the inp pointer and if we required the pointer (e.g. for user ID matching)
we'd deadlock trying to acquire an inp lock we've already got.

Pass the inp pointer along with pf_route()/pf_route6().

PR:		228782
MFC after:	1 week
2018-06-09 14:17:06 +00:00
Mateusz Guzik
4e180881ae uma: implement provisional api for per-cpu zones
Per-cpu zone allocations are very rarely done compared to regular zones.
The intent is to avoid pessimizing the latter case with per-cpu specific
code.

In particular contrary to the claim in r334824, M_ZERO is sometimes being
used for such zones. But the zeroing method is completely different and
braching on it in the fast path for regular zones is a waste of time.
2018-06-08 21:40:03 +00:00
Kristof Provost
455969d305 pf: Replace rwlock on PF_RULES_LOCK with rmlock
Given that PF_RULES_LOCK is a mostly read lock, replace the rwlock with rmlock.
This change improves packet processing rate in high pps environments.
Benchmarking by olivier@ shows a 65% improvement in pps.

While here, also eliminate all appearances of "sys/rwlock.h" includes since it
is not used anymore.

Submitted by:	farrokhi@
Differential Revision:	https://reviews.freebsd.org/D15502
2018-05-30 07:11:33 +00:00
Matt Macy
4f6c66cc9c UDP: further performance improvements on tx
Cumulative throughput while running 64
  netperf -H $DUT -t UDP_STREAM -- -m 1
on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps

Single stream throughput increases from 910kpps to 1.18Mpps

Baseline:
https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg

- Protect read access to global ifnet list with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg

- Protect short lived ifaddr references with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg

- Convert if_afdata read lock path to epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg

A fix for the inpcbhash contention is pending sufficient time
on a canary at LLNW.

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15409
2018-05-23 21:02:14 +00:00
Andrey V. Elsukov
67ad3c0bf9 Restore the ability to keep states after parent rule deletion.
This feature is disabled by default and was removed when dynamic states
implementation changed to be lockless. Now it is reimplemented with small
differences - when dyn_keep_states sysctl variable is enabled,
dyn_match_ipv[46]_state() function doesn't match child states of deleted
rule. And thus they are keept alive until expired. ipfw_dyn_lookup_state()
function does check that state was not orphaned, and if so, it returns
pointer to default_rule and its position in the rules map. The main visible
difference is that orphaned states still have the same rule number that
they have before parent rule deleted, because now a state has many fields
related to rule and changing them all atomically to point to default_rule
seems hard enough.

Reported by:	<lantw44 at gmail.com>
MFC after:	2 days
2018-05-22 13:28:05 +00:00
Andrey V. Elsukov
4bb8a5b0c9 Remove check for matching the rulenum, ruleid and rule pointer from
dyn_lookup_ipv[46]_state_locked(). These checks are remnants of not
ready to be committed code, and they are there by accident.
Due to the race these checks can lead to creating of duplicate states
when concurrent threads in the same time will try to add state for two
packets of the same flow, but in reverse directions and matched by
different parent rules.

Reported by:	lev
MFC after:	3 days
2018-05-21 16:19:00 +00:00
Matt Macy
d7c5a620e2 ifnet: Replace if_addr_lock rwlock with epoch + mutex
Run on LLNW canaries and tested by pho@

gallatin:
Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5
based ConnectX 4-LX NIC, I see an almost 12% improvement in received
packet rate, and a larger improvement in bytes delivered all the way
to userspace.

When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1,
I see, using nstat -I mce0 1 before the patch:

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
4.98   0.00   4.42   0.00 4235592     33   83.80 4720653 2149771   1235 247.32
4.73   0.00   4.20   0.00 4025260     33   82.99 4724900 2139833   1204 247.32
4.72   0.00   4.20   0.00 4035252     33   82.14 4719162 2132023   1264 247.32
4.71   0.00   4.21   0.00 4073206     33   83.68 4744973 2123317   1347 247.32
4.72   0.00   4.21   0.00 4061118     33   80.82 4713615 2188091   1490 247.32
4.72   0.00   4.21   0.00 4051675     33   85.29 4727399 2109011   1205 247.32
4.73   0.00   4.21   0.00 4039056     33   84.65 4724735 2102603   1053 247.32

After the patch

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
5.43   0.00   4.20   0.00 3313143     33   84.96 5434214 1900162   2656 245.51
5.43   0.00   4.20   0.00 3308527     33   85.24 5439695 1809382   2521 245.51
5.42   0.00   4.19   0.00 3316778     33   87.54 5416028 1805835   2256 245.51
5.42   0.00   4.19   0.00 3317673     33   90.44 5426044 1763056   2332 245.51
5.42   0.00   4.19   0.00 3314839     33   88.11 5435732 1792218   2499 245.52
5.44   0.00   4.19   0.00 3293228     33   91.84 5426301 1668597   2121 245.52

Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15366
2018-05-18 20:13:34 +00:00
Andrey V. Elsukov
782360dec3 Bring in some last changes in NAT64 implementation:
o Modify ipfw(8) to be able set any prefix6 not just Well-Known,
  and also show configured prefix6;
o relocate some definitions and macros into proper place;
o convert nat64_debug and nat64_allow_private variables to be
  VNET-compatible;
o add struct nat64_config that keeps generic configuration needed
  to NAT64 code;
o add nat64_check_prefix6() function to check validness of specified
  by user IPv6 prefix according to RFC6052;
o use nat64_check_private_ip4() and nat64_embed_ip4() functions
  instead of nat64_get_ip4() and nat64_set_ip4() macros. This allows
  to use any configured IPv6 prefixes that are allowed by RFC6052;
o introduce NAT64_WKPFX flag, that is set when IPv6 prefix is
  Well-Known IPv6 prefix. It is used to reduce overhead to check this;
o modify nat64lsn_cfg and nat64stl_cfg structures to use nat64_config
  structure. And respectivelly modify the rest of code;
o remove now unused ro argument from nat64_output() function;
o remove __FreeBSD_version ifdef, NAT64 was not merged to older versions;
o add commented -DIPFIREWALL_NAT64_DIRECT_OUTPUT flag to module's Makefile
  as example.

Obtained from:	Yandex LLC
MFC after:	1 month
Sponsored by:	Yandex LLC
2018-05-09 11:59:24 +00:00
Sean Bruno
2695c9c109 Retire ixgb(4)
This driver was for an early and uncommon legacy PCI 10GbE for a single
ASIC, Intel 82597EX. Intel quickly shifted to the long lived ixgbe family.

Submitted by:	kbowling
Reviewed by:	brooks imp jeffrey.e.pieper@intel.com
Relnotes:	yes
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15234
2018-05-02 15:59:15 +00:00
Andrey V. Elsukov
5f69d0a4ff To avoid possible deadlock do not acquire JQUEUE_LOCK before callout_drain.
Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2018-04-13 10:03:30 +00:00
Andrey V. Elsukov
2d8fcffb99 Fix integer types mismatch for flags field in nat64stl_cfg structure.
Also preserve internal flags on NAT64STL reconfiguration.

Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2018-04-12 21:29:40 +00:00
Andrey V. Elsukov
eed302572a Use cfg->nomatch_verdict as return value from NAT64LSN handler when
given mbuf is considered as not matched.

If mbuf was consumed or freed during handling, we must return
IP_FW_DENY, since ipfw's pfil handler ipfw_check_packet() expects
IP_FW_DENY when mbuf pointer is NULL. This fixes KASSERT panics
when NAT64 is used with INVARIANTS. Also remove unused nomatch_final
field from struct nat64lsn_cfg.

Reported by:	Justin Holcomb <justin at justinholcomb dot me>
Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2018-04-12 21:13:30 +00:00
Andrey V. Elsukov
c570565f12 Migrate NAT64 to FIB KPI.
Obtained from:	Yandex LLC
MFC after:	1 week
2018-04-12 21:05:20 +00:00
Kristof Provost
c41420d5dc pf: limit ioctl to a reasonable and tuneable number of elements
pf ioctls frequently take a variable number of elements as argument. This can
potentially allow users to request very large allocations.  These will fail,
but even a failing M_NOWAIT might tie up resources and result in concurrent
M_WAITOK allocations entering vm_wait and inducing reclamation of caches.

Limit these ioctls to what should be a reasonable value, but allow users to
tune it should they need to.

Differential Revision:	https://reviews.freebsd.org/D15018
2018-04-11 11:43:12 +00:00
Oleg Bulyzhin
3995ad1768 Fix ipfw table creation when net.inet.ip.fw.tables_sets = 0 and non zero set
specified on table creation. This fixes following:

# sysctl net.inet.ip.fw.tables_sets
net.inet.ip.fw.tables_sets: 0
# ipfw table all info
# ipfw set 1 table 1 create type addr
# ipfw set 1 table 1 create type addr
# ipfw add 10 set 1 count ip from table\(1\) to any
00010 count ip from table(1) to any
# ipfw add 10 set 1 count ip from table\(1\) to any
00010 count ip from table(1) to any
# ipfw table all info
--- table(1), set(1) ---
 kindex: 4, type: addr
 references: 1, valtype: legacy
 algorithm: addr:radix
 items: 0, size: 296
--- table(1), set(1) ---
 kindex: 3, type: addr
 references: 1, valtype: legacy
 algorithm: addr:radix
 items: 0, size: 296
--- table(1), set(1) ---
 kindex: 2, type: addr
 references: 0, valtype: legacy
 algorithm: addr:radix
 items: 0, size: 296
--- table(1), set(1) ---
 kindex: 1, type: addr
 references: 0, valtype: legacy
 algorithm: addr:radix
 items: 0, size: 296
#

MFC after:	1 week
2018-04-11 11:12:20 +00:00
Kristof Provost
1a125a2f7f pf: Improve ioctl validation
Ensure that multiplications for memory allocations cannot overflow, and
that we'll not try to allocate M_WAITOK for potentially overly large
allocations.

MFC after:	1 week
2018-04-06 19:36:35 +00:00
Kristof Provost
02214ac854 pf: Improve ioctl validation for DIOCIGETIFACES and DIOCXCOMMIT
These ioctls can process a number of items at a time, which puts us at
risk of overflow in mallocarray() and of impossibly large allocations
even if we don't overflow.

There's no obvious limit to the request size for these, so we limit the
requests to something which won't overflow. Change the memory allocation
to M_NOWAIT so excessive requests will fail rather than stall forever.

MFC after:	1 week
2018-04-06 19:20:45 +00:00
Kristof Provost
adfe2f6aff pf: Improve ioctl validation for DIOCRGETTABLES, DIOCRGETTSTATS, DIOCRCLRTSTATS and DIOCRSETTFLAGS
These ioctls can process a number of items at a time, which puts us at
risk of overflow in mallocarray() and of impossibly large allocations
even if we don't overflow.

Limit the allocation to required size (or the user allocation, if that's
smaller). That does mean we need to do the allocation with the rules
lock held (so the number doesn't change while we're doing this), so it
can't M_WAITOK.

MFC after:	1 week
2018-04-06 15:54:30 +00:00
Kristof Provost
8748b499c1 pf: Improve ioctl validation for DIOCRADDTABLES and DIOCRDELTABLES
The DIOCRADDTABLES and DIOCRDELTABLES ioctls can process a number of
tables at a time, and as such try to allocate <number of tables> *
sizeof(struct pfr_table). This multiplication can overflow. Thanks to
mallocarray() this is not exploitable, but an overflow does panic the
system.

Arbitrarily limit this to 65535 tables. pfctl only ever processes one
table at a time, so it presents no issues there.

MFC after:	1 week
2018-04-06 15:01:45 +00:00
Brooks Davis
541d96aaaf Use an accessor function to access ifr_data.
This fixes 32-bit compat (no ioctl command defintions are required
as struct ifreq is the same size).  This is believed to be sufficent to
fully support ifconfig on 32-bit systems.

Reviewed by:	kib
Obtained from:	CheriBSD
MFC after:	1 week
Relnotes:	yes
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14900
2018-03-30 18:50:13 +00:00
Kristof Provost
effaab8861 netpfil: Introduce PFIL_FWD flag
Forwarded packets passed through PFIL_OUT, which made it difficult for
firewalls to figure out if they were forwarding or producing packets. This in
turn is an issue for pf for IPv6 fragment handling: it needs to call
ip6_output() or ip6_forward() to handle the fragments. Figuring out which was
difficult (and until now, incorrect).
Having pfil distinguish the two removes an ugly piece of code from pf.

Introduce a new variant of the netpfil callbacks with a flags variable, which
has PFIL_FWD set for forwarded packets. This allows pf to reliably work out if
a packet is forwarded.

Reviewed by:	ae, kevans
Differential Revision:	https://reviews.freebsd.org/D13715
2018-03-23 16:56:44 +00:00
Kristof Provost
b4b8fa3387 pf: Fix memory leak in DIOCRADDTABLES
If a user attempts to add two tables with the same name the duplicate table
will not be added, but we forgot to free the duplicate table, leaking memory.
Ensure we free the duplicate table in the error path.

Reported by:	Coverity
CID:		1382111
MFC after:	3 weeks
2018-03-19 21:13:25 +00:00
Andrey V. Elsukov
12c080e613 Do not try to reassemble IPv6 fragments in "reass" rule.
ip_reass() expects IPv4 packet and will just corrupt any IPv6 packets
that it gets. Until proper IPv6 fragments handling function will be
implemented, pass IPv6 packets to next rule.

PR:		170604
MFC after:	1 week
2018-03-12 09:40:46 +00:00
Kristof Provost
bf56a3fe47 pf: Cope with overly large net.pf.states_hashsize
If the user configures a states_hashsize or source_nodes_hashsize value we may
not have enough memory to allocate this. This used to lock up pf, because these
allocations used M_WAITOK.

Cope with this by attempting the allocation with M_NOWAIT and falling back to
the default sizes (with M_WAITOK) if these fail.

PR:		209475
Submitted by:	Fehmi Noyan Isi <fnoyanisi AT yahoo.com>
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D14367
2018-02-25 08:56:44 +00:00
Andrey V. Elsukov
99493f5a4a Remove duplicate #include <netinet/ip_var.h>. 2018-02-07 19:12:05 +00:00