Commit Graph

709 Commits

Author SHA1 Message Date
Kristof Provost
150182e309 pf: Support "return" statements in passing rules when they fail.
Normally pf rules are expected to do one of two things: pass the traffic or
block it. Blocking can be silent - "drop", or loud - "return", "return-rst",
"return-icmp". Yet there is a 3rd category of traffic passing through pf:
Packets matching a "pass" rule but when applying the rule fails. This happens
when redirection table is empty or when src node or state creation fails. Such
rules always fail silently without notifying the sender.

Allow users to configure this behaviour too, so that pf returns an error packet
in these cases.

PR:		226850
Submitted by:	Kajetan Staszkiewicz <vegeta tuxpowered.net>
MFC after:	1 week
Sponsored by:	InnoGames GmbH
2018-06-22 21:59:30 +00:00
Andrey V. Elsukov
20efcfc602 Switch RIB and RADIX_NODE_HEAD lock from rwlock(9) to rmlock(9).
Using of rwlock with multiqueue NICs for IP forwarding on high pps
produces high lock contention and inefficient. Rmlock fits better for
such workloads.

Reviewed by:	melifaro, olivier
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D15789
2018-06-16 08:26:23 +00:00
Kristof Provost
0b799353d8 pf: Fix deadlock with route-to
If a locally generated packet is routed (with route-to/reply-to/dup-to) out of
a different interface it's passed through the firewall again. This meant we
lost the inp pointer and if we required the pointer (e.g. for user ID matching)
we'd deadlock trying to acquire an inp lock we've already got.

Pass the inp pointer along with pf_route()/pf_route6().

PR:		228782
MFC after:	1 week
2018-06-09 14:17:06 +00:00
Mateusz Guzik
4e180881ae uma: implement provisional api for per-cpu zones
Per-cpu zone allocations are very rarely done compared to regular zones.
The intent is to avoid pessimizing the latter case with per-cpu specific
code.

In particular contrary to the claim in r334824, M_ZERO is sometimes being
used for such zones. But the zeroing method is completely different and
braching on it in the fast path for regular zones is a waste of time.
2018-06-08 21:40:03 +00:00
Kristof Provost
455969d305 pf: Replace rwlock on PF_RULES_LOCK with rmlock
Given that PF_RULES_LOCK is a mostly read lock, replace the rwlock with rmlock.
This change improves packet processing rate in high pps environments.
Benchmarking by olivier@ shows a 65% improvement in pps.

While here, also eliminate all appearances of "sys/rwlock.h" includes since it
is not used anymore.

Submitted by:	farrokhi@
Differential Revision:	https://reviews.freebsd.org/D15502
2018-05-30 07:11:33 +00:00
Matt Macy
4f6c66cc9c UDP: further performance improvements on tx
Cumulative throughput while running 64
  netperf -H $DUT -t UDP_STREAM -- -m 1
on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps

Single stream throughput increases from 910kpps to 1.18Mpps

Baseline:
https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg

- Protect read access to global ifnet list with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg

- Protect short lived ifaddr references with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg

- Convert if_afdata read lock path to epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg

A fix for the inpcbhash contention is pending sufficient time
on a canary at LLNW.

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15409
2018-05-23 21:02:14 +00:00
Andrey V. Elsukov
67ad3c0bf9 Restore the ability to keep states after parent rule deletion.
This feature is disabled by default and was removed when dynamic states
implementation changed to be lockless. Now it is reimplemented with small
differences - when dyn_keep_states sysctl variable is enabled,
dyn_match_ipv[46]_state() function doesn't match child states of deleted
rule. And thus they are keept alive until expired. ipfw_dyn_lookup_state()
function does check that state was not orphaned, and if so, it returns
pointer to default_rule and its position in the rules map. The main visible
difference is that orphaned states still have the same rule number that
they have before parent rule deleted, because now a state has many fields
related to rule and changing them all atomically to point to default_rule
seems hard enough.

Reported by:	<lantw44 at gmail.com>
MFC after:	2 days
2018-05-22 13:28:05 +00:00
Andrey V. Elsukov
4bb8a5b0c9 Remove check for matching the rulenum, ruleid and rule pointer from
dyn_lookup_ipv[46]_state_locked(). These checks are remnants of not
ready to be committed code, and they are there by accident.
Due to the race these checks can lead to creating of duplicate states
when concurrent threads in the same time will try to add state for two
packets of the same flow, but in reverse directions and matched by
different parent rules.

Reported by:	lev
MFC after:	3 days
2018-05-21 16:19:00 +00:00
Matt Macy
d7c5a620e2 ifnet: Replace if_addr_lock rwlock with epoch + mutex
Run on LLNW canaries and tested by pho@

gallatin:
Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5
based ConnectX 4-LX NIC, I see an almost 12% improvement in received
packet rate, and a larger improvement in bytes delivered all the way
to userspace.

When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1,
I see, using nstat -I mce0 1 before the patch:

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
4.98   0.00   4.42   0.00 4235592     33   83.80 4720653 2149771   1235 247.32
4.73   0.00   4.20   0.00 4025260     33   82.99 4724900 2139833   1204 247.32
4.72   0.00   4.20   0.00 4035252     33   82.14 4719162 2132023   1264 247.32
4.71   0.00   4.21   0.00 4073206     33   83.68 4744973 2123317   1347 247.32
4.72   0.00   4.21   0.00 4061118     33   80.82 4713615 2188091   1490 247.32
4.72   0.00   4.21   0.00 4051675     33   85.29 4727399 2109011   1205 247.32
4.73   0.00   4.21   0.00 4039056     33   84.65 4724735 2102603   1053 247.32

After the patch

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
5.43   0.00   4.20   0.00 3313143     33   84.96 5434214 1900162   2656 245.51
5.43   0.00   4.20   0.00 3308527     33   85.24 5439695 1809382   2521 245.51
5.42   0.00   4.19   0.00 3316778     33   87.54 5416028 1805835   2256 245.51
5.42   0.00   4.19   0.00 3317673     33   90.44 5426044 1763056   2332 245.51
5.42   0.00   4.19   0.00 3314839     33   88.11 5435732 1792218   2499 245.52
5.44   0.00   4.19   0.00 3293228     33   91.84 5426301 1668597   2121 245.52

Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15366
2018-05-18 20:13:34 +00:00
Andrey V. Elsukov
782360dec3 Bring in some last changes in NAT64 implementation:
o Modify ipfw(8) to be able set any prefix6 not just Well-Known,
  and also show configured prefix6;
o relocate some definitions and macros into proper place;
o convert nat64_debug and nat64_allow_private variables to be
  VNET-compatible;
o add struct nat64_config that keeps generic configuration needed
  to NAT64 code;
o add nat64_check_prefix6() function to check validness of specified
  by user IPv6 prefix according to RFC6052;
o use nat64_check_private_ip4() and nat64_embed_ip4() functions
  instead of nat64_get_ip4() and nat64_set_ip4() macros. This allows
  to use any configured IPv6 prefixes that are allowed by RFC6052;
o introduce NAT64_WKPFX flag, that is set when IPv6 prefix is
  Well-Known IPv6 prefix. It is used to reduce overhead to check this;
o modify nat64lsn_cfg and nat64stl_cfg structures to use nat64_config
  structure. And respectivelly modify the rest of code;
o remove now unused ro argument from nat64_output() function;
o remove __FreeBSD_version ifdef, NAT64 was not merged to older versions;
o add commented -DIPFIREWALL_NAT64_DIRECT_OUTPUT flag to module's Makefile
  as example.

Obtained from:	Yandex LLC
MFC after:	1 month
Sponsored by:	Yandex LLC
2018-05-09 11:59:24 +00:00
Sean Bruno
2695c9c109 Retire ixgb(4)
This driver was for an early and uncommon legacy PCI 10GbE for a single
ASIC, Intel 82597EX. Intel quickly shifted to the long lived ixgbe family.

Submitted by:	kbowling
Reviewed by:	brooks imp jeffrey.e.pieper@intel.com
Relnotes:	yes
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15234
2018-05-02 15:59:15 +00:00
Andrey V. Elsukov
5f69d0a4ff To avoid possible deadlock do not acquire JQUEUE_LOCK before callout_drain.
Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2018-04-13 10:03:30 +00:00
Andrey V. Elsukov
2d8fcffb99 Fix integer types mismatch for flags field in nat64stl_cfg structure.
Also preserve internal flags on NAT64STL reconfiguration.

Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2018-04-12 21:29:40 +00:00
Andrey V. Elsukov
eed302572a Use cfg->nomatch_verdict as return value from NAT64LSN handler when
given mbuf is considered as not matched.

If mbuf was consumed or freed during handling, we must return
IP_FW_DENY, since ipfw's pfil handler ipfw_check_packet() expects
IP_FW_DENY when mbuf pointer is NULL. This fixes KASSERT panics
when NAT64 is used with INVARIANTS. Also remove unused nomatch_final
field from struct nat64lsn_cfg.

Reported by:	Justin Holcomb <justin at justinholcomb dot me>
Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2018-04-12 21:13:30 +00:00
Andrey V. Elsukov
c570565f12 Migrate NAT64 to FIB KPI.
Obtained from:	Yandex LLC
MFC after:	1 week
2018-04-12 21:05:20 +00:00
Kristof Provost
c41420d5dc pf: limit ioctl to a reasonable and tuneable number of elements
pf ioctls frequently take a variable number of elements as argument. This can
potentially allow users to request very large allocations.  These will fail,
but even a failing M_NOWAIT might tie up resources and result in concurrent
M_WAITOK allocations entering vm_wait and inducing reclamation of caches.

Limit these ioctls to what should be a reasonable value, but allow users to
tune it should they need to.

Differential Revision:	https://reviews.freebsd.org/D15018
2018-04-11 11:43:12 +00:00
Oleg Bulyzhin
3995ad1768 Fix ipfw table creation when net.inet.ip.fw.tables_sets = 0 and non zero set
specified on table creation. This fixes following:

# sysctl net.inet.ip.fw.tables_sets
net.inet.ip.fw.tables_sets: 0
# ipfw table all info
# ipfw set 1 table 1 create type addr
# ipfw set 1 table 1 create type addr
# ipfw add 10 set 1 count ip from table\(1\) to any
00010 count ip from table(1) to any
# ipfw add 10 set 1 count ip from table\(1\) to any
00010 count ip from table(1) to any
# ipfw table all info
--- table(1), set(1) ---
 kindex: 4, type: addr
 references: 1, valtype: legacy
 algorithm: addr:radix
 items: 0, size: 296
--- table(1), set(1) ---
 kindex: 3, type: addr
 references: 1, valtype: legacy
 algorithm: addr:radix
 items: 0, size: 296
--- table(1), set(1) ---
 kindex: 2, type: addr
 references: 0, valtype: legacy
 algorithm: addr:radix
 items: 0, size: 296
--- table(1), set(1) ---
 kindex: 1, type: addr
 references: 0, valtype: legacy
 algorithm: addr:radix
 items: 0, size: 296
#

MFC after:	1 week
2018-04-11 11:12:20 +00:00
Kristof Provost
1a125a2f7f pf: Improve ioctl validation
Ensure that multiplications for memory allocations cannot overflow, and
that we'll not try to allocate M_WAITOK for potentially overly large
allocations.

MFC after:	1 week
2018-04-06 19:36:35 +00:00
Kristof Provost
02214ac854 pf: Improve ioctl validation for DIOCIGETIFACES and DIOCXCOMMIT
These ioctls can process a number of items at a time, which puts us at
risk of overflow in mallocarray() and of impossibly large allocations
even if we don't overflow.

There's no obvious limit to the request size for these, so we limit the
requests to something which won't overflow. Change the memory allocation
to M_NOWAIT so excessive requests will fail rather than stall forever.

MFC after:	1 week
2018-04-06 19:20:45 +00:00
Kristof Provost
adfe2f6aff pf: Improve ioctl validation for DIOCRGETTABLES, DIOCRGETTSTATS, DIOCRCLRTSTATS and DIOCRSETTFLAGS
These ioctls can process a number of items at a time, which puts us at
risk of overflow in mallocarray() and of impossibly large allocations
even if we don't overflow.

Limit the allocation to required size (or the user allocation, if that's
smaller). That does mean we need to do the allocation with the rules
lock held (so the number doesn't change while we're doing this), so it
can't M_WAITOK.

MFC after:	1 week
2018-04-06 15:54:30 +00:00
Kristof Provost
8748b499c1 pf: Improve ioctl validation for DIOCRADDTABLES and DIOCRDELTABLES
The DIOCRADDTABLES and DIOCRDELTABLES ioctls can process a number of
tables at a time, and as such try to allocate <number of tables> *
sizeof(struct pfr_table). This multiplication can overflow. Thanks to
mallocarray() this is not exploitable, but an overflow does panic the
system.

Arbitrarily limit this to 65535 tables. pfctl only ever processes one
table at a time, so it presents no issues there.

MFC after:	1 week
2018-04-06 15:01:45 +00:00
Brooks Davis
541d96aaaf Use an accessor function to access ifr_data.
This fixes 32-bit compat (no ioctl command defintions are required
as struct ifreq is the same size).  This is believed to be sufficent to
fully support ifconfig on 32-bit systems.

Reviewed by:	kib
Obtained from:	CheriBSD
MFC after:	1 week
Relnotes:	yes
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14900
2018-03-30 18:50:13 +00:00
Kristof Provost
effaab8861 netpfil: Introduce PFIL_FWD flag
Forwarded packets passed through PFIL_OUT, which made it difficult for
firewalls to figure out if they were forwarding or producing packets. This in
turn is an issue for pf for IPv6 fragment handling: it needs to call
ip6_output() or ip6_forward() to handle the fragments. Figuring out which was
difficult (and until now, incorrect).
Having pfil distinguish the two removes an ugly piece of code from pf.

Introduce a new variant of the netpfil callbacks with a flags variable, which
has PFIL_FWD set for forwarded packets. This allows pf to reliably work out if
a packet is forwarded.

Reviewed by:	ae, kevans
Differential Revision:	https://reviews.freebsd.org/D13715
2018-03-23 16:56:44 +00:00
Kristof Provost
b4b8fa3387 pf: Fix memory leak in DIOCRADDTABLES
If a user attempts to add two tables with the same name the duplicate table
will not be added, but we forgot to free the duplicate table, leaking memory.
Ensure we free the duplicate table in the error path.

Reported by:	Coverity
CID:		1382111
MFC after:	3 weeks
2018-03-19 21:13:25 +00:00
Andrey V. Elsukov
12c080e613 Do not try to reassemble IPv6 fragments in "reass" rule.
ip_reass() expects IPv4 packet and will just corrupt any IPv6 packets
that it gets. Until proper IPv6 fragments handling function will be
implemented, pass IPv6 packets to next rule.

PR:		170604
MFC after:	1 week
2018-03-12 09:40:46 +00:00
Kristof Provost
bf56a3fe47 pf: Cope with overly large net.pf.states_hashsize
If the user configures a states_hashsize or source_nodes_hashsize value we may
not have enough memory to allocate this. This used to lock up pf, because these
allocations used M_WAITOK.

Cope with this by attempting the allocation with M_NOWAIT and falling back to
the default sizes (with M_WAITOK) if these fail.

PR:		209475
Submitted by:	Fehmi Noyan Isi <fnoyanisi AT yahoo.com>
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D14367
2018-02-25 08:56:44 +00:00
Andrey V. Elsukov
99493f5a4a Remove duplicate #include <netinet/ip_var.h>. 2018-02-07 19:12:05 +00:00
Andrey V. Elsukov
b99a682320 Rework ipfw dynamic states implementation to be lockless on fast path.
o added struct ipfw_dyn_info that keeps all needed for ipfw_chk and
  for dynamic states implementation information;
o added DYN_LOOKUP_NEEDED() macro that can be used to determine the
  need of new lookup of dynamic states;
o ipfw_dyn_rule now becomes obsolete. Currently it used to pass
  information from kernel to userland only.
o IPv4 and IPv6 states now described by different structures
  dyn_ipv4_state and dyn_ipv6_state;
o IPv6 scope zones support is added;
o ipfw(4) now depends from Concurrency Kit;
o states are linked with "entry" field using CK_SLIST. This allows
  lockless lookup and protected by mutex modifications.
o the "expired" SLIST field is used for states expiring.
o struct dyn_data is used to keep generic information for both IPv4
  and IPv6;
o struct dyn_parent is used to keep O_LIMIT_PARENT information;
o IPv4 and IPv6 states are stored in different hash tables;
o O_LIMIT_PARENT states now are kept separately from O_LIMIT and
  O_KEEP_STATE states;
o per-cpu dyn_hp pointers are used to implement hazard pointers and they
  prevent freeing states that are locklessly used by lookup threads;
o mutexes to protect modification of lists in hash tables now kept in
  separate arrays. 65535 limit to maximum number of hash buckets now
  removed.
o Separate lookup and install functions added for IPv4 and IPv6 states
  and for parent states.
o By default now is used Jenkinks hash function.

Obtained from:	Yandex LLC
MFC after:	42 days
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D12685
2018-02-07 18:59:54 +00:00
Kristof Provost
c201b5644d pf: Avoid warning without INVARIANTS
When INVARIANTS is not set the 'last' variable is not used, which can generate
compiler warnings.
If this invariant is ever violated it'd result in a KASSERT failure in
refcount_release(), so this one is not strictly required.
2018-02-01 07:52:06 +00:00
Andrey V. Elsukov
14a6bab1da When IPv6 packet is handled by O_REJECT opcode, convert ICMP code
specified in the arg1 into ICMPv6 destination unreachable code according
to RFC7915.

Obtained from:	Yandex LLC
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2018-01-24 12:40:28 +00:00
Kristof Provost
6701c43213 pf: States have at least two references
pf_unlink_state() releases a reference to the state without checking if
this is the last reference. It can't be, because pf_state_insert()
initialises it to two. KASSERT() that this is always the case.

CID:	1347140
2018-01-24 04:29:16 +00:00
Pedro F. Giffuni
d821d36419 Unsign some values related to allocation.
When allocating memory through malloc(9), we always expect the amount of
memory requested to be unsigned as a negative value would either stand for
an error or an overflow.
Unsign some values, found when considering the use of mallocarray(9), to
avoid unnecessary casting. Also consider that indexes should be of
at least the same size/type as the upper limit they pretend to index.

MFC after:	3 weeks
2018-01-22 02:08:10 +00:00
Andrey V. Elsukov
d38344208e Add UDPLite support to ipfw(4).
Now it is possible to use UDPLite's port numbers in rules,
create dynamic states for UDPLite packets and see "UDPLite" for matched
packets in log.

Obtained from:	Yandex LLC
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2018-01-19 12:50:03 +00:00
Jeff Roberson
3f289c3fcf Implement 'domainset', a cpuset based NUMA policy mechanism. This allows
userspace to control NUMA policy administratively and programmatically.

Implement domainset based iterators in the page layer.

Remove the now legacy numa_* syscalls.

Cleanup some header polution created by having seq.h in proc.h.

Reviewed by:	markj, kib
Discussed with:	alc
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13403
2018-01-12 22:48:23 +00:00
Pedro F. Giffuni
454529cd0b netpfil/ipfw: Make some use of mallocarray(9).
Reviewed by:	kp, ae
Differential Revision: https://reviews.freebsd.org/D13834
2018-01-11 15:29:29 +00:00
Kristof Provost
6273ba66f2 pf: Avoid integer overflow issues by using mallocarray() iso. malloc()
pfioctl() handles several ioctl that takes variable length input, these
include:
- DIOCRADDTABLES
- DIOCRDELTABLES
- DIOCRGETTABLES
- DIOCRGETTSTATS
- DIOCRCLRTSTATS
- DIOCRSETTFLAGS

All of them take a pfioc_table struct as input from userland. One of
its elements (pfrio_size) is used in a buffer length calculation.
The calculation contains an integer overflow which if triggered can lead
to out of bound reads and writes later on.

Reported by:	Ilja Van Sprundel <ivansprundel@ioactive.com>
2018-01-07 13:35:15 +00:00
Kristof Provost
9d671fee3a pf: Allow the module to be unloaded
pf can now be safely unloaded. Most of this code is exercised on vnet
jail shutdown.

Don't block unloading.
2017-12-31 16:18:13 +00:00
Kristof Provost
5d0020d6d7 pf: Clean all fragments on shutdown
When pf is unloaded, or a vnet jail using pf is stopped we need to
ensure we clean up all fragments, not just the expired ones.
2017-12-31 10:01:31 +00:00
Pedro F. Giffuni
6e778a7efd SPDX: license IDs for some ISC-related files. 2017-12-08 15:57:29 +00:00
Pedro F. Giffuni
8820ecc040 SPDX: Fix some cases wrongly attributed to MIT.
In the cases of BSD-style license variants without clauses, use 0BSD for
the time being in lack of a better description.
2017-11-30 15:10:11 +00:00
Pedro F. Giffuni
fe267a5590 sys: general adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

No functional change intended.
2017-11-27 15:23:17 +00:00
Michael Tuexen
665c8a2ee5 Add to ipfw support for sending an SCTP packet containing an ABORT chunk.
This is similar to the TCP case. where a TCP RST segment can be sent.

There is one limitation: When sending an ABORT in response to an incoming
packet, it should be tested if there is no ABORT chunk in the received
packet. Currently, it is only checked if the first chunk is an ABORT
chunk to avoid parsing the whole packet, which could result in a DOS attack.

Thanks to Timo Voelker for helping me to test this patch.
Reviewed by: bcr@ (man page part), ae@ (generic, non-SCTP part)
Differential Revision:	https://reviews.freebsd.org/D13239
2017-11-26 18:19:01 +00:00
Andrey V. Elsukov
1719df1bb4 Modify ipfw's dynamic states KPI.
Hide the locking logic used in the dynamic states implementation from
generic code. Rename ipfw_install_state() and ipfw_lookup_dyn_rule()
function to have similar names: ipfw_dyn_install_state() and
ipfw_dyn_lookup_state(). Move dynamic rule counters updating to the
ipfw_dyn_lookup_state() function. Now this function return NULL when
there is no state and pointer to the parent rule when state is found.
Thus now there is no need to return pointer to dynamic rule, and no need
to hold bucket lock for this state. Remove ipfw_dyn_unlock() function.

Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D11657
2017-11-23 08:02:02 +00:00
Andrey V. Elsukov
9d15540022 Check that address family of state matches address family of packet.
If it is not matched avoid comparing other state fields.

Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2017-11-23 07:05:25 +00:00
Andrey V. Elsukov
30df59d581 Move ipfw_send_pkt() from ip_fw_dynamic.c into ip_fw2.c.
It is not specific for dynamic states function and called also from
generic code.

Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2017-11-23 06:04:57 +00:00
Andrey V. Elsukov
288bf455bb Rework rule ranges matching. Use comparison rule id with UINT32_MAX to
match all rules with the same rule number.

Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2017-11-23 05:55:53 +00:00
Andrey V. Elsukov
7143bb7626 Add ipfw_add_protected_rule() function that creates rule with 65535
number in the reserved set 31. Use this function to create default rule.

Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2017-11-22 05:49:21 +00:00
Pedro F. Giffuni
51369649b0 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
Andrey V. Elsukov
66f84fabb3 Add comment for accidentally committed unrelated change in r325960.
Do not invoke IPv4 NAT handler for non IPv4 packets. Libalias expects
a packet is IPv4. And in case when it is IPv6, it just translates them
as IPv4. This leads to corruption and in some cases to panics.
In particular a panic can happen when value of ip6_plen modified to
something that leads to IP fragmentation, but actual packet length does
not match the IP length.

Packets that are not IPv4 will be dropped by NAT rule.

Reported by:	Viktor Dukhovni <freebsd at dukhovni dot org>
MFC after:	1 week
2017-11-17 23:25:06 +00:00
Andrey V. Elsukov
e11f0a0c4c Unconditionally enable support for O_IPSEC opcode.
IPsec support can be loaded as kernel module, thus do not depend from
kernel option IPSEC and always build O_IPSEC opcode implementation as
enabled.

Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2017-11-17 22:40:02 +00:00
Don Lewis
4001fcbe0a Fix Dummynet AQM packet marking function ecn_mark() and fq_codel /
fq_pie schedulers packet classification functions in layer2 (bridge mode).

Dummynet AQM packet marking function ecn_mark() and fq_codel/fq_pie
schedulers packet classification functions (fq_codel_classify_flow()
and fq_pie_classify_flow()) assume mbuf is pointing at L3 (IP)
packet. However, this assumption is incorrect if ipfw/dummynet is
used to manage layer2 traffic (bridge mode) since mbuf will point
at L2 frame.  This patch solves this problem by identifying the
source of the frame/packet (L2 or L3) and adding ETHER_HDR_LEN
offset when converting an mbuf pointer to ip pointer if the traffic
is from layer2.  More specifically, in dummynet packet tagging
function, tag_mbuf(), iphdr_off is set to ETHER_HDR_LEN if the
traffic is from layer2 and set to zero otherwise. Whenever an access
to IP header is required, mtodo(m, dn_tag_get(m)->iphdr_off) is
used instead of mtod(m, struct ip *) to correctly convert mbuf
pointer to ip pointer in both L2 and L3 traffic.

Submitted by:	lstewart
MFC after:	2 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D12506
2017-10-26 10:11:35 +00:00
Andrey V. Elsukov
5c70ebfa57 Add IPv6 support for O_TCPDATALEN opcode.
PR:		222746
MFC after:	1 week
2017-10-24 08:39:05 +00:00
Andrey V. Elsukov
ff0a137952 Fix regression in handling O_FORWARD_IP opcode after r279948.
To properly handle 'fwd tablearg,port' opcode, copy sin_port value from
sockaddr_in structure stored in the opcode into corresponding hopstore
field.

PR:		222953
MFC after:	1 week
2017-10-13 11:11:53 +00:00
Michael Tuexen
945906384d Fix a bug which avoided that rules for matching port numbers for SCTP
packets where actually matched.
While there, make clean in the man-page that SCTP port numbers are
supported in rules.

MFC after:	1 month
2017-10-02 18:25:30 +00:00
Andrey V. Elsukov
5df8171da3 Use in_localip() function instead of unlocked access to addresses hash
to determine that an address is our local.

PR:		220078
MFC after:	1 week
2017-09-20 22:35:28 +00:00
Andrey V. Elsukov
369bc48dc5 Do not acquire IPFW_WLOCK when a named object is created and destroyed.
Acquiring of IPFW_WLOCK is requried for cases when we are going to
change some data that can be accessed during processing of packets flow.
When we create new named object, there are not yet any rules, that
references it, thus holding IPFW_UH_WLOCK is enough to safely update
needed structures. When we destroy an object, we do this only when its
reference counter becomes zero. And it is safe to not acquire IPFW_WLOCK,
because noone references it. The another case is when we failed to finish
some action and thus we are doing rollback and destroying an object, in
this case it is still not referenced by rules and no need to acquire
IPFW_WLOCK.

This also fixes panic with INVARIANTS due to recursive IPFW_WLOCK acquiring.

MFC after:	1 week
Sponsored by:	Yandex LLC
2017-09-20 22:00:06 +00:00
Kristof Provost
7f3ad01804 pf_get_sport(): Prevent possible endless loop when searching for an unused nat port
This is an import of Alexander Bluhm's OpenBSD commit r1.60,
the first chunk had to be modified because on OpenBSD the
'cut' declaration is located elsewhere.

Upstream report by Jingmin Zhou:
https://marc.info/?l=openbsd-pf&m=150020133510896&w=2

OpenBSD commit message:
 Use a 32 bit variable to detect integer overflow when searching for
 an unused nat port.  Prevents a possible endless loop if high port
 is 65535 or low port is 0.
 report and analysis Jingmin Zhou; OK sashan@ visa@
Quoted from: https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/net/pf_lb.c

PR:		221201
Submitted by:	Fabian Keil <fk@fabiankeil.de>
Obtained from:  OpenBSD via ElectroBSD
MFC after:	1 week
2017-08-08 21:09:26 +00:00
Luiz Otavio O Souza
9ffd0f54a7 Fix a couple of typos in a comment.
MFC after:	1 week
Sponsored by:	Rubicon Communications, LLC (Netgate)
2017-07-21 03:04:55 +00:00
Philip Paeps
b0e1660d53 Fix GRE over IPv6 tunnels with IPFW
Previously, GRE packets in IPv6 tunnels would be dropped by IPFW (unless
net.inet6.ip6.fw.deny_unknown_exthdrs was unset).

PR:		220640
Submitted by:	Kun Xie <kxie@xiplink.com>
MFC after:	1 week
2017-07-13 09:01:22 +00:00
Kristof Provost
b7ae43552b pf: Fix vnet purging
pf_purge_thread() breaks up the work of iterating all states (in
pf_purge_expired_states()) and tracks progress in the idx variable.

If multiple vnets exist this results in pf_purge_thread() only calling
pf_purge_expired_states() for part of the states (the first part of the
first vnet, second part of the second vnet and so on).
Combined with the mark-and-sweep approach to cleaning up old rules (in
V_pf_unlinked_rules) that resulted in pf freeing rules that were still
referenced by states. This in turn caused panics when pf_state_expires()
encounters that state and attempts to access the rule.

We need to track the progress per vnet, not globally, so idx is moved
into a per-vnet V_pf_purge_idx.

PR:		219251
Sponsored by:	Hackathon Essen 2017
2017-07-09 17:56:39 +00:00
Andrey V. Elsukov
785c0d4d97 Fix IPv6 extension header parsing. The length field doesn't include the
first 8 octets.

Obtained from:	Yandex LLC
MFC after:	3 days
2017-06-29 19:06:43 +00:00
Don Lewis
d196c9ee16 Fix the queue delay estimation in PIE/FQ-PIE when the timestamp
(TS) method is used.  When packet timestamp is used, the "current_qdelay"
keeps storing the last queue delay value calculated in the dequeue
function.  Therefore, when a burst of packets arrives followed by
a pause, the "current_qdelay" will store a high value caused by the
burst and stick to that value during the pause because the queue
delay measurement is done inside the dequeue function.  This causes
the drop probability calculation function to calculate high drop
probability value instead of zero and prevents the burst allowance
mechanism from working properly.  Fix this problem by resetting
"current_qdelay" inside the drop probability calculation function
when the queue length is zero and TS option is used.

Submitted by:	Rasool Al-Saadi <ralsaadi@swin.edu.au>
MFC after:	1 week
2017-05-19 08:38:03 +00:00
Don Lewis
36fb8be630 The result of right shifting a negative signed value is implementation
defined.  On machines without arithmetic shift instructions, zero bits
may be shifted in from the left, giving a large positive result instead
of the desired divide-by power-of-2.  Fix this by operating on the
absolute value and compensating for the possible negation later.

Reverse the order of the underflow/overflow tests and the exponential
decay calculation to avoid the possibility of an erroneous overflow
detection if p is a sufficiently small non-negative value.  Also
check for negative values of prob before doing the exponential decay
to avoid another instance of of right shifting a negative value.

Tested by:	Rasool Al-Saadi <ralsaadi@swin.edu.au>
MFC after:	1 week
2017-05-19 01:23:06 +00:00
Kristof Provost
468cefa22e pf: Fix vnet initialisation
When running the vnet init code (pf_load_vnet()) we used to iterate over
all vnets, marking them as unhooked.
This is incorrect and leads to panics if pf is unloaded, as the unload
code does not unregister the pfil hooks (because the vnet is marked as
unhooked).

There's no need or reason to touch other vnets during initialisation.
Their pf_load_vnet() function will be triggered, which handles all
required initialisation.

Reviewed by:	zec, gnn
Differential Revision:	https://reviews.freebsd.org/D10592
2017-05-07 14:33:58 +00:00
Kristof Provost
64c79ee733 pf: Fix panic on unload
vnet_pf_uninit() is called through vnet_deregister_sysuninit() and
linker_file_unload() when the pf module is unloaded. This is executed
after pf_unload() so we end up trying to take locks which have been
destroyed already.

Move pf_unload() to a separate SYSUNINIT() to ensure it's called after
all the vnet_pf_uninit() calls.

Differential Revision:	https://reviews.freebsd.org/D10025
2017-05-03 20:56:54 +00:00
Marko Zec
1e9e374199 Fix VNET leakages in PF by V_irtualizing pfr_ktables and friends.
Apparently this resolves a PF-triggered panic when destroying VNET jails.

Submitted by:	Peter Blok <peter.blok@bsd4all.org>
Reviewed by:	kp
2017-04-25 08:34:39 +00:00
Marko Zec
3a36ee404f Since curvnet is already properly set on entry to event handlers,
there's no need to override it, particularly not unconditionally with
vnet0.

Submitted by:	Peter Blok <peter.blok@bsd4all.org>
Reviewed by:	kp
2017-04-25 08:30:28 +00:00
Kristof Provost
00eab743ab pf: Fix possible incorrect IPv6 fragmentation
When forwarding pf tracks the size of the largest fragment in a fragmented
packet, and refragments based on this size.
It failed to ensure that this size was a multiple of 8 (as is required for all
but the last fragment), so it could end up generating incorrect fragments.

For example, if we received an 8 byte and 12 byte fragment pf would emit a first
fragment with 12 bytes of payload and the final fragment would claim to be at
offset 8 (not 12).

We now assert that the fragment size is a multiple of 8 in ip6_fragment(), so
other users won't make the same mistake.

Reported by:	Antonios Atlasis <aatlasis at secfu net>
MFC after:	3 days
2017-04-20 09:05:53 +00:00
Kristof Provost
4e261006a1 pf: Also clear limit counters
The "pfctl -F info" command didn't clear the limit counters ( as shown in the
"pfctl -vsi" output).

Submitted by:	Max <maximos@als.nnov.ru>
2017-04-18 20:07:21 +00:00
Andrey V. Elsukov
da62ffd9cd Avoid undefined behavior.
The 'pktid' variable is modified while being used twice between
sequence points, probably due to htonl() is macro.

Reported by:	PVS-Studio
MFC after:	1 week
2017-04-14 11:58:41 +00:00
Andrey V. Elsukov
ba3e1361b0 Use address of specific union member instead of whole union address to
fix PVS-Studio warnings.

MFC after:	1 week
2017-04-14 11:41:09 +00:00
Andrey V. Elsukov
1ca7c3b815 The rule field in the ipfw_dyn_rule structure is used as storage
to pass rule number and rule set to userland. In r272840 the kernel
internal rule representation was changed and the rulenum field of
struct ip_fw_rule got the type uint32_t, but userlevel representation
still have the type uint16_t. To not overflow the size of pointer
on the systems with 32-bit pointer size use separate variable to
copy rulenum and set.

Reported by:	PVS-Studio
MFC after:	1 week
2017-04-14 11:19:09 +00:00
Gleb Smirnoff
9f5efe718f Fix potential NULL deref.
Found by:	PVS Studio
2017-04-14 01:56:15 +00:00
Maxim Konovalov
f91eb6adad o Redundant assignments removed.
Found by:	PVS-Stdio, V519
Reviewed by:	ae
2017-04-13 18:13:10 +00:00
Conrad Meyer
bcd8d3b805 dummynet: Use strlcpy to appease static checkers
Some dummynet modules used strcpy() to copy from a larger buffer
(dn_aqm->name) to a smaller buffer (dn_extra_parms->name).  It happens that
the lengths of the strings in the dn_aqm buffers were always hardcoded to be
smaller than the dn_extra_parms buffer ("CODEL", "PIE").

Use strlcpy() instead, to appease static checkers.  No functional change.

Reported by:	Coverity
CIDs:		1356163, 1356165
Sponsored by:	Dell EMC Isilon
2017-04-13 17:47:44 +00:00
Andrey V. Elsukov
88d950a650 Remove "IPFW static rules" rmlock.
Make PFIL's lock global and use it for this purpose.
This reduces the number of locks needed to acquire for each packet.

Obtained from:	Yandex LLC
MFC after:	2 weeks
Sponsored by:	Yandex LLC
No objection from: #network
Differential Revision:	https://reviews.freebsd.org/D10154
2017-04-03 13:35:04 +00:00
Andrey V. Elsukov
aac74aeac7 Add ipfw_pmod kernel module.
The module is designed for modification of a packets of any protocols.
For now it implements only TCP MSS modification. It adds the external
action handler for "tcp-setmss" action.

A rule with tcp-setmss action does additional check for protocol and
TCP flags. If SYN flag is present, it parses TCP options and modifies
MSS option if its value is greater than configured value in the rule.
Then it adjustes TCP checksum if needed. After handling the search
continues with the next rule.

Obtained from:	Yandex LLC
MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	Yandex LLC
No objection from: #network
Differential Revision:	https://reviews.freebsd.org/D10150
2017-04-03 03:07:48 +00:00
Andrey V. Elsukov
11c56650f0 Add O_EXTERNAL_DATA opcode support.
This opcode can be used to attach some data to external action opcode.
And unlike to O_EXTERNAL_INSTANCE opcode, this opcode does not require
creating of named instance to pass configuration arguments to external
action handler. The data is coming just next to O_EXTERNAL_ACTION opcode.

The userlevel part currenly supports formatting for opcode with ipfw_insn
size, by default it expects u16 numeric value in the arg1.

Obtained from:	Yandex LLC
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2017-04-03 02:44:40 +00:00
Andrey V. Elsukov
399ad57874 Add the log formatting for an external action opcode.
Obtained from:	Yandex LLC
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2017-04-03 02:26:30 +00:00
Kristof Provost
3601d25181 pf: Fix leak of pf_state_keys
If we hit the state limit we returned from pf_create_state() without cleaning
up.

PR:		217997
Submitted by:	Max <maximos@als.nnov.ru>
MFC after:	1 week
2017-04-01 12:22:34 +00:00
Andrey V. Elsukov
788e62864f Reset the cached state of last lookup in the dynamic states when an
external action is completed, but the rule search is continued.

External action handler can change the content of @args argument,
that is used for dynamic state lookup. Enforce the new lookup to be able
install new state, when the search is continued.

Obtained from:	Yandex LLC
MFC after:	1 week
Sponsored by:	Yandex LLC
2017-03-31 09:26:08 +00:00
Kristof Provost
2f8fb3a868 pf: Fix possible shutdown race
Prevent possible races in the pf_unload() / pf_purge_thread() shutdown
code. Lock the pf_purge_thread() with the new pf_end_lock to prevent
these races.

Use a shared/exclusive lock, as we need to also acquire another sx lock
(VNET_LIST_RLOCK). It's fine for both pf_purge_thread() and pf_unload()
to sleep,

Pointed out by: eri, glebius, jhb
Differential Revision:	https://reviews.freebsd.org/D10026
2017-03-22 21:18:18 +00:00
Kristof Provost
08ef4ddb0f pf: Fix rule evaluation after inet6 route-to
In pf_route6() we re-run the ruleset with PF_FWD if the packet goes out
of a different interface. pf_test6() needs to know that the packet was
forwarded (in case it needs to refragment so it knows whether to call
ip6_output() or ip6_forward()).

This lead pf_test6() to try to evaluate rules against the PF_FWD
direction, which isn't supported, so it needs to treat PF_FWD as PF_OUT.
Once fwdir is set correctly the correct output/forward function will be
called.

PR:		217883
Submitted by:	Kajetan Staszkiewicz
MFC after:	1 week
Sponsored by:	InnoGames GmbH
2017-03-19 03:06:09 +00:00
Don Lewis
46c8aadb6f Change several constants used by the PIE algorithm from unsigned to signed.
- PIE_MAX_PROB is compared to variable of int64_t and the type promotion
   rules can cause the value of that variable to be treated as unsigned.
   If the value is actually negative, then the result of the comparsion
   is incorrect, causing the algorithm to perform poorly in some
   situations.  Changing the constant to be signed cause the comparision
   to work correctly.

 - PIE_SCALE is also compared to signed values.  Fortunately they are
   also compared to zero and negative values are discarded so this is
   more of a cosmetic fix.

 - PIE_DQ_THRESHOLD is only compared to unsigned values, but it is small
   enough that the automatic promotion to unsigned is harmless.

Submitted by:	Rasool Al-Saadi <ralsaadi@swin.edu.au>
MFC after:	1 week
2017-03-18 23:00:13 +00:00
Kristof Provost
5c172e7059 pf: Fix memory leak on vnet shutdown or unload
Rules are unlinked in shutdown_pf(), so we must call
pf_unload_vnet_purge(), which frees unlinked rules, after that, not
before.

Reviewed by:	eri, bz
Differential Revision:	https://reviews.freebsd.org/D10040
2017-03-18 01:37:20 +00:00
Andrey V. Elsukov
3667f39ea3 Use memset with structure size. 2017-03-14 07:57:33 +00:00
Conrad Meyer
49b6a5d60a nat64lsn: Use memset() with structure, not pointer, size
PR:		217738
Submitted by:	Svyatoslav <razmyslov at viva64.com>
Sponsored by:	Viva64 (PVS-Studio)
2017-03-13 17:53:46 +00:00
Kristof Provost
2a57d24bd1 pf: Fix incorrect rw_sleep() in pf_unload()
When we unload we don't hold the pf_rules_lock, so we cannot call rw_sleep()
with it, because it would release a lock we do not hold. There's no need for the
lock either, so we can just tsleep().

While here also make the same change in pf_purge_thread(), because it explicitly
takes the lock before rw_sleep() and then immediately releases it afterwards.
2017-03-12 05:42:57 +00:00
Kristof Provost
f618201314 pf: Do not lose the VNET lock when ending the purge thread
When the pf_purge_thread() exits it must make sure to release the
VNET_LIST_RLOCK it still holds.
kproc_exit() does not return.
2017-03-12 05:00:04 +00:00
Maxim Konovalov
f621c2cd39 o Typo in the comment fixed.
PR:		217617
Submitted by:	lutz
2017-03-09 09:54:23 +00:00
Kristof Provost
98a9874f7b pf: Fix a crash in low-memory situations
If the call to pf_state_key_clone() in pf_get_translation() fails (i.e. there's
no more memory for it) it frees skp. This is wrong, because skp is a
pf_state_key **, so we need to free *skp, as is done later in the function.
Getting it wrong means we try to free a stack variable of the calling
pf_test_rule() function, and we panic.
2017-03-06 23:41:23 +00:00
Andrey V. Elsukov
53de37f8ca Fix the build. Use new ipfw_lookup_table() in the nat64 too.
Reported by:	cy
MFC after:	2 weeks
2017-03-06 00:41:59 +00:00
Andrey V. Elsukov
54e5669d8c Add IPv6 support to O_IP_DST_LOOKUP opcode.
o check the size of O_IP_SRC_LOOKUP opcode, it can not exceed the size of
  ipfw_insn_u32;
o rename ipfw_lookup_table_extended() function into ipfw_lookup_table() and
  remove old ipfw_lookup_table();
o use args->f_id.flow_id6 that is in host byte order to get DSCP value;
o add SCTP ports support to 'lookup src/dst-port' opcode;
o add IPv6 support to 'lookup src/dst-ip' opcode.

PR:		217292
Reviewed by:	melifaro
MFC after:	2 weeks
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D9873
2017-03-05 23:48:24 +00:00
Andrey V. Elsukov
c750a56914 Reject invalid object types that can not be used with specific opcodes.
When we doing reference counting of named objects in the new rule,
for existing objects check that opcode references to correct object,
otherwise return EINVAL.

PR:		217391
MFC after:	1 week
Sponsored by:	Yandex LLC
2017-03-05 22:19:43 +00:00
Andrey V. Elsukov
43b294a4db Fix matching table entry value. Use real table value instead of its index
in valuestate array.

When opcode has size equal to ipfw_insn_u32, this means that it should
additionally match value specified in d[0] with table entry value.
ipfw_table_lookup() returns table value index, use TARG_VAL() macro to
convert it to its value. The actual 32-bit value stored in the tag field
of table_value structure, where all unspecified u32 values are kept.

PR:		217262
Reviewed by:	melifaro
MFC after:	1 week
Sponsored by:	Yandex LLC
2017-03-03 20:22:42 +00:00
Andrey V. Elsukov
576429f04b Fix NPTv6 rule counters when one_pass is not enabled.
Consider the rule matching when both @done and @retval values
returned from ipfw_run_eaction() are zero. And modify ipfw_nptv6()
to return IP_FW_DENY and @done=0 when addresses do not match.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2017-03-01 20:00:19 +00:00
Pedro F. Giffuni
e099b90b80 sys: Replace zero with NULL for pointers.
Found with:	devel/coccinelle
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D9694
2017-02-22 02:35:59 +00:00
Eric van Gyzen
8144690af4 Use inet_ntoa_r() instead of inet_ntoa() throughout the kernel
inet_ntoa() cannot be used safely in a multithreaded environment
because it uses a static local buffer. Instead, use inet_ntoa_r()
with a buffer on the caller's stack.

Suggested by:	glebius, emaste
Reviewed by:	gnn
MFC after:	2 weeks
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D9625
2017-02-16 20:47:41 +00:00
Eric van Gyzen
643faabe0d pf: use inet_ntoa_r() instead of inet_ntoa(); maybe fix IPv6 OS fingerprinting
inet_ntoa() cannot be used safely in a multithreaded environment
because it uses a static local buffer. Instead, use inet_ntoa_r()
with a buffer on the caller's stack.

This code had an INET6 conditional before this commit, but opt_inet6.h
was not included, so INET6 was never defined.  Apparently, pf's OS
fingerprinting hasn't worked with IPv6 for quite some time.
This commit might fix it, but I didn't test that.

Reviewed by:	gnn, kp
MFC after:	2 weeks
Relnotes:	yes (if I/someone can test pf OS fingerprinting with IPv6)
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D9625
2017-02-16 20:44:44 +00:00
Enji Cooper
bc64f428ad Fix typos in comments (returing -> returning)
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2017-02-07 00:09:48 +00:00
Gleb Smirnoff
164aa3ce5e Fix indentantion in pf_purge_thread(). No functional change. 2017-01-30 22:47:48 +00:00
Luiz Otavio O Souza
a5c1a50a26 Do not run the pf purge thread while the VNET variables are not
initialized, this can cause a divide by zero (if the VNET initialization
takes to long to complete).

Obtained from:	pfSense
MFC after:	2 weeks
Sponsored by:	Rubicon Communications, LLC (Netgate)
2017-01-29 02:17:52 +00:00
Andrey V. Elsukov
ce3a6cf06a Initialize IPFW static rules rmlock with RM_RECURSE flag.
This lock was replaced from rwlock in r272840. But unlike rwlock, rmlock
doesn't allow recursion on rm_rlock(), so at this time fix this with
RM_RECURSE flag. Later we need to change ipfw to avoid such recursions.

PR:		216171
Reported by:	Eugene Grosbein
MFC after:	1 week
2017-01-17 10:50:28 +00:00
Marius Strobl
0ac43d9728 In dummynet(4), random chunks of memory are casted to struct dn_*,
potentially leading to fatal unaligned accesses on architectures with
strict alignment requirements. This change fixes dummynet(4) as far
as accesses to 64-bit members of struct dn_* are concerned, tripping
up on sparc64 with accesses to 32-bit members happening to be correctly
aligned there. In other words, this only fixes the tip of the iceberg;
larger parts of dummynet(4) still need to be rewritten in order to
properly work on all of !x86.
In principle, considering the amount of code in dummynet(4) that needs
this erroneous pattern corrected, an acceptable workaround would be to
declare all struct dn_* packed, forcing compilers to do byte-accesses
as a side-effect. However, given that the structs in question aren't
laid out well either, this would break ABI/KBI.
While at it, replace all existing bcopy(9) calls with memcpy(9) for
performance reasons, as there is no need to check for overlap in these
cases.

PR:		189219
MFC after:	5 days
2017-01-09 20:51:51 +00:00
Marcel Moolenaar
aa8c6a6dca Improve upon r309394
Instead of taking an extra reference to deal with pfsync_q_ins()
and pfsync_q_del() taken and dropping a reference (resp,) make
it optional of those functions to take or drop a reference by
passing an extra argument.

Submitted by:	glebius@
2016-12-10 03:31:38 +00:00
Gleb Smirnoff
296d65b7a9 Backout accidentially leaked in r309746 not yet reviewed patch :( 2016-12-09 18:00:45 +00:00
Gleb Smirnoff
3cbee8caa1 Use counter_ratecheck() in the ICMP rate limiting.
Together with:	rrs, jtl
2016-12-09 17:59:15 +00:00
Andrey V. Elsukov
02784f106e Convert result of hash_packet6() into host byte order.
For IPv4 similar function uses addresses and ports in host byte order,
but for IPv6 it used network byte order. This led to very bad hash
distribution for IPv6 flows. Now the result looks similar to IPv4.

Reported by:	olivier
MFC after:	1 week
Sponsored by:	Yandex LLC
2016-12-06 23:52:56 +00:00
Kristof Provost
c3e14afc18 pflog: Correctly initialise subrulenr
subrulenr is considered unset if it's set to -1, not if it's set to 1.
See contrib/tcpdump/print-pflog.c pflog_print() for a user.

This caused incorrect pflog output (tcpdump -n -e -ttt -i pflog0):
  rule 0..16777216(match)
instead of the correct output of
  rule 0/0(match)

PR:		214832
Submitted by:	andywhite@gmail.com
2016-12-05 21:52:10 +00:00
Marcel Moolenaar
d6d35f1561 Fix use-after-free bugs in pfsync(4)
Use after free happens for state that is deleted. The reference
count is what prevents the state from being freed. When the
state is dequeued, the reference count is dropped and the memory
freed. We can't dereference the next pointer or re-queue the
state.

MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D8671
2016-12-02 06:15:59 +00:00
Andrey V. Elsukov
c5f2dbb625 Fix ICMPv6 Time Exceeded error message translation.
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-11-26 10:04:05 +00:00
Luiz Otavio O Souza
e40145851b Remove the mbuf tag after use (for reinjected packets).
Fixes the packet processing in dummynet l2 rules.

Obtained from:	pfSense
MFC after:	2 weeks
Sponsored by:	Rubicon Communications, LLC (Netgate)
2016-11-03 00:26:58 +00:00
Luiz Otavio O Souza
3e80a649fb Stop abusing from struct ifnet presence to determine the packet direction
for dummynet, use the correct argument for that, remove the false coment
about the presence of struct ifnet.

Fixes the input match of dummynet l2 rules.

Obtained from:	pfSense
MFC after:	2 weeks
Sponsored by:	Rubicon Communications, LLC (Netgate)
2016-11-01 18:42:44 +00:00
Andrey V. Elsukov
308f2c6d56 Fix ipfw table lookup handler to return entry value, but not its index.
Submitted by:	loos
MFC after:	1 week
2016-10-19 11:51:17 +00:00
Kristof Provost
1f4955785d pf: port extended DSCP support from OpenBSD
Ignore the ECN bits on 'tos' and 'set-tos' and allow to use
DCSP names instead of having to embed their TOS equivalents
as plain numbers.

Obtained from:	OpenBSD
Sponsored by:	OPNsense
Differential Revision:	https://reviews.freebsd.org/D8165
2016-10-13 20:34:44 +00:00
Kristof Provost
813196a11a pf: remove fastroute tag
The tag fastroute came from ipf and was removed in OpenBSD in 2011. The code
allows to skip the in pfil hooks and completely removes the out pfil invoke,
albeit looking up a route that the IP stack will likely find on its own.
The code between IPv4 and IPv6 is also inconsistent and marked as "XXX"
for years.

Submitted by:	Franco Fichtner <franco@opnsense.org>
Differential Revision:	https://reviews.freebsd.org/D8058
2016-10-04 19:35:14 +00:00
Kevin Lo
c7641cd18d Remove ifa_list, use ifa_link (structure field) instead.
While here, prefer if_addrhead (FreeBSD) to if_addrlist (BSD compat) naming
for the interface address list in sctp_bsd_addr.c

Reviewed by:	tuexen
Differential Revision:	https://reviews.freebsd.org/D8051
2016-09-28 13:29:11 +00:00
Andrey V. Elsukov
0d9cbb874c Move opcode rewriter init and destroy handlers into non-VENT code.
PR:		212576,212649,212077
Submitted by:	John Zielinski
MFC after:	1 week
2016-09-18 17:35:17 +00:00
Andrey V. Elsukov
70c1466dad Fix swap tables between sets when this functional is enabled.
We have 6 opcode rewriters for table opcodes. When `set swap' command
invoked, it is called for each rewriter, so at the end we get the same
result, because opcode rewriter uses ETLV type to match opcode. And all
tables opcodes have the same ETLV type. To solve this problem, use
separate sets handler for one opcode rewriter. Use it to handle TEST_ALL,
SWAP_ALL and MOVE_ALL commands.

PR:		212630
MFC after:	1 week
2016-09-13 18:16:15 +00:00
Bjoern A. Zeeb
db68f7839f Try to fix gcc compilation errors (which are right).
nat64_getlasthdr() returns an int, which can be -1 in case of error,
storing the result in an uint8_t and then comparing to < 0 is not
helpful.  Do what is done in the rest of the code and make proto an
int here as well.
2016-08-18 10:26:15 +00:00
Oleg Bulyzhin
e7560c836f Fix command: ipfw set (enable|disable) N (where N > 4).
enable_sets() expects set bitmasks, not set numbers.

MFC after:	3 days
2016-08-15 13:06:29 +00:00
Kristof Provost
0df377cbb8 pf: Add missing byte-order swap to pf_match_addr_range
Without this, rules using address ranges (e.g. "10.1.1.1 - 10.1.1.5") did not
match addresses correctly on little-endian systems.

PR:		211796
Obtained from:	OpenBSD (sthen)
MFC after:	3 days
2016-08-15 12:13:14 +00:00
Andrey V. Elsukov
ecd3637584 Use %ju to print unsigned 64-bit value.
Reported by:	kib
2016-08-13 22:14:16 +00:00
Andrey V. Elsukov
57fb3b7a78 Add stats reset command implementation to NPTv6 module
to be able reset statistics counters.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-08-13 16:45:14 +00:00
Andrey V. Elsukov
c402a01b03 Replace __noinline with special debug macro NAT64NOINLINE. 2016-08-13 16:26:15 +00:00
Andrey V. Elsukov
d8caf56e9e Add ipfw_nat64 module that implements stateless and stateful NAT64.
The module works together with ipfw(4) and implemented as its external
action module.

Stateless NAT64 registers external action with name nat64stl. This
keyword should be used to create NAT64 instance and to address this
instance in rules. Stateless NAT64 uses two lookup tables with mapped
IPv4->IPv6 and IPv6->IPv4 addresses to perform translation.

A configuration of instance should looks like this:
 1. Create lookup tables:
 # ipfw table T46 create type addr valtype ipv6
 # ipfw table T64 create type addr valtype ipv4
 2. Fill T46 and T64 tables.
 3. Add rule to allow neighbor solicitation and advertisement:
 # ipfw add allow icmp6 from any to any icmp6types 135,136
 4. Create NAT64 instance:
 # ipfw nat64stl NAT create table4 T46 table6 T64
 5. Add rules that matches the traffic:
 # ipfw add nat64stl NAT ip from any to table(T46)
 # ipfw add nat64stl NAT ip from table(T64) to 64:ff9b::/96
 6. Configure DNS64 for IPv6 clients and add route to 64:ff9b::/96
    via NAT64 host.

Stateful NAT64 registers external action with name nat64lsn. The only
one option required to create nat64lsn instance - prefix4. It defines
the pool of IPv4 addresses used for translation.

A configuration of instance should looks like this:
 1. Add rule to allow neighbor solicitation and advertisement:
 # ipfw add allow icmp6 from any to any icmp6types 135,136
 2. Create NAT64 instance:
 # ipfw nat64lsn NAT create prefix4 A.B.C.D/28
 3. Add rules that matches the traffic:
 # ipfw add nat64lsn NAT ip from any to A.B.C.D/28
 # ipfw add nat64lsn NAT ip6 from any to 64:ff9b::/96
 4. Configure DNS64 for IPv6 clients and add route to 64:ff9b::/96
    via NAT64 host.

Obtained from:	Yandex LLC
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D6434
2016-08-13 16:09:49 +00:00
Andrey V. Elsukov
6951cecf71 Add three helper function to manage tables from external modules.
ipfw_objhash_lookup_table_kidx does lookup kernel index of table;
ipfw_ref_table/ipfw_unref_table takes and releases reference to table.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-08-13 15:48:56 +00:00
Andrey V. Elsukov
56132dcc0d Move logging via BPF support into separate file.
* make interface cloner VNET-aware;
* simplify cloner code and use if_clone_simple();
* migrate LOGIF_LOCK() to rmlock;
* add ipfw_bpf_mtap2() function to pass mbuf to BPF;
* introduce new additional ipfwlog0 pseudo interface. It differs from
  ipfw0 by DLT type used in bpfattach. This interface is intended to
  used by ipfw modules to dump packets with additional info attached.
  Currently pflog format is used. ipfw_bpf_mtap2() function uses second
  argument to determine which interface use for dumping. If dlen is equal
  to ETHER_HDR_LEN it uses old ipfw0 interface, if dlen is equal to
  PFLOG_HDRLEN - ipfwlog0 will be used.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-08-13 15:41:04 +00:00
Andrey V. Elsukov
d6eb9b0249 Restore "nat global" support.
Now zero value of arg1 used to specify "tablearg", use the old "tablearg"
value for "nat global". Introduce new macro IP_FW_NAT44_GLOBAL to replace
hardcoded magic number to specify "nat global". Also replace 65535 magic
number with corresponding macro. Fix typo in comments.

PR:		211256
Tested by:	Victor Chernov
MFC after:	3 days
2016-08-11 10:10:10 +00:00
Konstantin Belousov
584b675ed6 Hide the boottime and bootimebin globals, provide the getboottime(9)
and getboottimebin(9) KPI. Change consumers of boottime to use the
KPI.  The variables were renamed to avoid shadowing issues with local
variables of the same name.

Issue is that boottime* should be adjusted from tc_windup(), which
requires them to be members of the timehands structure.  As a
preparation, this commit only introduces the interface.

Some uses of boottime were found doubtful, e.g. NLM uses boottime to
identify the system boot instance.  Arguably the identity should not
change on the leap second adjustment, but the commit is about the
timekeeping code and the consumers were kept bug-to-bug compatible.

Tested by:	pho (as part of the bigger patch)
Reviewed by:	jhb (same)
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
X-Differential revision:	https://reviews.freebsd.org/D7302
2016-07-27 11:08:59 +00:00
Andrey V. Elsukov
ed22e564b8 Add named dynamic states support to ipfw(4).
The keep-state, limit and check-state now will have additional argument
flowname. This flowname will be assigned to dynamic rule by keep-state
or limit opcode. And then can be matched by check-state opcode or
O_PROBE_STATE internal opcode. To reduce possible breakage and to maximize
compatibility with old rulesets default flowname introduced.
It will be assigned to the rules when user has omitted state name in
keep-state and check-state opcodes. Also if name is ambiguous (can be
evaluated as rule opcode) it will be replaced to default.

Reviewed by:	julian
Obtained from:	Yandex LLC
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D6674
2016-07-19 04:56:59 +00:00
Andrey V. Elsukov
b867e84e95 Add ipfw_nptv6 module that implements Network Prefix Translation for IPv6
as defined in RFC 6296. The module works together with ipfw(4) and
implemented as its external action module. When it is loaded, it registers
as eaction and can be used in rules. The usage pattern is similar to
ipfw_nat(4). All matched by rule traffic goes to the NPT module.

Reviewed by:	hrs
Obtained from:	Yandex LLC
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D6420
2016-07-18 19:46:31 +00:00
Don Lewis
98e82c02e5 Fix problems in the FQ-PIE AQM cleanup code that could leak memory or
cause a crash.

Because dummynet calls pie_cleanup() while holding a mutex, pie_cleanup()
is not able to use callout_drain() to make sure that all callouts are
finished before it returns, and callout_stop() is not sufficient to make
that guarantee.  After pie_cleanup() returns, dummynet will free a
structure that any remaining callouts will want to access.

Fix these problems by allocating a separate structure to contain the
data used by the callouts.  In pie_cleanup(), call callout_reset_sbt()
to replace the normal callout with a cleanup callout that does the cleanup
work for each sub-queue.  The instance of the cleanup callout that
destroys the last flow will also free the extra allocated block of memory.
Protect the reference count manipulation in the cleanup callout with
DN_BH_WLOCK() to be consistent with all of the other usage of the reference
count where this lock is held by the dummynet code.

Submitted by:	Rasool Al-Saadi <ralsaadi@swin.edu.au>
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D7174
2016-07-12 17:32:40 +00:00
Kristof Provost
aa7cac58c6 pf: Map hook returns onto the correct error values
pf returns PF_PASS, PF_DROP, ... in the netpfil hooks, but the hook callers
expect to get E<foo> error codes.
Map the returns values. A pass is 0 (everything is OK), anything else means
pf ate the packet, so return EACCES, which tells the stack not to emit an ICMP
error message.

PR:	207598
2016-07-09 12:17:01 +00:00
Don Lewis
12be18c7d5 Fix a race condition between the main thread in aqm_pie_cleanup() and the
callout thread that can cause a kernel panic.  Always do the final cleanup
in the callout thread by passing a separate callout function for that task
to callout_reset_sbt().

Protect the ref_count decrement in the callout with DN_BH_WLOCK().  All
other ref_count manipulation is protected with this lock.

There is still a tiny window between ref_count reaching zero and the end
of the callout function where it is unsafe to unload the module.  Fixing
this would require the use of callout_drain(), but this can't be done
because dummynet holds a mutex and callout_drain() might sleep.

Remove the callout_pending(), callout_active(), and callout_deactivate()
calls from calculate_drop_prob().  They are not needed because this callout
uses callout_init_mtx().

Submitted by:	Rasool Al-Saadi <ralsaadi@swin.edu.au>
Approved by:	re (gjb)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D6928
2016-07-05 00:53:01 +00:00
Bjoern A. Zeeb
9ac51e7911 In case of the global eventhandler make sure the current VNET
is still operational before doing any work;  otherwise we might
run into, e.g., destroyed locks.

PR:		210724
Reported by:	olevole olevole.ru
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Obtained from:	projects/vnet
Approved by:	re (gjb)
2016-06-30 19:32:45 +00:00
Bjoern A. Zeeb
31fe4e62fa Move the ipfw_log_bpf() calls from global module initialisation to
per-VNET initialisation and virtualise the interface cloning to
allow a dedicated ipfw log interface per VNET.

Approved by:		re (gjb)
MFC after:		2 weeks
Sponsored by:		The FreeBSD Foundation
2016-06-30 01:33:14 +00:00
Bjoern A. Zeeb
a8fc1b786d The void isn't void.
Unbreak sparc64 and powerpc builds.

Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
MFC after:	12 days
2016-06-24 11:53:12 +00:00
Bjoern A. Zeeb
66c00e9efb Proerply virtualize pfsync for bringup after pf is initialized and
teardown of VNETs once pf(4) has been shut down.
Properly split resources into VNET_SYS(UN)INITs and one time module
loading.
While here cover the INET parts in the uninit callpath with proper
#ifdefs.

Approved by:	re (gjb)
Obtained from:  projects/vnet
MFC after:      2 weeks
Sponsored by:   The FreeBSD Foundation
2016-06-23 22:31:44 +00:00
Bjoern A. Zeeb
7d7751a071 Make sure pflog is attached after pf is initializaed so we can
borrow pf's lock, and also make sure pflog goes after pf is gone
in order to avoid callouts in VNETs to an already freed instance.

Reported by:    Ivan Klymenko, Johan Hendriks  on current@ today
Obtained from:  projects/vnet
Sponsored by:   The FreeBSD Foundation
MFC after:      13 days
Approved by:	re (gjb)
2016-06-23 22:31:10 +00:00
Bjoern A. Zeeb
a8e8c57443 PFSTATE_NOSYNC goes onto state_flags, not sync_state;
this prevents: panic: pfsync_delete_state: unexpected sync state 8

Reviewed by:		kp
Approved by:		re (gjb)
MFC after:		2 weeks
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6942
2016-06-23 21:42:43 +00:00
Bjoern A. Zeeb
a0429b5459 Update pf(4) and pflog(4) to survive basic VNET testing, which includes
proper virtualisation, teardown, avoiding use-after-free, race conditions,
no longer creating a thread per VNET (which could easily be a couple of
thousand threads), gracefully ignoring global events (e.g., eventhandlers)
on teardown, clearing various globally cached pointers and checking
them before use.

Reviewed by:		kp
Approved by:		re (gjb)
Sponsored by:		The FreeBSD Foundation
MFC after:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D6924
2016-06-23 21:34:38 +00:00
Bjoern A. Zeeb
8147948e19 Import a fix for and old security issue (CVE-2010-3830) in pf which
was not relevant to FreeBSD as only root could open /dev/pf by default.
With VIMAGE this is will longer be the case.  As pf(4) starts to
be supported with VNETs 3rd party users may open /dev/pf inside the
virtual jail instance; thus we need to address this issue after all.
While OpenBSD largely rewrote code parts for the fix [1], and it's
unclear what Apple [3] did, import the minimal fix from NetBSD [2].

[1] http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/net/pf_ioctl.c.diff?r1=1.235&r2=1.236
[2] http://mail-index.netbsd.org/source-changes/2011/01/19/msg017518.html
[3] https://support.apple.com/en-gb/HT202154

Obtained from:		http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/dist/pf/net/pf_ioctl.c.diff?r1=1.42&r2=1.43&only_with_tag=MAIN
MFC After:		2 weeks
Approved by:		re (gjb)
Sponsored by:		The FreeBSD Foundation
Security:		CVE-2010-3830
2016-06-23 05:41:46 +00:00
Bjoern A. Zeeb
89856f7e2d Get closer to a VIMAGE network stack teardown from top to bottom rather
than removing the network interfaces first. This change is rather larger
and convoluted as the ordering requirements cannot be separated.

Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and
related modules to their own SI_SUB_PROTO_FIREWALL.
Move initialization of "physical" interfaces to SI_SUB_DRIVERS,
move virtual (cloned) interfaces to SI_SUB_PSEUDO.
Move Multicast to SI_SUB_PROTO_MC.

Re-work parts of multicast initialisation and teardown, not taking the
huge amount of memory into account if used as a module yet.

For interface teardown we try to do as many of them as we can on
SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling
over a higher layer protocol such as IP. In that case the interface
has to go along (or before) the higher layer protocol is shutdown.

Kernel hhooks need to go last on teardown as they may be used at various
higher layers and we cannot remove them before we cleaned up the higher
layers.

For interface teardown there are multiple paths:
(a) a cloned interface is destroyed (inside a VIMAGE or in the base system),
(b) any interface is moved from a virtual network stack to a different
network stack ("vmove"), or (c) a virtual network stack is being shut down.
All code paths go through if_detach_internal() where we, depending on the
vmove flag or the vnet state, make a decision on how much to shut down;
in case we are destroying a VNET the individual protocol layers will
cleanup their own parts thus we cannot do so again for each interface as
we end up with, e.g., double-frees, destroying locks twice or acquiring
already destroyed locks.
When calling into protocol cleanups we equally have to tell them
whether they need to detach upper layer protocols ("ulp") or not
(e.g., in6_ifdetach()).

Provide or enahnce helper functions to do proper cleanup at a protocol
rather than at an interface level.

Approved by:		re (hrs)
Obtained from:		projects/vnet
Reviewed by:		gnn, jhb
Sponsored by:		The FreeBSD Foundation
MFC after:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D6747
2016-06-21 13:48:49 +00:00
Kristof Provost
3e248e0fb4 pf: Filter on and set vlan PCP values
Adopt the OpenBSD syntax for setting and filtering on VLAN PCP values. This
introduces two new keywords: 'set prio' to set the PCP value, and 'prio' to
filter on it.

Reviewed by:    allanjude, araujo
Approved by:	re (gjb)
Obtained from:  OpenBSD (mostly)
Differential Revision:  https://reviews.freebsd.org/D6786
2016-06-17 18:21:55 +00:00
Alexander V. Chernikov
37aefa2ad1 Fix 4-byte overflow in ipv6_writemask.
This bug could cause some IPv6 table prefix delete requests to fail.

Obtained from:	Yandex LLC
2016-06-05 10:33:53 +00:00
Don Lewis
d673654796 Replace constant expressions that contain multiplications by
fractional floating point values with integer divides.  This will
eliminate any chance that the compiler will generate code to evaluate
the expression using floating point at runtime.

Suggested by:	bde
Submitted by:	Rasool Al-Saadi <ralsaadi@swin.edu.au>
MFC after:	8 days (with r300779 and r300949)
2016-06-01 20:04:24 +00:00
Don Lewis
fe4b5f6659 Cast some expressions that multiply a long long constant by a
floating point constant to int64_t.  This avoids the runtime
conversion of the the other operand in a set of comparisons from
int64_t to floating point and doing the comparisions in floating
point.

Suggested by:	lidl
Submitted by:	Rasool Al-Saadi <ralsaadi@swin.edu.au>
MFC after:	2 weeks (with r300779)
2016-05-29 07:23:56 +00:00
Don Lewis
248c72bfb8 Correct a typo in a comment.
MFC after:	2 weeks (with r300779)
2016-05-26 22:03:28 +00:00
Don Lewis
4e59799e1b Modify BOUND_VAR() macro to wrap all of its arguments in () and tweak
its expression to work on powerpc and sparc64 (gcc compatibility).

Correct a typo in a nearby comment.

MFC after:	2 weeks (with r300779)
2016-05-26 21:44:52 +00:00
Don Lewis
91336b403a Import Dummynet AQM version 0.2.1 (CoDel, FQ-CoDel, PIE and FQ-PIE).
Centre for Advanced Internet Architectures

Implementing AQM in FreeBSD

* Overview <http://caia.swin.edu.au/freebsd/aqm/index.html>

* Articles, Papers and Presentations
  <http://caia.swin.edu.au/freebsd/aqm/papers.html>

* Patches and Tools <http://caia.swin.edu.au/freebsd/aqm/downloads.html>

Overview

Recent years have seen a resurgence of interest in better managing
the depth of bottleneck queues in routers, switches and other places
that get congested. Solutions include transport protocol enhancements
at the end-hosts (such as delay-based or hybrid congestion control
schemes) and active queue management (AQM) schemes applied within
bottleneck queues.

The notion of AQM has been around since at least the late 1990s
(e.g. RFC 2309). In recent years the proliferation of oversized
buffers in all sorts of network devices (aka bufferbloat) has
stimulated keen community interest in four new AQM schemes -- CoDel,
FQ-CoDel, PIE and FQ-PIE.

The IETF AQM working group is looking to document these schemes,
and independent implementations are a corner-stone of the IETF's
process for confirming the clarity of publicly available protocol
descriptions. While significant development work on all three schemes
has occured in the Linux kernel, there is very little in FreeBSD.

Project Goals

This project began in late 2015, and aims to design and implement
functionally-correct versions of CoDel, FQ-CoDel, PIE and FQ_PIE
in FreeBSD (with code BSD-licensed as much as practical). We have
chosen to do this as extensions to FreeBSD's ipfw/dummynet firewall
and traffic shaper. Implementation of these AQM schemes in FreeBSD
will:
* Demonstrate whether the publicly available documentation is
  sufficient to enable independent, functionally equivalent implementations

* Provide a broader suite of AQM options for sections the networking
  community that rely on FreeBSD platforms

Program Members:

* Rasool Al Saadi (developer)

* Grenville Armitage (project lead)

Acknowledgements:

This project has been made possible in part by a gift from the
Comcast Innovation Fund.

Submitted by:	Rasool Al-Saadi <ralsaadi@swin.edu.au>
X-No objection:	core
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D6388
2016-05-26 21:40:13 +00:00
Kristof Provost
b599e8dc59 pf: Fix more ICMP mistranslation
In the default case fix the substitution of the destination address.

PR:		201519
Submitted by:	Max <maximos@als.nnov.ru>
MFC after:	1 week
2016-05-23 13:59:48 +00:00
Kristof Provost
c0c82715b8 pf: Fix ICMP translation
Fix ICMP source address rewriting in rdr scenarios.

PR:		201519
Submitted by:	Max <maximos@als.nnov.ru>
MFC after:	1 week
2016-05-23 12:41:29 +00:00
Kristof Provost
d9f4fce5a7 pf: Fix fragment timeout
We were inconsistent about the use of time_second vs. time_uptime.
Always use time_uptime so the value can be meaningfully compared.

Submitted by:	"Max" <maximos@als.nnov.ru>
MFC after:	4 days
2016-05-20 15:41:05 +00:00
Andrey V. Elsukov
d16f495cad Fix the regression introduced in r300143.
When we are creating new dynamic state use MATCH_FORWARD direction to
correctly initialize protocol's state.
2016-05-20 15:00:12 +00:00
Andrey V. Elsukov
96e84c57e1 Move protocol state handling code from lookup_dyn_rule_locked() function
into dyn_update_proto_state(). This allows eliminate the second state
lookup in the ipfw_install_state().
Also remove MATCH_* macros, they are defined in ip_fw_private.h as enum.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-05-18 12:53:21 +00:00
Andrey V. Elsukov
2685841b38 Make named objects set-aware. Now it is possible to create named
objects with the same name in different sets.

Add optional manage_sets() callback to objects rewriting framework.
It is intended to implement handler for moving and swapping named
object's sets. Add ipfw_obj_manage_sets() function that implements
generic sets handler. Use new callback to implement sets support for
lookup tables.
External actions objects are global and they don't support sets.
Modify eaction_findbyname() to reflect this.
ipfw(8) now may fail to move rules or sets, because some named objects
in target set may have conflicting names.
Note that ipfw_obj_ntlv type was changed, but since lookup tables
actually didn't support sets, this change is harmless.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-05-17 07:47:23 +00:00
Andrey V. Elsukov
9f2e5ed3cc Fix memory leak possible in error case.
Use free_rule() instead of free(), it will also release memory allocated
for rule counters.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-05-11 10:04:32 +00:00
Andrey V. Elsukov
b309f085e0 Change the type of objhash_cb_t callback function to be able return an
error code. Use it to interrupt the loop in ipfw_objhash_foreach().

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-05-06 03:18:51 +00:00
Andrey V. Elsukov
2df1a11ffa Rename find_name_tlv_type() to ipfw_find_name_tlv_type() and make it
global. Use it in ip_fw_table.c instead of find_name_tlv() to reduce
duplicated code.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-05-05 20:15:46 +00:00
Pedro F. Giffuni
a4641f4eaa sys/net*: minor spelling fixes.
No functional change.
2016-05-03 18:05:43 +00:00
Andrey V. Elsukov
9a5be809ab Make create_object callback optional and return EOPNOTSUPP when it isn't
defined. Remove eaction_create_compat() and use designated initializers to
initialize eaction_opcodes structure.

Obtained from:	Yandex LLC
2016-04-27 15:28:25 +00:00
Pedro F. Giffuni
7a6ab8f19e netpfil: for pointers replace 0 with NULL.
These are mostly cosmetical, no functional change.

Found with devel/coccinelle.

Reviewed by:	ae
2016-04-15 12:24:01 +00:00
Andrey V. Elsukov
2acdf79f53 Add External Actions KPI to ipfw(9).
It allows implementing loadable kernel modules with new actions and
without needing to modify kernel headers and ipfw(8). The module
registers its action handler and keyword string, that will be used
as action name. Using generic syntax user can add rules with this
action. Also ipfw(8) can be easily modified to extend basic syntax
for external actions, that become a part base system.
Sample modules will coming soon.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-04-14 22:51:23 +00:00
Andrey V. Elsukov
4bd916567e Change the type of 'etlv' field in struct named_object to uint16_t.
It should match with the type field in struct ipfw_obj_tlv.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-04-14 21:52:31 +00:00
Andrey V. Elsukov
f8e26ca319 Adjust some comments and make ref_opcode_object() static. 2016-04-14 21:45:18 +00:00
Andrey V. Elsukov
b2df1f7ea1 o Teach opcode rewriting framework handle several rewriters for
the same opcode.

o Reduce number of times classifier callback is called. It is
  redundant to call it just after find_op_rw(), since the last
  does call it already and can have all results.

o Do immediately opcode rewrite in the ref_opcode_object().
  This eliminates additional classifier lookup later on bulk update.
  For unresolved opcodes the behavior still the same, we save information
  from classifier callback in the obj_idx array, then perform automatic
  objects creation, then perform rewriting for opcodes using indeces
  from created objects.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-04-14 21:31:16 +00:00
Andrey V. Elsukov
f976a4edc0 Move several functions related to opcode rewriting framework from
ip_fw_table.c into ip_fw_sockopt.c and make them static.

Obtained from:	Yandex LLC
2016-04-14 20:49:27 +00:00
Pedro F. Giffuni
74b8d63dcc Cleanup unnecessary semicolons from the kernel.
Found with devel/coccinelle.
2016-04-10 23:07:00 +00:00
Kristof Provost
0d8c93313e pf: Improve forwarding detection
When we guess the nature of the outbound packet (output vs. forwarding) we need
to take bridges into account. When bridging the input interface does not match
the output interface, but we're not forwarding. Similarly, it's possible for the
interface to actually be the bridge interface itself (and not a member interface).

PR:		202351
MFC after:	2 weeks
2016-03-16 06:42:15 +00:00
Andrey V. Elsukov
657592fd65 Use correct size for malloc.
Obtained from:	Yandex LLC
MFC after:	1 week
2016-03-03 13:07:59 +00:00
John Baldwin
cbc4d2db75 Remove taskqueue_enqueue_fast().
taskqueue_enqueue() was changed to support both fast and non-fast
taskqueues 10 years ago in r154167.  It has been a compat shim ever
since.  It's time for the compat shim to go.

Submitted by:	Howard Su <howard0su@gmail.com>
Reviewed by:	sephe
Differential Revision:	https://reviews.freebsd.org/D5131
2016-03-01 17:47:32 +00:00
Kristof Provost
14b5e85b18 pf: Fix possible out-of-bounds write
In the DIOCRSETADDRS ioctl() handler we allocate a table for struct pfr_addrs,
which is processed in pfr_set_addrs(). At the users request we also provide
feedback on the deleted addresses, by storing them after the new list
('bcopy(&ad, addr + size + i, sizeof(ad));' in pfr_set_addrs()).

This means we write outside the bounds of the buffer we've just allocated.
We need to look at pfrio_size2 instead (i.e. the size the user reserved for our
feedback). That'd allow a malicious user to specify a smaller pfrio_size2 than
pfrio_size though, in which case we'd still read outside of the allocated
buffer. Instead we allocate the largest of the two values.

Reported By:	Paul J Murphy <paul@inetstat.net>
PR:		207463
MFC after:	5 days
Differential Revision:	https://reviews.freebsd.org/D5426
2016-02-25 07:33:59 +00:00
Andrey V. Elsukov
23a6c7330c Fix bug in filling and handling ipfw's O_DSCP opcode.
Due to integer overflow CS4 token was handled as BE.

PR:		207459
MFC after:	1 week
2016-02-24 13:16:03 +00:00
Kristof Provost
c90369f880 in pf_print_state_parts, do not use skw->proto to print the protocol but our
local copy proto that we very carefully set beforehands. skw being NULL is
perfectly valid there.

Obtained from:	OpenBSD (henning)
2016-02-20 12:53:53 +00:00
Gleb Smirnoff
cd82d21b2e Fix obvious typo, that lead to incorrect sorting.
Found by:	PVS-Studio
2016-02-18 19:05:30 +00:00
Gleb Smirnoff
8ec07310fa These files were getting sys/malloc.h and vm/uma.h with header pollution
via sys/mbuf.h
2016-02-01 17:41:21 +00:00
Luigi Rizzo
1cdc5f0b87 cleanup and document in some detail the internals of the testing code
for dummynet schedulers
2016-01-27 02:22:31 +00:00
Luigi Rizzo
ff8d60ab4d the _Static_assert was not supposed to be in the commit. 2016-01-27 02:14:08 +00:00
Luigi Rizzo
788c0c66ab bugfix: the scheduler template (dn_schk) for the round robin scheduler
is followed by another structure (rr_schk) whose size must be set
in the schk_datalen field of the descriptor.
Not allocating the memory may cause other memory to be overwritten
(though dn_schk is 192 bytes and rr_schk only 12 so we may be lucky
and end up in the padding after the dn_schk).

This is a merge candidate for stable and 10.3

MFC after:	3 days
2016-01-27 02:08:30 +00:00
Luigi Rizzo
10d72ffc7d fix various warnings to compile the test code with -Wextra 2016-01-26 23:37:07 +00:00
Luigi Rizzo
fa57c83c70 fix various warnings (signed/unsigned, printf types, unused arguments) 2016-01-26 23:36:18 +00:00
Luigi Rizzo
f6a5c66400 prevent warnings for signed/unsigned comparisons and unused arguments.
Add checks for parameters overflowing 32 bit.
2016-01-26 22:46:58 +00:00
Luigi Rizzo
e72cd9a70d prevent warning for unused argument 2016-01-26 22:45:45 +00:00
Luigi Rizzo
4d85bfeb07 avoid warnings for signed/unsigned comparison and unused arguments 2016-01-26 22:45:05 +00:00
Luigi Rizzo
f51b072d4c Revert one chunk from commit 285362, which introduced an off-by-one error
in computing a shift index. The error was due to the use of mixed
fls() / __fls() functions in another implementation of qfq.
To avoid that the problem occurs again, properly document which
incarnation of the function we need.
Note that the bug only affects QFQ in FreeBSD head from last july, as
the patch was not merged to other versions.
2016-01-26 04:48:24 +00:00
Alexander V. Chernikov
61eee0e202 MFP r287070,r287073: split radix implementation and route table structure.
There are number of radix consumers in kernel land (pf,ipfw,nfs,route)
  with different requirements. In fact, first 3 don't have _any_ requirements
  and first 2 does not use radix locking. On the other hand, routing
  structure do have these requirements (rnh_gen, multipath, custom
  to-be-added control plane functions, different locking).
Additionally, radix should not known anything about its consumers internals.

So, radix code now uses tiny 'struct radix_head' structure along with
  internal 'struct radix_mask_head' instead of 'struct radix_node_head'.
  Existing consumers still uses the same 'struct radix_node_head' with
  slight modifications: they need to pass pointer to (embedded)
  'struct radix_head' to all radix callbacks.

Routing code now uses new 'struct rib_head' with different locking macro:
  RADIX_NODE_HEAD prefix was renamed to RIB_ (which stands for routing
  information base).

New net/route_var.h header was added to hold routing subsystem internal
  data. 'struct rib_head' was placed there. 'struct rtentry' will also
  be moved there soon.
2016-01-25 06:33:15 +00:00
Alexander V. Chernikov
fa7c058bf8 Fix panic on table/table entry delete. The panic could have happened
if more than 64 distinct values had been used.

Table value code uses internal objhash API which requires unique key
  for each object. For value code, pointer to the actual value data
  is used. The actual problem arises from the fact that 'actual' e.g.
  runtime data is stored in array and that array is auto-growing. There is
  special hook (update_tvalue() function) which is used to update the pointers
  after the change. For some reason, object 'key' was not updated.
  Fix this by adding update code to the update_tvalue().

Sponsored by:	Yandex LLC
2016-01-21 18:20:40 +00:00
Alexander V. Chernikov
89fc126add Initialize error value ta_lookup_kfib() by default to please compiler. 2016-01-10 08:37:00 +00:00
Bjoern A. Zeeb
60c274aaf8 Initialize error after r293626 in case neither INET nor INET6 is
compiled into the kernel.  Ideally lots more code would just not
be called (or compiled in) in that case but that requires a lot
more surgery.  For now try to make IP-less kernels compile again.
2016-01-10 08:14:25 +00:00
Alexander V. Chernikov
004d3e30a7 Make ipfw addr:kfib lookup algo use new routing KPI. 2016-01-10 06:43:43 +00:00
Alexander V. Chernikov
3673828490 Use already pre-calculated number of entries instead of tc->count. 2016-01-10 00:28:44 +00:00
Alexander V. Chernikov
ea8d14925c Remove sys/eventhandler.h from net/route.h
Reviewed by:	ae
2016-01-09 09:34:39 +00:00
Alexander V. Chernikov
460a5b502f Convert pf(4) to the new routing API.
Differential Revision:	https://reviews.freebsd.org/D4763
2016-01-07 10:20:03 +00:00
Hans Petter Selasky
c8cfbc066f Properly drain callouts in the IPFW subsystem to avoid use after free
panics when unloading the dummynet and IPFW modules:

- The callout drain function can sleep and should not be called having
a non-sleepable lock locked. Remove locks around "ipfw_dyn_uninit(0)".

- Add a new "dn_gone" variable to prevent asynchronous restart of
dummynet callouts when unloading the dummynet kernel module.

- Call "dn_reschedule()" locked so that "dn_gone" can be set and
checked atomically with regard to starting a new callout.

Reviewed by:	hiren
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D3855
2015-12-15 09:02:05 +00:00
Alexander V. Chernikov
65ff3638df Merge helper fib* functions used for basic lookups.
Vast majority of rtalloc(9) users require only basic info from
route table (e.g. "does the rtentry interface match with the interface
  I have?". "what is the MTU?", "Give me the IPv4 source address to use",
  etc..).
Instead of hand-rolling lookups, checking if rtentry is up, valid,
  dealing with IPv6 mtu, finding "address" ifp (almost never done right),
  provide easy-to-use API hiding all the complexity and returning the
  needed info into small on-stack structure.

This change also helps hiding route subsystem internals (locking, direct
  rtentry accesses).
Additionaly, using this API improves lookup performance since rtentry is not
  locked.
(This is safe, since all the rtentry changes happens under both radix WLOCK
  and rtentry WLOCK).

Sponsored by:	Yandex LLC
2015-12-08 10:50:03 +00:00
Andrey V. Elsukov
1cf09efe5d Add destroy_object callback to object rewriting framework.
It is called when last reference to named object is going to be released
and allows to do additional cleanup for implementation of named objects.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2015-11-23 22:06:55 +00:00
Bryan Drewery
7143303723 Fix dynamic IPv6 rules showing junk for non-specified address masks.
For example:
  00002      0         0 (19s) PARENT 1 tcp 10.10.0.5 0 <-> 0.0.0.0 0
  00002      4       412 (1s) LIMIT tcp 10.10.0.5 25848 <-> 10.10.0.7 22
  00002     10       777 (1s) LIMIT tcp 2001:894:5a24:653::503:1 52023 <-> 2001:894:5a24:653:ca0a:a9ff:fe04:3978 22
  00002      0         0 (17s) PARENT 1 tcp 2001:894:5a24:653::503:1 0 <-> 80f3:70d:23fe:ffff:1005:: 0

Fix this by zeroing the unused address, as is done for IPv4:
  00002     0         0 (18s) PARENT 1 tcp 10.10.0.5 0 <-> 0.0.0.0 0
  00002    36     14952 (1s) LIMIT tcp 10.10.0.5 25848 <-> 10.10.0.7 22
  00002     0         0 (0s) PARENT 1 tcp 2001:894:5a24:653::503:1 0 <-> :: 0
  00002     4       345 (274s) LIMIT tcp 2001:894:5a24:653::503:1 34131 <-> 2001:470:1f11:262:ca0a:a9ff:fe04:3978 22

MFC after:	2 weeks
2015-11-17 20:42:08 +00:00
Alexander V. Chernikov
637670e77e Bring back the ability of passing cached route via nd6_output_ifp(). 2015-11-15 16:02:22 +00:00
Randall Stewart
7c4676ddee This fixes several places where callout_stops return is examined. The
new return codes of -1 were mistakenly being considered "true". Callout_stop
now returns -1 to indicate the callout had either already completed or
was not running and 0 to indicate it could not be stopped.  Also update
the manual page to make it more consistent no non-zero in the callout_stop
or callout_reset descriptions.

MFC after:	1 Month with associated callout change.
2015-11-13 22:51:35 +00:00