Commit Graph

278 Commits

Author SHA1 Message Date
Alexander V. Chernikov
4c0c07a552 * Permit limiting number of items in table.
Kernel changes:
* Add TEI_FLAGS_DONTADD entry flag to indicate that insert is not possible
* Support given flag in all algorithms
* Add "limit" field to ipfw_xtable_info
* Add actual limiting code into add_table_entry()

Userland changes:
* Add "limit" option as "create" table sub-option. Limit modification
  is currently impossible.
* Print human-readable errors in table enry addition/deletion code.
2014-08-01 15:17:46 +00:00
Alexander V. Chernikov
95c3c1e25f Do not perform memset() on ta_buf in algo callbacks:
it is already zeroed by base code.
2014-08-01 08:39:47 +00:00
Alexander V. Chernikov
2e324d2931 Simplify radix operations: use unified tei_to_sockaddr_ent() to generate
keys for add/delete calls.
2014-08-01 08:28:18 +00:00
Alexander V. Chernikov
57a1cf954e * Use TA_FLAG_DEFAULT for default algorithm selection instead of
exporting algorithm structures directly.

* Pass needed state buffer size in algo structures as preparation
  for tables add/del requests batching.
2014-08-01 07:35:17 +00:00
Alexander V. Chernikov
914bffb6ab * Add new "flow" table type to support N=1..5-tuple lookups
* Add "flow:hash" algorithm

Kernel changes:
* Add O_IP_FLOW_LOOKUP opcode to support "flow" lookups
* Add IPFW_TABLE_FLOW table type
* Add "struct tflow_entry" as strage for 6-tuple flows
* Add "flow:hash" algorithm. Basically it is auto-growing chained hash table.
  Additionally, we store mask of fields we need to compare in each instance/

* Increase ipfw_obj_tentry size by adding struct tflow_entry
* Add per-algorithm stat (ifpw_ta_tinfo) to ipfw_xtable_info
* Increase algoname length: 32 -> 64 (algo options passed there as string)
* Assume every table type can be customized by flags, use u8 to store "tflags" field.
* Simplify ipfw_find_table_entry() by providing @tentry directly to algo callback.
* Fix bug in cidr:chash resize procedure.

Userland changes:
* add "flow table(NAME)" syntax to support n-tuple checking tables.
* make fill_flags() separate function to ease working with _s_x arrays
* change "table info" output to reflect longer "type" fields

Syntax:
ipfw table fl2 create type flow:[src-ip][,proto][,src-port][,dst-ip][dst-port] [algo flow:hash]

Examples:

0:02 [2] zfscurr0# ipfw table fl2 create type flow:src-ip,proto,dst-port algo flow:hash
0:02 [2] zfscurr0# ipfw table fl2 info
+++ table(fl2), set(0) +++
 kindex: 0, type: flow:src-ip,proto,dst-port
 valtype: number, references: 0
 algorithm: flow:hash
 items: 0, size: 280
0:02 [2] zfscurr0# ipfw table fl2 add 2a02:6b8::333,tcp,443 45000
0:02 [2] zfscurr0# ipfw table fl2 add 10.0.0.92,tcp,80 22000
0:02 [2] zfscurr0# ipfw table fl2 list
+++ table(fl2), set(0) +++
2a02:6b8::333,6,443 45000
10.0.0.92,6,80 22000
0:02 [2] zfscurr0# ipfw add 200 count tcp from me to 78.46.89.105 80 flow 'table(fl2)'
00200 count tcp from me to 78.46.89.105 dst-port 80 flow table(fl2)
0:03 [2] zfscurr0# ipfw show
00200   0     0 count tcp from me to 78.46.89.105 dst-port 80 flow table(fl2)
65535 617 59416 allow ip from any to any
0:03 [2] zfscurr0# telnet -s 10.0.0.92 78.46.89.105 80
Trying 78.46.89.105...
..
0:04 [2] zfscurr0# ipfw show
00200   5   272 count tcp from me to 78.46.89.105 dst-port 80 flow table(fl2)
65535 682 66733 allow ip from any to any
2014-07-31 20:08:19 +00:00
Alexander V. Chernikov
b23d5de9b6 * Add number:array algorithm lookup method.
Kernel changes:
* s/IPFW_TABLE_U32/IPFW_TABLE_NUMBER/
* Force "lookup <port|uid|gid|jid>" to be IPFW_TABLE_NUMBER
* Support "lookup" method for number tables
* Add number:array algorihm (i32 as key, auto-growing).

Userland changes:
* Support named tables in "lookup <tag> Table"
* Fix handling of "table(NAME,val)" case
* Support printing "number" table data.
2014-07-30 14:52:26 +00:00
Alexander V. Chernikov
ce2817b51c * Add "lookup" method for cidr:hash algorithm type.
* Add auoto-grow ability to cidr:hash type.
* Fix some bugs / simplify implementation for cidr:hash.
2014-07-30 12:39:49 +00:00
Alexander V. Chernikov
daabb523cc Fix "flush" cmd for algorithms wih non-default parameters. 2014-07-30 09:17:40 +00:00
Alexander V. Chernikov
b429d43c36 * Introduce ipfw_ctl3() handler and move all IP_FW3 opcodes there.
The long-term goal is to switch remaining opcodes to IP_FW3 versions
 and use ipfw_ctl3() as default handler simplifying ipfw(4) interaction
 with external world.
2014-07-29 23:06:06 +00:00
Alexander V. Chernikov
9d099b4f38 * Dump available table algorithms via "ipfw talist" cmd.
Kernel changes:
* Add type/refcount fields to table algo instances.
* Add IP_FW_TABLES_ALIST opcode to export available algorihms to userland.

Userland changes:
* Fix cores on empty input inside "ipfw table" handler.
* Add "ipfw talist" cmd to print availabled kernel algorithms.
* Change "table info" output to reflect long algorithm config lines.
2014-07-29 22:44:26 +00:00
Alexander V. Chernikov
0b565ac0e6 * Copy ta structures to stable storage to ease future extension.
* Remove algo .lookup field since table lookup function is set by algo code.
2014-07-29 21:38:06 +00:00
Alexander V. Chernikov
74b941f042 * Add new ipfw cidr algorihm: hash table.
Algorithm works with both IPv4 and IPv6 prefixes, /32 and /128
ranges are assumed by default.
It works the following way: input IP address is masked to specified
mask, hashed and searched inside hash bucket.

Current implementation does not support "lookup" method and hash auto-resize.
This will be changed soon.

some examples:

ipfw table mi_test2 create type cidr algo cidr:hash
ipfw table mi_test create type cidr algo "cidr:hash masks=/30,/64"

ipfw table mi_test2 info
+++ table(mi_test2), set(0) +++
 type: cidr, kindex: 7
 valtype: number, references: 0
 algorithm: cidr:hash
 items: 0, size: 220

ipfw table mi_test info
+++ table(mi_test), set(0) +++
 type: cidr, kindex: 6
 valtype: number, references: 0
 algorithm: cidr:hash masks=/30,/64
 items: 0, size: 220

ipfw table mi_test add 10.0.0.5/30
ipfw table mi_test add 10.0.0.8/30
ipfw table mi_test add 2a02:6b8:b010::1/64 25

ipfw table mi_test list
+++ table(mi_test), set(0) +++
10.0.0.4/30 0
10.0.0.8/30 0
2a02:6b8:b010::/64 25
2014-07-29 19:49:38 +00:00
Alexander V. Chernikov
adea620132 * Change algorthm names to "type:algo" (e.g. "iface:array", "cidr:radix") format.
* Pass number of items changed in add/del hooks to permit adding/deleting
  multiple values at once.
2014-07-29 08:00:13 +00:00
Alexander V. Chernikov
68394ec88e * Add generic ipfw interface tracking API
* Rewrite interface tables to use interface indexes

Kernel changes:
* Add generic interface tracking API:
 - ipfw_iface_ref (must call unlocked, performs lazy init if needed, allocates
  state & bumps ref)
 - ipfw_iface_add_ntfy(UH_WLOCK+WLOCK, links comsumer & runs its callback to
  update ifindex)
 - ipfw_iface_del_ntfy(UH_WLOCK+WLOCK, unlinks consumer)
 - ipfw_iface_unref(unlocked, drops reference)
Additionally, consumer callbacks are called in interface withdrawal/departure.

* Rewrite interface tables to use iface tracking API. Currently tables are
  implemented the following way:
  runtime data is stored as sorted array of {ifidx, val} for existing interfaces
  full data is stored inside namedobj instance (chained hashed table).

* Add IP_FW_XIFLIST opcode to dump status of tracked interfaces

* Pass @chain ptr to most non-locked algorithm callbacks:
  (prepare_add, prepare_del, flush_entry ..). This may be needed for better
  interaction of given algorithm an other ipfw subsystems

* Add optional "change_ti" algorithm handler to permit updating of
  cached table_info pointer (happens in case of table_max resize)

* Fix small bug in ipfw_list_tables()
* Add badd (insert into sorted array) and bdel (remove from sorted array) funcs

Userland changes:
* Add "iflist" cmd to print status of currently tracked interface
* Add stringnum_cmp for better interface/table names sorting
2014-07-28 19:01:25 +00:00
Alexander V. Chernikov
db785d3199 * Require explicit table creation before use on kernel side.
* Add resize callbacks for upcoming table-based algorithms.

Kernel changes:
* s/ipfw_modify_table/ipfw_manage_table_ent/
* Simplify add_table_entry(): make table creation a separate piece of code.
  Do not perform creation if not in "compat" mode.
* Add ability to perform modification of algorithm state (like table resize).
  The following callbacks were added:
 - prepare_mod (allocate new state, without locks)
 - fill_mod (UH_WLOCK, copy old state to new one)
 - modify (UH_WLOCK + WLOCK, switch state)
 - flush_mod (no locks, flushes allocated data)
 Given callbacks are called if table modification has been requested by add or
   delete callbacks. Additional u64 tc->'flags' field was added to pass these
   requests.
* Change add/del table ent format: permit adding/removing multiple entries
   at once (only 1 supported at the moment).

Userland changes:
* Auto-create tables with warning
2014-07-26 13:37:25 +00:00
Gleb Smirnoff
8ff2bd98d6 On machines with strict alignment copy pfsync_state_key from packet
on stack to avoid unaligned access.

PR:		187381
Submitted by:	Lytochkin Boris <lytboris gmail.com>
2014-07-10 12:41:58 +00:00
Alexander V. Chernikov
e0a8b9ee29 * Reduce size of ipfw table entries for cidr/iface:
Since old structures had _value as the last field,
every table match required 3 cache lines instead of 2.
Fix this by
- using the fact that supplied masks are suplicated inside radix
- using lightweigth sa_in6 structure as key for IPv6

Before (amd64):
  sizeof(table_entry): 136
  sizeof(table_xentry): 160
After (amd64):
  sizeof(radix_cidr_entry): 120
  sizeof(radix_cidr_xentry): 128
  sizeof(radix_iface): 128

* Fix memory leak for table entry update
* Do some more sanity checks while deleting entry
* Do not store masks for host routes

Sponsored by:	Yandex LLC
2014-07-09 18:52:12 +00:00
Alexander V. Chernikov
7e767c791f * Use different rule structures in kernel/userland.
* Switch kernel to use per-cpu counters for rules.
* Keep ABI/API.

Kernel changes:
* Each rules is now exported as TLV with optional extenable
  counter block (ip_fW_bcounter for base one) and
  ip_fw_rule for rule&cmd data.
* Counters needs to be explicitly requested by IPFW_CFG_GET_COUNTERS flag.
* Separate counters from rules in kernel and clean up ip_fw a bit.
* Pack each rule in IPFW_TLV_RULE_ENT tlv to ease parsing.
* Introduce versioning in container TLV (may be needed in future).
* Fix ipfw_cfg_lheader broken u64 alignment.

Userland changes:
* Use set_mask from cfg header when requesting config
* Fix incorrect read accouting in ipfw_show_config()
* Use IPFW_RULE_NOOPT flag instead of playing with _pad
* Fix "ipfw -d list": do not print counters for dynamic states
* Some small fixes
2014-07-08 23:11:15 +00:00
Alexander V. Chernikov
6447bae661 * Prepare to pass other dynamic states via ipfw_dump_config()
Kernel changes:
* Change dump format for dynamic states:
  each state is now stored inside ipfw_obj_dyntlv
  last dynamic state is indicated by IPFW_DF_LAST flag
* Do not perform sooptcopyout() for !SOPT_GET requests.

Userland changes:
* Introduce foreach_state() function handler to ease work
  with different states passed by ipfw_dump_config().
2014-07-06 23:26:34 +00:00
Alexander V. Chernikov
81d3153d61 * Add "lookup" table functionality to permit userland entry lookups.
* Bump table dump format preserving old ABI.

Kernel size:
* Add IP_FW_TABLE_XFIND to handle "lookup" request from userland.
* Add ta_find_tentry() algorithm callbacks/handlers to support lookups.
* Fully switch to ipfw_obj_tentry for various table dumps:
  algorithms are now required to support the latest (ipfw_obj_tentry) entry
    dump format, the rest is handled by generic dump code.
  IP_FW_TABLE_XLIST opcode version bumped (0 -> 1).
* Eliminate legacy ta_dump_entry algo handler:
  dump_table_entry() converts data from current to legacy format.

Userland side:
* Add "lookup" table parameter.
* Change the way table type is guessed: call table_get_info() first,
  and check value for IPv4/IPv6 type IFF table does not exist.
* Fix table_get_list(): do more tries if supplied buffer is not enough.
* Sparate table_show_entry() from table_show_list().
2014-07-06 18:16:04 +00:00
Alexander V. Chernikov
1832a7b303 * Issue warning while requesting ruleset with new tables via legacy binary.
Convert each unresolved table as table 65535 (which cannot be used normally).
* Perform s/^ipfw_// for add_table_entry, del_table_entry and flush_table since
  these are internal functions exported to keep legacy interface.
* Remove macro TABLE_SET. Operations with tables can be done in any set, the only
  thing net.inet.ip.fw.tables_sets affects is the set in which tables are looked
  up while binding them to the rule.
2014-07-04 07:02:11 +00:00
Alexander V. Chernikov
ac35ff1784 Fully switch to named tables:
Kernel changes:
* Introduce ipfw_obj_tentry table entry structure to force u64 alignment.
* Support "update-on-existing-key" "add" bahavior (TEI_FLAGS_UPDATED).
* Use "subtype" field to distingush between IPv4 and IPv6 table records
  instead of previous hack.
* Add value type (vtype) field for kernel tables. Current types are
  number,ip and dscp
* Fix sets mask retrieval for old binaries
* Fix crash while using interface tables

Userland changes:
* Switch ipfw_table_handler() to use named-only tables.
* Add "table NAME create [type {cidr|iface|u32} [valtype {number|ip|dscp}] ..."
* Switch ipfw_table_handler to match_token()-based parser.
* Switch ipfw_sets_handler to use new ipfw_get_config() for mask  retrieval.
* Allow ipfw set X table ... syntax to permit using per-set table namespaces.
2014-07-03 22:25:59 +00:00
Alexander V. Chernikov
6c2997ffec * Add new IP_FW_XADD opcode which permits to
a) specify table ids as names
  b) add multiple rules at once.
Partially convert current code for atomic addition of multiple rules.
2014-06-29 22:35:47 +00:00
Alexander V. Chernikov
2aa75134b7 Enable kernel-side rule filtering based on user request.
Make do_get3() function return real error.
2014-06-29 09:29:27 +00:00
Alexander V. Chernikov
563b5ab132 Suppord showing named tables in ipfw(8) rule listing.
Kernel changes:
* change base TLV header to be u64 (so size can be u32).
* Introduce ipfw_obj_ctlv generc container TLV.
* Add IP_FW_XGET opcode which is now used for atomic configuration
  retrieval. One can specify needed configuration pieces to retrieve
  via flags field. Currently supported are
  IPFW_CFG_GET_STATIC (static rules) and
  IPFW_CFG_GET_STATES (dynamic states).
  Other configuration pieces (tables, pipes, etc..) support is planned.

Userland changes:
* Switch ipfw(8) to use new IP_FW_XGET for rule listing.
* Split rule listing code get and show pieces.
* Make several steps forward towards libipfw:
  permit printing states and rules(paritally) to supplied buffer.
  do not die on malloc/kernel failure inside given printing functions.
  stop assuming cmdline_opts is global symbol.
2014-06-28 23:20:24 +00:00
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
Alexander V. Chernikov
2d99a3497d Use different approach for filling large datasets to userspace:
Instead of trying to allocate bing contiguous chunk of memory,
use intermediate-sized (page size) buffer as sliding window
reducing number of sooptcopyout() calls to perform.

This reduces dump functions complexity and provides additional
layer of abstraction.

User-visible api consists of 2 functions:
ipfw_get_sopt_space() - gets contigious amount of storage (or NULL)
and
ipfw_get_sopt_header() - the same, but zeroes the rest of the buffer.
2014-06-27 10:07:00 +00:00
Alexander V. Chernikov
9490a62716 * Add IP_FW_TABLE_XCREATE / IP_FW_TABLE_XMODIFY opcodes.
* Add 'algoname' string to ipfw_xtable_info permitting to specify lookup
algoritm with parameters.
* Rework part of ipfw_rewrite_table_uidx()

Sponsored by:	Yandex LLC
2014-06-16 13:05:07 +00:00
Alexander V. Chernikov
9c3c43aa77 Remove unused ipfw_dump_xtable(). 2014-06-15 13:43:44 +00:00
Alexander V. Chernikov
d3a4f9249c Simplify opcode handling.
* Use one u16 from op3 header to implement opcode versioning.
* IP_FW_TABLE_XLIST has now 2 handlers, for ver.0 (old) and ver.1 (current).
* Every getsockopt request is now handled in ip_fw_table.c
* Rename new opcodes:
IP_FW_OBJ_DEL -> IP_FW_TABLE_XDESTROY
IP_FW_OBJ_LISTSIZE -> IP_FW_TABLES_XGETSIZE
IP_FW_OBJ_LIST -> IP_FW_TABLES_XLIST
IP_FW_OBJ_INFO -> IP_FW_TABLE_XINFO
IP_FW_OBJ_INFO -> IP_FW_TABLE_XFLUSH

* Add some docs about using given opcodes.
* Group some legacy opcode/handlers.
2014-06-15 13:40:27 +00:00
Alexander V. Chernikov
f1220db8d7 Move further to eliminate next pieces of number-assuming code inside tables.
Kernel changes:
* Add IP_FW_OBJ_FLUSH opcode (flush table based on its name/set)
* Add IP_FW_OBJ_DUMP opcode (dumps table data based on its names/set)
* Add IP_FW_OBJ_LISTSIZE / IP_FW_OBJ_LIST opcodes (get list of kernel tables)

Userland changes:
* move tables code to separate tables.c file
* get rid of tables_max
* switch "all"/list handling to new opcodes
2014-06-14 22:47:25 +00:00
Alexander V. Chernikov
ea761a5dc4 Move most of external table structures/functions to separate ip_fw_table.h 2014-06-14 11:13:02 +00:00
Alexander V. Chernikov
9f7d47b025 Add API to ease adding new algorithms/new tabletypes to ipfw.
Kernel-side changelog:
* Split general tables code and algorithm-specific table data.
  Current algorithms (IPv4/IPv6 radix and interface tables radix) moved to
  new ip_fw_table_algo.c file.
  Tables code now supports any algorithm implementing the following callbacks:
+struct table_algo {
+       char            name[64];
+       int             idx;
+       ta_init         *init;
+       ta_destroy      *destroy;
+       table_lookup_t  *lookup;
+       ta_prepare_add  *prepare_add;
+       ta_prepare_del  *prepare_del;
+       ta_add          *add;
+       ta_del          *del;
+       ta_flush_entry  *flush_entry;
+       ta_foreach      *foreach;
+       ta_dump_entry   *dump_entry;
+       ta_dump_xentry  *dump_xentry;
+};

* Change ->state, ->xstate, ->tabletype fields of ip_fw_chain to
   ->tablestate pointer (array of 32 bytes structures necessary for
   runtime lookups (can be probably shrinked to 16 bytes later):

   +struct table_info {
   +       table_lookup_t  *lookup;        /* Lookup function */
   +       void            *state;         /* Lookup radix/other structure */
   +       void            *xstate;        /* eXtended state */
   +       u_long          data;           /* Hints for given func */
   +};

* Add count method for namedobj instance to ease size calculations
* Bump ip_fw3 buffer in ipfw_clt 128->256 bytes.
* Improve bitmask resizing on tables_max change.
* Remove table numbers checking from most places.
* Fix wrong nesting in ipfw_rewrite_table_uidx().

* Add IP_FW_OBJ_LIST opcode (list all objects of given type, currently
    implemented for IPFW_OBJTYPE_TABLE).
* Add IP_FW_OBJ_LISTSIZE (get buffer size to hold IP_FW_OBJ_LIST data,
    currenly implemented for IPFW_OBJTYPE_TABLE).
* Add IP_FW_OBJ_INFO (requests info for one object of given type).

Some name changes:
s/ipfw_xtable_tlv/ipfw_obj_tlv/ (no table specifics)
s/ipfw_xtable_ntlv/ipfw_obj_ntlv/ (no table specifics)

Userland changes:
* Add do_set3() cmd to ipfw2 to ease dealing with op3-embeded opcodes.
* Add/improve support for destroy/info cmds.
2014-06-14 10:58:39 +00:00
Alexander V. Chernikov
b074b7bbce Make ipfw tables use names as used-level identifier internally:
* Add namedobject set-aware api capable of searching/allocation objects by their name/idx.
* Switch tables code to use string ids for configuration tasks.
* Change locking model: most configuration changes are protected with UH lock, runtime-visible are protected with both locks.
* Reduce number of arguments passed to ipfw_table_add/del by using separate structure.
* Add internal V_fw_tables_sets tunable (set to 0) to prepare for set-aware tables (requires opcodes/client support)
* Implement typed table referencing (and tables are implicitly allocated with all state like radix ptrs on reference)
* Add "destroy" ipfw(8) using new IP_FW_DELOBJ opcode

Namedobj more detailed:
* Blackbox api providing methods to add/del/search/enumerate objects
* Statically-sized hashes for names/indexes
* Per-set bitmask to indicate free indexes
* Separate methods for index alloc/delete/resize

Basically, there should not be any user-visible changes except the following:
* reducing table_max is not supported
* flush & add change table type won't work if table is referenced

Sponsored by:	Yandex LLC
2014-06-12 09:59:11 +00:00
Hiren Panchasara
046faba0eb DNOLD_IS_ECN introduced by r266941 is not required.
DNOLD_* flags are for compat with old binaries.

Suggested by: luigi
2014-06-01 20:19:17 +00:00
Hiren Panchasara
fc5e1956d9 ECN marking implenetation for dummynet.
Changes include both DCTCP and RFC 3168 ECN marking methodology.

DCTCP draft: http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00

Submitted by:	Midori Kato (aoimidori27@gmail.com)
Worked with:	Lars Eggert (lars@netapp.com)
Reviewed by:	luigi, hiren
2014-06-01 07:28:24 +00:00
John Baldwin
b437b06c79 Fix pf(4) to build with MAXCPU set to 256. MAXCPU is actually a count,
not a maximum ID value (so it is a cap on mp_ncpus, not mp_maxid).
2014-05-29 19:17:10 +00:00
Andrey V. Elsukov
3a5db2d492 Since ipfw nat configures all options in one step, we should set all bits
in the mask when calling LibAliasSetMode() to properly clear unneeded
options.

PR:		189655
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-05-18 14:25:19 +00:00
Alexander V. Chernikov
c3015737f3 Fix wrong formatting of 0.0.0.0/X table records in ipfw(8).
Add `flags` u16 field to the hole in ipfw_table_xentry structure.
Kernel has been guessing address family for supplied record based
on xent length size.
Userland, however, has been getting fixed-size ipfw_table_xentry structures
guessing address family by checking address by IN6_IS_ADDR_V4COMPAT().

Fix this behavior by providing specific IPFW_TCF_INET flag for IPv4 records.

PR:		bin/189471
Submitted by:	Dennis Yusupoff <dyr@smartspb.net>
MFC after:	2 weeks
2014-05-17 13:45:03 +00:00
Gleb Smirnoff
0e4f18aa68 o In pf_normalize_ip() we don't need mtag in
!(PFRULE_FRAGCROP|PFRULE_FRAGDROP) case.
o In the (PFRULE_FRAGCROP|PFRULE_FRAGDROP) case we should allocate mtag
  if we don't find any.

Tested by:	Ian FREISLICH <ianf cloudseed.co.za>
2014-05-17 12:30:27 +00:00
Mikolaj Golub
8a477d48e0 Define startup order the same way as it is in dummynet. 2014-04-26 08:05:16 +00:00
Gleb Smirnoff
53f4b0cf9b The current API for adding rules with pool addresses is the following:
- DIOCADDADDR adds addresses and puts them into V_pf_pabuf
- DIOCADDRULE takes all addresses from V_pf_pabuf and links
  them into rule.

The ugly part is that if address is a table, then it is initialized
in DIOCADDRULE, because we need ruleset, and DIOCADDADDR doesn't
supply ruleset. But if address is a dynaddr, we need address family,
and address family could be different for different addresses in one
rule, so dynaddr is initialized in DIOCADDADDR.

This leads to the entangled state of addresses on V_pf_pabuf. Some are
initialized, and some not. That's why running pf_empty_pool(&V_pf_pabuf)
can lead to a panic on a NULL table address.

Since proper fix requires API/ABI change, for now simply plug the panic
in pf_empty_pool().

Reported by:	danger
2014-04-25 11:36:11 +00:00
Martin Matuska
ecb47cf9c5 Backport from projects/pf r263908:
De-virtualize UMA zone pf_mtag_z and move to global initialization part.

The m_tag struct does not know about vnet context and the pf_mtag_free()
callback is called unaware of current vnet. This causes a panic.

MFC after:	1 week
2014-04-20 09:17:48 +00:00
Andrey V. Elsukov
69155b132d Set oif only for outgoing packets.
PR:		188543
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-04-16 14:37:11 +00:00
Gleb Smirnoff
79bde95f51 Backout r257223,r257224,r257225,r257246,r257710. The changes caused
some regressions in ICMP handling, and right now me and Baptiste
are out of time on analyzing them.

PR:		188253
2014-04-16 09:25:20 +00:00
Christian Brueffer
7946c49fd8 Free resources and error cases; re-indent a curly brace while here.
CID:		1199366
Found with:	Coverity Prevent(tm)
MFC after:	1 week
2014-04-13 21:13:33 +00:00
Martin Matuska
42311ccc0b Merge from projects/pf r264198:
Execute pf_overload_task() in vnet context. Fixes a vnet kernel panic.

Reviewed by:	trociny
MFC after:	1 week
2014-04-07 07:06:13 +00:00
Martin Matuska
0a7c583acc Execute pf_overload_task() in vnet context. Fixes a vnet kernel panic.
Reviewed by:	trociny
2014-04-06 19:19:25 +00:00
Martin Matuska
7e92ce7380 De-virtualize UMA zone pf_mtag_z and move to global initialization part.
The m_tag struct does not know about vnet context and the pf_mtag_free()
callback is called unaware of current vnet. This causes a panic.

Reviewed by:	Nikos Vassiliadis, trociny@
2014-03-29 09:05:25 +00:00
Martin Matuska
1709ccf9d3 Merge head up to r263906. 2014-03-29 08:39:53 +00:00
Martin Matuska
d318d97fb5 Merge from projects/pf r251993 (glebius@):
De-vnet hash sizes and hash masks.

Submitted by:	Nikos Vassiliadis <nvass gmx.com>
Reviewed by:	trociny

MFC after:	1 month
2014-03-25 06:55:53 +00:00
Gleb Smirnoff
620ee5d386 Fix breakage in ipfw+VIMAGE after r261590.
PR:		kern/187665
Sponsored by:	Nginx, Inc.
2014-03-21 17:07:18 +00:00
Gleb Smirnoff
e3a7aa6f56 - Remove rt_metrics_lite and simply put its members into rtentry.
- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This
  removes another cache trashing ++ from packet forwarding path.
- Create zini/fini methods for the rtentry UMA zone. Via initialize
  mutex and counter in them.
- Fix reporting of rmx_pksent to routing socket.
- Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.

The change is mostly targeted for stable/10 merge. For head,
rt_pksent is expected to just disappear.

Discussed with:		melifaro
Sponsored by:		Netflix
Sponsored by:		Nginx, Inc.
2014-03-05 01:17:47 +00:00
Gleb Smirnoff
fb3541ad15 Instead of playing games with casts simply add 3 more members to the
structure pf_rule, that are used when the structure is passed via
ioctl().

PR:		187074
2014-03-05 00:40:03 +00:00
Martin Matuska
5748b897da Merge head up to r262222 (last merge was incomplete). 2014-02-19 22:02:15 +00:00
Martin Matuska
dc64d6b7e1 Revert r262196
I am going to split this into two individual patches and test it with
the projects/pf branch that may get merged later.
2014-02-19 17:06:04 +00:00
Martin Matuska
a93b9a64fe De-virtualize pf_mtag_z [1]
Process V_pf_overloadqueue in vnet context [2]

This fixes two VIMAGE kernel panics and allows to simultaneously run host-pf
and vnet jails. pf inside jails remains broken.

PR:		kern/182964
Submitted by:	glebius@FreeBSD.org [2], myself [1]
Tested by:	rodrigc@FreeBSD.org, myself
MFC after:	2 weeks
2014-02-18 22:17:12 +00:00
George V. Neville-Neil
f850d57232 Summary: Two quick edits to the implementation notes as they're no
longer stored in netinet but in netpfil.
2014-02-15 18:36:31 +00:00
Dimitry Andric
a3043eeef2 Under sys/netpfil/ipfw, surround two IPv6-specific static functions with
#ifdef INET6, since they are unused when INET6 is disabled.

MFC after:	3 days
2014-02-15 12:25:01 +00:00
Gleb Smirnoff
48278b8846 Once pf became not covered by a single mutex, many counters in it became
race prone. Some just gather statistics, but some are later used in
different calculations.

A real problem was the race provoked underflow of the states_cur counter
on a rule. Once it goes below zero, it wraps to UINT32_MAX. Later this
value is used in pf_state_expires() and any state created by this rule
is immediately expired.

Thus, make fields states_cur, states_tot and src_nodes of struct
pf_rule be counter(9)s.

Thanks to Dennis for providing me shell access to problematic box and
his help with reproducing, debugging and investigating the problem.

Thanks to:		Dennis Yusupoff <dyr smartspb.net>
Also reported by:	dumbbell, pgj, Rambler
Sponsored by:		Nginx, Inc.
2014-02-14 10:05:21 +00:00
Alexander V. Chernikov
5fa3fdd3d9 Reorder struct ip_fw_chain:
* move rarely-used fields down
* move uh_lock to different cacheline
* remove some usused fields

Sponsored by:	Yandex LLC
2014-01-24 09:13:30 +00:00
Gleb Smirnoff
be3d21a2cf Remove NULL pointer dereference.
CID:	1009118
2014-01-22 15:58:43 +00:00
Gleb Smirnoff
d26bbeb948 Fix resource leak and simplify code for DIOCCHANGEADDR.
CID:	1007035
2014-01-22 15:44:38 +00:00
Alexander V. Chernikov
c254a2091c Revert r260548. We really should not use IPFW_WLOCK() here
but this requires some more playing with IPFW_UH_WLOCK(). Leave till later.
2014-01-11 18:27:34 +00:00
Alexander V. Chernikov
ac5863eed8 We don't need chain write lock since we're not modifying its contents.
LibAliasSetAddress() uses its own mutex to serialize changes.

While here, convert ifp->if_xname access to if_name() function.

MFC after:	2 weeks
Sponsored by:	Yandex LLC
2014-01-11 16:50:41 +00:00
Gleb Smirnoff
a830c4524d When pf_get_translation() fails, it should leave *sn pointer pristine,
otherwise we will panic in pf_test_rule().

PR:		182557
2014-01-06 19:05:04 +00:00
Alexander V. Chernikov
d28d2aa46d Use rnh_matchaddr instead of rnh_lookup for longest-prefix match.
rnh_lookup is effectively the same as rnh_matchaddr if called with
empy network mask.

MFC after:	2 weeks
2014-01-03 23:11:26 +00:00
Dimitry Andric
85838e48bd Fix incorrect header guard define in sys/netpfil/pf/pf.h, which snuck in
in r257186.  Found by clang 3.4.
2013-12-22 19:47:22 +00:00
Gleb Smirnoff
0b5d46ce4d Fix fallout from r258479: in pf_free_src_node() the node must already
be unlinked.

Reported by:	Konstantin Kukushkin <dark rambler-co.ru>
Sponsored by:	Nginx, Inc.
2013-12-22 12:10:36 +00:00
Alexander V. Chernikov
fb2b51fab1 Add net.inet.ip.fw.dyn_keep_states sysctl which
re-links dynamic states to default rule instead of
flushing on rule deletion.
This can be useful while performing ruleset reload
(think about `atomic` reload via changing sets).
Currently it is turned off by default.

MFC after:	2 weeks
Sponsored by:	Yandex LLC
2013-12-18 20:17:05 +00:00
Alexander V. Chernikov
a19b3f74af Simplify O_NAT opcode handling.
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2013-11-28 15:28:51 +00:00
Alexander V. Chernikov
1058f17749 Check ipfw table numbers in both user and kernel space before rule addition.
Found by:	Saychik Pavel <umka@localka.net>
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2013-11-28 10:28:28 +00:00
Craig Rodrigues
d04dc94cac In sys/netpfil/ipfw/ip_fw_nat.c:vnet_ipfw_nat_uninit() we call "IPFW_WLOCK(chain);".
This lock gets deleted in sys/netpfil/ipfw/ip_fw2.c:vnet_ipfw_uninit().

Therefore, vnet_ipfw_nat_uninit() *must* be called before vnet_ipfw_uninit(),
but this doesn't always happen, because the VNET_SYSINIT order is the same for both functions.
In sys/net/netpfil/ipfw/ip_fw2.c and sys/net/netpfil/ipfw/ip_fw_nat.c,
IPFW_SI_SUB_FIREWALL == IPFW_NAT_SI_SUB_FIREWALL == SI_SUB_PROTO_IFATTACHDOMAIN
and
IPFW_MODULE_ORDER == IPFW_NAT_MODULE_ORDER

Consequently, if VIMAGE is enabled, and jails are created and destroyed,
the system sometimes crashes, because we are trying to use a deleted lock.

To reproduce the problem:
  (1)  Take a GENERIC kernel config, and add options for: VIMAGE, WITNESS,
       INVARIANTS.
  (2)  Run this command in a loop:
       jail -l -u root -c path=/ name=foo persist vnet && jexec foo ifconfig lo0 127.0.0.1/8 && jail -r foo

       (see http://lists.freebsd.org/pipermail/freebsd-current/2010-November/021280.html )

Fix the problem by increasing the value of IPFW_NAT_SI_SUB_FIREWALL,
so that vnet_ipfw_nat_uninit() runs after vnet_ipfw_uninit().
2013-11-25 20:20:34 +00:00
Gleb Smirnoff
19acaecac3 The DIOCKILLSRCNODES operation was implemented with O(m*n) complexity,
where "m" is number of source nodes and "n" is number of states. Thus,
on heavy loaded router its processing consumed a lot of CPU time.

Reimplement it with O(m+n) complexity. We first scan through source
nodes and disconnect matching ones, putting them on the freelist and
marking with a cookie value in their expire field. Then we scan through
the states, detecting references to source nodes with a cookie, and
disconnect them as well. Then the freelist is passed to pf_free_src_nodes().

In collaboration with:	Kajetan Staszkiewicz <kajetan.staszkiewicz innogames.de>
PR:		kern/176763
Sponsored by:	InnoGames GmbH
Sponsored by:	Nginx, Inc.
2013-11-22 19:22:26 +00:00
Gleb Smirnoff
d77c1b3269 To support upcoming changes change internal API for source node handling:
- Removed pf_remove_src_node().
- Introduce pf_unlink_src_node() and pf_unlink_src_node_locked().
  These function do not proceed with freeing of a node, just disconnect
  it from storage.
- New function pf_free_src_nodes() works on a list of previously
  disconnected nodes and frees them.
- Utilize new API in pf_purge_expired_src_nodes().

In collaboration with:	Kajetan Staszkiewicz <kajetan.staszkiewicz innogames.de>

Sponsored by:	InnoGames GmbH
Sponsored by:	Nginx, Inc.
2013-11-22 19:16:34 +00:00
Gleb Smirnoff
1320f8c0d5 Fix off by ones when scanning source nodes hash.
Sponsored by:	Nginx, Inc.
2013-11-22 18:57:27 +00:00
Gleb Smirnoff
4280d14d2b Style: don't compare unsigned <= 0.
Sponsored by:	Nginx, Inc.
2013-11-22 18:54:06 +00:00
Luigi Rizzo
41d10f903d add a counter on the struct mq (a queue of mbufs),
and add a block for userspace compiling.
2013-11-22 05:02:37 +00:00
Luigi Rizzo
f783a35ced disable some ipfw match options when compiling in userspace 2013-11-22 05:01:38 +00:00
Luigi Rizzo
d0f65d47ec make this code compile in userspace on OSX 2013-11-22 05:00:18 +00:00
Luigi Rizzo
413c8aaa87 more support for userspace compiling of this code:
emulate the uma_zone for dynamic rules.
2013-11-22 04:59:17 +00:00
Luigi Rizzo
77024cbc3e make ipfw_check_packet() and ipfw_check_frame() public,
so they can be used in the userspace version of ipfw/dummynet
(normally using netmap for the I/O path).

This is the first of a few commits to ease compiling the
ipfw kernel code in userspace.
2013-11-22 04:57:50 +00:00
Gleb Smirnoff
654957c2c8 Merge head up to r258343. 2013-11-19 12:21:47 +00:00
Gleb Smirnoff
f053058cee - Split functions that initialize various pf parts into their vimage
parts and global parts.
- Since global parts appeared to be only mutex initializations, just
  abandon them and use MTX_SYSINIT() instead.
- Kill my incorrect VNET_FOREACH() iterator and instead use correct
  approach with VNET_SYSINIT().

Submitted by:	Nikos Vassiliadis <nvass gmx.com>
Reviewed by:	trociny
2013-11-18 22:18:07 +00:00
Gleb Smirnoff
e4e01d9cec Some fixups to pf_get_sport after r257223:
- Do not return blindly if proto isn't ICMP.
- The dport is in network order, so fix comparisons.
- Remove ridiculous htonl(arc4random()).
- Push local variable to a narrower block.
2013-11-14 14:20:35 +00:00
Gleb Smirnoff
50d3286d9d Merge head r232040 through r258006. 2013-11-11 20:33:25 +00:00
Gleb Smirnoff
6c71335c62 Fix fallout from r257223. Since pf_test_state_icmp() can call
pf_icmp_state_lookup() twice, we need to unlock previously found state.

Reported & tested by:	gavin
2013-11-05 16:54:25 +00:00
Gleb Smirnoff
b1b9dcae46 Remove net.link.ether.inet.useloopback sysctl tunable. It was always on by
default from the very beginning. It was placed in wrong namespace
net.link.ether, originally it had been at another wrong namespace. It was
incorrectly documented at incorrect manual page arp(8). Since new-ARP commit,
the tunable have been consulted only on route addition, and ignored on route
deletion. Behaviour of a system with tunable turned off is not fully correct,
and has no advantages comparing to normal behavior.
2013-11-05 07:32:09 +00:00
Gleb Smirnoff
e1b58d2cff Code logic of handling PFTM_PURGE into pf_find_state(). 2013-11-04 08:20:06 +00:00
Gleb Smirnoff
7710f9f14a Remove unused PFTM_UNTIL_PACKET const. 2013-11-04 08:15:59 +00:00
Gleb Smirnoff
f9b2a21c9e Merge head r232040 through r257457.
M    usr.sbin/portsnap/portsnap/portsnap.8
M    usr.sbin/portsnap/portsnap/portsnap.sh
M    usr.sbin/tcpdump/tcpdump/Makefile
2013-10-31 17:33:29 +00:00
Gleb Smirnoff
1ce5620d32 - Fix VIMAGE build.
- Fix build with gcc.
2013-10-28 10:12:19 +00:00
Gleb Smirnoff
c3322cb91c Include necessary headers that now are available due to pollution
via if_var.h.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-28 07:29:16 +00:00
Baptiste Daroussin
0664b03c16 Import pf.c 1.638 from OpenBSD
Original log:
Some ICMP types that also have icmp_id, pointed out by markus@

Obtained from:	OpenBSD
2013-10-27 20:56:23 +00:00
Baptiste Daroussin
5fff3f1010 Improt pf.c 1.636 from OpenBSD
Original log:
Make sure pd2 has a pointer to the icmp header in the payload; fixes
panic seen with some some icmp types in icmp error message payloads.

Obtained from:	OpenBSD
2013-10-27 20:52:09 +00:00
Baptiste Daroussin
44df0d9356 Import pf.c 1.635 and pf_lb.c 1.4 from OpenBSD
Stricter state checking for ICMP and ICMPv6 packets: include the ICMP type

in one port of the state key, using the type to determine which
side should be the id, and which should be the type. Also:
- Handle ICMP6 messages which are typically sent to multicast
  addresses but recieve unicast replies, by doing fallthrough lookups
  against the correct multicast address.  - Clear up some mistaken
  assumptions in the PF code:
- Not all ICMP packets have an icmp_id, so simulate
  one based on other data if we can, otherwise set it to 0.
  - Don't modify the icmp id field in NAT unless it's echo
  - Use the full range of possible id's when NATing icmp6 echoy

Difference with OpenBSD version:
- C99ify the new code
- WITHOUT_INET6 safe

Reviewed by:	glebius
Obtained from:	OpenBSD
2013-10-27 20:44:42 +00:00
Gleb Smirnoff
75bf2db380 Move new pf includes to the pf directory. The pfvar.h remain
in net, to avoid compatibility breakage for no sake.

The future plan is to split most of non-kernel parts of
pfvar.h into pf.h, and then make pfvar.h a kernel only
include breaking compatibility.

Discussed with:		bz
2013-10-27 16:25:57 +00:00
Gleb Smirnoff
eedc7fd9e8 Provide includes that are needed in these files, and before were read
in implicitly via if.h -> if_var.h pollution.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 18:18:50 +00:00
Gleb Smirnoff
76039bc84f The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 17:58:36 +00:00
Gleb Smirnoff
0bfd163f52 Merge head r233826 through r256722. 2013-10-18 09:32:02 +00:00
Philip Paeps
b49bf73f75 Use the correct EtherType for logging IPv6 packets.
Reviewed by:	melifaro
Approved by:	re (kib, glebius)
MFC after:	3 days
2013-09-28 15:49:36 +00:00
Gleb Smirnoff
8fc6e19c2c Merge 1.12 of pf_lb.c from OpenBSD, with some changes. Original commit:
date: 2010/02/04 14:10:12;  author: sthen;  state: Exp;  lines: +24 -19;
  pf_get_sport() picks a random port from the port range specified in a
  nat rule. It should check to see if it's in-use (i.e. matches an existing
  PF state), if it is, it cycles sequentially through other ports until
  it finds a free one. However the check was being done with the state
  keys the wrong way round so it was never actually finding the state
  to be in-use.

  - switch the keys to correct this, avoiding random state collisions
  with nat. Fixes PR 6300 and problems reported by robert@ and viq.

  - check pf_get_sport() return code in pf_test(); if port allocation
  fails the packet should be dropped rather than sent out untranslated.

  Help/ok claudio@.

Some additional changes to 1.12:

- We also need to bzero() the key to zero padding, otherwise key
  won't match.
- Collapse two if blocks into one with ||, since both conditions
  lead to the same processing.
- Only naddr changes in the cycle, so move initialization of other
  fields above the cycle.
- s/u_intXX_t/uintXX_t/g

PR:		kern/181690
Submitted by:	Olivier Cochard-Labbé <olivier cochard.me>
Sponsored by:	Nginx, Inc.
2013-09-02 10:14:25 +00:00
Alexander Motin
5f4fc3dbcb Make dummynet use new direct callout(9) execution mechanism. Since the only
thing done by the dummynet handler is taskqueue_enqueue() call, it doesn't
need extra switch to the clock SWI context.

On idle system this change in half reduces number of active CPU cycles and
wakes up only one CPU from sleep instead of two.

I was going to make this change much earlier as part of calloutng project,
but waited for better solution with skipping idle ticks to be implemented.
Unfortunately with 10.0 release coming it is better get at least this.
2013-08-24 13:34:36 +00:00
Mikolaj Golub
8856400bcb Make ipfw nat init/unint work correctly for VIMAGE:
* Do per vnet instance cleanup (previously it was only for vnet0 on
  module unload, and led to libalias leaks and possible panics due to
  stale pointer dereferences).

* Instead of protecting ipfw hooks registering/deregistering by only
  vnet0 lock (which does not prevent pointers access from another
  vnets), introduce per vnet ipfw_nat_loaded variable. The variable is
  set after hooks are registered and unset before they are deregistered.

* Devirtualize ifaddr_event_tag as we run only one event handler for
  all vnets.

* It is supposed that ifaddr_change event handler is called in the
  interface vnet context, so add an assertion.

Reviewed by:	zec
MFC after:	2 weeks
2013-08-24 11:59:51 +00:00
Andre Oppermann
86bd049144 Add m_clrprotoflags() to clear protocol specific mbuf flags at up and
downwards layer crossings.

Consistently use it within IP, IPv6 and ethernet protocols.

Discussed with:	trociny, glebius
2013-08-19 13:27:32 +00:00
Andrey V. Elsukov
415077bad9 Fix a possible NULL-pointer dereference on the pfsync(4) reconfiguration.
Reported by:	Eugene M. Zheganin
2013-07-29 13:17:18 +00:00
Gleb Smirnoff
6828cc99e1 De-vnet hash sizes and hash masks.
Submitted by:	Nikos Vassiliadis <nvass gmx.com>
Reviewed by:	trociny
2013-06-19 13:37:29 +00:00
Gleb Smirnoff
93ecffe50b Improve locking strategy between keys hash and ID hash.
Before this change state creating sequence was:

1) lock wire key hash
2) link state's wire key
3) unlock wire key hash
4) lock stack key hash
5) link state's stack key
6) unlock stack key hash
7) lock ID hash
8) link into ID hash
9) unlock ID hash

What could happen here is that other thread finds the state via key
hash lookup after 6), locks ID hash and does some processing of the
state. When the thread creating state unblocks, it finds the state
it was inserting already non-virgin.

Now we perform proper interlocking between key hash locks and ID hash
lock:

1) lock wire & stack hashes
2) link state's keys
3) lock ID hash
4) unlock wire & stack hashes
5) link into ID hash
6) unlock ID hash

To achieve that, the following hacking was performed in pf_state_key_attach():

- Key hash mutex is marked with MTX_DUPOK.
- To avoid deadlock on 2 key hash mutexes, we lock them in order determined
  by their address value.
- pf_state_key_attach() had a magic to reuse a > FIN_WAIT_2 state. It unlinked
  the conflicting state synchronously. In theory this could require locking
  a third key hash, which we can't do now.
  Now we do not remove the state immediately, instead we leave this task to
  the purge thread. To avoid conflicts in a short period before state is
  purged, we push to the very end of the TAILQ.
- On success, before dropping key hash locks, pf_state_key_attach() locks
  ID hash and returns.

Tested by:	Ian FREISLICH <ianf clue.co.za>
2013-06-13 06:07:19 +00:00
Gleb Smirnoff
5af77b3ebd Return meaningful error code from pf_state_key_attach() and
pf_state_insert().
2013-05-11 18:06:51 +00:00
Gleb Smirnoff
03911dec5b Better debug message. 2013-05-11 18:03:36 +00:00
Gleb Smirnoff
048c95417d Fix DIOCADDSTATE operation. 2013-05-11 17:58:26 +00:00
Gleb Smirnoff
b69d74e834 Invalid creatorid is always EINVAL, not only when we are in verbose mode. 2013-05-11 17:57:52 +00:00
Gleb Smirnoff
f8aa444783 Improve KASSERT() message. 2013-05-06 21:44:06 +00:00
Gleb Smirnoff
7a954bbbce Simplify printf(). 2013-05-06 21:43:15 +00:00
Alexander V. Chernikov
454189c130 Use unified method for accessing / updating cached rule pointers.
MFC after:	2 weeks
2013-05-04 18:24:30 +00:00
Eitan Adler
578acad37e Correct a few sizeof()s
Submitted by:	swildner@DragonFlyBSD.org
Reviewed by:	alfred
2013-05-01 04:37:34 +00:00
Gleb Smirnoff
a642ae6825 Remove useless ifdef KLD_MODULE from dummynet module unload path. This
fixes panic on unload.

Reported by:	pho
2013-04-29 06:11:19 +00:00
Gleb Smirnoff
47e8d432d5 Add const qualifier to the dst parameter of the ifnet if_output method. 2013-04-26 12:50:32 +00:00
Alexander V. Chernikov
4037b82802 Fix ipfw rule validation partially broken by r248552.
Pointed by:	avg
MFC with:	r248552
2013-04-01 11:28:52 +00:00
Andrey V. Elsukov
5b4661289d When we are removing a specific set, call ipfw_expire_dyn_rules only once.
Obtained from:	Yandex LLC
MFC after:	1 week
2013-03-25 07:43:46 +00:00
Alexander V. Chernikov
ae01d73c04 Add ipfw support for setting/matching DiffServ codepoints (DSCP).
Setting DSCP support is done via O_SETDSCP which works for both
IPv4 and IPv6 packets. Fast checksum recalculation (RFC 1624) is done for IPv4.
Dscp can be specified by name (AFXY, CSX, BE, EF), by value
(0..63) or via tablearg.

Matching DSCP is done via another opcode (O_DSCP) which accepts several
classes at once (af11,af22,be). Classes are stored in bitmask (2 u32 words).

Many people made their variants of this patch, the ones I'm aware of are
(in alphabetic order):

Dmitrii Tejblum
Marcelo Araujo
Roman Bogorodskiy (novel)
Sergey Matveichuk (sem)
Sergey Ryabin

PR:		kern/102471, kern/121122
MFC after:	2 weeks
2013-03-20 10:35:33 +00:00
Andrey V. Elsukov
93bb4f9ed5 Separate the locking macros that are used in the packet flow path
from others. This helps easy switch to use pfil(4) lock.
2013-03-19 06:04:17 +00:00
Gleb Smirnoff
dc4ad05ecd Use m_get/m_gethdr instead of compat macros.
Sponsored by:	Nginx, Inc.
2013-03-15 12:55:30 +00:00
Gleb Smirnoff
41a7572b26 Functions m_getm2() and m_get2() have different order of arguments,
and that can drive someone crazy. While m_get2() is young and not
documented yet, change its order of arguments to match m_getm2().

Sorry for churn, but better now than later.
2013-03-12 13:42:47 +00:00
Alexander V. Chernikov
39bddcde96 Fix callout expiring dynamic rules.
PR:		kern/175530
Submitted by:	Vladimir Spiridenkov <vs@gtn.ru>
MFC after:	2 weeks
2013-03-02 14:47:10 +00:00
Gleb Smirnoff
e2a55a0021 Finish the r244185. This fixes ever growing counter of pfsync bad
length packets, which was actually harmless.

Note that peers with different version of head/ may grow this
counter, but it is harmless - all pfsync data is processed.

Reported & tested by:	Anton Yuzhaninov <citrin citrin.ru>
Sponsored by:		Nginx, Inc
2013-02-15 09:03:56 +00:00
Gleb Smirnoff
d8aa10cc35 In netpfil/pf:
- Add my copyright to files I've touched a lot this year.
  - Add dash in front of all copyright notices according to style(9).
  - Move $OpenBSD$ down below copyright notices.
  - Remove extra line between cdefs.h and __FBSDID.
2012-12-28 09:19:49 +00:00
Alexander V. Chernikov
3abd4586a4 Add parentheses to IP_FW_ARG_TABLEARG() definition.
Suggested by:	glebius
MFC with:	r244633
2012-12-23 18:35:42 +00:00
Alexander V. Chernikov
f37de965cc Use unified IP_FW_ARG_TABLEARG() macro for most tablearg checks.
Log real value instead of IP_FW_TABLEARG (65535) in ipfw_log().

Noticed by:	Vitaliy Tokarenko <rphone@ukr.net>
MFC after:	2 weeks
2012-12-23 16:28:18 +00:00
Pawel Jakub Dawidek
f5002be657 Warn about reaching various PF limits.
Reviewed by:	glebius
Obtained from:	WHEEL Systems
2012-12-17 10:10:13 +00:00
Mikolaj Golub
bf1e95a21c In pfioctl, if the permission checks failed we returned with vnet context
set.

As the checks don't require vnet context, this is fixed by setting
vnet after the checks.

PR:		kern/160541
Submitted by:	Nikos Vassiliadis (slightly different approach)
2012-12-15 17:19:36 +00:00
Gleb Smirnoff
f094f811fb Fix error in r235991. No-sleep version of IFNET_RLOCK() should
be used here, since we may hold the main pf rulesets rwlock.

Reported by:	Fleuriot Damien <ml my.gd>
2012-12-14 13:01:16 +00:00
Gleb Smirnoff
4c794f5c06 Fix VIMAGE build broken in r244185.
Submitted by:	Nikolai Lifanov <lifanov mail.lifanov.com>
2012-12-14 08:02:35 +00:00
Gleb Smirnoff
9ff7e6e922 Merge rev. 1.119 from OpenBSD:
date: 2009/03/31 01:21:29;  author: dlg;  state: Exp;  lines: +9 -16
  ...

  this also firms up some of the input parsing so it handles short frames a
  bit better.

This actually fixes reading beyond mbuf data area in pfsync_input(), that
may happen at certain pfsync datagrams.
2012-12-13 12:51:22 +00:00
Gleb Smirnoff
feaa4dd2d0 Initialize state id prior to attaching state to key hash. Otherwise a
race can happen, when pf_find_state() finds state via key hash, and locks
id hash slot 0 instead of appropriate to state id slot.
2012-12-13 12:48:57 +00:00
Gleb Smirnoff
fed7635002 Merge 1.127 from OpenBSD, that closes a regression from 1.125 (merged
as r242694):
  do better detection of when we have a better version of the tcp sequence
  windows than our peer.

  this resolves the last of the pfsync traffic storm issues ive been able to
  produce, and therefore makes it possible to do usable active-active
  statuful firewalls with pf.
2012-12-11 08:37:08 +00:00
Gleb Smirnoff
59cc9fde4f Rule memory garbage collecting in new pf scans only states that are on
id hash. If a state has been disconnected from id hash, its rule pointers
can no longer be dereferenced, and referenced memory can't be modified.
Thus, move rule statistics from pf_free_rule() to pf_unlink_rule() and
update them prior to releasing id hash slot lock.

Reported by:	Ian FREISLICH <ianf cloudseed.co.za>
2012-12-06 08:38:14 +00:00
Gleb Smirnoff
38cc0bfa26 Close possible races between state deletion and sent being sent out
from pfsync:
- Call into pfsync_delete_state() holding the state lock.
- Set the state timeout to PFTM_UNLINKED after state has been moved
  to the PFSYNC_S_DEL queue in pfsync.

Reported by:	Ian FREISLICH <ianf cloudseed.co.za>
2012-12-06 08:32:28 +00:00
Gleb Smirnoff
8db7e13f1d Remove extra PFSYNC_LOCK() in pfsync_bulk_update() which lead to lock
recursion.

Reported by:	Ian FREISLICH <ianf cloudseed.co.za>
2012-12-06 08:22:08 +00:00
Gleb Smirnoff
5da39c565b Revert erroneous r242693. A state may have PFTM_UNLINKED being on the
PFSYNC_S_DEL queue of pfsync.
2012-12-06 08:15:06 +00:00
Gleb Smirnoff
eb1b1807af Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually
2012-12-05 08:04:20 +00:00
Alexander V. Chernikov
c187c1fbf8 Use common macros for working with rule/dynamic counters.
This is done as preparation to introduce per-cpu ipfw counters.

MFC after:	3 weeks
2012-11-30 19:36:55 +00:00
Alexander V. Chernikov
2e089d5c04 Make ipfw dynamic states operations SMP-ready.
* Global IPFW_DYN_LOCK() is changed to per-bucket mutex.
* State expiration is done in ipfw_tick every second.
* No expiration is done on forwarding path.
* hash table resize is done automatically and does not flush all states.
* Dynamic UMA zone is now allocated per each VNET
* State limiting is now done via UMA(9) api.

Discussed with:	ipfw
MFC after:	3 weeks
Sponsored by:	Yandex LLC
2012-11-30 16:33:22 +00:00
Alexander V. Chernikov
73332e7c82 Simplify sending keepalives.
Prepare ipfw_tick() to be used by other consumers.

Reviewed by:	ae(basically)
MFC after:	2 weeks
2012-11-09 18:23:38 +00:00
Gleb Smirnoff
f18ab0ffa3 Merge rev. 1.125 from OpenBSD:
date: 2009/06/12 02:03:51;  author: dlg;  state: Exp;  lines: +59 -69
  rewrite the way states from pfsync are merged into the local state tree
  and the conditions on which pfsync will notify its peers on a stale update.

  each side (ie, the sending and receiving side) of the state update is
  compared separately. any side that is further along than the local state
  tree is merged. if any side is further along in the local state table, an
  update is sent out telling the peers about it.
2012-11-07 07:35:05 +00:00
Gleb Smirnoff
d75efebeab It may happen that pfsync holds the last reference on a state. In this
case keys had already been freed. If encountering such state, then
just release last reference.

Not sure this can happen as a runtime race, but can be reproduced by
the following scenario:

- enable pfsync
- disable pfsync
- wait some time
- enable pfsync
2012-11-07 07:30:40 +00:00
Alexander V. Chernikov
5d0cd92651 Add assertion to enforce 'nat global' locking requierements changed by r241908.
Suggested by:	adrian, glebius
MFC after:	3 days
2012-11-05 22:54:00 +00:00
Alexander V. Chernikov
a730ff05c1 Use unified print_dyn_rule_flags() function for debugging messages
instead of hand-made printfs in every place.

MFC after:	1 week
2012-11-05 22:30:56 +00:00