Commit Graph

168 Commits

Author SHA1 Message Date
Alexander V. Chernikov
28ea4fa355 Remove IP_FW_TABLES_XGETSIZE opcode.
It is superseded by IP_FW_TABLES_XLIST.
2014-08-08 06:36:26 +00:00
Alexander V. Chernikov
a73d728d31 Kernel changes:
* Implement proper checks for switching between global and set-aware tables
* Split IP_FW_DEL mess into the following opcodes:
  * IP_FW_XDEL (del rules matching pattern)
  * IP_FW_XMOVE (move rules matching pattern to another set)
  * IP_FW_SET_SWAP (swap between 2 sets)
  * IP_FW_SET_MOVE (move one set to another one)
  * IP_FW_SET_ENABLE (enable/disable sets)
* Add IP_FW_XZERO / IP_FW_XRESETLOG to finish IP_FW3 migration.
* Use unified ipfw_range_tlv as range description for all of the above.
* Check dynamic states IFF there was non-zero number of deleted dyn rules,
* Del relevant dynamic states with singe traversal instead of per-rule one.

Userland changes:
* Switch ipfw(8) to use new opcodes.
2014-08-07 21:37:31 +00:00
Alexander V. Chernikov
46d5200874 Implement atomic ipfw table swap.
Kernel changes:
* Add opcode IP_FW_TABLE_XSWAP
* Add support for swapping 2 tables with the same type/ftype/vtype.
* Make skipto cache init after ipfw locks init.

Userland changes:
* Add "table X swap Y" command.
2014-08-03 21:37:12 +00:00
Alexander V. Chernikov
5f379342d2 Show algorithm-specific data in "table info" output. 2014-08-03 12:19:45 +00:00
Alexander V. Chernikov
4c0c07a552 * Permit limiting number of items in table.
Kernel changes:
* Add TEI_FLAGS_DONTADD entry flag to indicate that insert is not possible
* Support given flag in all algorithms
* Add "limit" field to ipfw_xtable_info
* Add actual limiting code into add_table_entry()

Userland changes:
* Add "limit" option as "create" table sub-option. Limit modification
  is currently impossible.
* Print human-readable errors in table enry addition/deletion code.
2014-08-01 15:17:46 +00:00
Alexander V. Chernikov
914bffb6ab * Add new "flow" table type to support N=1..5-tuple lookups
* Add "flow:hash" algorithm

Kernel changes:
* Add O_IP_FLOW_LOOKUP opcode to support "flow" lookups
* Add IPFW_TABLE_FLOW table type
* Add "struct tflow_entry" as strage for 6-tuple flows
* Add "flow:hash" algorithm. Basically it is auto-growing chained hash table.
  Additionally, we store mask of fields we need to compare in each instance/

* Increase ipfw_obj_tentry size by adding struct tflow_entry
* Add per-algorithm stat (ifpw_ta_tinfo) to ipfw_xtable_info
* Increase algoname length: 32 -> 64 (algo options passed there as string)
* Assume every table type can be customized by flags, use u8 to store "tflags" field.
* Simplify ipfw_find_table_entry() by providing @tentry directly to algo callback.
* Fix bug in cidr:chash resize procedure.

Userland changes:
* add "flow table(NAME)" syntax to support n-tuple checking tables.
* make fill_flags() separate function to ease working with _s_x arrays
* change "table info" output to reflect longer "type" fields

Syntax:
ipfw table fl2 create type flow:[src-ip][,proto][,src-port][,dst-ip][dst-port] [algo flow:hash]

Examples:

0:02 [2] zfscurr0# ipfw table fl2 create type flow:src-ip,proto,dst-port algo flow:hash
0:02 [2] zfscurr0# ipfw table fl2 info
+++ table(fl2), set(0) +++
 kindex: 0, type: flow:src-ip,proto,dst-port
 valtype: number, references: 0
 algorithm: flow:hash
 items: 0, size: 280
0:02 [2] zfscurr0# ipfw table fl2 add 2a02:6b8::333,tcp,443 45000
0:02 [2] zfscurr0# ipfw table fl2 add 10.0.0.92,tcp,80 22000
0:02 [2] zfscurr0# ipfw table fl2 list
+++ table(fl2), set(0) +++
2a02:6b8::333,6,443 45000
10.0.0.92,6,80 22000
0:02 [2] zfscurr0# ipfw add 200 count tcp from me to 78.46.89.105 80 flow 'table(fl2)'
00200 count tcp from me to 78.46.89.105 dst-port 80 flow table(fl2)
0:03 [2] zfscurr0# ipfw show
00200   0     0 count tcp from me to 78.46.89.105 dst-port 80 flow table(fl2)
65535 617 59416 allow ip from any to any
0:03 [2] zfscurr0# telnet -s 10.0.0.92 78.46.89.105 80
Trying 78.46.89.105...
..
0:04 [2] zfscurr0# ipfw show
00200   5   272 count tcp from me to 78.46.89.105 dst-port 80 flow table(fl2)
65535 682 66733 allow ip from any to any
2014-07-31 20:08:19 +00:00
Alexander V. Chernikov
b23d5de9b6 * Add number:array algorithm lookup method.
Kernel changes:
* s/IPFW_TABLE_U32/IPFW_TABLE_NUMBER/
* Force "lookup <port|uid|gid|jid>" to be IPFW_TABLE_NUMBER
* Support "lookup" method for number tables
* Add number:array algorihm (i32 as key, auto-growing).

Userland changes:
* Support named tables in "lookup <tag> Table"
* Fix handling of "table(NAME,val)" case
* Support printing "number" table data.
2014-07-30 14:52:26 +00:00
Alexander V. Chernikov
9d099b4f38 * Dump available table algorithms via "ipfw talist" cmd.
Kernel changes:
* Add type/refcount fields to table algo instances.
* Add IP_FW_TABLES_ALIST opcode to export available algorihms to userland.

Userland changes:
* Fix cores on empty input inside "ipfw table" handler.
* Add "ipfw talist" cmd to print availabled kernel algorithms.
* Change "table info" output to reflect long algorithm config lines.
2014-07-29 22:44:26 +00:00
Alexander V. Chernikov
68394ec88e * Add generic ipfw interface tracking API
* Rewrite interface tables to use interface indexes

Kernel changes:
* Add generic interface tracking API:
 - ipfw_iface_ref (must call unlocked, performs lazy init if needed, allocates
  state & bumps ref)
 - ipfw_iface_add_ntfy(UH_WLOCK+WLOCK, links comsumer & runs its callback to
  update ifindex)
 - ipfw_iface_del_ntfy(UH_WLOCK+WLOCK, unlinks consumer)
 - ipfw_iface_unref(unlocked, drops reference)
Additionally, consumer callbacks are called in interface withdrawal/departure.

* Rewrite interface tables to use iface tracking API. Currently tables are
  implemented the following way:
  runtime data is stored as sorted array of {ifidx, val} for existing interfaces
  full data is stored inside namedobj instance (chained hashed table).

* Add IP_FW_XIFLIST opcode to dump status of tracked interfaces

* Pass @chain ptr to most non-locked algorithm callbacks:
  (prepare_add, prepare_del, flush_entry ..). This may be needed for better
  interaction of given algorithm an other ipfw subsystems

* Add optional "change_ti" algorithm handler to permit updating of
  cached table_info pointer (happens in case of table_max resize)

* Fix small bug in ipfw_list_tables()
* Add badd (insert into sorted array) and bdel (remove from sorted array) funcs

Userland changes:
* Add "iflist" cmd to print status of currently tracked interface
* Add stringnum_cmp for better interface/table names sorting
2014-07-28 19:01:25 +00:00
Alexander V. Chernikov
7e767c791f * Use different rule structures in kernel/userland.
* Switch kernel to use per-cpu counters for rules.
* Keep ABI/API.

Kernel changes:
* Each rules is now exported as TLV with optional extenable
  counter block (ip_fW_bcounter for base one) and
  ip_fw_rule for rule&cmd data.
* Counters needs to be explicitly requested by IPFW_CFG_GET_COUNTERS flag.
* Separate counters from rules in kernel and clean up ip_fw a bit.
* Pack each rule in IPFW_TLV_RULE_ENT tlv to ease parsing.
* Introduce versioning in container TLV (may be needed in future).
* Fix ipfw_cfg_lheader broken u64 alignment.

Userland changes:
* Use set_mask from cfg header when requesting config
* Fix incorrect read accouting in ipfw_show_config()
* Use IPFW_RULE_NOOPT flag instead of playing with _pad
* Fix "ipfw -d list": do not print counters for dynamic states
* Some small fixes
2014-07-08 23:11:15 +00:00
Alexander V. Chernikov
6447bae661 * Prepare to pass other dynamic states via ipfw_dump_config()
Kernel changes:
* Change dump format for dynamic states:
  each state is now stored inside ipfw_obj_dyntlv
  last dynamic state is indicated by IPFW_DF_LAST flag
* Do not perform sooptcopyout() for !SOPT_GET requests.

Userland changes:
* Introduce foreach_state() function handler to ease work
  with different states passed by ipfw_dump_config().
2014-07-06 23:26:34 +00:00
Alexander V. Chernikov
81d3153d61 * Add "lookup" table functionality to permit userland entry lookups.
* Bump table dump format preserving old ABI.

Kernel size:
* Add IP_FW_TABLE_XFIND to handle "lookup" request from userland.
* Add ta_find_tentry() algorithm callbacks/handlers to support lookups.
* Fully switch to ipfw_obj_tentry for various table dumps:
  algorithms are now required to support the latest (ipfw_obj_tentry) entry
    dump format, the rest is handled by generic dump code.
  IP_FW_TABLE_XLIST opcode version bumped (0 -> 1).
* Eliminate legacy ta_dump_entry algo handler:
  dump_table_entry() converts data from current to legacy format.

Userland side:
* Add "lookup" table parameter.
* Change the way table type is guessed: call table_get_info() first,
  and check value for IPv4/IPv6 type IFF table does not exist.
* Fix table_get_list(): do more tries if supplied buffer is not enough.
* Sparate table_show_entry() from table_show_list().
2014-07-06 18:16:04 +00:00
Alexander V. Chernikov
ac35ff1784 Fully switch to named tables:
Kernel changes:
* Introduce ipfw_obj_tentry table entry structure to force u64 alignment.
* Support "update-on-existing-key" "add" bahavior (TEI_FLAGS_UPDATED).
* Use "subtype" field to distingush between IPv4 and IPv6 table records
  instead of previous hack.
* Add value type (vtype) field for kernel tables. Current types are
  number,ip and dscp
* Fix sets mask retrieval for old binaries
* Fix crash while using interface tables

Userland changes:
* Switch ipfw_table_handler() to use named-only tables.
* Add "table NAME create [type {cidr|iface|u32} [valtype {number|ip|dscp}] ..."
* Switch ipfw_table_handler to match_token()-based parser.
* Switch ipfw_sets_handler to use new ipfw_get_config() for mask  retrieval.
* Allow ipfw set X table ... syntax to permit using per-set table namespaces.
2014-07-03 22:25:59 +00:00
Alexander V. Chernikov
6c2997ffec * Add new IP_FW_XADD opcode which permits to
a) specify table ids as names
  b) add multiple rules at once.
Partially convert current code for atomic addition of multiple rules.
2014-06-29 22:35:47 +00:00
Alexander V. Chernikov
563b5ab132 Suppord showing named tables in ipfw(8) rule listing.
Kernel changes:
* change base TLV header to be u64 (so size can be u32).
* Introduce ipfw_obj_ctlv generc container TLV.
* Add IP_FW_XGET opcode which is now used for atomic configuration
  retrieval. One can specify needed configuration pieces to retrieve
  via flags field. Currently supported are
  IPFW_CFG_GET_STATIC (static rules) and
  IPFW_CFG_GET_STATES (dynamic states).
  Other configuration pieces (tables, pipes, etc..) support is planned.

Userland changes:
* Switch ipfw(8) to use new IP_FW_XGET for rule listing.
* Split rule listing code get and show pieces.
* Make several steps forward towards libipfw:
  permit printing states and rules(paritally) to supplied buffer.
  do not die on malloc/kernel failure inside given printing functions.
  stop assuming cmdline_opts is global symbol.
2014-06-28 23:20:24 +00:00
Alexander V. Chernikov
9490a62716 * Add IP_FW_TABLE_XCREATE / IP_FW_TABLE_XMODIFY opcodes.
* Add 'algoname' string to ipfw_xtable_info permitting to specify lookup
algoritm with parameters.
* Rework part of ipfw_rewrite_table_uidx()

Sponsored by:	Yandex LLC
2014-06-16 13:05:07 +00:00
Alexander V. Chernikov
d3a4f9249c Simplify opcode handling.
* Use one u16 from op3 header to implement opcode versioning.
* IP_FW_TABLE_XLIST has now 2 handlers, for ver.0 (old) and ver.1 (current).
* Every getsockopt request is now handled in ip_fw_table.c
* Rename new opcodes:
IP_FW_OBJ_DEL -> IP_FW_TABLE_XDESTROY
IP_FW_OBJ_LISTSIZE -> IP_FW_TABLES_XGETSIZE
IP_FW_OBJ_LIST -> IP_FW_TABLES_XLIST
IP_FW_OBJ_INFO -> IP_FW_TABLE_XINFO
IP_FW_OBJ_INFO -> IP_FW_TABLE_XFLUSH

* Add some docs about using given opcodes.
* Group some legacy opcode/handlers.
2014-06-15 13:40:27 +00:00
Alexander V. Chernikov
f1220db8d7 Move further to eliminate next pieces of number-assuming code inside tables.
Kernel changes:
* Add IP_FW_OBJ_FLUSH opcode (flush table based on its name/set)
* Add IP_FW_OBJ_DUMP opcode (dumps table data based on its names/set)
* Add IP_FW_OBJ_LISTSIZE / IP_FW_OBJ_LIST opcodes (get list of kernel tables)

Userland changes:
* move tables code to separate tables.c file
* get rid of tables_max
* switch "all"/list handling to new opcodes
2014-06-14 22:47:25 +00:00
Alexander V. Chernikov
9f7d47b025 Add API to ease adding new algorithms/new tabletypes to ipfw.
Kernel-side changelog:
* Split general tables code and algorithm-specific table data.
  Current algorithms (IPv4/IPv6 radix and interface tables radix) moved to
  new ip_fw_table_algo.c file.
  Tables code now supports any algorithm implementing the following callbacks:
+struct table_algo {
+       char            name[64];
+       int             idx;
+       ta_init         *init;
+       ta_destroy      *destroy;
+       table_lookup_t  *lookup;
+       ta_prepare_add  *prepare_add;
+       ta_prepare_del  *prepare_del;
+       ta_add          *add;
+       ta_del          *del;
+       ta_flush_entry  *flush_entry;
+       ta_foreach      *foreach;
+       ta_dump_entry   *dump_entry;
+       ta_dump_xentry  *dump_xentry;
+};

* Change ->state, ->xstate, ->tabletype fields of ip_fw_chain to
   ->tablestate pointer (array of 32 bytes structures necessary for
   runtime lookups (can be probably shrinked to 16 bytes later):

   +struct table_info {
   +       table_lookup_t  *lookup;        /* Lookup function */
   +       void            *state;         /* Lookup radix/other structure */
   +       void            *xstate;        /* eXtended state */
   +       u_long          data;           /* Hints for given func */
   +};

* Add count method for namedobj instance to ease size calculations
* Bump ip_fw3 buffer in ipfw_clt 128->256 bytes.
* Improve bitmask resizing on tables_max change.
* Remove table numbers checking from most places.
* Fix wrong nesting in ipfw_rewrite_table_uidx().

* Add IP_FW_OBJ_LIST opcode (list all objects of given type, currently
    implemented for IPFW_OBJTYPE_TABLE).
* Add IP_FW_OBJ_LISTSIZE (get buffer size to hold IP_FW_OBJ_LIST data,
    currenly implemented for IPFW_OBJTYPE_TABLE).
* Add IP_FW_OBJ_INFO (requests info for one object of given type).

Some name changes:
s/ipfw_xtable_tlv/ipfw_obj_tlv/ (no table specifics)
s/ipfw_xtable_ntlv/ipfw_obj_ntlv/ (no table specifics)

Userland changes:
* Add do_set3() cmd to ipfw2 to ease dealing with op3-embeded opcodes.
* Add/improve support for destroy/info cmds.
2014-06-14 10:58:39 +00:00
Alexander V. Chernikov
b074b7bbce Make ipfw tables use names as used-level identifier internally:
* Add namedobject set-aware api capable of searching/allocation objects by their name/idx.
* Switch tables code to use string ids for configuration tasks.
* Change locking model: most configuration changes are protected with UH lock, runtime-visible are protected with both locks.
* Reduce number of arguments passed to ipfw_table_add/del by using separate structure.
* Add internal V_fw_tables_sets tunable (set to 0) to prepare for set-aware tables (requires opcodes/client support)
* Implement typed table referencing (and tables are implicitly allocated with all state like radix ptrs on reference)
* Add "destroy" ipfw(8) using new IP_FW_DELOBJ opcode

Namedobj more detailed:
* Blackbox api providing methods to add/del/search/enumerate objects
* Statically-sized hashes for names/indexes
* Per-set bitmask to indicate free indexes
* Separate methods for index alloc/delete/resize

Basically, there should not be any user-visible changes except the following:
* reducing table_max is not supported
* flush & add change table type won't work if table is referenced

Sponsored by:	Yandex LLC
2014-06-12 09:59:11 +00:00
Alexander V. Chernikov
c3015737f3 Fix wrong formatting of 0.0.0.0/X table records in ipfw(8).
Add `flags` u16 field to the hole in ipfw_table_xentry structure.
Kernel has been guessing address family for supplied record based
on xent length size.
Userland, however, has been getting fixed-size ipfw_table_xentry structures
guessing address family by checking address by IN6_IS_ADDR_V4COMPAT().

Fix this behavior by providing specific IPFW_TCF_INET flag for IPv4 records.

PR:		bin/189471
Submitted by:	Dennis Yusupoff <dyr@smartspb.net>
MFC after:	2 weeks
2014-05-17 13:45:03 +00:00
Alexander V. Chernikov
ae01d73c04 Add ipfw support for setting/matching DiffServ codepoints (DSCP).
Setting DSCP support is done via O_SETDSCP which works for both
IPv4 and IPv6 packets. Fast checksum recalculation (RFC 1624) is done for IPv4.
Dscp can be specified by name (AFXY, CSX, BE, EF), by value
(0..63) or via tablearg.

Matching DSCP is done via another opcode (O_DSCP) which accepts several
classes at once (af11,af22,be). Classes are stored in bitmask (2 u32 words).

Many people made their variants of this patch, the ones I'm aware of are
(in alphabetic order):

Dmitrii Tejblum
Marcelo Araujo
Roman Bogorodskiy (novel)
Sergey Matveichuk (sem)
Sergey Ryabin

PR:		kern/102471, kern/121122
MFC after:	2 weeks
2013-03-20 10:35:33 +00:00
Alexander V. Chernikov
bdf942c3f0 Revert r234834 per luigi@ request.
Cleaner solution (e.g. adding another header) should be done here.

Original log:
  Move several enums and structures required for L2 filtering from ip_fw_private.h to ip_fw.h.
  Remove ipfw/ip_fw_private.h header from non-ipfw code.

Requested by:      luigi
Approved by:       kib(mentor)
2012-05-03 08:56:43 +00:00
Alexander V. Chernikov
7bd5e9b143 Move several enums and structures required for L2 filtering from ip_fw_private.h to ip_fw.h.
Remove ipfw/ip_fw_private.h header from non-ipfw code.

Approved by:        ae(mentor)
MFC after:          2 weeks
2012-04-30 10:22:23 +00:00
Alexander V. Chernikov
732d27b32d - Permit number of ipfw tables to be changed in runtime.
net.inet.ip.fw.tables_max is now read-write.

- Bump IPFW_TABLES_MAX to 65535
Default number of tables is still 128

- Remove IPFW_TABLES_MAX from ipfw(8) code.

Sponsored by Yandex LLC

Approved by:    kib(mentor)

MFC after:      2 weeks
2012-03-25 20:37:59 +00:00
Alexander V. Chernikov
f8bee51a69 - Add ipfw eXtended tables permitting radix to be used for any kind of keys.
- Add support for IPv6 and interface extended tables
- Make number of tables to be loader tunable in range 0..65534.
- Use IP_FW3 opcode for all new extended table cmds

No ABI changes are introduced. Old userland will see valid tables for
IPv4 tables and no entries otherwise. Flush works for any table.

IP_FW3 socket option is used to encapsulate all new opcodes:
 /* IP_FW3 header/opcodes */
 typedef struct _ip_fw3_opheader {
        uint16_t opcode;        /* Operation opcode */
        uint16_t reserved[3];   /* Align to 64-bit boundary */
 } ip_fw3_opheader;

New opcodes added:
 IP_FW_TABLE_XADD, IP_FW_TABLE_XDEL, IP_FW_TABLE_XGETSIZE, IP_FW_TABLE_XLIST

ipfw(8) table argument parsing behavior is changed:
 'ipfw table 999 add host' now assumes 'host' to be interface name instead of
 hostname.

New tunable:
 net.inet.ip.fw.tables_max controls number of table supported by ipfw in given
 VNET instance. 128 is still the default value.

New syntax:
ipfw add skipto tablearg ip from any to any via table(42) in
ipfw add skipto tablearg ip from any to any via table(4242) out

This is a bit hackish, special interface name '\1' is used to signal interface
table number is passed in p.glob field.

Sponsored by Yandex LLC

Reviewed by:    ae
Approved by:    ae (mentor)

MFC after:      4 weeks
2012-03-12 14:07:57 +00:00
Bjoern A. Zeeb
8a006adb24 Add support for IPv6 to ipfw fwd:
Distinguish IPv4 and IPv6 addresses and optional port numbers in
user space to set the option for the correct protocol family.
Add support in the kernel for carrying the new IPv6 destination
address and port.
Add support to TCP and UDP for IPv6 and fix UDP IPv4 to not change
the address in the IP header.
Add support for IPv6 forwarding to a non-local destination.
Add a regession test uitilizing VIMAGE to check all 20 possible
combinations I could think of.

Obtained from:	David Dolson at Sandvine Incorporated
		(original version for ipfw fwd IPv6 support)
Sponsored by:	Sandvine Incorporated
PR:		bin/117214
MFC after:	4 weeks
Approved by:	re (kib)
2011-08-20 17:05:11 +00:00
Andrey V. Elsukov
9527ec6e52 Add new rule actions "call" and "return" to ipfw. They make
possible to organize subroutines with rules.

The "call" action saves the current rule number in the internal
stack and rules processing continues from the first rule with
specified number (similar to skipto action). If later a rule with
"return" action is encountered, the processing returns to the first
rule with number of "call" rule saved in the stack plus one or higher.

Submitted by:	Vadim Goncharov
Discussed by:	ipfw@, luigi@
2011-06-29 10:06:58 +00:00
Gleb Smirnoff
9d0a2ddf69 - Rewrite functions that copyin/out NAT configuration, so that they
calculate required memory size dynamically.
- Fix races on chain re-lock.
- Introduce new field to ip_fw_chain - generation count. Now utilized
  only in the NAT configuration, but can be utilized wider in ipfw.
- Get rid of NAT_BUF_LEN in ip_fw.h

PR:		kern/143653
2011-04-19 15:06:33 +00:00
Luigi Rizzo
ae99fd0e07 The first customer of the SO_USER_COOKIE option:
the "sockarg" ipfw option matches packets associated to
a local socket and with a non-zero so_user_cookie value.
The value is made available as tablearg, so it can be used
as a skipto target or pipe number in ipfw/dummynet rules.

Code by Paul Joe, manpage by me.

Submitted by:	Paul Joe
MFC after:	1 week
2010-11-12 13:05:17 +00:00
Luigi Rizzo
f9f7bde3bc + implement (two lines) the kernel side of 'lookup dscp N' to use the
dscp as a search key in table lookups;

+ (re)implement a sysctl variable to control the expire frequency of
  pipes and queues when they become empty;

+ add 'queue number' as optional part of the flow_id. This can be
  enabled with the command

        queue X config mask queue ...

  and makes it possible to support priority-based schedulers, where
  packets should be grouped according to the priority and not some
  fields in the 5-tuple.
  This is implemented as follows:
  - redefine a field in the ipfw_flow_id (in sys/netinet/ip_fw.h) but
    without changing the size or shape of the structure, so there are
    no ABI changes. On passing, also document how other fields are
    used, and remove some useless assignments in ip_fw2.c

  - implement small changes in the userland code to set/read the field;

  - revise the functions in ip_dummynet.c to manipulate masks so they
    also handle the additional field;

There are no ABI changes in this commit.
2010-03-15 17:14:27 +00:00
Luigi Rizzo
cc4d3c30ea Bring in the most recent version of ipfw and dummynet, developed
and tested over the past two months in the ipfw3-head branch.  This
also happens to be the same code available in the Linux and Windows
ports of ipfw and dummynet.

The major enhancement is a completely restructured version of
dummynet, with support for different packet scheduling algorithms
(loadable at runtime), faster queue/pipe lookup, and a much cleaner
internal architecture and kernel/userland ABI which simplifies
future extensions.

In addition to the existing schedulers (FIFO and WF2Q+), we include
a Deficit Round Robin (DRR or RR for brevity) scheduler, and a new,
very fast version of WF2Q+ called QFQ.

Some test code is also present (in sys/netinet/ipfw/test) that
lets you build and test schedulers in userland.

Also, we have added a compatibility layer that understands requests
from the RELENG_7 and RELENG_8 versions of the /sbin/ipfw binaries,
and replies correctly (at least, it does its best; sometimes you
just cannot tell who sent the request and how to answer).
The compatibility layer should make it possible to MFC this code in a
relatively short time.

Some minor glitches (e.g. handling of ipfw set enable/disable,
and a workaround for a bug in RELENG_7's /sbin/ipfw) will be
fixed with separate commits.

CREDITS:
This work has been partly supported by the ONELAB2 project, and
mostly developed by Riccardo Panicucci and myself.
The code for the qfq scheduler is mostly from Fabio Checconi,
and Marta Carbone and Francesco Magno have helped with testing,
debugging and some bug fixes.
2010-03-02 17:40:48 +00:00
Luigi Rizzo
de240d1013 merge code from ipfw3-head to reduce contention on the ipfw lock
and remove all O(N) sequences from kernel critical sections in ipfw.

In detail:

 1. introduce a IPFW_UH_LOCK to arbitrate requests from
     the upper half of the kernel. Some things, such as 'ipfw show',
     can be done holding this lock in read mode, whereas insert and
     delete require IPFW_UH_WLOCK.

  2. introduce a mapping structure to keep rules together. This replaces
     the 'next' chain currently used in ipfw rules. At the moment
     the map is a simple array (sorted by rule number and then rule_id),
     so we can find a rule quickly instead of having to scan the list.
     This reduces many expensive lookups from O(N) to O(log N).

  3. when an expensive operation (such as insert or delete) is done
     by userland, we grab IPFW_UH_WLOCK, create a new copy of the map
     without blocking the bottom half of the kernel, then acquire
     IPFW_WLOCK and quickly update pointers to the map and related info.
     After dropping IPFW_LOCK we can then continue the cleanup protected
     by IPFW_UH_LOCK. So userland still costs O(N) but the kernel side
     is only blocked for O(1).

  4. do not pass pointers to rules through dummynet, netgraph, divert etc,
     but rather pass a <slot, chain_id, rulenum, rule_id> tuple.
     We validate the slot index (in the array of #2) with chain_id,
     and if successful do a O(1) dereference; otherwise, we can find
     the rule in O(log N) through <rulenum, rule_id>

All the above does not change the userland/kernel ABI, though there
are some disgusting casts between pointers and uint32_t

Operation costs now are as follows:

  Function				Old	Now	  Planned
-------------------------------------------------------------------
  + skipto X, non cached		O(N)	O(log N)
  + skipto X, cached			O(1)	O(1)
XXX dynamic rule lookup			O(1)	O(log N)  O(1)
  + skipto tablearg			O(N)	O(1)
  + reinject, non cached		O(N)	O(log N)
  + reinject, cached			O(1)	O(1)
  + kernel blocked during setsockopt()	O(N)	O(1)
-------------------------------------------------------------------

The only (very small) regression is on dynamic rule lookup and this will
be fixed in a day or two, without changing the userland/kernel ABI

Supported by: Valeria Paoli
MFC after:	1 month
2009-12-22 19:01:47 +00:00
Luigi Rizzo
70228fb346 Start splitting ip_fw2.c and ip_fw.h into smaller components.
At this time we pull out from ip_fw2.c the logging functions, and
support for dynamic rules, and move kernel-only stuff into
netinet/ipfw/ip_fw_private.h

No ABI change involved in this commit, unless I made some mistake.
ip_fw.h has changed, though not in the userland-visible part.

Files touched by this commit:

conf/files
	now references the two new source files

netinet/ip_fw.h
	remove kernel-only definitions gone into netinet/ipfw/ip_fw_private.h.

netinet/ipfw/ip_fw_private.h
	new file with kernel-specific ipfw definitions

netinet/ipfw/ip_fw_log.c
	ipfw_log and related functions

netinet/ipfw/ip_fw_dynamic.c
	code related to dynamic rules

netinet/ipfw/ip_fw2.c
	removed the pieces that goes in the new files

netinet/ipfw/ip_fw_nat.c
	minor rearrangement to remove LOOKUP_NAT from the
	main headers. This require a new function pointer.

A bunch of other kernel files that included netinet/ip_fw.h now
require netinet/ipfw/ip_fw_private.h as well.
Not 100% sure i caught all of them.

MFC after:	1 month
2009-12-15 16:15:14 +00:00
Luigi Rizzo
9565806f16 change the type of the opcode from enum *:8 to u_int8_t
so the size and alignment of the ipfw_insn is not compiler dependent.
No changes in the code generated by gcc.

There was only one instance of this kind in our entire source tree,
so i suspect the old definition was a poor choice (which i made).

MFC after:	3 days
2009-12-02 08:52:06 +00:00
Julian Elischer
c4b21cbe4a Fix ipfw's initialization functions to get the correct order of evaluation
to allow vnet and non vnet operation. Move some functions from ip_fw_pfil.c
to ip_fw2.c and mode to mostly using the SYSINIT and VNET_SYSINIT handlers
instead of the modevent handler. Correct some spelling errors in comments
in the affected code. Note this bug fixes a crash in NON VIMAGE kernels when
ipfw is unloaded.

This patch is a minimal patch for 8.0
I have a much larger patch that actually fixes the underlying problems
that will be applied after 8.0

Reviewed by:	zec@, rwatson@, bz@(earlier version)
Approved by:	re (rwatson)
MFC after:	Immediatly
2009-08-21 11:20:10 +00:00
Robert Watson
1e77c1056a Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used.  Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with:	bz, julian
Reviewed by:	bz
Approved by:	re (kensmith, kib)
2009-07-16 21:13:04 +00:00
Robert Watson
eddfbb763d Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator.  Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...).  This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack.  Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory.  Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy.  Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address.  When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by:  bz
Reviewed by:            bz, zec
Discussed with:         gnn, jamie, jeff, jhb, julian, sam
Suggested by:           peter
Approved by:            re (kensmith)
2009-07-14 22:48:30 +00:00
Oleg Bulyzhin
dda10d624c Close long existed race with net.inet.ip.fw.one_pass = 0:
If packet leaves ipfw to other kernel subsystem (dummynet, netgraph, etc)
it carries pointer to matching ipfw rule. If this packet then reinjected back
to ipfw, ruleset processing starts from that rule. If rule was deleted
meanwhile, due to existed race condition panic was possible (as well as
other odd effects like parsing rules in 'reap list').

P.S. this commit changes ABI so userland ipfw related binaries should be
recompiled.

MFC after:	1 month
Tested by:	Mikolaj Golub
2009-06-09 21:27:11 +00:00
Luigi Rizzo
b87ce5545b Several ipfw options and actions use a 16-bit argument to indicate
pipes, queues, tags, rule numbers and so on.
These are all different namespaces, and the only thing they have in
common is the fact they use a 16-bit slot to represent the argument.

There is some confusion in the code, mostly for historical reasons,
on how the values 0 and 65535 should be used. At the moment, 0 is
forbidden almost everywhere, while 65535 is used to represent a
'tablearg' argument, i.e. the result of the most recent table() lookup.

For now, try to use explicit constants for the min and max allowed
values, and do not overload the default rule number for that.

Also, make the MTAG_IPFW declaration only visible to the kernel.

NOTE: I think the issue needs to be revisited before 8.0 is out:
the 2^16 namespace limit for rule numbers and pipe/queue is
annoying, and we can easily bump the limit to 2^32 which gives
a lot more flexibility in partitioning the namespace.

MFC after:	5 days
2009-06-05 16:16:07 +00:00
Luigi Rizzo
115a40c7bf More cleanup in preparation of ipfw relocation (no actual code change):
+ move ipfw and dummynet hooks declarations to raw_ip.c (definitions
  in ip_var.h) same as for most other global variables.
  This removes some dependencies from ip_input.c;

+ remove the IPFW_LOADED macro, just test ip_fw_chk_ptr directly;

+ remove the DUMMYNET_LOADED macro, just test ip_dn_io_ptr directly;

+ move ip_dn_ruledel_ptr to ip_fw2.c which is the only file using it;

To be merged together with rev 193497

MFC after:	5 days
2009-06-05 13:44:30 +00:00
Marko Zec
5f416f8e84 Make indentation more uniform accross vnet container structs.
This is a purely cosmetic / NOP change.

Reviewed by:	bz
Approved by:	julian (mentor)
Verified by:	svn diff -x -w producing no output
2009-05-02 08:16:26 +00:00
Marko Zec
f6dfe47a14 Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance.  Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables.  As an example, V_ifnet becomes:

    options VIMAGE:          ((struct vnet_net *) vnet_net)->_ifnet
    default build:           vnet_net_0._ifnet
    options VIMAGE_GLOBALS:  ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

    INIT_VNET_NET(ifp->if_vnet); becomes

    struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals.  If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet.  options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod.  SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by:	bz, rwatson
Approved by:	julian (mentor)
2009-04-30 13:36:26 +00:00
Marko Zec
1ed81b739e First pass at separating per-vnet initializer functions
from existing functions for initializing global state.

        At this stage, the new per-vnet initializer functions are
	directly called from the existing global initialization code,
	which should in most cases result in compiler inlining those
	new functions, hence yielding a near-zero functional change.

        Modify the existing initializer functions which are invoked via
        protosw, like ip_init() et. al., to allow them to be invoked
	multiple times, i.e. per each vnet.  Global state, if any,
	is initialized only if such functions are called within the
	context of vnet0, which will be determined via the
	IS_DEFAULT_VNET(curvnet) check (currently always true).

        While here, V_irtualize a few remaining global UMA zones
        used by net/netinet/netipsec networking code.  While it is
        not yet clear to me or anybody else whether this is the right
        thing to do, at this stage this makes the code more readable,
        and makes it easier to track uncollected UMA-zone-backed
        objects on vnet removal.  In the long run, it's quite possible
        that some form of shared use of UMA zone pools among multiple
        vnets should be considered.

	Bump __FreeBSD_version due to changes in layout of structs
	vnet_ipfw, vnet_inet and vnet_net.

Approved by:	julian (mentor)
2009-04-06 22:29:41 +00:00
Paolo Pisati
eb2e411915 Implement an ipfw action to reassemble ip packets: reass. 2009-04-01 20:23:47 +00:00
Luigi Rizzo
0906f40fd8 fw_debug has been unused for ages, so remove it from the list
of sysctl_variables.
I would also remove it from the VNET record but I am unsure if
there is any ABI issue -- so for the time being just mark it as
unused in ip_fw.h, and then we will collect the garbage at some
appropriate time in the future.

MFC after:	3 days
2009-03-02 22:11:48 +00:00
Luigi Rizzo
35b78b7520 remove dependency on eventhandler.h, we only need a forward declaration 2009-02-16 15:08:41 +00:00
Bjoern A. Zeeb
1b193af610 Second round of putting global variables, which were virtualized
but formerly missed under VIMAGE_GLOBAL.

Put the extern declarations of the  virtualized globals
under VIMAGE_GLOBAL as the globals themsevles are already.
This will help by the time when we are going to remove the globals
entirely.

Sponsored by:	The FreeBSD Foundation
2008-12-13 19:13:03 +00:00
Marko Zec
385195c062 Conditionally compile out V_ globals while instantiating the appropriate
container structures, depending on VIMAGE_GLOBALS compile time option.

Make VIMAGE_GLOBALS a new compile-time option, which by default will not
be defined, resulting in instatiations of global variables selected for
V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be
effectively compiled out.  Instantiate new global container structures
to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0,
vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.

Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_
macros resolve either to the original globals, or to fields inside
container structures, i.e. effectively

#ifdef VIMAGE_GLOBALS
#define V_rt_tables rt_tables
#else
#define V_rt_tables vnet_net_0._rt_tables
#endif

Update SYSCTL_V_*() macros to operate either on globals or on fields
inside container structs.

Extend the internal kldsym() lookups with the ability to resolve
selected fields inside the virtualization container structs.  This
applies only to the fields which are explicitly registered for kldsym()
visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently
this is done only in sys/net/if.c.

Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code,
and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in
turn result in proper code being generated depending on VIMAGE_GLOBALS.

De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c
which were prematurely V_irtualized by automated V_ prepending scripts
during earlier merging steps.  PF virtualization will be done
separately, most probably after next PF import.

Convert a few variable initializations at instantiation to
initialization in init functions, most notably in ipfw.  Also convert
TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in
initializer functions.

Discussed at:	devsummit Strassburg
Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-12-10 23:12:39 +00:00
Robert Watson
1f6ef666b5 Fix content and spelling of comment on _ipfw_insn.len -- a count of
32-bit words, not 32-byte words.

MFC after:	3 days
2008-10-10 14:33:47 +00:00