This adds 512K (2 * sizeof(u32) * 65k) bytes to the memory footprint.
This feature is optionaly and may be turned on in any time
(however it starts immediately in this commit. This will be changed.)
* Add "flow:hash" algorithm
Kernel changes:
* Add O_IP_FLOW_LOOKUP opcode to support "flow" lookups
* Add IPFW_TABLE_FLOW table type
* Add "struct tflow_entry" as strage for 6-tuple flows
* Add "flow:hash" algorithm. Basically it is auto-growing chained hash table.
Additionally, we store mask of fields we need to compare in each instance/
* Increase ipfw_obj_tentry size by adding struct tflow_entry
* Add per-algorithm stat (ifpw_ta_tinfo) to ipfw_xtable_info
* Increase algoname length: 32 -> 64 (algo options passed there as string)
* Assume every table type can be customized by flags, use u8 to store "tflags" field.
* Simplify ipfw_find_table_entry() by providing @tentry directly to algo callback.
* Fix bug in cidr:chash resize procedure.
Userland changes:
* add "flow table(NAME)" syntax to support n-tuple checking tables.
* make fill_flags() separate function to ease working with _s_x arrays
* change "table info" output to reflect longer "type" fields
Syntax:
ipfw table fl2 create type flow:[src-ip][,proto][,src-port][,dst-ip][dst-port] [algo flow:hash]
Examples:
0:02 [2] zfscurr0# ipfw table fl2 create type flow:src-ip,proto,dst-port algo flow:hash
0:02 [2] zfscurr0# ipfw table fl2 info
+++ table(fl2), set(0) +++
kindex: 0, type: flow:src-ip,proto,dst-port
valtype: number, references: 0
algorithm: flow:hash
items: 0, size: 280
0:02 [2] zfscurr0# ipfw table fl2 add 2a02:6b8::333,tcp,443 45000
0:02 [2] zfscurr0# ipfw table fl2 add 10.0.0.92,tcp,80 22000
0:02 [2] zfscurr0# ipfw table fl2 list
+++ table(fl2), set(0) +++
2a02:6b8::333,6,443 45000
10.0.0.92,6,80 22000
0:02 [2] zfscurr0# ipfw add 200 count tcp from me to 78.46.89.105 80 flow 'table(fl2)'
00200 count tcp from me to 78.46.89.105 dst-port 80 flow table(fl2)
0:03 [2] zfscurr0# ipfw show
00200 0 0 count tcp from me to 78.46.89.105 dst-port 80 flow table(fl2)
65535 617 59416 allow ip from any to any
0:03 [2] zfscurr0# telnet -s 10.0.0.92 78.46.89.105 80
Trying 78.46.89.105...
..
0:04 [2] zfscurr0# ipfw show
00200 5 272 count tcp from me to 78.46.89.105 dst-port 80 flow table(fl2)
65535 682 66733 allow ip from any to any
Kernel changes:
* s/IPFW_TABLE_U32/IPFW_TABLE_NUMBER/
* Force "lookup <port|uid|gid|jid>" to be IPFW_TABLE_NUMBER
* Support "lookup" method for number tables
* Add number:array algorihm (i32 as key, auto-growing).
Userland changes:
* Support named tables in "lookup <tag> Table"
* Fix handling of "table(NAME,val)" case
* Support printing "number" table data.
* Rewrite interface tables to use interface indexes
Kernel changes:
* Add generic interface tracking API:
- ipfw_iface_ref (must call unlocked, performs lazy init if needed, allocates
state & bumps ref)
- ipfw_iface_add_ntfy(UH_WLOCK+WLOCK, links comsumer & runs its callback to
update ifindex)
- ipfw_iface_del_ntfy(UH_WLOCK+WLOCK, unlinks consumer)
- ipfw_iface_unref(unlocked, drops reference)
Additionally, consumer callbacks are called in interface withdrawal/departure.
* Rewrite interface tables to use iface tracking API. Currently tables are
implemented the following way:
runtime data is stored as sorted array of {ifidx, val} for existing interfaces
full data is stored inside namedobj instance (chained hashed table).
* Add IP_FW_XIFLIST opcode to dump status of tracked interfaces
* Pass @chain ptr to most non-locked algorithm callbacks:
(prepare_add, prepare_del, flush_entry ..). This may be needed for better
interaction of given algorithm an other ipfw subsystems
* Add optional "change_ti" algorithm handler to permit updating of
cached table_info pointer (happens in case of table_max resize)
* Fix small bug in ipfw_list_tables()
* Add badd (insert into sorted array) and bdel (remove from sorted array) funcs
Userland changes:
* Add "iflist" cmd to print status of currently tracked interface
* Add stringnum_cmp for better interface/table names sorting
* Switch kernel to use per-cpu counters for rules.
* Keep ABI/API.
Kernel changes:
* Each rules is now exported as TLV with optional extenable
counter block (ip_fW_bcounter for base one) and
ip_fw_rule for rule&cmd data.
* Counters needs to be explicitly requested by IPFW_CFG_GET_COUNTERS flag.
* Separate counters from rules in kernel and clean up ip_fw a bit.
* Pack each rule in IPFW_TLV_RULE_ENT tlv to ease parsing.
* Introduce versioning in container TLV (may be needed in future).
* Fix ipfw_cfg_lheader broken u64 alignment.
Userland changes:
* Use set_mask from cfg header when requesting config
* Fix incorrect read accouting in ipfw_show_config()
* Use IPFW_RULE_NOOPT flag instead of playing with _pad
* Fix "ipfw -d list": do not print counters for dynamic states
* Some small fixes
Kernel changes:
* Introduce ipfw_obj_tentry table entry structure to force u64 alignment.
* Support "update-on-existing-key" "add" bahavior (TEI_FLAGS_UPDATED).
* Use "subtype" field to distingush between IPv4 and IPv6 table records
instead of previous hack.
* Add value type (vtype) field for kernel tables. Current types are
number,ip and dscp
* Fix sets mask retrieval for old binaries
* Fix crash while using interface tables
Userland changes:
* Switch ipfw_table_handler() to use named-only tables.
* Add "table NAME create [type {cidr|iface|u32} [valtype {number|ip|dscp}] ..."
* Switch ipfw_table_handler to match_token()-based parser.
* Switch ipfw_sets_handler to use new ipfw_get_config() for mask retrieval.
* Allow ipfw set X table ... syntax to permit using per-set table namespaces.
Kernel-side changelog:
* Split general tables code and algorithm-specific table data.
Current algorithms (IPv4/IPv6 radix and interface tables radix) moved to
new ip_fw_table_algo.c file.
Tables code now supports any algorithm implementing the following callbacks:
+struct table_algo {
+ char name[64];
+ int idx;
+ ta_init *init;
+ ta_destroy *destroy;
+ table_lookup_t *lookup;
+ ta_prepare_add *prepare_add;
+ ta_prepare_del *prepare_del;
+ ta_add *add;
+ ta_del *del;
+ ta_flush_entry *flush_entry;
+ ta_foreach *foreach;
+ ta_dump_entry *dump_entry;
+ ta_dump_xentry *dump_xentry;
+};
* Change ->state, ->xstate, ->tabletype fields of ip_fw_chain to
->tablestate pointer (array of 32 bytes structures necessary for
runtime lookups (can be probably shrinked to 16 bytes later):
+struct table_info {
+ table_lookup_t *lookup; /* Lookup function */
+ void *state; /* Lookup radix/other structure */
+ void *xstate; /* eXtended state */
+ u_long data; /* Hints for given func */
+};
* Add count method for namedobj instance to ease size calculations
* Bump ip_fw3 buffer in ipfw_clt 128->256 bytes.
* Improve bitmask resizing on tables_max change.
* Remove table numbers checking from most places.
* Fix wrong nesting in ipfw_rewrite_table_uidx().
* Add IP_FW_OBJ_LIST opcode (list all objects of given type, currently
implemented for IPFW_OBJTYPE_TABLE).
* Add IP_FW_OBJ_LISTSIZE (get buffer size to hold IP_FW_OBJ_LIST data,
currenly implemented for IPFW_OBJTYPE_TABLE).
* Add IP_FW_OBJ_INFO (requests info for one object of given type).
Some name changes:
s/ipfw_xtable_tlv/ipfw_obj_tlv/ (no table specifics)
s/ipfw_xtable_ntlv/ipfw_obj_ntlv/ (no table specifics)
Userland changes:
* Add do_set3() cmd to ipfw2 to ease dealing with op3-embeded opcodes.
* Add/improve support for destroy/info cmds.
* Add namedobject set-aware api capable of searching/allocation objects by their name/idx.
* Switch tables code to use string ids for configuration tasks.
* Change locking model: most configuration changes are protected with UH lock, runtime-visible are protected with both locks.
* Reduce number of arguments passed to ipfw_table_add/del by using separate structure.
* Add internal V_fw_tables_sets tunable (set to 0) to prepare for set-aware tables (requires opcodes/client support)
* Implement typed table referencing (and tables are implicitly allocated with all state like radix ptrs on reference)
* Add "destroy" ipfw(8) using new IP_FW_DELOBJ opcode
Namedobj more detailed:
* Blackbox api providing methods to add/del/search/enumerate objects
* Statically-sized hashes for names/indexes
* Per-set bitmask to indicate free indexes
* Separate methods for index alloc/delete/resize
Basically, there should not be any user-visible changes except the following:
* reducing table_max is not supported
* flush & add change table type won't work if table is referenced
Sponsored by: Yandex LLC
default from the very beginning. It was placed in wrong namespace
net.link.ether, originally it had been at another wrong namespace. It was
incorrectly documented at incorrect manual page arp(8). Since new-ARP commit,
the tunable have been consulted only on route addition, and ignored on route
deletion. Behaviour of a system with tunable turned off is not fully correct,
and has no advantages comparing to normal behavior.
in net, to avoid compatibility breakage for no sake.
The future plan is to split most of non-kernel parts of
pfvar.h into pf.h, and then make pfvar.h a kernel only
include breaking compatibility.
Discussed with: bz
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
* Do per vnet instance cleanup (previously it was only for vnet0 on
module unload, and led to libalias leaks and possible panics due to
stale pointer dereferences).
* Instead of protecting ipfw hooks registering/deregistering by only
vnet0 lock (which does not prevent pointers access from another
vnets), introduce per vnet ipfw_nat_loaded variable. The variable is
set after hooks are registered and unset before they are deregistered.
* Devirtualize ifaddr_event_tag as we run only one event handler for
all vnets.
* It is supposed that ifaddr_change event handler is called in the
interface vnet context, so add an assertion.
Reviewed by: zec
MFC after: 2 weeks
Setting DSCP support is done via O_SETDSCP which works for both
IPv4 and IPv6 packets. Fast checksum recalculation (RFC 1624) is done for IPv4.
Dscp can be specified by name (AFXY, CSX, BE, EF), by value
(0..63) or via tablearg.
Matching DSCP is done via another opcode (O_DSCP) which accepts several
classes at once (af11,af22,be). Classes are stored in bitmask (2 u32 words).
Many people made their variants of this patch, the ones I'm aware of are
(in alphabetic order):
Dmitrii Tejblum
Marcelo Araujo
Roman Bogorodskiy (novel)
Sergey Matveichuk (sem)
Sergey Ryabin
PR: kern/102471, kern/121122
MFC after: 2 weeks
* Global IPFW_DYN_LOCK() is changed to per-bucket mutex.
* State expiration is done in ipfw_tick every second.
* No expiration is done on forwarding path.
* hash table resize is done automatically and does not flush all states.
* Dynamic UMA zone is now allocated per each VNET
* State limiting is now done via UMA(9) api.
Discussed with: ipfw
MFC after: 3 weeks
Sponsored by: Yandex LLC
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.
Suggested by: andre
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.
Sponsored by: Yandex LLC
Discussed with: net@
MFC after: 2 weeks
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.
After this change a packet processed by the stack isn't
modified at all[2] except for TTL.
After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.
[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.
[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.
Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>
reside, and move there ipfw(4) and pf(4).
o Move most modified parts of pf out of contrib.
Actual movements:
sys/contrib/pf/net/*.c -> sys/netpfil/pf/
sys/contrib/pf/net/*.h -> sys/net/
contrib/pf/pfctl/*.c -> sbin/pfctl
contrib/pf/pfctl/*.h -> sbin/pfctl
contrib/pf/pfctl/pfctl.8 -> sbin/pfctl
contrib/pf/pfctl/*.4 -> share/man/man4
contrib/pf/pfctl/*.5 -> share/man/man5
sys/netinet/ipfw -> sys/netpfil/ipfw
The arguable movement is pf/net/*.h -> sys/net. There are
future plans to refactor pf includes, so I decided not to
break things twice.
Not modified bits of pf left in contrib: authpf, ftp-proxy,
tftp-proxy, pflogd.
The ipfw(4) movement is planned to be merged to stable/9,
to make head and stable match.
Discussed with: bz, luigi