* Rewrite interface tables to use interface indexes
Kernel changes:
* Add generic interface tracking API:
- ipfw_iface_ref (must call unlocked, performs lazy init if needed, allocates
state & bumps ref)
- ipfw_iface_add_ntfy(UH_WLOCK+WLOCK, links comsumer & runs its callback to
update ifindex)
- ipfw_iface_del_ntfy(UH_WLOCK+WLOCK, unlinks consumer)
- ipfw_iface_unref(unlocked, drops reference)
Additionally, consumer callbacks are called in interface withdrawal/departure.
* Rewrite interface tables to use iface tracking API. Currently tables are
implemented the following way:
runtime data is stored as sorted array of {ifidx, val} for existing interfaces
full data is stored inside namedobj instance (chained hashed table).
* Add IP_FW_XIFLIST opcode to dump status of tracked interfaces
* Pass @chain ptr to most non-locked algorithm callbacks:
(prepare_add, prepare_del, flush_entry ..). This may be needed for better
interaction of given algorithm an other ipfw subsystems
* Add optional "change_ti" algorithm handler to permit updating of
cached table_info pointer (happens in case of table_max resize)
* Fix small bug in ipfw_list_tables()
* Add badd (insert into sorted array) and bdel (remove from sorted array) funcs
Userland changes:
* Add "iflist" cmd to print status of currently tracked interface
* Add stringnum_cmp for better interface/table names sorting
so it really should not be under "optional inet". The fact that uipc_accf.c
lives under kern/ lends some weight to making it a "standard" file.
Moving kern/uipc_accf.c from "optional inet" to "standard" eliminates the
need for #ifdef INET in kern/uipc_socket.c.
Also, this meant the net.inet.accf.unloadable sysctl needed to move, as
net.inet does not exist without networking compiled in (as it lives in
netinet/in_proto.c.) The new sysctl has been named net.accf.unloadable.
In order to support existing accept filter sysctls, the net.inet.accf node
has been added netinet/in_proto.c.
Submitted by: Steve Kiernan <stevek@juniper.net>
Obtained from: Juniper Networks, Inc.
PF_LINK, and multicast/broadcast flag should always be dropped because
the outer protocol uses unicast even when the inner address is not for
unicast. It had been broken since r236951 when gif_output() started to
use IFQ_HANDOFF().
by the stack.
Right now the stack isn't really setup for RSS with 4-tuple UDP hashing
for either IPv4 and IPv6.
The specifics:
* The UDP init path udp_init() and udplite_init() specify the hash as
2-tuple, so the PCBGROUPS code only tries a 2-tuple check;
* The PCBGROUPS and RSS code doesn't know about the UDP hash types
just yet, so they're never treated as valid hashes.
* For correctness, 4-tuple can't be enabled in the general case because
UDP datagrams can be more fragmented than IP datagrams may be.
Strictly speaking, TCP datagrams may also be fragmented and this could
cause issues with PCBGROUPS/RSS until the IP defragment path grows some
code to re-calculate the RSS hash.
I'll follow this commit up with awareness of the UDP 4-tuple for those
who wish to configure it, but for now it'll stay disabled.
No drivers (yet) know to use this function when RSS is enabled.
markedly better distribution of IPv6 address/ports than the previous key.
The previous key would hash large swaths of the port space for a given
source/destination IP address to the same low handful of bits, effectively
mapping them to the same queue. This made testing very .. special.
a source address was selected and cached, but it was not
stored that is was cached. This resulted in selecting
different source addresses for the INIT-ACK and COOKIE-ACK
when possible.
Thanks to Niu Zhixiong for reporting the issue.
MFC after: 1 week
awareness.
* Introduce IP_BINDMULTI - indicating that it's okay to bind multiple
sockets on the same bind details.
Although the PCB code has been taught about this (see below) this patch
doesn't introduce the rest of the PCB changes necessary to distribute
lookups among multiple PCB entries in the global wildcard table.
* Introduce IP_RSS_LISTEN_BUCKET - placing an listen socket into the
given RSS bucket (and thus a single PCBGROUP hash.)
* Modify the PCB add path to be aware of IP_BINDMULTI:
+ Only allow further PCB entries to be added if the owner credentials
and IP_BINDMULTI has been specified. Ie, only allow further
IP_BINDMULTI sockets to appear if the first bind() was IP_BINDMULTI.
* Teach the PCBGROUP code about IP_RSS_LISTE_BUCKET marked PCB entries.
Instead of using the wildcard logic and hashing, these sockets are
simply placed into the PCBGROUP and _not_ in the wildcard hash.
* When doing a PCBGROUP lookup, also do a wildcard match as well.
This allows for an RSS bucket PCB entry to appear in a PCBGROUP
rather than having to exist in the wildcard list.
Tested:
* TCP IPv4 server testing with igb(4)
* TCP IPv4 server testing with ix(4)
TODO:
* The pcbgroup lookup code duplicated the wildcard and wildcard-PCB
logic. This could be refactored into a single function.
* This doesn't yet work for IPv6 (The PCBGROUP code in netinet6/ doesn't
yet know about this); nor does it yet fully work for UDP.
* Switch kernel to use per-cpu counters for rules.
* Keep ABI/API.
Kernel changes:
* Each rules is now exported as TLV with optional extenable
counter block (ip_fW_bcounter for base one) and
ip_fw_rule for rule&cmd data.
* Counters needs to be explicitly requested by IPFW_CFG_GET_COUNTERS flag.
* Separate counters from rules in kernel and clean up ip_fw a bit.
* Pack each rule in IPFW_TLV_RULE_ENT tlv to ease parsing.
* Introduce versioning in container TLV (may be needed in future).
* Fix ipfw_cfg_lheader broken u64 alignment.
Userland changes:
* Use set_mask from cfg header when requesting config
* Fix incorrect read accouting in ipfw_show_config()
* Use IPFW_RULE_NOOPT flag instead of playing with _pad
* Fix "ipfw -d list": do not print counters for dynamic states
* Some small fixes
Kernel changes:
* Change dump format for dynamic states:
each state is now stored inside ipfw_obj_dyntlv
last dynamic state is indicated by IPFW_DF_LAST flag
* Do not perform sooptcopyout() for !SOPT_GET requests.
Userland changes:
* Introduce foreach_state() function handler to ease work
with different states passed by ipfw_dump_config().
* Bump table dump format preserving old ABI.
Kernel size:
* Add IP_FW_TABLE_XFIND to handle "lookup" request from userland.
* Add ta_find_tentry() algorithm callbacks/handlers to support lookups.
* Fully switch to ipfw_obj_tentry for various table dumps:
algorithms are now required to support the latest (ipfw_obj_tentry) entry
dump format, the rest is handled by generic dump code.
IP_FW_TABLE_XLIST opcode version bumped (0 -> 1).
* Eliminate legacy ta_dump_entry algo handler:
dump_table_entry() converts data from current to legacy format.
Userland side:
* Add "lookup" table parameter.
* Change the way table type is guessed: call table_get_info() first,
and check value for IPv4/IPv6 type IFF table does not exist.
* Fix table_get_list(): do more tries if supplied buffer is not enough.
* Sparate table_show_entry() from table_show_list().
Kernel changes:
* Introduce ipfw_obj_tentry table entry structure to force u64 alignment.
* Support "update-on-existing-key" "add" bahavior (TEI_FLAGS_UPDATED).
* Use "subtype" field to distingush between IPv4 and IPv6 table records
instead of previous hack.
* Add value type (vtype) field for kernel tables. Current types are
number,ip and dscp
* Fix sets mask retrieval for old binaries
* Fix crash while using interface tables
Userland changes:
* Switch ipfw_table_handler() to use named-only tables.
* Add "table NAME create [type {cidr|iface|u32} [valtype {number|ip|dscp}] ..."
* Switch ipfw_table_handler to match_token()-based parser.
* Switch ipfw_sets_handler to use new ipfw_get_config() for mask retrieval.
* Allow ipfw set X table ... syntax to permit using per-set table namespaces.
Kernel changes:
* change base TLV header to be u64 (so size can be u32).
* Introduce ipfw_obj_ctlv generc container TLV.
* Add IP_FW_XGET opcode which is now used for atomic configuration
retrieval. One can specify needed configuration pieces to retrieve
via flags field. Currently supported are
IPFW_CFG_GET_STATIC (static rules) and
IPFW_CFG_GET_STATES (dynamic states).
Other configuration pieces (tables, pipes, etc..) support is planned.
Userland changes:
* Switch ipfw(8) to use new IP_FW_XGET for rule listing.
* Split rule listing code get and show pieces.
* Make several steps forward towards libipfw:
permit printing states and rules(paritally) to supplied buffer.
do not die on malloc/kernel failure inside given printing functions.
stop assuming cmdline_opts is global symbol.
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
map the bucket to an RSS queue, then map the queue to a CPU ID.
This way the bucket->queue and queue->CPU mapping can change
over time.
Introduce IP_RSSBUCKETID - which instead looks up the RSS bucket.
User applications can then map the RSS bucket to a CPU.
There's 128 indirection table entries which correspond to the
low 7 bits of the 32 bit RSS hash. Each value will correspond
to an RSS bucket. (Then each RSS bucket currently will map
to a CPU.)
This is a more explicit way of figuring out which RSS bucket
is in each RSS indirection slot. It can be inferred by the other
methods but I'd rather drivers use something more simplified and
explicit.
reporting IP-addresses to the peer during the handshake, adding
addresses to the host, reporting the addresses via the sysctl
interface (used by netstat, for example) and reporting the
addresses to the application via socket options.
This issue was reported by Bernd Walter.
MFC after: 3 days
* Add 'algoname' string to ipfw_xtable_info permitting to specify lookup
algoritm with parameters.
* Rework part of ipfw_rewrite_table_uidx()
Sponsored by: Yandex LLC
* Use one u16 from op3 header to implement opcode versioning.
* IP_FW_TABLE_XLIST has now 2 handlers, for ver.0 (old) and ver.1 (current).
* Every getsockopt request is now handled in ip_fw_table.c
* Rename new opcodes:
IP_FW_OBJ_DEL -> IP_FW_TABLE_XDESTROY
IP_FW_OBJ_LISTSIZE -> IP_FW_TABLES_XGETSIZE
IP_FW_OBJ_LIST -> IP_FW_TABLES_XLIST
IP_FW_OBJ_INFO -> IP_FW_TABLE_XINFO
IP_FW_OBJ_INFO -> IP_FW_TABLE_XFLUSH
* Add some docs about using given opcodes.
* Group some legacy opcode/handlers.
Kernel changes:
* Add IP_FW_OBJ_FLUSH opcode (flush table based on its name/set)
* Add IP_FW_OBJ_DUMP opcode (dumps table data based on its names/set)
* Add IP_FW_OBJ_LISTSIZE / IP_FW_OBJ_LIST opcodes (get list of kernel tables)
Userland changes:
* move tables code to separate tables.c file
* get rid of tables_max
* switch "all"/list handling to new opcodes
Kernel-side changelog:
* Split general tables code and algorithm-specific table data.
Current algorithms (IPv4/IPv6 radix and interface tables radix) moved to
new ip_fw_table_algo.c file.
Tables code now supports any algorithm implementing the following callbacks:
+struct table_algo {
+ char name[64];
+ int idx;
+ ta_init *init;
+ ta_destroy *destroy;
+ table_lookup_t *lookup;
+ ta_prepare_add *prepare_add;
+ ta_prepare_del *prepare_del;
+ ta_add *add;
+ ta_del *del;
+ ta_flush_entry *flush_entry;
+ ta_foreach *foreach;
+ ta_dump_entry *dump_entry;
+ ta_dump_xentry *dump_xentry;
+};
* Change ->state, ->xstate, ->tabletype fields of ip_fw_chain to
->tablestate pointer (array of 32 bytes structures necessary for
runtime lookups (can be probably shrinked to 16 bytes later):
+struct table_info {
+ table_lookup_t *lookup; /* Lookup function */
+ void *state; /* Lookup radix/other structure */
+ void *xstate; /* eXtended state */
+ u_long data; /* Hints for given func */
+};
* Add count method for namedobj instance to ease size calculations
* Bump ip_fw3 buffer in ipfw_clt 128->256 bytes.
* Improve bitmask resizing on tables_max change.
* Remove table numbers checking from most places.
* Fix wrong nesting in ipfw_rewrite_table_uidx().
* Add IP_FW_OBJ_LIST opcode (list all objects of given type, currently
implemented for IPFW_OBJTYPE_TABLE).
* Add IP_FW_OBJ_LISTSIZE (get buffer size to hold IP_FW_OBJ_LIST data,
currenly implemented for IPFW_OBJTYPE_TABLE).
* Add IP_FW_OBJ_INFO (requests info for one object of given type).
Some name changes:
s/ipfw_xtable_tlv/ipfw_obj_tlv/ (no table specifics)
s/ipfw_xtable_ntlv/ipfw_obj_ntlv/ (no table specifics)
Userland changes:
* Add do_set3() cmd to ipfw2 to ease dealing with op3-embeded opcodes.
* Add/improve support for destroy/info cmds.
* Add namedobject set-aware api capable of searching/allocation objects by their name/idx.
* Switch tables code to use string ids for configuration tasks.
* Change locking model: most configuration changes are protected with UH lock, runtime-visible are protected with both locks.
* Reduce number of arguments passed to ipfw_table_add/del by using separate structure.
* Add internal V_fw_tables_sets tunable (set to 0) to prepare for set-aware tables (requires opcodes/client support)
* Implement typed table referencing (and tables are implicitly allocated with all state like radix ptrs on reference)
* Add "destroy" ipfw(8) using new IP_FW_DELOBJ opcode
Namedobj more detailed:
* Blackbox api providing methods to add/del/search/enumerate objects
* Statically-sized hashes for names/indexes
* Per-set bitmask to indicate free indexes
* Separate methods for index alloc/delete/resize
Basically, there should not be any user-visible changes except the following:
* reducing table_max is not supported
* flush & add change table type won't work if table is referenced
Sponsored by: Yandex LLC
ifa_ifwithnet() and ifa_ifwithdstaddr() The legacy functions will call the
_fib() versions with RT_ALL_FIBS, preserving legacy behavior.
sys/net/if_var.h
sys/net/if.c
Add legacy-compatible functions as described above. Ensure legacy
behavior when RT_ALL_FIBS is passed as fibnum.
sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/net/route.c
sys/net/rtsock.c
sys/netinet6/nd6.c
Call with _fib() functions if we must use a specific fib, or the
legacy functions otherwise.
tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
Improve the udp_dontroute test. The bug that this test exercises is
that ifa_ifwithnet() will return the wrong address, if multiple
interfaces have addresses on the same subnet but with different
fibs. The previous version of the test only considered one possible
failure mode: that ifa_ifwithnet_fib() might fail to find any
suitable address at all. The new version also checks whether
ifa_ifwithnet_fib() finds the correct address by checking where the
ARP request goes.
Reported by: bz, hrs
Reviewed by: hrs
MFC after: 1 week
X-MFC-with: 264905
Sponsored by: Spectra Logic
mode.
Put the htonl(), htons(), ntohl() and ntohs() declarations under
__POSIX_VISIBLE >= 200112. POSIX.1-2001 and newer require these to be
exposed from <netinet/in.h> (as well as <arpa/inet.h>).
Note that it may be unnecessary to check __POSIX_VISIBLE >= 200112 because
older versions of POSIX and the C standard do not define this header.
However, other places in the same file already perform the check.
PR: 188316
Submitted by: Christian Neukirchen