freebsd-skq

Author	SHA1	Message	Date
ae	1576b695b6	Add scope zone id to the in_endpoints and hc_metrics structures. A non-global IPv6 address can be used in more than one zone of the same scope. This zone index is used to identify to which zone a non-global address belongs. Also we can have many foreign hosts with equal non-global addresses, but from different zones. So, they can have different metrics in the host cache. Obtained from: Yandex LLC Sponsored by: Yandex LLC	2014-09-10 16:26:18 +00:00
ae	9186ea6a59	Make in6_pcblookup_hash_locked and in6_pcbladdr static. Obtained from: Yandex LLC Sponsored by: Yandex LLC	2014-09-10 13:17:35 +00:00
ae	82d0b71937	Introduce INP6_PCBHASHKEY macro. Replace usage of hardcoded part of IPv6 address as hash key in all places. Obtained from: Yandex LLC	2014-09-10 12:35:42 +00:00
adrian	63bc177a08	Calculate the RSS hash for outbound UDPv4 frames. Differential Revision: https://reviews.freebsd.org/D527 Reviewed by: grehan	2014-09-09 04:19:36 +00:00
adrian	b73995e058	Update the IPv4 input path to handle reassembled frames and incoming frames with no RSS hash. When doing RSS: * Create a new IPv4 netisr which expects the frames to have been verified; it just directly dispatches to the IPv4 input path. * Once IPv4 reassembly is done, re-calculate the RSS hash with the new IP and L3 header; then reinject it as appropriate. * Update the IPv4 netisr to be a CPU affinity netisr with the RSS hash function (rss_soft_m2cpuid) - this will do a software hash if the hardware doesn't provide one. NICs that don't implement hardware RSS hashing will now benefit from RSS distribution - it'll inject into the correct destination netisr. Note: the netisr distribution doesn't work out of the box - netisr doesn't query RSS for how many CPUs and the affinity setup. Yes, netisr likely shouldn't really be doing CPU stuff anymore and should be "some kind of 'thing' that is a workqueue that may or may not have any CPU affinity"; that's for a later commit. Differential Revision: https://reviews.freebsd.org/D527 Reviewed by: grehan	2014-09-09 04:18:20 +00:00
adrian	e5ddfb30ab	Implement IPv4 RSS software hash functions to use during packet ingress and egress. * rss_mbuf_software_hash_v4 - look at the IPv4 mbuf to fetch the IPv4 details + direction to calculate a hash. * rss_proto_software_hash_v4 - hash the given source/destination IPv4 address, port and direction. * rss_soft_m2cpuid - map the given mbuf to an RSS CPU ("bucket" for now) These functions are intended to be used by the stack to support the following: * Not all NICs do RSS hashing, so we should support some way of doing a hash in software; * The NIC / driver may not hash frames the way we want (eg UDP 4-tuple hashing when the stack is only doing 2-tuple hashing for UDP); so we may need to re-hash frames; * .. same with IPv4 fragments - they will need to be re-hashed after reassembly; * .. and same with things like IP tunneling and such; * The transmit path for things like UDP, RAW and ICMP don't currently have any RSS information attached to them - so they'll need an RSS calculation performed before transmit. TODO: * Counters! Everywhere! * Add a debug mode that software hashes received frames and compares them to the hardware hash provided by the hardware to ensure they match. The IPv6 part of this is missing - I'm going to do some re-juggling of where various parts of the RSS framework live before I add the IPv6 code (read: the IPv6 code is going to go into netinet6/in6_rss.[ch], rather than living here.) Note: This API is still fluid. Please keep that in mind. Differential Revision: https://reviews.freebsd.org/D527 Reviewed by: grehan	2014-09-09 03:10:21 +00:00
adrian	e623d51cd5	Add support for receiving and setting flowtype, flowid and RSS bucket information as part of recvmsg(). This is primarily used for debugging/verification of the various processing paths in the IP, PCB and driver layers. Unfortunately the current implementation of the control message path results in a ~10% or so drop in UDP frame throughput when it's used. Differential Revision: https://reviews.freebsd.org/D527 Reviewed by: grehan	2014-09-09 01:45:39 +00:00
adrian	aab402c146	Add a flag to ip_output() - IP_NODEFAULTFLOWID - which prevents it from overriding an existing flowid/flowtype field in the outbound mbuf with the inp_flowid/inp_flowtype details. The upcoming RSS UDP support calculates a valid RSS value for outbound mbufs and since it may change per send, it doesn't cache it in the inpcb. So overriding it here would be wrong. Differential Revision: https://reviews.freebsd.org/D527 Reviewed by: grehan	2014-09-09 00:19:02 +00:00
melifaro	f7e6823045	Make ipfw_nat module use IP_FW3 codes. Kernel changes: * Split kernel/userland nat structures eliminating IPFW_INTERNAL hack. * Add IP_FW_NAT44_* codes resemblin old ones. * Assume that instances can be named (no kernel support currently). * Use both UH+WLOCK locks for all configuration changes. * Provide full ABI support for old sockopts. Userland changes: * Use IP_FW_NAT44_* codes for nat operations. * Remove undocumented ability to show ranges of nat "log" entries.	2014-09-07 18:30:29 +00:00
tuexen	53f889f7e3	Address warnings generated by the clang analyzer. MFC after: 1 week	2014-09-07 18:05:37 +00:00
tuexen	c7b009940d	Address another warnings reported by Patrick Laimbock when compiling in userspace. While there, improve consistency. MFC after: 1 week	2014-09-07 17:07:19 +00:00
tuexen	a20e3eb506	Use union sctp_sockstore instead of struct sockaddr_storage. This eliminiates some warnings when building in userland. Thanks to Patrick Laimbock for reporting this issue. Remove also some unnecessary casts. There should be no functional change. MFC after: 1 week	2014-09-07 09:06:26 +00:00
tuexen	863eb35e7d	Use SYSCTL_PROC instead of SYSCTL_VNET_PROC. Suggested by: glebius@ MFC after: 1 week	2014-09-07 07:49:49 +00:00
tuexen	65ac57de62	Fix a leak of an address, if the address is scheduled for removal and the stack is torn down. Thanks to Peter Bostroem and Jiayang Liu from Google for reporting the issue. MFC after: 1 week	2014-09-06 20:03:24 +00:00
tuexen	b9adc6630b	Fix the handling of sysctl variables when used with VIMAGE. While there do some cleanup of the code. MFC after: 1 week	2014-09-06 19:12:14 +00:00
melifaro	21fa37c8e5	Sync to HEAD@r271160.	2014-09-05 13:52:39 +00:00
glebius	ca003e260e	Satisfy assertion in m_demote(). Sponsored by: Nginx, Inc.	2014-09-04 19:28:02 +00:00
jhb	01ecfa68a7	In tcp_input(), don't acquire the pcbinfo global write lock for SYN packets targeting a listening socket. Permit to reduce TCP input processing starvation in context of high SYN load (e.g. short-lived TCP connections or SYN flood). Submitted by: Julien Charbon <jcharbon@verisign.com> Reviewed by: adrian, hiren, jhb, Mike Bentkofsky	2014-09-04 19:09:08 +00:00
glebius	9e0133cbe0	Fixes for tcp_respond() comment.	2014-09-04 17:05:57 +00:00
glebius	ae21082d79	Improve r265338. When inserting mbufs into TCP reassembly queue, try to collapse adjacent pieces using m_catpkt(). In best case scenario it copies data and frees mbufs, making mbuf exhaustion attack harder. Suggested by: Jonathan Looney <jonlooney gmail.com> Security: Hardens against remote mbuf exhaustion attack. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-09-04 09:15:44 +00:00
glebius	2e01608625	Clean up unused CSUM_FRAGMENT. Sponsored by: Nginx, Inc.	2014-09-03 08:30:18 +00:00
glebius	2dd39cca12	Make SOCK_RAW sockets to be truly raw, not modifying received and sent packets at all. Swapping byte order on SOCK_RAW was actually a bug, an artifact from the BSD network stack, that used to convert a packet to native byte order once it is received by kernel. Other operating systems didn't follow this, and later other BSD descendants fixed this, leaving us alone with the bug. Now it is clear that we should fix the bug. In collaboration with: Olivier Cochard-Labbé <olivier cochard.me> See also: https://wiki.freebsd.org/SOCK_RAW Sponsored by: Nginx, Inc.	2014-09-01 14:04:51 +00:00
melifaro	a1eca3cc0c	Add support for multi-field values inside ipfw tables. This is the last major change in given branch. Kernel changes: * Use 64-bytes structures to hold multi-value variables. * Use shared array to hold values from all tables (assume each table algo is capable of holding 32-byte variables). * Add some placeholders to support per-table value arrays in future. * Use simple eventhandler-style API to ease the process of adding new table items. Currently table addition may required multiple UH drops/ acquires which is quite tricky due to atomic table modificatio/swap support, shared array resize, etc. Deal with it by calling special notifier capable of rolling back state before actually performing swap/resize operations. Original operation then restarts itself after acquiring UH lock. * Bump all objhash users default values to at least 64 * Fix custom hashing inside objhash. Userland changes: * Add support for dumping shared value array via "vlist" internal cmd. * Some small print/fill_flags dixes to support u32 values. * valtype is now bitmask of <skipto\|pipe\|fib\|nat\|dscp\|tag\|divert\|netgraph\|limit\|ipv4\|ipv6>. New values can hold distinct values for each of this types. * Provide special "legacy" type which assumes all values are the same. * More helpers/docs following.. Some examples: 3:41 [1] zfscurr0# ipfw table mimimi create valtype skipto,limit,ipv4,ipv6 3:41 [1] zfscurr0# ipfw table mimimi info +++ table(mimimi), set(0) +++ kindex: 2, type: addr references: 0, valtype: skipto,limit,ipv4,ipv6 algorithm: addr:radix items: 0, size: 296 3:42 [1] zfscurr0# ipfw table mimimi add 10.0.0.5 3000,10,10.0.0.1,2a02:978:2::1 added: 10.0.0.5/32 3000,10,10.0.0.1,2a02:978:2::1 3:42 [1] zfscurr0# ipfw table mimimi list +++ table(mimimi), set(0) +++ 10.0.0.5/32 3000,0,10.0.0.1,2a02:978:2::1	2014-08-31 23:51:09 +00:00
glebius	6047680797	Use macros instead of referencing struct if_data that resides in ifnet. Sponsored by: Nginx, Inc.	2014-08-31 06:30:50 +00:00
tuexen	368ac164f9	Announce SCTP support in the kern.features sysctl variables. MFC after: 3 days	2014-08-26 21:15:34 +00:00
melifaro	cf94663e69	Sync to HEAD@r270409.	2014-08-23 14:58:31 +00:00
delphij	3fc68a2c42	Restore historical behavior of in_control, which, when no matching address is found, the first usable address is returned for legacy ioctls like SIOCGIFBRDADDR, SIOCGIFDSTADDR, SIOCGIFNETMASK and SIOCGIFADDR. While there also fix a subtle issue that a caller from a jail asking for INADDR_ANY may get the first IP of the host that do not belong to the jail. Submitted by: glebius Differential Revision: https://reviews.freebsd.org/D667	2014-08-22 19:08:12 +00:00
lstewart	b3bf7524ad	Destroy the "qdiffsample_zone" UMA zone on unload to avoid a use-after-unload panic easily triggered by running "sysctl -a" after unload. Reported and tested by: Grenville Armitage <garmitage@swin.edu.au> MFC after: 1 week	2014-08-19 02:19:53 +00:00
melifaro	b921074dbb	Make room for multi-type values in struct tentry.	2014-08-15 12:58:32 +00:00
kevlo	dd40fa7e62	Change pr_output's prototype to avoid the need for explicit casts. This is a follow up to r269699. Phabric: D564 Reviewed by: jhb	2014-08-15 02:43:02 +00:00
melifaro	6f8397b648	Replace "cidr" table type with "addr" type. Suggested by: luigi	2014-08-14 21:43:20 +00:00
melifaro	ef7f079c1d	* Fix displaying dynamic rules for large rulesets. * Clean up some comments.	2014-08-14 08:21:22 +00:00
melifaro	03e33c1ac5	Sync to HEAD@r269943.	2014-08-13 16:20:41 +00:00
tuexen	4feb6f37e3	Add support for the SCTP_PR_STREAM_STATUS and SCTP_PR_ASSOC_STATUS socket options. This includes managing the correspoing stat counters. Add the SCTP_DETAILED_STR_STATS kernel option to control per policy counters on every stream. The default is off and only an aggregated counter is available. This is sufficient for the RTCWeb usecase. MFC after: 1 week	2014-08-13 15:50:16 +00:00
melifaro	20eb17aed6	Change tablearg value to be 0 (try #2 ). Most of the tablearg-supported opcodes does not accept 0 as valid value: O_TAG, O_TAGGED, O_PIPE, O_QUEUE, O_DIVERT, O_TEE, O_SKIPTO, O_CALLRET, O_NETGRAPH, O_NGTEE, O_NAT treats 0 as invalid input. The rest are O_SETDSCP and O_SETFIB. 'Fix' them by adding high-order bit (0x8000) set for non-tablearg values. Do translation in kernel for old clients (import_rule0 / export_rule0), teach current ipfw(8) binary to add/remove given bit. This change does not affect handling SETDSCP values, but limit O_SETFIB values to 32767 instead of 65k. Since currently we have either old (16) or new (2^32) max fibs, this should not be a big deal: we're definitely OK for former and have to add another opcode to deal with latter, regardless of tablearg value.	2014-08-12 15:51:48 +00:00
tuexen	e167d96026	Change SCTP sysctl from auth_disable to auth_enable. This is consistent with other similar sysctl variable used in SCTP.	2014-08-12 13:13:11 +00:00
tuexen	b57b7cb252	Add support for the SCTP_AUTH_SUPPORTED and SCTP_ASCONF_SUPPORTED socket options. Add also a sysctl to control the support of ASCONF. MFC after: 1 week	2014-08-12 11:30:16 +00:00
melifaro	25473f8f4a	* Add the abilify to lock/unlock given table from changes. Example: # ipfw table si lock # ipfw table si info +++ table(si), set(0) +++ kindex: 0, type: cidr, locked valtype: number, references: 0 algorithm: cidr:radix items: 0, size: 288 # ipfw table si add 4.5.6.7 ignored: 4.5.6.7/32 0 ipfw: Adding record failed: table is locked # ipfw table si unlock # ipfw table si add 4.5.6.7 added: 4.5.6.7/32 0 # ipfw table si lock # ipfw table si delete 4.5.6.7 ignored: 4.5.6.7/32 0 ipfw: Deleting record failed: table is locked # ipfw table si unlock # ipfw table si delete 4.5.6.7 deleted: 4.5.6.7/32 0	2014-08-11 18:09:37 +00:00
melifaro	377bb9d131	* Add support for batched add/delete for ipfw tables * Add support for atomic batches add (all or none). * Fix panic on deleting non-existing entry in radix algo. Examples: # si is empty # ipfw table si add 1.1.1.1/32 1111 2.2.2.2/32 2222 added: 1.1.1.1/32 1111 added: 2.2.2.2/32 2222 # ipfw table si add 2.2.2.2/32 2200 4.4.4.4/32 4444 exists: 2.2.2.2/32 2200 added: 4.4.4.4/32 4444 ipfw: Adding record failed: record already exists ^^^^^ Returns error but keeps inserted items # ipfw table si list +++ table(si), set(0) +++ 1.1.1.1/32 1111 2.2.2.2/32 2222 4.4.4.4/32 4444 # ipfw table si atomic add 3.3.3.3/32 3333 4.4.4.4/32 4400 5.5.5.5/32 5555 added(reverted): 3.3.3.3/32 3333 exists: 4.4.4.4/32 4400 ignored: 5.5.5.5/32 5555 ipfw: Adding record failed: record already exists ^^^^^ Returns error and reverts added records # ipfw table si list +++ table(si), set(0) +++ 1.1.1.1/32 1111 2.2.2.2/32 2222 4.4.4.4/32 4444	2014-08-11 17:34:25 +00:00
hselasky	e253e43882	Fix string length argument passed to "sysctl_handle_string()" so that the complete string is returned by the function and not just only one byte. PR: 192544 MFC after: 2 weeks	2014-08-10 07:51:55 +00:00
hiren	a1e149db26	Improve comments by listing a criteria for automatic increment of receive socket buffer. Reviewed by: jmg	2014-08-09 21:01:24 +00:00
tuexen	9f6eff7a40	Small modification of the sctp_input() cleanup to avoid having code between declariations.	2014-08-09 14:33:44 +00:00
kib	aebb08ee0c	Fix one more compiler warning, m is not initialized.	2014-08-08 15:50:02 +00:00
melifaro	deeb40d882	Partially revert previous commit: "0" value is perfectly valid for O_SETFIB and O_SETDSCP, so tablearg remains to be 655535 for now.	2014-08-08 15:33:26 +00:00
melifaro	bc102dcade	* Switch tablearg value from 65535 to 0. * Use u16 table kidx instead of integer on for iface opcode. * Provide compability layer for old clients.	2014-08-08 14:23:20 +00:00
melifaro	2a5da00f23	* Add IP_FW_TABLE_XMODIFY opcode * Since there seems to be lack of consensus on strict value typing, remove non-default value types. Use userland-only "value format type" to print values. Kernel changes: * Add IP_FW_XMODIFY to permit table run-time modifications. Currently we support changing limit and value format type. Userland changes: * Support IP_FW_XMODIFY opcode. * Support specifying value format type (ftype) in tablble create/modify req * Fine-print value type/value format type.	2014-08-08 09:27:49 +00:00
bz	7704d53e2e	Fix argument to KTR after r269699 to unbreak LINT builds.	2014-08-08 09:17:02 +00:00
melifaro	3ad34df447	Remove IP_FW_TABLES_XGETSIZE opcode. It is superseded by IP_FW_TABLES_XLIST.	2014-08-08 06:36:26 +00:00
kevlo	7727a3c215	Merge 'struct ip6protosw' and 'struct protosw' into one. Now we have only one protocol switch structure that is shared between ipv4 and ipv6. Phabric: D476 Reviewed by: jhb	2014-08-08 01:57:15 +00:00
melifaro	61bb76b813	Kernel changes: * Implement proper checks for switching between global and set-aware tables * Split IP_FW_DEL mess into the following opcodes: * IP_FW_XDEL (del rules matching pattern) * IP_FW_XMOVE (move rules matching pattern to another set) * IP_FW_SET_SWAP (swap between 2 sets) * IP_FW_SET_MOVE (move one set to another one) * IP_FW_SET_ENABLE (enable/disable sets) * Add IP_FW_XZERO / IP_FW_XRESETLOG to finish IP_FW3 migration. * Use unified ipfw_range_tlv as range description for all of the above. * Check dynamic states IFF there was non-zero number of deleted dyn rules, * Del relevant dynamic states with singe traversal instead of per-rule one. Userland changes: * Switch ipfw(8) to use new opcodes.	2014-08-07 21:37:31 +00:00
tuexen	e7d5338a8e	Add support for the SCTP_RECONFIG_SUPPORTED and the corresponding sysctl controlling the negotiation of the RE-CONFIG extension. MFC after: 3 days	2014-08-04 20:07:35 +00:00
hiren	203dc5e795	Add a comment for easier code understanding.	2014-08-04 19:42:48 +00:00
melifaro	42eca8abfb	Implement atomic ipfw table swap. Kernel changes: * Add opcode IP_FW_TABLE_XSWAP * Add support for swapping 2 tables with the same type/ftype/vtype. * Make skipto cache init after ipfw locks init. Userland changes: * Add "table X swap Y" command.	2014-08-03 21:37:12 +00:00
tuexen	ff18393ff0	Add support for the SCTP_PKTDROP_SUPPORTED socket option and the corresponding sysctl variable. The default is off, since the specification is not an RFC yet. MFC after: 1 week	2014-08-03 18:12:55 +00:00
tuexen	62b88726fd	Use consistent names for SCTP sysctls. Rename nr_sack_on_off to nrsack_enable. Please note that this extension is off by default since it is not specified in an RFC (yet).	2014-08-03 15:09:13 +00:00
tuexen	fb7bbef5e1	Add SCTP socket option SCTP_NRSACK_SUPPORTED to control the NRSACK extension. The default will still be off, since it it not an RFC (yet). Changing the sysctl name will be in a separate commit. MFC after: 1 week	2014-08-03 14:10:10 +00:00
melifaro	6e882e1221	Show algorithm-specific data in "table info" output.	2014-08-03 12:19:45 +00:00
tuexen	31e0173d95	Add support for the SCTP_PR_SUPPORTED socket option as specified in http://tools.ietf.org/html/draft-ietf-tsvwg-sctp-prpolicies Add also a sysctl controlling the default of the end-points. MFC after: 1 week	2014-08-02 21:36:40 +00:00
tuexen	6dd06cb981	Fix a copy and paste error. X-MFC with: 269436	2014-08-02 20:37:02 +00:00
tuexen	9ad96316d8	Cleanup the ECN configuration handling and provide an SCTP socket option for controlling ECN on future associations and get the status on current associations. A simialar pattern will be used for controlling SCTP extensions in upcoming commits.	2014-08-02 17:35:13 +00:00
tuexen	4a0d502636	Remove the asconf_auth_nochk sysctl. This was off by default and only existed to be able to test with non-compliant peers a long time ago.	2014-08-01 20:49:27 +00:00
grehan	8cf28f6b1b	Fix byte ordering in default RSS key. The rss_key[] array in netinet/in_rss.c has the bytes in incorrect order. This results in the RSS test vectors in the Microsft RSS spec and Intel NIC specs giving incorrect results, and making it difficult to verify correct hash operation when RSS functionality is added to new NICs. CR: https://phabric.freebsd.org/D516 Reviewed by: adrian	2014-08-01 18:36:40 +00:00
melifaro	178311d9d4	* Permit limiting number of items in table. Kernel changes: * Add TEI_FLAGS_DONTADD entry flag to indicate that insert is not possible * Support given flag in all algorithms * Add "limit" field to ipfw_xtable_info * Add actual limiting code into add_table_entry() Userland changes: * Add "limit" option as "create" table sub-option. Limit modification is currently impossible. * Print human-readable errors in table enry addition/deletion code.	2014-08-01 15:17:46 +00:00
tuexen	418772ad46	Cleanup sctp_send_initiate() and sctp_send_initiate_ack() to be in sync as much as possible. This simplifies upcoming changes.	2014-08-01 12:42:37 +00:00
melifaro	58e70e361d	* Add new "flow" table type to support N=1..5-tuple lookups * Add "flow:hash" algorithm Kernel changes: * Add O_IP_FLOW_LOOKUP opcode to support "flow" lookups * Add IPFW_TABLE_FLOW table type * Add "struct tflow_entry" as strage for 6-tuple flows * Add "flow:hash" algorithm. Basically it is auto-growing chained hash table. Additionally, we store mask of fields we need to compare in each instance/ * Increase ipfw_obj_tentry size by adding struct tflow_entry * Add per-algorithm stat (ifpw_ta_tinfo) to ipfw_xtable_info * Increase algoname length: 32 -> 64 (algo options passed there as string) * Assume every table type can be customized by flags, use u8 to store "tflags" field. * Simplify ipfw_find_table_entry() by providing @tentry directly to algo callback. * Fix bug in cidr:chash resize procedure. Userland changes: * add "flow table(NAME)" syntax to support n-tuple checking tables. * make fill_flags() separate function to ease working with _s_x arrays * change "table info" output to reflect longer "type" fields Syntax: ipfw table fl2 create type flow:[src-ip][,proto][,src-port][,dst-ip][dst-port] [algo flow:hash] Examples: 0:02 [2] zfscurr0# ipfw table fl2 create type flow:src-ip,proto,dst-port algo flow:hash 0:02 [2] zfscurr0# ipfw table fl2 info +++ table(fl2), set(0) +++ kindex: 0, type: flow:src-ip,proto,dst-port valtype: number, references: 0 algorithm: flow:hash items: 0, size: 280 0:02 [2] zfscurr0# ipfw table fl2 add 2a02:6b8::333,tcp,443 45000 0:02 [2] zfscurr0# ipfw table fl2 add 10.0.0.92,tcp,80 22000 0:02 [2] zfscurr0# ipfw table fl2 list +++ table(fl2), set(0) +++ 2a02:6b8::333,6,443 45000 10.0.0.92,6,80 22000 0:02 [2] zfscurr0# ipfw add 200 count tcp from me to 78.46.89.105 80 flow 'table(fl2)' 00200 count tcp from me to 78.46.89.105 dst-port 80 flow table(fl2) 0:03 [2] zfscurr0# ipfw show 00200 0 0 count tcp from me to 78.46.89.105 dst-port 80 flow table(fl2) 65535 617 59416 allow ip from any to any 0:03 [2] zfscurr0# telnet -s 10.0.0.92 78.46.89.105 80 Trying 78.46.89.105... .. 0:04 [2] zfscurr0# ipfw show 00200 5 272 count tcp from me to 78.46.89.105 dst-port 80 flow table(fl2) 65535 682 66733 allow ip from any to any	2014-07-31 20:08:19 +00:00
smh	3ae4520699	Ensure that IP's added to CARP always use the CARP MAC Previously there was a race condition between the address addition and associating it with the CARP which resulted in the interface MAC, instead of the CARP MAC, being used for a brief amount of time. This caused "is using my IP address" warnings as well as data being sent to the wrong machine due to incorrect ARP entries being recorded by other devices on the network.	2014-07-31 16:43:56 +00:00
smh	8d283ef60a	Only check error if one could have been generated	2014-07-31 09:18:29 +00:00
melifaro	4419c812fe	* Add number:array algorithm lookup method. Kernel changes: * s/IPFW_TABLE_U32/IPFW_TABLE_NUMBER/ * Force "lookup <port\|uid\|gid\|jid>" to be IPFW_TABLE_NUMBER * Support "lookup" method for number tables * Add number:array algorihm (i32 as key, auto-growing). Userland changes: * Support named tables in "lookup <tag> Table" * Fix handling of "table(NAME,val)" case * Support printing "number" table data.	2014-07-30 14:52:26 +00:00
hiren	f19c3089af	Add a comment and while there, fix trailing whitespace.	2014-07-29 23:42:51 +00:00
melifaro	bf787a59a7	* Dump available table algorithms via "ipfw talist" cmd. Kernel changes: * Add type/refcount fields to table algo instances. * Add IP_FW_TABLES_ALIST opcode to export available algorihms to userland. Userland changes: * Fix cores on empty input inside "ipfw table" handler. * Add "ipfw talist" cmd to print availabled kernel algorithms. * Change "table info" output to reflect long algorithm config lines.	2014-07-29 22:44:26 +00:00
glebius	d32e428cc3	Garbage collect couple of unused fields from struct ifaddr: - ifa_claim_addr() unused since removal of NetAtalk - ifa_metric seems to be never utilized, always a copy of if_metric	2014-07-29 15:01:29 +00:00
melifaro	fa3f38a6a0	* Add generic ipfw interface tracking API * Rewrite interface tables to use interface indexes Kernel changes: * Add generic interface tracking API: - ipfw_iface_ref (must call unlocked, performs lazy init if needed, allocates state & bumps ref) - ipfw_iface_add_ntfy(UH_WLOCK+WLOCK, links comsumer & runs its callback to update ifindex) - ipfw_iface_del_ntfy(UH_WLOCK+WLOCK, unlinks consumer) - ipfw_iface_unref(unlocked, drops reference) Additionally, consumer callbacks are called in interface withdrawal/departure. * Rewrite interface tables to use iface tracking API. Currently tables are implemented the following way: runtime data is stored as sorted array of {ifidx, val} for existing interfaces full data is stored inside namedobj instance (chained hashed table). * Add IP_FW_XIFLIST opcode to dump status of tracked interfaces * Pass @chain ptr to most non-locked algorithm callbacks: (prepare_add, prepare_del, flush_entry ..). This may be needed for better interaction of given algorithm an other ipfw subsystems * Add optional "change_ti" algorithm handler to permit updating of cached table_info pointer (happens in case of table_max resize) * Fix small bug in ipfw_list_tables() * Add badd (insert into sorted array) and bdel (remove from sorted array) funcs Userland changes: * Add "iflist" cmd to print status of currently tracked interface * Add stringnum_cmp for better interface/table names sorting	2014-07-28 19:01:25 +00:00
marcel	c6a7dd8e59	The accept filter code is not specific to the FreeBSD IPv4 network stack, so it really should not be under "optional inet". The fact that uipc_accf.c lives under kern/ lends some weight to making it a "standard" file. Moving kern/uipc_accf.c from "optional inet" to "standard" eliminates the need for #ifdef INET in kern/uipc_socket.c. Also, this meant the net.inet.accf.unloadable sysctl needed to move, as net.inet does not exist without networking compiled in (as it lives in netinet/in_proto.c.) The new sysctl has been named net.accf.unloadable. In order to support existing accept filter sysctls, the net.inet.accf node has been added netinet/in_proto.c. Submitted by: Steve Kiernan <stevek@juniper.net> Obtained from: Juniper Networks, Inc.	2014-07-26 19:27:34 +00:00
tuexen	da26b011a7	Initialize notification strucuture. This was missed in an earlier commit MFC after: 3 days	2014-07-24 18:06:18 +00:00
hrs	45044bb8e3	Fix EtherIP. TOS field must be initialized when the inner protocol is PF_LINK, and multicast/broadcast flag should always be dropped because the outer protocol uses unicast even when the inner address is not for unicast. It had been broken since r236951 when gif_output() started to use IFQ_HANDOFF().	2014-07-24 10:42:47 +00:00
tuexen	8a530385ea	Cleanup the definition of two structures which are exposed to userland. Therefore no MFC.	2014-07-22 19:54:22 +00:00
adrian	c1ee0e57a1	Make the PCBGROUPS code aware of IPv4 UDP 4-tuple.	2014-07-20 07:38:38 +00:00
adrian	10af48ebe2	Add hash awareness of the IPv4 and IPv6 UDP 4-tuple. Note: it would be nice if the supported hash check would be used here!	2014-07-20 07:37:47 +00:00
adrian	8a3ce3eb52	Implement rss_gethashconfig() - return the currently supported hash methods by the stack. Right now the stack isn't really setup for RSS with 4-tuple UDP hashing for either IPv4 and IPv6. The specifics: * The UDP init path udp_init() and udplite_init() specify the hash as 2-tuple, so the PCBGROUPS code only tries a 2-tuple check; * The PCBGROUPS and RSS code doesn't know about the UDP hash types just yet, so they're never treated as valid hashes. * For correctness, 4-tuple can't be enabled in the general case because UDP datagrams can be more fragmented than IP datagrams may be. Strictly speaking, TCP datagrams may also be fragmented and this could cause issues with PCBGROUPS/RSS until the IP defragment path grows some code to re-calculate the RSS hash. I'll follow this commit up with awareness of the UDP 4-tuple for those who wish to configure it, but for now it'll stay disabled. No drivers (yet) know to use this function when RSS is enabled.	2014-07-20 07:36:59 +00:00
adrian	5956b735dc	Update the comment to be more concise.	2014-07-20 07:31:55 +00:00
adrian	036f4e1340	Update the default RSS hash to the Chelsio T5 firmware one - it provides markedly better distribution of IPv6 address/ports than the previous key. The previous key would hash large swaths of the port space for a given source/destination IP address to the same low handful of bits, effectively mapping them to the same queue. This made testing very .. special.	2014-07-18 08:22:13 +00:00
adrian	0e45eb31ff	Oops - somehow I missed the IP option numbers clashing with the multicast numbers below. Move them to a new set of non-clashing numbers.	2014-07-17 05:45:54 +00:00
adrian	59e42691b9	Add RSS hashing awareness for IPv6 and TCP IPv6 hash types.	2014-07-12 05:43:43 +00:00
adrian	01670bf027	Expose in_pcbbind_check_bindmulti() so the upcoming IPv6 RSS changes can be made to use it.	2014-07-12 05:40:13 +00:00
tuexen	264ec2645a	Whitespace changes. MFC after: 1 week	2014-07-11 21:15:40 +00:00
tuexen	d1d1e8d008	Bugfix: When a remote address was added to an endpoint, a source address was selected and cached, but it was not stored that is was cached. This resulted in selecting different source addresses for the INIT-ACK and COOKIE-ACK when possible. Thanks to Niu Zhixiong for reporting the issue. MFC after: 1 week	2014-07-11 17:31:40 +00:00
glebius	75acbf068a	Fix style bug: rename the refcount field of m_ext to ext_cnt, to match other members. Sponsored by: Nginx, Inc.	2014-07-11 14:34:29 +00:00
tuexen	635f385383	Integrate upstream changes. MFC after: 1 week	2014-07-11 06:52:48 +00:00
adrian	627c6869c3	Implement the first stage of multi-bind listen sockets and RSS socket awareness. * Introduce IP_BINDMULTI - indicating that it's okay to bind multiple sockets on the same bind details. Although the PCB code has been taught about this (see below) this patch doesn't introduce the rest of the PCB changes necessary to distribute lookups among multiple PCB entries in the global wildcard table. * Introduce IP_RSS_LISTEN_BUCKET - placing an listen socket into the given RSS bucket (and thus a single PCBGROUP hash.) * Modify the PCB add path to be aware of IP_BINDMULTI: + Only allow further PCB entries to be added if the owner credentials and IP_BINDMULTI has been specified. Ie, only allow further IP_BINDMULTI sockets to appear if the first bind() was IP_BINDMULTI. * Teach the PCBGROUP code about IP_RSS_LISTE_BUCKET marked PCB entries. Instead of using the wildcard logic and hashing, these sockets are simply placed into the PCBGROUP and _not_ in the wildcard hash. * When doing a PCBGROUP lookup, also do a wildcard match as well. This allows for an RSS bucket PCB entry to appear in a PCBGROUP rather than having to exist in the wildcard list. Tested: * TCP IPv4 server testing with igb(4) * TCP IPv4 server testing with ix(4) TODO: * The pcbgroup lookup code duplicated the wildcard and wildcard-PCB logic. This could be refactored into a single function. * This doesn't yet work for IPv6 (The PCBGROUP code in netinet6/ doesn't yet know about this); nor does it yet fully work for UDP.	2014-07-10 03:10:56 +00:00
glebius	ab0c569422	In several cases in ip_output() we obtain reference on ifa. Do not leak it. Together with: asomers, np Sponsored by: Nginx, Inc.	2014-07-09 07:48:05 +00:00
melifaro	3f7d90b385	* Use different rule structures in kernel/userland. * Switch kernel to use per-cpu counters for rules. * Keep ABI/API. Kernel changes: * Each rules is now exported as TLV with optional extenable counter block (ip_fW_bcounter for base one) and ip_fw_rule for rule&cmd data. * Counters needs to be explicitly requested by IPFW_CFG_GET_COUNTERS flag. * Separate counters from rules in kernel and clean up ip_fw a bit. * Pack each rule in IPFW_TLV_RULE_ENT tlv to ease parsing. * Introduce versioning in container TLV (may be needed in future). * Fix ipfw_cfg_lheader broken u64 alignment. Userland changes: * Use set_mask from cfg header when requesting config * Fix incorrect read accouting in ipfw_show_config() * Use IPFW_RULE_NOOPT flag instead of playing with _pad * Fix "ipfw -d list": do not print counters for dynamic states * Some small fixes	2014-07-08 23:11:15 +00:00
delphij	b74f97a13b	Initialize SCTP cmsg's and notification's buffer before copying out to userland. Submitted by: tuexen Security: CVE-2014-3953 Security: FreeBSD-SA-14:17.kmem	2014-07-08 21:54:27 +00:00
melifaro	7189aec01e	* Prepare to pass other dynamic states via ipfw_dump_config() Kernel changes: * Change dump format for dynamic states: each state is now stored inside ipfw_obj_dyntlv last dynamic state is indicated by IPFW_DF_LAST flag * Do not perform sooptcopyout() for !SOPT_GET requests. Userland changes: * Introduce foreach_state() function handler to ease work with different states passed by ipfw_dump_config().	2014-07-06 23:26:34 +00:00
melifaro	0eba52a18e	* Add "lookup" table functionality to permit userland entry lookups. * Bump table dump format preserving old ABI. Kernel size: * Add IP_FW_TABLE_XFIND to handle "lookup" request from userland. * Add ta_find_tentry() algorithm callbacks/handlers to support lookups. * Fully switch to ipfw_obj_tentry for various table dumps: algorithms are now required to support the latest (ipfw_obj_tentry) entry dump format, the rest is handled by generic dump code. IP_FW_TABLE_XLIST opcode version bumped (0 -> 1). * Eliminate legacy ta_dump_entry algo handler: dump_table_entry() converts data from current to legacy format. Userland side: * Add "lookup" table parameter. * Change the way table type is guessed: call table_get_info() first, and check value for IPv4/IPv6 type IFF table does not exist. * Fix table_get_list(): do more tries if supplied buffer is not enough. * Sparate table_show_entry() from table_show_list().	2014-07-06 18:16:04 +00:00
hiren	955dea0122	Fix a typo.	2014-07-03 23:12:43 +00:00
melifaro	99023231d3	Fully switch to named tables: Kernel changes: * Introduce ipfw_obj_tentry table entry structure to force u64 alignment. * Support "update-on-existing-key" "add" bahavior (TEI_FLAGS_UPDATED). * Use "subtype" field to distingush between IPv4 and IPv6 table records instead of previous hack. * Add value type (vtype) field for kernel tables. Current types are number,ip and dscp * Fix sets mask retrieval for old binaries * Fix crash while using interface tables Userland changes: * Switch ipfw_table_handler() to use named-only tables. * Add "table NAME create [type {cidr\|iface\|u32} [valtype {number\|ip\|dscp}] ..." * Switch ipfw_table_handler to match_token()-based parser. * Switch ipfw_sets_handler to use new ipfw_get_config() for mask retrieval. * Allow ipfw set X table ... syntax to permit using per-set table namespaces.	2014-07-03 22:25:59 +00:00
hiren	93cfc1a55a		2014-07-02 22:04:14 +00:00
adrian	dc6db53254	Remove old reference to IP_RSSCPUID. Submitted by: Eggert, Lars <lars@netapp.com>	2014-07-01 17:27:48 +00:00
adrian	14ce8d2190	If we're doing RSS then ensure the TCP timer selection uses the multi-CPU callwheel setup, rather than just dumping all the timers on swi0.	2014-06-30 04:26:29 +00:00
melifaro	75913dd997	* Add new IP_FW_XADD opcode which permits to a) specify table ids as names b) add multiple rules at once. Partially convert current code for atomic addition of multiple rules.	2014-06-29 22:35:47 +00:00
melifaro	5d627fdb8b	Suppord showing named tables in ipfw(8) rule listing. Kernel changes: * change base TLV header to be u64 (so size can be u32). * Introduce ipfw_obj_ctlv generc container TLV. * Add IP_FW_XGET opcode which is now used for atomic configuration retrieval. One can specify needed configuration pieces to retrieve via flags field. Currently supported are IPFW_CFG_GET_STATIC (static rules) and IPFW_CFG_GET_STATES (dynamic states). Other configuration pieces (tables, pipes, etc..) support is planned. Userland changes: * Switch ipfw(8) to use new IP_FW_XGET for rule listing. * Split rule listing code get and show pieces. * Make several steps forward towards libipfw: permit printing states and rules(paritally) to supplied buffer. do not die on malloc/kernel failure inside given printing functions. stop assuming cmdline_opts is global symbol.	2014-06-28 23:20:24 +00:00
hselasky	35b126e324	Pull in r267961 and r267973 again. Fix for issues reported will follow.	2014-06-28 03:56:17 +00:00
gjb	fc21f40567	Revert r267961, r267973: These changes prevent sysctl(8) from returning proper output, such as: 1) no output from sysctl(8) 2) erroneously returning ENOMEM with tools like truss(1) or uname(1) truss: can not get etype: Cannot allocate memory	2014-06-27 22:05:21 +00:00
adrian	9fbdf6a6f5	Add missing variable declarations when using RSS. Reported by: bryanv@	2014-06-27 19:07:00 +00:00
hselasky	bd1ed65f0f	Extend the meaning of the CTLFLAG_TUN flag to automatically check if there is an environment variable which shall initialize the SYSCTL during early boot. This works for all SYSCTL types both statically and dynamically created ones, except for the SYSCTL NODE type and SYSCTLs which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to be used in the case a tunable sysctl has a custom initialisation function allowing the sysctl to still be marked as a tunable. The kernel SYSCTL API is mostly the same, with a few exceptions for some special operations like iterating childrens of a static/extern SYSCTL node. This operation should probably be made into a factored out common macro, hence some device drivers use this. The reason for changing the SYSCTL API was the need for a SYSCTL parent OID pointer and not only the SYSCTL parent OID list pointer in order to quickly generate the sysctl path. The motivation behind this patch is to avoid parameter loading cludges inside the OFED driver subsystem. Instead of adding special code to the OFED driver subsystem to post-load tunables into dynamically created sysctls, we generalize this in the kernel. Other changes: - Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask" to "hw.pcic.intr_mask". - Removed redundant TUNABLE statements throughout the kernel. - Some minor code rewrites in connection to removing not needed TUNABLE statements. - Added a missing SYSCTL_DECL(). - Wrapped two very long lines. - Avoid malloc()/free() inside sysctl string handling, in case it is called to initialize a sysctl from a tunable, hence malloc()/free() is not ready when sysctls from the sysctl dataset are registered. - Bumped FreeBSD version to indicate SYSCTL API change. MFC after: 2 weeks Sponsored by: Mellanox Technologies	2014-06-27 16:33:43 +00:00
adrian	d4fe515519	Retire IP_RSSCPUID ; the right thing to do is query the RSS bucket; map the bucket to an RSS queue, then map the queue to a CPU ID. This way the bucket->queue and queue->CPU mapping can change over time. Introduce IP_RSSBUCKETID - which instead looks up the RSS bucket. User applications can then map the RSS bucket to a CPU.	2014-06-26 04:12:41 +00:00
adrian	90965b8d6e	Add another RSS method to query the indirection table entries. There's 128 indirection table entries which correspond to the low 7 bits of the 32 bit RSS hash. Each value will correspond to an RSS bucket. (Then each RSS bucket currently will map to a CPU.) This is a more explicit way of figuring out which RSS bucket is in each RSS indirection slot. It can be inferred by the other methods but I'd rather drivers use something more simplified and explicit.	2014-06-26 02:49:51 +00:00
tuexen	af1469109b	Fix a bug which incorrectly allowed two listening SCTP sockets on the same port bound to the wildcard address. MFC after: 3 days	2014-06-20 20:17:39 +00:00
tuexen	16f87e69c8	Fix a bug in the setsockopt()-handling of the SCTP specific option SCTP_PEER_ADDR_THLDS: Use the provided address as intended. MFC after: 3 days	2014-06-20 17:45:00 +00:00
tuexen	d9071b7221	Honor jails for unbound SCTP sockets when selecting source addresses, reporting IP-addresses to the peer during the handshake, adding addresses to the host, reporting the addresses via the sysctl interface (used by netstat, for example) and reporting the addresses to the application via socket options. This issue was reported by Bernd Walter. MFC after: 3 days	2014-06-20 13:26:49 +00:00
melifaro	8bc233982f	* Add IP_FW_TABLE_XCREATE / IP_FW_TABLE_XMODIFY opcodes. * Add 'algoname' string to ipfw_xtable_info permitting to specify lookup algoritm with parameters. * Rework part of ipfw_rewrite_table_uidx() Sponsored by: Yandex LLC	2014-06-16 13:05:07 +00:00
melifaro	b06860b3e2	Simplify opcode handling. * Use one u16 from op3 header to implement opcode versioning. * IP_FW_TABLE_XLIST has now 2 handlers, for ver.0 (old) and ver.1 (current). * Every getsockopt request is now handled in ip_fw_table.c * Rename new opcodes: IP_FW_OBJ_DEL -> IP_FW_TABLE_XDESTROY IP_FW_OBJ_LISTSIZE -> IP_FW_TABLES_XGETSIZE IP_FW_OBJ_LIST -> IP_FW_TABLES_XLIST IP_FW_OBJ_INFO -> IP_FW_TABLE_XINFO IP_FW_OBJ_INFO -> IP_FW_TABLE_XFLUSH * Add some docs about using given opcodes. * Group some legacy opcode/handlers.	2014-06-15 13:40:27 +00:00
melifaro	fe9646e6ff	Move further to eliminate next pieces of number-assuming code inside tables. Kernel changes: * Add IP_FW_OBJ_FLUSH opcode (flush table based on its name/set) * Add IP_FW_OBJ_DUMP opcode (dumps table data based on its names/set) * Add IP_FW_OBJ_LISTSIZE / IP_FW_OBJ_LIST opcodes (get list of kernel tables) Userland changes: * move tables code to separate tables.c file * get rid of tables_max * switch "all"/list handling to new opcodes	2014-06-14 22:47:25 +00:00
melifaro	f9fb63fe8c	Add API to ease adding new algorithms/new tabletypes to ipfw. Kernel-side changelog: * Split general tables code and algorithm-specific table data. Current algorithms (IPv4/IPv6 radix and interface tables radix) moved to new ip_fw_table_algo.c file. Tables code now supports any algorithm implementing the following callbacks: +struct table_algo { + char name[64]; + int idx; + ta_init init; + ta_destroy destroy; + table_lookup_t lookup; + ta_prepare_add prepare_add; + ta_prepare_del prepare_del; + ta_add add; + ta_del del; + ta_flush_entry flush_entry; + ta_foreach foreach; + ta_dump_entry dump_entry; + ta_dump_xentry dump_xentry; +}; Change ->state, ->xstate, ->tabletype fields of ip_fw_chain to ->tablestate pointer (array of 32 bytes structures necessary for runtime lookups (can be probably shrinked to 16 bytes later): +struct table_info { + table_lookup_t lookup; / Lookup function / + void state; /* Lookup radix/other structure / + void xstate; /* eXtended state / + u_long data; / Hints for given func / +}; Add count method for namedobj instance to ease size calculations * Bump ip_fw3 buffer in ipfw_clt 128->256 bytes. * Improve bitmask resizing on tables_max change. * Remove table numbers checking from most places. * Fix wrong nesting in ipfw_rewrite_table_uidx(). * Add IP_FW_OBJ_LIST opcode (list all objects of given type, currently implemented for IPFW_OBJTYPE_TABLE). * Add IP_FW_OBJ_LISTSIZE (get buffer size to hold IP_FW_OBJ_LIST data, currenly implemented for IPFW_OBJTYPE_TABLE). * Add IP_FW_OBJ_INFO (requests info for one object of given type). Some name changes: s/ipfw_xtable_tlv/ipfw_obj_tlv/ (no table specifics) s/ipfw_xtable_ntlv/ipfw_obj_ntlv/ (no table specifics) Userland changes: * Add do_set3() cmd to ipfw2 to ease dealing with op3-embeded opcodes. * Add/improve support for destroy/info cmds.	2014-06-14 10:58:39 +00:00
melifaro	01ec53e019	Make ipfw tables use names as used-level identifier internally: * Add namedobject set-aware api capable of searching/allocation objects by their name/idx. * Switch tables code to use string ids for configuration tasks. * Change locking model: most configuration changes are protected with UH lock, runtime-visible are protected with both locks. * Reduce number of arguments passed to ipfw_table_add/del by using separate structure. * Add internal V_fw_tables_sets tunable (set to 0) to prepare for set-aware tables (requires opcodes/client support) * Implement typed table referencing (and tables are implicitly allocated with all state like radix ptrs on reference) * Add "destroy" ipfw(8) using new IP_FW_DELOBJ opcode Namedobj more detailed: * Blackbox api providing methods to add/del/search/enumerate objects * Statically-sized hashes for names/indexes * Per-set bitmask to indicate free indexes * Separate methods for index alloc/delete/resize Basically, there should not be any user-visible changes except the following: * reducing table_max is not supported * flush & add change table type won't work if table is referenced Sponsored by: Yandex LLC	2014-06-12 09:59:11 +00:00
tuexen	bde098f08a	Use ENOBUFS instead of ENOMEM in error situations related to m_uiotombuf(). This was suggested by kevlo@. MFC after: 3 days	2014-06-05 12:51:12 +00:00
kevlo	22c7fedfe4	Fix build UDP-Lite with VIMAGE enabled when building with gcc. Reported and tested by: Jason Hellenthal	2014-06-03 01:30:32 +00:00
hiren	cc47b6d947	ECN marking implenetation for dummynet. Changes include both DCTCP and RFC 3168 ECN marking methodology. DCTCP draft: http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00 Submitted by: Midori Kato (aoimidori27@gmail.com) Worked with: Lars Eggert (lars@netapp.com) Reviewed by: luigi, hiren	2014-06-01 07:28:24 +00:00
bz	df1284e013	While PAWS is disabled, there are no consumers for the tcp options argument to tcp_twcheck(); thus mark it __unused. MFC after: 2 weeks	2014-05-30 22:34:06 +00:00
asomers	7ca8bf0f2c	Fix unintended KBI change from r264905. Add _fib versions of ifa_ifwithnet() and ifa_ifwithdstaddr() The legacy functions will call the _fib() versions with RT_ALL_FIBS, preserving legacy behavior. sys/net/if_var.h sys/net/if.c Add legacy-compatible functions as described above. Ensure legacy behavior when RT_ALL_FIBS is passed as fibnum. sys/netinet/in_pcb.c sys/netinet/ip_output.c sys/netinet/ip_options.c sys/net/route.c sys/net/rtsock.c sys/netinet6/nd6.c Call with _fib() functions if we must use a specific fib, or the legacy functions otherwise. tests/sys/netinet/fibs_test.sh tests/sys/netinet/udp_dontroute.c Improve the udp_dontroute test. The bug that this test exercises is that ifa_ifwithnet() will return the wrong address, if multiple interfaces have addresses on the same subnet but with different fibs. The previous version of the test only considered one possible failure mode: that ifa_ifwithnet_fib() might fail to find any suitable address at all. The new version also checks whether ifa_ifwithnet_fib() finds the correct address by checking where the ARP request goes. Reported by: bz, hrs Reviewed by: hrs MFC after: 1 week X-MFC-with: 264905 Sponsored by: Spectra Logic	2014-05-29 21:03:49 +00:00
jilles	1effdc7fef	netinet/in.h: Expose htonl(), htons(), ntohl() and ntohs() in strict POSIX mode. Put the htonl(), htons(), ntohl() and ntohs() declarations under __POSIX_VISIBLE >= 200112. POSIX.1-2001 and newer require these to be exposed from <netinet/in.h> (as well as <arpa/inet.h>). Note that it may be unnecessary to check __POSIX_VISIBLE >= 200112 because older versions of POSIX and the C standard do not define this header. However, other places in the same file already perform the check. PR: 188316 Submitted by: Christian Neukirchen	2014-05-29 15:23:37 +00:00
adrian	9cac8fd2f5	The users of RSS shouldn't be directly concerned about hash -> CPU ID mappings. Instead, they should be first mapping to an RSS bucket and then querying the RSS bucket -> CPU ID mapping to figure out the target CPU. When (if?) RSS rebalancing is implemented or some other (non round-robin) distribution of work from buckets to CPU IDs, various bits of code - both userland and kernel - will need to know how this mapping works. So, to support this: * Add a new function rss_m2bucket() - this maps an mbuf to a given bucket. Anything which is currently doing hash -> CPU work may instead wish to do hash -> bucket, and then query the bucket->cpuid map for which CPU it belongs on. Or, map it to a bucket, then re-pin that bucket -> CPU during a rebalance operation. * For userland applications which wish to exploit affinity to RSS buckets, the bucket -> CPU ID mapping is now available via a sysctl. net.inet.rss.bucket_mapping lists the bucket to CPU ID mapping via a list of bucket:cpu pairs.	2014-05-27 08:06:20 +00:00
bz	606b94139a	Remove the prototpye for the static inline function tcp_signature_verify_input(). The function is defined before first use already. MFC after: 2 weeks	2014-05-24 15:31:40 +00:00
bz	cb0049082b	syncache_lookup() is a file local function. Make it static and take it out of the public KPI; seems it was never used elsewhere. MFC after: 2 weeks	2014-05-24 15:03:36 +00:00
bz	618ca8446a	Make tcp_twrespond() file local private; this removes it from the public KPI; it is not used anywhere else and seems it never was. MFC after: 2 weeks	2014-05-24 14:01:18 +00:00
bz	b513252b3a	Remove the prototypes for things that are no longer file local but were moved to the header file. Pointy hat to: clang \|\| bz MFC after: 2 weeks X-MFC with: r266596 Reported by: gcc build of sparc64	2014-05-23 21:12:33 +00:00
bz	c4e312930b	Move the tcp_fields_to_host() and tcp_fields_to_net() (inline) functions to the tcp_var.h header file in order to avoid further duplication with upcoming commits. Reviewed by: np MFC after: 2 weeks	2014-05-23 20:15:01 +00:00
adrian	22b3cc8e05	Use CPU_FIRST() / CPU_NEXT() to iterate over the valid CPU IDs.	2014-05-22 07:25:36 +00:00
adrian	7bdec225ef	When RSS is enabled and per cpu TCP timers are enabled, do an RSS lookup for the inp flowid/flowtype to destination CPU. This only modifies the case where RSS is enabled and the per-cpu tcp timer option is enabled. Otherwise the behaviour should be the same as before.	2014-05-18 22:39:01 +00:00
adrian	429607ed6b	* When copying the flowid from inp -> outbound mbuf, also assign the hashtype to to the outbound mbuf as well as the flowid. * Add in socket options to fetch the hashid, the hashtype and RSS CPU ID for a given socket.	2014-05-18 22:37:31 +00:00
adrian	ce4c5bcf33	Ensure that the flowid hashtype is assigned to the inp if the flowid is also assigned.	2014-05-18 22:34:06 +00:00
adrian	6cb1f60ec9	Add a new function to do a CPU ID lookup based on RSS hash information. This is intended to be used by various places that wish to hash some information about a TCP/UDP/IP flow but don't necessarily have a live mbuf to do it with. Refactor rss_m2cpuid() to use the refactored function.	2014-05-18 22:32:04 +00:00
adrian	f91e4baca7	Add the flowtype to the inpcb. The flowid isn't enough to use as part of any RSS related CPU affinity lookups - the RSS code would like to know what kind of hash it is.	2014-05-18 22:30:12 +00:00
melifaro	f4783a05e9	Fix wrong formatting of 0.0.0.0/X table records in ipfw(8). Add `flags` u16 field to the hole in ipfw_table_xentry structure. Kernel has been guessing address family for supplied record based on xent length size. Userland, however, has been getting fixed-size ipfw_table_xentry structures guessing address family by checking address by IN6_IS_ADDR_V4COMPAT(). Fix this behavior by providing specific IPFW_TCF_INET flag for IPv4 records. PR: bin/189471 Submitted by: Dennis Yusupoff <dyr@smartspb.net> MFC after: 2 weeks	2014-05-17 13:45:03 +00:00
glebius	e21a2ccdb6	Provide compatibility #define after r265408. Suggested by: truckman	2014-05-17 12:33:27 +00:00
adrian	784a0cfd01	Reserve IP_FLOWID, IP_FLOWTYPE, IP_RSSCPUID socket option IDs for near-term future use. These are intended to fetch the current flow id, flow hash type (M_HASHTYPE_* from the sys/mbuf.h) and if RSS is enabled, the RSS destined CPU ID for the receive path.	2014-05-17 00:09:12 +00:00
silby	278abd3220	Remove the function tcp_twrecycleable; it has been #if 0'd for eight years. The original concept was to improve the corner case where you run out of ephemeral ports, but it was causing performance problems and the mechanism of limiting the number of time_wait sockets serves the same purpose in the end. Reviewed by: bz	2014-05-16 01:38:38 +00:00
yongari	40228299cd	Fix checksum computation. Previously it didn't include carry. Reviewed by: tuexen	2014-05-13 05:07:03 +00:00
tuexen	eff58a975e	Disable TX checksum offload for UDP-Lite completely. It wasn't used for partial checksum coverage, but even for full checksum coverage it doesn't work. This was discussed with Kevin Lo (kevlo@).	2014-05-12 09:46:48 +00:00
tuexen	285d10d30f	Whitespace change.	2014-05-10 08:48:04 +00:00
tuexen	7bb91ee942	Fix a logic bug which prevented the sending of UDP packet with 0 checksum. This bug was introduced in r264212 and should be X-MFCed with that revision, if UDP-Lite support if MFCed.	2014-05-09 14:15:48 +00:00
tuexen	b6b80cf495	Use KASSERTs as suggested by glebius@ MFC after: 3 days X-MFC with: 265691	2014-05-08 20:47:54 +00:00
tuexen	c296f5d2af	For some UDP packets (for example with 200 byte payload) and IP options, the IP header and the UDP header are not in the same mbuf. Add code to in_delayed_cksum() to deal with this case. MFC after: 3 days	2014-05-08 17:27:46 +00:00
tuexen	ebf0ea80e4	Remove unused code. This is triggered by the bugreport of Sylvestre Ledru which deal with useless code in the user land stack: https://bugzilla.mozilla.org/show_bug.cgi?id=1003929 MFC after: 3 days	2014-05-06 16:51:07 +00:00
glebius	68ca190323	- Remove net.inet.tcp.reass.overflows sysctl. It counts exactly same events that tcpstat's tcps_rcvmemdrop counter counts. - Rename tcps_rcvmemdrop to tcps_rcvreassfull and improve its description in netstat(1) output. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-05-06 00:00:07 +00:00
glebius	9f8ba75695	The tcp_log_addrs() uses th pointer, which points into the mbuf, thus we can not free the mbuf before tcp_log_addrs(). Sponsored by: Nginx, Inc. Sponsored by: Netflix	2014-05-05 21:33:20 +00:00
glebius	d2bbb6646b	The FreeBSD-SA-14:08.tcp was a lesson on not doing acrobatics with mixing on stack memory and UMA memory in one linked list. Thus, rewrite TCP reassembly code in terms of memory usage. The algorithm remains unchanged. We actually do not need extra memory to build a reassembly queue. Arriving mbufs are always packet header mbufs. So we got the length of data as pkthdr.len. We got m_nextpkt for linkage. And we need only one pointer to point at the tcphdr, use PH_loc for that. In tcpcb the t_segq fields becomes mbuf pointer. The t_segqlen field now counts not packets, but bytes in the queue. This gives us more precision when comparing to socket buffer limits. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-05-04 23:25:32 +00:00
melifaro	876586a2b1	Fix panic on IPv4 address removal introduced in r265279. Reported by: Trond Endrestøl MFC with: r265279	2014-05-03 20:22:13 +00:00
melifaro	a4407f98c0	Pass radix head ptr along with rte to rtexpunge(). Rename rtexpunge to rt_expunge().	2014-05-03 16:28:54 +00:00
delphij	4007fe7eae	Fix TCP reassembly vulnerability. Patch done by: glebius Security: FreeBSD-SA-14:08.tcp Security: CVE-2014-3000	2014-04-30 04:02:57 +00:00
asomers	130691029e	Fix a panic when removing an IP address from an interface, if the same address exists on another interface. The panic was introduced by change 264887, which changed the fibnum parameter in the call to rtalloc1_fib() in ifa_switch_loopback_route() from RT_DEFAULT_FIB to RT_ALL_FIBS. The solution is to use the interface fib in that call. For the majority of users, that will be equivalent to the legacy behavior. PR: kern/189089 Reported by: neel Reviewed by: neel MFC after: 3 weeks X-MFC with: 264887 Sponsored by: Spectra Logic	2014-04-29 14:46:45 +00:00
asomers	f8a34b6f49	Fix subnet and default routes on different FIBs on the same subnet. These two bugs are closely related. The root cause is that ifa_ifwithnet does not consider FIBs when searching for an interface address. sys/net/if_var.h sys/net/if.c Add a fib argument to ifa_ifwithnet and ifa_ifwithdstadddr. Those functions will only return an address whose interface fib equals the argument. sys/net/route.c Update calls to ifa_ifwithnet and ifa_ifwithdstaddr with fib arguments. sys/netinet/in.c Update in_addprefix to consider the interface fib when adding prefixes. This will prevent it from not adding a subnet route when one already exists on a different fib. sys/net/rtsock.c sys/netinet/in_pcb.c sys/netinet/ip_output.c sys/netinet/ip_options.c sys/netinet6/nd6.c Add RT_DEFAULT_FIB arguments to ifa_ifwithdstaddr and ifa_ifwithnet. In some cases it there wasn't a clear specific fib number to use. In others, I was unable to test those functions so I chose RT_DEFAULT_FIB to minimize divergence from current behavior. I will fix some of the latter changes along with PR kern/187553. tests/sys/netinet/fibs_test.sh tests/sys/netinet/udp_dontroute.c tests/sys/netinet/Makefile Revert r263738. The udp_dontroute test was right all along. However, bugs kern/187550 and kern/187553 cancelled each other out when it came to this test. Because of kern/187553, ifa_ifwithnet searched the default fib instead of the requested one, but because of kern/187550, there was an applicable subnet route on the default fib. The new test added in r263738 doesn't work right, however. I can verify with dtrace that ifa_ifwithnet returned the wrong address before I applied this commit, but route(8) miraculously found the correct interface to use anyway. I don't know how. Clear expected failure messages for kern/187550 and kern/187552. PR: kern/187550 PR: kern/187552 Reviewed by: melifaro MFC after: 3 weeks Sponsored by: Spectra Logic	2014-04-24 23:56:56 +00:00
asomers	6e7494c7e1	Fix host and network routes for new interfaces when net.add_addr_allfibs=0 sys/net/route.c In rtinit1, use the interface fib instead of the process fib. The latter wasn't very useful because ifconfig(8) is usually invoked with the default process fib. Changing ifconfig(8) to use setfib(2) would be redundant, because it already sets the interface fib. tests/sys/netinet/fibs_test.sh Clear the expected ATF failure sys/net/if.c Pass the interface fib in calls to rtrequest1_fib and rtalloc1_fib sys/netinet/in.c sys/net/if_var.h Add a fibnum argument to ifa_switch_loopback_route, a subroutine of in_scrubprefix. Pass it the interface fib. PR: kern/187549 Reviewed by: melifaro MFC after: 3 weeks Sponsored by: Spectra Logic Corporation	2014-04-24 17:23:16 +00:00
smh	977eea5c3a	Fix jailed raw sockets not setting the correct source address by calling in_pcbladdr instead of prison_get_ip4 MFC after: 1 month	2014-04-24 12:52:31 +00:00
tuexen	17495be0b7	Don't free an mbuf twice. This only happens in very rare error cases where the peer sends illegal sequencing information in DATA chunks for an existing association. MFC after: 3 days.	2014-04-23 21:20:55 +00:00
rmacklem	87348af494	Add {} braces so that the code conforms to the indentation. Fortunately, I don't think doing the assignment of cap->tsomax unconditionally causes any problem. Reviewed by: glebius MFC after: 2 weeks	2014-04-21 19:17:19 +00:00
tuexen	0d6ccd58e0	Add consistency checks to ensure that fragments of a user message have the same U-bit. MFC after: 3 days	2014-04-20 21:11:39 +00:00
tuexen	eaf7d5a955	Send also a packet containing an ABORT chunk in response to an OOTB packet containing a COOKIE-ECHO chunk. MFC after: 3 days	2014-04-20 18:15:23 +00:00
tuexen	03865d88ca	Use consistently debug output instead of an unconditional printf. MFC after: 3 days	2014-04-19 20:55:51 +00:00
tuexen	156da197a9	Send the correct error cause, when a DATA chunk with no user data is received. This bug was reported by Irene Ruengeler. MFC after: 3 days	2014-04-19 19:21:06 +00:00
jhb	4a20a393db	Some whitespace and style fixes. Submitted by: bde	2014-04-11 21:00:59 +00:00
jhb	3f6d278d38	The tw_pcbrele() function does not need the global timewait lock. Submitted by: Julien Charbon Suggested by: glebius	2014-04-11 19:17:45 +00:00
jhb	3bdbc6b29a	Don't leak the TCP pcbinfo lock if a time wait connection is closed in between grabbing a reference on the connection structure and obtaining the pcbinfo lock. Reviewed by: Julien Charbon	2014-04-11 13:11:43 +00:00
jhb	7d16e10f89	Currently, the TCP slow timer can starve TCP input processing while it walks the list of connections in TIME_WAIT closing expired connections due to contention on the global TCP pcbinfo lock. To remediate, introduce a new global lock to protect the list of connections in TIME_WAIT. Only acquire the TCP pcbinfo lock when closing an expired connection. This limits the window of time when TCP input processing is stopped to the amount of time needed to close a single connection. Submitted by: Julien Charbon <jcharbon@verisign.com> Reviewed by: rwatson, rrs, adrian MFC after: 2 months	2014-04-10 18:15:35 +00:00
kevlo	2302f326b9	Remove a bogus re-assignment.	2014-04-08 01:54:50 +00:00
kevlo	bd0aef7d51	Minor style cleanups.	2014-04-07 01:55:53 +00:00
kevlo	45fcb795ff	Add support for UDP-Lite protocol (RFC 3828) to IPv4 and IPv6 stacks. Tested with vlc and a test suite [1]. [1] http://www.erg.abdn.ac.uk/~gerrit/udp-lite/files/udplite_linux.tar.gz Reviewed by: jhb, glebius, adrian	2014-04-07 01:53:03 +00:00
hiren	51495517fa	Improve readability of comments for DELAY_ACK() macro.	2014-04-03 01:46:03 +00:00
tuexen	5dda26e88d	Increment the SSN only after processing the last fragment of an ordered user message. MFC after: 3 days	2014-04-01 18:38:04 +00:00
ae	67faa89c61	Don't copy the MF flag from original IP header to ICMP error message. PR: 188092 MFC after: 1 week Sponsored by: Yandex LLC	2014-03-31 13:00:49 +00:00
tuexen	a92bbbfa13	Handle an edge case of address management similar to TCP. This needs to be reconsidered when the address handling will be reimplemented. The patch is from rrs@. MFC after: 3 days	2014-03-29 21:26:45 +00:00
tuexen	e3da2350c9	Use SCTP_OVER_UDP_TUNNELING_PORT more consistently. MFC after: 3 days	2014-03-29 20:21:36 +00:00
asomers	0b79334339	Correct ARP update handling when the routes for network interfaces are restricted to a single FIB in a multifib system. Restricting an interface's routes to the FIB to which it is assigned (by setting net.add_addr_allfibs=0) causes ARP updates to fail with "arpresolve: can't allocate llinfo for x.x.x.x". This is due to the ARP update code hard coding it's lookup for existing routing entries to FIB 0. sys/netinet/in.c: When dealing with RTM_ADD (add route) requests for an interface, use the interface's assigned FIB instead of the default (FIB 0). sys/netinet/if_ether.c: In arpresolve(), enhance error message generated when an lla_lookup() fails so that the interface causing the error is visible in logs. tests/sys/netinet/fibs_test.sh Clear ATF expected error. PR: kern/167947 Submitted by: Nikolay Denev <ndenev@gmail.com> (previous version) Reviewed by: melifaro MFC after: 3 weeks Sponsored by: Spectra Logic Corporation	2014-03-26 22:46:03 +00:00
hiren	d452acf0c9	Correct the comments as support for RFC 1644 has been removed for a long time.	2014-03-25 21:57:50 +00:00
tuexen	529b98b943	* Provide information in error causes in ASCII instead of proprietary binary format. * Add support for a diagnostic information error cause. The code is sysctlable and the default is 0, which means it is not sent. This is joint work with rrs@. MFC after: 1 week	2014-03-16 12:32:16 +00:00
rwatson	f411704afc	Several years after initial development, merge prototype support for linking NIC Receive Side Scaling (RSS) to the network stack's connection-group implementation. This prototype (and derived patches) are in use at Juniper and several other FreeBSD-using companies, so despite some reservations about its maturity, merge the patch to the base tree so that it can be iteratively refined in collaboration rather than maintained as a set of gradually diverging patch sets. (1) Merge a software implementation of the Toeplitz hash specified in RSS implemented by David Malone. This is used to allow suitable pcbgroup placement of connections before the first packet is received from the NIC. Software hashing is generally avoided, however, due to high cost of the hash on general-purpose CPUs. (2) In in_rss.c, maintain authoritative versions of RSS state intended to be pushed to each NIC, including keying material, hash algorithm/ configuration, and buckets. Provide software-facing interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both the RSS standardised Toeplitz and a 'naive' variation with a hash efficient in software but with poor distribution properties. Implement rss_m2cpuid()to be used by netisr and other load balancing code to look up the CPU on which an mbuf should be processed. (3) In the Ethernet link layer, allow netisr distribution using RSS as a source of policy as an alternative to source ordering; continue to default to direct dispatch (i.e., don't try and requeue packets for processing on the 'right' CPU if they arrive in a directly dispatchable context). (4) Allow RSS to control tuning of connection groups in order to align groups with RSS buckets. If a packet arrives on a protocol using connection groups, and contains a suitable hardware-generated hash, use that hash value to select the connection group for pcb lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz hash is available, we fall back on regular PCB lookup risking contention rather than pay the cost of Toeplitz in software -- this is a less scalable but, at my last measurement, faster approach. As core counts go up, we may want to revise this strategy despite CPU overhead. Where device drivers suitably configure NICs, and connection groups / RSS are enabled, this should avoid both lock and line contention during connection lookup for TCP. This commit does not modify any device drivers to tune device RSS configuration to the global RSS configuration; patches are in circulation to do this for at least Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device drivers is not particularly robust, nor aware of more advanced features such as runtime reconfiguration/rebalancing. This will hopefully prove a useful starting point for refinement. No MFC is scheduled as we will first want to nail down a more mature and maintainable KPI/KBI for device drivers. Sponsored by: Juniper Networks (original work) Sponsored by: EMC/Isilon (patch update and merge)	2014-03-15 00:57:50 +00:00
glebius	80e85e32a5	Remove AppleTalk support. AppleTalk was a network transport protocol for Apple Macintosh devices in 80s and then 90s. Starting with Mac OS X in 2000 the AppleTalk was a legacy protocol and primary networking protocol is TCP/IP. The last Mac OS X release to support AppleTalk happened in 2009. The same year routing equipment vendors (namely Cisco) end their support. Thus, AppleTalk won't be supported in FreeBSD 11.0-RELEASE.	2014-03-14 06:29:43 +00:00
glebius	d494babace	Remove IPX support. IPX was a network transport protocol in Novell's NetWare network operating system from late 80s and then 90s. The NetWare itself switched to TCP/IP as default transport in 1998. Later, in this century the Novell Open Enterprise Server became successor of Novell NetWare. The last release that claimed to still support IPX was OES 2 in 2007. Routing equipment vendors (e.g. Cisco) discontinued support for IPX in 2011. Thus, IPX won't be supported in FreeBSD 11.0-RELEASE.	2014-03-14 02:58:48 +00:00
tuexen	2720ac2544	Put the offset of the CRC32C in csum_data instead of 0. The virtio driver needs the offset to be stored in csum_data, like in the case for UDP and TCP. The virtio problem was reported by Niu Zhixiong <kaiaixi@gmail.com>, who helped in debugging and testing the patch. MFC after: 3 days	2014-03-12 17:18:15 +00:00
tuexen	4f721683c9	SCTP uses CRC32C and not Adler anymore. While there change the reference to RFC 4960. This does not change any code, just comments. MFC after: 3 days	2014-03-12 15:30:40 +00:00
glebius	d734bed796	Since both netinet/ and netinet6/ call into netipsec/ and netpfil/, the protocol specific mbuf flags are shared between them. - Move all M_FOO definitions into a single place: netinet/in6.h, to avoid future clashes. - Resolve clash between M_DECRYPTED and M_SKIP_FIREWALL which resulted in a failure of operation of IPSEC and packet filters. Thanks to Nicolas and Georgios for all the hard work on bisecting, testing and finally finding the root of the problem. PR: kern/186755 PR: kern/185876 In collaboration with: Georgios Amanakis <gamanakis gmail.com> In collaboration with: Nicolas DEFFAYET <nicolas-ml deffayet.com> Sponsored by: Nginx, Inc.	2014-03-12 14:29:08 +00:00
glebius	8a3e4bbebb	- Remove rt_metrics_lite and simply put its members into rtentry. - Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This removes another cache trashing ++ from packet forwarding path. - Create zini/fini methods for the rtentry UMA zone. Via initialize mutex and counter in them. - Fix reporting of rmx_pksent to routing socket. - Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode. The change is mostly targeted for stable/10 merge. For head, rt_pksent is expected to just disappear. Discussed with: melifaro Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-03-05 01:17:47 +00:00
glebius	2c50988dca	Remove ifa_ref()/ifa_free(), which are atomic(9), from ip_output(). The ifaddr is already referenced by the rtentry, and we are holding reference on the rtentry throughout the function execution. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-03-04 19:49:41 +00:00
jhb	d9d6b88f18	Remove more constants related to static sysctl nodes. The MAXID constants were primarily used to size the sysctl name list macros that were removed in r254295. A few other constants either did not have an associated sysctl node, or the associated node used OID_AUTO instead. PR: ports/184525 (exp-run)	2014-02-25 18:44:33 +00:00
glebius	0380a02726	Improve logging of send errors, reporting error code and interface. Reduce code duplication between INET and INET6. Tested by: Lytochkin Boris <lytboris gmail.com>	2014-02-22 19:20:40 +00:00
tuexen	a1eb8eb32c	Remove redundant code and fix a style error. MFC after: 3 days	2014-02-20 20:14:43 +00:00
glebius	f62415c467	o Remove at compile time the HASH_ALL code, that was never tested and is unfinished. However, I've tested my version, it works okay. As before it is unfinished: timeout aren't driven by TCP session state. To enable the HASH_ALL mode, one needs in kernel config: options FLOWTABLE_HASH_ALL o Reduce the alignment on flentry to 64 bytes. Without the FLOWTABLE_HASH_ALL option, twice less memory would be consumed by flows. o API to ip_output()/ip6_output() got even more thin: 1 liner. o Remove unused unions. Simply use fle->f_key[]. o Merge all IPv4 code into flowtable_lookup_ipv4(), and do same flowtable_lookup_ipv6(). Stop copying data to on stack sockaddr structures, simply use key[] on stack. o Move code from flowtable_lookup_common() that actually works on insertion into flowtable_insert(). Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-02-17 11:50:56 +00:00
trociny	42b6b03337	Fixup for r261590 (vnet sysctl handlers cleanup). Reviewed by: glebius	2014-02-09 08:13:17 +00:00
glebius	9d7706f9f4	o Revamp API between flowtable and netinet, netinet6. - ip_output() and ip_output6() simply call flowtable_lookup(), passing mbuf and address family. That's the only code under #ifdef FLOWTABLE in the protocols code now. o Revamp statistics gathering and export. - Remove hand made pcpu stats, and utilize counter(9). - Snapshot of statistics is available via 'netstat -rs'. - All sysctls are moved into net.flowtable namespace, since spreading them over net.inet isn't correct. o Properly separate at compile time INET and INET6 parts. o General cleanup. - Remove chain of multiple flowtables. We simply have one for IPv4 and one for IPv6. - Flowtables are allocated in flowtable.c, symbols are static. - With proper argument to SYSINIT() we no longer need flowtable_ready. - Hash salt doesn't need to be per-VNET. - Removed rudimentary debugging, which use quite useless in dtrace era. The runtime behavior of flowtable shouldn't be changed by this commit. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-02-07 15:18:23 +00:00
glebius	21845e0d75	Utilize SYSCTL_UMA_CUR() to export usage of syncache and tcp reassembly zones. Sponsored by: Nginx, Inc.	2014-02-07 14:31:51 +00:00
glebius	5e6abd57c8	Catch up on r261590.	2014-02-07 14:26:33 +00:00
peter	e7e77ae2a8	Adjust r239672 from rrs and r258821 from eadler. By definition, the very first FIN is not a duplicate. Process it normally and don't feed it to congestion control as though it were a dupe. Don't prevent CC from seeing later dupe acks while in a half close state.	2014-01-28 21:13:15 +00:00
gnn	5c066f6412	Decrease lock contention within the TCP accept case by removing the INP_INFO lock from tcp_usr_accept. As the PR/patch states this was following the advice already in the code. See the PR below for a full disucssion of this change and its measured effects. PR: 183659 Submitted by: Julian Charbon Reviewed by: jhb	2014-01-28 20:28:32 +00:00
glebius	29bf6ea675	Fix fallout from r241923. Calculate length of payload in pim_input() properly. While here, remove extra variable and incorrect condition before m_pullup(). Reported by: Olivier Cochard-Labbé <olivier cochard.me> Sponsored by: Nginx, Inc.	2014-01-22 10:57:42 +00:00
melifaro	6ee59753ec	Further rework netinet6 address handling code: * Set ia address/mask values BEFORE attaching to address lists. Inet6 address assignment is not atomic, so the simplest way to do this atomically is to fill in ia before attach. * Validate irfa->ia_addr field before use (we permit ANY sockaddr in old code). * Do some renamings: in6_ifinit -> in6_notify_ifa (interaction with other subsystems is here) in6_setup_ifa -> in6_broadcast_ifa (LLE/Multicast/DaD code) in6_ifaddloop -> nd6_add_ifa_lle in6_ifremloop -> nd6_rem_ifa_lle * Split working with LLE and route announce code for last two. Add temporary in6_newaddrmsg() function to mimic current rtsock behaviour. * Call device SIOCSIFADDR handler IFF we're adding first address. In IPv4 we have to call it on every address change since ARP record is installed by arp_ifinit() which is called by given handler. IPv6 stack, on the opposite is responsible to call nd6_add_ifa_lle() so there is no reason to call SIOCSIFADDR often.	2014-01-19 16:07:27 +00:00
adrian	ea15a84683	If the flowid is available for the mbuf that finalised the creation of a syncache connection, copy it into the inp_flowid field. Without this, an incoming TCP connection won't have an inp_flowid marked until some data comes in, and this means that things like the per-CPU TCP timer option will choose a different CPU for the timer work. (It also means that if one grabbed the flowid via an ioctl from userland, it won't be available until some data has been received.) Sponsored by: Netflix, Inc.	2014-01-18 23:48:20 +00:00
gnn	61b1174206	Fix various places where we don't properly release a lock PR: 185043 Submitted by: Michael Bentkofsky MFC after: 2 weeks	2014-01-16 22:14:54 +00:00
glebius	abd19c8039	Cleanup comments and whitespace. No functional changes.	2014-01-16 12:58:03 +00:00
melifaro	0092f95bcc	Fix refcount leak on netinet ifa. Reviewed by: glebius MFC after: 2 weeks Sponsored by: Yandex LLC	2014-01-16 12:35:18 +00:00
melifaro	479797d59e	Fix ipfw fwd for IPv4 traffic broken by r249894. Problem case: Original lookup returns route with GW set, so gw points to rte->rt_gateway. After that we're changing dst and performing lookup another time. Since fwd host is most probably directly reachable, resulting rte does not contain rt_gateway, so gw is not set. Finally, we end with packet transmitted to proper interface but wrong link-layer address. Found by: lstewart Discussed with: ae,lstewart MFC after: 2 weeks Sponsored by: Yandex LLC	2014-01-16 11:50:00 +00:00

... 2 3 4 5 6 ...

5075 Commits