freebsd-skq

Author	SHA1	Message	Date
Michael Tuexen	f3ba71bee4	Don't check twice that inp is not NULL. Reported by: Coverity CID: 748671 MFC after: 3 days	2014-12-21 13:58:53 +00:00
Warner Losh	61f26cae7d	Where appropriate, use the modern terms for the one true time base (UTC) rather than the archaic (GMT) in comments. Except where the comments are making fun of people doing this (and pedants who insist on the new terms).	2014-12-21 05:07:11 +00:00
Michael Tuexen	b03b5d729a	Fix and harmonize the validation of PR-SCTP policies. Reported by: Coverity CID: 1232044 MFC after: 3 days	2014-12-20 21:17:28 +00:00
Michael Tuexen	ca10a8d944	Cleanup the code. Reported by: Coverity CID: 1232003	2014-12-20 13:47:38 +00:00
Michael Tuexen	142a4d9e86	Add a missing break. Reported by: Coverity CID: 1232014 MFC after: 3 days	2014-12-17 20:34:38 +00:00
Andrey V. Elsukov	44eb8bbe7b	Do not count security policy violation twice. ipsec*_in_reject() do this by their own. Obtained from: Yandex LLC Sponsored by: Yandex LLC	2014-12-11 19:20:13 +00:00
Andrey V. Elsukov	0332a55f0f	Use ipsec4_in_reject() to simplify ip_ipsec_fwd() and ip_ipsec_input(). ipsec4_in_reject() does the same things, also it counts policy violation errors. Obtained from: Yandex LLC Sponsored by: Yandex LLC	2014-12-11 18:55:54 +00:00
Andrey V. Elsukov	0275b2e369	Remove flag/flags argument from the following functions: ipsec_getpolicybyaddr() ipsec4_checkpolicy() ip_ipsec_output() ip6_ipsec_output() The only flag used here was IP_FORWARDING. Obtained from: Yandex LLC Sponsored by: Yandex LLC	2014-12-11 18:35:34 +00:00
Andrey V. Elsukov	619764beab	Remove flags and tunalready arguments from ipsec4_process_packet() and make its prototype similar to ipsec6_process_packet. The flags argument isn't used here, tunalready is always zero. Obtained from: Yandex LLC Sponsored by: Yandex LLC	2014-12-11 17:34:49 +00:00
Andrey V. Elsukov	8922ddbe40	Move ip_ipsec_fwd() from ip_input() into ip_forward(). Remove check for presence PACKET_TAG_IPSEC_IN_DONE mbuf tag from ip_ipsec_fwd(). PACKET_TAG_IPSEC_IN_DONE tag means that packet is already handled by IPSEC code. This means that before IPSEC processing it was destined to our address and security policy was checked in the ip_ipsec_input(). After IPSEC processing packet has new IP addresses and destination address isn't our own. So, anyway we can't check security policy from the mbuf tag, because it corresponds to different addresses. We should check security policy that corresponds to packet attributes in both cases - when it has a mbuf tag and when it has not. Obtained from: Yandex LLC Sponsored by: Yandex LLC	2014-12-11 16:53:29 +00:00
Andrey V. Elsukov	e58320f127	Remove PACKET_TAG_IPSEC_IN_DONE mbuf tag lookup and usage of its security policy. The changed block of code in ip*_ipsec_input() is called when packet has ESP/AH header. Presence of PACKET_TAG_IPSEC_IN_DONE mbuf tag in the same time means that packet was already handled by IPSEC and reinjected in the netisr, and it has another ESP/AH headers (encrypted twice?). Since it was already processed by IPSEC code, the AH/ESP headers was already stripped (and probably outer IP header was stripped too) and security policy from the tdb_ident was applied to those headers. It is incorrect to apply this security policy to current headers. Also make ip_ipsec_input() prototype similar to ip6_ipsec_input(). Obtained from: Yandex LLC Sponsored by: Yandex LLC	2014-12-11 14:58:55 +00:00
Andrey V. Elsukov	dd9cd45b44	Remove check for presence of PACKET_TAG_IPSEC_PENDING_TDB and PACKET_TAG_IPSEC_OUT_CRYPTO_NEEDED mbuf tags. They aren't used in FreeBSD. Instead check presence of PACKET_TAG_IPSEC_OUT_DONE mbuf tag. If it is found, bypass security policy lookup as described in the comment. PACKET_TAG_IPSEC_OUT_DONE tag added to mbuf when IPSEC code finishes ESP/AH processing. Since it was already finished, this means the security policy placed in the tdb_ident was already checked. And there is no reason to check it again here. Obtained from: Yandex LLC Sponsored by: Yandex LLC	2014-12-11 14:43:44 +00:00
Michael Tuexen	39cbb549cc	Include the received chunk padding when reporting an unknown chunk. MFC after: 1 week	2014-12-06 22:57:19 +00:00
Michael Tuexen	d59107f700	Fix the support of mapped IPv4 addresses. Thanks to Mark Bonnekessel and Markus Boese for making me aware of the problems. MFC after: 1 week	2014-12-06 20:00:08 +00:00
Craig Rodrigues	a8da5dd658	MFp4: @181627 Allow UMA allocated memory to be freed when VNET jails are torn down. Differential Revision: D1201 Submitted by: bz Reviewed by: rwatson, gnn	2014-12-06 02:59:59 +00:00
Michael Tuexen	457b4b8836	This is the SCTP specific companion of https://svnweb.freebsd.org/changeset/base/275358 which was provided by Hans Petter Selasky.	2014-12-04 21:17:50 +00:00
Michael Tuexen	4e88d37a2a	Do the renaming of sb_cc to sb_ccc in a way with less code changes by using a macro. This is an alternate approach to https://svnweb.freebsd.org/changeset/base/275326 which is easier to handle upstream. Discussed with: rrs, glebius	2014-12-02 20:29:29 +00:00
Andrey V. Elsukov	2d957916ef	Remove route chaching support from ipsec code. It isn't used for some time. * remove sa_route_union declaration and route_cache member from struct secashead; * remove key_sa_routechange() call from ICMP and ICMPv6 code; * simplify ip_ipsec_mtu(); * remove #include <net/route.h>; Sponsored by: Yandex LLC	2014-12-02 04:20:50 +00:00
Hans Petter Selasky	c25290420e	Start process of removing the use of the deprecated "M_FLOWID" flag from the FreeBSD network code. The flag is still kept around in the "sys/mbuf.h" header file, but does no longer have any users. Instead the "m_pkthdr.rsstype" field in the mbuf structure is now used to decide the meaning of the "m_pkthdr.flowid" field. To modify the "m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX" macros as defined in the "sys/mbuf.h" header file. This patch introduces new behaviour in the transmit direction. Previously network drivers checked if "M_FLOWID" was set in "m_flags" before using the "m_pkthdr.flowid" field. This check has now now been replaced by checking if "M_HASHTYPE_GET(m)" is different from "M_HASHTYPE_NONE". In the future more hashtypes will be added, for example hashtypes for hardware dedicated flows. "M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is valid and has no particular type. This change removes the need for an "if" statement in TCP transmit code checking for the presence of a valid flowid value. The "if" statement mentioned above is now a direct variable assignment which is then later checked by the respective network drivers like before. Additional notes: - The SCTP code changes will be committed as a separate patch. - Removal of the "M_FLOWID" flag will also be done separately. - The FreeBSD version has been bumped. MFC after: 1 month Sponsored by: Mellanox Technologies	2014-12-01 11:45:24 +00:00
Gleb Smirnoff	2cbcd3c198	Merge from projects/sendfile: - Provide pru_ready function for TCP. - Don't call tcp_output() from tcp_usr_send() if no ready data was put into the socket buffer. - In case of dropped connection don't try to m_freem() not ready data. Sponsored by: Nginx, Inc. Sponsored by: Netflix	2014-11-30 13:43:52 +00:00
Gleb Smirnoff	651e4e6a30	Merge from projects/sendfile: extend protocols API to support sending not ready data: o Add new flag to pru_send() flags - PRUS_NOTREADY. o Add new protocol method pru_ready(). Sponsored by: Nginx, Inc. Sponsored by: Netflix	2014-11-30 13:24:21 +00:00
Gleb Smirnoff	0f9d0a73a4	Merge from projects/sendfile: o Introduce a notion of "not ready" mbufs in socket buffers. These mbufs are now being populated by some I/O in background and are referenced outside. This forces following implications: - An mbuf which is "not ready" can't be taken out of the buffer. - An mbuf that is behind a "not ready" in the queue neither. - If sockbet buffer is flushed, then "not ready" mbufs shouln't be freed. o In struct sockbuf the sb_cc field is split into sb_ccc and sb_acc. The sb_ccc stands for ""claimed character count", or "committed character count". And the sb_acc is "available character count". Consumers of socket buffer API shouldn't already access them directly, but use sbused() and sbavail() respectively. o Not ready mbufs are marked with M_NOTREADY, and ready but blocked ones with M_BLOCKED. o New field sb_fnrdy points to the first not ready mbuf, to avoid linear search. o New function sbready() is provided to activate certain amount of mbufs in a socket buffer. A special note on SCTP: SCTP has its own sockbufs. Unfortunately, FreeBSD stack doesn't yet allow protocol specific sockbufs. Thus, SCTP does some hacks to make itself compatible with FreeBSD: it manages sockbufs on its own, but keeps sb_cc updated to inform the stack of amount of data in them. The new notion of "not ready" data isn't supported by SCTP. Instead, only a mechanical substitute is done: s/sb_cc/sb_ccc/. A proper solution would be to take away struct sockbuf from struct socket and allow protocols to implement their own socket buffers, like SCTP already does. This was discussed with rrs@. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-11-30 12:52:33 +00:00
Gleb Smirnoff	300fa232ee	Missed in r274421: use sbavail() instead of bare access to sb_cc.	2014-11-30 12:11:01 +00:00
Alexander V. Chernikov	74860d4f7c	Do not return unlocked/unreferenced lle in arpresolve/nd6_storelladdr - return lle flags IFF needed. Do not pass rte to arpresolve - pass is_gateway flag instead.	2014-11-27 23:06:25 +00:00
Julien Charbon	71da715374	Re-introduce padding fields removed with r264321 to keep struct tcptw ABI unchanged. Suggested by: jhb Approved by: jhb (mentor) MFC after: 1 day X-MFC-With: r264321	2014-11-17 14:56:02 +00:00
Alexander V. Chernikov	7f948f12f6	Finish r274175: do control plane MTU tracking. Update route MTU in case of ifnet MTU change. Add new RTF_FIXEDMTU to track explicitly specified MTU. Old behavior: ifconfig em0 mtu 1500->9000 -> all routes traversing em0 do not change MTU. User has to manually update all routes. ifconfig em0 mtu 9000->1500 -> all routes traversing em0 do not change MTU. However, if ip[6]_output finds route with rt_mtu > interface mtu, rt_mtu gets updated. New behavior: ifconfig em0 mtu 1500->9000 -> all interface routes in all fibs gets updated with new MTU unless RTF_FIXEDMTU flag set on them. ifconfig em0 mtu 9000->1500 -> all routes in all fibs gets updated with new MTU unless RTF_FIXEDMTU flag set on them AND rt_mtu is less than ifp mtu. route add ... -mtu XXX automatically sets RTF_FIXEDMTU flag. route change .. -mtu 0 automatically removes RTF_FIXEDMTU flag. PR: 194238 MFC after: 1 month CR: D1125	2014-11-17 01:05:29 +00:00
Gleb Smirnoff	cfa6009e36	In preparation of merging projects/sendfile, transform bare access to sb_cc member of struct sockbuf to a couple of inline functions: sbavail() and sbused() Right now they are equal, but once notion of "not ready socket buffer data", will be checked in, they are going to be different. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-11-12 09:57:15 +00:00
Hans Petter Selasky	3c7c188c16	Fix some minor TSO issues: - Improve description of TSO limits. - Remove a not needed KASSERT() - Remove some not needed variable casts. Sponsored by: Mellanox Technologies Discussed with: lstewart @ MFC after: 1 week	2014-11-11 12:05:59 +00:00
Alexander V. Chernikov	670e8b3b8c	Kill custom in_matroute() radix mathing function removing one rte mutex lock. Initially in_matrote() in_clsroute() in their current state was introduced by r4105 20 years ago. Instead of deleting inactive routes immediately, we kept them in route table, setting RTPRF_OURS flag and some expire time. After that, either GC came or RTPRF_OURS got removed on first-packet. It was a good solution in that days (and probably another decade after that) to keep TCP metrics. However, after moving metrics to TCP hostcache in r122922, most of in_rmx functionality became unused. It might had been used for flushing icmp-originated routes before rte mutexes/refcounting, but I'm not sure about that. So it looks like this is nearly impossible to make GC do its work nowadays: in_rtkill() ignores non-RTPRF_OURS routes. route can only become RTPRF_OURS after dropping last reference via rtfree() which calls in_clsroute(), which, it turn, ignores UP and non-RTF_DYNAMIC routes. Dynamic routes can still be installed via received redirect, but they have default lifetime (no specific rt_expire) and no one has another trie walker to call RTFREE() on them. So, the changelist: * remove custom rnh_match / rnh_close matching function. * remove all GC functions * partially revert r256695 (proto3 is no more used inside kernel, it is not possible to use rt_expire from user point of view, proto3 support is not complete) * Finish r241884 (similar to this commit) and remove remaining IPv6 parts MFC after: 1 month	2014-11-11 02:52:40 +00:00
Alexander V. Chernikov	d1f79a3bfc	Remove kernel handling of ICMP_SOURCEQUENCH. It hasn't been used for a very long time. Additionally, it was deprecated by RFC 6633.	2014-11-10 23:10:01 +00:00
Alexander V. Chernikov	603eaf792b	Renove faith(4) and faithd(8) from base. It looks like industry have chosen different (and more traditional) stateless/statuful NAT64 as translation mechanism. Last non-trivial commits to both faith(4) and faithd(8) happened more than 12 years ago, so I assume it is time to drop RFC3142 in FreeBSD. No objections from: net@	2014-11-09 21:33:01 +00:00
Andrey V. Elsukov	3e88eb903b	Remove ip6_getdstifaddr() and all functions to work with auxiliary data. It isn't safe to keep unreferenced ifaddrs. Use in6ifa_ifwithaddr() to determine ifaddr corresponding to destination address. Since currently we keep addresses with embedded scope zone, in6ifa_ifwithaddr is called with zero zoneid and marked with XXX. Also remove route and lle lookups from ip6_input. Use in6ifa_ifwithaddr() instead. Sponsored by: Yandex LLC	2014-11-08 19:38:34 +00:00
Andrey V. Elsukov	f325335caf	Overhaul if_gre(4). Split it into two modules: if_gre(4) for GRE encapsulation and if_me(4) for minimal encapsulation within IP. gre(4) changes: * convert to if_transmit; * rework locking: protect access to softc with rmlock, protect from concurrent ioctls with sx lock; * correct interface accounting for outgoing datagramms (count only payload size); * implement generic support for using IPv6 as delivery header; * make implementation conform to the RFC 2784 and partially to RFC 2890; * add support for GRE checksums - calculate for outgoing datagramms and check for inconming datagramms; * add support for sending sequence number in GRE header; * remove support of cached routes. This fixes problem, when gre(4) doesn't work at system startup. But this also removes support for having tunnels with the same addresses for inner and outer header. * deprecate support for various GREXXX ioctls, that doesn't used in FreeBSD. Use our standard ioctls for tunnels. me(4): * implementation conform to RFC 2004; * use if_transmit; * use the same locking model as gre(4); PR: 164475 Differential Revision: D1023 No objections from: net@ Relnotes: yes Sponsored by: Yandex LLC	2014-11-07 19:13:19 +00:00
Gleb Smirnoff	6df8a71067	Remove SYSCTL_VNET_* macros, and simply put CTLFLAG_VNET where needed. Sponsored by: Nginx, Inc.	2014-11-07 09:39:05 +00:00
Alexander V. Chernikov	146a181f28	Finish r274118: remove useless fields from struct domain. Sponsored by: Yandex LLC	2014-11-06 14:39:04 +00:00
Alexander V. Chernikov	1a75e3b20f	Make checks for rt_mtu generic: Some virtual if drivers has (ab)used ifa ifa_rtrequest hook to enforce route MTU to be not bigger that interface MTU. While ifa_rtrequest hooking might be an option in some situation, it is not feasible to do MTU checks there: generic (or per-domain) routing code is perfectly capable of doing this. We currrently have 3 places where MTU is altered: 1) route addition. In this case domain overrides radix _addroute callback (in[6]_addroute) and all necessary checks/fixes are/can be done there. 2) route change (especially, GW change). In this case, there are no explicit per-domain calls, but one can override rte by setting ifa_rtrequest hook to domain handler (inet6 does this). 3) ifconfig ifaceX mtu YYYY In this case, we have no callbacks, but ip[6]_output performes runtime checks and decreases rt_mtu if necessary. Generally, the goals are to be able to handle all MTU changes in control plane, not in runtime part, and properly deal with increased interface MTU. This commit changes the following: * removes hooks setting MTU from drivers side * adds proper per-doman MTU checks for case 1) * adds generic MTU check for case 2) * The latter is done by using new dom_ifmtu callback since if_mtu denotes L3 interface MTU, e.g. maximum trasmitted _packet_ size. However, IPv6 mtu might be different from if_mtu one (e.g. default 1280) for some cases, so we need an abstract way to know maximum MTU size for given interface and domain. * moves rt_setmetrics() before MTU/ifa_rtrequest hooks since it copies user-supplied data which must be checked. * removes RT_LOCK_ASSERT() from other ifa_rtrequest hooks to be able to use this functions on new non-inserted rte. More changes will follow soon. MFC after: 1 month Sponsored by: Yandex LLC	2014-11-06 13:13:09 +00:00
Alexander V. Chernikov	9f25cbe45e	Remove old hack abusing domattach from NFS code. According to IANA RPC uaddr registry, there are no AFs except IPv4 and IPv6, so it's not worth being too abstract here. Remove ne_rtable[AF_MAX+1] and use explicit per-AF radix tries. Use own initialization without relying on domattach code. While I admit that this was one of the rare places in kernel networking code which really was capable of doing multi-AF without any AF-depended code, it is not possible anymore to rely on dom* code. While here, change terrifying "Invalid radix node head, rn:" message, to different non-understandable "netcred already exists for given addr/mask", but less terrifying. Since we know that rn_addaddr() returns NULL if the same record already exists, we should provide more friendly error. MFC after: 1 month	2014-11-05 00:58:01 +00:00
Hans Petter Selasky	4952ad427a	Restore spares used in "struct tcpcb" and bump "__FreeBSD_version" to indicate need for kernel module re-compilation. Sponsored by: Mellanox Technologies	2014-11-03 13:01:58 +00:00
Michael Tuexen	f885296d70	Don't zero the stats before they are read out. MFC after: 3 days	2014-11-01 10:35:45 +00:00
Andrey V. Elsukov	1d904a55c8	Remove the check for packets with broadcast source from if_gif's encapcheck. The check was recommened in the draft-ietf-ngtrans-mech-05.txt. But it isn't clear, should it compare the source with all direct broadcast addresses in the system or not. RFC 4213 says it is enough to verify that the source address is the address of the encapsulator, as configured on the decapsulator. And this verification can be extended by administrator with any other forms of IPv4 ingress filtering. Discussed with: glebius, melifaro Sponsored by: Yandex LLC	2014-10-31 15:23:24 +00:00
Andrey V. Elsukov	7e4217558c	Fix typo.	2014-10-31 11:40:49 +00:00
Julien Charbon	cea40c4888	Fix a race condition in TCP timewait between tcp_tw_2msl_reuse() and tcp_tw_2msl_scan(). This race condition drives unplanned timewait timeout cancellation. Also simplify implementation by holding inpcb reference and removing tcptw reference counting. Differential Revision: https://reviews.freebsd.org/D826 Submitted by: Marc De la Gueronniere <mdelagueronniere@verisign.com> Submitted by: jch Reviewed By: jhb (mentor), adrian, rwatson Sponsored by: Verisign, Inc. MFC after: 2 weeks X-MFC-With: r264321	2014-10-30 08:53:56 +00:00
Hans Petter Selasky	0e1152fcc2	The SYSCTL data pointers can come from userspace and must not be directly accessed. Although this will work on some platforms, it can throw an exception if the pointer is invalid and then panic the kernel. Add a missing SYSCTL_IN() of "SCTP_BASE_STATS" structure. MFC after: 3 days Sponsored by: Mellanox Technologies	2014-10-28 12:00:39 +00:00
Hans Petter Selasky	614b50ae8b	Preserve limitation of "TCP_CA_NAME_MAX" when matching the algorithm name. MFC after: 3 days Suggested by: gnn @	2014-10-27 16:08:41 +00:00
Hans Petter Selasky	60a945f95d	Make assignments to "net.inet.tcp.cc.algorithm" work by fixing a bad string comparison. MFC after: 3 days Reported by: Jukka Ukkonen <jau789@gmail.com> Sponsored by: Mellanox Technologies	2014-10-27 11:21:47 +00:00
Mateusz Guzik	e015b1ab0a	Avoid dynamic syscall overhead for statically compiled modules. The kernel tracks syscall users so that modules can safely unregister them. But if the module is not unloadable or was compiled into the kernel, there is no need to do this. Achieve this by adding SY_THR_STATIC_KLD macro which expands to SY_THR_STATIC during kernel build and 0 otherwise. Reviewed by: kib (previous version) MFC after: 2 weeks	2014-10-26 19:42:44 +00:00
Michael Tuexen	b3817112b4	Fix a use of an uninitialized variable by makeing sure that sctp_med_chunk_output() always initialized the reason_code instead of relying on the caller. The variable is only used for debugging purpose. This issue was reported by Peter Bostroem from Google. MFC after: 3 days	2014-10-25 09:25:29 +00:00
Andrey V. Elsukov	a663aa4ce8	Remove redundant check and m_pullup() call.	2014-10-24 13:34:22 +00:00
Hans Petter Selasky	f0188618f2	Fix multiple incorrect SYSCTL arguments in the kernel: - Wrong integer type was specified. - Wrong or missing "access" specifier. The "access" specifier sometimes included the SYSCTL type, which it should not, except for procedural SYSCTL nodes. - Logical OR where binary OR was expected. - Properly assert the "access" argument passed to all SYSCTL macros, using the CTASSERT macro. This applies to both static- and dynamically created SYSCTLs. - Properly assert the the data type for both static and dynamic SYSCTLs. In the case of static SYSCTLs we only assert that the data pointed to by the SYSCTL data pointer has the correct size, hence there is no easy way to assert types in the C language outside a C-function. - Rewrote some code which doesn't pass a constant "access" specifier when creating dynamic SYSCTL nodes, which is now a requirement. - Updated "EXAMPLES" section in SYSCTL manual page. MFC after: 3 days Sponsored by: Mellanox Technologies	2014-10-21 07:31:21 +00:00
Michael Tuexen	84f3b49ac9	Fix the reported streams in a SCTP_STREAM_RESET_EVENT, if a sent incoming stream reset request was responded with failed or denied. Thanks to Peter Bostroem from Google for reporting the issue. MFC after: 3 days	2014-10-16 15:36:04 +00:00
Andrey V. Elsukov	0b9f5f8a5f	Overhaul if_gif(4): o convert to if_transmit; o use rmlock to protect access to gif_softc; o use sx lock to protect from concurrent ioctls; o remove a lot of unneeded and duplicated code; o remove cached route support (it won't work with concurrent io); o style fixes. Reviewed by: melifaro Obtained from: Yandex LLC MFC after: 1 month Sponsored by: Yandex LLC	2014-10-14 13:31:47 +00:00
Sean Bruno	882ac53ed7	Handle small file case with regards to plpmtud blackhole detection. Submitted by: Mikhail <mp@lenta.ru> MFC after: 2 weeks Relnotes: yes	2014-10-13 21:06:21 +00:00
Sean Bruno	0f3e3bc526	Catch ipv6 case when attempting to do PLPMTUD blackhole detection. Submitted by: Mikhail <mp@lenta.ru> MFC after: 2 weeks Relnotes: yes	2014-10-13 21:05:29 +00:00
Alexander V. Chernikov	2930362fb1	Fix matching default rule on clear/show commands. Found by: Oleg Ginzburg	2014-10-13 13:49:28 +00:00
Julien Charbon	489dcc9262	A connection in TIME_WAIT state before calling close() actually did not received any RST packet. Do not set error to ECONNRESET in this case. Differential Revision: https://reviews.freebsd.org/D879 Reviewed by: rpaulo, adrian Approved by: jhb (mentor) Sponsored by: Verisign, Inc.	2014-10-12 23:01:25 +00:00
Robert Watson	f0cace5d94	When deciding whether to call m_pullup() even though there is adequate data in an mbuf, use M_WRITABLE() instead of a direct test of M_EXT; the latter both unnecessarily exposes mbuf-allocator internals in the protocol stack and is also insufficient to catch all cases of non-writability. (NB: m_pullup() does not actually guarantee that a writable mbuf is returned, so further refinement of all of these code paths continues to be required.) Reviewed by: bz MFC after: 3 days Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D900	2014-10-12 15:49:52 +00:00
John Baldwin	585a4290ab	Update ip_divert.ko to depend on version 3 of ipfw.	2014-10-11 16:08:54 +00:00
Bryan Venteicher	81d3ec1763	Add context pointer and source address to the UDP tunnel callback These are needed for the forthcoming vxlan implementation. The context pointer means we do not have to use a spare pointer field in the inpcb, and the source address is required to populate vxlan's forwarding table. While I highly doubt there is an out of tree consumer of the UDP tunneling callback, this change may be a difficult to eventually MFC. Phabricator: https://reviews.freebsd.org/D383 Reviewed by: gnn	2014-10-10 06:08:59 +00:00
Bryan Venteicher	a0a9e1b57c	Add missing UDP multicast receive dtrace probes Phabricator: https://reviews.freebsd.org/D924 Reviewed by: rpaulo markj MFC after: 1 month	2014-10-09 22:36:21 +00:00
Michael Tuexen	e03159ea69	Ensure that the flags field of sctp_tmit_chunks is initialized. Thanks to Peter Bostroem from Google for reporting the issue. MFC after: 3 days	2014-10-09 20:08:12 +00:00
Alexander V. Chernikov	779b53d008	Sync to HEAD@r272825.	2014-10-09 15:35:28 +00:00
Marcel Moolenaar	80b47aefa1	Move the SCTP syscalls to netinet with the rest of the SCTP code. The syscalls themselves are tightly coupled with the network stack and therefore should not be in the generic socket code. The following four syscalls have been marked as NOSTD so they can be dynamically registered in sctp_syscalls_init() function: sys_sctp_peeloff sys_sctp_generic_sendmsg sys_sctp_generic_sendmsg_iov sys_sctp_generic_recvmsg The syscalls are also set up to be dynamically registered when COMPAT32 option is configured. As a side effect of moving the SCTP syscalls, getsock_cap needs to be made available outside of the uipc_syscalls.c source file. A proper prototype has been added to the sys/socketvar.h header file. API tests from the SCTP reference implementation have been run to ensure compatibility. (http://code.google.com/p/sctp-refimpl/source/checkout) Submitted by: Steve Kiernan <stevek@juniper.net> Reviewed by: tuexen, rrs Obtained from: Juniper Networks, Inc.	2014-10-09 15:16:52 +00:00
Bryan Venteicher	c19f98eb74	Check for mbuf copy failure when there are multiple multicast sockets This partitular case is the only path where the mbuf could be NULL. udp_append() checked for a NULL mbuf only after invoking the tunneling callback. Our only in tree tunneling callback - SCTP - assumed a non NULL mbuf, and it is a bit odd to make the callbacks responsible for checking this condition. This also reduces the differences between the IPv4 and IPv6 code. MFC after: 1 month	2014-10-09 05:17:47 +00:00
Andrey V. Elsukov	5b7a43f546	When tunneling interface is going to insert mbuf into netisr queue after stripping outer header, consider it as new packet and clear the protocols flags. This fixes problems when IPSEC traffic goes through various tunnels and router doesn't send ICMP/ICMPv6 errors. PR: 174602 Obtained from: Yandex LLC MFC after: 2 weeks Sponsored by: Yandex LLC	2014-10-08 21:23:34 +00:00
Michael Tuexen	9ba6106020	Ensure that the list of streams sent in a stream reset parameter fits in an mbuf-cluster. Thanks to Peter Bostroem for drawing my attention to this part of the code.	2014-10-08 15:30:59 +00:00
Michael Tuexen	e29127de2e	Ensure that the number of stream reported in srs_number_streams is consistent with the amount of data provided in the SCTP_RESET_STREAMS socket option. Thanks to Peter Bostroem from Google for drawing my attention to this part of the code.	2014-10-08 15:29:49 +00:00
Alexander V. Chernikov	be8bc45790	Add IP_FW_DUMP_SOPTCODES sopt to be able to determine which opcodes are currently available in kernel.	2014-10-08 11:12:14 +00:00
Sean Bruno	f6f6703f27	Implement PLPMTUD blackhole detection (RFC 4821), inspired by code from xnu sources. If we encounter a network where ICMP is blocked the Needs Frag indicator may not propagate back to us. Attempt to downshift the mss once to a preconfigured value. Default this feature to off for now while we do not have a full PLPMTUD implementation in our stack. Adds the following new sysctl's for control: net.inet.tcp.pmtud_blackhole_detection -- turns on/off this feature net.inet.tcp.pmtud_blackhole_mss -- mss to try for ipv4 net.inet.tcp.v6pmtud_blackhole_mss -- mss to try for ipv6 Adds the following new sysctl's for monitoring: -- Number of times the code was activated to attempt a mss downshift net.inet.tcp.pmtud_blackhole_activated -- Number of times the blackhole mss was used in an attempt to downshift net.inet.tcp.pmtud_blackhole_min_activated -- Number of times that we failed to connect after we downshifted the mss net.inet.tcp.pmtud_blackhole_failed Phabricator: https://reviews.freebsd.org/D506 Reviewed by: rpaulo bz MFC after: 2 weeks Relnotes: yes Sponsored by: Limelight Networks	2014-10-07 21:50:28 +00:00
Alexander V. Chernikov	a5fedf11fc	Sync to HEAD@r272609.	2014-10-06 11:29:50 +00:00
Hans Petter Selasky	b228e6bf57	Minor code styling. Suggested by: glebius @	2014-10-06 06:19:54 +00:00
Michael Tuexen	041353aba4	Remove unused MC_ALIGN macro as suggested by Robert. MFC after: 1 week	2014-10-05 20:30:49 +00:00
Robert Watson	6c572040c6	Eliminate use of M_EXT in IP6_EXTHDR_CHECK() by trimming a redundant 'if'/'else' case: it matches the simple 'else' case that follows. This reduces awareness of external-storage mechanics outside of the mbuf allocator. Reviewed by: bz MFC after: 3 days Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D900	2014-10-05 06:28:53 +00:00
Alexander V. Chernikov	1ce4b35740	Sync to HEAD@r272516.	2014-10-04 12:42:37 +00:00
Hiroki Sato	9c57a5b630	Add an additional routing table lookup when m->m_pkthdr.fibnum is changed at a PFIL hook in ip{,6}_output(). IPFW setfib rule did not perform a routing table lookup when the destination address was not changed. CR: D805	2014-10-02 00:25:57 +00:00
Mark Johnston	00cb6bef99	Add a sysctl, net.inet.icmp.tstamprepl, which can be used to disable replies to ICMP Timestamp packets. PR: 193689 Submitted by: Anthony Cornehl <accornehl@gmail.com> MFC after: 3 weeks Sponsored by: EMC / Isilon Storage Division	2014-10-01 18:07:34 +00:00
Alexander V. Chernikov	31f0d081d8	Remove lock init from radix.c. Radix has never managed its locking itself. The only consumer using radix with embeded rwlock is system routing table. Move per-AF lock inits there.	2014-10-01 14:39:06 +00:00
Michael Tuexen	83e95fb30b	The default for UDPLITE_RECV_CSCOV is zero. RFC 3828 recommend that this means full checksum coverage for received packets. If an application is willing to accept packets with partial coverage, it is expected to use the socekt option and provice the minimum coverage it accepts. Reviewed by: kevlo MFC after: 3 days	2014-10-01 05:43:29 +00:00
Michael Tuexen	c6d81a3445	UDPLite requires a checksum. Therefore, discard a received packet if the checksum is 0. MFC after: 3 days	2014-09-30 20:29:58 +00:00
Michael Tuexen	0f4a03663b	If the checksum coverage field in the UDPLITE header is the length of the complete UDPLITE packet, the packet has full checksum coverage. SO fix the condition. Reviewed by: kevlo MFC after: 3 days	2014-09-30 18:17:28 +00:00
John Baldwin	a9456c081a	Only define the full inm_print() if KTR_IGMPV3 is enabled at compile time.	2014-09-30 17:26:34 +00:00
Michael Tuexen	03f90784bf	Checksum coverage values larger than 65535 for UDPLite are invalid. Check for this when the user calls setsockopt using UDPLITE_{SEND,RECV}CSCOV. Reviewed by: kevlo MFC after: 3 days	2014-09-28 17:22:45 +00:00
Alexander V. Chernikov	29c47f18da	* Split tcp_signature_compute() into 2 pieces: - tcp_get_sav() - SADB key lookup - tcp_signature_do_compute() - actual computation * Fix TCP signature case for listening socket: do not assume EVERY connection coming to socket with TCP_SIGNATURE set to be md5 signed regardless of SADB key existance for particular address. This fixes the case for routing software having _some_ BGP sessions secured by md5. * Simplify TCP_SIGNATURE handling in tcp_input() MFC after: 2 weeks	2014-09-27 07:04:12 +00:00
Adrian Chadd	3aac064c2f	Remove an un-needed bit of pre-processor work - it all lives inside #ifdef RSS.	2014-09-27 05:14:02 +00:00
John-Mark Gurney	469c4e0465	drop unnecessary ifdef IPSEC's. This file is only compiled when IPSEC is defined... Differential Revision: D839 Reviewed by: bz, glebius, gnn Sponsered by: EuroBSDCon DevSummit	2014-09-26 12:48:54 +00:00
Navdeep Parhar	5acf7269da	Catch up with r271119.	2014-09-24 20:12:40 +00:00
Hans Petter Selasky	9fd573c39d	Improve transmit sending offload, TSO, algorithm in general. The current TSO limitation feature only takes the total number of bytes in an mbuf chain into account and does not limit by the number of mbufs in a chain. Some kinds of hardware is limited by two factors. One is the fragment length and the second is the fragment count. Both of these limits need to be taken into account when doing TSO. Else some kinds of hardware might have to drop completely valid mbuf chains because they cannot loaded into the given hardware's DMA engine. The new way of doing TSO limitation has been made backwards compatible as input from other FreeBSD developers and will use defaults for values not set. Reviewed by: adrian, rmacklem Sponsored by: Mellanox Technologies MFC after: 1 week	2014-09-22 08:27:27 +00:00
Hiroki Sato	cc45ae406d	Add a change missing in r271916.	2014-09-21 04:38:50 +00:00
Hiroki Sato	89c58b73e0	- Virtualize interface cloner for gre(4). This fixes a panic when destroying a vnet jail which has a gre(4) interface. - Make net.link.gre.max_nesting vnet-local.	2014-09-21 03:56:06 +00:00
Gleb Smirnoff	32c7c51c2a	Mechanically convert to if_inc_counter().	2014-09-19 10:19:51 +00:00
Gleb Smirnoff	22bfa4f5b1	Remove disabled code, that is very unlikely to be ever enabled again, as well as the comment that explains why is it disabled.	2014-09-19 05:23:47 +00:00
Alan Somers	58a39d8c5b	Fix source address selection on unbound sockets in the presence of multiple fibs. Use the mbuf's or the socket's fib instead of RT_ALL_FIBS. Fixes PR 187553. Also fixes netperf's UDP_STREAM test on a nondefault fib. sys/netinet/ip_output.c In ip_output, lookup the source address using the mbuf's fib instead of RT_ALL_FIBS. sys/netinet/in_pcb.c in in_pcbladdr, lookup the source address using the socket's fib, because we don't seem to have the mbuf fib. They should be the same, though. tests/sys/net/fibs_test.sh Clear the expected failure on udp_dontroute. PR: 187553 CR: https://reviews.freebsd.org/D772 MFC after: 3 weeks Sponsored by: Spectra Logic	2014-09-16 15:28:19 +00:00
Michael Tuexen	b60b0fe6fd	Add a explict cast to silence a warning when building the userland stack on Windows. This issue was reported by Peter Kasting from Google. MFC after: 3 days	2014-09-16 14:39:24 +00:00
Michael Tuexen	47b80412cd	Use a consistent type for the number of HMAC algorithms. This fixes a bug which resulted in a warning on the userland stack, when compiled on Windows. Thanks to Peter Kasting from Google for reporting the issue and provinding a potential fix. MFC after: 3 days	2014-09-16 14:20:33 +00:00
Michael Tuexen	667eb48763	Small cleanup which addresses a warning regaring the truncation of a 64-bit entity to a 32-bit entity. This issue was reported by Peter Kasting from Google. MFC after: 3 days	2014-09-16 13:48:46 +00:00
Gleb Smirnoff	3220a2121c	FreeBSD-SA-14:19.tcp raised attention to the state of our stack towards blind SYN/RST spoofed attack. Originally our stack used in-window checks for incoming SYN/RST as proposed by RFC793. Later, circa 2003 the RST attack was mitigated using the technique described in P. Watson "Slipping in the window" paper [1]. After that, the checks were only relaxed for the sake of compatibility with some buggy TCP stacks. First, r192912 introduced the vulnerability, just fixed by aforementioned SA. Second, r167310 had slightly relaxed the default RST checks, instead of utilizing net.inet.tcp.insecure_rst sysctl. In 2010 a new technique for mitigation of these attacks was proposed in RFC5961 [2]. The idea is to send a "challenge ACK" packet to the peer, to verify that packet arrived isn't spoofed. If peer receives challenge ACK it should regenerate its RST or SYN with correct sequence number. This should not only protect against attacks, but also improve communication with broken stacks, so authors of reverted r167310 and r192912 won't be disappointed. [1] http://bandwidthco.com/whitepapers/netforensics/tcpip/TCP Reset Attacks.pdf [2] http://www.rfc-editor.org/rfc/rfc5961.txt Changes made: o Revert r167310. o Implement "challenge ACK" protection as specificed in RFC5961 against RST attack. On by default. - Carefully preserve r138098, which handles empty window edge case, not described by the RFC. - Update net.inet.tcp.insecure_rst description. o Implement "challenge ACK" protection as specificed in RFC5961 against SYN attack. On by default. - Provide net.inet.tcp.insecure_syn sysctl, to turn off RFC5961 protection. The changes were tested at Netflix. The tested box didn't show any anomalies compared to control box, except slightly increased number of TCP connection in LAST_ACK state. Reviewed by: rrs Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-09-16 11:07:25 +00:00
Michael Tuexen	8a0834ec28	Make a type conversion explicit. When compiling this code on Windows as part of the SCTP userland stack, this fixes a warning reported by Peter Kasting from Google. MFC after: 3 days	2014-09-16 10:57:55 +00:00
Xin LI	831ad37ef2	Fix Denial of Service in TCP packet processing. Submitted by: glebius Security: FreeBSD-SA-14:19.tcp	2014-09-16 09:48:24 +00:00
Michael Tuexen	43f9f175c5	The MTU is handled as a 32-bit entity within the SCTP stack. This was reported by Peter Kasting from Google. MFC after: 3 days	2014-09-16 09:22:43 +00:00
Adrian Chadd	f4659f4c27	Ensure the correct software IPv4 hash is done based on the configured RSS parameters, rather than assuming we're hashing IPv4+UDP and IPv4+TCP.	2014-09-16 03:26:42 +00:00
Michael Tuexen	aa7e5af86f	Chunk IDs are 8 bit entities, not 16 bit. Thanks to Peter Kasting from Google for drawing my attention to it. MFC after: 3 days	2014-09-15 19:38:34 +00:00
Hiroki Sato	9bc11d7bd7	Use generic SYSCTL_* macro instead of deprecated SYSCTL_VNET_*. Suggested by: glebius	2014-09-15 14:43:58 +00:00
Hiroki Sato	348aae2398	Make net.inet.ip.sourceroute, net.inet.ip.accept_sourceroute, and net.inet.ip.process_options vnet-aware. Revert changes in r271545. Suggested by: bz	2014-09-15 07:20:40 +00:00
Hans Petter Selasky	72f3100047	Revert r271504. A new patch to solve this issue will be made. Suggested by: adrian @	2014-09-13 20:52:01 +00:00
Hans Petter Selasky	eb93b77ae4	Improve transmit sending offload, TSO, algorithm in general. The current TSO limitation feature only takes the total number of bytes in an mbuf chain into account and does not limit by the number of mbufs in a chain. Some kinds of hardware is limited by two factors. One is the fragment length and the second is the fragment count. Both of these limits need to be taken into account when doing TSO. Else some kinds of hardware might have to drop completely valid mbuf chains because they cannot loaded into the given hardware's DMA engine. The new way of doing TSO limitation has been made backwards compatible as input from other FreeBSD developers and will use defaults for values not set. MFC after: 1 week Sponsored by: Mellanox Technologies	2014-09-13 08:26:09 +00:00
Alan Somers	4f8585e021	Revisions 264905 and 266860 added a "int fib" argument to ifa_ifwithnet and ifa_ifwithdstaddr. For the sake of backwards compatibility, the new arguments were added to new functions named ifa_ifwithnet_fib and ifa_ifwithdstaddr_fib, while the old functions became wrappers around the new ones that passed RT_ALL_FIBS for the fib argument. However, the backwards compatibility is not desired for FreeBSD 11, because there are numerous other incompatible changes to the ifnet(9) API. We therefore decided to remove it from head but leave it in place for stable/9 and stable/10. In addition, this commit adds the fib argument to ifa_ifwithbroadaddr for consistency's sake. sys/sys/param.h Increment __FreeBSD_version sys/net/if.c sys/net/if_var.h sys/net/route.c Add fibnum argument to ifa_ifwithbroadaddr, and remove the _fib versions of ifa_ifwithdstaddr, ifa_ifwithnet, and ifa_ifwithroute. sys/net/route.c sys/net/rtsock.c sys/netinet/in_pcb.c sys/netinet/ip_options.c sys/netinet/ip_output.c sys/netinet6/nd6.c Fixup calls of modified functions. share/man/man9/ifnet.9 Document changed API. CR: https://reviews.freebsd.org/D458 MFC after: Never Sponsored by: Spectra Logic	2014-09-11 20:21:03 +00:00
Andrey V. Elsukov	028bdf289d	Add scope zone id to the in_endpoints and hc_metrics structures. A non-global IPv6 address can be used in more than one zone of the same scope. This zone index is used to identify to which zone a non-global address belongs. Also we can have many foreign hosts with equal non-global addresses, but from different zones. So, they can have different metrics in the host cache. Obtained from: Yandex LLC Sponsored by: Yandex LLC	2014-09-10 16:26:18 +00:00
Andrey V. Elsukov	a7e201bbac	Make in6_pcblookup_hash_locked and in6_pcbladdr static. Obtained from: Yandex LLC Sponsored by: Yandex LLC	2014-09-10 13:17:35 +00:00
Andrey V. Elsukov	1b44e5ffe3	Introduce INP6_PCBHASHKEY macro. Replace usage of hardcoded part of IPv6 address as hash key in all places. Obtained from: Yandex LLC	2014-09-10 12:35:42 +00:00
Adrian Chadd	8ad1a83b48	Calculate the RSS hash for outbound UDPv4 frames. Differential Revision: https://reviews.freebsd.org/D527 Reviewed by: grehan	2014-09-09 04:19:36 +00:00
Adrian Chadd	b8bc95cd49	Update the IPv4 input path to handle reassembled frames and incoming frames with no RSS hash. When doing RSS: * Create a new IPv4 netisr which expects the frames to have been verified; it just directly dispatches to the IPv4 input path. * Once IPv4 reassembly is done, re-calculate the RSS hash with the new IP and L3 header; then reinject it as appropriate. * Update the IPv4 netisr to be a CPU affinity netisr with the RSS hash function (rss_soft_m2cpuid) - this will do a software hash if the hardware doesn't provide one. NICs that don't implement hardware RSS hashing will now benefit from RSS distribution - it'll inject into the correct destination netisr. Note: the netisr distribution doesn't work out of the box - netisr doesn't query RSS for how many CPUs and the affinity setup. Yes, netisr likely shouldn't really be doing CPU stuff anymore and should be "some kind of 'thing' that is a workqueue that may or may not have any CPU affinity"; that's for a later commit. Differential Revision: https://reviews.freebsd.org/D527 Reviewed by: grehan	2014-09-09 04:18:20 +00:00
Adrian Chadd	72d33245f5	Implement IPv4 RSS software hash functions to use during packet ingress and egress. * rss_mbuf_software_hash_v4 - look at the IPv4 mbuf to fetch the IPv4 details + direction to calculate a hash. * rss_proto_software_hash_v4 - hash the given source/destination IPv4 address, port and direction. * rss_soft_m2cpuid - map the given mbuf to an RSS CPU ("bucket" for now) These functions are intended to be used by the stack to support the following: * Not all NICs do RSS hashing, so we should support some way of doing a hash in software; * The NIC / driver may not hash frames the way we want (eg UDP 4-tuple hashing when the stack is only doing 2-tuple hashing for UDP); so we may need to re-hash frames; * .. same with IPv4 fragments - they will need to be re-hashed after reassembly; * .. and same with things like IP tunneling and such; * The transmit path for things like UDP, RAW and ICMP don't currently have any RSS information attached to them - so they'll need an RSS calculation performed before transmit. TODO: * Counters! Everywhere! * Add a debug mode that software hashes received frames and compares them to the hardware hash provided by the hardware to ensure they match. The IPv6 part of this is missing - I'm going to do some re-juggling of where various parts of the RSS framework live before I add the IPv6 code (read: the IPv6 code is going to go into netinet6/in6_rss.[ch], rather than living here.) Note: This API is still fluid. Please keep that in mind. Differential Revision: https://reviews.freebsd.org/D527 Reviewed by: grehan	2014-09-09 03:10:21 +00:00
Adrian Chadd	9d3ddf4384	Add support for receiving and setting flowtype, flowid and RSS bucket information as part of recvmsg(). This is primarily used for debugging/verification of the various processing paths in the IP, PCB and driver layers. Unfortunately the current implementation of the control message path results in a ~10% or so drop in UDP frame throughput when it's used. Differential Revision: https://reviews.freebsd.org/D527 Reviewed by: grehan	2014-09-09 01:45:39 +00:00
Adrian Chadd	061a4b4c36	Add a flag to ip_output() - IP_NODEFAULTFLOWID - which prevents it from overriding an existing flowid/flowtype field in the outbound mbuf with the inp_flowid/inp_flowtype details. The upcoming RSS UDP support calculates a valid RSS value for outbound mbufs and since it may change per send, it doesn't cache it in the inpcb. So overriding it here would be wrong. Differential Revision: https://reviews.freebsd.org/D527 Reviewed by: grehan	2014-09-09 00:19:02 +00:00
Alexander V. Chernikov	d6164b77f8	Make ipfw_nat module use IP_FW3 codes. Kernel changes: * Split kernel/userland nat structures eliminating IPFW_INTERNAL hack. * Add IP_FW_NAT44_* codes resemblin old ones. * Assume that instances can be named (no kernel support currently). * Use both UH+WLOCK locks for all configuration changes. * Provide full ABI support for old sockopts. Userland changes: * Use IP_FW_NAT44_* codes for nat operations. * Remove undocumented ability to show ranges of nat "log" entries.	2014-09-07 18:30:29 +00:00
Michael Tuexen	ad234e3c3d	Address warnings generated by the clang analyzer. MFC after: 1 week	2014-09-07 18:05:37 +00:00
Michael Tuexen	23602b60fb	Address another warnings reported by Patrick Laimbock when compiling in userspace. While there, improve consistency. MFC after: 1 week	2014-09-07 17:07:19 +00:00
Michael Tuexen	24aaac8d59	Use union sctp_sockstore instead of struct sockaddr_storage. This eliminiates some warnings when building in userland. Thanks to Patrick Laimbock for reporting this issue. Remove also some unnecessary casts. There should be no functional change. MFC after: 1 week	2014-09-07 09:06:26 +00:00
Michael Tuexen	95e550801c	Use SYSCTL_PROC instead of SYSCTL_VNET_PROC. Suggested by: glebius@ MFC after: 1 week	2014-09-07 07:49:49 +00:00
Michael Tuexen	24110da033	Fix a leak of an address, if the address is scheduled for removal and the stack is torn down. Thanks to Peter Bostroem and Jiayang Liu from Google for reporting the issue. MFC after: 1 week	2014-09-06 20:03:24 +00:00
Michael Tuexen	f47f328dc5	Fix the handling of sysctl variables when used with VIMAGE. While there do some cleanup of the code. MFC after: 1 week	2014-09-06 19:12:14 +00:00
Alexander V. Chernikov	c9daea0b86	Sync to HEAD@r271160.	2014-09-05 13:52:39 +00:00
Gleb Smirnoff	770aa6cb25	Satisfy assertion in m_demote(). Sponsored by: Nginx, Inc.	2014-09-04 19:28:02 +00:00
John Baldwin	a7c7f2a7e2	In tcp_input(), don't acquire the pcbinfo global write lock for SYN packets targeting a listening socket. Permit to reduce TCP input processing starvation in context of high SYN load (e.g. short-lived TCP connections or SYN flood). Submitted by: Julien Charbon <jcharbon@verisign.com> Reviewed by: adrian, hiren, jhb, Mike Bentkofsky	2014-09-04 19:09:08 +00:00
Gleb Smirnoff	07e845a3f4	Fixes for tcp_respond() comment.	2014-09-04 17:05:57 +00:00
Gleb Smirnoff	ba32fcfff9	Improve r265338. When inserting mbufs into TCP reassembly queue, try to collapse adjacent pieces using m_catpkt(). In best case scenario it copies data and frees mbufs, making mbuf exhaustion attack harder. Suggested by: Jonathan Looney <jonlooney gmail.com> Security: Hardens against remote mbuf exhaustion attack. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-09-04 09:15:44 +00:00
Gleb Smirnoff	bf7dcda366	Clean up unused CSUM_FRAGMENT. Sponsored by: Nginx, Inc.	2014-09-03 08:30:18 +00:00
Gleb Smirnoff	c26544aa7f	Make SOCK_RAW sockets to be truly raw, not modifying received and sent packets at all. Swapping byte order on SOCK_RAW was actually a bug, an artifact from the BSD network stack, that used to convert a packet to native byte order once it is received by kernel. Other operating systems didn't follow this, and later other BSD descendants fixed this, leaving us alone with the bug. Now it is clear that we should fix the bug. In collaboration with: Olivier Cochard-Labbé <olivier cochard.me> See also: https://wiki.freebsd.org/SOCK_RAW Sponsored by: Nginx, Inc.	2014-09-01 14:04:51 +00:00
Alexander V. Chernikov	0cba2b2802	Add support for multi-field values inside ipfw tables. This is the last major change in given branch. Kernel changes: * Use 64-bytes structures to hold multi-value variables. * Use shared array to hold values from all tables (assume each table algo is capable of holding 32-byte variables). * Add some placeholders to support per-table value arrays in future. * Use simple eventhandler-style API to ease the process of adding new table items. Currently table addition may required multiple UH drops/ acquires which is quite tricky due to atomic table modificatio/swap support, shared array resize, etc. Deal with it by calling special notifier capable of rolling back state before actually performing swap/resize operations. Original operation then restarts itself after acquiring UH lock. * Bump all objhash users default values to at least 64 * Fix custom hashing inside objhash. Userland changes: * Add support for dumping shared value array via "vlist" internal cmd. * Some small print/fill_flags dixes to support u32 values. * valtype is now bitmask of <skipto\|pipe\|fib\|nat\|dscp\|tag\|divert\|netgraph\|limit\|ipv4\|ipv6>. New values can hold distinct values for each of this types. * Provide special "legacy" type which assumes all values are the same. * More helpers/docs following.. Some examples: 3:41 [1] zfscurr0# ipfw table mimimi create valtype skipto,limit,ipv4,ipv6 3:41 [1] zfscurr0# ipfw table mimimi info +++ table(mimimi), set(0) +++ kindex: 2, type: addr references: 0, valtype: skipto,limit,ipv4,ipv6 algorithm: addr:radix items: 0, size: 296 3:42 [1] zfscurr0# ipfw table mimimi add 10.0.0.5 3000,10,10.0.0.1,2a02:978:2::1 added: 10.0.0.5/32 3000,10,10.0.0.1,2a02:978:2::1 3:42 [1] zfscurr0# ipfw table mimimi list +++ table(mimimi), set(0) +++ 10.0.0.5/32 3000,0,10.0.0.1,2a02:978:2::1	2014-08-31 23:51:09 +00:00
Gleb Smirnoff	546451a2e5	Use macros instead of referencing struct if_data that resides in ifnet. Sponsored by: Nginx, Inc.	2014-08-31 06:30:50 +00:00
Michael Tuexen	76031b19ef	Announce SCTP support in the kern.features sysctl variables. MFC after: 3 days	2014-08-26 21:15:34 +00:00
Alexander V. Chernikov	832fd78087	Sync to HEAD@r270409.	2014-08-23 14:58:31 +00:00
Xin LI	a7f77a3950	Restore historical behavior of in_control, which, when no matching address is found, the first usable address is returned for legacy ioctls like SIOCGIFBRDADDR, SIOCGIFDSTADDR, SIOCGIFNETMASK and SIOCGIFADDR. While there also fix a subtle issue that a caller from a jail asking for INADDR_ANY may get the first IP of the host that do not belong to the jail. Submitted by: glebius Differential Revision: https://reviews.freebsd.org/D667	2014-08-22 19:08:12 +00:00
Lawrence Stewart	8b0fe327e8	Destroy the "qdiffsample_zone" UMA zone on unload to avoid a use-after-unload panic easily triggered by running "sysctl -a" after unload. Reported and tested by: Grenville Armitage <garmitage@swin.edu.au> MFC after: 1 week	2014-08-19 02:19:53 +00:00
Alexander V. Chernikov	4bbd15771b	Make room for multi-type values in struct tentry.	2014-08-15 12:58:32 +00:00
Kevin Lo	73d76e77b6	Change pr_output's prototype to avoid the need for explicit casts. This is a follow up to r269699. Phabric: D564 Reviewed by: jhb	2014-08-15 02:43:02 +00:00
Alexander V. Chernikov	c21034b744	Replace "cidr" table type with "addr" type. Suggested by: luigi	2014-08-14 21:43:20 +00:00
Alexander V. Chernikov	18ad419788	* Fix displaying dynamic rules for large rulesets. * Clean up some comments.	2014-08-14 08:21:22 +00:00
Alexander V. Chernikov	1b833d535b	Sync to HEAD@r269943.	2014-08-13 16:20:41 +00:00
Michael Tuexen	f0396ad15e	Add support for the SCTP_PR_STREAM_STATUS and SCTP_PR_ASSOC_STATUS socket options. This includes managing the correspoing stat counters. Add the SCTP_DETAILED_STR_STATS kernel option to control per policy counters on every stream. The default is off and only an aggregated counter is available. This is sufficient for the RTCWeb usecase. MFC after: 1 week	2014-08-13 15:50:16 +00:00
Alexander V. Chernikov	1940fa7727	Change tablearg value to be 0 (try #2 ). Most of the tablearg-supported opcodes does not accept 0 as valid value: O_TAG, O_TAGGED, O_PIPE, O_QUEUE, O_DIVERT, O_TEE, O_SKIPTO, O_CALLRET, O_NETGRAPH, O_NGTEE, O_NAT treats 0 as invalid input. The rest are O_SETDSCP and O_SETFIB. 'Fix' them by adding high-order bit (0x8000) set for non-tablearg values. Do translation in kernel for old clients (import_rule0 / export_rule0), teach current ipfw(8) binary to add/remove given bit. This change does not affect handling SETDSCP values, but limit O_SETFIB values to 32767 instead of 65k. Since currently we have either old (16) or new (2^32) max fibs, this should not be a big deal: we're definitely OK for former and have to add another opcode to deal with latter, regardless of tablearg value.	2014-08-12 15:51:48 +00:00
Michael Tuexen	97a0ca5b3e	Change SCTP sysctl from auth_disable to auth_enable. This is consistent with other similar sysctl variable used in SCTP.	2014-08-12 13:13:11 +00:00
Michael Tuexen	c79bec9c75	Add support for the SCTP_AUTH_SUPPORTED and SCTP_ASCONF_SUPPORTED socket options. Add also a sysctl to control the support of ASCONF. MFC after: 1 week	2014-08-12 11:30:16 +00:00
Alexander V. Chernikov	4f43138ade	* Add the abilify to lock/unlock given table from changes. Example: # ipfw table si lock # ipfw table si info +++ table(si), set(0) +++ kindex: 0, type: cidr, locked valtype: number, references: 0 algorithm: cidr:radix items: 0, size: 288 # ipfw table si add 4.5.6.7 ignored: 4.5.6.7/32 0 ipfw: Adding record failed: table is locked # ipfw table si unlock # ipfw table si add 4.5.6.7 added: 4.5.6.7/32 0 # ipfw table si lock # ipfw table si delete 4.5.6.7 ignored: 4.5.6.7/32 0 ipfw: Deleting record failed: table is locked # ipfw table si unlock # ipfw table si delete 4.5.6.7 deleted: 4.5.6.7/32 0	2014-08-11 18:09:37 +00:00
Alexander V. Chernikov	3a845e1076	* Add support for batched add/delete for ipfw tables * Add support for atomic batches add (all or none). * Fix panic on deleting non-existing entry in radix algo. Examples: # si is empty # ipfw table si add 1.1.1.1/32 1111 2.2.2.2/32 2222 added: 1.1.1.1/32 1111 added: 2.2.2.2/32 2222 # ipfw table si add 2.2.2.2/32 2200 4.4.4.4/32 4444 exists: 2.2.2.2/32 2200 added: 4.4.4.4/32 4444 ipfw: Adding record failed: record already exists ^^^^^ Returns error but keeps inserted items # ipfw table si list +++ table(si), set(0) +++ 1.1.1.1/32 1111 2.2.2.2/32 2222 4.4.4.4/32 4444 # ipfw table si atomic add 3.3.3.3/32 3333 4.4.4.4/32 4400 5.5.5.5/32 5555 added(reverted): 3.3.3.3/32 3333 exists: 4.4.4.4/32 4400 ignored: 5.5.5.5/32 5555 ipfw: Adding record failed: record already exists ^^^^^ Returns error and reverts added records # ipfw table si list +++ table(si), set(0) +++ 1.1.1.1/32 1111 2.2.2.2/32 2222 4.4.4.4/32 4444	2014-08-11 17:34:25 +00:00
Hans Petter Selasky	e167cb89a2	Fix string length argument passed to "sysctl_handle_string()" so that the complete string is returned by the function and not just only one byte. PR: 192544 MFC after: 2 weeks	2014-08-10 07:51:55 +00:00
Hiren Panchasara	f7469d3e52	Improve comments by listing a criteria for automatic increment of receive socket buffer. Reviewed by: jmg	2014-08-09 21:01:24 +00:00
Michael Tuexen	82eaf95e8d	Small modification of the sctp_input() cleanup to avoid having code between declariations.	2014-08-09 14:33:44 +00:00
Konstantin Belousov	1216eb3320	Fix one more compiler warning, m is not initialized.	2014-08-08 15:50:02 +00:00
Alexander V. Chernikov	8bd1921248	Partially revert previous commit: "0" value is perfectly valid for O_SETFIB and O_SETDSCP, so tablearg remains to be 655535 for now.	2014-08-08 15:33:26 +00:00
Alexander V. Chernikov	2c452b20dd	* Switch tablearg value from 65535 to 0. * Use u16 table kidx instead of integer on for iface opcode. * Provide compability layer for old clients.	2014-08-08 14:23:20 +00:00
Alexander V. Chernikov	adf3b2b9d8	* Add IP_FW_TABLE_XMODIFY opcode * Since there seems to be lack of consensus on strict value typing, remove non-default value types. Use userland-only "value format type" to print values. Kernel changes: * Add IP_FW_XMODIFY to permit table run-time modifications. Currently we support changing limit and value format type. Userland changes: * Support IP_FW_XMODIFY opcode. * Support specifying value format type (ftype) in tablble create/modify req * Fine-print value type/value format type.	2014-08-08 09:27:49 +00:00
Bjoern A. Zeeb	eb5eb08820	Fix argument to KTR after r269699 to unbreak LINT builds.	2014-08-08 09:17:02 +00:00
Alexander V. Chernikov	28ea4fa355	Remove IP_FW_TABLES_XGETSIZE opcode. It is superseded by IP_FW_TABLES_XLIST.	2014-08-08 06:36:26 +00:00
Kevin Lo	8f5a8818f5	Merge 'struct ip6protosw' and 'struct protosw' into one. Now we have only one protocol switch structure that is shared between ipv4 and ipv6. Phabric: D476 Reviewed by: jhb	2014-08-08 01:57:15 +00:00
Alexander V. Chernikov	a73d728d31	Kernel changes: * Implement proper checks for switching between global and set-aware tables * Split IP_FW_DEL mess into the following opcodes: * IP_FW_XDEL (del rules matching pattern) * IP_FW_XMOVE (move rules matching pattern to another set) * IP_FW_SET_SWAP (swap between 2 sets) * IP_FW_SET_MOVE (move one set to another one) * IP_FW_SET_ENABLE (enable/disable sets) * Add IP_FW_XZERO / IP_FW_XRESETLOG to finish IP_FW3 migration. * Use unified ipfw_range_tlv as range description for all of the above. * Check dynamic states IFF there was non-zero number of deleted dyn rules, * Del relevant dynamic states with singe traversal instead of per-rule one. Userland changes: * Switch ipfw(8) to use new opcodes.	2014-08-07 21:37:31 +00:00
Michael Tuexen	317e00ef86	Add support for the SCTP_RECONFIG_SUPPORTED and the corresponding sysctl controlling the negotiation of the RE-CONFIG extension. MFC after: 3 days	2014-08-04 20:07:35 +00:00
Hiren Panchasara	76504ce978	Add a comment for easier code understanding.	2014-08-04 19:42:48 +00:00
Alexander V. Chernikov	46d5200874	Implement atomic ipfw table swap. Kernel changes: * Add opcode IP_FW_TABLE_XSWAP * Add support for swapping 2 tables with the same type/ftype/vtype. * Make skipto cache init after ipfw locks init. Userland changes: * Add "table X swap Y" command.	2014-08-03 21:37:12 +00:00
Michael Tuexen	cb9b8e6f7d	Add support for the SCTP_PKTDROP_SUPPORTED socket option and the corresponding sysctl variable. The default is off, since the specification is not an RFC yet. MFC after: 1 week	2014-08-03 18:12:55 +00:00
Michael Tuexen	2fdf7a7a35	Use consistent names for SCTP sysctls. Rename nr_sack_on_off to nrsack_enable. Please note that this extension is off by default since it is not specified in an RFC (yet).	2014-08-03 15:09:13 +00:00
Michael Tuexen	caea98793f	Add SCTP socket option SCTP_NRSACK_SUPPORTED to control the NRSACK extension. The default will still be off, since it it not an RFC (yet). Changing the sysctl name will be in a separate commit. MFC after: 1 week	2014-08-03 14:10:10 +00:00
Alexander V. Chernikov	5f379342d2	Show algorithm-specific data in "table info" output.	2014-08-03 12:19:45 +00:00
Michael Tuexen	dd973b0e15	Add support for the SCTP_PR_SUPPORTED socket option as specified in http://tools.ietf.org/html/draft-ietf-tsvwg-sctp-prpolicies Add also a sysctl controlling the default of the end-points. MFC after: 1 week	2014-08-02 21:36:40 +00:00
Michael Tuexen	59a86c85bb	Fix a copy and paste error. X-MFC with: 269436	2014-08-02 20:37:02 +00:00
Michael Tuexen	f342355a0e	Cleanup the ECN configuration handling and provide an SCTP socket option for controlling ECN on future associations and get the status on current associations. A simialar pattern will be used for controlling SCTP extensions in upcoming commits.	2014-08-02 17:35:13 +00:00
Michael Tuexen	47aac6fa4b	Remove the asconf_auth_nochk sysctl. This was off by default and only existed to be able to test with non-compliant peers a long time ago.	2014-08-01 20:49:27 +00:00
Peter Grehan	07b4e38313	Fix byte ordering in default RSS key. The rss_key[] array in netinet/in_rss.c has the bytes in incorrect order. This results in the RSS test vectors in the Microsft RSS spec and Intel NIC specs giving incorrect results, and making it difficult to verify correct hash operation when RSS functionality is added to new NICs. CR: https://phabric.freebsd.org/D516 Reviewed by: adrian	2014-08-01 18:36:40 +00:00
Alexander V. Chernikov	4c0c07a552	* Permit limiting number of items in table. Kernel changes: * Add TEI_FLAGS_DONTADD entry flag to indicate that insert is not possible * Support given flag in all algorithms * Add "limit" field to ipfw_xtable_info * Add actual limiting code into add_table_entry() Userland changes: * Add "limit" option as "create" table sub-option. Limit modification is currently impossible. * Print human-readable errors in table enry addition/deletion code.	2014-08-01 15:17:46 +00:00
Michael Tuexen	ce11b8429b	Cleanup sctp_send_initiate() and sctp_send_initiate_ack() to be in sync as much as possible. This simplifies upcoming changes.	2014-08-01 12:42:37 +00:00
Alexander V. Chernikov	914bffb6ab	* Add new "flow" table type to support N=1..5-tuple lookups * Add "flow:hash" algorithm Kernel changes: * Add O_IP_FLOW_LOOKUP opcode to support "flow" lookups * Add IPFW_TABLE_FLOW table type * Add "struct tflow_entry" as strage for 6-tuple flows * Add "flow:hash" algorithm. Basically it is auto-growing chained hash table. Additionally, we store mask of fields we need to compare in each instance/ * Increase ipfw_obj_tentry size by adding struct tflow_entry * Add per-algorithm stat (ifpw_ta_tinfo) to ipfw_xtable_info * Increase algoname length: 32 -> 64 (algo options passed there as string) * Assume every table type can be customized by flags, use u8 to store "tflags" field. * Simplify ipfw_find_table_entry() by providing @tentry directly to algo callback. * Fix bug in cidr:chash resize procedure. Userland changes: * add "flow table(NAME)" syntax to support n-tuple checking tables. * make fill_flags() separate function to ease working with _s_x arrays * change "table info" output to reflect longer "type" fields Syntax: ipfw table fl2 create type flow:[src-ip][,proto][,src-port][,dst-ip][dst-port] [algo flow:hash] Examples: 0:02 [2] zfscurr0# ipfw table fl2 create type flow:src-ip,proto,dst-port algo flow:hash 0:02 [2] zfscurr0# ipfw table fl2 info +++ table(fl2), set(0) +++ kindex: 0, type: flow:src-ip,proto,dst-port valtype: number, references: 0 algorithm: flow:hash items: 0, size: 280 0:02 [2] zfscurr0# ipfw table fl2 add 2a02:6b8::333,tcp,443 45000 0:02 [2] zfscurr0# ipfw table fl2 add 10.0.0.92,tcp,80 22000 0:02 [2] zfscurr0# ipfw table fl2 list +++ table(fl2), set(0) +++ 2a02:6b8::333,6,443 45000 10.0.0.92,6,80 22000 0:02 [2] zfscurr0# ipfw add 200 count tcp from me to 78.46.89.105 80 flow 'table(fl2)' 00200 count tcp from me to 78.46.89.105 dst-port 80 flow table(fl2) 0:03 [2] zfscurr0# ipfw show 00200 0 0 count tcp from me to 78.46.89.105 dst-port 80 flow table(fl2) 65535 617 59416 allow ip from any to any 0:03 [2] zfscurr0# telnet -s 10.0.0.92 78.46.89.105 80 Trying 78.46.89.105... .. 0:04 [2] zfscurr0# ipfw show 00200 5 272 count tcp from me to 78.46.89.105 dst-port 80 flow table(fl2) 65535 682 66733 allow ip from any to any	2014-07-31 20:08:19 +00:00
Steven Hartland	5af464bbe0	Ensure that IP's added to CARP always use the CARP MAC Previously there was a race condition between the address addition and associating it with the CARP which resulted in the interface MAC, instead of the CARP MAC, being used for a brief amount of time. This caused "is using my IP address" warnings as well as data being sent to the wrong machine due to incorrect ARP entries being recorded by other devices on the network.	2014-07-31 16:43:56 +00:00
Steven Hartland	d34165f759	Only check error if one could have been generated	2014-07-31 09:18:29 +00:00
Alexander V. Chernikov	b23d5de9b6	* Add number:array algorithm lookup method. Kernel changes: * s/IPFW_TABLE_U32/IPFW_TABLE_NUMBER/ * Force "lookup <port\|uid\|gid\|jid>" to be IPFW_TABLE_NUMBER * Support "lookup" method for number tables * Add number:array algorihm (i32 as key, auto-growing). Userland changes: * Support named tables in "lookup <tag> Table" * Fix handling of "table(NAME,val)" case * Support printing "number" table data.	2014-07-30 14:52:26 +00:00
Hiren Panchasara	39c8c62ec4	Add a comment and while there, fix trailing whitespace.	2014-07-29 23:42:51 +00:00
Alexander V. Chernikov	9d099b4f38	* Dump available table algorithms via "ipfw talist" cmd. Kernel changes: * Add type/refcount fields to table algo instances. * Add IP_FW_TABLES_ALIST opcode to export available algorihms to userland. Userland changes: * Fix cores on empty input inside "ipfw table" handler. * Add "ipfw talist" cmd to print availabled kernel algorithms. * Change "table info" output to reflect long algorithm config lines.	2014-07-29 22:44:26 +00:00
Gleb Smirnoff	9753faf553	Garbage collect couple of unused fields from struct ifaddr: - ifa_claim_addr() unused since removal of NetAtalk - ifa_metric seems to be never utilized, always a copy of if_metric	2014-07-29 15:01:29 +00:00
Alexander V. Chernikov	68394ec88e	* Add generic ipfw interface tracking API * Rewrite interface tables to use interface indexes Kernel changes: * Add generic interface tracking API: - ipfw_iface_ref (must call unlocked, performs lazy init if needed, allocates state & bumps ref) - ipfw_iface_add_ntfy(UH_WLOCK+WLOCK, links comsumer & runs its callback to update ifindex) - ipfw_iface_del_ntfy(UH_WLOCK+WLOCK, unlinks consumer) - ipfw_iface_unref(unlocked, drops reference) Additionally, consumer callbacks are called in interface withdrawal/departure. * Rewrite interface tables to use iface tracking API. Currently tables are implemented the following way: runtime data is stored as sorted array of {ifidx, val} for existing interfaces full data is stored inside namedobj instance (chained hashed table). * Add IP_FW_XIFLIST opcode to dump status of tracked interfaces * Pass @chain ptr to most non-locked algorithm callbacks: (prepare_add, prepare_del, flush_entry ..). This may be needed for better interaction of given algorithm an other ipfw subsystems * Add optional "change_ti" algorithm handler to permit updating of cached table_info pointer (happens in case of table_max resize) * Fix small bug in ipfw_list_tables() * Add badd (insert into sorted array) and bdel (remove from sorted array) funcs Userland changes: * Add "iflist" cmd to print status of currently tracked interface * Add stringnum_cmp for better interface/table names sorting	2014-07-28 19:01:25 +00:00
Marcel Moolenaar	1e0a021e3d	The accept filter code is not specific to the FreeBSD IPv4 network stack, so it really should not be under "optional inet". The fact that uipc_accf.c lives under kern/ lends some weight to making it a "standard" file. Moving kern/uipc_accf.c from "optional inet" to "standard" eliminates the need for #ifdef INET in kern/uipc_socket.c. Also, this meant the net.inet.accf.unloadable sysctl needed to move, as net.inet does not exist without networking compiled in (as it lives in netinet/in_proto.c.) The new sysctl has been named net.accf.unloadable. In order to support existing accept filter sysctls, the net.inet.accf node has been added netinet/in_proto.c. Submitted by: Steve Kiernan <stevek@juniper.net> Obtained from: Juniper Networks, Inc.	2014-07-26 19:27:34 +00:00
Michael Tuexen	56711f9433	Initialize notification strucuture. This was missed in an earlier commit MFC after: 3 days	2014-07-24 18:06:18 +00:00
Hiroki Sato	9be09a6e43	Fix EtherIP. TOS field must be initialized when the inner protocol is PF_LINK, and multicast/broadcast flag should always be dropped because the outer protocol uses unicast even when the inner address is not for unicast. It had been broken since r236951 when gif_output() started to use IFQ_HANDOFF().	2014-07-24 10:42:47 +00:00
Michael Tuexen	e710ed26a3	Cleanup the definition of two structures which are exposed to userland. Therefore no MFC.	2014-07-22 19:54:22 +00:00
Adrian Chadd	58ef629f00	Make the PCBGROUPS code aware of IPv4 UDP 4-tuple.	2014-07-20 07:38:38 +00:00
Adrian Chadd	9870806c93	Add hash awareness of the IPv4 and IPv6 UDP 4-tuple. Note: it would be nice if the supported hash check would be used here!	2014-07-20 07:37:47 +00:00
Adrian Chadd	40c753e3da	Implement rss_gethashconfig() - return the currently supported hash methods by the stack. Right now the stack isn't really setup for RSS with 4-tuple UDP hashing for either IPv4 and IPv6. The specifics: * The UDP init path udp_init() and udplite_init() specify the hash as 2-tuple, so the PCBGROUPS code only tries a 2-tuple check; * The PCBGROUPS and RSS code doesn't know about the UDP hash types just yet, so they're never treated as valid hashes. * For correctness, 4-tuple can't be enabled in the general case because UDP datagrams can be more fragmented than IP datagrams may be. Strictly speaking, TCP datagrams may also be fragmented and this could cause issues with PCBGROUPS/RSS until the IP defragment path grows some code to re-calculate the RSS hash. I'll follow this commit up with awareness of the UDP 4-tuple for those who wish to configure it, but for now it'll stay disabled. No drivers (yet) know to use this function when RSS is enabled.	2014-07-20 07:36:59 +00:00
Adrian Chadd	85415b47c8	Update the comment to be more concise.	2014-07-20 07:31:55 +00:00
Adrian Chadd	5f15473b37	Update the default RSS hash to the Chelsio T5 firmware one - it provides markedly better distribution of IPv6 address/ports than the previous key. The previous key would hash large swaths of the port space for a given source/destination IP address to the same low handful of bits, effectively mapping them to the same queue. This made testing very .. special.	2014-07-18 08:22:13 +00:00
Adrian Chadd	8496de3825	Oops - somehow I missed the IP option numbers clashing with the multicast numbers below. Move them to a new set of non-clashing numbers.	2014-07-17 05:45:54 +00:00
Adrian Chadd	e989b65f79	Add RSS hashing awareness for IPv6 and TCP IPv6 hash types.	2014-07-12 05:43:43 +00:00
Adrian Chadd	d5bb8bd315	Expose in_pcbbind_check_bindmulti() so the upcoming IPv6 RSS changes can be made to use it.	2014-07-12 05:40:13 +00:00
Michael Tuexen	0c8682e8ad	Whitespace changes. MFC after: 1 week	2014-07-11 21:15:40 +00:00
Michael Tuexen	f64a0b069a	Bugfix: When a remote address was added to an endpoint, a source address was selected and cached, but it was not stored that is was cached. This resulted in selecting different source addresses for the INIT-ACK and COOKIE-ACK when possible. Thanks to Niu Zhixiong for reporting the issue. MFC after: 1 week	2014-07-11 17:31:40 +00:00
Gleb Smirnoff	fcc34a238c	Fix style bug: rename the refcount field of m_ext to ext_cnt, to match other members. Sponsored by: Nginx, Inc.	2014-07-11 14:34:29 +00:00
Michael Tuexen	4474d71a7b	Integrate upstream changes. MFC after: 1 week	2014-07-11 06:52:48 +00:00
Adrian Chadd	0a100a6f1e	Implement the first stage of multi-bind listen sockets and RSS socket awareness. * Introduce IP_BINDMULTI - indicating that it's okay to bind multiple sockets on the same bind details. Although the PCB code has been taught about this (see below) this patch doesn't introduce the rest of the PCB changes necessary to distribute lookups among multiple PCB entries in the global wildcard table. * Introduce IP_RSS_LISTEN_BUCKET - placing an listen socket into the given RSS bucket (and thus a single PCBGROUP hash.) * Modify the PCB add path to be aware of IP_BINDMULTI: + Only allow further PCB entries to be added if the owner credentials and IP_BINDMULTI has been specified. Ie, only allow further IP_BINDMULTI sockets to appear if the first bind() was IP_BINDMULTI. * Teach the PCBGROUP code about IP_RSS_LISTE_BUCKET marked PCB entries. Instead of using the wildcard logic and hashing, these sockets are simply placed into the PCBGROUP and _not_ in the wildcard hash. * When doing a PCBGROUP lookup, also do a wildcard match as well. This allows for an RSS bucket PCB entry to appear in a PCBGROUP rather than having to exist in the wildcard list. Tested: * TCP IPv4 server testing with igb(4) * TCP IPv4 server testing with ix(4) TODO: * The pcbgroup lookup code duplicated the wildcard and wildcard-PCB logic. This could be refactored into a single function. * This doesn't yet work for IPv6 (The PCBGROUP code in netinet6/ doesn't yet know about this); nor does it yet fully work for UDP.	2014-07-10 03:10:56 +00:00
Gleb Smirnoff	fe82cbe85c	In several cases in ip_output() we obtain reference on ifa. Do not leak it. Together with: asomers, np Sponsored by: Nginx, Inc.	2014-07-09 07:48:05 +00:00
Alexander V. Chernikov	7e767c791f	* Use different rule structures in kernel/userland. * Switch kernel to use per-cpu counters for rules. * Keep ABI/API. Kernel changes: * Each rules is now exported as TLV with optional extenable counter block (ip_fW_bcounter for base one) and ip_fw_rule for rule&cmd data. * Counters needs to be explicitly requested by IPFW_CFG_GET_COUNTERS flag. * Separate counters from rules in kernel and clean up ip_fw a bit. * Pack each rule in IPFW_TLV_RULE_ENT tlv to ease parsing. * Introduce versioning in container TLV (may be needed in future). * Fix ipfw_cfg_lheader broken u64 alignment. Userland changes: * Use set_mask from cfg header when requesting config * Fix incorrect read accouting in ipfw_show_config() * Use IPFW_RULE_NOOPT flag instead of playing with _pad * Fix "ipfw -d list": do not print counters for dynamic states * Some small fixes	2014-07-08 23:11:15 +00:00
Xin LI	e432298ade	Initialize SCTP cmsg's and notification's buffer before copying out to userland. Submitted by: tuexen Security: CVE-2014-3953 Security: FreeBSD-SA-14:17.kmem	2014-07-08 21:54:27 +00:00
Alexander V. Chernikov	6447bae661	* Prepare to pass other dynamic states via ipfw_dump_config() Kernel changes: * Change dump format for dynamic states: each state is now stored inside ipfw_obj_dyntlv last dynamic state is indicated by IPFW_DF_LAST flag * Do not perform sooptcopyout() for !SOPT_GET requests. Userland changes: * Introduce foreach_state() function handler to ease work with different states passed by ipfw_dump_config().	2014-07-06 23:26:34 +00:00
Alexander V. Chernikov	81d3153d61	* Add "lookup" table functionality to permit userland entry lookups. * Bump table dump format preserving old ABI. Kernel size: * Add IP_FW_TABLE_XFIND to handle "lookup" request from userland. * Add ta_find_tentry() algorithm callbacks/handlers to support lookups. * Fully switch to ipfw_obj_tentry for various table dumps: algorithms are now required to support the latest (ipfw_obj_tentry) entry dump format, the rest is handled by generic dump code. IP_FW_TABLE_XLIST opcode version bumped (0 -> 1). * Eliminate legacy ta_dump_entry algo handler: dump_table_entry() converts data from current to legacy format. Userland side: * Add "lookup" table parameter. * Change the way table type is guessed: call table_get_info() first, and check value for IPv4/IPv6 type IFF table does not exist. * Fix table_get_list(): do more tries if supplied buffer is not enough. * Sparate table_show_entry() from table_show_list().	2014-07-06 18:16:04 +00:00
Hiren Panchasara	43630e625a	Fix a typo.	2014-07-03 23:12:43 +00:00

... 2 3 4 5 6 ...

5178 Commits