freebsd-skq

Author	SHA1	Message	Date
Adrian Chadd	dc847eb656	Add missing variable declarations when using RSS. Reported by: bryanv@	2014-06-27 19:07:00 +00:00
Hans Petter Selasky	3da1cf1e88	Extend the meaning of the CTLFLAG_TUN flag to automatically check if there is an environment variable which shall initialize the SYSCTL during early boot. This works for all SYSCTL types both statically and dynamically created ones, except for the SYSCTL NODE type and SYSCTLs which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to be used in the case a tunable sysctl has a custom initialisation function allowing the sysctl to still be marked as a tunable. The kernel SYSCTL API is mostly the same, with a few exceptions for some special operations like iterating childrens of a static/extern SYSCTL node. This operation should probably be made into a factored out common macro, hence some device drivers use this. The reason for changing the SYSCTL API was the need for a SYSCTL parent OID pointer and not only the SYSCTL parent OID list pointer in order to quickly generate the sysctl path. The motivation behind this patch is to avoid parameter loading cludges inside the OFED driver subsystem. Instead of adding special code to the OFED driver subsystem to post-load tunables into dynamically created sysctls, we generalize this in the kernel. Other changes: - Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask" to "hw.pcic.intr_mask". - Removed redundant TUNABLE statements throughout the kernel. - Some minor code rewrites in connection to removing not needed TUNABLE statements. - Added a missing SYSCTL_DECL(). - Wrapped two very long lines. - Avoid malloc()/free() inside sysctl string handling, in case it is called to initialize a sysctl from a tunable, hence malloc()/free() is not ready when sysctls from the sysctl dataset are registered. - Bumped FreeBSD version to indicate SYSCTL API change. MFC after: 2 weeks Sponsored by: Mellanox Technologies	2014-06-27 16:33:43 +00:00
Adrian Chadd	7847796a93	Retire IP_RSSCPUID ; the right thing to do is query the RSS bucket; map the bucket to an RSS queue, then map the queue to a CPU ID. This way the bucket->queue and queue->CPU mapping can change over time. Introduce IP_RSSBUCKETID - which instead looks up the RSS bucket. User applications can then map the RSS bucket to a CPU.	2014-06-26 04:12:41 +00:00
Adrian Chadd	a6c88ec4fb	Add another RSS method to query the indirection table entries. There's 128 indirection table entries which correspond to the low 7 bits of the 32 bit RSS hash. Each value will correspond to an RSS bucket. (Then each RSS bucket currently will map to a CPU.) This is a more explicit way of figuring out which RSS bucket is in each RSS indirection slot. It can be inferred by the other methods but I'd rather drivers use something more simplified and explicit.	2014-06-26 02:49:51 +00:00
Michael Tuexen	2f4c57fbe9	Fix a bug which incorrectly allowed two listening SCTP sockets on the same port bound to the wildcard address. MFC after: 3 days	2014-06-20 20:17:39 +00:00
Michael Tuexen	8a794ba826	Fix a bug in the setsockopt()-handling of the SCTP specific option SCTP_PEER_ADDR_THLDS: Use the provided address as intended. MFC after: 3 days	2014-06-20 17:45:00 +00:00
Michael Tuexen	6ba22f19ca	Honor jails for unbound SCTP sockets when selecting source addresses, reporting IP-addresses to the peer during the handshake, adding addresses to the host, reporting the addresses via the sysctl interface (used by netstat, for example) and reporting the addresses to the application via socket options. This issue was reported by Bernd Walter. MFC after: 3 days	2014-06-20 13:26:49 +00:00
Michael Tuexen	dfa9c0b787	Use ENOBUFS instead of ENOMEM in error situations related to m_uiotombuf(). This was suggested by kevlo@. MFC after: 3 days	2014-06-05 12:51:12 +00:00
Kevin Lo	71c92ff80a	Fix build UDP-Lite with VIMAGE enabled when building with gcc. Reported and tested by: Jason Hellenthal	2014-06-03 01:30:32 +00:00
Hiren Panchasara	fc5e1956d9	ECN marking implenetation for dummynet. Changes include both DCTCP and RFC 3168 ECN marking methodology. DCTCP draft: http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00 Submitted by: Midori Kato (aoimidori27@gmail.com) Worked with: Lars Eggert (lars@netapp.com) Reviewed by: luigi, hiren	2014-06-01 07:28:24 +00:00
Bjoern A. Zeeb	700515aa62	While PAWS is disabled, there are no consumers for the tcp options argument to tcp_twcheck(); thus mark it __unused. MFC after: 2 weeks	2014-05-30 22:34:06 +00:00
Alan Somers	2f308a343f	Fix unintended KBI change from r264905. Add _fib versions of ifa_ifwithnet() and ifa_ifwithdstaddr() The legacy functions will call the _fib() versions with RT_ALL_FIBS, preserving legacy behavior. sys/net/if_var.h sys/net/if.c Add legacy-compatible functions as described above. Ensure legacy behavior when RT_ALL_FIBS is passed as fibnum. sys/netinet/in_pcb.c sys/netinet/ip_output.c sys/netinet/ip_options.c sys/net/route.c sys/net/rtsock.c sys/netinet6/nd6.c Call with _fib() functions if we must use a specific fib, or the legacy functions otherwise. tests/sys/netinet/fibs_test.sh tests/sys/netinet/udp_dontroute.c Improve the udp_dontroute test. The bug that this test exercises is that ifa_ifwithnet() will return the wrong address, if multiple interfaces have addresses on the same subnet but with different fibs. The previous version of the test only considered one possible failure mode: that ifa_ifwithnet_fib() might fail to find any suitable address at all. The new version also checks whether ifa_ifwithnet_fib() finds the correct address by checking where the ARP request goes. Reported by: bz, hrs Reviewed by: hrs MFC after: 1 week X-MFC-with: 264905 Sponsored by: Spectra Logic	2014-05-29 21:03:49 +00:00
Jilles Tjoelker	ced01b33f4	netinet/in.h: Expose htonl(), htons(), ntohl() and ntohs() in strict POSIX mode. Put the htonl(), htons(), ntohl() and ntohs() declarations under __POSIX_VISIBLE >= 200112. POSIX.1-2001 and newer require these to be exposed from <netinet/in.h> (as well as <arpa/inet.h>). Note that it may be unnecessary to check __POSIX_VISIBLE >= 200112 because older versions of POSIX and the C standard do not define this header. However, other places in the same file already perform the check. PR: 188316 Submitted by: Christian Neukirchen	2014-05-29 15:23:37 +00:00
Adrian Chadd	8bde802a2b	The users of RSS shouldn't be directly concerned about hash -> CPU ID mappings. Instead, they should be first mapping to an RSS bucket and then querying the RSS bucket -> CPU ID mapping to figure out the target CPU. When (if?) RSS rebalancing is implemented or some other (non round-robin) distribution of work from buckets to CPU IDs, various bits of code - both userland and kernel - will need to know how this mapping works. So, to support this: * Add a new function rss_m2bucket() - this maps an mbuf to a given bucket. Anything which is currently doing hash -> CPU work may instead wish to do hash -> bucket, and then query the bucket->cpuid map for which CPU it belongs on. Or, map it to a bucket, then re-pin that bucket -> CPU during a rebalance operation. * For userland applications which wish to exploit affinity to RSS buckets, the bucket -> CPU ID mapping is now available via a sysctl. net.inet.rss.bucket_mapping lists the bucket to CPU ID mapping via a list of bucket:cpu pairs.	2014-05-27 08:06:20 +00:00
Bjoern A. Zeeb	3150f357ff	Remove the prototpye for the static inline function tcp_signature_verify_input(). The function is defined before first use already. MFC after: 2 weeks	2014-05-24 15:31:40 +00:00
Bjoern A. Zeeb	ad494fa898	syncache_lookup() is a file local function. Make it static and take it out of the public KPI; seems it was never used elsewhere. MFC after: 2 weeks	2014-05-24 15:03:36 +00:00
Bjoern A. Zeeb	4fd2b4eb53	Make tcp_twrespond() file local private; this removes it from the public KPI; it is not used anywhere else and seems it never was. MFC after: 2 weeks	2014-05-24 14:01:18 +00:00
Bjoern A. Zeeb	5688fa661b	Remove the prototypes for things that are no longer file local but were moved to the header file. Pointy hat to: clang \|\| bz MFC after: 2 weeks X-MFC with: r266596 Reported by: gcc build of sparc64	2014-05-23 21:12:33 +00:00
Bjoern A. Zeeb	255cd9fd58	Move the tcp_fields_to_host() and tcp_fields_to_net() (inline) functions to the tcp_var.h header file in order to avoid further duplication with upcoming commits. Reviewed by: np MFC after: 2 weeks	2014-05-23 20:15:01 +00:00
Adrian Chadd	bad008ce85	Use CPU_FIRST() / CPU_NEXT() to iterate over the valid CPU IDs.	2014-05-22 07:25:36 +00:00
Adrian Chadd	883831c675	When RSS is enabled and per cpu TCP timers are enabled, do an RSS lookup for the inp flowid/flowtype to destination CPU. This only modifies the case where RSS is enabled and the per-cpu tcp timer option is enabled. Otherwise the behaviour should be the same as before.	2014-05-18 22:39:01 +00:00
Adrian Chadd	9c42397277	* When copying the flowid from inp -> outbound mbuf, also assign the hashtype to to the outbound mbuf as well as the flowid. * Add in socket options to fetch the hashid, the hashtype and RSS CPU ID for a given socket.	2014-05-18 22:37:31 +00:00
Adrian Chadd	2f71993288	Ensure that the flowid hashtype is assigned to the inp if the flowid is also assigned.	2014-05-18 22:34:06 +00:00
Adrian Chadd	cc6c187794	Add a new function to do a CPU ID lookup based on RSS hash information. This is intended to be used by various places that wish to hash some information about a TCP/UDP/IP flow but don't necessarily have a live mbuf to do it with. Refactor rss_m2cpuid() to use the refactored function.	2014-05-18 22:32:04 +00:00
Adrian Chadd	34e3dcedec	Add the flowtype to the inpcb. The flowid isn't enough to use as part of any RSS related CPU affinity lookups - the RSS code would like to know what kind of hash it is.	2014-05-18 22:30:12 +00:00
Alexander V. Chernikov	c3015737f3	Fix wrong formatting of 0.0.0.0/X table records in ipfw(8). Add `flags` u16 field to the hole in ipfw_table_xentry structure. Kernel has been guessing address family for supplied record based on xent length size. Userland, however, has been getting fixed-size ipfw_table_xentry structures guessing address family by checking address by IN6_IS_ADDR_V4COMPAT(). Fix this behavior by providing specific IPFW_TCF_INET flag for IPv4 records. PR: bin/189471 Submitted by: Dennis Yusupoff <dyr@smartspb.net> MFC after: 2 weeks	2014-05-17 13:45:03 +00:00
Gleb Smirnoff	b1a4156614	Provide compatibility #define after r265408. Suggested by: truckman	2014-05-17 12:33:27 +00:00
Adrian Chadd	d804a08f3e	Reserve IP_FLOWID, IP_FLOWTYPE, IP_RSSCPUID socket option IDs for near-term future use. These are intended to fetch the current flow id, flow hash type (M_HASHTYPE_* from the sys/mbuf.h) and if RSS is enabled, the RSS destined CPU ID for the receive path.	2014-05-17 00:09:12 +00:00
Mike Silbersack	f1395664e5	Remove the function tcp_twrecycleable; it has been #if 0'd for eight years. The original concept was to improve the corner case where you run out of ephemeral ports, but it was causing performance problems and the mechanism of limiting the number of time_wait sockets serves the same purpose in the end. Reviewed by: bz	2014-05-16 01:38:38 +00:00
Pyun YongHyeon	c732cd1af1	Fix checksum computation. Previously it didn't include carry. Reviewed by: tuexen	2014-05-13 05:07:03 +00:00
Michael Tuexen	a485f139c3	Disable TX checksum offload for UDP-Lite completely. It wasn't used for partial checksum coverage, but even for full checksum coverage it doesn't work. This was discussed with Kevin Lo (kevlo@).	2014-05-12 09:46:48 +00:00
Michael Tuexen	6c19260269	Whitespace change.	2014-05-10 08:48:04 +00:00
Michael Tuexen	d58c15339b	Fix a logic bug which prevented the sending of UDP packet with 0 checksum. This bug was introduced in r264212 and should be X-MFCed with that revision, if UDP-Lite support if MFCed.	2014-05-09 14:15:48 +00:00
Michael Tuexen	26461454fc	Use KASSERTs as suggested by glebius@ MFC after: 3 days X-MFC with: 265691	2014-05-08 20:47:54 +00:00
Michael Tuexen	8e1d0a568a	For some UDP packets (for example with 200 byte payload) and IP options, the IP header and the UDP header are not in the same mbuf. Add code to in_delayed_cksum() to deal with this case. MFC after: 3 days	2014-05-08 17:27:46 +00:00
Michael Tuexen	4aa74d8b65	Remove unused code. This is triggered by the bugreport of Sylvestre Ledru which deal with useless code in the user land stack: https://bugzilla.mozilla.org/show_bug.cgi?id=1003929 MFC after: 3 days	2014-05-06 16:51:07 +00:00
Gleb Smirnoff	c669105d17	- Remove net.inet.tcp.reass.overflows sysctl. It counts exactly same events that tcpstat's tcps_rcvmemdrop counter counts. - Rename tcps_rcvmemdrop to tcps_rcvreassfull and improve its description in netstat(1) output. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-05-06 00:00:07 +00:00
Gleb Smirnoff	6c42c8a93f	The tcp_log_addrs() uses th pointer, which points into the mbuf, thus we can not free the mbuf before tcp_log_addrs(). Sponsored by: Nginx, Inc. Sponsored by: Netflix	2014-05-05 21:33:20 +00:00
Gleb Smirnoff	e407b67be4	The FreeBSD-SA-14:08.tcp was a lesson on not doing acrobatics with mixing on stack memory and UMA memory in one linked list. Thus, rewrite TCP reassembly code in terms of memory usage. The algorithm remains unchanged. We actually do not need extra memory to build a reassembly queue. Arriving mbufs are always packet header mbufs. So we got the length of data as pkthdr.len. We got m_nextpkt for linkage. And we need only one pointer to point at the tcphdr, use PH_loc for that. In tcpcb the t_segq fields becomes mbuf pointer. The t_segqlen field now counts not packets, but bytes in the queue. This gives us more precision when comparing to socket buffer limits. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-05-04 23:25:32 +00:00
Alexander V. Chernikov	a32603a55a	Fix panic on IPv4 address removal introduced in r265279. Reported by: Trond Endrestøl MFC with: r265279	2014-05-03 20:22:13 +00:00
Alexander V. Chernikov	b980262e63	Pass radix head ptr along with rte to rtexpunge(). Rename rtexpunge to rt_expunge().	2014-05-03 16:28:54 +00:00
Xin LI	c6f70658c3	Fix TCP reassembly vulnerability. Patch done by: glebius Security: FreeBSD-SA-14:08.tcp Security: CVE-2014-3000	2014-04-30 04:02:57 +00:00
Alan Somers	7278b62aee	Fix a panic when removing an IP address from an interface, if the same address exists on another interface. The panic was introduced by change 264887, which changed the fibnum parameter in the call to rtalloc1_fib() in ifa_switch_loopback_route() from RT_DEFAULT_FIB to RT_ALL_FIBS. The solution is to use the interface fib in that call. For the majority of users, that will be equivalent to the legacy behavior. PR: kern/189089 Reported by: neel Reviewed by: neel MFC after: 3 weeks X-MFC with: 264887 Sponsored by: Spectra Logic	2014-04-29 14:46:45 +00:00
Alan Somers	0cfee0c223	Fix subnet and default routes on different FIBs on the same subnet. These two bugs are closely related. The root cause is that ifa_ifwithnet does not consider FIBs when searching for an interface address. sys/net/if_var.h sys/net/if.c Add a fib argument to ifa_ifwithnet and ifa_ifwithdstadddr. Those functions will only return an address whose interface fib equals the argument. sys/net/route.c Update calls to ifa_ifwithnet and ifa_ifwithdstaddr with fib arguments. sys/netinet/in.c Update in_addprefix to consider the interface fib when adding prefixes. This will prevent it from not adding a subnet route when one already exists on a different fib. sys/net/rtsock.c sys/netinet/in_pcb.c sys/netinet/ip_output.c sys/netinet/ip_options.c sys/netinet6/nd6.c Add RT_DEFAULT_FIB arguments to ifa_ifwithdstaddr and ifa_ifwithnet. In some cases it there wasn't a clear specific fib number to use. In others, I was unable to test those functions so I chose RT_DEFAULT_FIB to minimize divergence from current behavior. I will fix some of the latter changes along with PR kern/187553. tests/sys/netinet/fibs_test.sh tests/sys/netinet/udp_dontroute.c tests/sys/netinet/Makefile Revert r263738. The udp_dontroute test was right all along. However, bugs kern/187550 and kern/187553 cancelled each other out when it came to this test. Because of kern/187553, ifa_ifwithnet searched the default fib instead of the requested one, but because of kern/187550, there was an applicable subnet route on the default fib. The new test added in r263738 doesn't work right, however. I can verify with dtrace that ifa_ifwithnet returned the wrong address before I applied this commit, but route(8) miraculously found the correct interface to use anyway. I don't know how. Clear expected failure messages for kern/187550 and kern/187552. PR: kern/187550 PR: kern/187552 Reviewed by: melifaro MFC after: 3 weeks Sponsored by: Spectra Logic	2014-04-24 23:56:56 +00:00
Alan Somers	0489b8916e	Fix host and network routes for new interfaces when net.add_addr_allfibs=0 sys/net/route.c In rtinit1, use the interface fib instead of the process fib. The latter wasn't very useful because ifconfig(8) is usually invoked with the default process fib. Changing ifconfig(8) to use setfib(2) would be redundant, because it already sets the interface fib. tests/sys/netinet/fibs_test.sh Clear the expected ATF failure sys/net/if.c Pass the interface fib in calls to rtrequest1_fib and rtalloc1_fib sys/netinet/in.c sys/net/if_var.h Add a fibnum argument to ifa_switch_loopback_route, a subroutine of in_scrubprefix. Pass it the interface fib. PR: kern/187549 Reviewed by: melifaro MFC after: 3 weeks Sponsored by: Spectra Logic Corporation	2014-04-24 17:23:16 +00:00
Steven Hartland	ae19083248	Fix jailed raw sockets not setting the correct source address by calling in_pcbladdr instead of prison_get_ip4 MFC after: 1 month	2014-04-24 12:52:31 +00:00
Michael Tuexen	8be0fd55dc	Don't free an mbuf twice. This only happens in very rare error cases where the peer sends illegal sequencing information in DATA chunks for an existing association. MFC after: 3 days.	2014-04-23 21:20:55 +00:00
Rick Macklem	2aa76dba07	Add {} braces so that the code conforms to the indentation. Fortunately, I don't think doing the assignment of cap->tsomax unconditionally causes any problem. Reviewed by: glebius MFC after: 2 weeks	2014-04-21 19:17:19 +00:00
Michael Tuexen	eb67ee5fc6	Add consistency checks to ensure that fragments of a user message have the same U-bit. MFC after: 3 days	2014-04-20 21:11:39 +00:00
Michael Tuexen	273351d497	Send also a packet containing an ABORT chunk in response to an OOTB packet containing a COOKIE-ECHO chunk. MFC after: 3 days	2014-04-20 18:15:23 +00:00
Michael Tuexen	2dec1efc5a	Use consistently debug output instead of an unconditional printf. MFC after: 3 days	2014-04-19 20:55:51 +00:00
Michael Tuexen	32451da416	Send the correct error cause, when a DATA chunk with no user data is received. This bug was reported by Irene Ruengeler. MFC after: 3 days	2014-04-19 19:21:06 +00:00
John Baldwin	b8c8c8c3c7	Some whitespace and style fixes. Submitted by: bde	2014-04-11 21:00:59 +00:00
John Baldwin	2ffb755cec	The tw_pcbrele() function does not need the global timewait lock. Submitted by: Julien Charbon Suggested by: glebius	2014-04-11 19:17:45 +00:00
John Baldwin	9941de49ad	Don't leak the TCP pcbinfo lock if a time wait connection is closed in between grabbing a reference on the connection structure and obtaining the pcbinfo lock. Reviewed by: Julien Charbon	2014-04-11 13:11:43 +00:00
John Baldwin	66eefb1eae	Currently, the TCP slow timer can starve TCP input processing while it walks the list of connections in TIME_WAIT closing expired connections due to contention on the global TCP pcbinfo lock. To remediate, introduce a new global lock to protect the list of connections in TIME_WAIT. Only acquire the TCP pcbinfo lock when closing an expired connection. This limits the window of time when TCP input processing is stopped to the amount of time needed to close a single connection. Submitted by: Julien Charbon <jcharbon@verisign.com> Reviewed by: rwatson, rrs, adrian MFC after: 2 months	2014-04-10 18:15:35 +00:00
Kevin Lo	cfac59ecb1	Remove a bogus re-assignment.	2014-04-08 01:54:50 +00:00
Kevin Lo	d1b18731d9	Minor style cleanups.	2014-04-07 01:55:53 +00:00
Kevin Lo	e06e816f67	Add support for UDP-Lite protocol (RFC 3828) to IPv4 and IPv6 stacks. Tested with vlc and a test suite [1]. [1] http://www.erg.abdn.ac.uk/~gerrit/udp-lite/files/udplite_linux.tar.gz Reviewed by: jhb, glebius, adrian	2014-04-07 01:53:03 +00:00
Hiren Panchasara	855363811f	Improve readability of comments for DELAY_ACK() macro.	2014-04-03 01:46:03 +00:00
Michael Tuexen	6bbfa13f80	Increment the SSN only after processing the last fragment of an ordered user message. MFC after: 3 days	2014-04-01 18:38:04 +00:00
Andrey V. Elsukov	41ea685c32	Don't copy the MF flag from original IP header to ICMP error message. PR: 188092 MFC after: 1 week Sponsored by: Yandex LLC	2014-03-31 13:00:49 +00:00
Michael Tuexen	9ba5b6b730	Handle an edge case of address management similar to TCP. This needs to be reconsidered when the address handling will be reimplemented. The patch is from rrs@. MFC after: 3 days	2014-03-29 21:26:45 +00:00
Michael Tuexen	fe96e2852e	Use SCTP_OVER_UDP_TUNNELING_PORT more consistently. MFC after: 3 days	2014-03-29 20:21:36 +00:00
Alan Somers	743c072a09	Correct ARP update handling when the routes for network interfaces are restricted to a single FIB in a multifib system. Restricting an interface's routes to the FIB to which it is assigned (by setting net.add_addr_allfibs=0) causes ARP updates to fail with "arpresolve: can't allocate llinfo for x.x.x.x". This is due to the ARP update code hard coding it's lookup for existing routing entries to FIB 0. sys/netinet/in.c: When dealing with RTM_ADD (add route) requests for an interface, use the interface's assigned FIB instead of the default (FIB 0). sys/netinet/if_ether.c: In arpresolve(), enhance error message generated when an lla_lookup() fails so that the interface causing the error is visible in logs. tests/sys/netinet/fibs_test.sh Clear ATF expected error. PR: kern/167947 Submitted by: Nikolay Denev <ndenev@gmail.com> (previous version) Reviewed by: melifaro MFC after: 3 weeks Sponsored by: Spectra Logic Corporation	2014-03-26 22:46:03 +00:00
Hiren Panchasara	153edc50d7	Correct the comments as support for RFC 1644 has been removed for a long time.	2014-03-25 21:57:50 +00:00
Michael Tuexen	ff1ffd7499	* Provide information in error causes in ASCII instead of proprietary binary format. * Add support for a diagnostic information error cause. The code is sysctlable and the default is 0, which means it is not sent. This is joint work with rrs@. MFC after: 1 week	2014-03-16 12:32:16 +00:00
Robert Watson	7527624efa	Several years after initial development, merge prototype support for linking NIC Receive Side Scaling (RSS) to the network stack's connection-group implementation. This prototype (and derived patches) are in use at Juniper and several other FreeBSD-using companies, so despite some reservations about its maturity, merge the patch to the base tree so that it can be iteratively refined in collaboration rather than maintained as a set of gradually diverging patch sets. (1) Merge a software implementation of the Toeplitz hash specified in RSS implemented by David Malone. This is used to allow suitable pcbgroup placement of connections before the first packet is received from the NIC. Software hashing is generally avoided, however, due to high cost of the hash on general-purpose CPUs. (2) In in_rss.c, maintain authoritative versions of RSS state intended to be pushed to each NIC, including keying material, hash algorithm/ configuration, and buckets. Provide software-facing interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both the RSS standardised Toeplitz and a 'naive' variation with a hash efficient in software but with poor distribution properties. Implement rss_m2cpuid()to be used by netisr and other load balancing code to look up the CPU on which an mbuf should be processed. (3) In the Ethernet link layer, allow netisr distribution using RSS as a source of policy as an alternative to source ordering; continue to default to direct dispatch (i.e., don't try and requeue packets for processing on the 'right' CPU if they arrive in a directly dispatchable context). (4) Allow RSS to control tuning of connection groups in order to align groups with RSS buckets. If a packet arrives on a protocol using connection groups, and contains a suitable hardware-generated hash, use that hash value to select the connection group for pcb lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz hash is available, we fall back on regular PCB lookup risking contention rather than pay the cost of Toeplitz in software -- this is a less scalable but, at my last measurement, faster approach. As core counts go up, we may want to revise this strategy despite CPU overhead. Where device drivers suitably configure NICs, and connection groups / RSS are enabled, this should avoid both lock and line contention during connection lookup for TCP. This commit does not modify any device drivers to tune device RSS configuration to the global RSS configuration; patches are in circulation to do this for at least Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device drivers is not particularly robust, nor aware of more advanced features such as runtime reconfiguration/rebalancing. This will hopefully prove a useful starting point for refinement. No MFC is scheduled as we will first want to nail down a more mature and maintainable KPI/KBI for device drivers. Sponsored by: Juniper Networks (original work) Sponsored by: EMC/Isilon (patch update and merge)	2014-03-15 00:57:50 +00:00
Gleb Smirnoff	45c203fce2	Remove AppleTalk support. AppleTalk was a network transport protocol for Apple Macintosh devices in 80s and then 90s. Starting with Mac OS X in 2000 the AppleTalk was a legacy protocol and primary networking protocol is TCP/IP. The last Mac OS X release to support AppleTalk happened in 2009. The same year routing equipment vendors (namely Cisco) end their support. Thus, AppleTalk won't be supported in FreeBSD 11.0-RELEASE.	2014-03-14 06:29:43 +00:00
Gleb Smirnoff	2c284d9395	Remove IPX support. IPX was a network transport protocol in Novell's NetWare network operating system from late 80s and then 90s. The NetWare itself switched to TCP/IP as default transport in 1998. Later, in this century the Novell Open Enterprise Server became successor of Novell NetWare. The last release that claimed to still support IPX was OES 2 in 2007. Routing equipment vendors (e.g. Cisco) discontinued support for IPX in 2011. Thus, IPX won't be supported in FreeBSD 11.0-RELEASE.	2014-03-14 02:58:48 +00:00
Michael Tuexen	3d31c75401	Put the offset of the CRC32C in csum_data instead of 0. The virtio driver needs the offset to be stored in csum_data, like in the case for UDP and TCP. The virtio problem was reported by Niu Zhixiong <kaiaixi@gmail.com>, who helped in debugging and testing the patch. MFC after: 3 days	2014-03-12 17:18:15 +00:00
Michael Tuexen	031925742a	SCTP uses CRC32C and not Adler anymore. While there change the reference to RFC 4960. This does not change any code, just comments. MFC after: 3 days	2014-03-12 15:30:40 +00:00
Gleb Smirnoff	aa69c61235	Since both netinet/ and netinet6/ call into netipsec/ and netpfil/, the protocol specific mbuf flags are shared between them. - Move all M_FOO definitions into a single place: netinet/in6.h, to avoid future clashes. - Resolve clash between M_DECRYPTED and M_SKIP_FIREWALL which resulted in a failure of operation of IPSEC and packet filters. Thanks to Nicolas and Georgios for all the hard work on bisecting, testing and finally finding the root of the problem. PR: kern/186755 PR: kern/185876 In collaboration with: Georgios Amanakis <gamanakis gmail.com> In collaboration with: Nicolas DEFFAYET <nicolas-ml deffayet.com> Sponsored by: Nginx, Inc.	2014-03-12 14:29:08 +00:00
Gleb Smirnoff	e3a7aa6f56	- Remove rt_metrics_lite and simply put its members into rtentry. - Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This removes another cache trashing ++ from packet forwarding path. - Create zini/fini methods for the rtentry UMA zone. Via initialize mutex and counter in them. - Fix reporting of rmx_pksent to routing socket. - Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode. The change is mostly targeted for stable/10 merge. For head, rt_pksent is expected to just disappear. Discussed with: melifaro Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-03-05 01:17:47 +00:00
Gleb Smirnoff	2a7da7299d	Remove ifa_ref()/ifa_free(), which are atomic(9), from ip_output(). The ifaddr is already referenced by the rtentry, and we are holding reference on the rtentry throughout the function execution. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-03-04 19:49:41 +00:00
John Baldwin	5b26ea5df3	Remove more constants related to static sysctl nodes. The MAXID constants were primarily used to size the sysctl name list macros that were removed in r254295. A few other constants either did not have an associated sysctl node, or the associated node used OID_AUTO instead. PR: ports/184525 (exp-run)	2014-02-25 18:44:33 +00:00
Gleb Smirnoff	4a2dd8d4fb	Improve logging of send errors, reporting error code and interface. Reduce code duplication between INET and INET6. Tested by: Lytochkin Boris <lytboris gmail.com>	2014-02-22 19:20:40 +00:00
Michael Tuexen	1213f0e749	Remove redundant code and fix a style error. MFC after: 3 days	2014-02-20 20:14:43 +00:00
Gleb Smirnoff	0ff96b4f55	o Remove at compile time the HASH_ALL code, that was never tested and is unfinished. However, I've tested my version, it works okay. As before it is unfinished: timeout aren't driven by TCP session state. To enable the HASH_ALL mode, one needs in kernel config: options FLOWTABLE_HASH_ALL o Reduce the alignment on flentry to 64 bytes. Without the FLOWTABLE_HASH_ALL option, twice less memory would be consumed by flows. o API to ip_output()/ip6_output() got even more thin: 1 liner. o Remove unused unions. Simply use fle->f_key[]. o Merge all IPv4 code into flowtable_lookup_ipv4(), and do same flowtable_lookup_ipv6(). Stop copying data to on stack sockaddr structures, simply use key[] on stack. o Move code from flowtable_lookup_common() that actually works on insertion into flowtable_insert(). Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-02-17 11:50:56 +00:00
Mikolaj Golub	db2f5a2461	Fixup for r261590 (vnet sysctl handlers cleanup). Reviewed by: glebius	2014-02-09 08:13:17 +00:00
Gleb Smirnoff	5d6d7e756b	o Revamp API between flowtable and netinet, netinet6. - ip_output() and ip_output6() simply call flowtable_lookup(), passing mbuf and address family. That's the only code under #ifdef FLOWTABLE in the protocols code now. o Revamp statistics gathering and export. - Remove hand made pcpu stats, and utilize counter(9). - Snapshot of statistics is available via 'netstat -rs'. - All sysctls are moved into net.flowtable namespace, since spreading them over net.inet isn't correct. o Properly separate at compile time INET and INET6 parts. o General cleanup. - Remove chain of multiple flowtables. We simply have one for IPv4 and one for IPv6. - Flowtables are allocated in flowtable.c, symbols are static. - With proper argument to SYSINIT() we no longer need flowtable_ready. - Hash salt doesn't need to be per-VNET. - Removed rudimentary debugging, which use quite useless in dtrace era. The runtime behavior of flowtable shouldn't be changed by this commit. Sponsored by: Netflix Sponsored by: Nginx, Inc.	2014-02-07 15:18:23 +00:00
Gleb Smirnoff	92f8975ff4	Utilize SYSCTL_UMA_CUR() to export usage of syncache and tcp reassembly zones. Sponsored by: Nginx, Inc.	2014-02-07 14:31:51 +00:00
Gleb Smirnoff	96d111245f	Catch up on r261590.	2014-02-07 14:26:33 +00:00
Peter Wemm	62b90589e4	Adjust r239672 from rrs and r258821 from eadler. By definition, the very first FIN is not a duplicate. Process it normally and don't feed it to congestion control as though it were a dupe. Don't prevent CC from seeing later dupe acks while in a half close state.	2014-01-28 21:13:15 +00:00
George V. Neville-Neil	6f3caa6d81	Decrease lock contention within the TCP accept case by removing the INP_INFO lock from tcp_usr_accept. As the PR/patch states this was following the advice already in the code. See the PR below for a full disucssion of this change and its measured effects. PR: 183659 Submitted by: Julian Charbon Reviewed by: jhb	2014-01-28 20:28:32 +00:00
Gleb Smirnoff	547246a373	Fix fallout from r241923. Calculate length of payload in pim_input() properly. While here, remove extra variable and incorrect condition before m_pullup(). Reported by: Olivier Cochard-Labbé <olivier cochard.me> Sponsored by: Nginx, Inc.	2014-01-22 10:57:42 +00:00
Alexander V. Chernikov	f6b84910bb	Further rework netinet6 address handling code: * Set ia address/mask values BEFORE attaching to address lists. Inet6 address assignment is not atomic, so the simplest way to do this atomically is to fill in ia before attach. * Validate irfa->ia_addr field before use (we permit ANY sockaddr in old code). * Do some renamings: in6_ifinit -> in6_notify_ifa (interaction with other subsystems is here) in6_setup_ifa -> in6_broadcast_ifa (LLE/Multicast/DaD code) in6_ifaddloop -> nd6_add_ifa_lle in6_ifremloop -> nd6_rem_ifa_lle * Split working with LLE and route announce code for last two. Add temporary in6_newaddrmsg() function to mimic current rtsock behaviour. * Call device SIOCSIFADDR handler IFF we're adding first address. In IPv4 we have to call it on every address change since ARP record is installed by arp_ifinit() which is called by given handler. IPv6 stack, on the opposite is responsible to call nd6_add_ifa_lle() so there is no reason to call SIOCSIFADDR often.	2014-01-19 16:07:27 +00:00
Adrian Chadd	9db69902c6	If the flowid is available for the mbuf that finalised the creation of a syncache connection, copy it into the inp_flowid field. Without this, an incoming TCP connection won't have an inp_flowid marked until some data comes in, and this means that things like the per-CPU TCP timer option will choose a different CPU for the timer work. (It also means that if one grabbed the flowid via an ioctl from userland, it won't be available until some data has been received.) Sponsored by: Netflix, Inc.	2014-01-18 23:48:20 +00:00
George V. Neville-Neil	d9e1bc4f0d	Fix various places where we don't properly release a lock PR: 185043 Submitted by: Michael Bentkofsky MFC after: 2 weeks	2014-01-16 22:14:54 +00:00
Gleb Smirnoff	3c065f2f3e	Cleanup comments and whitespace. No functional changes.	2014-01-16 12:58:03 +00:00
Alexander V. Chernikov	a49b317c41	Fix refcount leak on netinet ifa. Reviewed by: glebius MFC after: 2 weeks Sponsored by: Yandex LLC	2014-01-16 12:35:18 +00:00
Alexander V. Chernikov	054692a4bd	Fix ipfw fwd for IPv4 traffic broken by r249894. Problem case: Original lookup returns route with GW set, so gw points to rte->rt_gateway. After that we're changing dst and performing lookup another time. Since fwd host is most probably directly reachable, resulting rte does not contain rt_gateway, so gw is not set. Finally, we end with packet transmitted to proper interface but wrong link-layer address. Found by: lstewart Discussed with: ae,lstewart MFC after: 2 weeks Sponsored by: Yandex LLC	2014-01-16 11:50:00 +00:00
Alexander V. Chernikov	d375edc9b5	Simplify inet alias handling code: if we're adding/removing alias which has the same prefix as some other alias on the same interface, use newly-added rt_addrmsg() instead of hand-rolled in_addralias_rtmsg(). This eliminates the following rtsock messages: Pinned RTM_ADD for prefix (for alias addition). Pinned RTM_DELETE for prefix (for alias withdrawal). Example (got 10.0.0.1/24 on vlan4, playing with 10.0.0.2/24): before commit, addition: got message of size 116 on Fri Jan 10 14:13:15 2014 RTM_NEWADDR: address being added to iface: len 116, metric 0, flags: sockaddrs: <NETMASK,IFP,IFA,BRD> 255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255 got message of size 192 on Fri Jan 10 14:13:15 2014 RTM_ADD: Add Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED> locks: inits: sockaddrs: <DST,GATEWAY,NETMASK> 10.0.0.0 10.0.0.2 (255) ffff ffff ff after commit, addition: got message of size 116 on Fri Jan 10 13:56:26 2014 RTM_NEWADDR: address being added to iface: len 116, metric 0, flags: sockaddrs: <NETMASK,IFP,IFA,BRD> 255.255.255.0 vlan4:8.0.27.c5.29.d4 14.0.0.2 14.0.0.255 before commit, wihdrawal: got message of size 192 on Fri Jan 10 13:58:59 2014 RTM_DELETE: Delete Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED> locks: inits: sockaddrs: <DST,GATEWAY,NETMASK> 10.0.0.0 10.0.0.2 (255) ffff ffff ff got message of size 116 on Fri Jan 10 13:58:59 2014 RTM_DELADDR: address being removed from iface: len 116, metric 0, flags: sockaddrs: <NETMASK,IFP,IFA,BRD> 255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255 adter commit, withdrawal: got message of size 116 on Fri Jan 10 14:14:11 2014 RTM_DELADDR: address being removed from iface: len 116, metric 0, flags: sockaddrs: <NETMASK,IFP,IFA,BRD> 255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255 Sending both RTM_ADD/RTM_DELETE messages to rtsock is completely wrong (and requires some hacks to keep prefix in route table on RTM_DELETE). I've tested this change with quagga (no change) and bird (). bird alias handling is already broken in BSD sysdep code, so nothing changes here, too. I'm going to MFC this change if there will be no complains about behavior change. While here, fix some style(9) bugs introduced by r260488 (pointed by glebius and bde). Sponsored by: Yandex LLC MFC after: 4 weeks	2014-01-10 12:13:55 +00:00
Gleb Smirnoff	0cc726f25a	Make failure of ifpromisc() a non-fatal error. This makes it possible to run carp(4) on vtnet(4). Sponsored by: Nginx, Inc.	2014-01-03 11:03:12 +00:00
Andrey V. Elsukov	e2d14d9317	Add IF_AFDATA_WLOCK_ASSERT() in case lla_lookup() is called with LLE_CREATE flag. MFC after: 1 week	2014-01-03 02:32:05 +00:00
Gleb Smirnoff	183e1c8634	Fix regression from r249894. Now we pass "gw" as argument to if_output method, thus for multicast case we need it to point at "dst". PR: 185395 Submitted by: ae	2014-01-02 10:18:39 +00:00
Andrey V. Elsukov	ea0c377602	lla_lookup() does modification only when LLE_CREATE is specified. Thus we can use IF_AFDATA_RLOCK() instead of IF_AFDATA_LOCK() when doing lla_lookup() without LLE_CREATE flag. Reviewed by: glebius, adrian MFC after: 1 week Sponsored by: Yandex LLC	2014-01-02 08:40:37 +00:00
Gleb Smirnoff	9706c950a2	Fix couple of bugs from r257692 related to scan of address list on an interface: - in in_control() skip over not AF_INET addresses. - in in_aifaddr_ioctl() and in_difaddr_ioctl() do correct check of address family, w/o accessing memory beyond struct ifaddr. Sponsored by: Nginx, Inc.	2013-12-29 22:20:06 +00:00
Michael Tuexen	04aab884d7	Address some warnings which showed up on the userland version. MFC after: 1 week	2013-12-27 13:07:00 +00:00
Sergey Kandaurov	b8b4cfcdf6	Draft-ietf-tcpm-initcwnd-05 became RFC6928. MFC after: 1 week	2013-12-26 04:24:08 +00:00

1 2 3 4 5 ...

4865 Commits