Commit Graph

4897 Commits

Author SHA1 Message Date
Bjoern A. Zeeb
255cd9fd58 Move the tcp_fields_to_host() and tcp_fields_to_net() (inline)
functions to the tcp_var.h header file in order to avoid further
duplication with upcoming commits.

Reviewed by:	np
MFC after:	2 weeks
2014-05-23 20:15:01 +00:00
Adrian Chadd
bad008ce85 Use CPU_FIRST() / CPU_NEXT() to iterate over the valid CPU IDs. 2014-05-22 07:25:36 +00:00
Adrian Chadd
883831c675 When RSS is enabled and per cpu TCP timers are enabled, do an RSS
lookup for the inp flowid/flowtype to destination CPU.

This only modifies the case where RSS is enabled and the per-cpu tcp
timer option is enabled.  Otherwise the behaviour should be the same
as before.
2014-05-18 22:39:01 +00:00
Adrian Chadd
9c42397277 * When copying the flowid from inp -> outbound mbuf, also assign the
hashtype to to the outbound mbuf as well as the flowid.

* Add in socket options to fetch the hashid, the hashtype and RSS CPU
  ID for a given socket.
2014-05-18 22:37:31 +00:00
Adrian Chadd
2f71993288 Ensure that the flowid hashtype is assigned to the inp if the flowid
is also assigned.
2014-05-18 22:34:06 +00:00
Adrian Chadd
cc6c187794 Add a new function to do a CPU ID lookup based on RSS hash information.
This is intended to be used by various places that wish to hash some
information about a TCP/UDP/IP flow but don't necessarily have a
live mbuf to do it with.

Refactor rss_m2cpuid() to use the refactored function.
2014-05-18 22:32:04 +00:00
Adrian Chadd
34e3dcedec Add the flowtype to the inpcb.
The flowid isn't enough to use as part of any RSS related CPU affinity
lookups - the RSS code would like to know what kind of hash it is.
2014-05-18 22:30:12 +00:00
Alexander V. Chernikov
c3015737f3 Fix wrong formatting of 0.0.0.0/X table records in ipfw(8).
Add `flags` u16 field to the hole in ipfw_table_xentry structure.
Kernel has been guessing address family for supplied record based
on xent length size.
Userland, however, has been getting fixed-size ipfw_table_xentry structures
guessing address family by checking address by IN6_IS_ADDR_V4COMPAT().

Fix this behavior by providing specific IPFW_TCF_INET flag for IPv4 records.

PR:		bin/189471
Submitted by:	Dennis Yusupoff <dyr@smartspb.net>
MFC after:	2 weeks
2014-05-17 13:45:03 +00:00
Gleb Smirnoff
b1a4156614 Provide compatibility #define after r265408.
Suggested by:	truckman
2014-05-17 12:33:27 +00:00
Adrian Chadd
d804a08f3e Reserve IP_FLOWID, IP_FLOWTYPE, IP_RSSCPUID socket option IDs for
near-term future use.

These are intended to fetch the current flow id, flow hash type
(M_HASHTYPE_* from the sys/mbuf.h) and if RSS is enabled, the
RSS destined CPU ID for the receive path.
2014-05-17 00:09:12 +00:00
Mike Silbersack
f1395664e5 Remove the function tcp_twrecycleable; it has been #if 0'd for
eight years.  The original concept was to improve the
corner case where you run out of ephemeral ports, but it
was causing performance problems and the mechanism
of limiting the number of time_wait sockets serves
the same purpose in the end.

Reviewed by:	bz
2014-05-16 01:38:38 +00:00
Pyun YongHyeon
c732cd1af1 Fix checksum computation. Previously it didn't include carry.
Reviewed by:	tuexen
2014-05-13 05:07:03 +00:00
Michael Tuexen
a485f139c3 Disable TX checksum offload for UDP-Lite completely. It wasn't used for
partial checksum coverage, but even for full checksum coverage it doesn't
work.
This was discussed with Kevin Lo (kevlo@).
2014-05-12 09:46:48 +00:00
Michael Tuexen
6c19260269 Whitespace change. 2014-05-10 08:48:04 +00:00
Michael Tuexen
d58c15339b Fix a logic bug which prevented the sending of UDP packet with 0 checksum.
This bug was introduced in r264212 and should be X-MFCed with that
revision, if UDP-Lite support if MFCed.
2014-05-09 14:15:48 +00:00
Michael Tuexen
26461454fc Use KASSERTs as suggested by glebius@
MFC after: 3 days
X-MFC with: 265691
2014-05-08 20:47:54 +00:00
Michael Tuexen
8e1d0a568a For some UDP packets (for example with 200 byte payload) and IP options,
the IP header and the UDP header are not in the same mbuf.
Add code to in_delayed_cksum() to deal with this case.

MFC after: 3 days
2014-05-08 17:27:46 +00:00
Michael Tuexen
4aa74d8b65 Remove unused code. This is triggered by the bugreport of Sylvestre Ledru
which deal with useless code in the user land stack:
https://bugzilla.mozilla.org/show_bug.cgi?id=1003929

MFC after: 3 days
2014-05-06 16:51:07 +00:00
Gleb Smirnoff
c669105d17 - Remove net.inet.tcp.reass.overflows sysctl. It counts exactly
same events that tcpstat's tcps_rcvmemdrop counter counts.
- Rename tcps_rcvmemdrop to tcps_rcvreassfull and improve its
  description in netstat(1) output.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-05-06 00:00:07 +00:00
Gleb Smirnoff
6c42c8a93f The tcp_log_addrs() uses th pointer, which points into the mbuf, thus we
can not free the mbuf before tcp_log_addrs().

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2014-05-05 21:33:20 +00:00
Gleb Smirnoff
e407b67be4 The FreeBSD-SA-14:08.tcp was a lesson on not doing acrobatics with
mixing on stack memory and UMA memory in one linked list.

Thus, rewrite TCP reassembly code in terms of memory usage. The
algorithm remains unchanged.

We actually do not need extra memory to build a reassembly queue.
Arriving mbufs are always packet header mbufs. So we got the length
of data as pkthdr.len. We got m_nextpkt for linkage. And we need
only one pointer to point at the tcphdr, use PH_loc for that.

In tcpcb the t_segq fields becomes mbuf pointer. The t_segqlen
field now counts not packets, but bytes in the queue. This gives
us more precision when comparing to socket buffer limits.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-05-04 23:25:32 +00:00
Alexander V. Chernikov
a32603a55a Fix panic on IPv4 address removal introduced in r265279.
Reported by:	Trond Endrestøl
MFC with:	r265279
2014-05-03 20:22:13 +00:00
Alexander V. Chernikov
b980262e63 Pass radix head ptr along with rte to rtexpunge().
Rename rtexpunge to rt_expunge().
2014-05-03 16:28:54 +00:00
Xin LI
c6f70658c3 Fix TCP reassembly vulnerability.
Patch done by:	glebius
Security:	FreeBSD-SA-14:08.tcp
Security:	CVE-2014-3000
2014-04-30 04:02:57 +00:00
Alan Somers
7278b62aee Fix a panic when removing an IP address from an interface, if the same address
exists on another interface.  The panic was introduced by change 264887, which
changed the fibnum parameter in the call to rtalloc1_fib() in
ifa_switch_loopback_route() from RT_DEFAULT_FIB to RT_ALL_FIBS.  The solution
is to use the interface fib in that call.  For the majority of users, that will
be equivalent to the legacy behavior.

PR:		kern/189089
Reported by:	neel
Reviewed by:	neel
MFC after:	3 weeks
X-MFC with:	264887
Sponsored by:	Spectra Logic
2014-04-29 14:46:45 +00:00
Alan Somers
0cfee0c223 Fix subnet and default routes on different FIBs on the same subnet.
These two bugs are closely related.  The root cause is that ifa_ifwithnet
does not consider FIBs when searching for an interface address.

sys/net/if_var.h
sys/net/if.c
	Add a fib argument to ifa_ifwithnet and ifa_ifwithdstadddr.  Those
	functions will only return an address whose interface fib equals the
	argument.

sys/net/route.c
	Update calls to ifa_ifwithnet and ifa_ifwithdstaddr with fib
	arguments.

sys/netinet/in.c
	Update in_addprefix to consider the interface fib when adding
	prefixes.  This will prevent it from not adding a subnet route when
	one already exists on a different fib.

sys/net/rtsock.c
sys/netinet/in_pcb.c
sys/netinet/ip_output.c
sys/netinet/ip_options.c
sys/netinet6/nd6.c
	Add RT_DEFAULT_FIB arguments to ifa_ifwithdstaddr and ifa_ifwithnet.
	In some cases it there wasn't a clear specific fib number to use.
	In others, I was unable to test those functions so I chose
	RT_DEFAULT_FIB to minimize divergence from current behavior.  I will
	fix some of the latter changes along with PR kern/187553.

tests/sys/netinet/fibs_test.sh
tests/sys/netinet/udp_dontroute.c
tests/sys/netinet/Makefile
	Revert r263738.  The udp_dontroute test was right all along.
	However, bugs kern/187550 and kern/187553 cancelled each other out
	when it came to this test.  Because of kern/187553, ifa_ifwithnet
	searched the default fib instead of the requested one, but because
	of kern/187550, there was an applicable subnet route on the default
	fib.  The new test added in r263738 doesn't work right, however.  I
	can verify with dtrace that ifa_ifwithnet returned the wrong address
	before I applied this commit, but route(8) miraculously found the
	correct interface to use anyway.  I don't know how.

	Clear expected failure messages for kern/187550 and kern/187552.

PR:		kern/187550
PR:		kern/187552
Reviewed by:	melifaro
MFC after:	3 weeks
Sponsored by:	Spectra Logic
2014-04-24 23:56:56 +00:00
Alan Somers
0489b8916e Fix host and network routes for new interfaces when net.add_addr_allfibs=0
sys/net/route.c
	In rtinit1, use the interface fib instead of the process fib.  The
	latter wasn't very useful because ifconfig(8) is usually invoked
	with the default process fib.  Changing ifconfig(8) to use setfib(2)
	would be redundant, because it already sets the interface fib.

tests/sys/netinet/fibs_test.sh
	Clear the expected ATF failure

sys/net/if.c
	Pass the interface fib in calls to rtrequest1_fib and rtalloc1_fib

sys/netinet/in.c
sys/net/if_var.h
	Add a fibnum argument to ifa_switch_loopback_route, a subroutine of
	in_scrubprefix.  Pass it the interface fib.

PR:		kern/187549
Reviewed by:	melifaro
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corporation
2014-04-24 17:23:16 +00:00
Steven Hartland
ae19083248 Fix jailed raw sockets not setting the correct source address by
calling in_pcbladdr instead of prison_get_ip4

MFC after:	1 month
2014-04-24 12:52:31 +00:00
Michael Tuexen
8be0fd55dc Don't free an mbuf twice. This only happens in very rare error
cases where the peer sends illegal sequencing information in
DATA chunks for an existing association.

MFC after: 3 days.
2014-04-23 21:20:55 +00:00
Rick Macklem
2aa76dba07 Add {} braces so that the code conforms to the indentation.
Fortunately, I don't think doing the assignment of cap->tsomax
unconditionally causes any problem.

Reviewed by:	glebius
MFC after:	2 weeks
2014-04-21 19:17:19 +00:00
Michael Tuexen
eb67ee5fc6 Add consistency checks to ensure that fragments of a user message
have the same U-bit.

MFC after: 3 days
2014-04-20 21:11:39 +00:00
Michael Tuexen
273351d497 Send also a packet containing an ABORT chunk in response to an OOTB packet
containing a COOKIE-ECHO chunk.

MFC after: 3 days
2014-04-20 18:15:23 +00:00
Michael Tuexen
2dec1efc5a Use consistently debug output instead of an unconditional printf.
MFC after: 3 days
2014-04-19 20:55:51 +00:00
Michael Tuexen
32451da416 Send the correct error cause, when a DATA chunk with no user data
is received. This bug was reported by Irene Ruengeler.

MFC after: 3 days
2014-04-19 19:21:06 +00:00
John Baldwin
b8c8c8c3c7 Some whitespace and style fixes.
Submitted by:	bde
2014-04-11 21:00:59 +00:00
John Baldwin
2ffb755cec The tw_pcbrele() function does not need the global timewait lock.
Submitted by:	Julien Charbon
Suggested by:	glebius
2014-04-11 19:17:45 +00:00
John Baldwin
9941de49ad Don't leak the TCP pcbinfo lock if a time wait connection is closed
in between grabbing a reference on the connection structure and obtaining
the pcbinfo lock.

Reviewed by:	Julien Charbon
2014-04-11 13:11:43 +00:00
John Baldwin
66eefb1eae Currently, the TCP slow timer can starve TCP input processing while it
walks the list of connections in TIME_WAIT closing expired connections
due to contention on the global TCP pcbinfo lock.

To remediate, introduce a new global lock to protect the list of
connections in TIME_WAIT.  Only acquire the TCP pcbinfo lock when
closing an expired connection.  This limits the window of time when
TCP input processing is stopped to the amount of time needed to close
a single connection.

Submitted by:	Julien Charbon <jcharbon@verisign.com>
Reviewed by:	rwatson, rrs, adrian
MFC after:	2 months
2014-04-10 18:15:35 +00:00
Kevin Lo
cfac59ecb1 Remove a bogus re-assignment. 2014-04-08 01:54:50 +00:00
Kevin Lo
d1b18731d9 Minor style cleanups. 2014-04-07 01:55:53 +00:00
Kevin Lo
e06e816f67 Add support for UDP-Lite protocol (RFC 3828) to IPv4 and IPv6 stacks.
Tested with vlc and a test suite [1].

[1] http://www.erg.abdn.ac.uk/~gerrit/udp-lite/files/udplite_linux.tar.gz

Reviewed by:	jhb, glebius, adrian
2014-04-07 01:53:03 +00:00
Hiren Panchasara
855363811f Improve readability of comments for DELAY_ACK() macro. 2014-04-03 01:46:03 +00:00
Michael Tuexen
6bbfa13f80 Increment the SSN only after processing the last fragment of an
ordered user message.

MFC after: 3 days
2014-04-01 18:38:04 +00:00
Andrey V. Elsukov
41ea685c32 Don't copy the MF flag from original IP header to ICMP error message.
PR:		188092
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-03-31 13:00:49 +00:00
Michael Tuexen
9ba5b6b730 Handle an edge case of address management similar to TCP.
This needs to be reconsidered when the address handling
will be reimplemented.
The patch is from rrs@.

MFC after: 3 days
2014-03-29 21:26:45 +00:00
Michael Tuexen
fe96e2852e Use SCTP_OVER_UDP_TUNNELING_PORT more consistently.
MFC after: 3 days
2014-03-29 20:21:36 +00:00
Alan Somers
743c072a09 Correct ARP update handling when the routes for network interfaces are
restricted to a single FIB in a multifib system.

Restricting an interface's routes to the FIB to which it is assigned (by
setting net.add_addr_allfibs=0) causes ARP updates to fail with "arpresolve:
can't allocate llinfo for x.x.x.x".  This is due to the ARP update code hard
coding it's lookup for existing routing entries to FIB 0.

sys/netinet/in.c:
	When dealing with RTM_ADD (add route) requests for an interface, use
	the interface's assigned FIB instead of the default (FIB 0).

sys/netinet/if_ether.c:
	In arpresolve(), enhance error message generated when an
	lla_lookup() fails so that the interface causing the error is
	visible in logs.

tests/sys/netinet/fibs_test.sh
	Clear ATF expected error.

PR:		kern/167947
Submitted by:	Nikolay Denev <ndenev@gmail.com> (previous version)
Reviewed by:	melifaro
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corporation
2014-03-26 22:46:03 +00:00
Hiren Panchasara
153edc50d7 Correct the comments as support for RFC 1644 has been removed for a long time. 2014-03-25 21:57:50 +00:00
Michael Tuexen
ff1ffd7499 * Provide information in error causes in ASCII instead of
proprietary binary format.
* Add support for a diagnostic information error cause.
  The code is sysctlable and the default is 0, which
  means it is not sent.

This is joint work with rrs@.

MFC after: 1 week
2014-03-16 12:32:16 +00:00
Robert Watson
7527624efa Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation.  This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.

(1) Merge a software implementation of the Toeplitz hash specified in
    RSS implemented by David Malone.  This is used to allow suitable
    pcbgroup placement of connections before the first packet is
    received from the NIC.  Software hashing is generally avoided,
    however, due to high cost of the hash on general-purpose CPUs.

(2) In in_rss.c, maintain authoritative versions of RSS state intended
    to be pushed to each NIC, including keying material, hash
    algorithm/ configuration, and buckets.  Provide software-facing
    interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
    the RSS standardised Toeplitz and a 'naive' variation with a hash
    efficient in software but with poor distribution properties.
    Implement rss_m2cpuid()to be used by netisr and other load
    balancing code to look up the CPU on which an mbuf should be
    processed.

(3) In the Ethernet link layer, allow netisr distribution using RSS as
    a source of policy as an alternative to source ordering; continue
    to default to direct dispatch (i.e., don't try and requeue packets
    for processing on the 'right' CPU if they arrive in a directly
    dispatchable context).

(4) Allow RSS to control tuning of connection groups in order to align
    groups with RSS buckets.  If a packet arrives on a protocol using
    connection groups, and contains a suitable hardware-generated
    hash, use that hash value to select the connection group for pcb
    lookup for both IPv4 and IPv6.  If no hardware-generated Toeplitz
    hash is available, we fall back on regular PCB lookup risking
    contention rather than pay the cost of Toeplitz in software --
    this is a less scalable but, at my last measurement, faster
    approach.  As core counts go up, we may want to revise this
    strategy despite CPU overhead.

Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP.  This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers.  Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing.  This will hopefully prove
a useful starting point for refinement.

No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.

Sponsored by:   Juniper Networks (original work)
Sponsored by:   EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
Gleb Smirnoff
45c203fce2 Remove AppleTalk support.
AppleTalk was a network transport protocol for Apple Macintosh devices
in 80s and then 90s. Starting with Mac OS X in 2000 the AppleTalk was
a legacy protocol and primary networking protocol is TCP/IP. The last
Mac OS X release to support AppleTalk happened in 2009. The same year
routing equipment vendors (namely Cisco) end their support.

Thus, AppleTalk won't be supported in FreeBSD 11.0-RELEASE.
2014-03-14 06:29:43 +00:00
Gleb Smirnoff
2c284d9395 Remove IPX support.
IPX was a network transport protocol in Novell's NetWare network operating
system from late 80s and then 90s. The NetWare itself switched to TCP/IP
as default transport in 1998. Later, in this century the Novell Open
Enterprise Server became successor of Novell NetWare. The last release
that claimed to still support IPX was OES 2 in 2007. Routing equipment
vendors (e.g. Cisco) discontinued support for IPX in 2011.

Thus, IPX won't be supported in FreeBSD 11.0-RELEASE.
2014-03-14 02:58:48 +00:00
Michael Tuexen
3d31c75401 Put the offset of the CRC32C in csum_data instead of 0.
The virtio driver needs the offset to be stored in csum_data,
like in the case for UDP and TCP.

The virtio problem was reported by
Niu Zhixiong <kaiaixi@gmail.com>, who helped in debugging
and testing the patch.

MFC after: 3 days
2014-03-12 17:18:15 +00:00
Michael Tuexen
031925742a SCTP uses CRC32C and not Adler anymore. While there change the reference
to RFC 4960.
This does not change any code, just comments.

MFC after: 3 days
2014-03-12 15:30:40 +00:00
Gleb Smirnoff
aa69c61235 Since both netinet/ and netinet6/ call into netipsec/ and netpfil/,
the protocol specific mbuf flags are shared between them.

- Move all M_FOO definitions into a single place: netinet/in6.h, to
  avoid future  clashes.
- Resolve clash between M_DECRYPTED and M_SKIP_FIREWALL which resulted
  in a failure of operation of IPSEC and packet filters.

Thanks to Nicolas and Georgios for all the hard work on bisecting,
testing and finally finding the root of the problem.

PR:			kern/186755
PR:			kern/185876
In collaboration with:	Georgios Amanakis <gamanakis gmail.com>
In collaboration with:	Nicolas DEFFAYET <nicolas-ml deffayet.com>
Sponsored by:		Nginx, Inc.
2014-03-12 14:29:08 +00:00
Gleb Smirnoff
e3a7aa6f56 - Remove rt_metrics_lite and simply put its members into rtentry.
- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This
  removes another cache trashing ++ from packet forwarding path.
- Create zini/fini methods for the rtentry UMA zone. Via initialize
  mutex and counter in them.
- Fix reporting of rmx_pksent to routing socket.
- Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.

The change is mostly targeted for stable/10 merge. For head,
rt_pksent is expected to just disappear.

Discussed with:		melifaro
Sponsored by:		Netflix
Sponsored by:		Nginx, Inc.
2014-03-05 01:17:47 +00:00
Gleb Smirnoff
2a7da7299d Remove ifa_ref()/ifa_free(), which are atomic(9), from ip_output().
The ifaddr is already referenced by the rtentry, and we are holding
reference on the rtentry throughout the function execution.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-03-04 19:49:41 +00:00
John Baldwin
5b26ea5df3 Remove more constants related to static sysctl nodes. The MAXID constants
were primarily used to size the sysctl name list macros that were removed
in r254295.  A few other constants either did not have an associated
sysctl node, or the associated node used OID_AUTO instead.

PR:		ports/184525 (exp-run)
2014-02-25 18:44:33 +00:00
Gleb Smirnoff
4a2dd8d4fb Improve logging of send errors, reporting error code and interface.
Reduce code duplication between INET and INET6.

Tested by:	Lytochkin Boris <lytboris gmail.com>
2014-02-22 19:20:40 +00:00
Michael Tuexen
1213f0e749 Remove redundant code and fix a style error.
MFC after: 3 days
2014-02-20 20:14:43 +00:00
Gleb Smirnoff
0ff96b4f55 o Remove at compile time the HASH_ALL code, that was never
tested and is unfinished. However, I've tested my version,
  it works okay. As before it is unfinished: timeout aren't
  driven by TCP session state. To enable the HASH_ALL mode,
  one needs in kernel config:

	options FLOWTABLE_HASH_ALL

o Reduce the alignment on flentry to 64 bytes. Without
  the FLOWTABLE_HASH_ALL option, twice less memory would
  be consumed by flows.
o API to ip_output()/ip6_output() got even more thin: 1 liner.
o Remove unused unions. Simply use fle->f_key[].
o Merge all IPv4 code into flowtable_lookup_ipv4(), and do same
  flowtable_lookup_ipv6(). Stop copying data to on stack
  sockaddr structures, simply use key[] on stack.
o Move code from flowtable_lookup_common() that actually works
  on insertion into flowtable_insert().

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-02-17 11:50:56 +00:00
Mikolaj Golub
db2f5a2461 Fixup for r261590 (vnet sysctl handlers cleanup).
Reviewed by:	glebius
2014-02-09 08:13:17 +00:00
Gleb Smirnoff
5d6d7e756b o Revamp API between flowtable and netinet, netinet6.
- ip_output() and ip_output6() simply call flowtable_lookup(),
    passing mbuf and address family. That's the only code under
    #ifdef FLOWTABLE in the protocols code now.
o Revamp statistics gathering and export.
  - Remove hand made pcpu stats, and utilize counter(9).
  - Snapshot of statistics is available via 'netstat -rs'.
  - All sysctls are moved into net.flowtable namespace, since
    spreading them over net.inet isn't correct.
o Properly separate at compile time INET and INET6 parts.
o General cleanup.
  - Remove chain of multiple flowtables. We simply have one for
    IPv4 and one for IPv6.
  - Flowtables are allocated in flowtable.c, symbols are static.
  - With proper argument to SYSINIT() we no longer need flowtable_ready.
  - Hash salt doesn't need to be per-VNET.
  - Removed rudimentary debugging, which use quite useless in dtrace era.

The runtime behavior of flowtable shouldn't be changed by this commit.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-02-07 15:18:23 +00:00
Gleb Smirnoff
92f8975ff4 Utilize SYSCTL_UMA_CUR() to export usage of syncache and
tcp reassembly zones.

Sponsored by:	Nginx, Inc.
2014-02-07 14:31:51 +00:00
Gleb Smirnoff
96d111245f Catch up on r261590. 2014-02-07 14:26:33 +00:00
Peter Wemm
62b90589e4 Adjust r239672 from rrs and r258821 from eadler.
By definition, the very first FIN is not a duplicate. Process it normally
and don't feed it to congestion control as though it were a dupe.  Don't
prevent CC from seeing later dupe acks while in a half close state.
2014-01-28 21:13:15 +00:00
George V. Neville-Neil
6f3caa6d81 Decrease lock contention within the TCP accept case by removing
the INP_INFO lock from tcp_usr_accept.  As the PR/patch states
this was following the advice already in the code.
See the PR below for a full disucssion of this change and its
measured effects.

PR:		183659
Submitted by:	Julian Charbon
Reviewed by:	jhb
2014-01-28 20:28:32 +00:00
Gleb Smirnoff
547246a373 Fix fallout from r241923. Calculate length of payload in
pim_input() properly.

While here, remove extra variable and incorrect condition
before m_pullup().

Reported by:	Olivier Cochard-Labbé <olivier cochard.me>
Sponsored by:	Nginx, Inc.
2014-01-22 10:57:42 +00:00
Alexander V. Chernikov
f6b84910bb Further rework netinet6 address handling code:
* Set ia address/mask values BEFORE attaching to address lists.
Inet6 address assignment is not atomic, so the simplest way to
do this atomically is to fill in ia before attach.
* Validate irfa->ia_addr field before use (we permit ANY sockaddr in old code).
* Do some renamings:
  in6_ifinit -> in6_notify_ifa (interaction with other subsystems is here)
  in6_setup_ifa -> in6_broadcast_ifa (LLE/Multicast/DaD code)
  in6_ifaddloop -> nd6_add_ifa_lle
  in6_ifremloop -> nd6_rem_ifa_lle
* Split working with LLE and route announce code for last two.
Add temporary in6_newaddrmsg() function to mimic current rtsock behaviour.
* Call device SIOCSIFADDR handler IFF we're adding first address.
In IPv4 we have to call it on every address change since ARP record
is installed by arp_ifinit() which is called by given handler.
IPv6 stack, on the opposite is responsible to call nd6_add_ifa_lle() so
there is no reason to call SIOCSIFADDR often.
2014-01-19 16:07:27 +00:00
Adrian Chadd
9db69902c6 If the flowid is available for the mbuf that finalised the creation
of a syncache connection, copy it into the inp_flowid field.

Without this, an incoming TCP connection won't have an inp_flowid marked
until some data comes in, and this means that things like the per-CPU
TCP timer option will choose a different CPU for the timer work.
(It also means that if one grabbed the flowid via an ioctl from userland,
it won't be available until some data has been received.)

Sponsored by:	Netflix, Inc.
2014-01-18 23:48:20 +00:00
George V. Neville-Neil
d9e1bc4f0d Fix various places where we don't properly release a lock
PR:		185043
Submitted by:	Michael Bentkofsky
MFC after:	2 weeks
2014-01-16 22:14:54 +00:00
Gleb Smirnoff
3c065f2f3e Cleanup comments and whitespace. No functional changes. 2014-01-16 12:58:03 +00:00
Alexander V. Chernikov
a49b317c41 Fix refcount leak on netinet ifa.
Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2014-01-16 12:35:18 +00:00
Alexander V. Chernikov
054692a4bd Fix ipfw fwd for IPv4 traffic broken by r249894.
Problem case:
Original lookup returns route with GW set, so gw points to
rte->rt_gateway.
After that we're changing dst and performing lookup another time.
Since fwd host is most probably directly reachable, resulting
rte does not contain rt_gateway, so gw is not set. Finally, we
end with packet transmitted to proper interface but wrong
link-layer address.

Found by:	lstewart
Discussed with:	ae,lstewart
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2014-01-16 11:50:00 +00:00
Alexander V. Chernikov
d375edc9b5 Simplify inet alias handling code: if we're adding/removing alias which
has the same prefix as some other alias on the same interface, use
newly-added rt_addrmsg() instead of hand-rolled in_addralias_rtmsg().

This eliminates the following rtsock messages:

Pinned RTM_ADD for prefix (for alias addition).
Pinned RTM_DELETE for prefix (for alias withdrawal).

Example (got 10.0.0.1/24 on vlan4, playing with 10.0.0.2/24):

before commit, addition:

  got message of size 116 on Fri Jan 10 14:13:15 2014
  RTM_NEWADDR: address being added to iface: len 116, metric 0, flags:
  sockaddrs: <NETMASK,IFP,IFA,BRD>
   255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255

  got message of size 192 on Fri Jan 10 14:13:15 2014
  RTM_ADD: Add Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED>
  locks:  inits:
  sockaddrs: <DST,GATEWAY,NETMASK>
   10.0.0.0 10.0.0.2 (255) ffff ffff ff

after commit, addition:

  got message of size 116 on Fri Jan 10 13:56:26 2014
  RTM_NEWADDR: address being added to iface: len 116, metric 0, flags:
  sockaddrs: <NETMASK,IFP,IFA,BRD>
   255.255.255.0 vlan4:8.0.27.c5.29.d4 14.0.0.2 14.0.0.255

before commit, wihdrawal:

  got message of size 192 on Fri Jan 10 13:58:59 2014
  RTM_DELETE: Delete Route: len 192, pid: 0, seq 0, errno 0, flags:<UP,PINNED>
  locks:  inits:
  sockaddrs: <DST,GATEWAY,NETMASK>
   10.0.0.0 10.0.0.2 (255) ffff ffff ff

  got message of size 116 on Fri Jan 10 13:58:59 2014
  RTM_DELADDR: address being removed from iface: len 116, metric 0, flags:
  sockaddrs: <NETMASK,IFP,IFA,BRD>
   255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255

adter commit, withdrawal:

  got message of size 116 on Fri Jan 10 14:14:11 2014
  RTM_DELADDR: address being removed from iface: len 116, metric 0, flags:
  sockaddrs: <NETMASK,IFP,IFA,BRD>
   255.255.255.0 vlan4:8.0.27.c5.29.d4 10.0.0.2 10.0.0.255

Sending both RTM_ADD/RTM_DELETE messages to rtsock is completely wrong
(and requires some hacks to keep prefix in route table on RTM_DELETE).

I've tested this change with quagga (no change) and bird (*).

bird alias handling is already broken in *BSD sysdep code, so nothing
changes here, too.

I'm going to MFC this change if there will be no complains about behavior
change.

While here, fix some style(9) bugs introduced by r260488
(pointed by glebius and bde).

Sponsored by:	Yandex LLC
MFC after:	4 weeks
2014-01-10 12:13:55 +00:00
Gleb Smirnoff
0cc726f25a Make failure of ifpromisc() a non-fatal error. This makes it possible to
run carp(4) on vtnet(4).

Sponsored by:	Nginx, Inc.
2014-01-03 11:03:12 +00:00
Andrey V. Elsukov
e2d14d9317 Add IF_AFDATA_WLOCK_ASSERT() in case lla_lookup() is called with
LLE_CREATE flag.

MFC after:	1 week
2014-01-03 02:32:05 +00:00
Gleb Smirnoff
183e1c8634 Fix regression from r249894. Now we pass "gw" as argument to if_output
method, thus for multicast case we need it to point at "dst".

PR:		185395
Submitted by:	ae
2014-01-02 10:18:39 +00:00
Andrey V. Elsukov
ea0c377602 lla_lookup() does modification only when LLE_CREATE is specified.
Thus we can use IF_AFDATA_RLOCK() instead of IF_AFDATA_LOCK() when doing
lla_lookup() without LLE_CREATE flag.

Reviewed by:	glebius, adrian
MFC after:	1 week
Sponsored by:	Yandex LLC
2014-01-02 08:40:37 +00:00
Gleb Smirnoff
9706c950a2 Fix couple of bugs from r257692 related to scan of address list on
an interface:
- in in_control() skip over not AF_INET addresses.
- in in_aifaddr_ioctl() and in_difaddr_ioctl() do correct check
  of address family, w/o accessing memory beyond struct ifaddr.

Sponsored by:	Nginx, Inc.
2013-12-29 22:20:06 +00:00
Michael Tuexen
04aab884d7 Address some warnings which showed up on the userland version.
MFC after: 1 week
2013-12-27 13:07:00 +00:00
Sergey Kandaurov
b8b4cfcdf6 Draft-ietf-tcpm-initcwnd-05 became RFC6928.
MFC after:	1 week
2013-12-26 04:24:08 +00:00
Bjoern A. Zeeb
415167d52b Add more (IPv6) related Internet Protocols:
- Host Identity Protocol (RFC5201)
- Shim6 Protocol (RFC5533)
- 2x experimentation and testing (RFC3692, RFC4727)

This does not indicate interest to implement/support these protocols,
but they are part of the "IPv6 Extension Header Types" [1] based on RFC7045
and might thus be needed by filtering and next header parsing
implementations.

References:	[1] http://www.iana.org/assignments/ipv6-parameters
Obtained from:	http://www.iana.org/assignments/protocol-numbers
MFC after:	1 week
2013-12-25 20:26:49 +00:00
Gleb Smirnoff
ec5df3a7b1 It'll be okay to use LibAliasDetachHandlers() here, relying
on the fact that all handlers come from modules' bss and are
followed by NODIR handler.
2013-12-25 09:43:51 +00:00
Gleb Smirnoff
535e0a0981 Cleanup alias module handler register/unregister.
- Remove locking, since all module(9) events are running under &Giant.
- Use TAILQ for protocol handlers and fix a bug which led to
  infinite cycle. Bug found in VirtualBox [1]
- Simplify code everywhere.
- Fix documentation.

[1]  https://www.virtualbox.org/pipermail/vbox-dev/2013-November/011936.html

PR:		183792 [1]
Submitted by:	Valery Ushakov <uwe NetBSD.org> [1]
Sponsored by:	Nginx, Inc.
2013-12-25 03:24:20 +00:00
Gleb Smirnoff
2fb87f0892 Kill space at eols. 2013-12-25 02:06:57 +00:00
Gleb Smirnoff
1019f603d5 Remove from kernel the "dll" code. 2013-12-25 01:58:19 +00:00
Gleb Smirnoff
22d3fb1917 Whitespace cleanup. 2013-12-25 01:52:55 +00:00
Dimitry Andric
36f54f0aaa In sys/netinet/in_mcast.c, inm_is_ifp_detached() is only used whenever
KTR is defined, so put it between #ifdef KTR guards.  This avoids a
warning about a unused function if KTR is not enabled.

MFC after:	 3 days
2013-12-24 20:25:18 +00:00
Adrian Chadd
ac7e121247 Disable the now unpredicably bogus check for whether we have
eneough queue space before queuing a bunch of IP fragments.

As the comment in the committed change says, in the post-if_transmit(),
post-SMP, post-preemption world, there's just too much overlapping
concurrent code paths and different approaches to driver transmit
queue management to have this code even remotely be effective.

The only specific place it could be useful is if ALTQ is enabled
but again it doesn't at all promise that all the fragments will be
transmitted anyway.

The main reason for committing this change is to disable a parallel
place where the drops counter is incremented.  This is a side effect
of an upcoming change to ixgbe/cxgbe to handle the queue drops
counter slightly better.

Sponsored by:	Netflix, Inc.
2013-12-20 07:41:03 +00:00
Eitan Adler
5f30ec9b63 In a situation where:
- The remote host sends a FIN
	- in an ACK for a sequence number for which an ACK has already
	  been received
	- There is still unacked data on route to the remote host
	- The packet does not contain a window update

The packet may be dropped without processing the FIN flag.

PR:		kern/99188
Submitted by:	Staffan Ulfberg <staffan@ulfberg.se>
Discussed with:	andre
MFC after:	never
2013-12-02 03:11:25 +00:00
Michael Tuexen
c302aeb123 In
http://svnweb.freebsd.org/changeset/base/258221
I introduced a bug which initialized global locks
whenever the SCTP stack initialized. This was fixed in
http://svnweb.freebsd.org/changeset/base/258574
by rodrigc@. He just initialized the locks for
the default vnet. This fix reverts to the old
behaviour before r258221, which explicitly makes
sure it is only called once, because this works also on
other platforms.
MFC after: 3 days
X-MFC with: r258574.
2013-11-30 12:51:19 +00:00
Andriy Gapon
d9fae5ab88 dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by:	markj
MFC after:	3 weeks
2013-11-26 08:46:27 +00:00
Adrian Chadd
fa22ce1570 Convert over the TCP probes to use mtod() rather than directly
dereferencing m->m_data.

Sponsored by:	Netflix, Inc.
2013-11-25 22:55:06 +00:00
Craig Rodrigues
c0c61281b4 Only initialize some mutexes for the default VNET.
In r208160, sctp_it_ctl was made a global variable, across all VNETs.
However, sctp_init() is called for every VNET that is created.  This results
in the same global mutexes which are part of sctp_it_ctl being initialized.  This can result
in crashes if many jails are created.

To reproduce the problem:
  (1)  Take a GENERIC kernel config, and add options for: VIMAGE, WITNESS,
       INVARIANTS.
  (2)  Run this command in a loop:
       jail -l -u root -c path=/ name=foo persist vnet && jexec foo ifconfig lo0 127.0.0.1/8 && jail -r foo

       (see http://lists.freebsd.org/pipermail/freebsd-current/2010-November/021280.html )

Witness will warn about the same mutex being initialized.

Fix the problem by only initializing these mutexes in the default VNET.
2013-11-25 18:49:37 +00:00
Attilio Rao
54366c0bd7 - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
Gleb Smirnoff
c1f7c3f500 In r257692 I intentionally deleted code that handled P2P interfaces
with equal addresses on both sides. It appeared that OpenVPN uses
such configutations.

Submitted by:	trociny
2013-11-17 15:14:07 +00:00
Mikolaj Golub
a3985bdd12 Deregister helper hooks on vnet destroy. 2013-11-17 15:09:39 +00:00
Michael Tuexen
2a44dbf682 Use SCTP_PR_SCTP_TTL when the user provides a positive
timetolive in sctp_sendmsg().

MFC after: 3 days
2013-11-16 19:57:56 +00:00
Michael Tuexen
04194e4f7d Remove a stray write operation.
MFC after: 3 days
2013-11-16 16:09:09 +00:00
Michael Tuexen
dcb3fc4cd6 When determining if an address belongs to an stcb, take the address family
into account for wildcard bound endpoints.

MFC after: 3 days
2013-11-16 15:34:14 +00:00
Michael Tuexen
f4f34bde23 Cleanups which result in fixes which have been made upstream
and where partially suggested by Andrew Galante.
There is no functional change in FreeBSD.

MFC after: 3 days
2013-11-16 15:04:49 +00:00
Gleb Smirnoff
555036b5f6 Remove never used ioctls that originate from KAME. The proof
of their zero usage was exp-run from misc/183538.
2013-11-11 05:39:42 +00:00
Gleb Smirnoff
2f3eb7f4d8 Make TCP_KEEP* socket options readable. At least PostgreSQL wants
to read the values.

Reported by:	sobomax
2013-11-08 13:04:14 +00:00
Michael Tuexen
de72f4e54b Get rid of the artification limitation enforced by
SCTP_AUTH_RANDOM_SIZE_MAX.
This was suggested by Andrew Galante.

MFC after: 3 days
2013-11-07 18:50:11 +00:00
Michael Tuexen
a9d94d290b Make sure that we don't try to build an ASCONF-ACK chunk
larger than what fits in the the mbuf cluster.
This issue was reported by Andrew Galante.

MFC after: 3 days
2013-11-07 17:08:09 +00:00
Michael Tuexen
c9eb4473b4 Use htons()/ntohs() appropriately.
These issues were reported by Andrew Galante.

MFC after: 3 days
2013-11-07 16:37:12 +00:00
Gleb Smirnoff
77b89ad837 Provide compat layer for OSIOCAIFADDR. 2013-11-06 19:46:20 +00:00
Gleb Smirnoff
821b5caf7a Fix my braino in r257692. For SIOCG*ADDR we don't need exact match on
specified address, actually in most cases the address isn't specified.

Reported by:	peter
2013-11-06 08:36:08 +00:00
Nathan Whitehorn
6224cd89c0 Fix build on GCC. 2013-11-06 01:14:00 +00:00
Gleb Smirnoff
fe9bfbcf5a netinet code no longer uses IFA_RTSELF. 2013-11-05 07:45:20 +00:00
Gleb Smirnoff
f7a39160c2 Rewrite in_control(), so that it is comprehendable without getting mad.
o Provide separate functions for SIOCAIFADDR and for SIOCDIFADDR, with
  clear code flow from beginning to the end. After that the rest of
  in_control() gets very small and clear.
o Provide sx(9) lock to protect against parallel ioctl() invocations.
o Reimplement logic from r201282, that tried to keep localhost route in
  table when multiple P2P interfaces with same local address are created
  and deleted.

Discussed with:		pluknet, melifaro
Sponsored by:		Netflix
Sponsored by:		Nginx, Inc.
2013-11-05 07:44:15 +00:00
Gleb Smirnoff
b1b9dcae46 Remove net.link.ether.inet.useloopback sysctl tunable. It was always on by
default from the very beginning. It was placed in wrong namespace
net.link.ether, originally it had been at another wrong namespace. It was
incorrectly documented at incorrect manual page arp(8). Since new-ARP commit,
the tunable have been consulted only on route addition, and ignored on route
deletion. Behaviour of a system with tunable turned off is not fully correct,
and has no advantages comparing to normal behavior.
2013-11-05 07:32:09 +00:00
Michael Tuexen
3b3d05d769 Unlock the lock before destroying it.
This issue was reported by Andrew Galante.

MFC after: 3 days
2013-11-03 14:00:17 +00:00
Michael Tuexen
b54ddf225f Changes from upstream to improve compilation when INET or INET6
or none of them is defined.

MFC after: 3 days
2013-11-02 20:12:19 +00:00
Gleb Smirnoff
586904c22e in_ifadown() can be void. 2013-11-01 10:29:10 +00:00
Gleb Smirnoff
237bf7f773 Cleanup in_ifscrub(), which is just an entry to in_scrubprefix(). 2013-11-01 10:18:41 +00:00
Michael Tuexen
6ed728108a Terminate a debug output with a \n. 2013-10-29 20:04:50 +00:00
Gleb Smirnoff
8d7cf9b5d4 Uninline inm_lookup_locked(). Now in_var.h doesn't dereference
fields of struct ifnet.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-29 11:21:31 +00:00
Michael Tuexen
92dfa76cbc Fis the value of *optlen when calling getsockopt() for
SCTP_REMOTE_UDP_ENCAPS_PORT.
This issue was reported by Andrew Galante.
MFC after: 3 days
2013-10-28 20:45:19 +00:00
Michael Tuexen
daac3e7db6 Fix compilation if SCTP_DONT_DO_PRIVADDR_SCOPE is defined.
The issue was reported by Andrew Galante.

MFC after: 3 days
2013-10-28 20:32:37 +00:00
Gleb Smirnoff
c3322cb91c Include necessary headers that now are available due to pollution
via if_var.h.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-28 07:29:16 +00:00
Gleb Smirnoff
eedc7fd9e8 Provide includes that are needed in these files, and before were read
in implicitly via if.h -> if_var.h pollution.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 18:18:50 +00:00
Gleb Smirnoff
76039bc84f The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 17:58:36 +00:00
John Baldwin
3380883230 Finish r254925 and remove the last remaining sysctl name list macro. The
one port that used it has been fixed to use the more portable
getprotoent(3) instead.
2013-10-23 13:22:50 +00:00
Andre Oppermann
c1e5a6e5e8 The TCP delayed ACK logic isn't aware of LRO passing up large aggregated
segments thinking it received only one segment. This causes it to enable
the delay the ACK for 100ms to wait for another segment which may never
come because all the data was received already.

Doing delayed ACK for LRO segments is bogus for two reasons: a) it pushes
us further away from acking every other packet; b) it introduces additional
delay in responding to the sender.  The latter is especially bad because it
is in the nature of LRO to aggregated all segments of a burst with no more
coming until an ACK is sent back.

Change the delayed ACK logic to detect LRO segments by being larger than
the MSS for this connection and issuing an immediate ACK for them to keep
the ACK clock ticking without interruption.

Reported by:	julian, cperciva
Tested by:	cperciva
Reviewed by:	lstewart
MFC after:	3 days
2013-10-22 18:24:34 +00:00
Kevin Lo
9768475eb0 - Add parentheses to all internet addresses
- All the casts to uint32_t should be to in_addr_t

Suggested by:	bde
Reviewed by:	bde
2013-10-19 18:13:32 +00:00
Michael Tuexen
77dabf96d9 Remove a buggy comparision when setting manually the path MTU.
After fixing, the comparision would have become redundant.
Thanks to Andrew Galante for reporting the issue.

MFC after:	3 days
2013-10-15 20:21:27 +00:00
Gleb Smirnoff
7caf4ab7ac - Utilize counter(9) to accumulate statistics on interface addresses. Add
four counters to struct ifaddr. This kills '+=' on a variables shared
  between processors for every packet.
- Nuke struct if_data from struct ifaddr.
- In ip_input() do not put a reference on ifaddr, instead update statistics
  right now in place and do IN_IFADDR_RUNLOCK(). These removes atomic(9)
  for every packet. [1]
- To properly support NET_RT_IFLISTL sysctl used by getifaddrs(3), in
  rtsock.c fill if_data fields using counter_u64_fetch().
- Accidentially fix bug in COMPAT_32 version of NET_RT_IFLISTL, which
  took if_data not from the ifaddr, but from ifaddr's ifnet. [2]

Submitted by:	melifaro [1], pluknet[2]
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 11:37:57 +00:00
Gleb Smirnoff
4675896098 Remove ifa_init() and provide ifa_alloc() that will allocate and setup
struct ifaddr internally.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 10:31:42 +00:00
Gleb Smirnoff
6ed910fabe Hide 'struct ifaddr' definition from userland. Two tools left that use it,
namely ipftest(1) and ifmcstat(1). These sniff structure definition using
_WANT_IFADDR define.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-15 10:19:24 +00:00
Kevin Lo
b298a86678 Treat INADDR_NONE as uint32_t.
Reviewed by:	glebius
2013-10-15 07:35:39 +00:00
Gleb Smirnoff
c11a15bf8d When processing ACK in tcp_do_segment, use sbcut_locked() instead of
sbdrop_locked() to cut acked mbufs from the socket buffer. Free this
chain a batch manner after the socket buffer lock is dropped.

This measurably reduces contention on socket buffer.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
Approved by:	re (marius)
2013-10-09 12:00:38 +00:00
Mark Johnston
8298c17c6c Add a separate translator for headers passed to the TCP probes in the
input path. These probes get some of the fields in host order, whereas the
output probes get them in network order, so a single translator isn't
enough. This workaround ensures that the problem is essentially invisble
to users: none of the probe arguments or their fields have changed.

Approved by:	re (hrs)
2013-10-02 17:14:12 +00:00
Bjoern A. Zeeb
a5f44cd7a1 Introduce spares in the TCP syncache and timewait structures
so that fixed TCP_SIGNATURE handling can later be merged.

This is derived from follow-up work to SVN r183001 posted to
net@ on Sep 13 2008.

Approved by:	re (gjb)
2013-09-21 10:01:51 +00:00
Mikolaj Golub
4d3dfd450a Unregister inet/inet6 pfil hooks on vnet destroy.
Discussed with:	andre
Approved by:	re (rodrigc)
2013-09-13 18:45:10 +00:00
Michael Tuexen
5dc80df9c5 Fix the aborting of association with the iterator using an empty
user initiated error cause (using SCTP_ABORT|SCTP_SENDALL).

Approved by: re (delphij)
MFC after: 1 week
2013-09-09 21:40:07 +00:00
Mikolaj Golub
1f6addd92c Relese the interface in the last.
Reviewed by:	glebius
Approved by:	re (kib)
2013-09-08 18:19:40 +00:00
Michael Tuexen
d4d23375d3 When computing the partial delivery point, take the
receiver socket buffer size correctly into account.

MFC after: 1 week
2013-09-07 00:45:24 +00:00
John Baldwin
86d93a15ff Use LIST_FOREACH_SAFE() instead of doing it by hand. 2013-09-05 14:26:37 +00:00
John Baldwin
fa302f207f Use an unsigned long when indexing into mfchashtbl[] and mf6ctable[]. This
matches the types used when computing hash indices and the type of the
maximum size of mfchashtbl[].

PR:		kern/181821
Submitted by:	Sven-Thorsten Dietrich <sven@vyatta.com> (IPv4)
MFC after:	1 week
2013-09-05 14:16:37 +00:00
Andrey V. Elsukov
d983befd2f Remove unused code and sort variables declarations.
PR:		kern/181822
MFC after:	1 week
2013-09-05 08:12:36 +00:00
Michael Tuexen
0ddb429900 Remove redundant field pr_sctp_on.
MFC after: 1 week
2013-09-03 19:31:59 +00:00
Michael Tuexen
a28c9ff0b7 Use uint16_t instead of in_port_t for consistency with the SCTP code.
MFC after: 1 week
2013-09-02 23:27:53 +00:00
Michael Tuexen
e6b2b4b65b All changes affect only SCTP-AUTH:
* Remove non working code related to SHA224.
* Remove support for non-standardised HMAC-IDs using SHA384 and SHA512.
* Prefer SHA256 over SHA1.
* Minor cleanup.

MFC after: 2 weeks
2013-09-02 22:48:41 +00:00
Navdeep Parhar
7127e6acf0 Merge r254336 from user/np/cxl_tuning.
Add a last-modified timestamp to each LRO entry and provide an interface
to flush all inactive entries.  Drivers decide when to flush and what
the inactivity threshold should be.

Network drivers that process an rx queue to completion can enter a
livelock type situation when the rate at which packets are received
reaches equilibrium with the rate at which the rx thread is processing
them.  When this happens the final LRO flush (normally when the rx
routine is done) does not occur.  Pure ACKs and segments with total
payload < 64K can get stuck in an LRO entry.  Symptoms are that TCP
tx-mostly connections' performance falls off a cliff during heavy,
unrelated rx on the interface.

Flushing only inactive LRO entries works better than any of these
alternates that I tried:
- don't LRO pure ACKs
- flush _all_ LRO entries periodically (every 'x' microseconds or every
  'y' descriptors)
- stop rx processing in the driver periodically and schedule remaining
  work for later.

Reviewed by:	andre
2013-08-28 23:00:34 +00:00
John Baldwin
fd77bbb967 Remove most of the remaining sysctl name list macros. They were only
ever intended for use in sysctl(8) and it has not used them for many
years.

Reviewed by:	bde
Tested by:	exp-run by bdrewery
2013-08-26 18:16:05 +00:00
Mark Johnston
1ad19fb657 The second last argument of udp:::receive is supposed to contain the
connection state, not the IP header.

X-MFC with:	r254889
2013-08-26 00:28:57 +00:00
Mark Johnston
57f6086735 Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by:	gnn, hiren
MFC after:	1 month
2013-08-25 21:54:41 +00:00
Michael Tuexen
1a94cdbea7 Provide human readable debug output. 2013-08-25 12:44:03 +00:00