Commit Graph

5855 Commits

Author SHA1 Message Date
Julien Charbon
f1ee30ccd6 Remove an extraneous call to soisconnected() in syncache_socket(),
introduced with r261242.  The useful and expected soisconnected()
call is done in tcp_do_segment().

Has been found as part of unrelated PR:212920 investigation.

Improve slightly (~2%) the maximum number of TCP accept per second.

Tested by:		kevin.bowling_kev009.com, jch
Approved by:		gnn, hiren
MFC after:		1 week
Sponsored by:		Verisign, Inc
Differential Revision:	https://reviews.freebsd.org/D8072
2016-10-26 15:19:18 +00:00
Hiren Panchasara
4e7f755377 FreeBSD tcp stack used to inform respective congestion control module about the
loss event but not use or obay the recommendations i.e. values set by it in some
cases.

Here is an attempt to solve that confusion by following relevant RFCs/drafts.
Stack only sets congestion window/slow start threshold values when there is no
CC module availalbe to take that action. All CC modules are inspected and
updated when needed to take appropriate action on loss.

tcp_stacks/fastpath module has been updated to adapt these changes.

Note: Probably, the most significant change would be to not bring congestion
window down to 1MSS on a loss signaled by 3-duplicate acks and letting
respective CC decide that value.

In collaboration with:	Matt Macy <mmacy at nextbsd dot org>
Discussed on:		transport@ mailing list
Reviewed by:		jtl
MFC after:		1 month
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D8225
2016-10-25 05:45:47 +00:00
Hiren Panchasara
dd13b7d387 Undo r307899. It needs a bit more work and proper commit log. 2016-10-25 05:07:51 +00:00
Hiren Panchasara
95d8236011 In Collaboration with: Matt Macy <mmacy at nextbsd dot com>
Reviewed by:		    jtl
Sponsored by:		    Limelight Networks
Differential Revision:	    https://reviews.freebsd.org/D8225
2016-10-25 05:03:33 +00:00
Ryan Stone
6c1bd55875 Fix ip_output() on point-to-point links
In r304435, ip_output() was changed to use the result of the route
lookup to decide whether the outgoing packet was a broadcast or
not.  This introduced a regression on interfaces where
IFF_BROADCAST was not set (e.g. point-to-point links), as the
algorithm could incorrectly treat the destination address as a
broadcast address, and ip_output() would subsequently drop the
packet as broadcasting on a non-IFF_BROADCAST interface is not
allowed.

Differential Revision:	https://reviews.freebsd.org/D8303
Reviewed by:	jtl
Reported by:	ambrisko
MFC after:	2 weeks
X-MFC-With:	r304435
Sponsored by:	Dell EMC Isilon
2016-10-24 22:11:33 +00:00
Michael Tuexen
38d3251c3d No functional changes, mostly getting the whitespace changes resulting
from an updated formatting tool chain.

MFC after: 1 month
2016-10-22 17:21:21 +00:00
Michael Tuexen
3e1465754f Make ICMPv6 hard error handling for TCP consistent with the ICMPv4
handling. Ensure that:
* Protocol unreachable errors are handled by indicating ECONNREFUSED
  to the TCP user for both IPv4 and IPv6. These were ignored for IPv6.
* Communication prohibited errors are handled by indicating ECONNREFUSED
  to the TCP user for both IPv4 and IPv6. These were ignored for IPv6.
* Hop Limited exceeded errors are handled by indicating EHOSTUNREACH
  to the TCP user for both IPv4 and IPv6.
  For IPv6 the TCP connected was dropped but errno wasn't set.

Reviewed by: gallatin, rrs
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: 7904
2016-10-21 10:32:57 +00:00
Julien Charbon
f5cf1e5f5a Fix a double-free when an inp transitions to INP_TIMEWAIT state
after having been dropped.

This fixes enforces in_pcbdrop() logic in tcp_input():

"in_pcbdrop() is used by TCP to mark an inpcb as unused and avoid future packet
delivery or event notification when a socket remains open but TCP has closed."

PR:			203175
Reported by:		Palle Girgensohn, Slawa Olhovchenkov
Tested by:		Slawa Olhovchenkov
Reviewed by:		Slawa Olhovchenkov
Approved by:		gnn, Slawa Olhovchenkov
Differential Revision:	https://reviews.freebsd.org/D8211
MFC after:		1 week
Sponsored by:		Verisign, inc
2016-10-18 07:16:49 +00:00
Hiren Panchasara
784ce8fad2 Make sure tcp_mss() has the same check as tcp_mss_update() to have t_maxseg set
to at least 64.

This is still just a coverup to avoid kernel panic and not an actual fix.

PR:			213232
Reviewed by:		glebius
MFC after:		1 week
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D8272
2016-10-18 02:40:25 +00:00
Patrick Kelsey
09c305eb65 Fix cases where the TFO pending counter would leak references, and eventually, memory.
Also renamed some tfo labels and added/reworked comments for clarity.

Based on an initial patch from jtl.

PR: 213424
Reviewed by:	jtl
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D8235
2016-10-15 01:41:28 +00:00
Jonathan T. Looney
82676a28eb r307082 added the TCP_HHOOK kernel option and made some existing code only
compile when that option is configured. In tcp_destroy(), the error
variable is now only used in code enclosed in an '#ifdef TCP_HHOOK' block.
This broke the build for VNET images.

Enclose the error variable itself in an #ifdef block.

Submitted by:	Shawn Webb <shawn.webb at hardenedbsd.org>
Reported by:	Shawn Webb <shawn.webb at hardenedbsd.org>
PointyHat to:	jtl
2016-10-15 00:29:15 +00:00
Jonathan T. Looney
6d172f58a2 The code currently resets the keepalive timer each time a packet is
received on a TCP session that has entered the ESTABLISHED state. This
results in a lot of calls to reset the keepalive timer.

This patch changes the behavior so we set the keepalive timer for the
keepalive idle time (TP_KEEPIDLE). When the keepalive timer fires, it will
first check to see if the session has been idle for TP_KEEPIDLE ticks. If
not, it will reschedule the keepalive timer for the time the session will
have been idle for TP_KEEPIDLE ticks.

For a session with regular communication, the keepalive timer should fire
approximately once every TP_KEEPIDLE ticks. For sessions with irregular
communication, the keepalive timer might fire more often. But, the
disruption from a periodic keepalive timer should be less than the regular
cost of resetting the keepalive timer on every packet.

(FWIW, this change saved approximately 1.73% of the busy CPU cycles on a
particular test system with a heavy TCP output load. Of course, the
actual impact is very specific to the particular hardware and workload.)

Reviewed by:	gallatin, rrs
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D8243
2016-10-14 14:57:43 +00:00
Gleb Smirnoff
cc94f0c2d7 - Revert r300854, r303657 which tried to fix regression from r297225.
- Fix the regression proper way using RO_RTFREE().

Submitted by:	ae
2016-10-13 20:15:47 +00:00
Gleb Smirnoff
ec7bbf1f79 With build without TCP_HHOOK and with INVARIANTS. Before mutex.h came
via sys/hhook.h -> sys/rmlock.h -> sys/mutex.h.
2016-10-13 18:02:29 +00:00
Michael Tuexen
859422cc12 Mark the socket as un-writable when it is 1-to-1 and the SCTP association
is freed.

MFC after:	1 month
2016-10-13 13:53:01 +00:00
Michael Tuexen
4c7fb0cf6e Whitespace changes.
MFC after: 1 month
2016-10-13 13:38:14 +00:00
Jonathan T. Looney
68bd7ed102 The TFO server-side code contains some changes that are not conditioned on
the TCP_RFC7413 kernel option. This change removes those few instructions
from the packet processing path.

While not strictly necessary, for the sake of consistency, I applied the
new IS_FASTOPEN macro to all places in the packet processing path that
used the (t_flags & TF_FASTOPEN) check.

Reviewed by:	hiren
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D8219
2016-10-12 19:06:50 +00:00
Jonathan T. Looney
4527476029 Currently, when tcp_input() receives a packet on a session that matches a
TCPCB, it checks (so->so_options & SO_ACCEPTCONN) to determine whether or
not the socket is a listening socket. However, this causes the code to
access a different cacheline. If we first check if the socket is in the
LISTEN state, we can avoid accessing so->so_options when processing packets
received for ESTABLISHED sessions.

If INVARIANTS is defined, the code still needs to access both variables to
check that so->so_options is consistent with the state.

Reviewed by:	gallatin
MFC after:	1 week
Sponsored by:	Netflix
2016-10-12 02:30:33 +00:00
Jonathan T. Looney
bd79708dbf In the TCP stack, the hhook(9) framework provides hooks for kernel modules
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.

Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.

This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.

Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.

Reviewed by:	rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D8185
2016-10-12 02:16:42 +00:00
Mark Johnston
d748f7efcd Lock the ND prefix list and add refcounting for prefixes.
This change extends the nd6 lock to protect the ND prefix list as well
as the list of advertising routers associated with each prefix. To handle
cases where the nd6 lock must be dropped while iterating over either the
prefix or default router lists, a generation counter is used to track
modifications to the lists. Additionally, a new mutex is used to serialize
prefix on-link/off-link transitions. This mutex must be acquired before
the nd6 lock and is held while updating the routing table in
nd6_prefix_onlink() and nd6_prefix_offlink().

Reviewed by:	ae, tuexen (SCTP bits)
Tested by:	Jason Wolfe <jason@llnw.com>,
		Larry Rosenman <ler@lerctr.org>
MFC after:	2 months
Differential Revision:	https://reviews.freebsd.org/D8125
2016-10-07 21:10:53 +00:00
Jonathan T. Looney
3ac125068a Remove "long" variables from the TCP stack (not including the modular
congestion control framework).

Reviewed by:	gnn, lstewart (partial)
Sponsored by:	Juniper Networks, Netflix
Differential Revision:	(multiple)
Tested by:	Limelight, Netflix
2016-10-06 16:28:34 +00:00
Jonathan T. Looney
0dda76b82b If the new window size is less than the old window size, skip the
calculations to check if we should advertise a larger window.

Reviewed by:	gnn
MFC after:	2 weeks
Sponsored by:	Juniper Networks, Netflix
Differential Revision:	https://reviews.freebsd.org/D7076
Tested by:	Limelight, Netflix
2016-10-06 16:09:45 +00:00
Jonathan T. Looney
15c825712e Correctly calculate snd_max in persist case.
In the persist case, take the SYN and FIN flags into account when updating
the sequence space sent.

Reviewed by:	gnn
MFC after:	2 weeks
Sponsored by:	Juniper Networks, Netflix
Differential Revision:	https://reviews.freebsd.org/D7075
Tested by:	Limelight, Netflix
2016-10-06 16:00:48 +00:00
Jonathan T. Looney
55a429a6dc Remove declaration of un-defined function tcp_seq_subtract().
Reviewed by:	gnn
MFC after:	1 week
Sponsored by:	Juniper Networks, Netflix
Differential Revision:	https://reviews.freebsd.org/D7055
2016-10-06 15:57:15 +00:00
Kevin Lo
c2b5ba7661 Remove an alias if_list, use if_link consistently.
Reviewed by:	tuexen
Differential Revision:	https://reviews.freebsd.org/D8075
2016-10-06 00:51:27 +00:00
Eric van Gyzen
2d9db0bc63 Add GARP retransmit capability
A single gratuitous ARP (GARP) is always transmitted when an IPv4
address is added to an interface, and that is usually sufficient.
However, in some circumstances, such as when a shared address is
passed between cluster nodes, this single GARP may occasionally be
dropped or lost.  This can lead to neighbors on the network link
working with a stale ARP cache and sending packets destined for
that address to the node that previously owned the address, which
may not respond.

To avoid this situation, GARP retransmissions can be enabled by setting
the net.link.ether.inet.garp_rexmit_count sysctl to a value greater
than zero.  The setting represents the maximum number of retransmissions.
The interval between retransmissions is calculated using an exponential
backoff algorithm, doubling each time, so the retransmission intervals
are: {1, 2, 4, 8, 16, ...} (seconds).

Due to the exponential backoff algorithm used for the interval
between GARP retransmissions, the maximum number of retransmissions
is limited to 16 for sanity.  This limit corresponds to a maximum
interval between retransmissions of 2^16 seconds ~= 18 hours.
Increasing this limit is possible, but sending out GARPs spaced
days apart would be of little use.

Submitted by:	David A. Bright <david.a.bright@dell.com>
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D7695
2016-10-02 01:42:45 +00:00
Rick Macklem
00b460ffc5 r297225 broke udp_output() for the case where the "addr" argument
is NULL and the function jumps to the "release:" label.
For this case, the "inp" was write locked, but the code attempted to
read unlock it. This patch fixes the problem.
This case could occur for NFS over UDP mounts, where the server was
down for a few minutes under certain circumstances.

Reported by:	bde
Tested by:	bde
Reviewed by:	gnn
MFC after:	2 weeks
2016-10-01 19:39:09 +00:00
Hiren Panchasara
8a56c64533 This adds a sysctl which allows you to disable the TCP hostcache. This is handy
during testing of network related changes where cached entries may pollute your
results, or during known congestion events where you don't want to unfairly
penalize hosts.

Prior to r232346 this would have meant you would break any connection with a sub
1500 MTU, as the hostcache was authoritative. All entries as they stand today
should simply be used to pre populate values for efficiency.

Submitted by:	Jason Wolfe (j at nitrology dot com)
Reviewed by:	rwatson, sbruno, rrs , bz (earlier version)
MFC after:	2 weeks
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D6198
2016-09-30 00:10:57 +00:00
Kurt Lidl
1d7ee746e6 Properly preserve ip_tos bits for IPv4 packets
Restructure code slightly to save ip_tos bits earlier.  Fix the bug
where the ip_tos field is zeroed out before assigning to the iptos
variable. Restore the ip_tos and ip_ver fields only if they have
been zeroed during the pseudo-header checksum calculation.

Reviewed by:	cem, gnn, hiren
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D8053
2016-09-29 19:45:24 +00:00
Julien Charbon
c1b19923a3 Fix an issue with accept_filter introduced with r261242:
As a side effect of r261242 when using accept_filter the
first call to soisconnected() is done earlier in tcp_input()
instead of tcp_do_segment() context.  Restore the expected behaviour.

Note:  This call to soisconnected() seems to be extraneous in all
cases (with or without accept_filter).  Will be addressed in a
separate commit.

PR:			212920
Reported by:		Alexey
Tested by:              Alexey, jch
Sponsored by:           Verisign, Inc.
MFC after:		1 week
2016-09-29 11:18:48 +00:00
Kevin Lo
c7641cd18d Remove ifa_list, use ifa_link (structure field) instead.
While here, prefer if_addrhead (FreeBSD) to if_addrlist (BSD compat) naming
for the interface address list in sctp_bsd_addr.c

Reviewed by:	tuexen
Differential Revision:	https://reviews.freebsd.org/D8051
2016-09-28 13:29:11 +00:00
Mariusz Zaborski
85b0f9de11 capsicum: propagate rights on accept(2)
Descriptor returned by accept(2) should inherits capabilities rights from
the listening socket.

PR:		201052
Reviewed by:	emaste, jonathan
Discussed with:	many
Differential Revision:	https://reviews.freebsd.org/D7724
2016-09-22 09:58:46 +00:00
Michael Tuexen
5cb9165556 Fix the handling of unordered fragmented user messages using DATA chunks.
There were two bugs:
* There was an accounting bug resulting in reporting a too small a_rwnd.
* There are a bug when abandoning messages in the reassembly queue.

MFC after:	4 weeks
2016-09-21 08:28:18 +00:00
Kevin Lo
c3bef61e58 Remove the 4.3BSD compatible macro m_copy(), use m_copym() instead.
Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D7878
2016-09-15 07:41:48 +00:00
Michael Tuexen
5a17b6ad98 Ensure that the IPPROTO_TCP level socket options
* TCP_KEEPINIT
* TCP_KEEPINTVL
* TCP_KEEPIDLE
* TCP_KEEPCNT
always always report the values currently used when getsockopt()
is used. This wasn't the case when the sysctl-inherited default
values where used.
Ensure that the IPPROTO_TCP level socket option TCP_INFO has the
TCPI_OPT_ECN flag set in the tcpi_options field when ECN support
has been negotiated successfully.

Reviewed by:	rrs, jtl, hiren
MFC after:	1 month
Differential Revision:	7833
2016-09-14 14:48:00 +00:00
Dimitry Andric
6c01c0e0c6 With clang 3.9.0, compiling sys/netinet/igmp.c results in the following
warning:

sys/netinet/igmp.c:546:21: error: implicit conversion from 'int' to 'char' changes value from 148 to -108 [-Werror,-Wconstant-conversion]
        p->ipopt_list[0] = IPOPT_RA;    /* Router Alert Option */
                         ~ ^~~~~~~~
sys/netinet/ip.h:153:19: note: expanded from macro 'IPOPT_RA'
#define IPOPT_RA                148             /* router alert */
                                ^~~

This is because ipopt_list is an array of char, so IPOPT_RA is wrapped
to a negative value.  It would be nice to change ipopt_list to an array
of u_char, but it changes the signature of the public struct ipoption,
so add an explicit cast to suppress the warning.

Reviewed by:	imp
MFC after:	3 days
Differential Revision: https://reviews.freebsd.org/D7777
2016-09-04 17:23:10 +00:00
Hiren Panchasara
06b99bd826 Adjust TCP module fastpath after r304803's cc_ack_received() changes.
Reported by:		hiren, bz, np
Reviewed by:		rrs
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D7664
2016-08-26 19:23:17 +00:00
Hiren Panchasara
e7106d6be2 Update TCPS_HAVERCVDFIN() macro to correctly include all states a connection
can be in after receiving a FIN.

FWIW, NetBSD has this change for quite some time.

This has been tested at Netflix and Limelight in production traffic.

Reported by:	Sam Kumar <samkumar99 at gmail.com> on transport@
Reviewed by:	rrs
MFC after:	4 weeks
Sponsored by:	Limelight Networks
Differential Revision:	 https://reviews.freebsd.org/D7475
2016-08-26 17:48:54 +00:00
Michael Tuexen
91843cf34e Fix a bug, where no SACK is sent when receiving a FORWARD-TSN or
I-FORWARD-TSN chunk before any DATA or I-DATA chunk.

Thanks to Julian Cordes for finding this problem and prividing
packetdrill scripts to reporduce the issue.

MFC after: 3 days
2016-08-26 07:49:23 +00:00
Lawrence Stewart
4b7b743c16 Pass the number of segments coalesced by LRO up the stack by repurposing the
tso_segsz pkthdr field during RX processing, and use the information in TCP for
more correct accounting and as a congestion control input. This is only a start,
and an audit of other uses for the data is left as future work.

Reviewed by:	gallatin, rrs
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D7564
2016-08-25 13:33:32 +00:00
Michael Tuexen
884d8c53e6 When aborting an association, send the ABORT before notifying the upper
layer. For the kernel this doesn't matter, for the userland stack, it does.
While there, silence a clang warning when compiling it in userland.
2016-08-24 06:22:53 +00:00
Ryan Stone
23424a2021 Temporarily disable the optimization from r304436
r304436 attempted to optimize the handling of incoming UDP packet by only
making an expensive call to in_broadcast() if the mbuf was marked as an
broadcast packet.  Unfortunately, this cannot work in the case of point-to-
point L2 protocols like PPP, which have no notion of "broadcast".

Discussions on how to properly fix r304436 are ongoing, but in the meantime
disable the optimization to ensure that no existing network setups are broken.

Reported by:	bms
2016-08-22 15:27:37 +00:00
Michael Tuexen
7fcbd928f8 Improve the locking when sending user messages.
First, keep a ref count on the stcb after looking it up, as
done in the other lookup cases.
Second, before looking again at sp, ensure that it is not
freed, because the assoc is about to be freed.

MFC after: 3 days
2016-08-22 01:45:29 +00:00
Michael Tuexen
26a5d52f03 Remove duplicate code, which is not protected by the appropriate locks.
MFC after: 3 days
2016-08-22 00:40:45 +00:00
Bjoern A. Zeeb
77ecef378a Remove the kernel optoion for IPSEC_FILTERTUNNEL, which was deprecated
more than 7 years ago in favour of a sysctl in r192648.
2016-08-21 18:55:30 +00:00
Marko Zec
9da85a912d Permit disabling net.inet.udp.require_l2_bcast in VIMAGE kernels.
The default value of the tunable introduced in r304436 couldn't be
effectively overrided on VIMAGE kernels, because instead of being
accessed via the appropriate VNET() accessor macro, it was accessed
via the VNET_NAME() macro, which resolves to the (should-be) read-only
master template of initial values of per-VNET data.  Hence, while the
value of udp_require_l2_bcast could be altered on per-VNET basis, the
code in udp_input() would ignore it as it would always read the default
value (one) from the VNET master template.

Silence from: rstone
2016-08-20 22:12:26 +00:00
Michael Tuexen
e19497672b Unbreak sctp_connectx().
MFC after: 3 days
2016-08-20 20:15:36 +00:00
Ryan Stone
11f2a7cd67 Fix unlocked access to ifnet address list
in_broadcast() was iterating over the ifnet address list without
first taking an IF_ADDR_RLOCK.  This could cause a panic if a
concurrent operation modified the list.

Reviewed by: bz
MFC after: 2 months
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7227
2016-08-18 22:59:10 +00:00
Ryan Stone
41029db13f Don't check for broadcast IPs on non-bcast pkts
in_broadcast() can be quite expensive, so skip calling it if the
incoming mbuf wasn't sent to a broadcast L2 address in the first
place.

Reviewed by: gnn
MFC after: 2 months
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7309
2016-08-18 22:59:05 +00:00
Ryan Stone
90cc51a1ab Don't iterate over the ifnet addr list in ip_output()
For almost every packet that is transmitted through ip_output(),
a call to in_broadcast() was made to decide if the destination
IP was a broadcast address.  in_broadcast() iterates over the
ifnet's address to find a source IP matching the subnet of the
destination IP, and then checks if the IP is a broadcast in that
subnet.

This is completely redundant as we have already performed the
route lookup, so the source IP is already known.  Just use that
address to directly check whether the destination IP is a
broadcast address or not.

MFC after:	2 months
Sponsored By:	EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7266
2016-08-18 22:59:00 +00:00
Randall Stewart
eadd00f81a A few more wording tweaks as suggested (with some modifications
as well) by Ravi Pokala. Thanks for the comments :-)
Sponsored by: Netflix Inc.
2016-08-16 15:17:36 +00:00
Randall Stewart
587d67c008 Here we update the modular tcp to be able to switch to an
alternate TCP stack in other then the closed state (pre-listen/connect).
The idea is that *if* that is supported by the alternate stack, it
is asked if its ok to switch. If it approves the "handoff" then we
allow the switch to happen. Also the fini() function now gets a flag
to tell if you are switching away *or* the tcb is destroyed. The
init() call into the alternate stack is moved to the end so the
tcb is more fully formed before the init transpires.

Sponsored by:	Netflix Inc.
Differential Revision:	D6790
2016-08-16 15:11:46 +00:00
Randall Stewart
0fa047b98c Comments describing how to properly use the new lock_add functions
and its respective companion.

Sponsored by:	Netflix Inc.
2016-08-16 13:08:03 +00:00
Randall Stewart
b07fef500b This cleans up the timer code in TCP and also makes it so we do not
take the INFO lock *unless* we are really going to delete the TCB.

Differential Revision:	D7136
2016-08-16 12:40:56 +00:00
Sepherosa Ziehau
8452c1b345 tcp/lro: Make # of LRO entries tunable
Reviewed by:	hps, gallatin
Obtained from:	rrs, gallatin
MFC after:	2 weeks
Sponsored by:	Netflix (rrs, gallatin), Microsoft (sephe)
Differential Revision:	https://reviews.freebsd.org/D7499
2016-08-16 06:40:27 +00:00
Michael Tuexen
dcb436c936 Ensure that sctp_it_ctl.cur_it does not point to a free object (during
a small time window).
Thanks to Byron Campen for reporting the issue and
suggesting a fix.

MFC after: 3 days
2016-08-15 10:16:08 +00:00
Andrey V. Elsukov
57fb3b7a78 Add stats reset command implementation to NPTv6 module
to be able reset statistics counters.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-08-13 16:45:14 +00:00
Andrey V. Elsukov
d8caf56e9e Add ipfw_nat64 module that implements stateless and stateful NAT64.
The module works together with ipfw(4) and implemented as its external
action module.

Stateless NAT64 registers external action with name nat64stl. This
keyword should be used to create NAT64 instance and to address this
instance in rules. Stateless NAT64 uses two lookup tables with mapped
IPv4->IPv6 and IPv6->IPv4 addresses to perform translation.

A configuration of instance should looks like this:
 1. Create lookup tables:
 # ipfw table T46 create type addr valtype ipv6
 # ipfw table T64 create type addr valtype ipv4
 2. Fill T46 and T64 tables.
 3. Add rule to allow neighbor solicitation and advertisement:
 # ipfw add allow icmp6 from any to any icmp6types 135,136
 4. Create NAT64 instance:
 # ipfw nat64stl NAT create table4 T46 table6 T64
 5. Add rules that matches the traffic:
 # ipfw add nat64stl NAT ip from any to table(T46)
 # ipfw add nat64stl NAT ip from table(T64) to 64:ff9b::/96
 6. Configure DNS64 for IPv6 clients and add route to 64:ff9b::/96
    via NAT64 host.

Stateful NAT64 registers external action with name nat64lsn. The only
one option required to create nat64lsn instance - prefix4. It defines
the pool of IPv4 addresses used for translation.

A configuration of instance should looks like this:
 1. Add rule to allow neighbor solicitation and advertisement:
 # ipfw add allow icmp6 from any to any icmp6types 135,136
 2. Create NAT64 instance:
 # ipfw nat64lsn NAT create prefix4 A.B.C.D/28
 3. Add rules that matches the traffic:
 # ipfw add nat64lsn NAT ip from any to A.B.C.D/28
 # ipfw add nat64lsn NAT ip6 from any to 64:ff9b::/96
 4. Configure DNS64 for IPv6 clients and add route to 64:ff9b::/96
    via NAT64 host.

Obtained from:	Yandex LLC
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D6434
2016-08-13 16:09:49 +00:00
Mike Karels
bca0155f64 Fix kernel build with TCP_RFC7413 option
The current in_pcb.h includes route.h, which includes sockaddr structures.
Including <sys/socketvar.h> should require <sys/socket.h>; add it in
the appropriate place.

PR: 211385
Submitted by: Sergey Kandaurov and iron at mail.ua
Reviewed by: gnn
Approved by: gnn (mentor)
MFC after: 1 day
2016-08-11 23:52:24 +00:00
Andrey V. Elsukov
d6eb9b0249 Restore "nat global" support.
Now zero value of arg1 used to specify "tablearg", use the old "tablearg"
value for "nat global". Introduce new macro IP_FW_NAT44_GLOBAL to replace
hardcoded magic number to specify "nat global". Also replace 65535 magic
number with corresponding macro. Fix typo in comments.

PR:		211256
Tested by:	Victor Chernov
MFC after:	3 days
2016-08-11 10:10:10 +00:00
Michael Tuexen
be46a7c54d Improve a consistency check to not detect valid cases for
unordered user messages using DATA chunks as invalid ones.
While there, ensure that error causes are provided when
sending ABORT chunks in case of reassembly problems detected.
Thanks to Taylor Brandstetter for making me aware of this problem.
MFC after:	3 days
2016-08-10 17:19:33 +00:00
Stephen J. Kiernan
0ce1624d0e Move IPv4-specific jail functions to new file netinet/in_jail.c
_prison_check_ip4 renamed to prison_check_ip4_locked

Move IPv6-specific jail functions to new file netinet6/in6_jail.c
_prison_check_ip6 renamed to prison_check_ip6_locked

Add appropriate prototypes to sys/sys/jail.h

Adjust kern_jail.c to call prison_check_ip4_locked and
prison_check_ip6_locked accordingly.

Add netinet/in_jail.c and netinet6/in6_jail.c to the list of files that
need to be built when INET and INET6, respectively, are configured in the
kernel configuration file.

Reviewed by:	jtl
Approved by:	sjg (mentor)
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D6799
2016-08-09 02:16:21 +00:00
Michael Tuexen
d6e73fa13d Fix the sending of FORWARD-TSN and I-FORWARD-TSN chunks. The
last SID/SSN pair wasn't filled in.
Thanks to Julian Cordes for providing a packetdrill script
triggering the issue and making me aware of the bug.

MFC after:	3 days
2016-08-08 13:52:18 +00:00
Michael Tuexen
9c5ca6f247 Fix a locking issue found by stress testing with tsctp.
The inp read lock neeeds to be held when considering control->do_not_ref_stcb.
MFC after:	3 days
2016-08-08 08:20:10 +00:00
Michael Tuexen
124d851acf Consistently check for unsent data on the stream queues.
MFC after:	3 days
2016-08-07 23:04:46 +00:00
Michael Tuexen
4d58b0c3a9 Remove stream queue entry consistently from wheel.
While there, improve the handling of drain.

MFC after:	3 days
2016-08-07 12:51:13 +00:00
Michael Tuexen
cf46cace5c Don't modify a structure without holding a reference count on it.
MFC after:	3 days
2016-08-06 15:29:46 +00:00
Michael Tuexen
bfe7e9328c Mark an unused parameter as such.
MFC after:	3 days
2016-08-06 12:51:07 +00:00
Michael Tuexen
d1ea5fa9c2 Fix various bugs in relation to the I-DATA chunk support
This is joint work with rrs.

MFC after:	3 days
2016-08-06 12:33:15 +00:00
Sepherosa Ziehau
b9ec6f0b02 tcp/lro: If timestamps mismatch or it's a FIN, force flush.
This keeps the segments/ACK/FIN delivery order.

Before this patch, it was observed: if A sent FIN immediately after
an ACK, B would deliver FIN first to the TCP stack, then the ACK.
This out-of-order delivery causes one unnecessary ACK sent from B.

Reviewed by:	gallatin, hps
Obtained from:  rrs, gallatin
Sponsored by:	Netflix (rrs, gallatin), Microsoft (sephe)
Differential Revision:	https://reviews.freebsd.org/D7415
2016-08-05 09:08:00 +00:00
Sepherosa Ziehau
05cde7efa6 tcp/lro: Implement hash table for LRO entries.
This significantly improves HTTP workload performance and reduces
HTTP workload latency.

Reviewed by:	rrs, gallatin, hps
Obtained from:	rrs, gallatin
Sponsored by:	Netflix (rrs, gallatin) , Microsoft (sephe)
Differential Revision:	https://reviews.freebsd.org/D6689
2016-08-02 06:36:47 +00:00
Andrew Gallatin
d4c22202e6 Rework IPV6 TCP path MTU discovery to match IPv4
- Re-write tcp_ctlinput6() to closely mimic the IPv4 tcp_ctlinput()

- Now that tcp_ctlinput6() updates t_maxseg, we can allow ip6_output()
  to send TCP packets without looking at the tcp host cache for every
  single transmit.

- Make the icmp6 code mimic the IPv4 code & avoid returning
  PRC_HOSTDEAD because it is so expensive.

Without these changes in place, every TCP6 pmtu discovery or host
unreachable ICMP resulted in a call to in6_pcbnotify() which walks the
tcbinfo table with the write lock held.  Because the tcbinfo table is
shared between IPv4 and IPv6, this causes huge scalabilty issues on
servers with lots of (~100K) TCP connections, to the point where even
a small percent of IPv6 traffic had a disproportionate impact on
overall throughput.

Reviewed by:	bz, rrs, ae (all earlier versions), lstewart (in Netflix's tree)
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D7272
2016-08-01 17:02:21 +00:00
Andrew Gallatin
0e3b891988 Call tcp_notify() directly to shoot down routes, rather than
calling in_pcbnotifyall().

This avoids lock contention on tcbinfo due to in_pcbnotifyall()
holding the tcbinfo write lock while walking all connections.

Reviewed by:	rrs, karels
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D7251
2016-07-28 19:32:25 +00:00
Stephen J. Kiernan
f11ec79842 Remove BSD and USL copyright and update license block in in_prot.c, as the
code in this file was written by Robert N. M. Waston.

Move cr_can* prototypes from sys/systm.h to sys/proc.h

Reported by:	rwatson
Reviewed by:	rwatson
Approved by:	sjg (mentor)
Differential Revision:	https://reviews.freebsd.org/D7345
2016-07-28 18:39:30 +00:00
Stephen J. Kiernan
4ac21b4f09 Prepare for network stack as a module
- Move cr_canseeinpcb to sys/netinet/in_prot.c in order to separate the
   INET and INET6-specific code from the rest of the prot code (It is only
   used by the network stack, so it makes sense for it to live with the
   other network stack code.)
 - Move cr_canseeinpcb prototype from sys/systm.h to netinet/in_systm.h
 - Rename cr_seeotheruids to cr_canseeotheruids and cr_seeothergids to
   cr_canseeothergids, make them non-static, and add prototypes (so they
   can be seen/called by in_prot.c functions.)
 - Remove sw_csum variable from ip6_forward in ip6_forward.c, as it is an
   unused variable.

Reviewed by:	gnn, jtl
Approved by:	sjg (mentor)
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D2901
2016-07-27 20:34:09 +00:00
Brad Davis
439e76ec22 Fix the case for some sysctl descriptions.
Reviewed by:	gnn
2016-07-26 20:20:09 +00:00
Mike Karels
ea17754c5a Fix per-connection L2 caching in fast path
r301217 re-added per-connection L2 caching from a previous change,
but it omitted caching in the fast path.  Add it.

Reviewed By: gallatin
Approved by: gnn (mentor)
Differential Revision: https://reviews.freebsd.org/D7239
2016-07-22 02:11:49 +00:00
Michael Tuexen
0ee5c319f2 Fix a bug in deferred stream reset processing which results
in using a length field before it is set.

Thanks to Taylor Brandstetter for reporting the issue and
providing a fix.

MFC after:	3 days
2016-07-20 06:29:26 +00:00
Michael Tuexen
0fa7377a03 Use correct order of conditions to avoid NULL deref.
MFC after:	3 days
X-MFC with:	r302935
2016-07-19 11:16:44 +00:00
Michael Tuexen
0ba4c8abe0 netstat and sockstat expect the IPv6 link local addresses to
have an embedded scope. So don't recover.

MFC after:	3 days
2016-07-19 09:48:08 +00:00
Andrey V. Elsukov
ed22e564b8 Add named dynamic states support to ipfw(4).
The keep-state, limit and check-state now will have additional argument
flowname. This flowname will be assigned to dynamic rule by keep-state
or limit opcode. And then can be matched by check-state opcode or
O_PROBE_STATE internal opcode. To reduce possible breakage and to maximize
compatibility with old rulesets default flowname introduced.
It will be assigned to the rules when user has omitted state name in
keep-state and check-state opcodes. Also if name is ambiguous (can be
evaluated as rule opcode) it will be replaced to default.

Reviewed by:	julian
Obtained from:	Yandex LLC
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D6674
2016-07-19 04:56:59 +00:00
Andrey V. Elsukov
b867e84e95 Add ipfw_nptv6 module that implements Network Prefix Translation for IPv6
as defined in RFC 6296. The module works together with ipfw(4) and
implemented as its external action module. When it is loaded, it registers
as eaction and can be used in rules. The usage pattern is similar to
ipfw_nat(4). All matched by rule traffic goes to the NPT module.

Reviewed by:	hrs
Obtained from:	Yandex LLC
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D6420
2016-07-18 19:46:31 +00:00
Michael Tuexen
487e2e6507 Add a constant required by RFC 7496.
MFC after:	3 days
2016-07-17 13:33:35 +00:00
Michael Tuexen
8e1b295f09 Fix the PR-SCTP behaviour.
This is done by rrs@.

MFC after:	3 days
2016-07-17 13:14:51 +00:00
Michael Tuexen
26f0250adf Add missing sctps_reasmusrmsgs counter.
Joint work with rrs@.
MFC after:	3 days
2016-07-17 08:31:21 +00:00
Michael Tuexen
643fd575da Deal with a portential memory allocation failure, which was reported
by the clang static code analyzer.
Joint work with rrs@.

MFC after:	3 days
2016-07-16 12:25:37 +00:00
Michael Tuexen
b4bf72213a Don't free a data chunk twice.
Found by the clang static code analyzer running for the userland stack.

MFC after:	3 days
2016-07-16 08:11:43 +00:00
Michael Tuexen
56d2f7d8e5 Address a potential memory leak found a the clang static code analyzer
running on the userland stack.

MFC after:	3 days
2016-07-16 07:48:01 +00:00
Jonathan T. Looney
24b9bb5614 The TCPPCAP debugging feature caches recently-used mbufs for use in
debugging TCP connections. This commit provides a mechanism to free those
mbufs when the system is under memory pressure.

Because this will result in lost debugging information, the behavior is
controllable by a sysctl. The default setting is to free the mbufs.

Reviewed by:	gnn
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D6931
Input from:	novice_techie.com
2016-07-06 16:17:13 +00:00
Nathan Whitehorn
96c85efb4b Replace a number of conflations of mp_ncpus and mp_maxid with either
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.

There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.

PR:		kern/210106
Reviewed by:	jhb
Approved by:	re (glebius)
2016-07-06 14:09:49 +00:00
Michael Tuexen
e75f31c1d0 This patch fixes two bugs related to the setting of the I-Bit
for SCTP DATA and I-DATA chunks.
* For fragmented user messages, set the I-Bit only on the last
  fragment.
* When using explicit EOR mode, set the I-Bit on the last
  fragment, whenever SCTP_SACK_IMMEDIATELY was set in snd_flags
  for any of the send() calls.

Approved by:	re (hrs)
MFC after:	1 week
2016-06-30 06:06:35 +00:00
Michael Tuexen
ab3373140d This patch fixes two bugs related to the SCTP message recovery
for messages which have been put on the send queue:
* Do not report any DATA or I-DATA chunk padding.
* Correctly deal with the I-DATA chunk header instead of the DATA
  chunk header when the I-DATA extension is used.

Approved by:	re (kib)
MFC after:	1 week
2016-06-26 16:38:42 +00:00
Michael Tuexen
d1b52c6a01 This patch fixes a locking bug when a send() call blocks
on an SCTP socket and the association is aborted by the
peer.

Approved by:	re (kib)
MFC after:	1 week
2016-06-26 12:41:02 +00:00
Bjoern A. Zeeb
42644eb88e Try to avoid a 2nd conditional by re-writing the loop, pause, and
escape clause another time.

Submitted by:	jhb
Approved by:	re (gjb)
MFC after:	12 days
2016-06-23 21:32:52 +00:00
Navdeep Parhar
f22bfc72f8 Add spares to struct ifnet and socket for packet pacing and/or general
use.  Update comments regarding the spare fields in struct inpcb.

Bump __FreeBSD_version for the changes to the size of the structures.

Reviewed by:	gnn@
Approved by:	re@ (gjb@)
Sponsored by:	Chelsio Communications
2016-06-23 21:07:15 +00:00
Michael Tuexen
9de217ce56 Fix a bug in the handling of non-blocking SCTP 1-to-1 sockets. When using
this code in the userland stack, it could result in a loop. This happened on iOS.
However, I was not able to reproduce this when using the code in the kernel.
Thanks to Eugen-Andrei Gavriloaie for reporting the issue and proving detailed
information to find the root of the problem.

Approved by:	re (gjb)
MFC after:	1 week
2016-06-23 19:27:29 +00:00
Bjoern A. Zeeb
361279a4fb In VNET TCP teardown Do not sleep unconditionally but only if we
have any TCP connections left.

Submitted by:		zec
Approved by:		re (hrs)
MFC after:		13 days
2016-06-23 11:55:15 +00:00
Michael Tuexen
55b8cd93ef Don't consider the socket when processing an incoming ICMP/ICMP6 packet,
which was triggered by an SCTP packet. Whether a socket exists, is just
not relevant.

Approved by: re (kib)
MFC after: 1 week
2016-06-23 09:13:15 +00:00
Bjoern A. Zeeb
b54e08e11a Check the V_tcbinfo.ipi_count to hit 0 before doing the full TCP cleanup.
That way timers can finish cleanly and we do not gamble with a DELAY().

Reviewed by:		gnn, jtl
Approved by:		re (gjb)
Obtained from:		projects/vnet
MFC after:		2 weeks
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6923
2016-06-23 00:34:03 +00:00
Bjoern A. Zeeb
252a46f324 No longer mark TCP TW zone NO_FREE.
Timewait code does a proper cleanup after itself.

Reviewed by:		gnn
Approved by:		re (gjb)
Obtained from:		projects/vnet
MFC after:		2 weeks
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6922
2016-06-23 00:32:58 +00:00
Bjoern A. Zeeb
89856f7e2d Get closer to a VIMAGE network stack teardown from top to bottom rather
than removing the network interfaces first. This change is rather larger
and convoluted as the ordering requirements cannot be separated.

Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and
related modules to their own SI_SUB_PROTO_FIREWALL.
Move initialization of "physical" interfaces to SI_SUB_DRIVERS,
move virtual (cloned) interfaces to SI_SUB_PSEUDO.
Move Multicast to SI_SUB_PROTO_MC.

Re-work parts of multicast initialisation and teardown, not taking the
huge amount of memory into account if used as a module yet.

For interface teardown we try to do as many of them as we can on
SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling
over a higher layer protocol such as IP. In that case the interface
has to go along (or before) the higher layer protocol is shutdown.

Kernel hhooks need to go last on teardown as they may be used at various
higher layers and we cannot remove them before we cleaned up the higher
layers.

For interface teardown there are multiple paths:
(a) a cloned interface is destroyed (inside a VIMAGE or in the base system),
(b) any interface is moved from a virtual network stack to a different
network stack ("vmove"), or (c) a virtual network stack is being shut down.
All code paths go through if_detach_internal() where we, depending on the
vmove flag or the vnet state, make a decision on how much to shut down;
in case we are destroying a VNET the individual protocol layers will
cleanup their own parts thus we cannot do so again for each interface as
we end up with, e.g., double-frees, destroying locks twice or acquiring
already destroyed locks.
When calling into protocol cleanups we equally have to tell them
whether they need to detach upper layer protocols ("ulp") or not
(e.g., in6_ifdetach()).

Provide or enahnce helper functions to do proper cleanup at a protocol
rather than at an interface level.

Approved by:		re (hrs)
Obtained from:		projects/vnet
Reviewed by:		gnn, jhb
Sponsored by:		The FreeBSD Foundation
MFC after:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D6747
2016-06-21 13:48:49 +00:00
Andrey V. Elsukov
4c10540274 Cleanup unneded include "opt_ipfw.h".
It was used for conditional build IPFIREWALL_FORWARD support.
But IPFIREWALL_FORWARD option was removed a long time ago.
2016-06-09 05:48:34 +00:00
Michael Tuexen
63d5b56815 Use a separate MID counter for ordered und unordered messages for each
outgoing stream.

Thanks to Jens Hoelscher for reporting the issue.

MFC after: 1 week
2016-06-08 17:57:42 +00:00
Sepherosa Ziehau
36ad8372d4 net: Use M_HASHTYPE_OPAQUE_HASH if the mbuf flowid has hash properties
Reviewed by:	hps, erj, tuexen
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D6688
2016-06-07 04:51:50 +00:00
Bjoern A. Zeeb
b941cb8d60 Add a show igi_list command to DDB to debug IGMP state.
Obtained from:	projects/vnet
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2016-06-06 22:26:18 +00:00
Bjoern A. Zeeb
a6c96fc2f0 Destroy the mutex last. In this case it should not matter, but
generally cleanup code might still acquire it thus try to be
consistent destroying locks late.

Obtained from:	projects/vnet
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2016-06-06 13:04:22 +00:00
George V. Neville-Neil
8b2b91eee0 Add missing constants from RFCs 4443 and 6550 2016-06-06 00:35:45 +00:00
Bjoern A. Zeeb
484149def8 Introduce a per-VNET flag to enable/disable netisr prcessing on that VNET.
Add accessor functions to toggle the state per VNET.
The base system (vnet0) will always enable itself with the normal
registration. We will share the registered protocol handlers in all
VNETs minimising duplication and management.
Upon disabling netisr processing for a VNET drain the netisr queue from
packets for that VNET.

Update netisr consumers to (de)register on a per-VNET start/teardown using
VNET_SYS(UN)INIT functionality.

The change should be transparent for non-VIMAGE kernels.

Reviewed by:	gnn (, hiren)
Obtained from:	projects/vnet
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6691
2016-06-03 13:57:10 +00:00
Hans Petter Selasky
ec6689059d Use insertion sort instead of bubble sort in TCP LRO.
Replacing the bubble sort with insertion sort gives an 80% reduction
in runtime on average, with randomized keys, for small partitions.

If the keys are pre-sorted, insertion sort runs in linear time, and
even if the keys are reversed, insertion sort is faster than bubble
sort, although not by much.

Update comment describing "tcp_lro_sort()" while at it.

Differential Revision:	https://reviews.freebsd.org/D6619
Sponsored by:	Mellanox Technologies
Tested by:	Netflix
Suggested by:	Pieter de Goeje <pieter@degoeje.nl>
Reviewed by:	ed, gallatin, gnn, transport
2016-06-03 08:35:07 +00:00
Michael Tuexen
273e31638f Get struct sctp_net_route in-sync with struct route again. 2016-06-03 07:43:04 +00:00
Michael Tuexen
565cccce37 Store the peers vtag in host byte order in the cookie, since all
consumers expect it that way.
This fixes the vtag when sending en ERROR chunk.

MFC after:	1 week
2016-06-03 07:24:41 +00:00
George V. Neville-Neil
6d76822688 This change re-adds L2 caching for TCP and UDP, as originally added in D4306
but removed due to other changes in the system. Restore the llentry pointer
to the "struct route", and use it to cache the L2 lookup (ARP or ND6) as
appropriate.

Submitted by:	Mike Karels
Differential Revision:	https://reviews.freebsd.org/D6262
2016-06-02 17:51:29 +00:00
Bjoern A. Zeeb
3f58662dd9 The pr_destroy field does not allow us to run the teardown code in a
specific order.  VNET_SYSUNINITs however are doing exactly that.
Thus remove the VIMAGE conditional field from the domain(9) protosw
structure and replace it with VNET_SYSUNINITs.
This also allows us to change some order and to make the teardown functions
file local static.
Also convert divert(4) as it uses the same mechanism ip(4) and ip6(4) use
internally.

Slightly reshuffle the SI_SUB_* fields in kernel.h and add a new ones, e.g.,
for pfil consumers (firewalls), partially for this commit and for others
to come.

Reviewed by:		gnn, tuexen (sctp), jhb (kernel.h)
Obtained from:		projects/vnet
MFC after:		2 weeks
X-MFC:			do not remove pr_destroy
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6652
2016-06-01 10:14:04 +00:00
Michael Tuexen
3d48d25be7 Add PR_CONNREQUIRED for SOCK_STREAM sockets using SCTP.
This is required to signal connetion setup on non-blocking sockets
via becoming writable. This still allows for implicit connection
setup.

MFC after:	1 week
2016-05-30 18:24:23 +00:00
Michael Tuexen
7b7f31e6cf Fix a byte order issue for the scope stored in the SCTP cookie.
MFC after:	1 week
2016-05-30 11:18:39 +00:00
Sepherosa Ziehau
425b763928 tcp: Don't prematurely drop receiving-only connections
If the connection was persistent and receiving-only, several (12)
sporadic device insufficient buffers would cause the connection be
dropped prematurely:
Upon ENOBUFS in tcp_output() for an ACK, retransmission timer is
started.  No one will stop this retransmission timer for receiving-
only connection, so the retransmission timer promises to expire and
t_rxtshift is promised to be increased.  And t_rxtshift will not be
reset to 0, since no RTT measurement will be done for receiving-only
connection.  If this receiving-only connection lived long enough
(e.g. >350sec, given the RTO starts from 200ms), and it suffered 12
sporadic device insufficient buffers, i.e. t_rxtshift >= 12, this
receiving-only connection would be dropped prematurely by the
retransmission timer.

We now assert that for data segments, SYNs or FINs either rexmit or
persist timer was wired upon ENOBUFS.  And don't set rexmit timer
for other cases, i.e. ENOBUFS upon ACKs.

Discussed with:	lstewart, hiren, jtl, Mike Karels
MFC after:	3 weeks
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D5872
2016-05-30 03:31:37 +00:00
Gleb Smirnoff
6351b3857b Plug route reference underleak that happens with FLOWTABLE after r297225.
Submitted by:	Mike Karels <mike karels.net>
2016-05-27 17:31:02 +00:00
Don Lewis
91336b403a Import Dummynet AQM version 0.2.1 (CoDel, FQ-CoDel, PIE and FQ-PIE).
Centre for Advanced Internet Architectures

Implementing AQM in FreeBSD

* Overview <http://caia.swin.edu.au/freebsd/aqm/index.html>

* Articles, Papers and Presentations
  <http://caia.swin.edu.au/freebsd/aqm/papers.html>

* Patches and Tools <http://caia.swin.edu.au/freebsd/aqm/downloads.html>

Overview

Recent years have seen a resurgence of interest in better managing
the depth of bottleneck queues in routers, switches and other places
that get congested. Solutions include transport protocol enhancements
at the end-hosts (such as delay-based or hybrid congestion control
schemes) and active queue management (AQM) schemes applied within
bottleneck queues.

The notion of AQM has been around since at least the late 1990s
(e.g. RFC 2309). In recent years the proliferation of oversized
buffers in all sorts of network devices (aka bufferbloat) has
stimulated keen community interest in four new AQM schemes -- CoDel,
FQ-CoDel, PIE and FQ-PIE.

The IETF AQM working group is looking to document these schemes,
and independent implementations are a corner-stone of the IETF's
process for confirming the clarity of publicly available protocol
descriptions. While significant development work on all three schemes
has occured in the Linux kernel, there is very little in FreeBSD.

Project Goals

This project began in late 2015, and aims to design and implement
functionally-correct versions of CoDel, FQ-CoDel, PIE and FQ_PIE
in FreeBSD (with code BSD-licensed as much as practical). We have
chosen to do this as extensions to FreeBSD's ipfw/dummynet firewall
and traffic shaper. Implementation of these AQM schemes in FreeBSD
will:
* Demonstrate whether the publicly available documentation is
  sufficient to enable independent, functionally equivalent implementations

* Provide a broader suite of AQM options for sections the networking
  community that rely on FreeBSD platforms

Program Members:

* Rasool Al Saadi (developer)

* Grenville Armitage (project lead)

Acknowledgements:

This project has been made possible in part by a gift from the
Comcast Innovation Fund.

Submitted by:	Rasool Al-Saadi <ralsaadi@swin.edu.au>
X-No objection:	core
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D6388
2016-05-26 21:40:13 +00:00
John Baldwin
052a5418e8 Don't reuse the source mbuf in tcp_respond() if it is not writable.
Not all mbufs passed up from device drivers are M_WRITABLE().  In
particular, the Chelsio T4/T5 driver uses a feature called "buffer packing"
to receive multiple frames in a single receive buffer.  The mbufs for
these frames all share the same external storage so are treated as
read-only by the rest of the stack when multiple frames are in flight.
Previously tcp_respond() would blindly overwrite read-only mbufs when
INVARIANTS was disabled or panic with an assertion failure if INVARIANTS
was enabled.  Note that the new case is a bit of a mix of the two other
cases in tcp_respond().  The TCP and IP headers must be copied explicitly
into the new mbuf instead of being inherited (similar to the m == NULL
case), but the addresses and ports must be swapped in the reply (similar
to the m != NULL case).

Reviewed by:	glebius
2016-05-26 18:35:37 +00:00
Michael Tuexen
4f3b84b524 Make struct sctp_paddrthlds compliant to RFC 7829. 2016-05-26 11:38:26 +00:00
Hans Petter Selasky
fc271df341 Use optimised complexity safe sorting routine instead of the kernel's
"qsort()".

The kernel's "qsort()" routine can in worst case spend O(N*N) amount of
comparisons before the input array is sorted. It can also recurse a
significant amount of times using up the kernel's interrupt thread
stack.

The custom sorting routine takes advantage of that the sorting key is
only 64 bits. Based on set and cleared bits in the sorting key it
partitions the array until it is sorted. This process has a recursion
limit of 64 times, due to the number of set and cleared bits which can
occur. Compiled with -O2 the sorting routine was measured to use
64-bytes of stack. Multiplying this by 64 gives a maximum stack
consumption of 4096 bytes for AMD64. The same applies to the execution
time, that the array to be sorted will not be traversed more than 64
times.

When serving roughly 80Gb/s with 80K TCP connections, the old method
consisting of "qsort()" and "tcp_lro_mbuf_compare_header()" used 1.4%
CPU, while the new "tcp_lro_sort()" used 1.1% for LRO related sorting
as measured by Intel Vtune. The testing was done using a sysctl to
toggle between "qsort()" and "tcp_lro_sort()".

Differential Revision:	https://reviews.freebsd.org/D6472
Sponsored by:	Mellanox Technologies
Tested by:	Netflix
Reviewed by:	gallatin, rrs, sephe, transport
2016-05-26 11:10:31 +00:00
Michael Tuexen
f88d0cfe7a When sending in ICMP response to an SCTP packet,
* include the SCTP common header, if possible
* include the first 8 bytes of the INIT chunk, if possible
This provides the necesary information for the receiver of the ICMP
packet to process it.

MFC after:	1 week
2016-05-25 22:16:11 +00:00
Michael Tuexen
6d7270a580 Send an ICMP packet indicating destination unreachable/protocol
unreachable if we don't handle the packet in the kernel and not
in userspace.

MFC after:	1 week
2016-05-25 15:54:21 +00:00
Michael Tuexen
ad2cbb09ef Count packets as not being delivered only if they are neither
processed by a kernel handler nor by a raw socket.

MFC after:	1 week
2016-05-25 13:48:26 +00:00
Don Lewis
883054b4c3 Change net.inet.tcp.ecn.enable sysctl mib from a binary off/on
control to a three way setting.
  0 - Totally disable ECN. (no change)
  1 - Enable ECN if incoming connections request it.  Outgoing
      connections will request ECN.  (no change from present != 0 setting)
  2 - Enable ECN if incoming connections request it.  Outgoing
      conections will not request ECN.

Change the default value of net.inet.tcp.ecn.enable from 0 to 2.

Linux version 2.4.20 and newer, Solaris, and Mac OS X 10.5 and newer have
similar capabilities.  The actual values above match Linux, and the default
matches the current Linux default.

Reviewed by:	eadler
MFC after:	1 month
MFH:		yes
Sponsored by:	https://reviews.freebsd.org/D6386
2016-05-19 22:20:35 +00:00
Gleb Smirnoff
f59d975e10 Tiny refactor of r294869/r296881: use defines to mask the VNET() macro.
Suggested by:	bz
2016-05-17 23:14:17 +00:00
Randall Stewart
5105a92c49 This small change adopts the excellent suggestion for using named
structures in the add of a new tcp-stack that came in late to me
via email after the last commit. It also makes it so that a new
stack may optionally get a callback during a retransmit
timeout. This allows the new stack to clear specific state (think
sack scoreboards or other such structures).

Sponsored by:	Netflix Inc.
Differential Revision:	http://reviews.freebsd.org/D6303
2016-05-17 09:53:22 +00:00
Andrey V. Elsukov
2685841b38 Make named objects set-aware. Now it is possible to create named
objects with the same name in different sets.

Add optional manage_sets() callback to objects rewriting framework.
It is intended to implement handler for moving and swapping named
object's sets. Add ipfw_obj_manage_sets() function that implements
generic sets handler. Use new callback to implement sets support for
lookup tables.
External actions objects are global and they don't support sets.
Modify eaction_findbyname() to reflect this.
ipfw(8) now may fail to move rules or sets, because some named objects
in target set may have conflicting names.
Note that ipfw_obj_ntlv type was changed, but since lookup tables
actually didn't support sets, this change is harmless.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-05-17 07:47:23 +00:00
Mark Johnston
565e7fd3bc opt_kdtrace.h is not needed for SDT probes as of r258541. 2016-05-15 20:04:43 +00:00
Mark Johnston
e8800f3c2f Fix a few style issues in the ICMP sysctl descriptions.
MFC after:	1 week
2016-05-15 03:19:53 +00:00
Michael Tuexen
574679afe9 Fix a locking bug which only shows up on Mac OS X.
MFC after: 1 week
2016-05-14 13:44:49 +00:00
Michael Tuexen
5f05199c19 Fix a bug introduced by the implementation of I-DATA support.
There was the requirement that two structures are in sync,
which is not valid anymore. Therefore don't rely on this
in the code anymore.
Thanks to Radek Malcic for reporting the issue. He found this
when using the userland stack.

MFC after: 1 week
2016-05-13 09:11:41 +00:00
Michael Tuexen
fd60718d17 Retire net.inet.sctp.strict_sacks and net.inet.sctp.strict_data_order
sysctl's, since they where only there to interop with non-conformant
implementations. This should not be a problem anymore.
2016-05-12 16:34:59 +00:00
Michael Tuexen
d88a626a1d Enable SACK Immediately per default.
This has been tested for a long time and implements covered by RFC 7053.

MFC after: 1 week
2016-05-12 15:48:08 +00:00
Michael Tuexen
e7f232a0db Use a format string in snprintf() for consistency.
This was reported by Radek Malcic when using the userland stack in
combination with MinGW.

MFC after: 1 week
2016-05-12 14:41:53 +00:00
Sepherosa Ziehau
e6ec45f869 tcp/syncache: Add comment for syncache_respond
Suggested by:	hiren, hps
Reviewed by:	sbruno
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D6148
2016-05-10 04:59:04 +00:00
Hiren Panchasara
7c375daa61 Add an option to use rfc6675 based pipe/inflight bytes calculation in htcp.
Submitted by:	Kevin Bowling <kevin.bowling@kev009.com>
MFC after:	1 week
Sponsored by:	Limelight Networks
2016-05-09 19:19:03 +00:00
Michael Tuexen
a807fe2d83 Cleanup a comment.
MFC after: 1 week
2016-05-09 16:35:05 +00:00
Pedro F. Giffuni
a4641f4eaa sys/net*: minor spelling fixes.
No functional change.
2016-05-03 18:05:43 +00:00
Sepherosa Ziehau
51e3c20d36 tcp/lro: Refactor the active list operation.
Ease more work concerning active list, e.g. hash table etc.

Reviewed by:	gallatin, rrs (earlier version)
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D6137
2016-05-03 08:13:25 +00:00
Michael Tuexen
afd6748258 Undo a spell fix introduced in r298942, which breaks compilation. 2016-05-02 21:23:05 +00:00
Pedro F. Giffuni
cd0a4ff6a5 netinet/sctp*: minor spelling fixes in comments.
No functional change.

Reviewed by:	tuexen
2016-05-02 20:56:11 +00:00
Michael Tuexen
ec70917ffa When a client uses UDP encapsulation and lists IP addresses in the INIT
chunk, enable UDP encapsulation for all those addresses.
This helps clients using a userland stack to support multihoming if
they are not behind a NAT.

MFC after: 1 week
2016-05-01 21:48:55 +00:00
Michael Tuexen
7154bf4a41 Add the UDP encaps port as a parameter to sctp_add_remote_addr().
This is currently only a code change without any functional
change. But this allows to set the remote encapsulation port
in a more detailed way, which will be provided in a follow-up
commit.

MFC after: 1 week
2016-04-30 14:25:00 +00:00
Michael Tuexen
3c3f9e2a46 Don't assign, just compare... 2016-04-29 20:33:20 +00:00
Michael Tuexen
fd7af143e2 Add support for handling ICMP and ICMP6 messages sent in response
to SCTP/UDP/IP and SCTP/UDP/IPv6 packets.
2016-04-29 20:22:01 +00:00
Sepherosa Ziehau
9340a8d5b9 tcp/syncache: Set flowid and hash type properly for SYN|ACK
So the underlying drivers can use it to select the sending queue
properly for SYN|ACK instead of rolling their own hash.

Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D6120
2016-04-29 07:23:08 +00:00
Randall Stewart
abb901c5d7 Complete the UDP tunneling of ICMP msgs to those protocols
interested in having tunneled UDP and finding out about the
ICMP (tested by Michael Tuexen with SCTP.. soon to be using
this feature).

Differential Revision:	http://reviews.freebsd.org/D5875
2016-04-28 15:53:10 +00:00
Randall Stewart
e5ad64562a This cleans up the timers code in TCP to start using the new
async_drain functionality. This as been tested in NF as well as
by Verisign. Still to do in here is to remove all the old flags. They
are currently left being maintained but probably are no longer needed.

Sponsored by:	Netflix Inc.
Differential Revision:	http://reviews.freebsd.org/D5924
2016-04-28 13:27:12 +00:00
Pedro F. Giffuni
a5b50fbc20 ipdivert: Remove unnecessary and incorrectly typed variable.
In principle n is only used to carry a copy of ipi_count, which is
unsigned, in the non-VIMAGE case, however ipi_count can be used
directly so it is not needed at all. Removing it makes things look
cleaner.
2016-04-28 02:46:08 +00:00
Sepherosa Ziehau
9b436b180c tcp/lro: Fix more typo
Noticed by:	hiren
MFC after:	1 week
Sponsored by:	Microsoft OSTC
2016-04-28 01:43:18 +00:00
Michael Tuexen
c09a15342a Don't use the control argument after calling sctp_add_to_readq().
This breaks the userland stack. There should be no functional change
for the FreeBSD kernel stack.
While there, use consistent variable nameing.
2016-04-27 18:58:47 +00:00
Sepherosa Ziehau
9e3db01282 tcp/lro: Fix typo.
MFC after:	1 week
Sponsored by:	Microsoft OSTC
2016-04-27 09:40:55 +00:00
Conrad Meyer
2769d06203 in_lltable_alloc and in6 copy: Don't leak LLE in error path
Fix a memory leak in error conditions introduced in r292978.

Reported by:	Coverity
CIDs:		1347009, 1347010
Sponsored by:	EMC / Isilon Storage Division
2016-04-26 23:13:48 +00:00
Conrad Meyer
bac5bedf44 tcp_usrreq: Free allocated buffer in relock case
The disgusting macro INP_WLOCK_RECHECK may early-return.  In
tcp_default_ctloutput() the TCP_CCALGOOPT case allocates memory before invoking
this macro, which may leak memory.

Add a _CLEANUP variant that takes a code argument to perform variable cleanup
in the early return path.  Use it to free the 'pbuf' allocated in
tcp_default_ctloutput().

I am not especially happy with this macro, but I reckon it's not any worse than
INP_WLOCK_RECHECK already was.

Reported by:	Coverity
CID:		1350286
Sponsored by:	EMC / Isilon Storage Division
2016-04-26 23:02:18 +00:00
Michael Tuexen
7e372b1a40 Remove a function, which is not used anymore. 2016-04-23 09:15:58 +00:00
Jonathan T. Looney
b8c2cd15e9 Prevent underflows in tp->snd_wnd if the remote side ACKs more than
tp->snd_wnd. This can happen, for example, when the remote side responds to
a window probe by ACKing the one byte it contains.

Differential Revision:	https://reviews.freebsd.org/D5625
Reviewed by:	hiren
Obtained from:	Juniper Networks (earlier version)
MFC after:	2 weeks
Sponsored by:	Juniper Networks
2016-04-21 15:06:53 +00:00
Pedro F. Giffuni
63b6b7a74a Indentation issues.
Contract some lines leftover from r298310.

Mea culpa.
2016-04-20 16:19:44 +00:00
Pedro F. Giffuni
02abd40029 kernel: use our nitems() macro when it is available through param.h.
No functional change, only trivial cases are done in this sweep,

Discussed in:	freebsd-current
2016-04-19 23:48:27 +00:00
Michael Tuexen
b1deed45e6 Address issues found by the XCode code analyzer. 2016-04-18 20:16:41 +00:00
Michael Tuexen
f8ee69bf81 Fix signed/unsigned warnings. 2016-04-18 11:39:41 +00:00
Michael Tuexen
a39ddef038 Fix a warning about an unused variable. 2016-04-18 09:39:46 +00:00
Michael Tuexen
98d5fd976b Put panic() calls under INVARIANTS. 2016-04-18 09:29:14 +00:00
Michael Tuexen
f2ea2a2d5f Cleanup debug output. 2016-04-18 06:58:07 +00:00
Michael Tuexen
e187bac213 Don't use anonymous unions. 2016-04-18 06:38:53 +00:00
Michael Tuexen
24a9e1b53b Remove a left-over debug printf(). 2016-04-18 06:32:24 +00:00
Michael Tuexen
b9dd6a90b6 Fix the ICMP6 handling for SCTP.
Keep the IPv4 code in sync.

MFC after:	1 week
2016-04-16 21:34:49 +00:00
Pedro F. Giffuni
99d628d577 netinet: for pointers replace 0 with NULL.
These are mostly cosmetical, no functional change.

Found with devel/coccinelle.

Reviewed by:	ae. tuexen
2016-04-15 15:46:41 +00:00
Andrey V. Elsukov
2acdf79f53 Add External Actions KPI to ipfw(9).
It allows implementing loadable kernel modules with new actions and
without needing to modify kernel headers and ipfw(8). The module
registers its action handler and keyword string, that will be used
as action name. Using generic syntax user can add rules with this
action. Also ipfw(8) can be easily modified to extend basic syntax
for external actions, that become a part base system.
Sample modules will coming soon.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-04-14 22:51:23 +00:00
Michael Tuexen
4d6b853ad6 Allow the handling of ICMP messages sent in response to SCTP packets
containing an INIT chunk. These need to be handled in case the peer
does not support SCTP and returns an ICMP messages indicating destination
unreachable, protocol unreachable.

MFC after:	1 week
2016-04-14 19:59:21 +00:00
Michael Tuexen
f77b842746 When delivering an ICMP packet to the ctlinput function, ensure that
the outer IP header, the ICMP header, the inner IP header and the
first n bytes are stored in contgous memory. The ctlinput functions
currently rely on this for n = 8. This fixes a bug in case the inner IP
header had options.
While there, remove the options from the outer header and provide a
way to increase n to allow improved ICMP handling for SCTP. This will
be added in another commit.

MFC after:	1 week
2016-04-14 19:51:29 +00:00
Luiz Otavio O Souza
de89d74b70 Do not overwrite the dchg variable.
It does not cause any real issues because the variable is overwritten
only when the packet is forwarded (and the variable is not used anymore).

Obtained from:	pfSense
MFC after:	2 weeks
Sponsored by:	Rubicon Communications (Netgate)
2016-04-14 18:57:30 +00:00
Michael Tuexen
08b9595770 Refactor the handling of ICMP/IPv4 packets for SCTP/IPv4.
This cleansup the code and prepares upcoming handling of ICMP/IPv4 packets
for SCTP/UDP/IPv4 packets. IPv6 changes will follow...

MFC after:	3 days
2016-04-12 21:40:54 +00:00
Michael Tuexen
cf4476eb39 When processing an ICMP packet containing an SCTP packet, it
is required to check the verification tag. However, this
requires the verification tag to be not 0. Enforce this.
For packets with a verification tag of 0, we need to
check it it contains an INIT chunk and use the initiate
tag for the validation. This will be a separate commit,
since it touches also other code.

MFC after: 1 week
2016-04-12 11:48:54 +00:00
Bjoern A. Zeeb
806929d514 Mfp: r296310,r296343
It looks like as with the safety belt of DELAY() fastened (*) we can
completely tear down and free all memory for TCP (after r281599).

(*) in theory a few ticks should be good enough to make sure the timers
are all really gone. Could we use a better matric here and check a
tcbcb count as an optimization?

PR:		164763
Reviewed by:	gnn, emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5734
2016-04-09 12:05:23 +00:00
Bjoern A. Zeeb
8586a9635f Mfp: r296260
The tcp_inpcb (pcbinfo) zone should be safe to destroy.

PR:		164763
Reviewed by:	gnn
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5732
2016-04-09 11:27:47 +00:00
Bjoern A. Zeeb
f254aeda60 Mfp: r296259
We attach the "counter" to the tcpcbs. Thus don't free the
TCP Fastopen zone before the tcpcbs are gone, as otherwise
the zone won't be empty.
With that it should be safe to destroy the "tfo" zone without
leaking the memory.

PR:		164763
Reviewed by:	gnn
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5731
2016-04-09 10:58:08 +00:00
Bjoern A. Zeeb
dc95d65555 Mfp: r296309
While there is no dependency interaction, stopping the timer before
freeing the rest of the resources seems more natural and avoids it
being scheduled an extra time when it is no longer needed.

Reviewed by:	gnn, emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5733
2016-04-09 10:51:07 +00:00
Bjoern A. Zeeb
e18b26d377 Mfp: r296345
No need to keep type stability on raw sockets zone.
We've also been running with a KASSERT since r222488 to make sure the
ipi_count is 0 on destroy.

PR:		164763
Reviewed by:	gnn
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5735
2016-04-09 10:44:57 +00:00
Bjoern A. Zeeb
4c86b2bc13 Mfp: r296346
No reason identified to keep UMA_ZONE_NOFREE here.

Reviewed by:	gnn
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5736
2016-04-09 10:39:54 +00:00
Randall Stewart
9d18771f69 A couple of minor changes that I missed that Michael had done, most noted
in these is the change to non-strict ordering for incoming data (this will
make pkt-drill test 14 fail but its expected).
2016-04-07 09:34:41 +00:00
Randall Stewart
44249214d3 This is work done by Michael Tuexen and myself at the IETF. This
adds the new I-Data (Interleaved Data) message. This allows a user
to be able to have complete freedom from Head Of Line blocking that
was previously there due to the in-ability to send multiple large
messages without the TSN's being in sequence. The code as been
tested with Michaels various packet drill scripts as well as
inter-networking between the IETF's location in Argentina and Germany.
2016-04-07 09:10:34 +00:00
Michael Tuexen
e2823e8570 Set the chunk id for ERROR chunks.
This is work with rrs@.
MFC after:	1 week
2016-04-01 20:38:15 +00:00
Sepherosa Ziehau
1ea448225c tcp/lro: Change SLIST to LIST, so that removing an entry is O(1)
This is kinda critical to the performance when the CPU is slow and
network bandwidth is high, e.g. in the hypervisor.

Reviewed by:	rrs, gallatin, Dexuan Cui <decui microsoft com>
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D5765
2016-04-01 06:43:05 +00:00
Sepherosa Ziehau
6dd38b8716 tcp/lro: Use tcp_lro_flush_all in device drivers to avoid code duplication
And factor out tcp_lro_rx_done, which deduplicates the same logic with
netinet/tcp_lro.c

Reviewed by:	gallatin (1st version), hps, zbb, np, Dexuan Cui <decui microsoft com>
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D5725
2016-04-01 06:28:33 +00:00
George V. Neville-Neil
ce223fb715 Unbreak the RSS/PCBGROUp build. 2016-03-31 00:53:23 +00:00
Edward Tomasz Napierala
35030a5dd4 Remove some NULL checks for M_WAITOK allocations.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-29 13:56:59 +00:00
Michael Tuexen
a08b29253d Don't allow the user to set a peer primary which is restricted
and not pending.

MFC after: 1 week
2016-03-28 19:32:13 +00:00
Michael Tuexen
76f8482a93 Restrict local addresses until they are acked by the peer.
MFC after: 1 week
2016-03-28 19:31:10 +00:00
Michael Tuexen
5114dccbd4 Trigger sending of queued ASCONF chunks if outstanding ones are ACKED.
MFC after:	1 week
2016-03-28 11:32:20 +00:00
Michael Tuexen
9a8e308861 Improve compilation on windows 64-bit (for the userland stack).
MFC after:	1 week
2016-03-27 10:04:25 +00:00
Sepherosa Ziehau
489f0c3c17 tcp/lro: Return TCP_LRO_NO_ENTRIES if we are short of LRO entries.
So that callers could react accordingly.

Reviewed by:	gallatin (no objection)
MFC after:	1 week
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D5695
2016-03-25 02:54:13 +00:00
Bjoern A. Zeeb
4f321dbd1c Fix compile errors after r297225:
- properly V_irtualise variable access unbreaking VIMAGE kernels.
- remove the volatile from the function return type to make architecture
  using gcc happy [-Wreturn-type]
  "type qualifiers ignored on function return type"
  I am not entirely happy with this solution putting the u_int there
  but it will do for now.
2016-03-24 11:40:10 +00:00
George V. Neville-Neil
84cc0778d0 FreeBSD previously provided route caching for TCP (and UDP). Re-add
route caching for TCP, with some improvements. In particular, invalidate
the route cache if a new route is added, which might be a better match.
The cache is automatically invalidated if the old route is deleted.

Submitted by:	Mike Karels
Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D4306
2016-03-24 07:54:56 +00:00
Michael Tuexen
ed65436366 Add const to several constants. Thanks to Nicholas Nethercote for
providing the patch via
https://bugzilla.mozilla.org/show_bug.cgi?id=1255655

MFC after:	1 week
2016-03-23 13:28:04 +00:00
Jonathan T. Looney
5d20f97461 to_flags is currently a 64-bit integer; however, we only use 7 bits.
Furthermore, there is no reason this needs to be a 64-bit integer
for the forseeable future.

Also, there is an inconsistency between to_flags and the mask in
tcp_addoptions(). Before r195654, to_flags was a u_long and the mask in
tcp_addoptions() was a u_int. r195654 changed to_flags to be a u_int64_t
but left the mask in tcp_addoptions() as a u_int, meaning that these
variables will only be the same width on platforms with 64-bit integers.

Convert both to_flags and the mask in tcp_addoptions() to be explicitly
32-bit variables. This may save a few cycles on 32-bit platforms, and
avoids unnecessarily mixing types.

Differential Revision:	https://reviews.freebsd.org/D5584
Reviewed by:	hiren
MFC after:	2 weeks
Sponsored by:	Juniper Networks
2016-03-22 15:55:17 +00:00
Hans Petter Selasky
d4d32b9fec Fix kernel build after adding new sysctl asserts in r296933. 2016-03-16 10:42:24 +00:00
Gleb Smirnoff
bf840a1707 Redo r294869. The array of counters for TCP states doesn't belong to
struct tcpstat, because the structure can be zeroed out by netstat(1) -z,
and of course running connection counts shouldn't be touched.

Place running connection counts into separate array, and provide
separate read-only sysctl oid for it.
2016-03-15 00:15:10 +00:00
Gleb Smirnoff
2f06d2ab91 Comment fix: statistics are not read-only. 2016-03-14 18:06:59 +00:00
Bjoern A. Zeeb
19edab1711 Remove duplicate external declaration of tcprexmtthresh making
gcc compiles barf.
2016-03-13 21:26:18 +00:00
John Baldwin
47cedcbd72 Use SI_SUB_LAST instead of SI_SUB_SMP as the "catch-all" subsystem.
Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D5515
2016-03-11 23:18:06 +00:00
Michael Tuexen
1fabc43e9f Actually send a asconf chunk, not only queue one.
MFC after: 3 days
2016-03-10 00:27:10 +00:00
Randall Stewart
ec64c84ddc Fix a sneaky bug where we were missing an extern
to get the rxt threshold.. and thus created our own defaulted to 0 :-(

Sponsored by:	Netflix Inc
2016-03-08 00:16:34 +00:00
Jonathan T. Looney
737d4f6c93 As reported on the transport@ and current@ mailing lists, the FreeBSD TCP
stack is not compliant with RFC 7323, which requires that TCP stacks send
a timestamp option on all packets (except, optionally, RSTs) after the
session is established.

This patch adds that support. It also adds a TCP signature option to the
packet, if appropriate.

PR:		206047
Differential Revision:	https://reviews.freebsd.org/D4808
Reviewed by:	hiren
MFC after:	2 weeks
Sponsored by:	Juniper Networks
2016-03-07 15:00:34 +00:00
Jonathan T. Looney
9cbade8feb Some cleanup in tcp_respond() in preparation for another change:
- Reorder variables by size
- Move initializer closer to where it is used
- Remove unneeded variable

Differential Revision:	https://reviews.freebsd.org/D4808
Reviewed by:	hiren
MFC after:	2 weeks
Sponsored by:	Juniper Networks
2016-03-07 14:59:49 +00:00
George V. Neville-Neil
e79cb051d5 Fix dtrace probes (introduced in 287759): debug__input was used
for output and drop; connect didn't always fire a user probe
some probes were missing in fastpath

Submitted by:	Hannes Mehnert
Sponsored by:	REMS, EPSRC
Differential Revision:	https://reviews.freebsd.org/D5525
2016-03-03 17:46:38 +00:00
Bryan Drewery
6971a63795 Fix build after r29592. 2016-02-23 21:21:47 +00:00
Randall Stewart
6e0efc6a39 This fixes the fastpath code to have a better module initialization sequence when
included in loader.conf. It also fixes it so that no matter if some one incorrectly
specifies a load order, the lists and such will be initialized on demand at that
time so no one can make that mistake.

Reviewed by:	hiren
Differential Revision:	D5189
2016-02-23 17:53:39 +00:00
Michael Tuexen
64a3a6304e Use the SCTP level pointer, not the interface level.
MFC after:	3 days
2016-02-19 11:25:18 +00:00
Michael Tuexen
861f6d1196 Add protection code.
MFC after:	3 days
CID:		748858
2016-02-18 21:33:10 +00:00
Michael Tuexen
fdc4c9d067 Add some protection code.
CID:		1331893
MFC after:	3 days
2016-02-18 21:21:45 +00:00
Sepherosa Ziehau
7ae3d4bf54 tcp/lro: Allow drivers to set the TCP ACK/data segment aggregation limit
ACK aggregation limit is append count based, while the TCP data segment
aggregation limit is length based.  Unless the network driver sets these
two limits, it's an NO-OP.

Reviewed by:	adrian, gallatin (previous version), hselasky (previous version)
Approved by:	adrian (mentor)
MFC after:	1 week
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D5185
2016-02-18 04:58:34 +00:00
Michael Tuexen
828318e155 Add protection code for issues reported by PVS / D5245.
MFC after:	3 days
2016-02-17 18:12:38 +00:00
Michael Tuexen
815f806b82 Code cleanup which will silence a warning in PVS / D5245. 2016-02-17 18:04:22 +00:00
Michael Tuexen
7b0fd8f2af Address a warning reported by D5245 / PVS.
MFC after:	3 days
2016-02-17 17:52:46 +00:00
Michael Tuexen
467f0d55b4 Whitespace changes. 2016-02-16 20:33:18 +00:00
Michael Tuexen
2b1c7de4d8 Improve the teardown of the SCTP stack.
Obtained from:	bz@
MFC after: 1 week
2016-02-16 19:36:25 +00:00
Michael Tuexen
e51963a7bb Loopback addresses are 127.0.0.0/8, not 127.0.0.1/32.
MFC after: 1 week
2016-02-11 22:29:39 +00:00
Michael Tuexen
b028cf319e Use 4 spaces instead of a tab. 2016-02-11 18:35:46 +00:00
Devin Teske
41c0ec9a16 Merge SVN r295220 (bz) from projects/vnet/
Fix a panic that occurs when a vnet interface is unavailable at the time the
vnet jail referencing said interface is stopped.

Sponsored by:	FIS Global, Inc.
2016-02-11 17:07:19 +00:00
Hans Petter Selasky
3e9470b721 Use a pair of ifs when comparing the 32-bit flowid integers so that
the sign bit doesn't cause an overflow. The overflow manifests itself
as a sorting index wrap around in the middle of the sorted array,
which is not a problem for the LRO code, but might be a problem for
the logic inside qsort().

Reviewed by:		gnn @
Sponsored by:		Mellanox Technologies
Differential Revision:	https://reviews.freebsd.org/D5239
2016-02-11 10:03:50 +00:00
Gleb Smirnoff
b4b12e52fb Garbage collect unused arguments of m_init(). 2016-02-10 18:54:18 +00:00
Bjoern A. Zeeb
a5243af262 Code duplication but rib_head is special. Not found an easy way to go
back and harmize the use cases among RIB, IPFW, PF yet but it's also not
the scope of this work.   Prevents instant panics on teardown and frees
the FIB bits again.

Sponsored by:	The FreeBSD Foundation
2016-02-03 21:56:51 +00:00
Bjoern A. Zeeb
2414e86439 MfH @r295202
Expect to see panics in routing code at least now.
2016-02-03 11:49:51 +00:00
Alfred Perlstein
7325dfbb59 Increase max allowed backlog for listen sockets
from short to int.

PR: 203922
Submitted by: White Knight <white_knight@2ch.net>
MFC After: 4 weeks
2016-02-02 05:57:59 +00:00
Gleb Smirnoff
8ec07310fa These files were getting sys/malloc.h and vm/uma.h with header pollution
via sys/mbuf.h
2016-02-01 17:41:21 +00:00
Michael Tuexen
5322a0968e Add missing parentheses. This was reported by ccaughie via GitHub
for the userland stack.

MFC after: 3 days
2016-01-30 17:32:46 +00:00
Michael Tuexen
3cf729a920 Update the path mtu when turning on/off UDP encapsulation for SCTP.
MFC after: 3 days
2016-01-30 16:56:39 +00:00
Michael Tuexen
ca83f93c09 Don't allow a remote encapsulation port change during the
SCTP restart procedure.

MFC after: 3 days
2016-01-30 12:58:38 +00:00
Michael Tuexen
4edd31fc71 Don't change the remote UDP encapsulation port for SCTP packets
containing an INIT chunk.

MFC after: 3 days
2016-01-30 11:10:22 +00:00
Michael Tuexen
843d04a89e Ignore peer addresses in a consistent way also when checking for
new addresses during restart. If this is not done, restart doesn't
work when the local socket is IPv4 only and the peer uses
IPv4 and IPv6 addresses.

MFC after: 3 days.
2016-01-30 10:39:05 +00:00
Michael Tuexen
a4cab32319 Remove debug output which was committed by accident.
Thanks to Oliver Pinter for reporting.

MFC after: 3 days
X-MFC with: r294995
2016-01-28 23:12:12 +00:00
Michael Tuexen
79b67faaf6 Always look in the TCP pool.
This fixes issues with a restarting peer when the listening
1-to-1 style socket is closed.

MFC after: 3 days
2016-01-28 16:05:46 +00:00
Gleb Smirnoff
4644fda3f7 Rename netinet/tcp_cc.h to netinet/cc/cc.h.
Discussed with:	lstewart
2016-01-27 17:59:39 +00:00
Gleb Smirnoff
af6fef3abb Fix issues with TCP_CONGESTION handling after r294540:
o Return back the buf[TCP_CA_NAME_MAX] for TCP_CONGESTION,
  for TCP_CCALGOOPT use dynamically allocated *pbuf.
o For SOPT_SET TCP_CONGESTION do NULL terminating of string
  taking from userland.
o For SOPT_SET TCP_CONGESTION do the search for the algorithm
  keeping the inpcb lock.
o For SOPT_GET TCP_CONGESTION first strlcpy() the name
  holding the inpcb lock into temporary buffer, then copyout.

Together with:	lstewart
2016-01-27 07:34:00 +00:00
Gleb Smirnoff
75dd79d937 Grab a snap amount of TCP connections in syncache from tcpstat. 2016-01-27 00:48:05 +00:00
Gleb Smirnoff
57a78e3bae Augment struct tcpstat with tcps_states[], which is used for book-keeping
the amount of TCP connections by state.  Provides a cheap way to get
connection count without traversing the whole pcb list.

Sponsored by:	Netflix
2016-01-27 00:45:46 +00:00
Gleb Smirnoff
d17d4c6b2a Provide TCPSTAT_DEC() and TCPSTAT_FETCH() macros. 2016-01-27 00:20:07 +00:00
Hiren Panchasara
0645c6049d Persist timers TCPTV_PERSMIN and TCPTV_PERSMAX are hardcoded with 5 seconds and
60 seconds, respectively. Turn them into sysctls that can be tuned live. The
default values of 5 seconds and 60 seconds have been retained.

Submitted by:		Jason Wolfe (j at nitrology dot com)
Reviewed by:		gnn, rrs, hiren, bz
MFC after:		1 week
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D5024
2016-01-26 16:33:38 +00:00
Alexander V. Chernikov
0d6a516eb8 Convert TCP mtu checks to the new routing KPI. 2016-01-25 10:06:49 +00:00
Alexander V. Chernikov
61eee0e202 MFP r287070,r287073: split radix implementation and route table structure.
There are number of radix consumers in kernel land (pf,ipfw,nfs,route)
  with different requirements. In fact, first 3 don't have _any_ requirements
  and first 2 does not use radix locking. On the other hand, routing
  structure do have these requirements (rnh_gen, multipath, custom
  to-be-added control plane functions, different locking).
Additionally, radix should not known anything about its consumers internals.

So, radix code now uses tiny 'struct radix_head' structure along with
  internal 'struct radix_mask_head' instead of 'struct radix_node_head'.
  Existing consumers still uses the same 'struct radix_node_head' with
  slight modifications: they need to pass pointer to (embedded)
  'struct radix_head' to all radix callbacks.

Routing code now uses new 'struct rib_head' with different locking macro:
  RADIX_NODE_HEAD prefix was renamed to RIB_ (which stands for routing
  information base).

New net/route_var.h header was added to hold routing subsystem internal
  data. 'struct rib_head' was placed there. 'struct rtentry' will also
  be moved there soon.
2016-01-25 06:33:15 +00:00
Bjoern A. Zeeb
70a0984741 sctp_asconf_iterator_end() has an unused second argument; compiles
better if you add it.

Sponsored by:	The FreeBSD Foundation
2016-01-23 12:56:28 +00:00
Bjoern A. Zeeb
d30c4f99ed Noisy comments (not sure if the static would be valid for all SCTP
implementations).

Reorder some cleanup just to match the general order we normally use.

Sponsored by:	The FreeBSD Foundation
2016-01-23 12:52:08 +00:00
Bjoern A. Zeeb
765cf0b825 Try to prevent an address (assoc) leak in one way or another when
sctp_initiate_iterator() fails.

Sponsored by:	The FreeBSD Foundation
2016-01-23 12:51:12 +00:00
Bjoern A. Zeeb
ce1d6b0efa Use sctp_asconf_iterator_end() rather than doing the cleanup manually.
Sponsored by:	The FreeBSD Foundation
2016-01-23 12:50:02 +00:00
Bjoern A. Zeeb
27a01c6c0c Try to catch a couple of SCTP teardown race conditions.
Saw all the printfs already.

Note: not sure the atomics are needed but without them, the condition
would never trigger, and we'd still see panics (which could have been
due to the insert race).  Will work my way backwards in case this stays
stable.

Sponsored by:	The FreeBSD Foundation
2016-01-23 11:05:13 +00:00
Bjoern A. Zeeb
eef5775f02 Fix build and avoid a double-free in the VIMAGE case.
Sponsored by:	The FreeBSD Foundation
2016-01-22 19:43:26 +00:00
Bjoern A. Zeeb
bb84e3d77d Correct function arguments for SYSUNINITs.
Sponsored by:	The FreeBSD Foundation
2016-01-22 18:39:23 +00:00
Bjoern A. Zeeb
1bbe967cc4 Correct function arguments for SYSUNINITs.
Obtained from:	p4 @180834
Sponsored by:	The FreeBSD Foundation
2016-01-22 18:37:17 +00:00
Bjoern A. Zeeb
4ce8702050 Correct function arguments for SYSUNINITs.
Add #ifdef VIMAGE, as in other cases it's dead code.

Obtained from:	p4 @180832
Sponsored by:	The FreeBSD Foundation
2016-01-22 18:35:11 +00:00