Commit Graph

5677 Commits

Author SHA1 Message Date
Michael Tuexen
2048d80aa3 Consistent handling of errors reported from the lower layer.
MFC after:	3 days
2016-12-27 22:14:41 +00:00
Michael Tuexen
b7b84c0e02 Whitespace changes.
The toolchain for processing the sources has been updated. No functional
change.

MFC after:	3 days
2016-12-26 11:06:41 +00:00
Michael Tuexen
d6194c562f Remove a KASSERT which is not always true.
In case of the empty queue tp->snd_holes and tcp_sackhole_insert()
failing due to memory shortage, tp->snd_holes will be empty.
This problem was hit when stress tests where performed by pho.

PR:		215513
Reported by:	pho
Tested by:	pho
Sponsored by:	Netflix, Inc.
2016-12-25 17:37:18 +00:00
Gleb Smirnoff
030b9c2f69 Remove assigned only variable. 2016-12-21 22:47:10 +00:00
Andrey V. Elsukov
ad9f4d6ab6 ip[6]_tryforward does inbound and outbound packet firewall processing.
This can lead to change of mbuf pointer (packet filter could do m_pullup(),
NAT, etc). Also in case of change of destination address, tryforward can
decide that packet should be handled by local system. In this case modified
mbuf can be returned to the ip[6]_input(). To handle this correctly, check
M_FASTFWD_OURS flag after return from ip[6]_tryforward. And if it is present,
update variables that depend from mbuf pointer and skip another inbound
firewall processing.

No objection from:	#network
MFC after:	3 weeks
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D8764
2016-12-19 11:02:49 +00:00
Michael Tuexen
3d6fe5d84c Fix the handling of buffered messages in stream reset deferred handling.
Thanks to Eugen-Andrei Gavriloaie for reporting the issue and providing
substantial help in nailing down the issue.

MFC after:	1 week
2016-12-17 22:31:30 +00:00
Hiren Panchasara
b6ff672460 We currently don't do TSO if ip options are present. In case of IPv6, we look at
in6p_options to check that. That is incorrect as we carry ip options in
in6p_outputopts. Also, just checking for in6p_outputopts being NULL won't
suffice as we combine ip options and ip header fields both in that one field.
The commit fixes this by using ip6_optlen() which correctly calculates length
of only ip options for IPv6.

Reviewed by:	    ae, bz
MFC after:	    3 weeks
Sponsored by:	    Limelight Networks
2016-12-11 23:14:47 +00:00
Michael Tuexen
8b9c95f4a9 Ensure that the reported ppid and tsn are taken from the first fragment.
This fixes a bug where the wrong ppid was reported, if
* I-DATA was used on the first fragement was not received first
* DATA was used and different ppids where used.

Thanks to Julian Cordes for making me aware of the issue.

MFC after:	1 week
2016-12-11 13:26:35 +00:00
Gleb Smirnoff
8c70a35334 Fix build for 32-bit machines.
Submitted by:	tuexen
2016-12-09 20:50:35 +00:00
Gleb Smirnoff
3cbee8caa1 Use counter_ratecheck() in the ICMP rate limiting.
Together with:	rrs, jtl
2016-12-09 17:59:15 +00:00
Michael Tuexen
ebecdad811 Don't bundle a SACK chunk with a SHUTDOWN chunk if it is not required.
MFC after:	1 week
2016-12-09 17:58:07 +00:00
Michael Tuexen
8d0a31e19c Don't send multiple SHUTDOWN chunks in a single packet.
Thanks to Felix Weinrank for making me aware of this issue.

MFC after:	1 week
2016-12-09 17:57:17 +00:00
Michael Tuexen
b594081bdf Silence a warning produced by newer versions of gcc.
MFC after:	1 week
2016-12-07 22:01:09 +00:00
Michael Tuexen
49656eefc8 Cleanup the names of SSN, SID, TSN, FSN, PPID and MID.
This made a couple of bugs visible in handling SSN wrap-arounds
when using DATA chunks. Now bulk transfer seems to work fine...
This fixes the issue reported in
https://github.com/sctplab/usrsctp/issues/111

MFC after:	1 week
2016-12-07 19:30:59 +00:00
Michael Tuexen
5b495f17a5 Whitespace changes.
The tools using to generate the sources has been updated and produces
different whitespaces. Commit this seperately to avoid intermixing
these with real code changes.

MFC after:	3 days
2016-12-06 10:21:25 +00:00
Michael Tuexen
4ddd5aadea Fix the handling of TCP FIN-segments in the CLOSED state
When a TCP segment with the FIN bit set was received in the CLOSED state,
a TCP RST-ACK-segment is sent. When computing SEG.ACK for this, the
FIN counts as one byte. This accounting was missing and is fixed by this
patch.

Reviewed by:		hiren
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://svn.freebsd.org/base/head
2016-12-02 08:02:31 +00:00
Andrey V. Elsukov
dc9d21f8b0 Rework ip_tryforward() to use FIB4 KPI.
Tested by:	olivier
Obtained from:	Yandex LLC
MFC after:	1 month
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D8526
2016-11-28 17:55:32 +00:00
Hiren Panchasara
2806b2933b For RTT calculations mid-session, we explicitly ignore ACKs with tsecr of 0 as
many borken middle-boxes tend to do that. But during 3whs, in syncache_expand(),
we don't do that which causes us to send a RST to such a client. Relax this
constraint by only using tsecr to compare against timestamp that we sent when it
is not 0. As a result, we'd now accept the final ACK of 3whs with tsecr of 0.

Reviewed by:	    jtl, gnn
Sponsored by:	    Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D8552
2016-11-21 20:53:11 +00:00
Michael Tuexen
35dfb8cb68 Ensure that TCP state changes to state-closing are reported via dtrace.
This does not cover state changes from TIME-WAIT.

Reviewed by:		gnn
MFC after:		3 weeks
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D8443
2016-11-19 14:45:08 +00:00
Michael Tuexen
6779a1a101 Notify the use via setting errno when a TCP RST segment is received
either in the CLOSING or LAST-ACK state.

Reviewed by:		hiren
MFC after:		3 weeks
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D8371
2016-11-17 08:15:02 +00:00
Andrey V. Elsukov
8432fa5fd9 Initialize ip6 pointer before use.
PR:		214169
MFC after:	1 week
2016-11-06 02:33:04 +00:00
Hiren Panchasara
e04310d59b Set slow start threshold more accurately on loss to be flightsize/2 instead of
cwnd/2 as recommended by RFC5681. (spotted by mmacy at nextbsd dot org)

Restore pre-r307901 behavior of aligning ssthresh/cwnd on mss boundary. (spotted
by slawa at zxy dot spb dot ru)

Tested by:	    dim, Slawa <slawa at zxy dot spb dot ru>
MFC after:	    1 month
X-MFC with:	    r307901
Sponsored by:	    Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D8349
2016-11-01 21:08:37 +00:00
Julien Charbon
f1ee30ccd6 Remove an extraneous call to soisconnected() in syncache_socket(),
introduced with r261242.  The useful and expected soisconnected()
call is done in tcp_do_segment().

Has been found as part of unrelated PR:212920 investigation.

Improve slightly (~2%) the maximum number of TCP accept per second.

Tested by:		kevin.bowling_kev009.com, jch
Approved by:		gnn, hiren
MFC after:		1 week
Sponsored by:		Verisign, Inc
Differential Revision:	https://reviews.freebsd.org/D8072
2016-10-26 15:19:18 +00:00
Hiren Panchasara
4e7f755377 FreeBSD tcp stack used to inform respective congestion control module about the
loss event but not use or obay the recommendations i.e. values set by it in some
cases.

Here is an attempt to solve that confusion by following relevant RFCs/drafts.
Stack only sets congestion window/slow start threshold values when there is no
CC module availalbe to take that action. All CC modules are inspected and
updated when needed to take appropriate action on loss.

tcp_stacks/fastpath module has been updated to adapt these changes.

Note: Probably, the most significant change would be to not bring congestion
window down to 1MSS on a loss signaled by 3-duplicate acks and letting
respective CC decide that value.

In collaboration with:	Matt Macy <mmacy at nextbsd dot org>
Discussed on:		transport@ mailing list
Reviewed by:		jtl
MFC after:		1 month
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D8225
2016-10-25 05:45:47 +00:00
Hiren Panchasara
dd13b7d387 Undo r307899. It needs a bit more work and proper commit log. 2016-10-25 05:07:51 +00:00
Hiren Panchasara
95d8236011 In Collaboration with: Matt Macy <mmacy at nextbsd dot com>
Reviewed by:		    jtl
Sponsored by:		    Limelight Networks
Differential Revision:	    https://reviews.freebsd.org/D8225
2016-10-25 05:03:33 +00:00
Ryan Stone
6c1bd55875 Fix ip_output() on point-to-point links
In r304435, ip_output() was changed to use the result of the route
lookup to decide whether the outgoing packet was a broadcast or
not.  This introduced a regression on interfaces where
IFF_BROADCAST was not set (e.g. point-to-point links), as the
algorithm could incorrectly treat the destination address as a
broadcast address, and ip_output() would subsequently drop the
packet as broadcasting on a non-IFF_BROADCAST interface is not
allowed.

Differential Revision:	https://reviews.freebsd.org/D8303
Reviewed by:	jtl
Reported by:	ambrisko
MFC after:	2 weeks
X-MFC-With:	r304435
Sponsored by:	Dell EMC Isilon
2016-10-24 22:11:33 +00:00
Michael Tuexen
38d3251c3d No functional changes, mostly getting the whitespace changes resulting
from an updated formatting tool chain.

MFC after: 1 month
2016-10-22 17:21:21 +00:00
Michael Tuexen
3e1465754f Make ICMPv6 hard error handling for TCP consistent with the ICMPv4
handling. Ensure that:
* Protocol unreachable errors are handled by indicating ECONNREFUSED
  to the TCP user for both IPv4 and IPv6. These were ignored for IPv6.
* Communication prohibited errors are handled by indicating ECONNREFUSED
  to the TCP user for both IPv4 and IPv6. These were ignored for IPv6.
* Hop Limited exceeded errors are handled by indicating EHOSTUNREACH
  to the TCP user for both IPv4 and IPv6.
  For IPv6 the TCP connected was dropped but errno wasn't set.

Reviewed by: gallatin, rrs
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: 7904
2016-10-21 10:32:57 +00:00
Julien Charbon
f5cf1e5f5a Fix a double-free when an inp transitions to INP_TIMEWAIT state
after having been dropped.

This fixes enforces in_pcbdrop() logic in tcp_input():

"in_pcbdrop() is used by TCP to mark an inpcb as unused and avoid future packet
delivery or event notification when a socket remains open but TCP has closed."

PR:			203175
Reported by:		Palle Girgensohn, Slawa Olhovchenkov
Tested by:		Slawa Olhovchenkov
Reviewed by:		Slawa Olhovchenkov
Approved by:		gnn, Slawa Olhovchenkov
Differential Revision:	https://reviews.freebsd.org/D8211
MFC after:		1 week
Sponsored by:		Verisign, inc
2016-10-18 07:16:49 +00:00
Hiren Panchasara
784ce8fad2 Make sure tcp_mss() has the same check as tcp_mss_update() to have t_maxseg set
to at least 64.

This is still just a coverup to avoid kernel panic and not an actual fix.

PR:			213232
Reviewed by:		glebius
MFC after:		1 week
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D8272
2016-10-18 02:40:25 +00:00
Patrick Kelsey
09c305eb65 Fix cases where the TFO pending counter would leak references, and eventually, memory.
Also renamed some tfo labels and added/reworked comments for clarity.

Based on an initial patch from jtl.

PR: 213424
Reviewed by:	jtl
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D8235
2016-10-15 01:41:28 +00:00
Jonathan T. Looney
82676a28eb r307082 added the TCP_HHOOK kernel option and made some existing code only
compile when that option is configured. In tcp_destroy(), the error
variable is now only used in code enclosed in an '#ifdef TCP_HHOOK' block.
This broke the build for VNET images.

Enclose the error variable itself in an #ifdef block.

Submitted by:	Shawn Webb <shawn.webb at hardenedbsd.org>
Reported by:	Shawn Webb <shawn.webb at hardenedbsd.org>
PointyHat to:	jtl
2016-10-15 00:29:15 +00:00
Jonathan T. Looney
6d172f58a2 The code currently resets the keepalive timer each time a packet is
received on a TCP session that has entered the ESTABLISHED state. This
results in a lot of calls to reset the keepalive timer.

This patch changes the behavior so we set the keepalive timer for the
keepalive idle time (TP_KEEPIDLE). When the keepalive timer fires, it will
first check to see if the session has been idle for TP_KEEPIDLE ticks. If
not, it will reschedule the keepalive timer for the time the session will
have been idle for TP_KEEPIDLE ticks.

For a session with regular communication, the keepalive timer should fire
approximately once every TP_KEEPIDLE ticks. For sessions with irregular
communication, the keepalive timer might fire more often. But, the
disruption from a periodic keepalive timer should be less than the regular
cost of resetting the keepalive timer on every packet.

(FWIW, this change saved approximately 1.73% of the busy CPU cycles on a
particular test system with a heavy TCP output load. Of course, the
actual impact is very specific to the particular hardware and workload.)

Reviewed by:	gallatin, rrs
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D8243
2016-10-14 14:57:43 +00:00
Gleb Smirnoff
cc94f0c2d7 - Revert r300854, r303657 which tried to fix regression from r297225.
- Fix the regression proper way using RO_RTFREE().

Submitted by:	ae
2016-10-13 20:15:47 +00:00
Gleb Smirnoff
ec7bbf1f79 With build without TCP_HHOOK and with INVARIANTS. Before mutex.h came
via sys/hhook.h -> sys/rmlock.h -> sys/mutex.h.
2016-10-13 18:02:29 +00:00
Michael Tuexen
859422cc12 Mark the socket as un-writable when it is 1-to-1 and the SCTP association
is freed.

MFC after:	1 month
2016-10-13 13:53:01 +00:00
Michael Tuexen
4c7fb0cf6e Whitespace changes.
MFC after: 1 month
2016-10-13 13:38:14 +00:00
Jonathan T. Looney
68bd7ed102 The TFO server-side code contains some changes that are not conditioned on
the TCP_RFC7413 kernel option. This change removes those few instructions
from the packet processing path.

While not strictly necessary, for the sake of consistency, I applied the
new IS_FASTOPEN macro to all places in the packet processing path that
used the (t_flags & TF_FASTOPEN) check.

Reviewed by:	hiren
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D8219
2016-10-12 19:06:50 +00:00
Jonathan T. Looney
4527476029 Currently, when tcp_input() receives a packet on a session that matches a
TCPCB, it checks (so->so_options & SO_ACCEPTCONN) to determine whether or
not the socket is a listening socket. However, this causes the code to
access a different cacheline. If we first check if the socket is in the
LISTEN state, we can avoid accessing so->so_options when processing packets
received for ESTABLISHED sessions.

If INVARIANTS is defined, the code still needs to access both variables to
check that so->so_options is consistent with the state.

Reviewed by:	gallatin
MFC after:	1 week
Sponsored by:	Netflix
2016-10-12 02:30:33 +00:00
Jonathan T. Looney
bd79708dbf In the TCP stack, the hhook(9) framework provides hooks for kernel modules
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.

Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.

This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.

Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.

Reviewed by:	rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D8185
2016-10-12 02:16:42 +00:00
Mark Johnston
d748f7efcd Lock the ND prefix list and add refcounting for prefixes.
This change extends the nd6 lock to protect the ND prefix list as well
as the list of advertising routers associated with each prefix. To handle
cases where the nd6 lock must be dropped while iterating over either the
prefix or default router lists, a generation counter is used to track
modifications to the lists. Additionally, a new mutex is used to serialize
prefix on-link/off-link transitions. This mutex must be acquired before
the nd6 lock and is held while updating the routing table in
nd6_prefix_onlink() and nd6_prefix_offlink().

Reviewed by:	ae, tuexen (SCTP bits)
Tested by:	Jason Wolfe <jason@llnw.com>,
		Larry Rosenman <ler@lerctr.org>
MFC after:	2 months
Differential Revision:	https://reviews.freebsd.org/D8125
2016-10-07 21:10:53 +00:00
Jonathan T. Looney
3ac125068a Remove "long" variables from the TCP stack (not including the modular
congestion control framework).

Reviewed by:	gnn, lstewart (partial)
Sponsored by:	Juniper Networks, Netflix
Differential Revision:	(multiple)
Tested by:	Limelight, Netflix
2016-10-06 16:28:34 +00:00
Jonathan T. Looney
0dda76b82b If the new window size is less than the old window size, skip the
calculations to check if we should advertise a larger window.

Reviewed by:	gnn
MFC after:	2 weeks
Sponsored by:	Juniper Networks, Netflix
Differential Revision:	https://reviews.freebsd.org/D7076
Tested by:	Limelight, Netflix
2016-10-06 16:09:45 +00:00
Jonathan T. Looney
15c825712e Correctly calculate snd_max in persist case.
In the persist case, take the SYN and FIN flags into account when updating
the sequence space sent.

Reviewed by:	gnn
MFC after:	2 weeks
Sponsored by:	Juniper Networks, Netflix
Differential Revision:	https://reviews.freebsd.org/D7075
Tested by:	Limelight, Netflix
2016-10-06 16:00:48 +00:00
Jonathan T. Looney
55a429a6dc Remove declaration of un-defined function tcp_seq_subtract().
Reviewed by:	gnn
MFC after:	1 week
Sponsored by:	Juniper Networks, Netflix
Differential Revision:	https://reviews.freebsd.org/D7055
2016-10-06 15:57:15 +00:00
Kevin Lo
c2b5ba7661 Remove an alias if_list, use if_link consistently.
Reviewed by:	tuexen
Differential Revision:	https://reviews.freebsd.org/D8075
2016-10-06 00:51:27 +00:00
Eric van Gyzen
2d9db0bc63 Add GARP retransmit capability
A single gratuitous ARP (GARP) is always transmitted when an IPv4
address is added to an interface, and that is usually sufficient.
However, in some circumstances, such as when a shared address is
passed between cluster nodes, this single GARP may occasionally be
dropped or lost.  This can lead to neighbors on the network link
working with a stale ARP cache and sending packets destined for
that address to the node that previously owned the address, which
may not respond.

To avoid this situation, GARP retransmissions can be enabled by setting
the net.link.ether.inet.garp_rexmit_count sysctl to a value greater
than zero.  The setting represents the maximum number of retransmissions.
The interval between retransmissions is calculated using an exponential
backoff algorithm, doubling each time, so the retransmission intervals
are: {1, 2, 4, 8, 16, ...} (seconds).

Due to the exponential backoff algorithm used for the interval
between GARP retransmissions, the maximum number of retransmissions
is limited to 16 for sanity.  This limit corresponds to a maximum
interval between retransmissions of 2^16 seconds ~= 18 hours.
Increasing this limit is possible, but sending out GARPs spaced
days apart would be of little use.

Submitted by:	David A. Bright <david.a.bright@dell.com>
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D7695
2016-10-02 01:42:45 +00:00
Rick Macklem
00b460ffc5 r297225 broke udp_output() for the case where the "addr" argument
is NULL and the function jumps to the "release:" label.
For this case, the "inp" was write locked, but the code attempted to
read unlock it. This patch fixes the problem.
This case could occur for NFS over UDP mounts, where the server was
down for a few minutes under certain circumstances.

Reported by:	bde
Tested by:	bde
Reviewed by:	gnn
MFC after:	2 weeks
2016-10-01 19:39:09 +00:00
Hiren Panchasara
8a56c64533 This adds a sysctl which allows you to disable the TCP hostcache. This is handy
during testing of network related changes where cached entries may pollute your
results, or during known congestion events where you don't want to unfairly
penalize hosts.

Prior to r232346 this would have meant you would break any connection with a sub
1500 MTU, as the hostcache was authoritative. All entries as they stand today
should simply be used to pre populate values for efficiency.

Submitted by:	Jason Wolfe (j at nitrology dot com)
Reviewed by:	rwatson, sbruno, rrs , bz (earlier version)
MFC after:	2 weeks
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D6198
2016-09-30 00:10:57 +00:00