Commit Graph

5268 Commits

Author SHA1 Message Date
gnn
a09f127051 Set the proper direction to check for policies in this one case.
Pointed out by: eri
Sponsored by:	Rubicon Communications (Netgate)
2015-10-29 21:26:32 +00:00
hiren
d3c29dd43f Calculate the correct amount of bytes that are in-flight for a connection as
suggested by RFC 6675.

Currently differnt places in the stack tries to guess this in suboptimal ways.
The main problem is that current calculations don't take sacked bytes into
account. Sacked bytes are the bytes receiver acked via SACK option. This is
suboptimal because it assumes that network has more outstanding (unacked) bytes
than the actual value and thus sends less data by setting congestion window
lower than what's possible which in turn may cause slower recovery from losses.

As an example, one of the current calculations looks something like this:
snd_nxt - snd_fack + sackhint.sack_bytes_rexmit
New proposal from RFC 6675 is:
snd_max - snd_una - sackhint.sacked_bytes + sackhint.sack_bytes_rexmit
which takes sacked bytes into account which is a new addition to the sackhint
struct. Only thing we are missing from RFC 6675 is isLost() i.e. segment being
considered lost and thus adjusting pipe based on that which makes this
calculation a bit on conservative side.

The approach is very simple. We already process each ack with sack info in
tcp_sack_doack() and extract sack blocks/holes out of it. We'd now also track
this new variable sacked_bytes which keeps track of total sacked bytes reported.

One downside to this approach is that we may get incorrect count of sacked_bytes
if the other end decides to drop sack info in the ack because of memory pressure
or some other reasons. But in this (not very likely) case also the pipe
calculation would be conservative which is okay as opposed to being aggressive
in sending packets into the network.

Next step is to use this more accurate pipe estimation to drive congestion
window adjustments.

In collaboration with:	rrs
Reviewed by:		jason_eggnet dot com, rrs
MFC after:		2 weeks
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D3971
2015-10-28 22:57:51 +00:00
hiren
9d3e0245b7 Add sysctl tunable net.inet.tcp.initcwnd_segments to specify initial congestion
window in number of segments on fly. It is set to 10 segments by default.

Remove net.inet.tcp.experimental.initcwnd10 which is now redundant. Also remove
the parent node net.inet.tcp.experimental as it's not needed anymore and also
because it was not well thought out.

Differential Revision:	https://reviews.freebsd.org/D3858
In collaboration with:	lstewart
Reviewed by:		gnn (prev version), rwatson, allanjude, wblock (man page)
MFC after:		2 weeks
Relnotes:		yes
Sponsored by:		Limelight Networks
2015-10-27 09:43:05 +00:00
gnn
99f73cc3ee Turning on IPSEC used to introduce a slight amount of performance
degradation (7%) for host host TCP connections over 10Gbps links,
even when there were no secuirty policies in place. There is no
change in performance on 1Gbps network links. Testing GENERIC vs.
GENERIC-NOIPSEC vs. GENERIC with this change shows that the new
code removes any overhead introduced by having IPSEC always in the
kernel.

Differential Revision:	D3993
MFC after:	1 month
Sponsored by:	Rubicon Communications (Netgate)
2015-10-27 00:42:15 +00:00
tuexen
9e0a6fa539 When processing a cookie, any mismatch in port numbers or the vtag results
in failing the check.
This fixes https://github.com/nplab/ETSI-SCTP-Conformance-Testsuite/blob/master/sctp-imh-tests/sctp-imh-i-3-3.pkt

MFC after: 1 week
2015-10-26 21:19:49 +00:00
tuexen
d8710b5322 Use __func__ instead of __FUNCTION__.
This allows to compile the userland stack without errors using gcc5.
Thanks to saghul for makeing me aware and providing the patch.

MFC after: 1 week
2015-10-19 11:17:54 +00:00
melifaro
b26720e1df Fix deletion of ifaddr lle entries when deleting prefix from interface in
down state.

Regression appeared in r287789, where the "prefix has no corresponding
  installed route" case was forgotten. Additionally, lltable_delete_addr()
  was called with incorrect byte order (default is network for lltable code).
While here, improve comments on given cases and byte order.

PR:		203573
Submitted by:	phk
2015-10-18 12:26:25 +00:00
melifaro
45d403ed23 Remove several compat functions from pre-fib era. 2015-10-17 17:26:44 +00:00
bz
75679462b6 Hopefully also unbreak VIMAGE kernels replacing the &V_... with
&VNET_NAME(...).
Everything else is just a whitespace wrapping change.
2015-10-15 01:44:32 +00:00
bz
857b8f754e Properly define functions withut argument and wrap for { for style purposes
as followed in the rest of the file.  This will hopefully make gcc more happy.
2015-10-14 18:30:04 +00:00
hiren
c9534bc93d Fix an unnecessarily aggressive behavior where mtu clamping begins on first
retransmission timeout (rto) when blackhole detection is enabled.  Make
sure it only happens when the second attempt to send the same segment also fails
with rto.

Also make sure that each mtu probing stage (usually 1448 -> 1188 -> 524) follows
the same pattern and gets 2 chances (rto) before further clamping down.

Note: RFC4821 doesn't specify implementation details on how this situation
should be handled.

Differential Revision:	https://reviews.freebsd.org/D3434
Reviewed by:	sbruno, gnn (previous version)
MFC after:	2 weeks
Sponsored by:	Limelight Networks
2015-10-14 06:57:28 +00:00
hiren
0d12306188 There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.

It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.

To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).

There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.

I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.

The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.

Differential Revision:	D3100
Submitted by:		Jonathan Looney <jlooney at juniper dot net>
Reviewed by:		gnn, hiren
2015-10-14 00:35:37 +00:00
tuexen
b61741da4b Fix the timeout for INIT retransmissions in the case where RTO_MIN is
smaller than RTO_INITIAL.

MFC after:	1 week
2015-10-13 18:27:55 +00:00
glebius
46f350c8ac Fix regression from r287779, that bite me. If we call m_pullup()
unconditionally, we end up with an mbuf chain of two mbufs, which
later in in_arpreply() is rewritten from ARP request to ARP reply
and is sent out. Looks like igb(4) (at least mine, and at least
at my network) fails on such mbuf chain, so ARP reply doesn't go
out wire. Thus, make the m_pullup() call conditional, as it is
everywhere. Of course, the bug in igb(?) should be investigated,
but better first fix the head. And unconditional m_pullup() was
suboptimal, anyway.
2015-10-07 13:10:26 +00:00
hiren
9fffe2524d Add a comment specifying how we implement rfc3042.
Differential Revision:	D3746
MFC after:	    1 week
Sponsored by:	    Limelight Networks
2015-10-06 07:46:19 +00:00
ae
c3f8d46dc4 Take extra reference to security policy before calling crypto_dispatch().
Currently we perform crypto requests for IPSEC synchronous for most of
crypto providers (software, aesni) and only VIA padlock calls crypto
callback asynchronous. In synchronous mode it is possible, that security
policy will be removed during the processing crypto request. And crypto
callback will release the last reference to SP. Then upon return into
ipsec[46]_process_packet() IPSECREQUEST_UNLOCK() will be called to already
freed request. To prevent this we will take extra reference to SP.

PR:		201876
Sponsored by:	Yandex LLC
2015-09-30 08:16:33 +00:00
glebius
45adeac7f3 When processing ICMP need frag message, ignore the suggested MTU unless it
is smaller than the current one for this connection. This is behavior
specified by RFC 1191, and this is how original BSD stack behaved, but this
was unintentionally regressed in r182851.

Reported & tested by:	Richard Russo <russor whatsapp.com>
Differential Revision:	D3567
Sponsored by:		Nginx, Inc.
2015-09-30 03:37:37 +00:00
melifaro
91b3356875 Eliminate nd6_nud_hint() and its TCP bindings.
Initially function was introduced in r53541 (KAME initial commit) to
  "provide hints from upper layer protocols that indicate a connection
  is making "forward progress"" (quote from RFC 2461 7.3.1 Reachability
  Confirmation).
However, it was converted to do nothing (e.g. just return) in r122922
  (tcp_hostcache implementation) back in 2003. Some defines were moved
  to tcp_var.h in r169541. Then, it was broken (for non-corner cases)
  by r186119 (L2<>L3 split) in 2008 (NULL ifp in nd6_lookup). So,
  right now this code is broken and has no "real" base users.

Differential Revision:	https://reviews.freebsd.org/D3699
2015-09-27 05:29:34 +00:00
melifaro
4fed811000 rtsock requests for deleting interface address lles started to return EPERM
instead of old "ignore-and-return 0" in r287789. This broke arp -da /
  ndp -cn behavior (they exit on rtsock command failure). Fix this by
  translating LLE_IFADDR to RTM_PINNED flag, passing it to userland and
  making arp/ndp ignore these entries in batched delete.

MFC after:	2 weeks
2015-09-27 04:54:29 +00:00
melifaro
7e64f01f85 Replace toe_nd6_resolve() with nd6_resolve().
Reviewed by:	np
2015-09-22 19:05:44 +00:00
melifaro
536c20752f Unify nd6 state switching by using newly-created nd6_llinfo_setstate()
function. The change is mostly mechanical with the following exception:
Last piece of nd6_resolve_slow() was refactored: ND6_LLINFO_PERMANENT
  condition was removed as always-true, explicit ND6_LLINFO_NOSTATE ->
  ND6_LLINFO_INCOMPLETE state transition was removed as duplicate.

Reviewed by:	ae
Sponsored by:	Yandex LLC
2015-09-21 11:19:53 +00:00
glebius
8c2720775c Use proper byteswap macro. This isn't a functional change. 2015-09-17 17:27:49 +00:00
glebius
c2b26bf37d In tcp_ctlinput() separate the (ip == NULL) block from the rest of the
function to reduce so many levels of indentation.  Style the lines that
got now indentation reduced.  No functional change.

Checked with:	md5
2015-09-16 21:42:33 +00:00
melifaro
44be0cdaa8 Unify loopback route switching:
* prepare gateway before insertion
* use RTM_CHANGE instead of explicit find/change route
* Remove fib argument from ifa_switch_loopback_route added in r264887:
  if old ifp fib differes from new one, that the caller
  is doing something wrong
* Make ifa_*_loopback_route call single ifa_maintain_loopback_route().
2015-09-16 06:23:15 +00:00
brd
32f63fe59a Remove redundant 'man page'
Reviewed by:	allanjude
2015-09-15 21:16:45 +00:00
hiren
85848b2b16 Remove unnecessary tcp state transition call.
Differential Revision:	D3451
Reviewed by:		markj
MFC after:		2 weeks
Sponsored by:		Limelight Networks
2015-09-15 20:04:30 +00:00
melifaro
d0c2460548 * Improve logging invalid arp messages
* Remove redundant check in ip_arpinput

Suggested by:	glebius
MFC after:	2 weeks
2015-09-15 08:50:44 +00:00
melifaro
b42c7af3b5 * Require explicitl lle unlink prior to calling llentry_delete().
This one slightly decreases time of holding afdata wlock.
* While here, make nd6_free() return void. No one has used its return value
  since r186119.
2015-09-15 06:48:19 +00:00
melifaro
5ad1f2444d * Do more fine-grained locking: call eventhandlers/free_entry
without holding afdata wlock
* convert per-af delete_address callback to global lltable_delete_entry() and
  more low-level "delete this lle" per-af callback
* fix some bugs/inconsistencies in IPv4/IPv6 ifscrub procedures

Sponsored by:		Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D3573
2015-09-14 16:48:19 +00:00
melifaro
e64a8234e3 * Improve error checking for arp messages.
* Clean stale headers from if_ether.c.

Reported by:	rozhuk.im at gmail.com
Reviewed by:	ae
MFC after:	2 weeks
2015-09-14 10:28:47 +00:00
hselasky
ac0c211c77 Update TSO limits to include all headers.
To make driver programming easier the TSO limits are changed to
reflect the values used in the BUSDMA tag a network adapter driver is
using. The TCP/IP network stack will subtract space for all linklevel
and protocol level headers and ensure that the full mbuf chain passed
to the network adapter fits within the given limits.

Implementation notes:

If a network adapter driver needs to fixup the first mbuf in order to
support VLAN tag insertion, the size of the VLAN tag should be
subtracted from the TSO limit. Else not.

Network adapters which typically inline the complete header mbuf could
technically transmit one more segment. This patch does not implement a
mechanism to recover the last segment for data transmission. It is
believed when sufficiently large mbuf clusters are used, the segment
limit will not be reached and recovering the last segment will not
have any effect.

The current TSO algorithm tries to send MTU-sized packets, where the
MTU typically is 1500 bytes, which gives 1448 bytes of TCP data
payload per packet for IPv4. That means if the TSO length limitiation
is set to 65536 bytes, there will be a data payload remainder of
(65536 - 1500) mod 1448 bytes which is equal to 324 bytes. Trying to
recover total TSO length due to inlining mbuf header data will not
have any effect, because adding or removing the ETH/IP/TCP headers
to or from 324 bytes will not cause more or less TCP payload to be
TSO'ed.

Existing network adapter limits will be updated separately.

Differential Revision:	https://reviews.freebsd.org/D3458
Reviewed by:		rmacklem
MFC after:		2 weeks
2015-09-14 08:36:22 +00:00
gnn
e39dbc6166 dd DTrace probe points, translators and a corresponding script
to provide the TCPDEBUG functionality with pure DTrace.

Reviewed by:	rwatson
MFC after:	2 weeks
Sponsored by:	Limelight Networks
Differential Revision:	D3530
2015-09-13 15:50:55 +00:00
tuexen
63946e657e Fix compilation issue introduced in r287717.
Thanks to bz@ for making me aware of it.

MFC after:	1 week
2015-09-12 21:23:24 +00:00
tuexen
779fa4b9f9 Address a compile warning.
MFC after:	1 week
2015-09-12 18:00:06 +00:00
tuexen
5429764526 Cleanup the handling of error causes for ERROR chunks. This fixes
an inconsistency of the padding handling. The final padding is
now considered to be a chunk padding.

MFC after:	1 week
2015-09-12 17:08:51 +00:00
tuexen
1947f60716 Ensure that ERROR chunks are always padded by implementing this
in the routine, which queues an ERROR chunk, instead on relyinh
on the callers to do so. Since one caller missed this, this actially
fixes a bug.

MFC after:	1 week
2015-09-11 13:54:33 +00:00
tuexen
8a1adc38eb RFC 4960 requires that packets containing an INIT chunk bundled with
another chunk are silently discarded. Do so, instead of sending an
ABORT.

MFC after:	1 week
2015-09-07 14:00:38 +00:00
allanjude
1b61937107 missed file that should have been included in r287528
PR:		184110
Submitted by:	Marie Helene Kvello-Aune <marieheleneka@gmail.com>
Approved by:	wblock (mentor)
2015-09-07 02:00:05 +00:00
adrian
d8aa14f523 Replace rss_m2cpuid with rss_soft_m2cpuid_v4 for ip_direct_nh.nh_m2cpuid,
because the RSS hash may need to be recalculated.

Submitted by:	Tiwei Bie <btw@mail.ustc.edu.cn>
Differential Revision:	https://reviews.freebsd.org/D3564
2015-09-06 20:20:48 +00:00
melifaro
e31cc5ffc6 Do not pass lle to nd6_ns_output(). Use newly-added
nd6_llinfo_get_holdsrc() to extract desired IPv6 source
  from holdchain and pass it to the nd6_ns_output().
2015-09-05 14:14:03 +00:00
glebius
790dc6f94a Use Jenkins hash for TCP syncache.
o Unlike xor, in Jenkins hash every bit of input affects virtually
  every bit of output, thus salting the hash actually works. With
  xor salting only provides a false sense of security, since if
  hash(x) collides with hash(y), then of course, hash(x) ^ salt
  would also collide with hash(y) ^ salt. [1]
o Jenkins provides much better distribution than xor, very close to
  ideal.

TCP connection setup/teardown benchmark has shown a 10% increase
with default hash size, and with bigger hashes that still provide
possibility for collisions. With enormous hash size, when dataset is
by an order of magnitude smaller than hash size, the benchmark has
shown 4% decrease in performance decrease, which is expected and
acceptable.

Noticed by:	Jeffrey Knockel <jeffk cs.unm.edu> [1]
Benchmarks by:	jch
Reviewed by:	jch, pkelsey, delphij
Security:	strengthens protection against hash collision DoS
Sponsored by:	Nginx, Inc.
2015-09-05 10:15:19 +00:00
glebius
f3c8a935a4 Make tcp_mtudisc() static and void. No functional changes.
Sponsored by:	Nginx, Inc.
2015-09-04 12:02:12 +00:00
tuexen
760f2d0d08 Don't leak memory in an error case.
MFC after:	1 week
2015-09-04 09:24:07 +00:00
tuexen
a66dd7d374 Add a NULL pointer check to silence the clang code analyzer.
MFC after:	1 week
2015-09-04 09:22:16 +00:00
tuexen
056def9261 Fix a bug where two SHUTDOWN_ACK chunks were sent if a SHUTDOWN chunk was
received acking all outstanding data.
2015-09-03 22:15:56 +00:00
jch
c4ab4117ed Put r284245 back in place: If at first this fix was seen as a temporary
workaround for a callout(9) issue, it turns out it is instead the right
way to use callout in mpsafe mode without using callout_drain().

r284245 commit message:

Fix a callout race condition introduced in TCP timers callouts with r281599.
In TCP timer context, it is not enough to check callout_stop() return value
to decide if a callout is still running or not, previous callout_reset()
return values have also to be checked.

Differential Revision:	https://reviews.freebsd.org/D2763
2015-08-30 13:44:39 +00:00
tuexen
646fffa685 Use 5 times RTO.Max as the default for the shutdown guard timer
as required by RFC 4960. The sysctl variable can be used to
overwrite this.

Discussed with:	rrs
MFC after:	1 week
2015-08-29 17:26:29 +00:00
tuexen
b2ac8e86d2 Fix the exporting of SCTP association states to userland. Without this,
associations in SHUTDOWN-PENDING were never reported correctly.

MFC after:	3 weeks
2015-08-29 09:14:32 +00:00
adrian
4672b4c15b Rename rss_soft_m2cpuid() -> rss_soft_m2cpuid_v4() in preparation for
an IPv6 version to show up.

Submitted by:	Tiwei Bie <btw@mail.ustc.edu.cn>
Differential Revision:	https://reviews.freebsd.org/D3504
2015-08-29 06:58:30 +00:00
adrian
2d6b12b499 Replace the printf()s with optional rate limited debugging for RSS.
Submitted by:	Tiwei Bie <btw@mail.ustc.edu.cn>
Differential Revision:	https://reviews.freebsd.org/D3471
2015-08-28 05:58:16 +00:00