Commit Graph

5405 Commits

Author SHA1 Message Date
Hiren Panchasara
f81bc34eac Add an option to use rfc6675 based pipe/inflight bytes calculation in cubic.
Reviewed by:		gnn
MFC after:		3 weeks
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D4205
2015-12-09 07:56:40 +00:00
Hiren Panchasara
021eaf7996 One of the ways to detect loss is to count duplicate acks coming back from the
other end till it reaches predetermined threshold which is 3 for us right now.
Once that happens, we trigger fast-retransmit to do loss recovery.

Main problem with the current implementation is that we don't honor SACK
information well to detect whether an incoming ack is a dupack or not. RFC6675
has latest recommendations for that. According to it, dupack is a segment that
arrives carrying a SACK block that identifies previously unknown information
between snd_una and snd_max even if it carries new data, changes the advertised
window, or moves the cumulative acknowledgment point.

With the prevalence of Selective ACK (SACK) these days, improper handling can
lead to delayed loss recovery.

With the fix, new behavior looks like following:

0) th_ack < snd_una --> ignore
Old acks are ignored.
1) th_ack == snd_una, !sack_changed --> ignore
Acks with SACK enabled but without any new SACK info in them are ignored.
2) th_ack == snd_una, window == old_window --> increment
Increment on a good dupack.
3) th_ack == snd_una, window != old_window, sack_changed --> increment
When SACK enabled, it's okay to have advertized window changed if the ack has
new SACK info.
4) th_ack > snd_una --> reset to 0
Reset to 0 when left edge moves.
5) th_ack > snd_una, sack_changed --> increment
Increment if left edge moves but there is new SACK info.

Here, sack_changed is the indicator that incoming ack has previously unknown
SACK info in it.

Note: This fix is not fully compliant to RFC6675. That may require a few
changes to current implementation in order to keep per-sackhole dupack counter
and change to the way we mark/handle sack holes.

PR:			203663
Reviewed by:		jtl
MFC after:		3 weeks
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D4225
2015-12-08 21:21:48 +00:00
Alexander V. Chernikov
65ff3638df Merge helper fib* functions used for basic lookups.
Vast majority of rtalloc(9) users require only basic info from
route table (e.g. "does the rtentry interface match with the interface
  I have?". "what is the MTU?", "Give me the IPv4 source address to use",
  etc..).
Instead of hand-rolling lookups, checking if rtentry is up, valid,
  dealing with IPv6 mtu, finding "address" ifp (almost never done right),
  provide easy-to-use API hiding all the complexity and returning the
  needed info into small on-stack structure.

This change also helps hiding route subsystem internals (locking, direct
  rtentry accesses).
Additionaly, using this API improves lookup performance since rtentry is not
  locked.
(This is safe, since all the rtentry changes happens under both radix WLOCK
  and rtentry WLOCK).

Sponsored by:	Yandex LLC
2015-12-08 10:50:03 +00:00
Michael Tuexen
c979034b18 Fix the allocation of outgoing streams:
* When processing a cookie, use the number of
  streams announced in the INIT-ACK.
* When sending an INIT-ACK for an existing
  association, use the value from the association,
  not from the end-point.

MFC after:	1 week
2015-12-06 16:17:57 +00:00
Alexander V. Chernikov
f8aee88f0b Remove LLE read lock from IPv4 fast path.
LLE structure is mostly unchanged during its lifecycle.
To be more specific, there are 2 things relevant for fast path
  lookup code:
1) link-level address change. Since r286722, these updates are performed
  under AFDATA WLOCK.
2) Some sort of feedback indicating that this particular entry is used so
  we re-send arp request to perform reachability verification instead of
  expiring entry. The only signal that is needed from fast path is something
  like binary yes/no.

The latter is solved by the following changes:
1) introduce special r_skip_req field which is read lockless by fast path,
  but updated under (new) req_mutex mutex. If this field is non-zero, then
  fast path will acquire lock and set it back to 0.
2) introduce simple state machine: incomplete->reachable<->verify->deleted.
  Before that we implicitely had incomplete->reachable->deleted state machine,
  with V_arpt_keep between "reachable" and "deleted". Verification was performed
  in runtime 5 seconds before V_arpt_keep expire.
  This is changed to "change state to verify 5 seconds before V_arpt_keep,
  set r_skip_req to non-zero value and check it every second". If the value
  is zero - then send arp verification probe.
These changes do not introduce any signifficant control plane overhead:
  typically lle callout timer would fire 1 time more each V_arpt_keep (1200s)
  for used lles and up to arp_maxtries (5) for dead lles.

As a result, all packets towards "reachable" lle are handled by fast path without
acquiring lle read lock.

Additional "req_mutex" is needed because callout / arpresolve_slow() or eventhandler
  might keep LLE lock for signifficant amount of time, which might not be feasible
  for fast path locking (e.g. having rmlock as ether AFDATA or lltable own lock).

Differential Revision:	https://reviews.freebsd.org/D3688
2015-12-05 09:50:37 +00:00
Michael Tuexen
a4889f2dd0 Fix a bug where a stream reset request wasn't retranmitted when the
peer indicated "In progress".

MFC after:	1 week
2015-12-04 08:49:27 +00:00
Michael Tuexen
d96bef9c77 Ensure that outgoing streams get reset when they run dry.
MFC after:	1 week
2015-12-03 15:19:29 +00:00
Michael Tuexen
4821b41e21 Minor cleanup. No functional change.
MFC after:	1 week
2015-12-02 22:44:42 +00:00
Michael Tuexen
60862d8e48 Adjust the MTU when accepting an SCTP association using
UDP encapsulation.

MFC after:	1 week
2015-12-02 16:29:36 +00:00
Andrey V. Elsukov
b4e63e2d15 In the same way fix the problem described in r291578 for IGMPv3.
In case when router has a lot of multicast groups, the reply can take
several packets due to MTU limitation.
Also we have a limit IGMP_MAX_RESPONSE_BURST == 4, that limits the number
of packets we send in one shot. Then we recalculate the timer value and
schedule the remaining packets for sending.
The problem is that when we call igmp_v3_dispatch_general_query() to send
remaining packets, we queue new reply in the same mbuf queue. And when
number of packets is bigger than IGMP_MAX_RESPONSE_BURST, we get endless
reply of IGMPv3 reports.
To fix this, add the check for remaining packets in the queue.

MFC after:	1 week
Sponsored by:	Yandex LLC
2015-12-01 11:24:30 +00:00
Alexander V. Chernikov
c00c4e46e3 Remove in_setifarnh definition. 2015-11-30 06:02:35 +00:00
Alexander V. Chernikov
e8b0643eee Add new rt_foreach_fib_walk_del() function for deleting route entries
by filter function instead of picking into routing table details in
  each consumer.
Remove now-unused rt_expunge() (eliminating last external RTF_RNH_LOCKED
 user).
This simplifies future nexthops/mulitipath changes and rtrequest1_fib()
  locking refactoring.

Actual changes:
Add "rt_chain" field to permit rte grouping while doing batched delete
  from routing table (thus growing rte 200->208 on amd64).
Add "rti_filter" /  "rti_filterdata" / "rti_spare" fields to rt_addrinfo
  to pass filter function to various routing subsystems in standard way.
Convert all rt_expunge() customers to new rt_addinfo-based api and eliminate
  rt_expunge().
2015-11-30 05:51:14 +00:00
Michael Tuexen
c6d2bd4812 Take also the send queue and sent queue into account when triggering
the sending of outgoing stream reset requests.

MFC after:	3 days
2015-11-27 22:11:46 +00:00
Michael Tuexen
f0067f2251 When the sending of an SCTP outgoing stream reset request fails,
don't report it to the user since all stream have been marked
as pending.

MFC after:	1 week
2015-11-26 23:12:41 +00:00
Michael Tuexen
52f175be70 When receiving an SCTP/UDP packet and the interface performed
the UDP checksum computation and signals that it was OK,
clear this bit when passing the packet to SCTP. Since the
bits indicating a valid UDP checksum and a valid SCTP
checksum are the same, the SCTP stack would assume
that also an SCTP checksum check has been performed.

MFC after: 1 week
2015-11-26 09:25:20 +00:00
Fabien Thomas
edd0e0b098 The r241129 description was wrong that the scenario is possible
only for read locks on pcbs. The same race can happen with write
lock semantics as well.

The race scenario:

- Two threads (1 and 2) locate pcb with writer semantics (INPLOOKUP_WLOCKPCB)
 and do in_pcbref() on it.
- 1 and 2 both drop the inp hash lock.
- Another thread (3) grabs the inp hash lock. Then it runs in_pcbfree(),
 which wlocks the pcb. They must happen faster than 1 or 2 come INP_WLOCK()!
- 1 and 2 congest in INP_WLOCK().
- 3 does in_pcbremlists(), drops hash lock, and runs in_pcbrele_wlocked(),
 which doesn't free the pcb due to two references on it.
 Then it unlocks the pcb.
- 1 (or 2) gets wlock on the pcb, runs in_pcbrele_wlocked(), which doesn't
 report inp as freed, due to 2 (or 1) still helding extra reference on it.
 The thread tries to do smth with a disconnected pcb and crashes.

Submitted by:	emeric.poupon@stormshield.eu
Reviewed by:	gleb@
MFC after:	1 week
Sponsored by: Stormshield
Tested by: Cassiano Peixoto, Stormshield
2015-11-25 14:45:43 +00:00
Andrey V. Elsukov
ef91a9765d Overhaul if_enc(4) and make it loadable in run-time.
Use hhook(9) framework to achieve ability of loading and unloading
if_enc(4) kernel module. INET and INET6 code on initialization registers
two helper hooks points in the kernel. if_enc(4) module uses these helper
hook points and registers its hooks. IPSEC code uses these hhook points
to call helper hooks implemented in if_enc(4).
2015-11-25 07:31:59 +00:00
Michael Tuexen
3bf2363dca Fix the handling of IPSec policies in the SCTP stack. At least
make sure they are not leaked...

MFC after: 	1 week
2015-11-21 18:21:16 +00:00
Michael Tuexen
e5d23883bf Revert part of r291137 which seems correct, bit does not fix the
resource problem I'm currently hunting down.

MFC after:	1 week
X-MFC with:	291137
2015-11-21 16:46:59 +00:00
Michael Tuexen
722281c070 Clear the so_pcb pointer in case of ipsec_init_policy() fails.
MFC after:	1 week
2015-11-21 16:32:14 +00:00
Michael Tuexen
8ca16419bb Don't send SHUTDOWN chunk when the association is in a front state
and the applications calls shutdown(..., SHUT_WR) or
shutdown(..., SHUT_RDWR).

MFC after:	1 week.
2015-11-21 16:25:09 +00:00
Michael Tuexen
bd5b567213 Fix a bug where an SCTP association was moved back to SHUTDOWN_SENT
state when the user issued a shutdown() call.

MFC after: 	1 week
2015-11-19 16:46:00 +00:00
Conrad Meyer
3fbd30d495 in_getmulti: Fix recursion on if_addr_lock on malloc failure
When the M_NOWAIT allocation fails, we recurse the if_addr_lock trying
to clean up.  Reorder the cleanup after dropping the if_addr_lock.  The
obvious race is already possible between if_addmulti and IF_ADDR_WLOCK
above, so it must be ok.

Submitted by:	Ryan Libby <rlibby@gmail.com>
Reviewed by:	jhb
Found with:	M_NOWAIT failure injection testing
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4138
2015-11-18 23:53:13 +00:00
Steven Hartland
9925ac11da Revert r290403
CARP rework invalidated this change.
2015-11-13 23:14:39 +00:00
Randall Stewart
7c4676ddee This fixes several places where callout_stops return is examined. The
new return codes of -1 were mistakenly being considered "true". Callout_stop
now returns -1 to indicate the callout had either already completed or
was not running and 0 to indicate it could not be stopped.  Also update
the manual page to make it more consistent no non-zero in the callout_stop
or callout_reset descriptions.

MFC after:	1 Month with associated callout change.
2015-11-13 22:51:35 +00:00
Alexander V. Chernikov
1c302b58da Decompose arp_ifinit() into arp_add_ifa_lle() and arp_announce_ifaddr().
Rename arp_ifinit2() into arp_announce_ifaddr().

Eliminate zeroing ifa_rtrequest: it was used for calling arp_rtrequest()
which was responsible for handling route cloning requests. It became
obsolete since r186119 (L2/L3 split).
2015-11-09 10:35:33 +00:00
Alexander V. Chernikov
b13c5b5db2 Use lladdr_event to propagate gratiotus arp.
Differential Revision:	https://reviews.freebsd.org/D4019
2015-11-09 10:11:14 +00:00
Alexander V. Chernikov
ddd208f7ad Unify setting lladdr for AF_INET[6]. 2015-11-07 11:12:00 +00:00
Michael Tuexen
0bfc52bea5 Use the correct length. The wrong one was too large.
MFC after: 3 days
2015-11-06 22:08:05 +00:00
Michael Tuexen
179f731bb0 The field sinfo_timetolive should have been sinfo_pr_value.
Thanks to Jens Hoelscher for making me aware of the bug.

MFC after: 1 week
2015-11-06 14:00:26 +00:00
Michael Tuexen
b70b526d17 Fix typos in field names of struct sctp_extrcvinfo.
Provide defines to allow applications to compile.
Thanks to Jens Hoelscher for making me aware of the typos.

MFC after: 1 week
2015-11-06 13:08:16 +00:00
Steven Hartland
ac19560a34 Add MTU support to carp interfaces
MFC after:	2 weeks
Sponsored by:	Multiplay
2015-11-05 17:23:02 +00:00
George V. Neville-Neil
33872124a5 Replace the fastforward path with tryforward which does not require a
sysctl and will always be on. The former split between default and
fast forwarding is removed by this commit while preserving the ability
to use all network stack features.

Differential Revision:	https://reviews.freebsd.org/D4042
Reviewed by:	ae, melifaro, olivier, rwatson
MFC after:	1 month
Sponsored by:	Rubicon Communications (Netgate)
2015-11-05 07:26:32 +00:00
Hiren Panchasara
054d38e38c Improve the sysctl node name.
X-MFC with:	r290122
Sponsored by:	Limelight Networks
2015-11-05 02:09:48 +00:00
Andrey V. Elsukov
5dc5a0e0aa Implement ipfw internal olist command to list named objects.
Reviewed by:	melifaro
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2015-11-03 10:21:53 +00:00
George V. Neville-Neil
02b90dbf45 Set the proper direction to check for policies in this one case.
Pointed out by: eri
Sponsored by:	Rubicon Communications (Netgate)
2015-10-29 21:26:32 +00:00
Hiren Panchasara
12eeb81fc1 Calculate the correct amount of bytes that are in-flight for a connection as
suggested by RFC 6675.

Currently differnt places in the stack tries to guess this in suboptimal ways.
The main problem is that current calculations don't take sacked bytes into
account. Sacked bytes are the bytes receiver acked via SACK option. This is
suboptimal because it assumes that network has more outstanding (unacked) bytes
than the actual value and thus sends less data by setting congestion window
lower than what's possible which in turn may cause slower recovery from losses.

As an example, one of the current calculations looks something like this:
snd_nxt - snd_fack + sackhint.sack_bytes_rexmit
New proposal from RFC 6675 is:
snd_max - snd_una - sackhint.sacked_bytes + sackhint.sack_bytes_rexmit
which takes sacked bytes into account which is a new addition to the sackhint
struct. Only thing we are missing from RFC 6675 is isLost() i.e. segment being
considered lost and thus adjusting pipe based on that which makes this
calculation a bit on conservative side.

The approach is very simple. We already process each ack with sack info in
tcp_sack_doack() and extract sack blocks/holes out of it. We'd now also track
this new variable sacked_bytes which keeps track of total sacked bytes reported.

One downside to this approach is that we may get incorrect count of sacked_bytes
if the other end decides to drop sack info in the ack because of memory pressure
or some other reasons. But in this (not very likely) case also the pipe
calculation would be conservative which is okay as opposed to being aggressive
in sending packets into the network.

Next step is to use this more accurate pipe estimation to drive congestion
window adjustments.

In collaboration with:	rrs
Reviewed by:		jason_eggnet dot com, rrs
MFC after:		2 weeks
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D3971
2015-10-28 22:57:51 +00:00
Hiren Panchasara
356c7958a4 Add sysctl tunable net.inet.tcp.initcwnd_segments to specify initial congestion
window in number of segments on fly. It is set to 10 segments by default.

Remove net.inet.tcp.experimental.initcwnd10 which is now redundant. Also remove
the parent node net.inet.tcp.experimental as it's not needed anymore and also
because it was not well thought out.

Differential Revision:	https://reviews.freebsd.org/D3858
In collaboration with:	lstewart
Reviewed by:		gnn (prev version), rwatson, allanjude, wblock (man page)
MFC after:		2 weeks
Relnotes:		yes
Sponsored by:		Limelight Networks
2015-10-27 09:43:05 +00:00
George V. Neville-Neil
26882b4239 Turning on IPSEC used to introduce a slight amount of performance
degradation (7%) for host host TCP connections over 10Gbps links,
even when there were no secuirty policies in place. There is no
change in performance on 1Gbps network links. Testing GENERIC vs.
GENERIC-NOIPSEC vs. GENERIC with this change shows that the new
code removes any overhead introduced by having IPSEC always in the
kernel.

Differential Revision:	D3993
MFC after:	1 month
Sponsored by:	Rubicon Communications (Netgate)
2015-10-27 00:42:15 +00:00
Michael Tuexen
3db4ea954e When processing a cookie, any mismatch in port numbers or the vtag results
in failing the check.
This fixes https://github.com/nplab/ETSI-SCTP-Conformance-Testsuite/blob/master/sctp-imh-tests/sctp-imh-i-3-3.pkt

MFC after: 1 week
2015-10-26 21:19:49 +00:00
Michael Tuexen
6e9c45e0ee Use __func__ instead of __FUNCTION__.
This allows to compile the userland stack without errors using gcc5.
Thanks to saghul for makeing me aware and providing the patch.

MFC after: 1 week
2015-10-19 11:17:54 +00:00
Alexander V. Chernikov
26a6057525 Fix deletion of ifaddr lle entries when deleting prefix from interface in
down state.

Regression appeared in r287789, where the "prefix has no corresponding
  installed route" case was forgotten. Additionally, lltable_delete_addr()
  was called with incorrect byte order (default is network for lltable code).
While here, improve comments on given cases and byte order.

PR:		203573
Submitted by:	phk
2015-10-18 12:26:25 +00:00
Alexander V. Chernikov
f221bcaa06 Remove several compat functions from pre-fib era. 2015-10-17 17:26:44 +00:00
Bjoern A. Zeeb
962d02b00b Hopefully also unbreak VIMAGE kernels replacing the &V_... with
&VNET_NAME(...).
Everything else is just a whitespace wrapping change.
2015-10-15 01:44:32 +00:00
Bjoern A. Zeeb
f87ec781ef Properly define functions withut argument and wrap for { for style purposes
as followed in the rest of the file.  This will hopefully make gcc more happy.
2015-10-14 18:30:04 +00:00
Hiren Panchasara
adf43a9279 Fix an unnecessarily aggressive behavior where mtu clamping begins on first
retransmission timeout (rto) when blackhole detection is enabled.  Make
sure it only happens when the second attempt to send the same segment also fails
with rto.

Also make sure that each mtu probing stage (usually 1448 -> 1188 -> 524) follows
the same pattern and gets 2 chances (rto) before further clamping down.

Note: RFC4821 doesn't specify implementation details on how this situation
should be handled.

Differential Revision:	https://reviews.freebsd.org/D3434
Reviewed by:	sbruno, gnn (previous version)
MFC after:	2 weeks
Sponsored by:	Limelight Networks
2015-10-14 06:57:28 +00:00
Hiren Panchasara
86a996e6bd There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.

It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.

To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).

There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.

I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.

The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.

Differential Revision:	D3100
Submitted by:		Jonathan Looney <jlooney at juniper dot net>
Reviewed by:		gnn, hiren
2015-10-14 00:35:37 +00:00
Michael Tuexen
9372530827 Fix the timeout for INIT retransmissions in the case where RTO_MIN is
smaller than RTO_INITIAL.

MFC after:	1 week
2015-10-13 18:27:55 +00:00
Gleb Smirnoff
89bc042679 Fix regression from r287779, that bite me. If we call m_pullup()
unconditionally, we end up with an mbuf chain of two mbufs, which
later in in_arpreply() is rewritten from ARP request to ARP reply
and is sent out. Looks like igb(4) (at least mine, and at least
at my network) fails on such mbuf chain, so ARP reply doesn't go
out wire. Thus, make the m_pullup() call conditional, as it is
everywhere. Of course, the bug in igb(?) should be investigated,
but better first fix the head. And unconditional m_pullup() was
suboptimal, anyway.
2015-10-07 13:10:26 +00:00
Hiren Panchasara
62d4443f00 Add a comment specifying how we implement rfc3042.
Differential Revision:	D3746
MFC after:	    1 week
Sponsored by:	    Limelight Networks
2015-10-06 07:46:19 +00:00
Andrey V. Elsukov
f367798498 Take extra reference to security policy before calling crypto_dispatch().
Currently we perform crypto requests for IPSEC synchronous for most of
crypto providers (software, aesni) and only VIA padlock calls crypto
callback asynchronous. In synchronous mode it is possible, that security
policy will be removed during the processing crypto request. And crypto
callback will release the last reference to SP. Then upon return into
ipsec[46]_process_packet() IPSECREQUEST_UNLOCK() will be called to already
freed request. To prevent this we will take extra reference to SP.

PR:		201876
Sponsored by:	Yandex LLC
2015-09-30 08:16:33 +00:00
Gleb Smirnoff
794ac42374 When processing ICMP need frag message, ignore the suggested MTU unless it
is smaller than the current one for this connection. This is behavior
specified by RFC 1191, and this is how original BSD stack behaved, but this
was unintentionally regressed in r182851.

Reported & tested by:	Richard Russo <russor whatsapp.com>
Differential Revision:	D3567
Sponsored by:		Nginx, Inc.
2015-09-30 03:37:37 +00:00
Alexander V. Chernikov
1558cb2448 Eliminate nd6_nud_hint() and its TCP bindings.
Initially function was introduced in r53541 (KAME initial commit) to
  "provide hints from upper layer protocols that indicate a connection
  is making "forward progress"" (quote from RFC 2461 7.3.1 Reachability
  Confirmation).
However, it was converted to do nothing (e.g. just return) in r122922
  (tcp_hostcache implementation) back in 2003. Some defines were moved
  to tcp_var.h in r169541. Then, it was broken (for non-corner cases)
  by r186119 (L2<>L3 split) in 2008 (NULL ifp in nd6_lookup). So,
  right now this code is broken and has no "real" base users.

Differential Revision:	https://reviews.freebsd.org/D3699
2015-09-27 05:29:34 +00:00
Alexander V. Chernikov
4a336ef40c rtsock requests for deleting interface address lles started to return EPERM
instead of old "ignore-and-return 0" in r287789. This broke arp -da /
  ndp -cn behavior (they exit on rtsock command failure). Fix this by
  translating LLE_IFADDR to RTM_PINNED flag, passing it to userland and
  making arp/ndp ignore these entries in batched delete.

MFC after:	2 weeks
2015-09-27 04:54:29 +00:00
Alexander V. Chernikov
8e5aadb617 Replace toe_nd6_resolve() with nd6_resolve().
Reviewed by:	np
2015-09-22 19:05:44 +00:00
Alexander V. Chernikov
aa5f023eaf Unify nd6 state switching by using newly-created nd6_llinfo_setstate()
function. The change is mostly mechanical with the following exception:
Last piece of nd6_resolve_slow() was refactored: ND6_LLINFO_PERMANENT
  condition was removed as always-true, explicit ND6_LLINFO_NOSTATE ->
  ND6_LLINFO_INCOMPLETE state transition was removed as duplicate.

Reviewed by:	ae
Sponsored by:	Yandex LLC
2015-09-21 11:19:53 +00:00
Gleb Smirnoff
399fbd0ec0 Use proper byteswap macro. This isn't a functional change. 2015-09-17 17:27:49 +00:00
Gleb Smirnoff
db642c8e6e In tcp_ctlinput() separate the (ip == NULL) block from the rest of the
function to reduce so many levels of indentation.  Style the lines that
got now indentation reduced.  No functional change.

Checked with:	md5
2015-09-16 21:42:33 +00:00
Alexander V. Chernikov
59c180c35c Unify loopback route switching:
* prepare gateway before insertion
* use RTM_CHANGE instead of explicit find/change route
* Remove fib argument from ifa_switch_loopback_route added in r264887:
  if old ifp fib differes from new one, that the caller
  is doing something wrong
* Make ifa_*_loopback_route call single ifa_maintain_loopback_route().
2015-09-16 06:23:15 +00:00
Brad Davis
e5fe11011a Remove redundant 'man page'
Reviewed by:	allanjude
2015-09-15 21:16:45 +00:00
Hiren Panchasara
550e9d4235 Remove unnecessary tcp state transition call.
Differential Revision:	D3451
Reviewed by:		markj
MFC after:		2 weeks
Sponsored by:		Limelight Networks
2015-09-15 20:04:30 +00:00
Alexander V. Chernikov
eec33ea052 * Improve logging invalid arp messages
* Remove redundant check in ip_arpinput

Suggested by:	glebius
MFC after:	2 weeks
2015-09-15 08:50:44 +00:00
Alexander V. Chernikov
d3cdb71655 * Require explicitl lle unlink prior to calling llentry_delete().
This one slightly decreases time of holding afdata wlock.
* While here, make nd6_free() return void. No one has used its return value
  since r186119.
2015-09-15 06:48:19 +00:00
Alexander V. Chernikov
3e7a2321e3 * Do more fine-grained locking: call eventhandlers/free_entry
without holding afdata wlock
* convert per-af delete_address callback to global lltable_delete_entry() and
  more low-level "delete this lle" per-af callback
* fix some bugs/inconsistencies in IPv4/IPv6 ifscrub procedures

Sponsored by:		Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D3573
2015-09-14 16:48:19 +00:00
Alexander V. Chernikov
deb6bda6e3 * Improve error checking for arp messages.
* Clean stale headers from if_ether.c.

Reported by:	rozhuk.im at gmail.com
Reviewed by:	ae
MFC after:	2 weeks
2015-09-14 10:28:47 +00:00
Hans Petter Selasky
d76d40126e Update TSO limits to include all headers.
To make driver programming easier the TSO limits are changed to
reflect the values used in the BUSDMA tag a network adapter driver is
using. The TCP/IP network stack will subtract space for all linklevel
and protocol level headers and ensure that the full mbuf chain passed
to the network adapter fits within the given limits.

Implementation notes:

If a network adapter driver needs to fixup the first mbuf in order to
support VLAN tag insertion, the size of the VLAN tag should be
subtracted from the TSO limit. Else not.

Network adapters which typically inline the complete header mbuf could
technically transmit one more segment. This patch does not implement a
mechanism to recover the last segment for data transmission. It is
believed when sufficiently large mbuf clusters are used, the segment
limit will not be reached and recovering the last segment will not
have any effect.

The current TSO algorithm tries to send MTU-sized packets, where the
MTU typically is 1500 bytes, which gives 1448 bytes of TCP data
payload per packet for IPv4. That means if the TSO length limitiation
is set to 65536 bytes, there will be a data payload remainder of
(65536 - 1500) mod 1448 bytes which is equal to 324 bytes. Trying to
recover total TSO length due to inlining mbuf header data will not
have any effect, because adding or removing the ETH/IP/TCP headers
to or from 324 bytes will not cause more or less TCP payload to be
TSO'ed.

Existing network adapter limits will be updated separately.

Differential Revision:	https://reviews.freebsd.org/D3458
Reviewed by:		rmacklem
MFC after:		2 weeks
2015-09-14 08:36:22 +00:00
George V. Neville-Neil
5d06879adb dd DTrace probe points, translators and a corresponding script
to provide the TCPDEBUG functionality with pure DTrace.

Reviewed by:	rwatson
MFC after:	2 weeks
Sponsored by:	Limelight Networks
Differential Revision:	D3530
2015-09-13 15:50:55 +00:00
Michael Tuexen
30811e70d9 Fix compilation issue introduced in r287717.
Thanks to bz@ for making me aware of it.

MFC after:	1 week
2015-09-12 21:23:24 +00:00
Michael Tuexen
6802b0904f Address a compile warning.
MFC after:	1 week
2015-09-12 18:00:06 +00:00
Michael Tuexen
86eda749af Cleanup the handling of error causes for ERROR chunks. This fixes
an inconsistency of the padding handling. The final padding is
now considered to be a chunk padding.

MFC after:	1 week
2015-09-12 17:08:51 +00:00
Michael Tuexen
e629b9fc56 Ensure that ERROR chunks are always padded by implementing this
in the routine, which queues an ERROR chunk, instead on relyinh
on the callers to do so. Since one caller missed this, this actially
fixes a bug.

MFC after:	1 week
2015-09-11 13:54:33 +00:00
Michael Tuexen
0941640f34 RFC 4960 requires that packets containing an INIT chunk bundled with
another chunk are silently discarded. Do so, instead of sending an
ABORT.

MFC after:	1 week
2015-09-07 14:00:38 +00:00
Allan Jude
32d321fa4a missed file that should have been included in r287528
PR:		184110
Submitted by:	Marie Helene Kvello-Aune <marieheleneka@gmail.com>
Approved by:	wblock (mentor)
2015-09-07 02:00:05 +00:00
Adrian Chadd
499baf0aa7 Replace rss_m2cpuid with rss_soft_m2cpuid_v4 for ip_direct_nh.nh_m2cpuid,
because the RSS hash may need to be recalculated.

Submitted by:	Tiwei Bie <btw@mail.ustc.edu.cn>
Differential Revision:	https://reviews.freebsd.org/D3564
2015-09-06 20:20:48 +00:00
Alexander V. Chernikov
26deb8826c Do not pass lle to nd6_ns_output(). Use newly-added
nd6_llinfo_get_holdsrc() to extract desired IPv6 source
  from holdchain and pass it to the nd6_ns_output().
2015-09-05 14:14:03 +00:00
Gleb Smirnoff
388909a12a Use Jenkins hash for TCP syncache.
o Unlike xor, in Jenkins hash every bit of input affects virtually
  every bit of output, thus salting the hash actually works. With
  xor salting only provides a false sense of security, since if
  hash(x) collides with hash(y), then of course, hash(x) ^ salt
  would also collide with hash(y) ^ salt. [1]
o Jenkins provides much better distribution than xor, very close to
  ideal.

TCP connection setup/teardown benchmark has shown a 10% increase
with default hash size, and with bigger hashes that still provide
possibility for collisions. With enormous hash size, when dataset is
by an order of magnitude smaller than hash size, the benchmark has
shown 4% decrease in performance decrease, which is expected and
acceptable.

Noticed by:	Jeffrey Knockel <jeffk cs.unm.edu> [1]
Benchmarks by:	jch
Reviewed by:	jch, pkelsey, delphij
Security:	strengthens protection against hash collision DoS
Sponsored by:	Nginx, Inc.
2015-09-05 10:15:19 +00:00
Gleb Smirnoff
24067db8ca Make tcp_mtudisc() static and void. No functional changes.
Sponsored by:	Nginx, Inc.
2015-09-04 12:02:12 +00:00
Michael Tuexen
6fb9db98b3 Don't leak memory in an error case.
MFC after:	1 week
2015-09-04 09:24:07 +00:00
Michael Tuexen
59713bbf27 Add a NULL pointer check to silence the clang code analyzer.
MFC after:	1 week
2015-09-04 09:22:16 +00:00
Michael Tuexen
aa1cfca969 Fix a bug where two SHUTDOWN_ACK chunks were sent if a SHUTDOWN chunk was
received acking all outstanding data.
2015-09-03 22:15:56 +00:00
Julien Charbon
d6de19ac2f Put r284245 back in place: If at first this fix was seen as a temporary
workaround for a callout(9) issue, it turns out it is instead the right
way to use callout in mpsafe mode without using callout_drain().

r284245 commit message:

Fix a callout race condition introduced in TCP timers callouts with r281599.
In TCP timer context, it is not enough to check callout_stop() return value
to decide if a callout is still running or not, previous callout_reset()
return values have also to be checked.

Differential Revision:	https://reviews.freebsd.org/D2763
2015-08-30 13:44:39 +00:00
Michael Tuexen
2e2d67945a Use 5 times RTO.Max as the default for the shutdown guard timer
as required by RFC 4960. The sysctl variable can be used to
overwrite this.

Discussed with:	rrs
MFC after:	1 week
2015-08-29 17:26:29 +00:00
Michael Tuexen
e92c2a8d6a Fix the exporting of SCTP association states to userland. Without this,
associations in SHUTDOWN-PENDING were never reported correctly.

MFC after:	3 weeks
2015-08-29 09:14:32 +00:00
Adrian Chadd
2527ccad2d Rename rss_soft_m2cpuid() -> rss_soft_m2cpuid_v4() in preparation for
an IPv6 version to show up.

Submitted by:	Tiwei Bie <btw@mail.ustc.edu.cn>
Differential Revision:	https://reviews.freebsd.org/D3504
2015-08-29 06:58:30 +00:00
Adrian Chadd
e5562eb934 Replace the printf()s with optional rate limited debugging for RSS.
Submitted by:	Tiwei Bie <btw@mail.ustc.edu.cn>
Differential Revision:	https://reviews.freebsd.org/D3471
2015-08-28 05:58:16 +00:00
Bjoern A. Zeeb
a86e5c96af get_inpcbinfo() and get_pcblist() are UDP local functions and
do not do what one would expect by name. Prefix them with "udp_"
to at least obviously limit the scope.

This is a non-functional change.

Reviewed by:		gnn, rwatson
MFC after:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D3505
2015-08-27 15:27:41 +00:00
Julien Charbon
bcf9b91395 Revert r284245: "Fix a callout race condition introduced in TCP
timers callouts with r281599."

r281599 fixed a TCP timer race condition, but due a callout(9) bug
it also introduced another race condition workaround-ed with r284245.
The callout(9) bug being fixed with r286880, we can now revert the
workaround (r284245).

Differential Revision:	https://reviews.freebsd.org/D2079 (Initial change)
Differential Revision:	https://reviews.freebsd.org/D2763 (Workaround)
Differential Revision:	https://reviews.freebsd.org/D3078 (Fix)
Sponsored by:		Verisign, Inc.
MFC after:		2 weeks
2015-08-24 09:30:27 +00:00
Alexander V. Chernikov
5a2555160f * Split allocation and table linking for lle's.
Before that, the logic besides lle_create() was the following:
  return existing if found, create if not. This behaviour was error-prone
  since we had to deal with 'sudden' static<>dynamic lle changes.
  This commit fixes bunch of different issues like:
  - refcount leak when lle is converted to static.
    Simple check case:
    console 1:
    while true;
      do for i in `arp -an|awk '$4~/incomp/{print$2}'|tr -d '()'`;
        do arp -s $i 00:22:44:66:88:00 ; arp -d $i;
      done;
    done
   console 2:
    ping -f any-dead-host-in-L2
   console 3:
    # watch for memory consumption:
    vmstat -m | awk '$1~/lltable/{print$2}'
  - possible problems in arptimer() / nd6_timer() when dropping/reacquiring
   lock.
  New logic explicitly handles use-or-create cases in every lla_create
  user. Basically, most of the changes are purely mechanical. However,
  we explicitly avoid using existing lle's for interface/static LLE records.
* While here, call lle_event handlers on all real table lle change.
* Create lltable_free_entry() calling existing per-lltable
  lle_free_t callback for entry deletion
2015-08-20 12:05:17 +00:00
Alexander V. Chernikov
a4141c63c5 Check value return from lle_create() for NULL.
This bug sneaked unnoticed in r286722.

Reported by:	adrian
2015-08-19 21:08:42 +00:00
Julien Charbon
31a7749d4b Make clear that TIME_WAIT timeout expiration is managed solely by
tcp_tw_2msl_scan().

Sponsored by:	Verisign, Inc.
2015-08-18 08:27:26 +00:00
Alexander V. Chernikov
0c4210f984 Fix panic when handling non-inet arp message introduced in r286825.
Submitted by:	delphij
2015-08-18 06:16:19 +00:00
Alexander V. Chernikov
512e30ef9f Split arpresolve() into fast/slow path.
This change isolates the most common case (e.g. successful lookup)
  from more complicates scenarios. It also (tries to) make code
  more simple by avoiding retry: cycle.

The actual goal is to prepare code to the upcoming change that will
  allow LL address retrieval without acquiring LLE lock at all.

Reviewed by:		ae
Differential Revision:	https://reviews.freebsd.org/D3383
2015-08-16 12:23:58 +00:00
Michael Tuexen
faadc1b492 Allow the path MTU to grow up to the outgoing interface MTU.
MFC after: 3 days
2015-08-14 14:26:13 +00:00
Alexander V. Chernikov
f3bfa7d1cf Move lle update code from from gigantic ip_arpinput() to
separate bunch of functions. The goal is to isolate actual lle
updates to permit more fine-grained locking.

Do all lle link-level update under AFDATA wlock.

Sponsored by:	Yandex LLC
2015-08-13 13:38:09 +00:00
Hiren Panchasara
ad389a8c3b Remove unused TCPTV_SRTTDFLT. We initialize srtt with TCPTV_SRTTBASE when we
don't have any rtt estimate.

Differential Revision:	D3334
Sponsored by:		Limelight Networks
2015-08-12 16:08:37 +00:00
Alexander V. Chernikov
0447c1367a Use single 'lle_timer' callout in lltable instead of
two different names of the same timer.
2015-08-11 12:38:54 +00:00
Alexander V. Chernikov
314294de5c Store addresses instead of sockaddrs inside llentry.
This permits us having all (not fully true yet) all the info
needed in lookup process in first 64 bytes of 'struct llentry'.

struct llentry layout:
BEFORE:
[rwlock .. state .. state .. MAC ] (lle+1) [sockaddr_in[6]]
AFTER
[ in[6]_addr MAC .. state .. rwlock ]

Currently, address part of struct llentry has only 16 bytes for the key.
However, lltable does not restrict any custom lltable consumers with long
keys use the previous approach (store key at (lle+1)).

Sponsored by:	Yandex LLC
2015-08-11 09:26:11 +00:00
Alexander V. Chernikov
41cb42a633 MFP r276712.
* Split lltable_init() into lltable_allocate_htbl() (alloc
  hash table with default callbacks) and lltable_link() (
  links any lltable to the list).
* Switch from LLTBL_HASHTBL_SIZE to per-lltable hash size field.
* Move lltable setup to separate functions in in[6]_domifattach.
2015-08-11 05:51:00 +00:00
Alexander V. Chernikov
2caee4be35 Rename rt_foreach_fib() to rt_foreach_fib_walk().
Suggested by:	julian
2015-08-10 20:50:31 +00:00
Alexander V. Chernikov
11cdad9873 Partially merge r274887,r275334,r275577,r275578,r275586 to minimize
differences between projects/routing and HEAD.

This commit tries to keep code logic the same while changing underlying
code to use unified callbacks.

* Add llt_foreach_entry method to traverse all entries in given llt
* Add llt_dump_entry method to export particular lle entry in sysctl/rtsock
  format (code is not indented properly to minimize diff). Will be fixed
  in the next commits.
* Add llt_link_entry/llt_unlink_entry methods to link/unlink particular lle.
* Add llt_fill_sa_entry method to export address in the lle to sockaddr
  format.
* Add llt_hash method to use in generic hash table support code.
* Add llt_free_entry method which is used in llt_prefix_free code.

* Prepare for fine-grained locking by separating lle unlink and deletion in
  lltable_free() and lltable_prefix_free().

* Provide lltable_get<ifp|af>() functions to reduce direct 'struct lltable'
 access by external callers.

* Remove @llt agrument from lle_free() lle callback since it was unused.
* Temporarily add L3_CADDR() macro for 'const' sockaddr typecasting.
* Switch to per-af hashing code.
* Rename LLE_FREE_LOCKED() callback from in[6]_lltable_free() to
  in_[6]lltable_destroy() to avoid clashing with llt_free_entry() method.
  Update description from these functions.
* Use unified lltable_free_entry() function instead of per-af one.

Reviewed by:	ae
2015-08-10 12:03:59 +00:00