Commit Graph

5585 Commits

Author SHA1 Message Date
Sepherosa Ziehau
05cde7efa6 tcp/lro: Implement hash table for LRO entries.
This significantly improves HTTP workload performance and reduces
HTTP workload latency.

Reviewed by:	rrs, gallatin, hps
Obtained from:	rrs, gallatin
Sponsored by:	Netflix (rrs, gallatin) , Microsoft (sephe)
Differential Revision:	https://reviews.freebsd.org/D6689
2016-08-02 06:36:47 +00:00
Andrew Gallatin
d4c22202e6 Rework IPV6 TCP path MTU discovery to match IPv4
- Re-write tcp_ctlinput6() to closely mimic the IPv4 tcp_ctlinput()

- Now that tcp_ctlinput6() updates t_maxseg, we can allow ip6_output()
  to send TCP packets without looking at the tcp host cache for every
  single transmit.

- Make the icmp6 code mimic the IPv4 code & avoid returning
  PRC_HOSTDEAD because it is so expensive.

Without these changes in place, every TCP6 pmtu discovery or host
unreachable ICMP resulted in a call to in6_pcbnotify() which walks the
tcbinfo table with the write lock held.  Because the tcbinfo table is
shared between IPv4 and IPv6, this causes huge scalabilty issues on
servers with lots of (~100K) TCP connections, to the point where even
a small percent of IPv6 traffic had a disproportionate impact on
overall throughput.

Reviewed by:	bz, rrs, ae (all earlier versions), lstewart (in Netflix's tree)
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D7272
2016-08-01 17:02:21 +00:00
Andrew Gallatin
0e3b891988 Call tcp_notify() directly to shoot down routes, rather than
calling in_pcbnotifyall().

This avoids lock contention on tcbinfo due to in_pcbnotifyall()
holding the tcbinfo write lock while walking all connections.

Reviewed by:	rrs, karels
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D7251
2016-07-28 19:32:25 +00:00
Stephen J. Kiernan
f11ec79842 Remove BSD and USL copyright and update license block in in_prot.c, as the
code in this file was written by Robert N. M. Waston.

Move cr_can* prototypes from sys/systm.h to sys/proc.h

Reported by:	rwatson
Reviewed by:	rwatson
Approved by:	sjg (mentor)
Differential Revision:	https://reviews.freebsd.org/D7345
2016-07-28 18:39:30 +00:00
Stephen J. Kiernan
4ac21b4f09 Prepare for network stack as a module
- Move cr_canseeinpcb to sys/netinet/in_prot.c in order to separate the
   INET and INET6-specific code from the rest of the prot code (It is only
   used by the network stack, so it makes sense for it to live with the
   other network stack code.)
 - Move cr_canseeinpcb prototype from sys/systm.h to netinet/in_systm.h
 - Rename cr_seeotheruids to cr_canseeotheruids and cr_seeothergids to
   cr_canseeothergids, make them non-static, and add prototypes (so they
   can be seen/called by in_prot.c functions.)
 - Remove sw_csum variable from ip6_forward in ip6_forward.c, as it is an
   unused variable.

Reviewed by:	gnn, jtl
Approved by:	sjg (mentor)
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D2901
2016-07-27 20:34:09 +00:00
Brad Davis
439e76ec22 Fix the case for some sysctl descriptions.
Reviewed by:	gnn
2016-07-26 20:20:09 +00:00
Mike Karels
ea17754c5a Fix per-connection L2 caching in fast path
r301217 re-added per-connection L2 caching from a previous change,
but it omitted caching in the fast path.  Add it.

Reviewed By: gallatin
Approved by: gnn (mentor)
Differential Revision: https://reviews.freebsd.org/D7239
2016-07-22 02:11:49 +00:00
Michael Tuexen
0ee5c319f2 Fix a bug in deferred stream reset processing which results
in using a length field before it is set.

Thanks to Taylor Brandstetter for reporting the issue and
providing a fix.

MFC after:	3 days
2016-07-20 06:29:26 +00:00
Michael Tuexen
0fa7377a03 Use correct order of conditions to avoid NULL deref.
MFC after:	3 days
X-MFC with:	r302935
2016-07-19 11:16:44 +00:00
Michael Tuexen
0ba4c8abe0 netstat and sockstat expect the IPv6 link local addresses to
have an embedded scope. So don't recover.

MFC after:	3 days
2016-07-19 09:48:08 +00:00
Andrey V. Elsukov
ed22e564b8 Add named dynamic states support to ipfw(4).
The keep-state, limit and check-state now will have additional argument
flowname. This flowname will be assigned to dynamic rule by keep-state
or limit opcode. And then can be matched by check-state opcode or
O_PROBE_STATE internal opcode. To reduce possible breakage and to maximize
compatibility with old rulesets default flowname introduced.
It will be assigned to the rules when user has omitted state name in
keep-state and check-state opcodes. Also if name is ambiguous (can be
evaluated as rule opcode) it will be replaced to default.

Reviewed by:	julian
Obtained from:	Yandex LLC
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D6674
2016-07-19 04:56:59 +00:00
Andrey V. Elsukov
b867e84e95 Add ipfw_nptv6 module that implements Network Prefix Translation for IPv6
as defined in RFC 6296. The module works together with ipfw(4) and
implemented as its external action module. When it is loaded, it registers
as eaction and can be used in rules. The usage pattern is similar to
ipfw_nat(4). All matched by rule traffic goes to the NPT module.

Reviewed by:	hrs
Obtained from:	Yandex LLC
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D6420
2016-07-18 19:46:31 +00:00
Michael Tuexen
487e2e6507 Add a constant required by RFC 7496.
MFC after:	3 days
2016-07-17 13:33:35 +00:00
Michael Tuexen
8e1b295f09 Fix the PR-SCTP behaviour.
This is done by rrs@.

MFC after:	3 days
2016-07-17 13:14:51 +00:00
Michael Tuexen
26f0250adf Add missing sctps_reasmusrmsgs counter.
Joint work with rrs@.
MFC after:	3 days
2016-07-17 08:31:21 +00:00
Michael Tuexen
643fd575da Deal with a portential memory allocation failure, which was reported
by the clang static code analyzer.
Joint work with rrs@.

MFC after:	3 days
2016-07-16 12:25:37 +00:00
Michael Tuexen
b4bf72213a Don't free a data chunk twice.
Found by the clang static code analyzer running for the userland stack.

MFC after:	3 days
2016-07-16 08:11:43 +00:00
Michael Tuexen
56d2f7d8e5 Address a potential memory leak found a the clang static code analyzer
running on the userland stack.

MFC after:	3 days
2016-07-16 07:48:01 +00:00
Jonathan T. Looney
24b9bb5614 The TCPPCAP debugging feature caches recently-used mbufs for use in
debugging TCP connections. This commit provides a mechanism to free those
mbufs when the system is under memory pressure.

Because this will result in lost debugging information, the behavior is
controllable by a sysctl. The default setting is to free the mbufs.

Reviewed by:	gnn
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D6931
Input from:	novice_techie.com
2016-07-06 16:17:13 +00:00
Nathan Whitehorn
96c85efb4b Replace a number of conflations of mp_ncpus and mp_maxid with either
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.

There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.

PR:		kern/210106
Reviewed by:	jhb
Approved by:	re (glebius)
2016-07-06 14:09:49 +00:00
Michael Tuexen
e75f31c1d0 This patch fixes two bugs related to the setting of the I-Bit
for SCTP DATA and I-DATA chunks.
* For fragmented user messages, set the I-Bit only on the last
  fragment.
* When using explicit EOR mode, set the I-Bit on the last
  fragment, whenever SCTP_SACK_IMMEDIATELY was set in snd_flags
  for any of the send() calls.

Approved by:	re (hrs)
MFC after:	1 week
2016-06-30 06:06:35 +00:00
Michael Tuexen
ab3373140d This patch fixes two bugs related to the SCTP message recovery
for messages which have been put on the send queue:
* Do not report any DATA or I-DATA chunk padding.
* Correctly deal with the I-DATA chunk header instead of the DATA
  chunk header when the I-DATA extension is used.

Approved by:	re (kib)
MFC after:	1 week
2016-06-26 16:38:42 +00:00
Michael Tuexen
d1b52c6a01 This patch fixes a locking bug when a send() call blocks
on an SCTP socket and the association is aborted by the
peer.

Approved by:	re (kib)
MFC after:	1 week
2016-06-26 12:41:02 +00:00
Bjoern A. Zeeb
42644eb88e Try to avoid a 2nd conditional by re-writing the loop, pause, and
escape clause another time.

Submitted by:	jhb
Approved by:	re (gjb)
MFC after:	12 days
2016-06-23 21:32:52 +00:00
Navdeep Parhar
f22bfc72f8 Add spares to struct ifnet and socket for packet pacing and/or general
use.  Update comments regarding the spare fields in struct inpcb.

Bump __FreeBSD_version for the changes to the size of the structures.

Reviewed by:	gnn@
Approved by:	re@ (gjb@)
Sponsored by:	Chelsio Communications
2016-06-23 21:07:15 +00:00
Michael Tuexen
9de217ce56 Fix a bug in the handling of non-blocking SCTP 1-to-1 sockets. When using
this code in the userland stack, it could result in a loop. This happened on iOS.
However, I was not able to reproduce this when using the code in the kernel.
Thanks to Eugen-Andrei Gavriloaie for reporting the issue and proving detailed
information to find the root of the problem.

Approved by:	re (gjb)
MFC after:	1 week
2016-06-23 19:27:29 +00:00
Bjoern A. Zeeb
361279a4fb In VNET TCP teardown Do not sleep unconditionally but only if we
have any TCP connections left.

Submitted by:		zec
Approved by:		re (hrs)
MFC after:		13 days
2016-06-23 11:55:15 +00:00
Michael Tuexen
55b8cd93ef Don't consider the socket when processing an incoming ICMP/ICMP6 packet,
which was triggered by an SCTP packet. Whether a socket exists, is just
not relevant.

Approved by: re (kib)
MFC after: 1 week
2016-06-23 09:13:15 +00:00
Bjoern A. Zeeb
b54e08e11a Check the V_tcbinfo.ipi_count to hit 0 before doing the full TCP cleanup.
That way timers can finish cleanly and we do not gamble with a DELAY().

Reviewed by:		gnn, jtl
Approved by:		re (gjb)
Obtained from:		projects/vnet
MFC after:		2 weeks
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6923
2016-06-23 00:34:03 +00:00
Bjoern A. Zeeb
252a46f324 No longer mark TCP TW zone NO_FREE.
Timewait code does a proper cleanup after itself.

Reviewed by:		gnn
Approved by:		re (gjb)
Obtained from:		projects/vnet
MFC after:		2 weeks
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6922
2016-06-23 00:32:58 +00:00
Bjoern A. Zeeb
89856f7e2d Get closer to a VIMAGE network stack teardown from top to bottom rather
than removing the network interfaces first. This change is rather larger
and convoluted as the ordering requirements cannot be separated.

Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and
related modules to their own SI_SUB_PROTO_FIREWALL.
Move initialization of "physical" interfaces to SI_SUB_DRIVERS,
move virtual (cloned) interfaces to SI_SUB_PSEUDO.
Move Multicast to SI_SUB_PROTO_MC.

Re-work parts of multicast initialisation and teardown, not taking the
huge amount of memory into account if used as a module yet.

For interface teardown we try to do as many of them as we can on
SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling
over a higher layer protocol such as IP. In that case the interface
has to go along (or before) the higher layer protocol is shutdown.

Kernel hhooks need to go last on teardown as they may be used at various
higher layers and we cannot remove them before we cleaned up the higher
layers.

For interface teardown there are multiple paths:
(a) a cloned interface is destroyed (inside a VIMAGE or in the base system),
(b) any interface is moved from a virtual network stack to a different
network stack ("vmove"), or (c) a virtual network stack is being shut down.
All code paths go through if_detach_internal() where we, depending on the
vmove flag or the vnet state, make a decision on how much to shut down;
in case we are destroying a VNET the individual protocol layers will
cleanup their own parts thus we cannot do so again for each interface as
we end up with, e.g., double-frees, destroying locks twice or acquiring
already destroyed locks.
When calling into protocol cleanups we equally have to tell them
whether they need to detach upper layer protocols ("ulp") or not
(e.g., in6_ifdetach()).

Provide or enahnce helper functions to do proper cleanup at a protocol
rather than at an interface level.

Approved by:		re (hrs)
Obtained from:		projects/vnet
Reviewed by:		gnn, jhb
Sponsored by:		The FreeBSD Foundation
MFC after:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D6747
2016-06-21 13:48:49 +00:00
Andrey V. Elsukov
4c10540274 Cleanup unneded include "opt_ipfw.h".
It was used for conditional build IPFIREWALL_FORWARD support.
But IPFIREWALL_FORWARD option was removed a long time ago.
2016-06-09 05:48:34 +00:00
Michael Tuexen
63d5b56815 Use a separate MID counter for ordered und unordered messages for each
outgoing stream.

Thanks to Jens Hoelscher for reporting the issue.

MFC after: 1 week
2016-06-08 17:57:42 +00:00
Sepherosa Ziehau
36ad8372d4 net: Use M_HASHTYPE_OPAQUE_HASH if the mbuf flowid has hash properties
Reviewed by:	hps, erj, tuexen
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D6688
2016-06-07 04:51:50 +00:00
Bjoern A. Zeeb
b941cb8d60 Add a show igi_list command to DDB to debug IGMP state.
Obtained from:	projects/vnet
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2016-06-06 22:26:18 +00:00
Bjoern A. Zeeb
a6c96fc2f0 Destroy the mutex last. In this case it should not matter, but
generally cleanup code might still acquire it thus try to be
consistent destroying locks late.

Obtained from:	projects/vnet
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2016-06-06 13:04:22 +00:00
George V. Neville-Neil
8b2b91eee0 Add missing constants from RFCs 4443 and 6550 2016-06-06 00:35:45 +00:00
Bjoern A. Zeeb
484149def8 Introduce a per-VNET flag to enable/disable netisr prcessing on that VNET.
Add accessor functions to toggle the state per VNET.
The base system (vnet0) will always enable itself with the normal
registration. We will share the registered protocol handlers in all
VNETs minimising duplication and management.
Upon disabling netisr processing for a VNET drain the netisr queue from
packets for that VNET.

Update netisr consumers to (de)register on a per-VNET start/teardown using
VNET_SYS(UN)INIT functionality.

The change should be transparent for non-VIMAGE kernels.

Reviewed by:	gnn (, hiren)
Obtained from:	projects/vnet
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6691
2016-06-03 13:57:10 +00:00
Hans Petter Selasky
ec6689059d Use insertion sort instead of bubble sort in TCP LRO.
Replacing the bubble sort with insertion sort gives an 80% reduction
in runtime on average, with randomized keys, for small partitions.

If the keys are pre-sorted, insertion sort runs in linear time, and
even if the keys are reversed, insertion sort is faster than bubble
sort, although not by much.

Update comment describing "tcp_lro_sort()" while at it.

Differential Revision:	https://reviews.freebsd.org/D6619
Sponsored by:	Mellanox Technologies
Tested by:	Netflix
Suggested by:	Pieter de Goeje <pieter@degoeje.nl>
Reviewed by:	ed, gallatin, gnn, transport
2016-06-03 08:35:07 +00:00
Michael Tuexen
273e31638f Get struct sctp_net_route in-sync with struct route again. 2016-06-03 07:43:04 +00:00
Michael Tuexen
565cccce37 Store the peers vtag in host byte order in the cookie, since all
consumers expect it that way.
This fixes the vtag when sending en ERROR chunk.

MFC after:	1 week
2016-06-03 07:24:41 +00:00
George V. Neville-Neil
6d76822688 This change re-adds L2 caching for TCP and UDP, as originally added in D4306
but removed due to other changes in the system. Restore the llentry pointer
to the "struct route", and use it to cache the L2 lookup (ARP or ND6) as
appropriate.

Submitted by:	Mike Karels
Differential Revision:	https://reviews.freebsd.org/D6262
2016-06-02 17:51:29 +00:00
Bjoern A. Zeeb
3f58662dd9 The pr_destroy field does not allow us to run the teardown code in a
specific order.  VNET_SYSUNINITs however are doing exactly that.
Thus remove the VIMAGE conditional field from the domain(9) protosw
structure and replace it with VNET_SYSUNINITs.
This also allows us to change some order and to make the teardown functions
file local static.
Also convert divert(4) as it uses the same mechanism ip(4) and ip6(4) use
internally.

Slightly reshuffle the SI_SUB_* fields in kernel.h and add a new ones, e.g.,
for pfil consumers (firewalls), partially for this commit and for others
to come.

Reviewed by:		gnn, tuexen (sctp), jhb (kernel.h)
Obtained from:		projects/vnet
MFC after:		2 weeks
X-MFC:			do not remove pr_destroy
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6652
2016-06-01 10:14:04 +00:00
Michael Tuexen
3d48d25be7 Add PR_CONNREQUIRED for SOCK_STREAM sockets using SCTP.
This is required to signal connetion setup on non-blocking sockets
via becoming writable. This still allows for implicit connection
setup.

MFC after:	1 week
2016-05-30 18:24:23 +00:00
Michael Tuexen
7b7f31e6cf Fix a byte order issue for the scope stored in the SCTP cookie.
MFC after:	1 week
2016-05-30 11:18:39 +00:00
Sepherosa Ziehau
425b763928 tcp: Don't prematurely drop receiving-only connections
If the connection was persistent and receiving-only, several (12)
sporadic device insufficient buffers would cause the connection be
dropped prematurely:
Upon ENOBUFS in tcp_output() for an ACK, retransmission timer is
started.  No one will stop this retransmission timer for receiving-
only connection, so the retransmission timer promises to expire and
t_rxtshift is promised to be increased.  And t_rxtshift will not be
reset to 0, since no RTT measurement will be done for receiving-only
connection.  If this receiving-only connection lived long enough
(e.g. >350sec, given the RTO starts from 200ms), and it suffered 12
sporadic device insufficient buffers, i.e. t_rxtshift >= 12, this
receiving-only connection would be dropped prematurely by the
retransmission timer.

We now assert that for data segments, SYNs or FINs either rexmit or
persist timer was wired upon ENOBUFS.  And don't set rexmit timer
for other cases, i.e. ENOBUFS upon ACKs.

Discussed with:	lstewart, hiren, jtl, Mike Karels
MFC after:	3 weeks
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D5872
2016-05-30 03:31:37 +00:00
Gleb Smirnoff
6351b3857b Plug route reference underleak that happens with FLOWTABLE after r297225.
Submitted by:	Mike Karels <mike karels.net>
2016-05-27 17:31:02 +00:00
Don Lewis
91336b403a Import Dummynet AQM version 0.2.1 (CoDel, FQ-CoDel, PIE and FQ-PIE).
Centre for Advanced Internet Architectures

Implementing AQM in FreeBSD

* Overview <http://caia.swin.edu.au/freebsd/aqm/index.html>

* Articles, Papers and Presentations
  <http://caia.swin.edu.au/freebsd/aqm/papers.html>

* Patches and Tools <http://caia.swin.edu.au/freebsd/aqm/downloads.html>

Overview

Recent years have seen a resurgence of interest in better managing
the depth of bottleneck queues in routers, switches and other places
that get congested. Solutions include transport protocol enhancements
at the end-hosts (such as delay-based or hybrid congestion control
schemes) and active queue management (AQM) schemes applied within
bottleneck queues.

The notion of AQM has been around since at least the late 1990s
(e.g. RFC 2309). In recent years the proliferation of oversized
buffers in all sorts of network devices (aka bufferbloat) has
stimulated keen community interest in four new AQM schemes -- CoDel,
FQ-CoDel, PIE and FQ-PIE.

The IETF AQM working group is looking to document these schemes,
and independent implementations are a corner-stone of the IETF's
process for confirming the clarity of publicly available protocol
descriptions. While significant development work on all three schemes
has occured in the Linux kernel, there is very little in FreeBSD.

Project Goals

This project began in late 2015, and aims to design and implement
functionally-correct versions of CoDel, FQ-CoDel, PIE and FQ_PIE
in FreeBSD (with code BSD-licensed as much as practical). We have
chosen to do this as extensions to FreeBSD's ipfw/dummynet firewall
and traffic shaper. Implementation of these AQM schemes in FreeBSD
will:
* Demonstrate whether the publicly available documentation is
  sufficient to enable independent, functionally equivalent implementations

* Provide a broader suite of AQM options for sections the networking
  community that rely on FreeBSD platforms

Program Members:

* Rasool Al Saadi (developer)

* Grenville Armitage (project lead)

Acknowledgements:

This project has been made possible in part by a gift from the
Comcast Innovation Fund.

Submitted by:	Rasool Al-Saadi <ralsaadi@swin.edu.au>
X-No objection:	core
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D6388
2016-05-26 21:40:13 +00:00
John Baldwin
052a5418e8 Don't reuse the source mbuf in tcp_respond() if it is not writable.
Not all mbufs passed up from device drivers are M_WRITABLE().  In
particular, the Chelsio T4/T5 driver uses a feature called "buffer packing"
to receive multiple frames in a single receive buffer.  The mbufs for
these frames all share the same external storage so are treated as
read-only by the rest of the stack when multiple frames are in flight.
Previously tcp_respond() would blindly overwrite read-only mbufs when
INVARIANTS was disabled or panic with an assertion failure if INVARIANTS
was enabled.  Note that the new case is a bit of a mix of the two other
cases in tcp_respond().  The TCP and IP headers must be copied explicitly
into the new mbuf instead of being inherited (similar to the m == NULL
case), but the addresses and ports must be swapped in the reply (similar
to the m != NULL case).

Reviewed by:	glebius
2016-05-26 18:35:37 +00:00
Michael Tuexen
4f3b84b524 Make struct sctp_paddrthlds compliant to RFC 7829. 2016-05-26 11:38:26 +00:00