Commit Graph

5575 Commits

Author SHA1 Message Date
Andrey V. Elsukov
ed22e564b8 Add named dynamic states support to ipfw(4).
The keep-state, limit and check-state now will have additional argument
flowname. This flowname will be assigned to dynamic rule by keep-state
or limit opcode. And then can be matched by check-state opcode or
O_PROBE_STATE internal opcode. To reduce possible breakage and to maximize
compatibility with old rulesets default flowname introduced.
It will be assigned to the rules when user has omitted state name in
keep-state and check-state opcodes. Also if name is ambiguous (can be
evaluated as rule opcode) it will be replaced to default.

Reviewed by:	julian
Obtained from:	Yandex LLC
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D6674
2016-07-19 04:56:59 +00:00
Andrey V. Elsukov
b867e84e95 Add ipfw_nptv6 module that implements Network Prefix Translation for IPv6
as defined in RFC 6296. The module works together with ipfw(4) and
implemented as its external action module. When it is loaded, it registers
as eaction and can be used in rules. The usage pattern is similar to
ipfw_nat(4). All matched by rule traffic goes to the NPT module.

Reviewed by:	hrs
Obtained from:	Yandex LLC
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D6420
2016-07-18 19:46:31 +00:00
Michael Tuexen
487e2e6507 Add a constant required by RFC 7496.
MFC after:	3 days
2016-07-17 13:33:35 +00:00
Michael Tuexen
8e1b295f09 Fix the PR-SCTP behaviour.
This is done by rrs@.

MFC after:	3 days
2016-07-17 13:14:51 +00:00
Michael Tuexen
26f0250adf Add missing sctps_reasmusrmsgs counter.
Joint work with rrs@.
MFC after:	3 days
2016-07-17 08:31:21 +00:00
Michael Tuexen
643fd575da Deal with a portential memory allocation failure, which was reported
by the clang static code analyzer.
Joint work with rrs@.

MFC after:	3 days
2016-07-16 12:25:37 +00:00
Michael Tuexen
b4bf72213a Don't free a data chunk twice.
Found by the clang static code analyzer running for the userland stack.

MFC after:	3 days
2016-07-16 08:11:43 +00:00
Michael Tuexen
56d2f7d8e5 Address a potential memory leak found a the clang static code analyzer
running on the userland stack.

MFC after:	3 days
2016-07-16 07:48:01 +00:00
Jonathan T. Looney
24b9bb5614 The TCPPCAP debugging feature caches recently-used mbufs for use in
debugging TCP connections. This commit provides a mechanism to free those
mbufs when the system is under memory pressure.

Because this will result in lost debugging information, the behavior is
controllable by a sysctl. The default setting is to free the mbufs.

Reviewed by:	gnn
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D6931
Input from:	novice_techie.com
2016-07-06 16:17:13 +00:00
Nathan Whitehorn
96c85efb4b Replace a number of conflations of mp_ncpus and mp_maxid with either
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.

There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.

PR:		kern/210106
Reviewed by:	jhb
Approved by:	re (glebius)
2016-07-06 14:09:49 +00:00
Michael Tuexen
e75f31c1d0 This patch fixes two bugs related to the setting of the I-Bit
for SCTP DATA and I-DATA chunks.
* For fragmented user messages, set the I-Bit only on the last
  fragment.
* When using explicit EOR mode, set the I-Bit on the last
  fragment, whenever SCTP_SACK_IMMEDIATELY was set in snd_flags
  for any of the send() calls.

Approved by:	re (hrs)
MFC after:	1 week
2016-06-30 06:06:35 +00:00
Michael Tuexen
ab3373140d This patch fixes two bugs related to the SCTP message recovery
for messages which have been put on the send queue:
* Do not report any DATA or I-DATA chunk padding.
* Correctly deal with the I-DATA chunk header instead of the DATA
  chunk header when the I-DATA extension is used.

Approved by:	re (kib)
MFC after:	1 week
2016-06-26 16:38:42 +00:00
Michael Tuexen
d1b52c6a01 This patch fixes a locking bug when a send() call blocks
on an SCTP socket and the association is aborted by the
peer.

Approved by:	re (kib)
MFC after:	1 week
2016-06-26 12:41:02 +00:00
Bjoern A. Zeeb
42644eb88e Try to avoid a 2nd conditional by re-writing the loop, pause, and
escape clause another time.

Submitted by:	jhb
Approved by:	re (gjb)
MFC after:	12 days
2016-06-23 21:32:52 +00:00
Navdeep Parhar
f22bfc72f8 Add spares to struct ifnet and socket for packet pacing and/or general
use.  Update comments regarding the spare fields in struct inpcb.

Bump __FreeBSD_version for the changes to the size of the structures.

Reviewed by:	gnn@
Approved by:	re@ (gjb@)
Sponsored by:	Chelsio Communications
2016-06-23 21:07:15 +00:00
Michael Tuexen
9de217ce56 Fix a bug in the handling of non-blocking SCTP 1-to-1 sockets. When using
this code in the userland stack, it could result in a loop. This happened on iOS.
However, I was not able to reproduce this when using the code in the kernel.
Thanks to Eugen-Andrei Gavriloaie for reporting the issue and proving detailed
information to find the root of the problem.

Approved by:	re (gjb)
MFC after:	1 week
2016-06-23 19:27:29 +00:00
Bjoern A. Zeeb
361279a4fb In VNET TCP teardown Do not sleep unconditionally but only if we
have any TCP connections left.

Submitted by:		zec
Approved by:		re (hrs)
MFC after:		13 days
2016-06-23 11:55:15 +00:00
Michael Tuexen
55b8cd93ef Don't consider the socket when processing an incoming ICMP/ICMP6 packet,
which was triggered by an SCTP packet. Whether a socket exists, is just
not relevant.

Approved by: re (kib)
MFC after: 1 week
2016-06-23 09:13:15 +00:00
Bjoern A. Zeeb
b54e08e11a Check the V_tcbinfo.ipi_count to hit 0 before doing the full TCP cleanup.
That way timers can finish cleanly and we do not gamble with a DELAY().

Reviewed by:		gnn, jtl
Approved by:		re (gjb)
Obtained from:		projects/vnet
MFC after:		2 weeks
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6923
2016-06-23 00:34:03 +00:00
Bjoern A. Zeeb
252a46f324 No longer mark TCP TW zone NO_FREE.
Timewait code does a proper cleanup after itself.

Reviewed by:		gnn
Approved by:		re (gjb)
Obtained from:		projects/vnet
MFC after:		2 weeks
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6922
2016-06-23 00:32:58 +00:00
Bjoern A. Zeeb
89856f7e2d Get closer to a VIMAGE network stack teardown from top to bottom rather
than removing the network interfaces first. This change is rather larger
and convoluted as the ordering requirements cannot be separated.

Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and
related modules to their own SI_SUB_PROTO_FIREWALL.
Move initialization of "physical" interfaces to SI_SUB_DRIVERS,
move virtual (cloned) interfaces to SI_SUB_PSEUDO.
Move Multicast to SI_SUB_PROTO_MC.

Re-work parts of multicast initialisation and teardown, not taking the
huge amount of memory into account if used as a module yet.

For interface teardown we try to do as many of them as we can on
SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling
over a higher layer protocol such as IP. In that case the interface
has to go along (or before) the higher layer protocol is shutdown.

Kernel hhooks need to go last on teardown as they may be used at various
higher layers and we cannot remove them before we cleaned up the higher
layers.

For interface teardown there are multiple paths:
(a) a cloned interface is destroyed (inside a VIMAGE or in the base system),
(b) any interface is moved from a virtual network stack to a different
network stack ("vmove"), or (c) a virtual network stack is being shut down.
All code paths go through if_detach_internal() where we, depending on the
vmove flag or the vnet state, make a decision on how much to shut down;
in case we are destroying a VNET the individual protocol layers will
cleanup their own parts thus we cannot do so again for each interface as
we end up with, e.g., double-frees, destroying locks twice or acquiring
already destroyed locks.
When calling into protocol cleanups we equally have to tell them
whether they need to detach upper layer protocols ("ulp") or not
(e.g., in6_ifdetach()).

Provide or enahnce helper functions to do proper cleanup at a protocol
rather than at an interface level.

Approved by:		re (hrs)
Obtained from:		projects/vnet
Reviewed by:		gnn, jhb
Sponsored by:		The FreeBSD Foundation
MFC after:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D6747
2016-06-21 13:48:49 +00:00
Andrey V. Elsukov
4c10540274 Cleanup unneded include "opt_ipfw.h".
It was used for conditional build IPFIREWALL_FORWARD support.
But IPFIREWALL_FORWARD option was removed a long time ago.
2016-06-09 05:48:34 +00:00
Michael Tuexen
63d5b56815 Use a separate MID counter for ordered und unordered messages for each
outgoing stream.

Thanks to Jens Hoelscher for reporting the issue.

MFC after: 1 week
2016-06-08 17:57:42 +00:00
Sepherosa Ziehau
36ad8372d4 net: Use M_HASHTYPE_OPAQUE_HASH if the mbuf flowid has hash properties
Reviewed by:	hps, erj, tuexen
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D6688
2016-06-07 04:51:50 +00:00
Bjoern A. Zeeb
b941cb8d60 Add a show igi_list command to DDB to debug IGMP state.
Obtained from:	projects/vnet
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2016-06-06 22:26:18 +00:00
Bjoern A. Zeeb
a6c96fc2f0 Destroy the mutex last. In this case it should not matter, but
generally cleanup code might still acquire it thus try to be
consistent destroying locks late.

Obtained from:	projects/vnet
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2016-06-06 13:04:22 +00:00
George V. Neville-Neil
8b2b91eee0 Add missing constants from RFCs 4443 and 6550 2016-06-06 00:35:45 +00:00
Bjoern A. Zeeb
484149def8 Introduce a per-VNET flag to enable/disable netisr prcessing on that VNET.
Add accessor functions to toggle the state per VNET.
The base system (vnet0) will always enable itself with the normal
registration. We will share the registered protocol handlers in all
VNETs minimising duplication and management.
Upon disabling netisr processing for a VNET drain the netisr queue from
packets for that VNET.

Update netisr consumers to (de)register on a per-VNET start/teardown using
VNET_SYS(UN)INIT functionality.

The change should be transparent for non-VIMAGE kernels.

Reviewed by:	gnn (, hiren)
Obtained from:	projects/vnet
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6691
2016-06-03 13:57:10 +00:00
Hans Petter Selasky
ec6689059d Use insertion sort instead of bubble sort in TCP LRO.
Replacing the bubble sort with insertion sort gives an 80% reduction
in runtime on average, with randomized keys, for small partitions.

If the keys are pre-sorted, insertion sort runs in linear time, and
even if the keys are reversed, insertion sort is faster than bubble
sort, although not by much.

Update comment describing "tcp_lro_sort()" while at it.

Differential Revision:	https://reviews.freebsd.org/D6619
Sponsored by:	Mellanox Technologies
Tested by:	Netflix
Suggested by:	Pieter de Goeje <pieter@degoeje.nl>
Reviewed by:	ed, gallatin, gnn, transport
2016-06-03 08:35:07 +00:00
Michael Tuexen
273e31638f Get struct sctp_net_route in-sync with struct route again. 2016-06-03 07:43:04 +00:00
Michael Tuexen
565cccce37 Store the peers vtag in host byte order in the cookie, since all
consumers expect it that way.
This fixes the vtag when sending en ERROR chunk.

MFC after:	1 week
2016-06-03 07:24:41 +00:00
George V. Neville-Neil
6d76822688 This change re-adds L2 caching for TCP and UDP, as originally added in D4306
but removed due to other changes in the system. Restore the llentry pointer
to the "struct route", and use it to cache the L2 lookup (ARP or ND6) as
appropriate.

Submitted by:	Mike Karels
Differential Revision:	https://reviews.freebsd.org/D6262
2016-06-02 17:51:29 +00:00
Bjoern A. Zeeb
3f58662dd9 The pr_destroy field does not allow us to run the teardown code in a
specific order.  VNET_SYSUNINITs however are doing exactly that.
Thus remove the VIMAGE conditional field from the domain(9) protosw
structure and replace it with VNET_SYSUNINITs.
This also allows us to change some order and to make the teardown functions
file local static.
Also convert divert(4) as it uses the same mechanism ip(4) and ip6(4) use
internally.

Slightly reshuffle the SI_SUB_* fields in kernel.h and add a new ones, e.g.,
for pfil consumers (firewalls), partially for this commit and for others
to come.

Reviewed by:		gnn, tuexen (sctp), jhb (kernel.h)
Obtained from:		projects/vnet
MFC after:		2 weeks
X-MFC:			do not remove pr_destroy
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6652
2016-06-01 10:14:04 +00:00
Michael Tuexen
3d48d25be7 Add PR_CONNREQUIRED for SOCK_STREAM sockets using SCTP.
This is required to signal connetion setup on non-blocking sockets
via becoming writable. This still allows for implicit connection
setup.

MFC after:	1 week
2016-05-30 18:24:23 +00:00
Michael Tuexen
7b7f31e6cf Fix a byte order issue for the scope stored in the SCTP cookie.
MFC after:	1 week
2016-05-30 11:18:39 +00:00
Sepherosa Ziehau
425b763928 tcp: Don't prematurely drop receiving-only connections
If the connection was persistent and receiving-only, several (12)
sporadic device insufficient buffers would cause the connection be
dropped prematurely:
Upon ENOBUFS in tcp_output() for an ACK, retransmission timer is
started.  No one will stop this retransmission timer for receiving-
only connection, so the retransmission timer promises to expire and
t_rxtshift is promised to be increased.  And t_rxtshift will not be
reset to 0, since no RTT measurement will be done for receiving-only
connection.  If this receiving-only connection lived long enough
(e.g. >350sec, given the RTO starts from 200ms), and it suffered 12
sporadic device insufficient buffers, i.e. t_rxtshift >= 12, this
receiving-only connection would be dropped prematurely by the
retransmission timer.

We now assert that for data segments, SYNs or FINs either rexmit or
persist timer was wired upon ENOBUFS.  And don't set rexmit timer
for other cases, i.e. ENOBUFS upon ACKs.

Discussed with:	lstewart, hiren, jtl, Mike Karels
MFC after:	3 weeks
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D5872
2016-05-30 03:31:37 +00:00
Gleb Smirnoff
6351b3857b Plug route reference underleak that happens with FLOWTABLE after r297225.
Submitted by:	Mike Karels <mike karels.net>
2016-05-27 17:31:02 +00:00
Don Lewis
91336b403a Import Dummynet AQM version 0.2.1 (CoDel, FQ-CoDel, PIE and FQ-PIE).
Centre for Advanced Internet Architectures

Implementing AQM in FreeBSD

* Overview <http://caia.swin.edu.au/freebsd/aqm/index.html>

* Articles, Papers and Presentations
  <http://caia.swin.edu.au/freebsd/aqm/papers.html>

* Patches and Tools <http://caia.swin.edu.au/freebsd/aqm/downloads.html>

Overview

Recent years have seen a resurgence of interest in better managing
the depth of bottleneck queues in routers, switches and other places
that get congested. Solutions include transport protocol enhancements
at the end-hosts (such as delay-based or hybrid congestion control
schemes) and active queue management (AQM) schemes applied within
bottleneck queues.

The notion of AQM has been around since at least the late 1990s
(e.g. RFC 2309). In recent years the proliferation of oversized
buffers in all sorts of network devices (aka bufferbloat) has
stimulated keen community interest in four new AQM schemes -- CoDel,
FQ-CoDel, PIE and FQ-PIE.

The IETF AQM working group is looking to document these schemes,
and independent implementations are a corner-stone of the IETF's
process for confirming the clarity of publicly available protocol
descriptions. While significant development work on all three schemes
has occured in the Linux kernel, there is very little in FreeBSD.

Project Goals

This project began in late 2015, and aims to design and implement
functionally-correct versions of CoDel, FQ-CoDel, PIE and FQ_PIE
in FreeBSD (with code BSD-licensed as much as practical). We have
chosen to do this as extensions to FreeBSD's ipfw/dummynet firewall
and traffic shaper. Implementation of these AQM schemes in FreeBSD
will:
* Demonstrate whether the publicly available documentation is
  sufficient to enable independent, functionally equivalent implementations

* Provide a broader suite of AQM options for sections the networking
  community that rely on FreeBSD platforms

Program Members:

* Rasool Al Saadi (developer)

* Grenville Armitage (project lead)

Acknowledgements:

This project has been made possible in part by a gift from the
Comcast Innovation Fund.

Submitted by:	Rasool Al-Saadi <ralsaadi@swin.edu.au>
X-No objection:	core
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D6388
2016-05-26 21:40:13 +00:00
John Baldwin
052a5418e8 Don't reuse the source mbuf in tcp_respond() if it is not writable.
Not all mbufs passed up from device drivers are M_WRITABLE().  In
particular, the Chelsio T4/T5 driver uses a feature called "buffer packing"
to receive multiple frames in a single receive buffer.  The mbufs for
these frames all share the same external storage so are treated as
read-only by the rest of the stack when multiple frames are in flight.
Previously tcp_respond() would blindly overwrite read-only mbufs when
INVARIANTS was disabled or panic with an assertion failure if INVARIANTS
was enabled.  Note that the new case is a bit of a mix of the two other
cases in tcp_respond().  The TCP and IP headers must be copied explicitly
into the new mbuf instead of being inherited (similar to the m == NULL
case), but the addresses and ports must be swapped in the reply (similar
to the m != NULL case).

Reviewed by:	glebius
2016-05-26 18:35:37 +00:00
Michael Tuexen
4f3b84b524 Make struct sctp_paddrthlds compliant to RFC 7829. 2016-05-26 11:38:26 +00:00
Hans Petter Selasky
fc271df341 Use optimised complexity safe sorting routine instead of the kernel's
"qsort()".

The kernel's "qsort()" routine can in worst case spend O(N*N) amount of
comparisons before the input array is sorted. It can also recurse a
significant amount of times using up the kernel's interrupt thread
stack.

The custom sorting routine takes advantage of that the sorting key is
only 64 bits. Based on set and cleared bits in the sorting key it
partitions the array until it is sorted. This process has a recursion
limit of 64 times, due to the number of set and cleared bits which can
occur. Compiled with -O2 the sorting routine was measured to use
64-bytes of stack. Multiplying this by 64 gives a maximum stack
consumption of 4096 bytes for AMD64. The same applies to the execution
time, that the array to be sorted will not be traversed more than 64
times.

When serving roughly 80Gb/s with 80K TCP connections, the old method
consisting of "qsort()" and "tcp_lro_mbuf_compare_header()" used 1.4%
CPU, while the new "tcp_lro_sort()" used 1.1% for LRO related sorting
as measured by Intel Vtune. The testing was done using a sysctl to
toggle between "qsort()" and "tcp_lro_sort()".

Differential Revision:	https://reviews.freebsd.org/D6472
Sponsored by:	Mellanox Technologies
Tested by:	Netflix
Reviewed by:	gallatin, rrs, sephe, transport
2016-05-26 11:10:31 +00:00
Michael Tuexen
f88d0cfe7a When sending in ICMP response to an SCTP packet,
* include the SCTP common header, if possible
* include the first 8 bytes of the INIT chunk, if possible
This provides the necesary information for the receiver of the ICMP
packet to process it.

MFC after:	1 week
2016-05-25 22:16:11 +00:00
Michael Tuexen
6d7270a580 Send an ICMP packet indicating destination unreachable/protocol
unreachable if we don't handle the packet in the kernel and not
in userspace.

MFC after:	1 week
2016-05-25 15:54:21 +00:00
Michael Tuexen
ad2cbb09ef Count packets as not being delivered only if they are neither
processed by a kernel handler nor by a raw socket.

MFC after:	1 week
2016-05-25 13:48:26 +00:00
Don Lewis
883054b4c3 Change net.inet.tcp.ecn.enable sysctl mib from a binary off/on
control to a three way setting.
  0 - Totally disable ECN. (no change)
  1 - Enable ECN if incoming connections request it.  Outgoing
      connections will request ECN.  (no change from present != 0 setting)
  2 - Enable ECN if incoming connections request it.  Outgoing
      conections will not request ECN.

Change the default value of net.inet.tcp.ecn.enable from 0 to 2.

Linux version 2.4.20 and newer, Solaris, and Mac OS X 10.5 and newer have
similar capabilities.  The actual values above match Linux, and the default
matches the current Linux default.

Reviewed by:	eadler
MFC after:	1 month
MFH:		yes
Sponsored by:	https://reviews.freebsd.org/D6386
2016-05-19 22:20:35 +00:00
Gleb Smirnoff
f59d975e10 Tiny refactor of r294869/r296881: use defines to mask the VNET() macro.
Suggested by:	bz
2016-05-17 23:14:17 +00:00
Randall Stewart
5105a92c49 This small change adopts the excellent suggestion for using named
structures in the add of a new tcp-stack that came in late to me
via email after the last commit. It also makes it so that a new
stack may optionally get a callback during a retransmit
timeout. This allows the new stack to clear specific state (think
sack scoreboards or other such structures).

Sponsored by:	Netflix Inc.
Differential Revision:	http://reviews.freebsd.org/D6303
2016-05-17 09:53:22 +00:00
Andrey V. Elsukov
2685841b38 Make named objects set-aware. Now it is possible to create named
objects with the same name in different sets.

Add optional manage_sets() callback to objects rewriting framework.
It is intended to implement handler for moving and swapping named
object's sets. Add ipfw_obj_manage_sets() function that implements
generic sets handler. Use new callback to implement sets support for
lookup tables.
External actions objects are global and they don't support sets.
Modify eaction_findbyname() to reflect this.
ipfw(8) now may fail to move rules or sets, because some named objects
in target set may have conflicting names.
Note that ipfw_obj_ntlv type was changed, but since lookup tables
actually didn't support sets, this change is harmless.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-05-17 07:47:23 +00:00
Mark Johnston
565e7fd3bc opt_kdtrace.h is not needed for SDT probes as of r258541. 2016-05-15 20:04:43 +00:00
Mark Johnston
e8800f3c2f Fix a few style issues in the ICMP sysctl descriptions.
MFC after:	1 week
2016-05-15 03:19:53 +00:00