Commit Graph

5799 Commits

Author SHA1 Message Date
cem
d76c6282b3 alias_proxy.c: Fix accidental error quashing
This was introduced on accident in r165243, when return sites were unified
to add a lock around LibAliasProxyRule().

PR:		217749
Submitted by:	Svyatoslav <razmyslov at viva64.com>
Sponsored by:	Viva64 (PVS-Studio)
2017-03-13 18:05:31 +00:00
ae
662d71cd1c Fix the L2 address printed in the "arp: %s moved from %*D" message.
In the r292978 struct llentry was changed and the ll_addr field become
the pointer.

PR:		217667
MFC after:	1 week
2017-03-11 04:57:52 +00:00
glebius
65907cccb6 Make inp_lock_assert() depend on INVARIANT_SUPPORT, not INVARIANTS.
This will make INVARIANT-enabled modules, that use this function to load
successfully on a kernel that has INVARIANT_SUPPORT only.
2017-03-09 00:55:19 +00:00
eri
f2a480c25c The patch provides the same socket option as Linux IP_ORIGDSTADDR.
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.

The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.

This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.

Reviewed by:	adrian, aw
Approved by:	ae (mentor)
Sponsored by:	rsync.net
Differential Revision:	D9235
2017-03-06 04:01:58 +00:00
imp
7e6cabd06e Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
tuexen
19883d1af3 TCP window updates are only sent if the window can be increased by at
least 2 * MSS. However, if the receive buffer size is small, this might
be impossible. Add back a criterion to send a TCP window update if
the window can be increased by at least half of the receive buffer size.
This condition was removed in r242252. This patch simply brings it back.
PR:			211003
Reviewed by:		gnn
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D9475
2017-02-23 18:14:36 +00:00
vangyzen
f36100ef2c Remove inet_ntoa() from the kernel
inet_ntoa() cannot be used safely in a multithreaded environment
because it uses a static local buffer.  Remove it from the kernel.

Suggested by:	glebius, emaste
Reviewed by:	gnn
MFC after:	never
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D9625
2017-02-16 20:50:01 +00:00
vangyzen
c7c348accc Use inet_ntoa_r() instead of inet_ntoa() throughout the kernel
inet_ntoa() cannot be used safely in a multithreaded environment
because it uses a static local buffer. Instead, use inet_ntoa_r()
with a buffer on the caller's stack.

Suggested by:	glebius, emaste
Reviewed by:	gnn
MFC after:	2 weeks
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D9625
2017-02-16 20:47:41 +00:00
ae
ec763bfc18 Add missing check to fix the build with IPSEC_SUPPORT and without MAC.
Submitted by:	netchild
2017-02-14 21:33:10 +00:00
ae
5a443cfa6c Remove IPsec related PCB code from SCTP.
The inpcb structure has inp_sp pointer that is initialized by
ipsec_init_pcbpolicy() function. This pointer keeps strorage for IPsec
security policies associated with a specific socket.
An application can use IP_IPSEC_POLICY and IPV6_IPSEC_POLICY socket
options to configure these security policies. Then ip[6]_output()
uses inpcb pointer to specify that an outgoing packet is associated
with some socket. And IPSEC_OUTPUT() method can use a security policy
stored in the inp_sp. For inbound packet the protocol-specific input
routine uses IPSEC_CHECK_POLICY() method to check that a packet conforms
to inbound security policy configured in the inpcb.

SCTP protocol doesn't specify inpcb for ip[6]_output() when it sends
packets. Thus IPSEC_OUTPUT() method does not consider such packets as
associated with some socket and can not apply security policies
from inpcb, even if they are configured. Since IPSEC_CHECK_POLICY()
method is called from protocol-specific input routine, it can specify
inpcb pointer and associated with socket inbound policy will be
checked. But there are two problems:
1. Such check is asymmetric, becasue we can not apply security policy
from inpcb for outgoing packet.
2. IPSEC_CHECK_POLICY() expects that caller holds INPCB lock and
access to inp_sp is protected. But for SCTP this is not correct,
becasue SCTP uses own locks to protect inpcb.

To fix these problems remove IPsec related PCB code from SCTP.
This imply that IP_IPSEC_POLICY and IPV6_IPSEC_POLICY socket options
will be not applicable to SCTP sockets. To be able correctly check
inbound security policies for SCTP, mark its protocol header with
the PR_LASTHDR flag.

Reported by:	tuexen
Reviewed by:	tuexen
Differential Revision:	https://reviews.freebsd.org/D9538
2017-02-13 11:37:52 +00:00
eri
bb348ba959 Committed without approval from mentor.
Reported by:	gnn
2017-02-12 06:56:33 +00:00
rstone
cb359893cf Don't zero out srtt after excess retransmits
If the TCP stack has retransmitted more than 1/4 of the total
number of retransmits before a connection drop, it decides that
its current RTT estimate is hopelessly out of date and decides
to recalculate it from scratch starting with the next ACK.

Unfortunately, it implements this by zeroing out the current RTT
estimate.  Drop this hack entirely, as it makes it significantly more
difficult to debug connection issues.  Instead check for excessive
retransmits at the point where srtt is updated from an ACK being
received.  If we've exceeded 1/4 of the maximum retransmits,
discard the previous srtt estimate and replace it with the latest
rtt measurement.

Differential Revision:	https://reviews.freebsd.org/D9519
Reviewed by:	gnn
Sponsored by:	Dell EMC Isilon
2017-02-11 17:05:08 +00:00
glebius
c77bfbb1ee Move tcp_fields_to_net() static inline into tcp_var.h, just below its
friend tcp_fields_to_host(). There is third party code that also uses
this inline.

Reviewed by:	ae
2017-02-10 17:46:26 +00:00
eri
2530138e32 Fix build after r313524
Reported-by: ohartmann@walstatt.org
2017-02-10 06:01:47 +00:00
eri
6898c4334b Revert r313527
Heh svn is not git
2017-02-10 05:58:16 +00:00
eri
b429db62bc Correct missed variable name.
Reported-by: ohartmann@walstatt.org
2017-02-10 05:51:39 +00:00
eri
ed45b31494 The patch provides the same socket option as Linux IP_ORIGDSTADDR.
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.

The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.

This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.

Sponsored-by: rsync.net
Differential Revision: D9235
Reviewed-by: adrian
2017-02-10 05:16:14 +00:00
vangyzen
a55a27b7c6 Fix garbage IP addresses in UDP log_in_vain messages
If multiple threads emit a UDP log_in_vain message concurrently,
the IP addresses could be garbage due to concurrent usage of a
single string buffer inside inet_ntoa().  Use inet_ntoa_r() with
two stack buffers instead.

Reported by:	Mark Martinec <Mark.Martinec+freebsd@ijs.si>
MFC after:	3 days
Relnotes:	yes
Sponsored by:	Dell EMC
2017-02-07 18:57:57 +00:00
ae
0fb6ad528e Merge projects/ipsec into head/.
Small summary
 -------------

o Almost all IPsec releated code was moved into sys/netipsec.
o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel
  option IPSEC_SUPPORT added. It enables support for loading
  and unloading of ipsec.ko and tcpmd5.ko kernel modules.
o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by
  default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type
  support was removed. Added TCP/UDP checksum handling for
  inbound packets that were decapsulated by transport mode SAs.
  setkey(8) modified to show run-time NAT-T configuration of SA.
o New network pseudo interface if_ipsec(4) added. For now it is
  build as part of ipsec.ko module (or with IPSEC kernel).
  It implements IPsec virtual tunnels to create route-based VPNs.
o The network stack now invokes IPsec functions using special
  methods. The only one header file <netipsec/ipsec_support.h>
  should be included to declare all the needed things to work
  with IPsec.
o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed.
  Now these protocols are handled directly via IPsec methods.
o TCP_SIGNATURE support was reworked to be more close to RFC.
o PF_KEY SADB was reworked:
  - now all security associations stored in the single SPI namespace,
    and all SAs MUST have unique SPI.
  - several hash tables added to speed up lookups in SADB.
  - SADB now uses rmlock to protect access, and concurrent threads
    can do SA lookups in the same time.
  - many PF_KEY message handlers were reworked to reflect changes
    in SADB.
  - SADB_UPDATE message was extended to support new PF_KEY headers:
    SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They
    can be used by IKE daemon to change SA addresses.
o ipsecrequest and secpolicy structures were cardinally changed to
  avoid locking protection for ipsecrequest. Now we support
  only limited number (4) of bundled SAs, but they are supported
  for both INET and INET6.
o INPCB security policy cache was introduced. Each PCB now caches
  used security policies to avoid SP lookup for each packet.
o For inbound security policies added the mode, when the kernel does
  check for full history of applied IPsec transforms.
o References counting rules for security policies and security
  associations were changed. The proper SA locking added into xform
  code.
o xform code was also changed. Now it is possible to unregister xforms.
  tdb_xxx structures were changed and renamed to reflect changes in
  SADB/SPDB, and changed rules for locking and refcounting.

Reviewed by:	gnn, wblock
Obtained from:	Yandex LLC
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D9352
2017-02-06 08:49:57 +00:00
pkelsey
6c91473476 Fix VIMAGE-related bugs in TFO. The autokey callout vnet context was
not being initialized, and the per-vnet fastopen context was only
being initialized for the default vnet.

PR:		216613
Reported by:	Alex Deiter <alex dot deiter at gmail dot com>
MFC after:	1 week
2017-02-03 17:02:57 +00:00
gnn
5354226f97 Add an mbuf to ipinfo_t translator to finish cleanup of mbuf passing to TCP probes.
Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9401
2017-02-01 19:33:00 +00:00
tuexen
432d67eb8e Ensure that the variable bail is always initialized before used.
MFC after:	1 week
2017-02-01 00:10:29 +00:00
tuexen
695d1ccda1 Take the SCTP common header into account when computing the
space available for chunks. This unbreaks the handling of
ICMPV6 packets indicating "packet too big". It just worked
for IPv4 since we are overbooking for IPv4.

MFC after:	1 week
2017-01-31 23:36:31 +00:00
tuexen
b4b6fb9832 Remove a duplicate debug statement.
MFC after:	1 week
2017-01-31 23:34:02 +00:00
cy
230ac6bb88 Correct comment grammar and make it easier to understand.
MFC after:	1 week
2017-01-30 04:51:18 +00:00
hiren
845c485521 Add a knob to change default behavior of inheriting listen socket's tcp stack
regardless of what the default stack for the system is set to.

With current/default behavior, after changing the default tcp stack, the
application needs to be restarted to pick up that change. Setting this new knob
net.inet.tcp.functions_inherit_listen_socket_stack to '0' would change that
behavior and make any new connection use the newly selected default tcp stack.

Reviewed by:		rrs
MFC after:		2 weeks
Sponsored by:		Limelight Networks
2017-01-27 23:10:46 +00:00
loos
5bd3158a3b After the in_control() changes in r257692, an existing address is
(intentionally) deleted first and then completely added again (so all the
events, announces and hooks are given a chance to run).

This cause an issue with CARP where the existing CARP data structure is
removed together with the last address for a given VHID, which will cause
a subsequent fail when the address is later re-added.

This change fixes this issue by adding a new flag to keep the CARP data
structure when an address is not being removed.

There was an additional issue with IPv6 CARP addresses, where the CARP data
structure would never be removed after a change and lead to VHIDs which
cannot be destroyed.

Reviewed by:	glebius
Obtained from:	pfSense
MFC after:	2 weeks
Sponsored by:	Rubicon Communications, LLC (Netgate)
2017-01-25 19:04:08 +00:00
tuexen
bd0169975f Fix a bug where the overhead of the I-DATA chunk was not considered.
MFC after: 1 week
2017-01-24 21:30:36 +00:00
hselasky
efa6326974 Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.

- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.

- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().

- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.

- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.

- How rate limiting works:

1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.

2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.

3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.

4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.

Reviewed by:		wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision:	https://reviews.freebsd.org/D3687
Sponsored by:		Mellanox Technologies
MFC after:		3 months
2017-01-18 13:31:17 +00:00
sobomax
701697521c Add a new socket option SO_TS_CLOCK to pick from several different clock
sources to return timestamps when SO_TIMESTAMP is enabled. Two additional
clock sources are:

o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME);
o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC).

In addition to this, this option provides unified interface to get bintime
(equivalent of using SO_BINTIME), except it also supported with IPv6 where
SO_BINTIME has never been supported. The long term plan is to depreciate
SO_BINTIME and move everything to using SO_TS_CLOCK.

Idea for this enhancement has been briefly discussed on the Net session
during dev summit in Ottawa last June and the general input was positive.

This change is believed to benefit network benchmarks/profiling as well
as other scenarios where precise time of arrival measurement is necessary.

There are two regression test cases as part of this commit: one extends unix
domain test code (unix_cmsg) to test new SCM_XXX types and another one
implementis totally new test case which exchanges UDP packets between two
processes using both conventional methods (i.e. calling clock_gettime(2)
before recv(2) and after send(2)), as well as using setsockopt()+recv() in
receive path. The resulting delays are checked for sanity for all supported
clock types.

Reviewed by:    adrian, gnn
Differential Revision:  https://reviews.freebsd.org/D9171
2017-01-16 17:46:38 +00:00
cem
6d240aee91 Fix a variety of cosmetic typos and misspellings
No functional change.

PR:		216096, 216097, 216098, 216101, 216102, 216106, 216109, 216110
Reported by:	Bulat <bltsrc at mail.ru>
Sponsored by:	Dell EMC Isilon
2017-01-15 18:00:45 +00:00
glebius
23ed1207c8 Use getsock_cap() instead of deprecated fgetsock().
Reviewed by:	tuexen
2017-01-13 16:54:44 +00:00
tuexen
36cef24e89 Ensure that the buffer length and the length provided in the IPv4
header match when using a raw socket to send IPv4 packets and
providing the header. If they don't match, let send return -1
and set errno to EINVAL.

Before this patch is was only enforced that the length in the header
is not larger then the buffer length.

PR:			212283
Reviewed by:		ae, gnn
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D9161
2017-01-13 10:55:26 +00:00
sobomax
d54f5e8994 Fix slight type mismatch between so_options defined in sys/socketvar.h
and tw_so_options defined here which is supposed to be a copy of the
former (short vs u_short respectively).

Switch tw_so_options to be "signed short" to match the type of the field
it's inherited from.
2017-01-12 10:14:54 +00:00
hiren
9a5ee970a5 sysctl net.inet.tcp.hostcache.list in a jail can see connections from other
jails and the host. This commit fixes it.

PR:		200361
Submitted by:	bz (original version), hiren (minor corrections)
Reported by:	Marcus Reid <marcus at blazingdot dot com>
Reviewed by:	bz, gnn
Tested by:	Lohith Bellad <lohithbsd at gmail dot com>
MFC after:	1 week
Sponsored by:	Limelight Networks (minor corrections)
2017-01-05 17:22:09 +00:00
gnn
611bbdc42f Followup to mtod removal in main stack (r311225). Continued removal
of mtod() calls from TCP_PROBE macros.

MFC after:	1 week
Sponsored by:	Limelight Networks
2017-01-04 04:00:28 +00:00
gnn
6125ea060a Fix DTrace TCP tracepoints to not use mtod() as it is both unnecessary and
dangerous.  Those wanting data from an mbuf should use DTrace itself to get
the data.

PR:	203409
Reviewed by:	hiren
MFC after:	1 week
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D9035
2017-01-04 02:19:13 +00:00
ngie
99d3bb7bf7 Unbreak ip_carp with WITHOUT_INET6 enabled by conditionalizing all IPv6
structs under the INET6 #ifdef. Similarly (even though it doesn't seem
to affect the build), conditionalize all IPv4 structs under the INET
#ifdef

This also unbreaks the LINT-NOINET6 tinderbox target on amd64; I have not
verified other MACHINE/TARGET pairs (e.g. armv6/arm).

MFC after:	2 weeks
X-MFC with:	r310847
Pointyhat to:	jpaetzel
Reported by:	O. Hartmann <o.hartmann@walstatt.org>
2016-12-30 21:33:01 +00:00
jpaetzel
733f3711d8 Harden CARP against network loops.
If there is a loop in the network a CARP that is in MASTER state will see it's
own broadcasts, which will then cause it to assume BACKUP state.  When it
assumes BACKUP it will stop sending advertisements.  In that state it will no
longer see advertisements and will assume MASTER...

We can't catch all the cases where we are seeing our own CARP broadcast, but
we can catch the obvious case.

Submitted by:	torek
Obtained from:	FreeNAS
MFC after:	2 weeks
Sponsored by:	iXsystems
2016-12-30 18:46:21 +00:00
ae
d3f6c412e7 When we are sending IP fragments, update ip pointers in IP_PROBE() for
each fragment.

MFC after:	1 week
2016-12-29 19:57:46 +00:00
tuexen
8ac0611687 Consistent handling of errors reported from the lower layer.
MFC after:	3 days
2016-12-27 22:14:41 +00:00
tuexen
6266aedc70 Whitespace changes.
The toolchain for processing the sources has been updated. No functional
change.

MFC after:	3 days
2016-12-26 11:06:41 +00:00
tuexen
834d9ee203 Remove a KASSERT which is not always true.
In case of the empty queue tp->snd_holes and tcp_sackhole_insert()
failing due to memory shortage, tp->snd_holes will be empty.
This problem was hit when stress tests where performed by pho.

PR:		215513
Reported by:	pho
Tested by:	pho
Sponsored by:	Netflix, Inc.
2016-12-25 17:37:18 +00:00
glebius
5b31201618 Remove assigned only variable. 2016-12-21 22:47:10 +00:00
ae
2c8179e485 ip[6]_tryforward does inbound and outbound packet firewall processing.
This can lead to change of mbuf pointer (packet filter could do m_pullup(),
NAT, etc). Also in case of change of destination address, tryforward can
decide that packet should be handled by local system. In this case modified
mbuf can be returned to the ip[6]_input(). To handle this correctly, check
M_FASTFWD_OURS flag after return from ip[6]_tryforward. And if it is present,
update variables that depend from mbuf pointer and skip another inbound
firewall processing.

No objection from:	#network
MFC after:	3 weeks
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D8764
2016-12-19 11:02:49 +00:00
tuexen
37db8550fe Fix the handling of buffered messages in stream reset deferred handling.
Thanks to Eugen-Andrei Gavriloaie for reporting the issue and providing
substantial help in nailing down the issue.

MFC after:	1 week
2016-12-17 22:31:30 +00:00
hiren
719d6b9d9d We currently don't do TSO if ip options are present. In case of IPv6, we look at
in6p_options to check that. That is incorrect as we carry ip options in
in6p_outputopts. Also, just checking for in6p_outputopts being NULL won't
suffice as we combine ip options and ip header fields both in that one field.
The commit fixes this by using ip6_optlen() which correctly calculates length
of only ip options for IPv6.

Reviewed by:	    ae, bz
MFC after:	    3 weeks
Sponsored by:	    Limelight Networks
2016-12-11 23:14:47 +00:00
tuexen
5b3872f772 Ensure that the reported ppid and tsn are taken from the first fragment.
This fixes a bug where the wrong ppid was reported, if
* I-DATA was used on the first fragement was not received first
* DATA was used and different ppids where used.

Thanks to Julian Cordes for making me aware of the issue.

MFC after:	1 week
2016-12-11 13:26:35 +00:00
glebius
678917bbda Fix build for 32-bit machines.
Submitted by:	tuexen
2016-12-09 20:50:35 +00:00
glebius
4f37ca1641 Use counter_ratecheck() in the ICMP rate limiting.
Together with:	rrs, jtl
2016-12-09 17:59:15 +00:00
tuexen
07971cc856 Don't bundle a SACK chunk with a SHUTDOWN chunk if it is not required.
MFC after:	1 week
2016-12-09 17:58:07 +00:00
tuexen
de802bfc55 Don't send multiple SHUTDOWN chunks in a single packet.
Thanks to Felix Weinrank for making me aware of this issue.

MFC after:	1 week
2016-12-09 17:57:17 +00:00
tuexen
11a1caf3e4 Silence a warning produced by newer versions of gcc.
MFC after:	1 week
2016-12-07 22:01:09 +00:00
tuexen
ec64726461 Cleanup the names of SSN, SID, TSN, FSN, PPID and MID.
This made a couple of bugs visible in handling SSN wrap-arounds
when using DATA chunks. Now bulk transfer seems to work fine...
This fixes the issue reported in
https://github.com/sctplab/usrsctp/issues/111

MFC after:	1 week
2016-12-07 19:30:59 +00:00
tuexen
ae1856036a Whitespace changes.
The tools using to generate the sources has been updated and produces
different whitespaces. Commit this seperately to avoid intermixing
these with real code changes.

MFC after:	3 days
2016-12-06 10:21:25 +00:00
tuexen
c25f2a0572 Fix the handling of TCP FIN-segments in the CLOSED state
When a TCP segment with the FIN bit set was received in the CLOSED state,
a TCP RST-ACK-segment is sent. When computing SEG.ACK for this, the
FIN counts as one byte. This accounting was missing and is fixed by this
patch.

Reviewed by:		hiren
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://svn.freebsd.org/base/head
2016-12-02 08:02:31 +00:00
ae
3c54b1473b Rework ip_tryforward() to use FIB4 KPI.
Tested by:	olivier
Obtained from:	Yandex LLC
MFC after:	1 month
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D8526
2016-11-28 17:55:32 +00:00
hiren
af48af4877 For RTT calculations mid-session, we explicitly ignore ACKs with tsecr of 0 as
many borken middle-boxes tend to do that. But during 3whs, in syncache_expand(),
we don't do that which causes us to send a RST to such a client. Relax this
constraint by only using tsecr to compare against timestamp that we sent when it
is not 0. As a result, we'd now accept the final ACK of 3whs with tsecr of 0.

Reviewed by:	    jtl, gnn
Sponsored by:	    Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D8552
2016-11-21 20:53:11 +00:00
tuexen
0a8c611cbe Ensure that TCP state changes to state-closing are reported via dtrace.
This does not cover state changes from TIME-WAIT.

Reviewed by:		gnn
MFC after:		3 weeks
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D8443
2016-11-19 14:45:08 +00:00
tuexen
d16951e37f Notify the use via setting errno when a TCP RST segment is received
either in the CLOSING or LAST-ACK state.

Reviewed by:		hiren
MFC after:		3 weeks
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D8371
2016-11-17 08:15:02 +00:00
ae
77bbf6ec97 Initialize ip6 pointer before use.
PR:		214169
MFC after:	1 week
2016-11-06 02:33:04 +00:00
hiren
97041f7818 Set slow start threshold more accurately on loss to be flightsize/2 instead of
cwnd/2 as recommended by RFC5681. (spotted by mmacy at nextbsd dot org)

Restore pre-r307901 behavior of aligning ssthresh/cwnd on mss boundary. (spotted
by slawa at zxy dot spb dot ru)

Tested by:	    dim, Slawa <slawa at zxy dot spb dot ru>
MFC after:	    1 month
X-MFC with:	    r307901
Sponsored by:	    Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D8349
2016-11-01 21:08:37 +00:00
jch
0b9cdc127b Remove an extraneous call to soisconnected() in syncache_socket(),
introduced with r261242.  The useful and expected soisconnected()
call is done in tcp_do_segment().

Has been found as part of unrelated PR:212920 investigation.

Improve slightly (~2%) the maximum number of TCP accept per second.

Tested by:		kevin.bowling_kev009.com, jch
Approved by:		gnn, hiren
MFC after:		1 week
Sponsored by:		Verisign, Inc
Differential Revision:	https://reviews.freebsd.org/D8072
2016-10-26 15:19:18 +00:00
hiren
f084a95815 FreeBSD tcp stack used to inform respective congestion control module about the
loss event but not use or obay the recommendations i.e. values set by it in some
cases.

Here is an attempt to solve that confusion by following relevant RFCs/drafts.
Stack only sets congestion window/slow start threshold values when there is no
CC module availalbe to take that action. All CC modules are inspected and
updated when needed to take appropriate action on loss.

tcp_stacks/fastpath module has been updated to adapt these changes.

Note: Probably, the most significant change would be to not bring congestion
window down to 1MSS on a loss signaled by 3-duplicate acks and letting
respective CC decide that value.

In collaboration with:	Matt Macy <mmacy at nextbsd dot org>
Discussed on:		transport@ mailing list
Reviewed by:		jtl
MFC after:		1 month
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D8225
2016-10-25 05:45:47 +00:00
hiren
f904bc1c31 Undo r307899. It needs a bit more work and proper commit log. 2016-10-25 05:07:51 +00:00
hiren
301cc1629e In Collaboration with: Matt Macy <mmacy at nextbsd dot com>
Reviewed by:		    jtl
Sponsored by:		    Limelight Networks
Differential Revision:	    https://reviews.freebsd.org/D8225
2016-10-25 05:03:33 +00:00
rstone
66911849cb Fix ip_output() on point-to-point links
In r304435, ip_output() was changed to use the result of the route
lookup to decide whether the outgoing packet was a broadcast or
not.  This introduced a regression on interfaces where
IFF_BROADCAST was not set (e.g. point-to-point links), as the
algorithm could incorrectly treat the destination address as a
broadcast address, and ip_output() would subsequently drop the
packet as broadcasting on a non-IFF_BROADCAST interface is not
allowed.

Differential Revision:	https://reviews.freebsd.org/D8303
Reviewed by:	jtl
Reported by:	ambrisko
MFC after:	2 weeks
X-MFC-With:	r304435
Sponsored by:	Dell EMC Isilon
2016-10-24 22:11:33 +00:00
tuexen
3f1a425503 No functional changes, mostly getting the whitespace changes resulting
from an updated formatting tool chain.

MFC after: 1 month
2016-10-22 17:21:21 +00:00
tuexen
219874f57e Make ICMPv6 hard error handling for TCP consistent with the ICMPv4
handling. Ensure that:
* Protocol unreachable errors are handled by indicating ECONNREFUSED
  to the TCP user for both IPv4 and IPv6. These were ignored for IPv6.
* Communication prohibited errors are handled by indicating ECONNREFUSED
  to the TCP user for both IPv4 and IPv6. These were ignored for IPv6.
* Hop Limited exceeded errors are handled by indicating EHOSTUNREACH
  to the TCP user for both IPv4 and IPv6.
  For IPv6 the TCP connected was dropped but errno wasn't set.

Reviewed by: gallatin, rrs
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: 7904
2016-10-21 10:32:57 +00:00
jch
9e24370a69 Fix a double-free when an inp transitions to INP_TIMEWAIT state
after having been dropped.

This fixes enforces in_pcbdrop() logic in tcp_input():

"in_pcbdrop() is used by TCP to mark an inpcb as unused and avoid future packet
delivery or event notification when a socket remains open but TCP has closed."

PR:			203175
Reported by:		Palle Girgensohn, Slawa Olhovchenkov
Tested by:		Slawa Olhovchenkov
Reviewed by:		Slawa Olhovchenkov
Approved by:		gnn, Slawa Olhovchenkov
Differential Revision:	https://reviews.freebsd.org/D8211
MFC after:		1 week
Sponsored by:		Verisign, inc
2016-10-18 07:16:49 +00:00
hiren
d1e2d64193 Make sure tcp_mss() has the same check as tcp_mss_update() to have t_maxseg set
to at least 64.

This is still just a coverup to avoid kernel panic and not an actual fix.

PR:			213232
Reviewed by:		glebius
MFC after:		1 week
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D8272
2016-10-18 02:40:25 +00:00
pkelsey
c8c04b1d34 Fix cases where the TFO pending counter would leak references, and eventually, memory.
Also renamed some tfo labels and added/reworked comments for clarity.

Based on an initial patch from jtl.

PR: 213424
Reviewed by:	jtl
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D8235
2016-10-15 01:41:28 +00:00
jtl
1a0d134638 r307082 added the TCP_HHOOK kernel option and made some existing code only
compile when that option is configured. In tcp_destroy(), the error
variable is now only used in code enclosed in an '#ifdef TCP_HHOOK' block.
This broke the build for VNET images.

Enclose the error variable itself in an #ifdef block.

Submitted by:	Shawn Webb <shawn.webb at hardenedbsd.org>
Reported by:	Shawn Webb <shawn.webb at hardenedbsd.org>
PointyHat to:	jtl
2016-10-15 00:29:15 +00:00
jtl
0ba4ad5d78 The code currently resets the keepalive timer each time a packet is
received on a TCP session that has entered the ESTABLISHED state. This
results in a lot of calls to reset the keepalive timer.

This patch changes the behavior so we set the keepalive timer for the
keepalive idle time (TP_KEEPIDLE). When the keepalive timer fires, it will
first check to see if the session has been idle for TP_KEEPIDLE ticks. If
not, it will reschedule the keepalive timer for the time the session will
have been idle for TP_KEEPIDLE ticks.

For a session with regular communication, the keepalive timer should fire
approximately once every TP_KEEPIDLE ticks. For sessions with irregular
communication, the keepalive timer might fire more often. But, the
disruption from a periodic keepalive timer should be less than the regular
cost of resetting the keepalive timer on every packet.

(FWIW, this change saved approximately 1.73% of the busy CPU cycles on a
particular test system with a heavy TCP output load. Of course, the
actual impact is very specific to the particular hardware and workload.)

Reviewed by:	gallatin, rrs
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D8243
2016-10-14 14:57:43 +00:00
glebius
444f6777c7 - Revert r300854, r303657 which tried to fix regression from r297225.
- Fix the regression proper way using RO_RTFREE().

Submitted by:	ae
2016-10-13 20:15:47 +00:00
glebius
9289759c94 With build without TCP_HHOOK and with INVARIANTS. Before mutex.h came
via sys/hhook.h -> sys/rmlock.h -> sys/mutex.h.
2016-10-13 18:02:29 +00:00
tuexen
ad7fe17e53 Mark the socket as un-writable when it is 1-to-1 and the SCTP association
is freed.

MFC after:	1 month
2016-10-13 13:53:01 +00:00
tuexen
25333eeed8 Whitespace changes.
MFC after: 1 month
2016-10-13 13:38:14 +00:00
jtl
d6d2c8e016 The TFO server-side code contains some changes that are not conditioned on
the TCP_RFC7413 kernel option. This change removes those few instructions
from the packet processing path.

While not strictly necessary, for the sake of consistency, I applied the
new IS_FASTOPEN macro to all places in the packet processing path that
used the (t_flags & TF_FASTOPEN) check.

Reviewed by:	hiren
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D8219
2016-10-12 19:06:50 +00:00
jtl
c8e3a1fa40 Currently, when tcp_input() receives a packet on a session that matches a
TCPCB, it checks (so->so_options & SO_ACCEPTCONN) to determine whether or
not the socket is a listening socket. However, this causes the code to
access a different cacheline. If we first check if the socket is in the
LISTEN state, we can avoid accessing so->so_options when processing packets
received for ESTABLISHED sessions.

If INVARIANTS is defined, the code still needs to access both variables to
check that so->so_options is consistent with the state.

Reviewed by:	gallatin
MFC after:	1 week
Sponsored by:	Netflix
2016-10-12 02:30:33 +00:00
jtl
62030781cd In the TCP stack, the hhook(9) framework provides hooks for kernel modules
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.

Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.

This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.

Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.

Reviewed by:	rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D8185
2016-10-12 02:16:42 +00:00
markj
0e9bbe2171 Lock the ND prefix list and add refcounting for prefixes.
This change extends the nd6 lock to protect the ND prefix list as well
as the list of advertising routers associated with each prefix. To handle
cases where the nd6 lock must be dropped while iterating over either the
prefix or default router lists, a generation counter is used to track
modifications to the lists. Additionally, a new mutex is used to serialize
prefix on-link/off-link transitions. This mutex must be acquired before
the nd6 lock and is held while updating the routing table in
nd6_prefix_onlink() and nd6_prefix_offlink().

Reviewed by:	ae, tuexen (SCTP bits)
Tested by:	Jason Wolfe <jason@llnw.com>,
		Larry Rosenman <ler@lerctr.org>
MFC after:	2 months
Differential Revision:	https://reviews.freebsd.org/D8125
2016-10-07 21:10:53 +00:00
jtl
69bf0a6bcd Remove "long" variables from the TCP stack (not including the modular
congestion control framework).

Reviewed by:	gnn, lstewart (partial)
Sponsored by:	Juniper Networks, Netflix
Differential Revision:	(multiple)
Tested by:	Limelight, Netflix
2016-10-06 16:28:34 +00:00
jtl
68fb25b684 If the new window size is less than the old window size, skip the
calculations to check if we should advertise a larger window.

Reviewed by:	gnn
MFC after:	2 weeks
Sponsored by:	Juniper Networks, Netflix
Differential Revision:	https://reviews.freebsd.org/D7076
Tested by:	Limelight, Netflix
2016-10-06 16:09:45 +00:00
jtl
96fb403827 Correctly calculate snd_max in persist case.
In the persist case, take the SYN and FIN flags into account when updating
the sequence space sent.

Reviewed by:	gnn
MFC after:	2 weeks
Sponsored by:	Juniper Networks, Netflix
Differential Revision:	https://reviews.freebsd.org/D7075
Tested by:	Limelight, Netflix
2016-10-06 16:00:48 +00:00
jtl
0af56930ed Remove declaration of un-defined function tcp_seq_subtract().
Reviewed by:	gnn
MFC after:	1 week
Sponsored by:	Juniper Networks, Netflix
Differential Revision:	https://reviews.freebsd.org/D7055
2016-10-06 15:57:15 +00:00
kevlo
2d24f35373 Remove an alias if_list, use if_link consistently.
Reviewed by:	tuexen
Differential Revision:	https://reviews.freebsd.org/D8075
2016-10-06 00:51:27 +00:00
vangyzen
e993e06c3b Add GARP retransmit capability
A single gratuitous ARP (GARP) is always transmitted when an IPv4
address is added to an interface, and that is usually sufficient.
However, in some circumstances, such as when a shared address is
passed between cluster nodes, this single GARP may occasionally be
dropped or lost.  This can lead to neighbors on the network link
working with a stale ARP cache and sending packets destined for
that address to the node that previously owned the address, which
may not respond.

To avoid this situation, GARP retransmissions can be enabled by setting
the net.link.ether.inet.garp_rexmit_count sysctl to a value greater
than zero.  The setting represents the maximum number of retransmissions.
The interval between retransmissions is calculated using an exponential
backoff algorithm, doubling each time, so the retransmission intervals
are: {1, 2, 4, 8, 16, ...} (seconds).

Due to the exponential backoff algorithm used for the interval
between GARP retransmissions, the maximum number of retransmissions
is limited to 16 for sanity.  This limit corresponds to a maximum
interval between retransmissions of 2^16 seconds ~= 18 hours.
Increasing this limit is possible, but sending out GARPs spaced
days apart would be of little use.

Submitted by:	David A. Bright <david.a.bright@dell.com>
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D7695
2016-10-02 01:42:45 +00:00
rmacklem
1ed6d7ee67 r297225 broke udp_output() for the case where the "addr" argument
is NULL and the function jumps to the "release:" label.
For this case, the "inp" was write locked, but the code attempted to
read unlock it. This patch fixes the problem.
This case could occur for NFS over UDP mounts, where the server was
down for a few minutes under certain circumstances.

Reported by:	bde
Tested by:	bde
Reviewed by:	gnn
MFC after:	2 weeks
2016-10-01 19:39:09 +00:00
hiren
fa7cad5718 This adds a sysctl which allows you to disable the TCP hostcache. This is handy
during testing of network related changes where cached entries may pollute your
results, or during known congestion events where you don't want to unfairly
penalize hosts.

Prior to r232346 this would have meant you would break any connection with a sub
1500 MTU, as the hostcache was authoritative. All entries as they stand today
should simply be used to pre populate values for efficiency.

Submitted by:	Jason Wolfe (j at nitrology dot com)
Reviewed by:	rwatson, sbruno, rrs , bz (earlier version)
MFC after:	2 weeks
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D6198
2016-09-30 00:10:57 +00:00
lidl
afb49c8e8c Properly preserve ip_tos bits for IPv4 packets
Restructure code slightly to save ip_tos bits earlier.  Fix the bug
where the ip_tos field is zeroed out before assigning to the iptos
variable. Restore the ip_tos and ip_ver fields only if they have
been zeroed during the pseudo-header checksum calculation.

Reviewed by:	cem, gnn, hiren
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D8053
2016-09-29 19:45:24 +00:00
jch
20981a0a76 Fix an issue with accept_filter introduced with r261242:
As a side effect of r261242 when using accept_filter the
first call to soisconnected() is done earlier in tcp_input()
instead of tcp_do_segment() context.  Restore the expected behaviour.

Note:  This call to soisconnected() seems to be extraneous in all
cases (with or without accept_filter).  Will be addressed in a
separate commit.

PR:			212920
Reported by:		Alexey
Tested by:              Alexey, jch
Sponsored by:           Verisign, Inc.
MFC after:		1 week
2016-09-29 11:18:48 +00:00
kevlo
6f20b23b0e Remove ifa_list, use ifa_link (structure field) instead.
While here, prefer if_addrhead (FreeBSD) to if_addrlist (BSD compat) naming
for the interface address list in sctp_bsd_addr.c

Reviewed by:	tuexen
Differential Revision:	https://reviews.freebsd.org/D8051
2016-09-28 13:29:11 +00:00
oshogbo
41574d7110 capsicum: propagate rights on accept(2)
Descriptor returned by accept(2) should inherits capabilities rights from
the listening socket.

PR:		201052
Reviewed by:	emaste, jonathan
Discussed with:	many
Differential Revision:	https://reviews.freebsd.org/D7724
2016-09-22 09:58:46 +00:00
tuexen
ca10f858be Fix the handling of unordered fragmented user messages using DATA chunks.
There were two bugs:
* There was an accounting bug resulting in reporting a too small a_rwnd.
* There are a bug when abandoning messages in the reassembly queue.

MFC after:	4 weeks
2016-09-21 08:28:18 +00:00
kevlo
518bc28463 Remove the 4.3BSD compatible macro m_copy(), use m_copym() instead.
Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D7878
2016-09-15 07:41:48 +00:00
tuexen
f914d1f5aa Ensure that the IPPROTO_TCP level socket options
* TCP_KEEPINIT
* TCP_KEEPINTVL
* TCP_KEEPIDLE
* TCP_KEEPCNT
always always report the values currently used when getsockopt()
is used. This wasn't the case when the sysctl-inherited default
values where used.
Ensure that the IPPROTO_TCP level socket option TCP_INFO has the
TCPI_OPT_ECN flag set in the tcpi_options field when ECN support
has been negotiated successfully.

Reviewed by:	rrs, jtl, hiren
MFC after:	1 month
Differential Revision:	7833
2016-09-14 14:48:00 +00:00
dim
e0d1234535 With clang 3.9.0, compiling sys/netinet/igmp.c results in the following
warning:

sys/netinet/igmp.c:546:21: error: implicit conversion from 'int' to 'char' changes value from 148 to -108 [-Werror,-Wconstant-conversion]
        p->ipopt_list[0] = IPOPT_RA;    /* Router Alert Option */
                         ~ ^~~~~~~~
sys/netinet/ip.h:153:19: note: expanded from macro 'IPOPT_RA'
#define IPOPT_RA                148             /* router alert */
                                ^~~

This is because ipopt_list is an array of char, so IPOPT_RA is wrapped
to a negative value.  It would be nice to change ipopt_list to an array
of u_char, but it changes the signature of the public struct ipoption,
so add an explicit cast to suppress the warning.

Reviewed by:	imp
MFC after:	3 days
Differential Revision: https://reviews.freebsd.org/D7777
2016-09-04 17:23:10 +00:00
hiren
757d14e668 Adjust TCP module fastpath after r304803's cc_ack_received() changes.
Reported by:		hiren, bz, np
Reviewed by:		rrs
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D7664
2016-08-26 19:23:17 +00:00
hiren
4f8ef932b3 Update TCPS_HAVERCVDFIN() macro to correctly include all states a connection
can be in after receiving a FIN.

FWIW, NetBSD has this change for quite some time.

This has been tested at Netflix and Limelight in production traffic.

Reported by:	Sam Kumar <samkumar99 at gmail.com> on transport@
Reviewed by:	rrs
MFC after:	4 weeks
Sponsored by:	Limelight Networks
Differential Revision:	 https://reviews.freebsd.org/D7475
2016-08-26 17:48:54 +00:00
tuexen
98a41f1034 Fix a bug, where no SACK is sent when receiving a FORWARD-TSN or
I-FORWARD-TSN chunk before any DATA or I-DATA chunk.

Thanks to Julian Cordes for finding this problem and prividing
packetdrill scripts to reporduce the issue.

MFC after: 3 days
2016-08-26 07:49:23 +00:00
lstewart
b01e24747b Pass the number of segments coalesced by LRO up the stack by repurposing the
tso_segsz pkthdr field during RX processing, and use the information in TCP for
more correct accounting and as a congestion control input. This is only a start,
and an audit of other uses for the data is left as future work.

Reviewed by:	gallatin, rrs
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D7564
2016-08-25 13:33:32 +00:00
tuexen
7a9338fdc9 When aborting an association, send the ABORT before notifying the upper
layer. For the kernel this doesn't matter, for the userland stack, it does.
While there, silence a clang warning when compiling it in userland.
2016-08-24 06:22:53 +00:00
rstone
28461d00d9 Temporarily disable the optimization from r304436
r304436 attempted to optimize the handling of incoming UDP packet by only
making an expensive call to in_broadcast() if the mbuf was marked as an
broadcast packet.  Unfortunately, this cannot work in the case of point-to-
point L2 protocols like PPP, which have no notion of "broadcast".

Discussions on how to properly fix r304436 are ongoing, but in the meantime
disable the optimization to ensure that no existing network setups are broken.

Reported by:	bms
2016-08-22 15:27:37 +00:00
tuexen
527830f803 Improve the locking when sending user messages.
First, keep a ref count on the stcb after looking it up, as
done in the other lookup cases.
Second, before looking again at sp, ensure that it is not
freed, because the assoc is about to be freed.

MFC after: 3 days
2016-08-22 01:45:29 +00:00
tuexen
0bd1594238 Remove duplicate code, which is not protected by the appropriate locks.
MFC after: 3 days
2016-08-22 00:40:45 +00:00
bz
55cbdc7ad3 Remove the kernel optoion for IPSEC_FILTERTUNNEL, which was deprecated
more than 7 years ago in favour of a sysctl in r192648.
2016-08-21 18:55:30 +00:00
zec
383d1ec924 Permit disabling net.inet.udp.require_l2_bcast in VIMAGE kernels.
The default value of the tunable introduced in r304436 couldn't be
effectively overrided on VIMAGE kernels, because instead of being
accessed via the appropriate VNET() accessor macro, it was accessed
via the VNET_NAME() macro, which resolves to the (should-be) read-only
master template of initial values of per-VNET data.  Hence, while the
value of udp_require_l2_bcast could be altered on per-VNET basis, the
code in udp_input() would ignore it as it would always read the default
value (one) from the VNET master template.

Silence from: rstone
2016-08-20 22:12:26 +00:00
tuexen
cb0025aac6 Unbreak sctp_connectx().
MFC after: 3 days
2016-08-20 20:15:36 +00:00
rstone
babdcd8925 Fix unlocked access to ifnet address list
in_broadcast() was iterating over the ifnet address list without
first taking an IF_ADDR_RLOCK.  This could cause a panic if a
concurrent operation modified the list.

Reviewed by: bz
MFC after: 2 months
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7227
2016-08-18 22:59:10 +00:00
rstone
1d0c796721 Don't check for broadcast IPs on non-bcast pkts
in_broadcast() can be quite expensive, so skip calling it if the
incoming mbuf wasn't sent to a broadcast L2 address in the first
place.

Reviewed by: gnn
MFC after: 2 months
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7309
2016-08-18 22:59:05 +00:00
rstone
217cfdcc89 Don't iterate over the ifnet addr list in ip_output()
For almost every packet that is transmitted through ip_output(),
a call to in_broadcast() was made to decide if the destination
IP was a broadcast address.  in_broadcast() iterates over the
ifnet's address to find a source IP matching the subnet of the
destination IP, and then checks if the IP is a broadcast in that
subnet.

This is completely redundant as we have already performed the
route lookup, so the source IP is already known.  Just use that
address to directly check whether the destination IP is a
broadcast address or not.

MFC after:	2 months
Sponsored By:	EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7266
2016-08-18 22:59:00 +00:00
rrs
c9af4f2541 A few more wording tweaks as suggested (with some modifications
as well) by Ravi Pokala. Thanks for the comments :-)
Sponsored by: Netflix Inc.
2016-08-16 15:17:36 +00:00
rrs
4d7e0cd8cd Here we update the modular tcp to be able to switch to an
alternate TCP stack in other then the closed state (pre-listen/connect).
The idea is that *if* that is supported by the alternate stack, it
is asked if its ok to switch. If it approves the "handoff" then we
allow the switch to happen. Also the fini() function now gets a flag
to tell if you are switching away *or* the tcb is destroyed. The
init() call into the alternate stack is moved to the end so the
tcb is more fully formed before the init transpires.

Sponsored by:	Netflix Inc.
Differential Revision:	D6790
2016-08-16 15:11:46 +00:00
rrs
30758bfe03 Comments describing how to properly use the new lock_add functions
and its respective companion.

Sponsored by:	Netflix Inc.
2016-08-16 13:08:03 +00:00
rrs
fae35efb7f This cleans up the timer code in TCP and also makes it so we do not
take the INFO lock *unless* we are really going to delete the TCB.

Differential Revision:	D7136
2016-08-16 12:40:56 +00:00
sephe
779c070ec2 tcp/lro: Make # of LRO entries tunable
Reviewed by:	hps, gallatin
Obtained from:	rrs, gallatin
MFC after:	2 weeks
Sponsored by:	Netflix (rrs, gallatin), Microsoft (sephe)
Differential Revision:	https://reviews.freebsd.org/D7499
2016-08-16 06:40:27 +00:00
tuexen
9fe3ea7bab Ensure that sctp_it_ctl.cur_it does not point to a free object (during
a small time window).
Thanks to Byron Campen for reporting the issue and
suggesting a fix.

MFC after: 3 days
2016-08-15 10:16:08 +00:00
ae
fbd6330956 Add stats reset command implementation to NPTv6 module
to be able reset statistics counters.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-08-13 16:45:14 +00:00
ae
8c03d2551f Add ipfw_nat64 module that implements stateless and stateful NAT64.
The module works together with ipfw(4) and implemented as its external
action module.

Stateless NAT64 registers external action with name nat64stl. This
keyword should be used to create NAT64 instance and to address this
instance in rules. Stateless NAT64 uses two lookup tables with mapped
IPv4->IPv6 and IPv6->IPv4 addresses to perform translation.

A configuration of instance should looks like this:
 1. Create lookup tables:
 # ipfw table T46 create type addr valtype ipv6
 # ipfw table T64 create type addr valtype ipv4
 2. Fill T46 and T64 tables.
 3. Add rule to allow neighbor solicitation and advertisement:
 # ipfw add allow icmp6 from any to any icmp6types 135,136
 4. Create NAT64 instance:
 # ipfw nat64stl NAT create table4 T46 table6 T64
 5. Add rules that matches the traffic:
 # ipfw add nat64stl NAT ip from any to table(T46)
 # ipfw add nat64stl NAT ip from table(T64) to 64:ff9b::/96
 6. Configure DNS64 for IPv6 clients and add route to 64:ff9b::/96
    via NAT64 host.

Stateful NAT64 registers external action with name nat64lsn. The only
one option required to create nat64lsn instance - prefix4. It defines
the pool of IPv4 addresses used for translation.

A configuration of instance should looks like this:
 1. Add rule to allow neighbor solicitation and advertisement:
 # ipfw add allow icmp6 from any to any icmp6types 135,136
 2. Create NAT64 instance:
 # ipfw nat64lsn NAT create prefix4 A.B.C.D/28
 3. Add rules that matches the traffic:
 # ipfw add nat64lsn NAT ip from any to A.B.C.D/28
 # ipfw add nat64lsn NAT ip6 from any to 64:ff9b::/96
 4. Configure DNS64 for IPv6 clients and add route to 64:ff9b::/96
    via NAT64 host.

Obtained from:	Yandex LLC
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D6434
2016-08-13 16:09:49 +00:00
karels
cf6f0553ef Fix kernel build with TCP_RFC7413 option
The current in_pcb.h includes route.h, which includes sockaddr structures.
Including <sys/socketvar.h> should require <sys/socket.h>; add it in
the appropriate place.

PR: 211385
Submitted by: Sergey Kandaurov and iron at mail.ua
Reviewed by: gnn
Approved by: gnn (mentor)
MFC after: 1 day
2016-08-11 23:52:24 +00:00
ae
4500e11f0a Restore "nat global" support.
Now zero value of arg1 used to specify "tablearg", use the old "tablearg"
value for "nat global". Introduce new macro IP_FW_NAT44_GLOBAL to replace
hardcoded magic number to specify "nat global". Also replace 65535 magic
number with corresponding macro. Fix typo in comments.

PR:		211256
Tested by:	Victor Chernov
MFC after:	3 days
2016-08-11 10:10:10 +00:00
tuexen
899f803567 Improve a consistency check to not detect valid cases for
unordered user messages using DATA chunks as invalid ones.
While there, ensure that error causes are provided when
sending ABORT chunks in case of reassembly problems detected.
Thanks to Taylor Brandstetter for making me aware of this problem.
MFC after:	3 days
2016-08-10 17:19:33 +00:00
stevek
12aa5d6087 Move IPv4-specific jail functions to new file netinet/in_jail.c
_prison_check_ip4 renamed to prison_check_ip4_locked

Move IPv6-specific jail functions to new file netinet6/in6_jail.c
_prison_check_ip6 renamed to prison_check_ip6_locked

Add appropriate prototypes to sys/sys/jail.h

Adjust kern_jail.c to call prison_check_ip4_locked and
prison_check_ip6_locked accordingly.

Add netinet/in_jail.c and netinet6/in6_jail.c to the list of files that
need to be built when INET and INET6, respectively, are configured in the
kernel configuration file.

Reviewed by:	jtl
Approved by:	sjg (mentor)
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D6799
2016-08-09 02:16:21 +00:00
tuexen
aad07a8b35 Fix the sending of FORWARD-TSN and I-FORWARD-TSN chunks. The
last SID/SSN pair wasn't filled in.
Thanks to Julian Cordes for providing a packetdrill script
triggering the issue and making me aware of the bug.

MFC after:	3 days
2016-08-08 13:52:18 +00:00
tuexen
8159e4a89f Fix a locking issue found by stress testing with tsctp.
The inp read lock neeeds to be held when considering control->do_not_ref_stcb.
MFC after:	3 days
2016-08-08 08:20:10 +00:00
tuexen
7e13d9800b Consistently check for unsent data on the stream queues.
MFC after:	3 days
2016-08-07 23:04:46 +00:00
tuexen
0a5a343096 Remove stream queue entry consistently from wheel.
While there, improve the handling of drain.

MFC after:	3 days
2016-08-07 12:51:13 +00:00
tuexen
c4511dcda8 Don't modify a structure without holding a reference count on it.
MFC after:	3 days
2016-08-06 15:29:46 +00:00
tuexen
5570003d40 Mark an unused parameter as such.
MFC after:	3 days
2016-08-06 12:51:07 +00:00
tuexen
1f7e02ab3c Fix various bugs in relation to the I-DATA chunk support
This is joint work with rrs.

MFC after:	3 days
2016-08-06 12:33:15 +00:00
sephe
7679792d26 tcp/lro: If timestamps mismatch or it's a FIN, force flush.
This keeps the segments/ACK/FIN delivery order.

Before this patch, it was observed: if A sent FIN immediately after
an ACK, B would deliver FIN first to the TCP stack, then the ACK.
This out-of-order delivery causes one unnecessary ACK sent from B.

Reviewed by:	gallatin, hps
Obtained from:  rrs, gallatin
Sponsored by:	Netflix (rrs, gallatin), Microsoft (sephe)
Differential Revision:	https://reviews.freebsd.org/D7415
2016-08-05 09:08:00 +00:00
sephe
2f53001d8c tcp/lro: Implement hash table for LRO entries.
This significantly improves HTTP workload performance and reduces
HTTP workload latency.

Reviewed by:	rrs, gallatin, hps
Obtained from:	rrs, gallatin
Sponsored by:	Netflix (rrs, gallatin) , Microsoft (sephe)
Differential Revision:	https://reviews.freebsd.org/D6689
2016-08-02 06:36:47 +00:00
gallatin
11f6fcfd28 Rework IPV6 TCP path MTU discovery to match IPv4
- Re-write tcp_ctlinput6() to closely mimic the IPv4 tcp_ctlinput()

- Now that tcp_ctlinput6() updates t_maxseg, we can allow ip6_output()
  to send TCP packets without looking at the tcp host cache for every
  single transmit.

- Make the icmp6 code mimic the IPv4 code & avoid returning
  PRC_HOSTDEAD because it is so expensive.

Without these changes in place, every TCP6 pmtu discovery or host
unreachable ICMP resulted in a call to in6_pcbnotify() which walks the
tcbinfo table with the write lock held.  Because the tcbinfo table is
shared between IPv4 and IPv6, this causes huge scalabilty issues on
servers with lots of (~100K) TCP connections, to the point where even
a small percent of IPv6 traffic had a disproportionate impact on
overall throughput.

Reviewed by:	bz, rrs, ae (all earlier versions), lstewart (in Netflix's tree)
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D7272
2016-08-01 17:02:21 +00:00
gallatin
5eb719bd4e Call tcp_notify() directly to shoot down routes, rather than
calling in_pcbnotifyall().

This avoids lock contention on tcbinfo due to in_pcbnotifyall()
holding the tcbinfo write lock while walking all connections.

Reviewed by:	rrs, karels
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D7251
2016-07-28 19:32:25 +00:00
stevek
b6dd955145 Remove BSD and USL copyright and update license block in in_prot.c, as the
code in this file was written by Robert N. M. Waston.

Move cr_can* prototypes from sys/systm.h to sys/proc.h

Reported by:	rwatson
Reviewed by:	rwatson
Approved by:	sjg (mentor)
Differential Revision:	https://reviews.freebsd.org/D7345
2016-07-28 18:39:30 +00:00
stevek
3acd4a25e6 Prepare for network stack as a module
- Move cr_canseeinpcb to sys/netinet/in_prot.c in order to separate the
   INET and INET6-specific code from the rest of the prot code (It is only
   used by the network stack, so it makes sense for it to live with the
   other network stack code.)
 - Move cr_canseeinpcb prototype from sys/systm.h to netinet/in_systm.h
 - Rename cr_seeotheruids to cr_canseeotheruids and cr_seeothergids to
   cr_canseeothergids, make them non-static, and add prototypes (so they
   can be seen/called by in_prot.c functions.)
 - Remove sw_csum variable from ip6_forward in ip6_forward.c, as it is an
   unused variable.

Reviewed by:	gnn, jtl
Approved by:	sjg (mentor)
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D2901
2016-07-27 20:34:09 +00:00
brd
b3e395ca9c Fix the case for some sysctl descriptions.
Reviewed by:	gnn
2016-07-26 20:20:09 +00:00
karels
e31cf93024 Fix per-connection L2 caching in fast path
r301217 re-added per-connection L2 caching from a previous change,
but it omitted caching in the fast path.  Add it.

Reviewed By: gallatin
Approved by: gnn (mentor)
Differential Revision: https://reviews.freebsd.org/D7239
2016-07-22 02:11:49 +00:00
tuexen
1aab952d50 Fix a bug in deferred stream reset processing which results
in using a length field before it is set.

Thanks to Taylor Brandstetter for reporting the issue and
providing a fix.

MFC after:	3 days
2016-07-20 06:29:26 +00:00
tuexen
eeeff2b433 Use correct order of conditions to avoid NULL deref.
MFC after:	3 days
X-MFC with:	r302935
2016-07-19 11:16:44 +00:00
tuexen
4b30ef6b84 netstat and sockstat expect the IPv6 link local addresses to
have an embedded scope. So don't recover.

MFC after:	3 days
2016-07-19 09:48:08 +00:00
ae
e679279326 Add named dynamic states support to ipfw(4).
The keep-state, limit and check-state now will have additional argument
flowname. This flowname will be assigned to dynamic rule by keep-state
or limit opcode. And then can be matched by check-state opcode or
O_PROBE_STATE internal opcode. To reduce possible breakage and to maximize
compatibility with old rulesets default flowname introduced.
It will be assigned to the rules when user has omitted state name in
keep-state and check-state opcodes. Also if name is ambiguous (can be
evaluated as rule opcode) it will be replaced to default.

Reviewed by:	julian
Obtained from:	Yandex LLC
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D6674
2016-07-19 04:56:59 +00:00
ae
2c47439b3f Add ipfw_nptv6 module that implements Network Prefix Translation for IPv6
as defined in RFC 6296. The module works together with ipfw(4) and
implemented as its external action module. When it is loaded, it registers
as eaction and can be used in rules. The usage pattern is similar to
ipfw_nat(4). All matched by rule traffic goes to the NPT module.

Reviewed by:	hrs
Obtained from:	Yandex LLC
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D6420
2016-07-18 19:46:31 +00:00
tuexen
99394049ec Add a constant required by RFC 7496.
MFC after:	3 days
2016-07-17 13:33:35 +00:00
tuexen
0ad05f701a Fix the PR-SCTP behaviour.
This is done by rrs@.

MFC after:	3 days
2016-07-17 13:14:51 +00:00
tuexen
56cb57f2f0 Add missing sctps_reasmusrmsgs counter.
Joint work with rrs@.
MFC after:	3 days
2016-07-17 08:31:21 +00:00
tuexen
fd44b2ea42 Deal with a portential memory allocation failure, which was reported
by the clang static code analyzer.
Joint work with rrs@.

MFC after:	3 days
2016-07-16 12:25:37 +00:00
tuexen
4575027dc3 Don't free a data chunk twice.
Found by the clang static code analyzer running for the userland stack.

MFC after:	3 days
2016-07-16 08:11:43 +00:00
tuexen
be58faa50a Address a potential memory leak found a the clang static code analyzer
running on the userland stack.

MFC after:	3 days
2016-07-16 07:48:01 +00:00
jtl
be1c33dfcc The TCPPCAP debugging feature caches recently-used mbufs for use in
debugging TCP connections. This commit provides a mechanism to free those
mbufs when the system is under memory pressure.

Because this will result in lost debugging information, the behavior is
controllable by a sysctl. The default setting is to free the mbufs.

Reviewed by:	gnn
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D6931
Input from:	novice_techie.com
2016-07-06 16:17:13 +00:00
nwhitehorn
89d01c24d1 Replace a number of conflations of mp_ncpus and mp_maxid with either
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.

There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.

PR:		kern/210106
Reviewed by:	jhb
Approved by:	re (glebius)
2016-07-06 14:09:49 +00:00
tuexen
557adfd043 This patch fixes two bugs related to the setting of the I-Bit
for SCTP DATA and I-DATA chunks.
* For fragmented user messages, set the I-Bit only on the last
  fragment.
* When using explicit EOR mode, set the I-Bit on the last
  fragment, whenever SCTP_SACK_IMMEDIATELY was set in snd_flags
  for any of the send() calls.

Approved by:	re (hrs)
MFC after:	1 week
2016-06-30 06:06:35 +00:00
tuexen
5c0864eed0 This patch fixes two bugs related to the SCTP message recovery
for messages which have been put on the send queue:
* Do not report any DATA or I-DATA chunk padding.
* Correctly deal with the I-DATA chunk header instead of the DATA
  chunk header when the I-DATA extension is used.

Approved by:	re (kib)
MFC after:	1 week
2016-06-26 16:38:42 +00:00
tuexen
37b02e3df9 This patch fixes a locking bug when a send() call blocks
on an SCTP socket and the association is aborted by the
peer.

Approved by:	re (kib)
MFC after:	1 week
2016-06-26 12:41:02 +00:00
bz
179026d7fd Try to avoid a 2nd conditional by re-writing the loop, pause, and
escape clause another time.

Submitted by:	jhb
Approved by:	re (gjb)
MFC after:	12 days
2016-06-23 21:32:52 +00:00
np
af533198e3 Add spares to struct ifnet and socket for packet pacing and/or general
use.  Update comments regarding the spare fields in struct inpcb.

Bump __FreeBSD_version for the changes to the size of the structures.

Reviewed by:	gnn@
Approved by:	re@ (gjb@)
Sponsored by:	Chelsio Communications
2016-06-23 21:07:15 +00:00
tuexen
ab637d3a17 Fix a bug in the handling of non-blocking SCTP 1-to-1 sockets. When using
this code in the userland stack, it could result in a loop. This happened on iOS.
However, I was not able to reproduce this when using the code in the kernel.
Thanks to Eugen-Andrei Gavriloaie for reporting the issue and proving detailed
information to find the root of the problem.

Approved by:	re (gjb)
MFC after:	1 week
2016-06-23 19:27:29 +00:00
bz
4a8148b86d In VNET TCP teardown Do not sleep unconditionally but only if we
have any TCP connections left.

Submitted by:		zec
Approved by:		re (hrs)
MFC after:		13 days
2016-06-23 11:55:15 +00:00
tuexen
c2c8b26056 Don't consider the socket when processing an incoming ICMP/ICMP6 packet,
which was triggered by an SCTP packet. Whether a socket exists, is just
not relevant.

Approved by: re (kib)
MFC after: 1 week
2016-06-23 09:13:15 +00:00
bz
82f8e32710 Check the V_tcbinfo.ipi_count to hit 0 before doing the full TCP cleanup.
That way timers can finish cleanly and we do not gamble with a DELAY().

Reviewed by:		gnn, jtl
Approved by:		re (gjb)
Obtained from:		projects/vnet
MFC after:		2 weeks
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6923
2016-06-23 00:34:03 +00:00
bz
cd2b4aba5e No longer mark TCP TW zone NO_FREE.
Timewait code does a proper cleanup after itself.

Reviewed by:		gnn
Approved by:		re (gjb)
Obtained from:		projects/vnet
MFC after:		2 weeks
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6922
2016-06-23 00:32:58 +00:00
bz
7a1c0b1ad1 Get closer to a VIMAGE network stack teardown from top to bottom rather
than removing the network interfaces first. This change is rather larger
and convoluted as the ordering requirements cannot be separated.

Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and
related modules to their own SI_SUB_PROTO_FIREWALL.
Move initialization of "physical" interfaces to SI_SUB_DRIVERS,
move virtual (cloned) interfaces to SI_SUB_PSEUDO.
Move Multicast to SI_SUB_PROTO_MC.

Re-work parts of multicast initialisation and teardown, not taking the
huge amount of memory into account if used as a module yet.

For interface teardown we try to do as many of them as we can on
SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling
over a higher layer protocol such as IP. In that case the interface
has to go along (or before) the higher layer protocol is shutdown.

Kernel hhooks need to go last on teardown as they may be used at various
higher layers and we cannot remove them before we cleaned up the higher
layers.

For interface teardown there are multiple paths:
(a) a cloned interface is destroyed (inside a VIMAGE or in the base system),
(b) any interface is moved from a virtual network stack to a different
network stack ("vmove"), or (c) a virtual network stack is being shut down.
All code paths go through if_detach_internal() where we, depending on the
vmove flag or the vnet state, make a decision on how much to shut down;
in case we are destroying a VNET the individual protocol layers will
cleanup their own parts thus we cannot do so again for each interface as
we end up with, e.g., double-frees, destroying locks twice or acquiring
already destroyed locks.
When calling into protocol cleanups we equally have to tell them
whether they need to detach upper layer protocols ("ulp") or not
(e.g., in6_ifdetach()).

Provide or enahnce helper functions to do proper cleanup at a protocol
rather than at an interface level.

Approved by:		re (hrs)
Obtained from:		projects/vnet
Reviewed by:		gnn, jhb
Sponsored by:		The FreeBSD Foundation
MFC after:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D6747
2016-06-21 13:48:49 +00:00
ae
48b268cd67 Cleanup unneded include "opt_ipfw.h".
It was used for conditional build IPFIREWALL_FORWARD support.
But IPFIREWALL_FORWARD option was removed a long time ago.
2016-06-09 05:48:34 +00:00
tuexen
e091a23d8b Use a separate MID counter for ordered und unordered messages for each
outgoing stream.

Thanks to Jens Hoelscher for reporting the issue.

MFC after: 1 week
2016-06-08 17:57:42 +00:00
sephe
7acd138965 net: Use M_HASHTYPE_OPAQUE_HASH if the mbuf flowid has hash properties
Reviewed by:	hps, erj, tuexen
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D6688
2016-06-07 04:51:50 +00:00
bz
cbabf9b05e Add a show igi_list command to DDB to debug IGMP state.
Obtained from:	projects/vnet
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2016-06-06 22:26:18 +00:00
bz
25cd3a2011 Destroy the mutex last. In this case it should not matter, but
generally cleanup code might still acquire it thus try to be
consistent destroying locks late.

Obtained from:	projects/vnet
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2016-06-06 13:04:22 +00:00
gnn
c1b72d1e48 Add missing constants from RFCs 4443 and 6550 2016-06-06 00:35:45 +00:00
bz
69cdb2137c Introduce a per-VNET flag to enable/disable netisr prcessing on that VNET.
Add accessor functions to toggle the state per VNET.
The base system (vnet0) will always enable itself with the normal
registration. We will share the registered protocol handlers in all
VNETs minimising duplication and management.
Upon disabling netisr processing for a VNET drain the netisr queue from
packets for that VNET.

Update netisr consumers to (de)register on a per-VNET start/teardown using
VNET_SYS(UN)INIT functionality.

The change should be transparent for non-VIMAGE kernels.

Reviewed by:	gnn (, hiren)
Obtained from:	projects/vnet
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6691
2016-06-03 13:57:10 +00:00
hselasky
2665e75b58 Use insertion sort instead of bubble sort in TCP LRO.
Replacing the bubble sort with insertion sort gives an 80% reduction
in runtime on average, with randomized keys, for small partitions.

If the keys are pre-sorted, insertion sort runs in linear time, and
even if the keys are reversed, insertion sort is faster than bubble
sort, although not by much.

Update comment describing "tcp_lro_sort()" while at it.

Differential Revision:	https://reviews.freebsd.org/D6619
Sponsored by:	Mellanox Technologies
Tested by:	Netflix
Suggested by:	Pieter de Goeje <pieter@degoeje.nl>
Reviewed by:	ed, gallatin, gnn, transport
2016-06-03 08:35:07 +00:00
tuexen
2471e9df43 Get struct sctp_net_route in-sync with struct route again. 2016-06-03 07:43:04 +00:00
tuexen
060176cb0c Store the peers vtag in host byte order in the cookie, since all
consumers expect it that way.
This fixes the vtag when sending en ERROR chunk.

MFC after:	1 week
2016-06-03 07:24:41 +00:00
gnn
d75e0c471e This change re-adds L2 caching for TCP and UDP, as originally added in D4306
but removed due to other changes in the system. Restore the llentry pointer
to the "struct route", and use it to cache the L2 lookup (ARP or ND6) as
appropriate.

Submitted by:	Mike Karels
Differential Revision:	https://reviews.freebsd.org/D6262
2016-06-02 17:51:29 +00:00
bz
fac944a70a The pr_destroy field does not allow us to run the teardown code in a
specific order.  VNET_SYSUNINITs however are doing exactly that.
Thus remove the VIMAGE conditional field from the domain(9) protosw
structure and replace it with VNET_SYSUNINITs.
This also allows us to change some order and to make the teardown functions
file local static.
Also convert divert(4) as it uses the same mechanism ip(4) and ip6(4) use
internally.

Slightly reshuffle the SI_SUB_* fields in kernel.h and add a new ones, e.g.,
for pfil consumers (firewalls), partially for this commit and for others
to come.

Reviewed by:		gnn, tuexen (sctp), jhb (kernel.h)
Obtained from:		projects/vnet
MFC after:		2 weeks
X-MFC:			do not remove pr_destroy
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6652
2016-06-01 10:14:04 +00:00
tuexen
525f16040f Add PR_CONNREQUIRED for SOCK_STREAM sockets using SCTP.
This is required to signal connetion setup on non-blocking sockets
via becoming writable. This still allows for implicit connection
setup.

MFC after:	1 week
2016-05-30 18:24:23 +00:00
tuexen
9265336a79 Fix a byte order issue for the scope stored in the SCTP cookie.
MFC after:	1 week
2016-05-30 11:18:39 +00:00
sephe
8e57647564 tcp: Don't prematurely drop receiving-only connections
If the connection was persistent and receiving-only, several (12)
sporadic device insufficient buffers would cause the connection be
dropped prematurely:
Upon ENOBUFS in tcp_output() for an ACK, retransmission timer is
started.  No one will stop this retransmission timer for receiving-
only connection, so the retransmission timer promises to expire and
t_rxtshift is promised to be increased.  And t_rxtshift will not be
reset to 0, since no RTT measurement will be done for receiving-only
connection.  If this receiving-only connection lived long enough
(e.g. >350sec, given the RTO starts from 200ms), and it suffered 12
sporadic device insufficient buffers, i.e. t_rxtshift >= 12, this
receiving-only connection would be dropped prematurely by the
retransmission timer.

We now assert that for data segments, SYNs or FINs either rexmit or
persist timer was wired upon ENOBUFS.  And don't set rexmit timer
for other cases, i.e. ENOBUFS upon ACKs.

Discussed with:	lstewart, hiren, jtl, Mike Karels
MFC after:	3 weeks
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D5872
2016-05-30 03:31:37 +00:00
glebius
bbfa6d2853 Plug route reference underleak that happens with FLOWTABLE after r297225.
Submitted by:	Mike Karels <mike karels.net>
2016-05-27 17:31:02 +00:00
truckman
2a78edb668 Import Dummynet AQM version 0.2.1 (CoDel, FQ-CoDel, PIE and FQ-PIE).
Centre for Advanced Internet Architectures

Implementing AQM in FreeBSD

* Overview <http://caia.swin.edu.au/freebsd/aqm/index.html>

* Articles, Papers and Presentations
  <http://caia.swin.edu.au/freebsd/aqm/papers.html>

* Patches and Tools <http://caia.swin.edu.au/freebsd/aqm/downloads.html>

Overview

Recent years have seen a resurgence of interest in better managing
the depth of bottleneck queues in routers, switches and other places
that get congested. Solutions include transport protocol enhancements
at the end-hosts (such as delay-based or hybrid congestion control
schemes) and active queue management (AQM) schemes applied within
bottleneck queues.

The notion of AQM has been around since at least the late 1990s
(e.g. RFC 2309). In recent years the proliferation of oversized
buffers in all sorts of network devices (aka bufferbloat) has
stimulated keen community interest in four new AQM schemes -- CoDel,
FQ-CoDel, PIE and FQ-PIE.

The IETF AQM working group is looking to document these schemes,
and independent implementations are a corner-stone of the IETF's
process for confirming the clarity of publicly available protocol
descriptions. While significant development work on all three schemes
has occured in the Linux kernel, there is very little in FreeBSD.

Project Goals

This project began in late 2015, and aims to design and implement
functionally-correct versions of CoDel, FQ-CoDel, PIE and FQ_PIE
in FreeBSD (with code BSD-licensed as much as practical). We have
chosen to do this as extensions to FreeBSD's ipfw/dummynet firewall
and traffic shaper. Implementation of these AQM schemes in FreeBSD
will:
* Demonstrate whether the publicly available documentation is
  sufficient to enable independent, functionally equivalent implementations

* Provide a broader suite of AQM options for sections the networking
  community that rely on FreeBSD platforms

Program Members:

* Rasool Al Saadi (developer)

* Grenville Armitage (project lead)

Acknowledgements:

This project has been made possible in part by a gift from the
Comcast Innovation Fund.

Submitted by:	Rasool Al-Saadi <ralsaadi@swin.edu.au>
X-No objection:	core
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D6388
2016-05-26 21:40:13 +00:00
jhb
7a26d36370 Don't reuse the source mbuf in tcp_respond() if it is not writable.
Not all mbufs passed up from device drivers are M_WRITABLE().  In
particular, the Chelsio T4/T5 driver uses a feature called "buffer packing"
to receive multiple frames in a single receive buffer.  The mbufs for
these frames all share the same external storage so are treated as
read-only by the rest of the stack when multiple frames are in flight.
Previously tcp_respond() would blindly overwrite read-only mbufs when
INVARIANTS was disabled or panic with an assertion failure if INVARIANTS
was enabled.  Note that the new case is a bit of a mix of the two other
cases in tcp_respond().  The TCP and IP headers must be copied explicitly
into the new mbuf instead of being inherited (similar to the m == NULL
case), but the addresses and ports must be swapped in the reply (similar
to the m != NULL case).

Reviewed by:	glebius
2016-05-26 18:35:37 +00:00
tuexen
a38ffc8229 Make struct sctp_paddrthlds compliant to RFC 7829. 2016-05-26 11:38:26 +00:00
hselasky
8578dfb4fe Use optimised complexity safe sorting routine instead of the kernel's
"qsort()".

The kernel's "qsort()" routine can in worst case spend O(N*N) amount of
comparisons before the input array is sorted. It can also recurse a
significant amount of times using up the kernel's interrupt thread
stack.

The custom sorting routine takes advantage of that the sorting key is
only 64 bits. Based on set and cleared bits in the sorting key it
partitions the array until it is sorted. This process has a recursion
limit of 64 times, due to the number of set and cleared bits which can
occur. Compiled with -O2 the sorting routine was measured to use
64-bytes of stack. Multiplying this by 64 gives a maximum stack
consumption of 4096 bytes for AMD64. The same applies to the execution
time, that the array to be sorted will not be traversed more than 64
times.

When serving roughly 80Gb/s with 80K TCP connections, the old method
consisting of "qsort()" and "tcp_lro_mbuf_compare_header()" used 1.4%
CPU, while the new "tcp_lro_sort()" used 1.1% for LRO related sorting
as measured by Intel Vtune. The testing was done using a sysctl to
toggle between "qsort()" and "tcp_lro_sort()".

Differential Revision:	https://reviews.freebsd.org/D6472
Sponsored by:	Mellanox Technologies
Tested by:	Netflix
Reviewed by:	gallatin, rrs, sephe, transport
2016-05-26 11:10:31 +00:00
tuexen
dad0f4b1ee When sending in ICMP response to an SCTP packet,
* include the SCTP common header, if possible
* include the first 8 bytes of the INIT chunk, if possible
This provides the necesary information for the receiver of the ICMP
packet to process it.

MFC after:	1 week
2016-05-25 22:16:11 +00:00
tuexen
0bb4927e0c Send an ICMP packet indicating destination unreachable/protocol
unreachable if we don't handle the packet in the kernel and not
in userspace.

MFC after:	1 week
2016-05-25 15:54:21 +00:00
tuexen
34285663b0 Count packets as not being delivered only if they are neither
processed by a kernel handler nor by a raw socket.

MFC after:	1 week
2016-05-25 13:48:26 +00:00
truckman
c8e8113f34 Change net.inet.tcp.ecn.enable sysctl mib from a binary off/on
control to a three way setting.
  0 - Totally disable ECN. (no change)
  1 - Enable ECN if incoming connections request it.  Outgoing
      connections will request ECN.  (no change from present != 0 setting)
  2 - Enable ECN if incoming connections request it.  Outgoing
      conections will not request ECN.

Change the default value of net.inet.tcp.ecn.enable from 0 to 2.

Linux version 2.4.20 and newer, Solaris, and Mac OS X 10.5 and newer have
similar capabilities.  The actual values above match Linux, and the default
matches the current Linux default.

Reviewed by:	eadler
MFC after:	1 month
MFH:		yes
Sponsored by:	https://reviews.freebsd.org/D6386
2016-05-19 22:20:35 +00:00
glebius
60e4daddc3 Tiny refactor of r294869/r296881: use defines to mask the VNET() macro.
Suggested by:	bz
2016-05-17 23:14:17 +00:00
rrs
afc75dd440 This small change adopts the excellent suggestion for using named
structures in the add of a new tcp-stack that came in late to me
via email after the last commit. It also makes it so that a new
stack may optionally get a callback during a retransmit
timeout. This allows the new stack to clear specific state (think
sack scoreboards or other such structures).

Sponsored by:	Netflix Inc.
Differential Revision:	http://reviews.freebsd.org/D6303
2016-05-17 09:53:22 +00:00
ae
f79f8e9de8 Make named objects set-aware. Now it is possible to create named
objects with the same name in different sets.

Add optional manage_sets() callback to objects rewriting framework.
It is intended to implement handler for moving and swapping named
object's sets. Add ipfw_obj_manage_sets() function that implements
generic sets handler. Use new callback to implement sets support for
lookup tables.
External actions objects are global and they don't support sets.
Modify eaction_findbyname() to reflect this.
ipfw(8) now may fail to move rules or sets, because some named objects
in target set may have conflicting names.
Note that ipfw_obj_ntlv type was changed, but since lookup tables
actually didn't support sets, this change is harmless.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-05-17 07:47:23 +00:00
markj
fc8a4ae06d opt_kdtrace.h is not needed for SDT probes as of r258541. 2016-05-15 20:04:43 +00:00
markj
eac165ebf3 Fix a few style issues in the ICMP sysctl descriptions.
MFC after:	1 week
2016-05-15 03:19:53 +00:00
tuexen
f9a4fa815c Fix a locking bug which only shows up on Mac OS X.
MFC after: 1 week
2016-05-14 13:44:49 +00:00
tuexen
962273e9ec Fix a bug introduced by the implementation of I-DATA support.
There was the requirement that two structures are in sync,
which is not valid anymore. Therefore don't rely on this
in the code anymore.
Thanks to Radek Malcic for reporting the issue. He found this
when using the userland stack.

MFC after: 1 week
2016-05-13 09:11:41 +00:00
tuexen
6d95624949 Retire net.inet.sctp.strict_sacks and net.inet.sctp.strict_data_order
sysctl's, since they where only there to interop with non-conformant
implementations. This should not be a problem anymore.
2016-05-12 16:34:59 +00:00
tuexen
3c2e038873 Enable SACK Immediately per default.
This has been tested for a long time and implements covered by RFC 7053.

MFC after: 1 week
2016-05-12 15:48:08 +00:00
tuexen
766ed501bc Use a format string in snprintf() for consistency.
This was reported by Radek Malcic when using the userland stack in
combination with MinGW.

MFC after: 1 week
2016-05-12 14:41:53 +00:00
sephe
431b35bba5 tcp/syncache: Add comment for syncache_respond
Suggested by:	hiren, hps
Reviewed by:	sbruno
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D6148
2016-05-10 04:59:04 +00:00
hiren
7033f894b1 Add an option to use rfc6675 based pipe/inflight bytes calculation in htcp.
Submitted by:	Kevin Bowling <kevin.bowling@kev009.com>
MFC after:	1 week
Sponsored by:	Limelight Networks
2016-05-09 19:19:03 +00:00
tuexen
c299ac9344 Cleanup a comment.
MFC after: 1 week
2016-05-09 16:35:05 +00:00