Commit Graph

510 Commits

Author SHA1 Message Date
Hiren Panchasara
64807b300f DCTCP (Data Center TCP) implementation.
DCTCP congestion control algorithm aims to maximise throughput and minimise
latency in data center networks by utilising the proportion of Explicit
Congestion Notification (ECN) marked packets received from capable hardware as a
congestion signal.

Highlights:
Implemented as a mod_cc(4) module.
ECN (Explicit congestion notification) processing is done differently from
RFC3168.
Takes one-sided DCTCP into consideration where only one of the sides is using
DCTCP and other is using standard ECN.

IETF draft: http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Thesis report by Midori Kato: https://eggert.org/students/kato-thesis.pdf

Submitted by:	Midori Kato <katoon@sfc.wide.ad.jp> and
		Lars Eggert <lars@netapp.com>
		with help and modifications from
		hiren
Differential Revision:	https://reviews.freebsd.org/D604
Reviewed by:	gnn
2015-01-12 08:33:04 +00:00
Andrey V. Elsukov
44eb8bbe7b Do not count security policy violation twice.
ipsec*_in_reject() do this by their own.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-12-11 19:20:13 +00:00
Hans Petter Selasky
c25290420e Start process of removing the use of the deprecated "M_FLOWID" flag
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.

This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.

"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.

Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.

MFC after:	1 month
Sponsored by:	Mellanox Technologies
2014-12-01 11:45:24 +00:00
Gleb Smirnoff
651e4e6a30 Merge from projects/sendfile: extend protocols API to support
sending not ready data:
o Add new flag to pru_send() flags - PRUS_NOTREADY.
o Add new protocol method pru_ready().

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2014-11-30 13:24:21 +00:00
Gleb Smirnoff
cfa6009e36 In preparation of merging projects/sendfile, transform bare access to
sb_cc member of struct sockbuf to a couple of inline functions:

sbavail() and sbused()

Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-12 09:57:15 +00:00
Andrey V. Elsukov
3e88eb903b Remove ip6_getdstifaddr() and all functions to work with auxiliary data.
It isn't safe to keep unreferenced ifaddrs. Use in6ifa_ifwithaddr() to
determine ifaddr corresponding to destination address. Since currently
we keep addresses with embedded scope zone, in6ifa_ifwithaddr is called
with zero zoneid and marked with XXX.

Also remove route and lle lookups from ip6_input. Use in6ifa_ifwithaddr()
instead.

Sponsored by:	Yandex LLC
2014-11-08 19:38:34 +00:00
Gleb Smirnoff
6df8a71067 Remove SYSCTL_VNET_* macros, and simply put CTLFLAG_VNET where needed.
Sponsored by:	Nginx, Inc.
2014-11-07 09:39:05 +00:00
Hans Petter Selasky
9fd573c39d Improve transmit sending offload, TSO, algorithm in general.
The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.

Reviewed by:	adrian, rmacklem
Sponsored by:	Mellanox Technologies
MFC after:	1 week
2014-09-22 08:27:27 +00:00
Gleb Smirnoff
3220a2121c FreeBSD-SA-14:19.tcp raised attention to the state of our stack
towards blind SYN/RST spoofed attack.

Originally our stack used in-window checks for incoming SYN/RST
as proposed by RFC793. Later, circa 2003 the RST attack was
mitigated using the technique described in P. Watson
"Slipping in the window" paper [1].

After that, the checks were only relaxed for the sake of
compatibility with some buggy TCP stacks. First, r192912
introduced the vulnerability, just fixed by aforementioned SA.
Second, r167310 had slightly relaxed the default RST checks,
instead of utilizing net.inet.tcp.insecure_rst sysctl.

In 2010 a new technique for mitigation of these attacks was
proposed in RFC5961 [2]. The idea is to send a "challenge ACK"
packet to the peer, to verify that packet arrived isn't spoofed.
If peer receives challenge ACK it should regenerate its RST or
SYN with correct sequence number. This should not only protect
against attacks, but also improve communication with broken
stacks, so authors of reverted r167310 and r192912 won't be
disappointed.

[1] http://bandwidthco.com/whitepapers/netforensics/tcpip/TCP Reset Attacks.pdf
[2] http://www.rfc-editor.org/rfc/rfc5961.txt

Changes made:

o Revert r167310.
o Implement "challenge ACK" protection as specificed in RFC5961
  against RST attack. On by default.
  - Carefully preserve r138098, which handles empty window edge
    case, not described by the RFC.
  - Update net.inet.tcp.insecure_rst description.
o Implement "challenge ACK" protection as specificed in RFC5961
  against SYN attack. On by default.
  - Provide net.inet.tcp.insecure_syn sysctl, to turn off
    RFC5961 protection.

The changes were tested at Netflix. The tested box didn't show
any anomalies compared to control box, except slightly increased
number of TCP connection in LAST_ACK state.

Reviewed by:	rrs
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-09-16 11:07:25 +00:00
Xin LI
831ad37ef2 Fix Denial of Service in TCP packet processing.
Submitted by:	glebius
Security:	FreeBSD-SA-14:19.tcp
2014-09-16 09:48:24 +00:00
John Baldwin
a7c7f2a7e2 In tcp_input(), don't acquire the pcbinfo global write lock for SYN
packets targeting a listening socket.  Permit to reduce TCP input
processing starvation in context of high SYN load (e.g. short-lived TCP
connections or SYN flood).

Submitted by:	Julien Charbon <jcharbon@verisign.com>
Reviewed by:	adrian, hiren, jhb, Mike Bentkofsky
2014-09-04 19:09:08 +00:00
Hiren Panchasara
f7469d3e52 Improve comments by listing a criteria for automatic increment of receive socket
buffer.

Reviewed by:	jmg
2014-08-09 21:01:24 +00:00
Kevin Lo
8f5a8818f5 Merge 'struct ip6protosw' and 'struct protosw' into one. Now we have
only one protocol switch structure that is shared between ipv4 and ipv6.

Phabric:	D476
Reviewed by:	jhb
2014-08-08 01:57:15 +00:00
Hiren Panchasara
cc412412db 2014-07-02 22:04:14 +00:00
Bjoern A. Zeeb
3150f357ff Remove the prototpye for the static inline function
tcp_signature_verify_input().
The function is defined before first use already.

MFC after:	2 weeks
2014-05-24 15:31:40 +00:00
Bjoern A. Zeeb
5688fa661b Remove the prototypes for things that are no longer file local but were
moved to the header file.

Pointy hat to:	clang || bz
MFC after:	2 weeks
X-MFC with:	r266596
Reported by:	gcc build of sparc64
2014-05-23 21:12:33 +00:00
Bjoern A. Zeeb
255cd9fd58 Move the tcp_fields_to_host() and tcp_fields_to_net() (inline)
functions to the tcp_var.h header file in order to avoid further
duplication with upcoming commits.

Reviewed by:	np
MFC after:	2 weeks
2014-05-23 20:15:01 +00:00
Adrian Chadd
2f71993288 Ensure that the flowid hashtype is assigned to the inp if the flowid
is also assigned.
2014-05-18 22:34:06 +00:00
Gleb Smirnoff
e407b67be4 The FreeBSD-SA-14:08.tcp was a lesson on not doing acrobatics with
mixing on stack memory and UMA memory in one linked list.

Thus, rewrite TCP reassembly code in terms of memory usage. The
algorithm remains unchanged.

We actually do not need extra memory to build a reassembly queue.
Arriving mbufs are always packet header mbufs. So we got the length
of data as pkthdr.len. We got m_nextpkt for linkage. And we need
only one pointer to point at the tcphdr, use PH_loc for that.

In tcpcb the t_segq fields becomes mbuf pointer. The t_segqlen
field now counts not packets, but bytes in the queue. This gives
us more precision when comparing to socket buffer limits.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-05-04 23:25:32 +00:00
Hiren Panchasara
855363811f Improve readability of comments for DELAY_ACK() macro. 2014-04-03 01:46:03 +00:00
Hiren Panchasara
153edc50d7 Correct the comments as support for RFC 1644 has been removed for a long time. 2014-03-25 21:57:50 +00:00
Peter Wemm
62b90589e4 Adjust r239672 from rrs and r258821 from eadler.
By definition, the very first FIN is not a duplicate. Process it normally
and don't feed it to congestion control as though it were a dupe.  Don't
prevent CC from seeing later dupe acks while in a half close state.
2014-01-28 21:13:15 +00:00
Sergey Kandaurov
b8b4cfcdf6 Draft-ietf-tcpm-initcwnd-05 became RFC6928.
MFC after:	1 week
2013-12-26 04:24:08 +00:00
Eitan Adler
5f30ec9b63 In a situation where:
- The remote host sends a FIN
	- in an ACK for a sequence number for which an ACK has already
	  been received
	- There is still unacked data on route to the remote host
	- The packet does not contain a window update

The packet may be dropped without processing the FIN flag.

PR:		kern/99188
Submitted by:	Staffan Ulfberg <staffan@ulfberg.se>
Discussed with:	andre
MFC after:	never
2013-12-02 03:11:25 +00:00
Andriy Gapon
d9fae5ab88 dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by:	markj
MFC after:	3 weeks
2013-11-26 08:46:27 +00:00
Adrian Chadd
fa22ce1570 Convert over the TCP probes to use mtod() rather than directly
dereferencing m->m_data.

Sponsored by:	Netflix, Inc.
2013-11-25 22:55:06 +00:00
Attilio Rao
54366c0bd7 - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
Gleb Smirnoff
76039bc84f The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 17:58:36 +00:00
Andre Oppermann
c1e5a6e5e8 The TCP delayed ACK logic isn't aware of LRO passing up large aggregated
segments thinking it received only one segment. This causes it to enable
the delay the ACK for 100ms to wait for another segment which may never
come because all the data was received already.

Doing delayed ACK for LRO segments is bogus for two reasons: a) it pushes
us further away from acking every other packet; b) it introduces additional
delay in responding to the sender.  The latter is especially bad because it
is in the nature of LRO to aggregated all segments of a burst with no more
coming until an ACK is sent back.

Change the delayed ACK logic to detect LRO segments by being larger than
the MSS for this connection and issuing an immediate ACK for them to keep
the ACK clock ticking without interruption.

Reported by:	julian, cperciva
Tested by:	cperciva
Reviewed by:	lstewart
MFC after:	3 days
2013-10-22 18:24:34 +00:00
Gleb Smirnoff
c11a15bf8d When processing ACK in tcp_do_segment, use sbcut_locked() instead of
sbdrop_locked() to cut acked mbufs from the socket buffer. Free this
chain a batch manner after the socket buffer lock is dropped.

This measurably reduces contention on socket buffer.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
Approved by:	re (marius)
2013-10-09 12:00:38 +00:00
Mark Johnston
57f6086735 Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by:	gnn, hiren
MFC after:	1 month
2013-08-25 21:54:41 +00:00
Andrey V. Elsukov
6794f46021 Remove the large part of struct ipsecstat. Only few fields of this
structure is used, but they already have equal fields in the struct
newipsecstat, that was introduced with FAST_IPSEC and then was merged
together with old ipsecstat structure.

This fixes kernel stack overflow on some architectures after migration
ipsecstat to PCPU counters.

Reported by:	Taku YAMAMOTO, Maciej Milewski
2013-07-23 14:14:24 +00:00
Andre Oppermann
07dacf031e Extend debug logging of TCP timestamp related specification
violations.

Update related comments and style.
2013-07-10 12:06:01 +00:00
Andrey V. Elsukov
5da0521fce Use new macros to implement ipstat and tcpstat using PCPU counters.
Change interface of kread_counters() similar ot kread() in the netstat(1).
2013-07-09 09:43:03 +00:00
Gleb Smirnoff
42a253e6a1 Fix kmod_*stat_inc() after r249276. The incorrect code actually
increased the pointer, not the memory it points to.

In collaboration with:	kib
Reported & tested by:	Ian FREISLICH <ianf clue.co.za>
Sponsored by:		Nginx, Inc.
2013-06-21 06:36:26 +00:00
Andrey V. Elsukov
6659296cb0 Use IPSECSTAT_INC() and IPSEC6STAT_INC() macros for ipsec statistics
accounting.

MFC after:	2 weeks
2013-06-20 09:55:53 +00:00
Andre Oppermann
3c914c547e Allow drivers to specify a maximum TSO length in bytes if they are
limited in the amount of data they can handle at once.

Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to
change the limit.

The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything
less wouldn't be very useful anymore.  The upper limit is still at
IP_MAXPACKET (65536 bytes).  Raising it requires further auditing of
the IPv4/v6 code path's as the length field in the IP header would
overflow leading to confusion in firewalls and others packet handler on
the real size of the packet.

The placement into "struct ifnet" is a bit hackish but the best place
that was found.  When the stack/driver boundary is updated it should
be handled in a better way.

Submitted by:	cperciva (earlier version)
Reviewed by:	cperciva
Tested by:	cperciva
MFC after:	1 week (using spare struct members to preserve ABI)
2013-06-03 12:55:13 +00:00
Andre Oppermann
5628dd0893 When doing RFC3042 limited transmit on the first on second
duplicate ACK make sure we actually have new data to send.
This prevents us from sending unneccessary pure ACKs.

Reported by:	Matt Miller <matt@matthewjmiller.net>
Tested by:	Matt Miller <matt@matthewjmiller.net>
MFC after:	2 weeks
2013-04-23 14:06:32 +00:00
Andre Oppermann
982c1675ff Fix a race condition on tcp listen socket teardown with pending
connections in the accept queue and contiguous new incoming SYNs.

Compared to the original submitters patch I've moved the test
next to the SYN handling to have it together in a logical unit
and reworded the comment explaining the issue.

Submitted by:	Matt Miller <matt@matthewjmiller.net>
Submitted by:	Juan Mojica <jmojica@gmail.com>
Reviewed by:	Matt Miller (changes)
Tested by:	pho
MFC after:	1 week
2013-04-09 20:52:26 +00:00
Gleb Smirnoff
4a21e86ec1 Fix VIMAGE build. 2013-04-09 09:15:26 +00:00
Gleb Smirnoff
5923c29332 Merge from projects/counters: TCP/IP stats.
Convert 'struct ipstat' and 'struct tcpstat' to counter(9).

  This speeds up IP forwarding at extreme packet rates, and
makes accounting more precise.

Sponsored by:	Nginx, Inc.
2013-04-08 19:57:21 +00:00
Ed Maste
ce7ad6640c Keep fwd_tag around for subsequent pcb lookups
For TIMEWAIT handling tcp_input may have to jump back for an additional
pass through pcblookup.  Prior to this change the fwd_tag had been
discarded after the first lookup, so a new connection attempt delivered
locally via 'ipfw fwd' would fail to find a match.

As of r248886 the tag will be detached and freed when passed to the
socket buffer.
2013-03-29 20:51:44 +00:00
Lawrence Stewart
5b648e797b Simplify and fix a bug in cc_ack_received()'s "are we congestion window limited"
logic (refer to [1] for associated discussion). snd_cwnd and snd_wnd are
unsigned long and on 64 bit hosts, min() will truncate them to 32 bits and could
therefore potentially corrupt the result (although under normal operation,
neither variable should legitmately exceed 32 bits).

[1] http://lists.freebsd.org/pipermail/freebsd-net/2013-January/034297.html

Submitted by:	jhb
MFC after:	1 week
2013-01-22 09:44:21 +00:00
Gleb Smirnoff
b8056fae06 Fix !INET6 build after r244365. 2012-12-18 08:14:16 +00:00
Gleb Smirnoff
dd029d52fa Clear correct flag in INET6 case. 2012-12-18 08:09:44 +00:00
Andrey V. Elsukov
f491274582 Since we use different flags to detect tcp forwarding, and we share the
same code for IPv4 and IPv6 in tcp_input, we should check both
M_IP_NEXTHOP and M_IP6_NEXTHOP flags.

MFC after:	3 days
2012-12-17 20:55:33 +00:00
Gleb Smirnoff
78a7880f64 Fix a crash in tcp_input(), that happens when mbuf has a fwd_tag on it,
but later after processing and freeing the tag, we need to jump back again
to the findpcb label. Since the fwd_tag pointer wasn't NULL we tried to
process and free the tag for second time.

Reported & tested by:	Pawel Tyll <ptyll nitronet.pl>
MFC after:		3 days
2012-12-12 17:41:21 +00:00
Andre Oppermann
60ee3bb213 Back out r242262. The simplified window change/update logic wasn't
complete and ready for production use.

PR:	kern/173309
2012-11-05 09:13:06 +00:00
Andrey V. Elsukov
ffdbf9da3b Remove the recently added sysctl variable net.pfil.forward.
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.

Suggested by:	andre
2012-11-02 01:20:55 +00:00
Andre Oppermann
09440655fe Increase the initial CWND to 10 segments as defined in IETF TCPM
draft-ietf-tcpm-initcwnd-05. It explains why the increased initial
window improves the overall performance of many web services without
risking congestion collapse.

As long as it remains a draft it is placed under a sysctl marking it
as experimental:
 net.inet.tcp.experimental.initcwnd10 = 1
When it becomes an official RFC soon the sysctl will be changed to
the RFC number and moved to net.inet.tcp.

This implementation differs from the RFC draft in that it is a bit
more conservative in the case of packet loss on SYN or SYN|ACK because
we haven't reduced the default RTO to 1 second yet.  Also the restart
window isn't yet increased as allowed.  Both will be adjusted with
upcoming changes.

Is is enabled by default.  In Linux it is enabled since kernel 3.0.

MFC after:	2 weeks
2012-10-28 19:47:46 +00:00