Commit Graph

5619 Commits

Author SHA1 Message Date
Hiren Panchasara
06b99bd826 Adjust TCP module fastpath after r304803's cc_ack_received() changes.
Reported by:		hiren, bz, np
Reviewed by:		rrs
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D7664
2016-08-26 19:23:17 +00:00
Hiren Panchasara
e7106d6be2 Update TCPS_HAVERCVDFIN() macro to correctly include all states a connection
can be in after receiving a FIN.

FWIW, NetBSD has this change for quite some time.

This has been tested at Netflix and Limelight in production traffic.

Reported by:	Sam Kumar <samkumar99 at gmail.com> on transport@
Reviewed by:	rrs
MFC after:	4 weeks
Sponsored by:	Limelight Networks
Differential Revision:	 https://reviews.freebsd.org/D7475
2016-08-26 17:48:54 +00:00
Michael Tuexen
91843cf34e Fix a bug, where no SACK is sent when receiving a FORWARD-TSN or
I-FORWARD-TSN chunk before any DATA or I-DATA chunk.

Thanks to Julian Cordes for finding this problem and prividing
packetdrill scripts to reporduce the issue.

MFC after: 3 days
2016-08-26 07:49:23 +00:00
Lawrence Stewart
4b7b743c16 Pass the number of segments coalesced by LRO up the stack by repurposing the
tso_segsz pkthdr field during RX processing, and use the information in TCP for
more correct accounting and as a congestion control input. This is only a start,
and an audit of other uses for the data is left as future work.

Reviewed by:	gallatin, rrs
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D7564
2016-08-25 13:33:32 +00:00
Michael Tuexen
884d8c53e6 When aborting an association, send the ABORT before notifying the upper
layer. For the kernel this doesn't matter, for the userland stack, it does.
While there, silence a clang warning when compiling it in userland.
2016-08-24 06:22:53 +00:00
Ryan Stone
23424a2021 Temporarily disable the optimization from r304436
r304436 attempted to optimize the handling of incoming UDP packet by only
making an expensive call to in_broadcast() if the mbuf was marked as an
broadcast packet.  Unfortunately, this cannot work in the case of point-to-
point L2 protocols like PPP, which have no notion of "broadcast".

Discussions on how to properly fix r304436 are ongoing, but in the meantime
disable the optimization to ensure that no existing network setups are broken.

Reported by:	bms
2016-08-22 15:27:37 +00:00
Michael Tuexen
7fcbd928f8 Improve the locking when sending user messages.
First, keep a ref count on the stcb after looking it up, as
done in the other lookup cases.
Second, before looking again at sp, ensure that it is not
freed, because the assoc is about to be freed.

MFC after: 3 days
2016-08-22 01:45:29 +00:00
Michael Tuexen
26a5d52f03 Remove duplicate code, which is not protected by the appropriate locks.
MFC after: 3 days
2016-08-22 00:40:45 +00:00
Bjoern A. Zeeb
77ecef378a Remove the kernel optoion for IPSEC_FILTERTUNNEL, which was deprecated
more than 7 years ago in favour of a sysctl in r192648.
2016-08-21 18:55:30 +00:00
Marko Zec
9da85a912d Permit disabling net.inet.udp.require_l2_bcast in VIMAGE kernels.
The default value of the tunable introduced in r304436 couldn't be
effectively overrided on VIMAGE kernels, because instead of being
accessed via the appropriate VNET() accessor macro, it was accessed
via the VNET_NAME() macro, which resolves to the (should-be) read-only
master template of initial values of per-VNET data.  Hence, while the
value of udp_require_l2_bcast could be altered on per-VNET basis, the
code in udp_input() would ignore it as it would always read the default
value (one) from the VNET master template.

Silence from: rstone
2016-08-20 22:12:26 +00:00
Michael Tuexen
e19497672b Unbreak sctp_connectx().
MFC after: 3 days
2016-08-20 20:15:36 +00:00
Ryan Stone
11f2a7cd67 Fix unlocked access to ifnet address list
in_broadcast() was iterating over the ifnet address list without
first taking an IF_ADDR_RLOCK.  This could cause a panic if a
concurrent operation modified the list.

Reviewed by: bz
MFC after: 2 months
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7227
2016-08-18 22:59:10 +00:00
Ryan Stone
41029db13f Don't check for broadcast IPs on non-bcast pkts
in_broadcast() can be quite expensive, so skip calling it if the
incoming mbuf wasn't sent to a broadcast L2 address in the first
place.

Reviewed by: gnn
MFC after: 2 months
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7309
2016-08-18 22:59:05 +00:00
Ryan Stone
90cc51a1ab Don't iterate over the ifnet addr list in ip_output()
For almost every packet that is transmitted through ip_output(),
a call to in_broadcast() was made to decide if the destination
IP was a broadcast address.  in_broadcast() iterates over the
ifnet's address to find a source IP matching the subnet of the
destination IP, and then checks if the IP is a broadcast in that
subnet.

This is completely redundant as we have already performed the
route lookup, so the source IP is already known.  Just use that
address to directly check whether the destination IP is a
broadcast address or not.

MFC after:	2 months
Sponsored By:	EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7266
2016-08-18 22:59:00 +00:00
Randall Stewart
eadd00f81a A few more wording tweaks as suggested (with some modifications
as well) by Ravi Pokala. Thanks for the comments :-)
Sponsored by: Netflix Inc.
2016-08-16 15:17:36 +00:00
Randall Stewart
587d67c008 Here we update the modular tcp to be able to switch to an
alternate TCP stack in other then the closed state (pre-listen/connect).
The idea is that *if* that is supported by the alternate stack, it
is asked if its ok to switch. If it approves the "handoff" then we
allow the switch to happen. Also the fini() function now gets a flag
to tell if you are switching away *or* the tcb is destroyed. The
init() call into the alternate stack is moved to the end so the
tcb is more fully formed before the init transpires.

Sponsored by:	Netflix Inc.
Differential Revision:	D6790
2016-08-16 15:11:46 +00:00
Randall Stewart
0fa047b98c Comments describing how to properly use the new lock_add functions
and its respective companion.

Sponsored by:	Netflix Inc.
2016-08-16 13:08:03 +00:00
Randall Stewart
b07fef500b This cleans up the timer code in TCP and also makes it so we do not
take the INFO lock *unless* we are really going to delete the TCB.

Differential Revision:	D7136
2016-08-16 12:40:56 +00:00
Sepherosa Ziehau
8452c1b345 tcp/lro: Make # of LRO entries tunable
Reviewed by:	hps, gallatin
Obtained from:	rrs, gallatin
MFC after:	2 weeks
Sponsored by:	Netflix (rrs, gallatin), Microsoft (sephe)
Differential Revision:	https://reviews.freebsd.org/D7499
2016-08-16 06:40:27 +00:00
Michael Tuexen
dcb436c936 Ensure that sctp_it_ctl.cur_it does not point to a free object (during
a small time window).
Thanks to Byron Campen for reporting the issue and
suggesting a fix.

MFC after: 3 days
2016-08-15 10:16:08 +00:00
Andrey V. Elsukov
57fb3b7a78 Add stats reset command implementation to NPTv6 module
to be able reset statistics counters.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-08-13 16:45:14 +00:00
Andrey V. Elsukov
d8caf56e9e Add ipfw_nat64 module that implements stateless and stateful NAT64.
The module works together with ipfw(4) and implemented as its external
action module.

Stateless NAT64 registers external action with name nat64stl. This
keyword should be used to create NAT64 instance and to address this
instance in rules. Stateless NAT64 uses two lookup tables with mapped
IPv4->IPv6 and IPv6->IPv4 addresses to perform translation.

A configuration of instance should looks like this:
 1. Create lookup tables:
 # ipfw table T46 create type addr valtype ipv6
 # ipfw table T64 create type addr valtype ipv4
 2. Fill T46 and T64 tables.
 3. Add rule to allow neighbor solicitation and advertisement:
 # ipfw add allow icmp6 from any to any icmp6types 135,136
 4. Create NAT64 instance:
 # ipfw nat64stl NAT create table4 T46 table6 T64
 5. Add rules that matches the traffic:
 # ipfw add nat64stl NAT ip from any to table(T46)
 # ipfw add nat64stl NAT ip from table(T64) to 64:ff9b::/96
 6. Configure DNS64 for IPv6 clients and add route to 64:ff9b::/96
    via NAT64 host.

Stateful NAT64 registers external action with name nat64lsn. The only
one option required to create nat64lsn instance - prefix4. It defines
the pool of IPv4 addresses used for translation.

A configuration of instance should looks like this:
 1. Add rule to allow neighbor solicitation and advertisement:
 # ipfw add allow icmp6 from any to any icmp6types 135,136
 2. Create NAT64 instance:
 # ipfw nat64lsn NAT create prefix4 A.B.C.D/28
 3. Add rules that matches the traffic:
 # ipfw add nat64lsn NAT ip from any to A.B.C.D/28
 # ipfw add nat64lsn NAT ip6 from any to 64:ff9b::/96
 4. Configure DNS64 for IPv6 clients and add route to 64:ff9b::/96
    via NAT64 host.

Obtained from:	Yandex LLC
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D6434
2016-08-13 16:09:49 +00:00
Mike Karels
bca0155f64 Fix kernel build with TCP_RFC7413 option
The current in_pcb.h includes route.h, which includes sockaddr structures.
Including <sys/socketvar.h> should require <sys/socket.h>; add it in
the appropriate place.

PR: 211385
Submitted by: Sergey Kandaurov and iron at mail.ua
Reviewed by: gnn
Approved by: gnn (mentor)
MFC after: 1 day
2016-08-11 23:52:24 +00:00
Andrey V. Elsukov
d6eb9b0249 Restore "nat global" support.
Now zero value of arg1 used to specify "tablearg", use the old "tablearg"
value for "nat global". Introduce new macro IP_FW_NAT44_GLOBAL to replace
hardcoded magic number to specify "nat global". Also replace 65535 magic
number with corresponding macro. Fix typo in comments.

PR:		211256
Tested by:	Victor Chernov
MFC after:	3 days
2016-08-11 10:10:10 +00:00
Michael Tuexen
be46a7c54d Improve a consistency check to not detect valid cases for
unordered user messages using DATA chunks as invalid ones.
While there, ensure that error causes are provided when
sending ABORT chunks in case of reassembly problems detected.
Thanks to Taylor Brandstetter for making me aware of this problem.
MFC after:	3 days
2016-08-10 17:19:33 +00:00
Stephen J. Kiernan
0ce1624d0e Move IPv4-specific jail functions to new file netinet/in_jail.c
_prison_check_ip4 renamed to prison_check_ip4_locked

Move IPv6-specific jail functions to new file netinet6/in6_jail.c
_prison_check_ip6 renamed to prison_check_ip6_locked

Add appropriate prototypes to sys/sys/jail.h

Adjust kern_jail.c to call prison_check_ip4_locked and
prison_check_ip6_locked accordingly.

Add netinet/in_jail.c and netinet6/in6_jail.c to the list of files that
need to be built when INET and INET6, respectively, are configured in the
kernel configuration file.

Reviewed by:	jtl
Approved by:	sjg (mentor)
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D6799
2016-08-09 02:16:21 +00:00
Michael Tuexen
d6e73fa13d Fix the sending of FORWARD-TSN and I-FORWARD-TSN chunks. The
last SID/SSN pair wasn't filled in.
Thanks to Julian Cordes for providing a packetdrill script
triggering the issue and making me aware of the bug.

MFC after:	3 days
2016-08-08 13:52:18 +00:00
Michael Tuexen
9c5ca6f247 Fix a locking issue found by stress testing with tsctp.
The inp read lock neeeds to be held when considering control->do_not_ref_stcb.
MFC after:	3 days
2016-08-08 08:20:10 +00:00
Michael Tuexen
124d851acf Consistently check for unsent data on the stream queues.
MFC after:	3 days
2016-08-07 23:04:46 +00:00
Michael Tuexen
4d58b0c3a9 Remove stream queue entry consistently from wheel.
While there, improve the handling of drain.

MFC after:	3 days
2016-08-07 12:51:13 +00:00
Michael Tuexen
cf46cace5c Don't modify a structure without holding a reference count on it.
MFC after:	3 days
2016-08-06 15:29:46 +00:00
Michael Tuexen
bfe7e9328c Mark an unused parameter as such.
MFC after:	3 days
2016-08-06 12:51:07 +00:00
Michael Tuexen
d1ea5fa9c2 Fix various bugs in relation to the I-DATA chunk support
This is joint work with rrs.

MFC after:	3 days
2016-08-06 12:33:15 +00:00
Sepherosa Ziehau
b9ec6f0b02 tcp/lro: If timestamps mismatch or it's a FIN, force flush.
This keeps the segments/ACK/FIN delivery order.

Before this patch, it was observed: if A sent FIN immediately after
an ACK, B would deliver FIN first to the TCP stack, then the ACK.
This out-of-order delivery causes one unnecessary ACK sent from B.

Reviewed by:	gallatin, hps
Obtained from:  rrs, gallatin
Sponsored by:	Netflix (rrs, gallatin), Microsoft (sephe)
Differential Revision:	https://reviews.freebsd.org/D7415
2016-08-05 09:08:00 +00:00
Sepherosa Ziehau
05cde7efa6 tcp/lro: Implement hash table for LRO entries.
This significantly improves HTTP workload performance and reduces
HTTP workload latency.

Reviewed by:	rrs, gallatin, hps
Obtained from:	rrs, gallatin
Sponsored by:	Netflix (rrs, gallatin) , Microsoft (sephe)
Differential Revision:	https://reviews.freebsd.org/D6689
2016-08-02 06:36:47 +00:00
Andrew Gallatin
d4c22202e6 Rework IPV6 TCP path MTU discovery to match IPv4
- Re-write tcp_ctlinput6() to closely mimic the IPv4 tcp_ctlinput()

- Now that tcp_ctlinput6() updates t_maxseg, we can allow ip6_output()
  to send TCP packets without looking at the tcp host cache for every
  single transmit.

- Make the icmp6 code mimic the IPv4 code & avoid returning
  PRC_HOSTDEAD because it is so expensive.

Without these changes in place, every TCP6 pmtu discovery or host
unreachable ICMP resulted in a call to in6_pcbnotify() which walks the
tcbinfo table with the write lock held.  Because the tcbinfo table is
shared between IPv4 and IPv6, this causes huge scalabilty issues on
servers with lots of (~100K) TCP connections, to the point where even
a small percent of IPv6 traffic had a disproportionate impact on
overall throughput.

Reviewed by:	bz, rrs, ae (all earlier versions), lstewart (in Netflix's tree)
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D7272
2016-08-01 17:02:21 +00:00
Andrew Gallatin
0e3b891988 Call tcp_notify() directly to shoot down routes, rather than
calling in_pcbnotifyall().

This avoids lock contention on tcbinfo due to in_pcbnotifyall()
holding the tcbinfo write lock while walking all connections.

Reviewed by:	rrs, karels
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D7251
2016-07-28 19:32:25 +00:00
Stephen J. Kiernan
f11ec79842 Remove BSD and USL copyright and update license block in in_prot.c, as the
code in this file was written by Robert N. M. Waston.

Move cr_can* prototypes from sys/systm.h to sys/proc.h

Reported by:	rwatson
Reviewed by:	rwatson
Approved by:	sjg (mentor)
Differential Revision:	https://reviews.freebsd.org/D7345
2016-07-28 18:39:30 +00:00
Stephen J. Kiernan
4ac21b4f09 Prepare for network stack as a module
- Move cr_canseeinpcb to sys/netinet/in_prot.c in order to separate the
   INET and INET6-specific code from the rest of the prot code (It is only
   used by the network stack, so it makes sense for it to live with the
   other network stack code.)
 - Move cr_canseeinpcb prototype from sys/systm.h to netinet/in_systm.h
 - Rename cr_seeotheruids to cr_canseeotheruids and cr_seeothergids to
   cr_canseeothergids, make them non-static, and add prototypes (so they
   can be seen/called by in_prot.c functions.)
 - Remove sw_csum variable from ip6_forward in ip6_forward.c, as it is an
   unused variable.

Reviewed by:	gnn, jtl
Approved by:	sjg (mentor)
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D2901
2016-07-27 20:34:09 +00:00
Brad Davis
439e76ec22 Fix the case for some sysctl descriptions.
Reviewed by:	gnn
2016-07-26 20:20:09 +00:00
Mike Karels
ea17754c5a Fix per-connection L2 caching in fast path
r301217 re-added per-connection L2 caching from a previous change,
but it omitted caching in the fast path.  Add it.

Reviewed By: gallatin
Approved by: gnn (mentor)
Differential Revision: https://reviews.freebsd.org/D7239
2016-07-22 02:11:49 +00:00
Michael Tuexen
0ee5c319f2 Fix a bug in deferred stream reset processing which results
in using a length field before it is set.

Thanks to Taylor Brandstetter for reporting the issue and
providing a fix.

MFC after:	3 days
2016-07-20 06:29:26 +00:00
Michael Tuexen
0fa7377a03 Use correct order of conditions to avoid NULL deref.
MFC after:	3 days
X-MFC with:	r302935
2016-07-19 11:16:44 +00:00
Michael Tuexen
0ba4c8abe0 netstat and sockstat expect the IPv6 link local addresses to
have an embedded scope. So don't recover.

MFC after:	3 days
2016-07-19 09:48:08 +00:00
Andrey V. Elsukov
ed22e564b8 Add named dynamic states support to ipfw(4).
The keep-state, limit and check-state now will have additional argument
flowname. This flowname will be assigned to dynamic rule by keep-state
or limit opcode. And then can be matched by check-state opcode or
O_PROBE_STATE internal opcode. To reduce possible breakage and to maximize
compatibility with old rulesets default flowname introduced.
It will be assigned to the rules when user has omitted state name in
keep-state and check-state opcodes. Also if name is ambiguous (can be
evaluated as rule opcode) it will be replaced to default.

Reviewed by:	julian
Obtained from:	Yandex LLC
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D6674
2016-07-19 04:56:59 +00:00
Andrey V. Elsukov
b867e84e95 Add ipfw_nptv6 module that implements Network Prefix Translation for IPv6
as defined in RFC 6296. The module works together with ipfw(4) and
implemented as its external action module. When it is loaded, it registers
as eaction and can be used in rules. The usage pattern is similar to
ipfw_nat(4). All matched by rule traffic goes to the NPT module.

Reviewed by:	hrs
Obtained from:	Yandex LLC
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D6420
2016-07-18 19:46:31 +00:00
Michael Tuexen
487e2e6507 Add a constant required by RFC 7496.
MFC after:	3 days
2016-07-17 13:33:35 +00:00
Michael Tuexen
8e1b295f09 Fix the PR-SCTP behaviour.
This is done by rrs@.

MFC after:	3 days
2016-07-17 13:14:51 +00:00
Michael Tuexen
26f0250adf Add missing sctps_reasmusrmsgs counter.
Joint work with rrs@.
MFC after:	3 days
2016-07-17 08:31:21 +00:00
Michael Tuexen
643fd575da Deal with a portential memory allocation failure, which was reported
by the clang static code analyzer.
Joint work with rrs@.

MFC after:	3 days
2016-07-16 12:25:37 +00:00