Commit Graph

5630 Commits

Author SHA1 Message Date
Eric van Gyzen
2d9db0bc63 Add GARP retransmit capability
A single gratuitous ARP (GARP) is always transmitted when an IPv4
address is added to an interface, and that is usually sufficient.
However, in some circumstances, such as when a shared address is
passed between cluster nodes, this single GARP may occasionally be
dropped or lost.  This can lead to neighbors on the network link
working with a stale ARP cache and sending packets destined for
that address to the node that previously owned the address, which
may not respond.

To avoid this situation, GARP retransmissions can be enabled by setting
the net.link.ether.inet.garp_rexmit_count sysctl to a value greater
than zero.  The setting represents the maximum number of retransmissions.
The interval between retransmissions is calculated using an exponential
backoff algorithm, doubling each time, so the retransmission intervals
are: {1, 2, 4, 8, 16, ...} (seconds).

Due to the exponential backoff algorithm used for the interval
between GARP retransmissions, the maximum number of retransmissions
is limited to 16 for sanity.  This limit corresponds to a maximum
interval between retransmissions of 2^16 seconds ~= 18 hours.
Increasing this limit is possible, but sending out GARPs spaced
days apart would be of little use.

Submitted by:	David A. Bright <david.a.bright@dell.com>
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D7695
2016-10-02 01:42:45 +00:00
Rick Macklem
00b460ffc5 r297225 broke udp_output() for the case where the "addr" argument
is NULL and the function jumps to the "release:" label.
For this case, the "inp" was write locked, but the code attempted to
read unlock it. This patch fixes the problem.
This case could occur for NFS over UDP mounts, where the server was
down for a few minutes under certain circumstances.

Reported by:	bde
Tested by:	bde
Reviewed by:	gnn
MFC after:	2 weeks
2016-10-01 19:39:09 +00:00
Hiren Panchasara
8a56c64533 This adds a sysctl which allows you to disable the TCP hostcache. This is handy
during testing of network related changes where cached entries may pollute your
results, or during known congestion events where you don't want to unfairly
penalize hosts.

Prior to r232346 this would have meant you would break any connection with a sub
1500 MTU, as the hostcache was authoritative. All entries as they stand today
should simply be used to pre populate values for efficiency.

Submitted by:	Jason Wolfe (j at nitrology dot com)
Reviewed by:	rwatson, sbruno, rrs , bz (earlier version)
MFC after:	2 weeks
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D6198
2016-09-30 00:10:57 +00:00
Kurt Lidl
1d7ee746e6 Properly preserve ip_tos bits for IPv4 packets
Restructure code slightly to save ip_tos bits earlier.  Fix the bug
where the ip_tos field is zeroed out before assigning to the iptos
variable. Restore the ip_tos and ip_ver fields only if they have
been zeroed during the pseudo-header checksum calculation.

Reviewed by:	cem, gnn, hiren
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D8053
2016-09-29 19:45:24 +00:00
Julien Charbon
c1b19923a3 Fix an issue with accept_filter introduced with r261242:
As a side effect of r261242 when using accept_filter the
first call to soisconnected() is done earlier in tcp_input()
instead of tcp_do_segment() context.  Restore the expected behaviour.

Note:  This call to soisconnected() seems to be extraneous in all
cases (with or without accept_filter).  Will be addressed in a
separate commit.

PR:			212920
Reported by:		Alexey
Tested by:              Alexey, jch
Sponsored by:           Verisign, Inc.
MFC after:		1 week
2016-09-29 11:18:48 +00:00
Kevin Lo
c7641cd18d Remove ifa_list, use ifa_link (structure field) instead.
While here, prefer if_addrhead (FreeBSD) to if_addrlist (BSD compat) naming
for the interface address list in sctp_bsd_addr.c

Reviewed by:	tuexen
Differential Revision:	https://reviews.freebsd.org/D8051
2016-09-28 13:29:11 +00:00
Mariusz Zaborski
85b0f9de11 capsicum: propagate rights on accept(2)
Descriptor returned by accept(2) should inherits capabilities rights from
the listening socket.

PR:		201052
Reviewed by:	emaste, jonathan
Discussed with:	many
Differential Revision:	https://reviews.freebsd.org/D7724
2016-09-22 09:58:46 +00:00
Michael Tuexen
5cb9165556 Fix the handling of unordered fragmented user messages using DATA chunks.
There were two bugs:
* There was an accounting bug resulting in reporting a too small a_rwnd.
* There are a bug when abandoning messages in the reassembly queue.

MFC after:	4 weeks
2016-09-21 08:28:18 +00:00
Kevin Lo
c3bef61e58 Remove the 4.3BSD compatible macro m_copy(), use m_copym() instead.
Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D7878
2016-09-15 07:41:48 +00:00
Michael Tuexen
5a17b6ad98 Ensure that the IPPROTO_TCP level socket options
* TCP_KEEPINIT
* TCP_KEEPINTVL
* TCP_KEEPIDLE
* TCP_KEEPCNT
always always report the values currently used when getsockopt()
is used. This wasn't the case when the sysctl-inherited default
values where used.
Ensure that the IPPROTO_TCP level socket option TCP_INFO has the
TCPI_OPT_ECN flag set in the tcpi_options field when ECN support
has been negotiated successfully.

Reviewed by:	rrs, jtl, hiren
MFC after:	1 month
Differential Revision:	7833
2016-09-14 14:48:00 +00:00
Dimitry Andric
6c01c0e0c6 With clang 3.9.0, compiling sys/netinet/igmp.c results in the following
warning:

sys/netinet/igmp.c:546:21: error: implicit conversion from 'int' to 'char' changes value from 148 to -108 [-Werror,-Wconstant-conversion]
        p->ipopt_list[0] = IPOPT_RA;    /* Router Alert Option */
                         ~ ^~~~~~~~
sys/netinet/ip.h:153:19: note: expanded from macro 'IPOPT_RA'
#define IPOPT_RA                148             /* router alert */
                                ^~~

This is because ipopt_list is an array of char, so IPOPT_RA is wrapped
to a negative value.  It would be nice to change ipopt_list to an array
of u_char, but it changes the signature of the public struct ipoption,
so add an explicit cast to suppress the warning.

Reviewed by:	imp
MFC after:	3 days
Differential Revision: https://reviews.freebsd.org/D7777
2016-09-04 17:23:10 +00:00
Hiren Panchasara
06b99bd826 Adjust TCP module fastpath after r304803's cc_ack_received() changes.
Reported by:		hiren, bz, np
Reviewed by:		rrs
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D7664
2016-08-26 19:23:17 +00:00
Hiren Panchasara
e7106d6be2 Update TCPS_HAVERCVDFIN() macro to correctly include all states a connection
can be in after receiving a FIN.

FWIW, NetBSD has this change for quite some time.

This has been tested at Netflix and Limelight in production traffic.

Reported by:	Sam Kumar <samkumar99 at gmail.com> on transport@
Reviewed by:	rrs
MFC after:	4 weeks
Sponsored by:	Limelight Networks
Differential Revision:	 https://reviews.freebsd.org/D7475
2016-08-26 17:48:54 +00:00
Michael Tuexen
91843cf34e Fix a bug, where no SACK is sent when receiving a FORWARD-TSN or
I-FORWARD-TSN chunk before any DATA or I-DATA chunk.

Thanks to Julian Cordes for finding this problem and prividing
packetdrill scripts to reporduce the issue.

MFC after: 3 days
2016-08-26 07:49:23 +00:00
Lawrence Stewart
4b7b743c16 Pass the number of segments coalesced by LRO up the stack by repurposing the
tso_segsz pkthdr field during RX processing, and use the information in TCP for
more correct accounting and as a congestion control input. This is only a start,
and an audit of other uses for the data is left as future work.

Reviewed by:	gallatin, rrs
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D7564
2016-08-25 13:33:32 +00:00
Michael Tuexen
884d8c53e6 When aborting an association, send the ABORT before notifying the upper
layer. For the kernel this doesn't matter, for the userland stack, it does.
While there, silence a clang warning when compiling it in userland.
2016-08-24 06:22:53 +00:00
Ryan Stone
23424a2021 Temporarily disable the optimization from r304436
r304436 attempted to optimize the handling of incoming UDP packet by only
making an expensive call to in_broadcast() if the mbuf was marked as an
broadcast packet.  Unfortunately, this cannot work in the case of point-to-
point L2 protocols like PPP, which have no notion of "broadcast".

Discussions on how to properly fix r304436 are ongoing, but in the meantime
disable the optimization to ensure that no existing network setups are broken.

Reported by:	bms
2016-08-22 15:27:37 +00:00
Michael Tuexen
7fcbd928f8 Improve the locking when sending user messages.
First, keep a ref count on the stcb after looking it up, as
done in the other lookup cases.
Second, before looking again at sp, ensure that it is not
freed, because the assoc is about to be freed.

MFC after: 3 days
2016-08-22 01:45:29 +00:00
Michael Tuexen
26a5d52f03 Remove duplicate code, which is not protected by the appropriate locks.
MFC after: 3 days
2016-08-22 00:40:45 +00:00
Bjoern A. Zeeb
77ecef378a Remove the kernel optoion for IPSEC_FILTERTUNNEL, which was deprecated
more than 7 years ago in favour of a sysctl in r192648.
2016-08-21 18:55:30 +00:00
Marko Zec
9da85a912d Permit disabling net.inet.udp.require_l2_bcast in VIMAGE kernels.
The default value of the tunable introduced in r304436 couldn't be
effectively overrided on VIMAGE kernels, because instead of being
accessed via the appropriate VNET() accessor macro, it was accessed
via the VNET_NAME() macro, which resolves to the (should-be) read-only
master template of initial values of per-VNET data.  Hence, while the
value of udp_require_l2_bcast could be altered on per-VNET basis, the
code in udp_input() would ignore it as it would always read the default
value (one) from the VNET master template.

Silence from: rstone
2016-08-20 22:12:26 +00:00
Michael Tuexen
e19497672b Unbreak sctp_connectx().
MFC after: 3 days
2016-08-20 20:15:36 +00:00
Ryan Stone
11f2a7cd67 Fix unlocked access to ifnet address list
in_broadcast() was iterating over the ifnet address list without
first taking an IF_ADDR_RLOCK.  This could cause a panic if a
concurrent operation modified the list.

Reviewed by: bz
MFC after: 2 months
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7227
2016-08-18 22:59:10 +00:00
Ryan Stone
41029db13f Don't check for broadcast IPs on non-bcast pkts
in_broadcast() can be quite expensive, so skip calling it if the
incoming mbuf wasn't sent to a broadcast L2 address in the first
place.

Reviewed by: gnn
MFC after: 2 months
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7309
2016-08-18 22:59:05 +00:00
Ryan Stone
90cc51a1ab Don't iterate over the ifnet addr list in ip_output()
For almost every packet that is transmitted through ip_output(),
a call to in_broadcast() was made to decide if the destination
IP was a broadcast address.  in_broadcast() iterates over the
ifnet's address to find a source IP matching the subnet of the
destination IP, and then checks if the IP is a broadcast in that
subnet.

This is completely redundant as we have already performed the
route lookup, so the source IP is already known.  Just use that
address to directly check whether the destination IP is a
broadcast address or not.

MFC after:	2 months
Sponsored By:	EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7266
2016-08-18 22:59:00 +00:00
Randall Stewart
eadd00f81a A few more wording tweaks as suggested (with some modifications
as well) by Ravi Pokala. Thanks for the comments :-)
Sponsored by: Netflix Inc.
2016-08-16 15:17:36 +00:00
Randall Stewart
587d67c008 Here we update the modular tcp to be able to switch to an
alternate TCP stack in other then the closed state (pre-listen/connect).
The idea is that *if* that is supported by the alternate stack, it
is asked if its ok to switch. If it approves the "handoff" then we
allow the switch to happen. Also the fini() function now gets a flag
to tell if you are switching away *or* the tcb is destroyed. The
init() call into the alternate stack is moved to the end so the
tcb is more fully formed before the init transpires.

Sponsored by:	Netflix Inc.
Differential Revision:	D6790
2016-08-16 15:11:46 +00:00
Randall Stewart
0fa047b98c Comments describing how to properly use the new lock_add functions
and its respective companion.

Sponsored by:	Netflix Inc.
2016-08-16 13:08:03 +00:00
Randall Stewart
b07fef500b This cleans up the timer code in TCP and also makes it so we do not
take the INFO lock *unless* we are really going to delete the TCB.

Differential Revision:	D7136
2016-08-16 12:40:56 +00:00
Sepherosa Ziehau
8452c1b345 tcp/lro: Make # of LRO entries tunable
Reviewed by:	hps, gallatin
Obtained from:	rrs, gallatin
MFC after:	2 weeks
Sponsored by:	Netflix (rrs, gallatin), Microsoft (sephe)
Differential Revision:	https://reviews.freebsd.org/D7499
2016-08-16 06:40:27 +00:00
Michael Tuexen
dcb436c936 Ensure that sctp_it_ctl.cur_it does not point to a free object (during
a small time window).
Thanks to Byron Campen for reporting the issue and
suggesting a fix.

MFC after: 3 days
2016-08-15 10:16:08 +00:00
Andrey V. Elsukov
57fb3b7a78 Add stats reset command implementation to NPTv6 module
to be able reset statistics counters.

Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2016-08-13 16:45:14 +00:00
Andrey V. Elsukov
d8caf56e9e Add ipfw_nat64 module that implements stateless and stateful NAT64.
The module works together with ipfw(4) and implemented as its external
action module.

Stateless NAT64 registers external action with name nat64stl. This
keyword should be used to create NAT64 instance and to address this
instance in rules. Stateless NAT64 uses two lookup tables with mapped
IPv4->IPv6 and IPv6->IPv4 addresses to perform translation.

A configuration of instance should looks like this:
 1. Create lookup tables:
 # ipfw table T46 create type addr valtype ipv6
 # ipfw table T64 create type addr valtype ipv4
 2. Fill T46 and T64 tables.
 3. Add rule to allow neighbor solicitation and advertisement:
 # ipfw add allow icmp6 from any to any icmp6types 135,136
 4. Create NAT64 instance:
 # ipfw nat64stl NAT create table4 T46 table6 T64
 5. Add rules that matches the traffic:
 # ipfw add nat64stl NAT ip from any to table(T46)
 # ipfw add nat64stl NAT ip from table(T64) to 64:ff9b::/96
 6. Configure DNS64 for IPv6 clients and add route to 64:ff9b::/96
    via NAT64 host.

Stateful NAT64 registers external action with name nat64lsn. The only
one option required to create nat64lsn instance - prefix4. It defines
the pool of IPv4 addresses used for translation.

A configuration of instance should looks like this:
 1. Add rule to allow neighbor solicitation and advertisement:
 # ipfw add allow icmp6 from any to any icmp6types 135,136
 2. Create NAT64 instance:
 # ipfw nat64lsn NAT create prefix4 A.B.C.D/28
 3. Add rules that matches the traffic:
 # ipfw add nat64lsn NAT ip from any to A.B.C.D/28
 # ipfw add nat64lsn NAT ip6 from any to 64:ff9b::/96
 4. Configure DNS64 for IPv6 clients and add route to 64:ff9b::/96
    via NAT64 host.

Obtained from:	Yandex LLC
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D6434
2016-08-13 16:09:49 +00:00
Mike Karels
bca0155f64 Fix kernel build with TCP_RFC7413 option
The current in_pcb.h includes route.h, which includes sockaddr structures.
Including <sys/socketvar.h> should require <sys/socket.h>; add it in
the appropriate place.

PR: 211385
Submitted by: Sergey Kandaurov and iron at mail.ua
Reviewed by: gnn
Approved by: gnn (mentor)
MFC after: 1 day
2016-08-11 23:52:24 +00:00
Andrey V. Elsukov
d6eb9b0249 Restore "nat global" support.
Now zero value of arg1 used to specify "tablearg", use the old "tablearg"
value for "nat global". Introduce new macro IP_FW_NAT44_GLOBAL to replace
hardcoded magic number to specify "nat global". Also replace 65535 magic
number with corresponding macro. Fix typo in comments.

PR:		211256
Tested by:	Victor Chernov
MFC after:	3 days
2016-08-11 10:10:10 +00:00
Michael Tuexen
be46a7c54d Improve a consistency check to not detect valid cases for
unordered user messages using DATA chunks as invalid ones.
While there, ensure that error causes are provided when
sending ABORT chunks in case of reassembly problems detected.
Thanks to Taylor Brandstetter for making me aware of this problem.
MFC after:	3 days
2016-08-10 17:19:33 +00:00
Stephen J. Kiernan
0ce1624d0e Move IPv4-specific jail functions to new file netinet/in_jail.c
_prison_check_ip4 renamed to prison_check_ip4_locked

Move IPv6-specific jail functions to new file netinet6/in6_jail.c
_prison_check_ip6 renamed to prison_check_ip6_locked

Add appropriate prototypes to sys/sys/jail.h

Adjust kern_jail.c to call prison_check_ip4_locked and
prison_check_ip6_locked accordingly.

Add netinet/in_jail.c and netinet6/in6_jail.c to the list of files that
need to be built when INET and INET6, respectively, are configured in the
kernel configuration file.

Reviewed by:	jtl
Approved by:	sjg (mentor)
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D6799
2016-08-09 02:16:21 +00:00
Michael Tuexen
d6e73fa13d Fix the sending of FORWARD-TSN and I-FORWARD-TSN chunks. The
last SID/SSN pair wasn't filled in.
Thanks to Julian Cordes for providing a packetdrill script
triggering the issue and making me aware of the bug.

MFC after:	3 days
2016-08-08 13:52:18 +00:00
Michael Tuexen
9c5ca6f247 Fix a locking issue found by stress testing with tsctp.
The inp read lock neeeds to be held when considering control->do_not_ref_stcb.
MFC after:	3 days
2016-08-08 08:20:10 +00:00
Michael Tuexen
124d851acf Consistently check for unsent data on the stream queues.
MFC after:	3 days
2016-08-07 23:04:46 +00:00
Michael Tuexen
4d58b0c3a9 Remove stream queue entry consistently from wheel.
While there, improve the handling of drain.

MFC after:	3 days
2016-08-07 12:51:13 +00:00
Michael Tuexen
cf46cace5c Don't modify a structure without holding a reference count on it.
MFC after:	3 days
2016-08-06 15:29:46 +00:00
Michael Tuexen
bfe7e9328c Mark an unused parameter as such.
MFC after:	3 days
2016-08-06 12:51:07 +00:00
Michael Tuexen
d1ea5fa9c2 Fix various bugs in relation to the I-DATA chunk support
This is joint work with rrs.

MFC after:	3 days
2016-08-06 12:33:15 +00:00
Sepherosa Ziehau
b9ec6f0b02 tcp/lro: If timestamps mismatch or it's a FIN, force flush.
This keeps the segments/ACK/FIN delivery order.

Before this patch, it was observed: if A sent FIN immediately after
an ACK, B would deliver FIN first to the TCP stack, then the ACK.
This out-of-order delivery causes one unnecessary ACK sent from B.

Reviewed by:	gallatin, hps
Obtained from:  rrs, gallatin
Sponsored by:	Netflix (rrs, gallatin), Microsoft (sephe)
Differential Revision:	https://reviews.freebsd.org/D7415
2016-08-05 09:08:00 +00:00
Sepherosa Ziehau
05cde7efa6 tcp/lro: Implement hash table for LRO entries.
This significantly improves HTTP workload performance and reduces
HTTP workload latency.

Reviewed by:	rrs, gallatin, hps
Obtained from:	rrs, gallatin
Sponsored by:	Netflix (rrs, gallatin) , Microsoft (sephe)
Differential Revision:	https://reviews.freebsd.org/D6689
2016-08-02 06:36:47 +00:00
Andrew Gallatin
d4c22202e6 Rework IPV6 TCP path MTU discovery to match IPv4
- Re-write tcp_ctlinput6() to closely mimic the IPv4 tcp_ctlinput()

- Now that tcp_ctlinput6() updates t_maxseg, we can allow ip6_output()
  to send TCP packets without looking at the tcp host cache for every
  single transmit.

- Make the icmp6 code mimic the IPv4 code & avoid returning
  PRC_HOSTDEAD because it is so expensive.

Without these changes in place, every TCP6 pmtu discovery or host
unreachable ICMP resulted in a call to in6_pcbnotify() which walks the
tcbinfo table with the write lock held.  Because the tcbinfo table is
shared between IPv4 and IPv6, this causes huge scalabilty issues on
servers with lots of (~100K) TCP connections, to the point where even
a small percent of IPv6 traffic had a disproportionate impact on
overall throughput.

Reviewed by:	bz, rrs, ae (all earlier versions), lstewart (in Netflix's tree)
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D7272
2016-08-01 17:02:21 +00:00
Andrew Gallatin
0e3b891988 Call tcp_notify() directly to shoot down routes, rather than
calling in_pcbnotifyall().

This avoids lock contention on tcbinfo due to in_pcbnotifyall()
holding the tcbinfo write lock while walking all connections.

Reviewed by:	rrs, karels
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D7251
2016-07-28 19:32:25 +00:00
Stephen J. Kiernan
f11ec79842 Remove BSD and USL copyright and update license block in in_prot.c, as the
code in this file was written by Robert N. M. Waston.

Move cr_can* prototypes from sys/systm.h to sys/proc.h

Reported by:	rwatson
Reviewed by:	rwatson
Approved by:	sjg (mentor)
Differential Revision:	https://reviews.freebsd.org/D7345
2016-07-28 18:39:30 +00:00
Stephen J. Kiernan
4ac21b4f09 Prepare for network stack as a module
- Move cr_canseeinpcb to sys/netinet/in_prot.c in order to separate the
   INET and INET6-specific code from the rest of the prot code (It is only
   used by the network stack, so it makes sense for it to live with the
   other network stack code.)
 - Move cr_canseeinpcb prototype from sys/systm.h to netinet/in_systm.h
 - Rename cr_seeotheruids to cr_canseeotheruids and cr_seeothergids to
   cr_canseeothergids, make them non-static, and add prototypes (so they
   can be seen/called by in_prot.c functions.)
 - Remove sw_csum variable from ip6_forward in ip6_forward.c, as it is an
   unused variable.

Reviewed by:	gnn, jtl
Approved by:	sjg (mentor)
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D2901
2016-07-27 20:34:09 +00:00