Commit Graph

6027 Commits

Author SHA1 Message Date
Jonathan T. Looney
7fb2986ff6 If the INP lock is uncontested, avoid taking a reference and jumping
through the lock-switching hoops.

A few of the INP lookup operations that lock INPs after the lookup do
so using this mechanism (to maintain lock ordering):

1. Lock lookup structure.
2. Find INP.
3. Acquire reference on INP.
4. Drop lock on lookup structure.
5. Acquire INP lock.
6. Drop reference on INP.

This change provides a slightly shorter path for cases where the INP
lock is uncontested:

1. Lock lookup structure.
2. Find INP.
3. Try to acquire the INP lock.
4. If successful, drop lock on lookup structure.

Of course, if the INP lock is contested, the functions will need to
revert to the previous way of switching locks safely.

This saves a few atomic operations when the INP lock is uncontested.

Discussed with:	gallatin, rrs, rwatson
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D12911
2018-03-21 15:54:46 +00:00
Lawrence Stewart
370efe5ac8 Add support for the experimental Internet-Draft "TCP Alternative Backoff with
ECN (ABE)" proposal to the New Reno congestion control algorithm module.
ABE reduces the amount of congestion window reduction in response to
ECN-signalled congestion relative to the loss-inferred congestion response.

More details about ABE can be found in the Internet-Draft:
https://tools.ietf.org/html/draft-ietf-tcpm-alternativebackoff-ecn

The implementation introduces four new sysctls:

- net.inet.tcp.cc.abe defaults to 0 (disabled) and can be set to non-zero to
  enable ABE for ECN-enabled TCP connections.

- net.inet.tcp.cc.newreno.beta and net.inet.tcp.cc.newreno.beta_ecn set the
  multiplicative window decrease factor, specified as a percentage, applied to
  the congestion window in response to a loss-based or ECN-based congestion
  signal respectively. They default to the values specified in the draft i.e.
  beta=50 and beta_ecn=80.

- net.inet.tcp.cc.abe_frlossreduce defaults to 0 (disabled) and can be set to
  non-zero to enable the use of standard beta (50% by default) when repairing
  loss during an ECN-signalled congestion recovery episode. It enables a more
  conservative congestion response and is provided for the purposes of
  experimentation as a result of some discussion at IETF 100 in Singapore.

The values of beta and beta_ecn can also be set per-connection by way of the
TCP_CCALGOOPT TCP-level socket option and the new CC_NEWRENO_BETA or
CC_NEWRENO_BETA_ECN CC algo sub-options.

Submitted by:	Tom Jones <tj@enoti.me>
Tested by:	Tom Jones <tj@enoti.me>, Grenville Armitage <garmitage@swin.edu.au>
Relnotes:	Yes
Differential Revision:	https://reviews.freebsd.org/D11616
2018-03-19 16:37:47 +00:00
Alexander V. Chernikov
1435dcd94f Fix outgoing TCP/UDP packet drop on arp/ndp entry expiration.
Current arp/nd code relies on the feedback from the datapath indicating
 that the entry is still used. This mechanism is incorporated into the
 arpresolve()/nd6_resolve() routines. After the inpcb route cache
 introduction, the packet path for the locally-originated packets changed,
 passing cached lle pointer to the ether_output() directly. This resulted
 in the arp/ndp entry expire each time exactly after the configured max_age
 interval. During the small window between the ARP/NDP request and reply
 from the router, most of the packets got lost.

Fix this behaviour by plugging datapath notification code to the packet
 path used by route cache. Unify the notification code by using single
 inlined function with the per-AF callbacks.

Reported by:	sthaug at nethelp.no
Reviewed by:	ae
MFC after:	2 weeks
2018-03-17 17:05:48 +00:00
Michael Tuexen
1574b1e41e Set the inp_vflag consistently for accepted TCP/IPv6 connections when
net.inet6.ip6.v6only=0.

Without this patch, the inp_vflag would have INP_IPV4 and the
INP_IPV6 flags for accepted TCP/IPv6 connections if the sysctl
variable net.inet6.ip6.v6only is 0. This resulted in netstat
to report the source and destination addresses as IPv4 addresses,
even they are IPv6 addresses.

PR:			226421
Reviewed by:		bz, hiren, kib
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D13514
2018-03-16 15:26:07 +00:00
Sean Bruno
d7fb35d13a Update tcp_lro with tested bugfixes from Netflix and LLNW:
rrs - Lets make the LRO code look for true dup-acks and window update acks
          fly on through and combine.
    rrs - Make the LRO engine a bit more aware of ack-only seq space. Lets not
          have it incorrectly wipe out newer acks for older acks when we have
          out-of-order acks (common in wifi environments).
    jeggleston - LRO eating window updates

Based on all of the above I think we are RFC compliant doing it this way:

https://tools.ietf.org/html/rfc1122

section 4.2.2.16

"Note that TCP has a heuristic to select the latest window update despite
possible datagram reordering; as a result, it may ignore a window update with
a smaller window than previously offered if neither the sequence number nor the
acknowledgment number is increased."

Submitted by:	Kevin Bowling <kevin.bowling@kev009.com>
Reviewed by:	rstone gallatin
Sponsored by:	NetFlix and Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14540
2018-03-09 00:08:43 +00:00
Michael Tuexen
1c714531e8 When checking the TCP fast cookie length, conststently also check
for the minimum length.

This fixes a bug where cookies of length 2 bytes (which is smaller
than the minimum length of 4) is provided by the server.

Sponsored by:	Netflix, Inc.
2018-02-27 22:12:38 +00:00
Patrick Kelsey
1f13c23f3d Ensure signed comparison to avoid false trip of assert during VNET teardown.
Reported by:	lwhsu
MFC after:	1 month
2018-02-26 20:31:16 +00:00
Patrick Kelsey
18a7530938 Greatly reduce the number of #ifdefs supporting the TCP_RFC7413 kernel option.
The conditional compilation support is now centralized in
tcp_fastopen.h and tcp_var.h. This doesn't provide the minimum
theoretical code/data footprint when TCP_RFC7413 is disabled, but
nearly all the TFO code should wind up being removed by the optimizer,
the additional footprint in the syncache entries is a single pointer,
and the additional overhead in the tcpcb is at the end of the
structure.

This enables the TCP_RFC7413 kernel option by default in amd64 and
arm64 GENERIC.

Reviewed by:	hiren
MFC after:	1 month
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14048
2018-02-26 03:03:41 +00:00
Patrick Kelsey
c560df6f12 This is an implementation of the client side of TCP Fast Open (TFO)
[RFC7413]. It also includes a pre-shared key mode of operation in
which the server requires the client to be in possession of a shared
secret in order to successfully open TFO connections with that server.

The names of some existing fastopen sysctls have changed (e.g.,
net.inet.tcp.fastopen.enabled -> net.inet.tcp.fastopen.server_enable).

Reviewed by:	tuexen
MFC after:	1 month
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14047
2018-02-26 02:53:22 +00:00
Patrick Kelsey
798caa2ee5 Fix harmless locking bug in tfp_fastopen_check_cookie().
The keylist lock was not being acquired early enough.  The only side
effect of this bug is that the effective add time of a new key could
be slightly later than it would have been otherwise, as seen by a TFO
client.

Reviewed by:	tuexen
MFC after:	1 month
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14046
2018-02-26 02:43:26 +00:00
Andrey V. Elsukov
b2fe54bc8b Reinitialize IP header length after checksum calculation. It is used
later by TCP-MD5 code.

This fixes the problem with broken TCP-MD5 over IPv4 when NIC has
disabled TCP checksum offloading.

PR:		223835
MFC after:	1 week
2018-02-10 10:13:17 +00:00
Andrey V. Elsukov
b99a682320 Rework ipfw dynamic states implementation to be lockless on fast path.
o added struct ipfw_dyn_info that keeps all needed for ipfw_chk and
  for dynamic states implementation information;
o added DYN_LOOKUP_NEEDED() macro that can be used to determine the
  need of new lookup of dynamic states;
o ipfw_dyn_rule now becomes obsolete. Currently it used to pass
  information from kernel to userland only.
o IPv4 and IPv6 states now described by different structures
  dyn_ipv4_state and dyn_ipv6_state;
o IPv6 scope zones support is added;
o ipfw(4) now depends from Concurrency Kit;
o states are linked with "entry" field using CK_SLIST. This allows
  lockless lookup and protected by mutex modifications.
o the "expired" SLIST field is used for states expiring.
o struct dyn_data is used to keep generic information for both IPv4
  and IPv6;
o struct dyn_parent is used to keep O_LIMIT_PARENT information;
o IPv4 and IPv6 states are stored in different hash tables;
o O_LIMIT_PARENT states now are kept separately from O_LIMIT and
  O_KEEP_STATE states;
o per-cpu dyn_hp pointers are used to implement hazard pointers and they
  prevent freeing states that are locklessly used by lookup threads;
o mutexes to protect modification of lists in hash tables now kept in
  separate arrays. 65535 limit to maximum number of hash buckets now
  removed.
o Separate lookup and install functions added for IPv4 and IPv6 states
  and for parent states.
o By default now is used Jenkinks hash function.

Obtained from:	Yandex LLC
MFC after:	42 days
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D12685
2018-02-07 18:59:54 +00:00
John Baldwin
f17985319d Export tcp_always_keepalive for use by the Chelsio TOM module.
This used to work by accident with ld.bfd even though always_keepalive
was marked as static. LLD honors static more correctly, so export this
variable properly (including moving it into the tcp_* namespace).

Reviewed by:	bz, emaste
MFC after:	2 weeks
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D14129
2018-01-30 23:01:37 +00:00
Michael Tuexen
cf6339882e Add constant for the PAD chunk as defined in RFC 4820.
This will be used by traceroute and traceroute6 soon.

MFC after:	1 week
2018-01-27 13:46:55 +00:00
Michael Tuexen
07e75d0a37 Update references in comments, since the IDs have become an RFC long
time ago. Also cleanup whitespaces. No functional change.

MFC after:	1 week
2018-01-27 13:43:03 +00:00
Conrad Meyer
222daa421f style: Remove remaining deprecated MALLOC/FREE macros
Mechanically replace uses of MALLOC/FREE with appropriate invocations of
malloc(9) / free(9) (a series of sed expressions).  Something like:

* MALLOC(a, b, ... -> a = malloc(...
* FREE( -> free(
* free((caddr_t) -> free(

No functional change.

For now, punt on modifying contrib ipfilter code, leaving a definition of
the macro in its KMALLOC().

Reported by:	jhb
Reviewed by:	cy, imp, markj, rmacklem
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14035
2018-01-25 22:25:13 +00:00
Mark Johnston
5557762b72 Use tcpinfoh_t for TCP headers in the tcp:::debug-{drop,input} probes.
The header passed to these probes has some fields converted to host
order by tcp_fields_to_host(), so the tcpinfo_t translator doesn't do
what we want.

Submitted by:	Hannes Mehnert
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D12647
2018-01-25 15:35:34 +00:00
Navdeep Parhar
09b0b8c058 Do not generate illegal mbuf chains during IP fragment reassembly. Only
the first mbuf of the reassembled datagram should have a pkthdr.

This was discovered with cxgbe(4) + IPSEC + ping with payload more than
interface MTU.  cxgbe can generate !M_WRITEABLE mbufs and this results
in m_unshare being called on the reassembled datagram, and it complains:

panic: m_unshare: m0 0xfffff80020f82600, m 0xfffff8005d054100 has M_PKTHDR

PR:		224922
Reviewed by:	ae@
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D14009
2018-01-24 05:09:21 +00:00
Ryan Stone
fc21c53f63 Reduce code duplication for inpcb route caching
Add a new macro to clear both the L3 and L2 route caches, to
hopefully prevent future instances where only the L3 cache was
cleared when both should have been.

MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13989
Reviewed by:	karels
2018-01-23 03:15:39 +00:00
Michael Tuexen
e9a3a1b13c Fix a bug related to fast retransmissions.
When processing a SACK advancing the cumtsn-ack in fast recovery,
increment the miss-indications for all TSN's reported as missing.

Thanks to Fabian Ising for finding the bug and to Timo Voelker
for provinding a fix.

This fix moves also CMT related initialisation of some variables
to a more appropriate place.

MFC after:	1 week
2018-01-16 21:58:38 +00:00
Michael Tuexen
46bf534caf Don't provide a (meaningless) cmsg when proving a notification
in a recvmsg() call.

MFC after:	1 week
2018-01-15 21:59:20 +00:00
Pedro F. Giffuni
84f210952c libalias: small memory allocation cleanups.
Make the calloc wrappers behave as expected by using mallocarray.
It is rather weird that the malloc wrappers also zeroes the memory: update
a comment to reflect at least two cases where it is expected.

Reviewed by:	tuexen
2018-01-12 23:12:30 +00:00
Cy Schubert
f813e520c2 Correct the comment describing badrs which is bad router solicitiation,
not bad router advertisement.

MFC after:	3 days
2017-12-29 07:23:18 +00:00
Michael Tuexen
fa5867cbd6 White cleanups. 2017-12-26 16:33:55 +00:00
Michael Tuexen
f34a628e7e Clearify CID 1008197.
MFC after:	3 days
2017-12-26 16:12:04 +00:00
Michael Tuexen
0460135495 Clearify issue reported in CID 1008198.
MFC after:	3 days
2017-12-26 16:06:11 +00:00
Michael Tuexen
f6ea123171 Fix CID 1008428.
MFC after:	1 week
2017-12-26 15:29:11 +00:00
Michael Tuexen
4830aee72f Fix CID 1008936. 2017-12-26 15:24:42 +00:00
Michael Tuexen
c9256941d0 Allow the first (and second) argument of sn_calloc() be a sum.
This fixes a bug reported in
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224103
PR:		224103
2017-12-26 14:37:47 +00:00
Michael Tuexen
cd90150413 When adding support for sending SCTP packets containing an ABORT chunk
to ipfw in https://svnweb.freebsd.org/changeset/base/326233,
a dependency on the SCTP stack was added to ipfw by accident.

This was noted by Kevel Bowling in https://reviews.freebsd.org/D13594
where also a solution was suggested. This patch is based on Kevin's
suggestion, but implements the required SCTP checksum computation
without any dependency on other SCTP sources.

While there, do some cleanups and improve comments.

Thanks to Kevin Kevin Browling for reporting the issue and suggesting
a fix.
2017-12-26 12:35:02 +00:00
Alexander Kabaev
151ba7933a Do pass removing some write-only variables from the kernel.
This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.

Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385
2017-12-25 04:48:39 +00:00
Andrey V. Elsukov
2aad62408b Fix mbuf leak when TCPMD5_OUTPUT() method returns error.
PR:		223817
MFC after:	1 week
2017-12-14 12:54:20 +00:00
Michael Tuexen
cd6340caf7 Cleaup, no functional change. 2017-12-13 17:11:57 +00:00
Gleb Smirnoff
66492fea49 Separate out send buffer autoscaling code into function, so that
alternative TCP stacks may reuse it instead of pasting.

Obtained from:	Netflix
2017-12-07 22:36:58 +00:00
Michael Tuexen
9f0abda051 Retire SCTP_WITH_NO_CSUM option.
This option was used in the early days to allow performance measurements
extrapolating the use of SCTP checksum offloading. Since this feature
is now available, get rid of this option.
This also un-breaks the LINT kernel. Thanks to markj@ for making me
aware of the problem.
2017-12-07 22:19:08 +00:00
Pedro F. Giffuni
fe267a5590 sys: general adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

No functional change intended.
2017-11-27 15:23:17 +00:00
Michael Tuexen
665c8a2ee5 Add to ipfw support for sending an SCTP packet containing an ABORT chunk.
This is similar to the TCP case. where a TCP RST segment can be sent.

There is one limitation: When sending an ABORT in response to an incoming
packet, it should be tested if there is no ABORT chunk in the received
packet. Currently, it is only checked if the first chunk is an ABORT
chunk to avoid parsing the whole packet, which could result in a DOS attack.

Thanks to Timo Voelker for helping me to test this patch.
Reviewed by: bcr@ (man page part), ae@ (generic, non-SCTP part)
Differential Revision:	https://reviews.freebsd.org/D13239
2017-11-26 18:19:01 +00:00
Michael Tuexen
18442f0a5b Fix SPDX line as suggested by pfg 2017-11-24 19:38:59 +00:00
Michael Tuexen
ad15e1548f Unbreak compilation when using SCTP_DETAILED_STR_STATS option.
MFC after:	1 week
2017-11-24 12:18:48 +00:00
Michael Tuexen
b7d2b5d5b1 Add SPDX line. 2017-11-24 11:25:53 +00:00
Mark Johnston
7a5c730561 Use the right variable for the IP header parameter to tcp:::send.
This addresses a regression from r311225.

MFC after:	1 week
2017-11-22 14:13:40 +00:00
Pedro F. Giffuni
51369649b0 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
Michael Tuexen
3e87bccde3 Fix the handling of ERROR chunks which a lot of error causes.
While there, clean up the code.
Thanks to Felix Weinrank who found the bug by using fuzz-testing
the SCTP userland stack.

MFC after:	1 week
2017-11-15 22:13:10 +00:00
Michael Tuexen
d0f6ab7920 Simply the code and use the full buffer for contigous chunk representation.
MFC after:	1 week
2017-11-14 02:30:21 +00:00
Gleb Smirnoff
3e21cbc802 Style r320614: don't initialize at declaration, new line after declarations,
shorten variable name to avoid extra long lines.
No functional changes.
2017-11-13 22:16:47 +00:00
Michael Tuexen
469a65d1f8 Cleanup the handling of control chunks. While there fix some minor
bug related to clearing the assoc retransmit counter and the dup TSN
handling of NR-SACK chunks.

MFC after:	3 days
2017-11-12 21:43:33 +00:00
Konstantin Belousov
06193f0be0 Use hardware timestamps to report packet timestamps for SO_TIMESTAMP
and other similar socket options.

Provide new control message SCM_TIME_INFO to supply information about
timestamp.  Currently it indicates that the timestamp was
hardware-assisted and high-precision, for software timestamps the
message is not returned.  Reserved fields are added to ABI to report
additional info about it, it is expected that raw hardware clock value
might be useful for some applications.

Reviewed by:	gallatin (previous version), hselasky
Sponsored by:	Mellanox Technologies
MFC after:	2 weeks
X-Differential revision:	https://reviews.freebsd.org/D12638
2017-11-07 09:46:26 +00:00
Michael Tuexen
253a63b817 Fix an accounting bug where data was counted twice if on the read
queue and on the ordered or unordered queue.
While there, improve the checking in INVARIANTs when computing the
a_rwnd.

MFC after:	3 days
2017-11-05 11:59:33 +00:00
Michael Tuexen
28a6adde1d Allow the setting of the MTU for future paths using an SCTP socket option.
This functionality was missing.

MFC after:	1 week
2017-11-03 20:46:12 +00:00
Michael Tuexen
ba5fc4cf78 Fix the reporting of the MTU for SCTP sockets when using IPv6.
MFC after:	1 week
2017-11-01 16:32:11 +00:00
Michael Tuexen
966dfbf910 Fix parsing error when processing cmsg in SCTP send calls. Thei bug is
related to a signed/unsigned mismatch.
This should most likely fix the issue in sctp_sosend reported by
Dmitry Vyukov on the freebsd-hackers mailing list and found by
running syzkaller.
2017-10-27 19:27:05 +00:00
Michael Tuexen
8d9b040dd4 Fix a bug reported by Felix Weinrank using the libfuzzer on the
userland stack.

MFC after:	3 days
2017-10-25 09:12:22 +00:00
Michael Tuexen
701492a5f6 Fix a bug in handling special ABORT chunks.
Thanks to Felix Weinrank for finding this issue using libfuzzer with
the userland stack.

MFC after:	3 days
2017-10-24 16:24:12 +00:00
Michael Tuexen
adc59f7f46 Fix a locking issue found by running AFL on the userland stack.
Thanks to Felix Weinrank for reporting the issue.

MFC after:	3 days
2017-10-24 14:28:56 +00:00
Alexander Motin
81098a018e Relax per-ifnet cif_vrs list double locking in carp(4).
In all cases where cif_vrs list is modified, two locks are held: per-ifnet
CIF_LOCK and global carp_sx.  It means to read that list only one of them
is enough to be held, so we can skip CIF_LOCK when we already have carp_sx.

This fixes kernel panic, caused by attempts of copyout() to sleep while
holding non-sleepable CIF_LOCK mutex.

Discussed with:	glebius
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2017-10-19 09:01:15 +00:00
Michael Tuexen
af03054c8a Fix a signed/unsigned warning.
MFC after:	1 week
2017-10-18 21:08:35 +00:00
Michael Tuexen
7f75695a3e Abort an SCTP association, when a DATA chunk is followed by an unknown
chunk with a length smaller than the minimum length.

Thanks to Felix Weinrank for making me aware of the problem.
MFC after:	3 days
2017-10-18 20:17:44 +00:00
Michael Tuexen
0d5af38ceb Revert change which got in accidently. 2017-10-18 18:59:35 +00:00
Michael Tuexen
3ed8d364a7 Fix a bug introduced in r324638.
Thanks to Felix Weinrank for making me aware of this.

MFC after:	3 days
2017-10-18 18:56:56 +00:00
Michael Tuexen
80a2d1406f Fix the handling of parital and too short chunks.
Ensure that the current behaviour is consistent: stop processing
of the chunk, but finish the processing of the previous chunks.

This behaviour might be changed in a later commit to ABORT the
assoication due to a protocol violation, but changing this
is a separate issue.

MFC after:	3 days
2017-10-15 19:33:30 +00:00
Michael Tuexen
8c8e10b763 Code cleanup, not functional change.
This avoids taking a pointer of a packed structure which allows simpler
compilation of the userland stack.

MFC after:	1 week
2017-10-14 10:02:59 +00:00
Gleb Smirnoff
3bdf4c4274 Declare more TCP globals in tcp_var.h, so that alternative TCP stacks
can use them.  Gather all TCP tunables in tcp_var.h in one place and
alphabetically sort them, to ease maintainance of the list.

Don't copy and paste declarations in tcp_stacks/fastpath.c.
2017-10-11 20:36:09 +00:00
Gleb Smirnoff
e29c55e4bb Declare pmtud_blackhole global variables in tcp_timer.h, so that
alternative TCP stacks can legally use them.
2017-10-06 20:33:40 +00:00
Michael Tuexen
ff76c8c9fd Ensure that the accept ABORT chunks with the T-bit set only the
a non-zero matching peer tag is provided.

MFC after:	1 week
2017-10-05 13:29:54 +00:00
Julien Charbon
5fcd2d9bfc Forgotten bits in r324179: Include sys/syslog.h if INVARIANTS is not defined
MFC after:	1 week
X-MFC with:	r324179
Pointy hat to:	jch
2017-10-02 09:45:17 +00:00
Patrick Kelsey
3f43239f21 The soisconnected() call removed from syncache_socket() in r307966 was
not extraneous in the TCP Fast Open (TFO) passive-open case.  In the
TFO passive-open case, syncache_socket() is being called during
processing of a TFO SYN bearing a valid cookie, and a call to
soisconnected() is required in order to allow the application to
immediately consume any data delivered in the SYN and to have a chance
to generate response data to accompany the SYN-ACK.  The removal of
this call to soisconnected() effectively converted all TFO passive
opens to having the same RTT cost as a standard 3WHS.

This commit adds a call to soisconnected() to syncache_tfo_expand() so
that it is only in the TFO passive-open path, thereby restoring TFO
passve-open RTT performance and preserving the non-TFO connection-rate
performance gains realized by r307966.

MFC after:	1 week
Sponsored by:	Limelight Networks
2017-10-01 23:37:17 +00:00
Julien Charbon
dfa1f80ce9 Fix an infinite loop in tcp_tw_2msl_scan() when an INP_TIMEWAIT inp has
been destroyed before its tcptw with INVARIANTS undefined.

This is a symmetric change of r307551:

A INP_TIMEWAIT inp should not be destroyed before its tcptw, and INVARIANTS
will catch this case.  If INVARIANTS is undefined it will emit a log(LOG_ERR)
and avoid a hard to debug infinite loop in tcp_tw_2msl_scan().

Reported by:		Ben Rubson, hselasky
Submitted by:		hselasky
Tested by:		Ben Rubson, jch
MFC after:		1 week
Sponsored by:		Verisign, inc
Differential Revision:	https://reviews.freebsd.org/D12267
2017-10-01 21:20:28 +00:00
Andrey V. Elsukov
f415d666c3 Some mbuf related fixes in icmp_error()
* check mbuf length before doing mtod() and accessing to IP header;
* update oip pointer and all depending pointers after m_pullup();
* remove extra checks and extra parentheses, wrap long lines;

PR:		222670
Reported by:	Prabhakar Lakhera
MFC after:	1 week
2017-09-29 06:24:45 +00:00
Michael Tuexen
09c53cb6cc Remove unused function.
MFC after:	1 week
2017-09-27 13:05:23 +00:00
Sepherosa Ziehau
fc572e261f tcp: Don't "negotiate" MSS.
_NO_ OSes actually "negotiate" MSS.

RFC 879:
"... This Maximum Segment Size (MSS) announcement (often mistakenly
called a negotiation) ..."

This negotiation behaviour was introduced 11 years ago by r159955
without any explaination about why FreeBSD had to "negotiate" MSS:

    In syncache_respond() do not reply with a MSS that is larger than what
    the peer announced to us but make it at least tcp_minmss in size.

    Sponsored by:   TCP/IP Optimization Fundraise 2005

The tcp_minmss behaviour is still kept.

Syncookie fix was prodded by tuexen, who also helped to test this
patch w/ packetdrill.

Reviewed by:	tuexen, karels, bz (previous version)
MFC after:	2 week
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D12430
2017-09-27 05:52:37 +00:00
Michael Tuexen
d28a3a393b Add missing locking. Found by Coverity while scanning the usrsctp
library.

MFC after:	1 week
2017-09-22 06:33:01 +00:00
Michael Tuexen
afb908dada Add missing socket lock.
MFC after:	1 week
2017-09-22 06:07:47 +00:00
Michael Tuexen
cdd2d7d4a5 Code cleanup, no functional change.
MFC after:	1 week
2017-09-21 11:56:31 +00:00
Michael Tuexen
53999485e0 Free the control structure after using is, not before.
Found by Coverity while scanning the usrsctp library.
MFC after:	1 week
2017-09-21 09:47:56 +00:00
Michael Tuexen
d0d8c7de19 No need to wakeup, since sctp_add_to_readq() does it.
MFC after:	1 week
2017-09-21 09:18:05 +00:00
Michael Tuexen
2c62ba7377 Protect the address workqueue timer by a mutex.
MFC after:	1 week
2017-09-20 21:29:54 +00:00
Michael Tuexen
3ec509bcd3 Fix a warning.
MFC after:	1 week
2017-09-19 20:24:13 +00:00
Michael Tuexen
564a95f485 Avoid an overflow when computing the staleness.
This issue was found by running libfuzz on the userland stack.

MFC after:	1 week
2017-09-19 20:09:58 +00:00
Michael Tuexen
ad608f06ed Remove a no longer used variable.
Reported by:	Felix Weinrank
MFC after:	1 week
2017-09-19 15:00:19 +00:00
Michael Tuexen
72e23aba22 Fix an accounting bug and use sctp_timer_start to start a timer.
MFC after:	1 week
2017-09-17 09:27:27 +00:00
Michael Tuexen
fe40f49bb3 Remove code not used on any platform currently supported.
MFC after:	1 week
2017-09-16 21:26:06 +00:00
Michael Tuexen
292efb1bc0 Export the UDP encapsualation port and the path state. 2017-09-12 21:08:50 +00:00
Michael Tuexen
e5cccc35c3 Add support to print the TCP stack being used.
Sponsored by:	Netflix, Inc.
2017-09-12 13:34:43 +00:00
Michael Tuexen
7b5f06fbcc Fix MTU computation. Coverity scanning usrsctp pointed to this code...
MFC after:	3 days
2017-09-09 21:03:40 +00:00
Michael Tuexen
f55c326691 Fix locking issues found by Coverity scanning the usrsctp library.
MFC after:	3 days
2017-09-09 20:44:56 +00:00
Michael Tuexen
0c4622dab2 Silence a Coverity warning from scanning the usrsctp library.
MFC after:	3 days
2017-09-09 20:08:26 +00:00
Michael Tuexen
6c2cfc0419 Savely remove a chunk from the control queue.
This bug was found by Coverity scanning the usrsctp library.

MFC after:	3 days
2017-09-09 19:49:50 +00:00
Hans Petter Selasky
95ed5015ec Add support for generic backpressure indicator for ratelimited
transmit queues aswell as non-ratelimited ones.

Add the required structure bits in order to support a backpressure
indication with ratelimited connections aswell as non-ratelimited
ones. The backpressure indicator is a value between zero and 65535
inclusivly, indicating if the destination transmit queue is empty or
full respectivly. Applications can use this value as a decision point
for when to stop transmitting data to avoid endless ENOBUFS error
codes upon transmitting an mbuf. This indicator is also useful to
reduce the latency for ratelimited queues.

Reviewed by:		gallatin, kib, gnn
Differential Revision:	https://reviews.freebsd.org/D11518
Sponsored by:		Mellanox Technologies
2017-09-06 13:56:18 +00:00
Michael Tuexen
3d5af7a127 Fix blackhole detection.
There were two bugs related to the blackhole detection:
* The smalles size was tried more than two times.
* The restored MSS was not the original one, but the second
  candidate.

MFC after:	1 week
Sponsored by:	Netflix, Inc.
2017-08-28 11:41:18 +00:00
Sean Bruno
32a04bb81d Use counter(9) for PLPMTUD counters.
Remove unused PLPMTUD sysctl counters.

Bump UPDATING and FreeBSD Version to indicate a rebuild is required.

Submitted by:	kevin.bowling@kev009.com
Reviewed by:	jtl
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12003
2017-08-25 19:41:38 +00:00
Michael Tuexen
e8aba2eb21 Avoid TCP log messages which are false positives.
This is https://svnweb.freebsd.org/changeset/base/322812, just for
alternate TCP stacks.

XMFC with: 	322812
2017-08-23 15:08:51 +00:00
Michael Tuexen
4da9052a05 Avoid TCP log messages which are false positives.
The check for timestamps are too early to handle SYN-ACK correctly.
So move it down after the corresponing processing has been done.

PR:		216832
Obtained from:	antonfb@hesiod.org
MFC after:	1 week
2017-08-23 14:50:08 +00:00
Michael Tuexen
63ec505a4f Ensure inp_vflag is consistently set for TCP endpoints.
Make sure that the flags INP_IPV4 and INP_IPV6 are consistently set
for inpcbs used for TCP sockets, no matter if the setting is derived
from the net.inet6.ip6.v6only sysctl or the IPV6_V6ONLY socket option.
For UDP this was already done right.

PR:		221385
MFC after:	1 week
2017-08-18 07:27:15 +00:00
Oleg Bulyzhin
ff21796d25 Fix comment typo. 2017-08-09 10:46:34 +00:00
Dag-Erling Smørgrav
d80fbbee5f Correct sysctl names. 2017-08-09 07:24:58 +00:00
Bjoern A. Zeeb
ae69ad884d After inpcb route caching was put back in place there is no need for
flowtable anymore (as flowtable was never considered to be useful in
the forwarding path).

Reviewed by:		np
Differential Revision:	https://reviews.freebsd.org/D11448
2017-07-27 13:03:36 +00:00
Ed Maste
1edbb54fe9 cc_cubic: restore braces around if-condition block
r307901 was reverted in r321480, restoring an incorrect block
delimitation bug present in the original cc_cubic commit. Restore
only the bugfix (brace addition) from r307901.

CID:		1090182
Approved by:	sbruno
2017-07-26 21:23:09 +00:00
Sean Bruno
43053c125a Revert r307901 - Inform CC modules about loss events.
This was discussed between various transport@ members and it was
requested to be reverted and discussed.

Submitted by:	Kevin Bowling <kevin.bowling@kev009.com>
Reported by:	lawrence
Reviewed by:	hiren
Sponsored by:	Limelight Networks
2017-07-25 15:08:52 +00:00
Sean Bruno
5d53981a18 Revert r308180 - Set slow start threshold more accurrately on loss ...
This was discussed between various transport@ members and it was
requested to be reverted and discussed.

Submitted by:	kevin
Reported by:	lawerence
Reviewed by:	hiren
2017-07-25 15:03:05 +00:00
Michael Tuexen
e5a9c519bc Remove duplicate statement. 2017-07-25 11:05:53 +00:00
Michael Tuexen
9dd6ca9602 Deal with listening socket correctly. 2017-07-20 14:50:13 +00:00
Michael Tuexen
bbc9dfbc08 Fix the explicit EOR mode. If the final messages is not complete, send
an ABORT.
Joint work with rrs@
MFC after:	1 week
2017-07-20 11:09:33 +00:00
Michael Tuexen
1f76872c36 Avoid shadowed variables.
MFC after:	1 week
2017-07-19 15:12:23 +00:00
Michael Tuexen
5ba7f91f9d Use memset/memcpy instead of bzero/bcopy.
Just use one variant instead of both. Use the memset/memcpy
ones since they cause less problems in crossplatform deployment.

MFC after:	1 week
2017-07-19 14:28:58 +00:00
Michael Tuexen
28cd0699b6 Fix the accounting and add code to detect errors in accounting.
Joint work with rrs@
MFC after:	1 week
2017-07-19 12:27:40 +00:00
Michael Tuexen
d32ed2c735 Fix the handling of Explicit EOR mode.
While there, appropriately handle the overhead depending on
the usage of DATA or I-DATA chunks. Take the overhead only
into account, when required.

Joint work with rrs@
MFC after:	1 week
2017-07-15 19:54:03 +00:00
Konstantin Belousov
5cead59181 Correct sysent flags for dynamically loaded syscalls.
Using the https://github.com/google/capsicum-test/ suite, the
PosixMqueue.CapModeForked test was failing due to an ECAPMODE after
calling kmq_notify(). On further inspection, the dynamically
loaded syscall entry was initialized with sy_flags zeroed out, since
SYSCALL_INIT_HELPER() left sysent.sy_flags with the default value.

Add a new helper SYSCALL{,32}_INIT_HELPER_F() which takes an
additional argument to specify the sy_flags value.

Submitted by:	Siva Mahadevan <smahadevan@freebsdfoundation.org>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D11576
2017-07-14 09:34:44 +00:00
Jonathan T. Looney
cb503ae22d Don't overpromote values when calculating len in tcp_output().
sbavail() returns u_int and sendwin is a uint32_t. Therefore, min() (which
operates on two u_int values) is able to correctly calculate the minimum
of these two arguments.

Reported by:	rrs
MFC after:	1 week
Sponsored by:	Netflix
2017-07-05 16:10:30 +00:00
Michael Tuexen
1698cbd919 Move to open state after plausibility checks.
When doing this too early, the MIB counters go wrong.

MFC after:	1 week
2017-07-04 18:24:50 +00:00
Michael Tuexen
afffa1a9ad Don't hold if refcount on an stcb when it is not needed.
This improves the consistency with other parts of the code.
2017-07-04 18:04:44 +00:00
Sean Bruno
ac952dd274 Add a sysctl to toggle the use of the sockets LOWAT when calculating auto window growth
Submitted by:	j@nitrology.com (Jason Wolfe)
Reviewed by:	gnn hiren
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D11016
2017-07-03 19:39:58 +00:00
Michael Tuexen
f4358911bf Handle sctp_get_next_param() in a consistent way.
This addresses an issue found by Felix Weinrank using libfuzz.
While there, use also consistent nameing.

MFC after:	3 days
2017-06-23 21:01:57 +00:00
Michael Tuexen
d44b45df2c Check the length of a COOKIE chunk before accessing fields in it.
Thanks to Felix Weinrank for reporting the issue he found by using
libFuzzer.

MFC after:	3 days
2017-06-23 10:09:49 +00:00
Michael Tuexen
1a7abbb3be Use a longer buffer for messages in ERROR chunks.
This allows them to be sent in a non truncated way and addresses a warning
given by newver versions of gcc.
Thanks to Anselm Jonas Scholl for reporting it and providing a patch.
2017-06-23 09:27:31 +00:00
Michael Tuexen
94f66d603a Honor the backlog field. 2017-06-23 08:35:54 +00:00
Michael Tuexen
3017b21bb6 Improve compilation on platforms different from FreeBSD. 2017-06-23 08:34:01 +00:00
Gleb Smirnoff
779f106aa1 Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
  fields that belong to normal dataflow, and unionize them.  This
  shrinks the structure a bit.
  - Take out selinfo's from the socket buffers into the socket. The
    first reason is to support braindamaged scenario when a socket is
    added to kevent(2) and then listen(2) is cast on it. The second
    reason is that there is future plan to make socket buffers pluggable,
    so that for a dataflow socket a socket buffer can be changed, and
    in this case we also want to keep same selinfos through the lifetime
    of a socket.
  - Remove struct struct so_accf. Since now listening stuff no longer
    affects struct socket size, just move its fields into listening part
    of the union.
  - Provide sol_upcall field and enforce that so_upcall_set() may be called
    only on a dataflow socket, which has buffers, and for listening sockets
    provide solisten_upcall_set().

o Remove ACCEPT_LOCK() global.
  - Add a mutex to socket, to be used instead of socket buffer lock to lock
    fields of struct socket that don't belong to a socket buffer.
  - Allow to acquire two socket locks, but the first one must belong to a
    listening socket.
  - Make soref()/sorele() to use atomic(9).  This allows in some situations
    to do soref() without owning socket lock.  There is place for improvement
    here, it is possible to make sorele() also to lock optionally.
  - Most protocols aren't touched by this change, except UNIX local sockets.
    See below for more information.

o Reduce copy-and-paste in kernel modules that accept connections from
  listening sockets: provide function solisten_dequeue(), and use it in
  the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
  infiniband, rpc.

o UNIX local sockets.
  - Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
    local sockets.  Most races exist around spawning a new socket, when we
    are connecting to a local listening socket.  To cover them, we need to
    hold locks on both PCBs when spawning a third one.  This means holding
    them across sonewconn().  This creates a LOR between pcb locks and
    unp_list_lock.
  - To fix the new LOR, abandon the global unp_list_lock in favor of global
    unp_link_lock.  Indeed, separating these two locks didn't provide us any
    extra parralelism in the UNIX sockets.
  - Now call into uipc_attach() may happen with unp_link_lock hold if, we
    are accepting, or without unp_link_lock in case if we are just creating
    a socket.
  - Another problem in UNIX sockets is that uipc_close() basicly did nothing
    for a listening socket.  The vnode remained opened for connections.  This
    is fixed by removing vnode in uipc_close().  Maybe the right way would be
    to do it for all sockets (not only listening), simply move the vnode
    teardown from uipc_detach() to uipc_close()?

Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
Jonathan T. Looney
dc6a41b936 Add the infrastructure to support loading multiple versions of TCP
stack modules.

It adds support for mangling symbols exported by a module by prepending
a string to them. (This avoids overlapping symbols in the kernel linker.)

It allows the use of a macro as the module name in the DECLARE_MACRO()
and MACRO_VERSION() macros.

It allows the code to register stack aliases (e.g. both a generic name
["default"] and version-specific name ["default_10_3p1"]).

With these changes, it is trivial to compile TCP stack modules with
the name defined in the Makefile and to load multiple versions of the
same stack simultaneously. This functionality can be used to enable
side-by-side testing of an old and new version of the same TCP stack.
It also could support upgrading the TCP stack without a reboot.

Reviewed by:	gnn, sjg (makefiles only)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D11086
2017-06-08 20:41:28 +00:00
Gleb Smirnoff
3acfe1e1b0 This code was missing socket unlock and socket buffer lock, but it
worked since right now these two locks are the same.
2017-06-08 06:37:11 +00:00
Gleb Smirnoff
12d8a8e7a3 The desired lock here is socket buffer, not socket.
Right now they match, but won't in future.
2017-06-08 06:34:09 +00:00
Michael Tuexen
8cb5a8e90a Fix the ICMP6 handling for TCP.
The ICMP6 packets might not be contained in a single mbuf. So don't
assume this. Keep the IPv4 and IPv6 code in sync and make explicit
that the syncache code only need the TCP sequence number, not the
complete TCP header.

MFC after:		3 days
Sponsored by:		Netflix, Inc.
2017-06-03 21:53:58 +00:00
Michael Tuexen
98732609d5 Improve comments to describe what the code does.
Reported by:		jtl
Sponsored by:		Netflix, Inc.
2017-06-01 15:11:18 +00:00
Jonathan T. Looney
382a6bbcf1 Enforce the limit on ICMP messages before doing work to formulate the
response.

Delete an unneeded rate limit for UDP under IPv6. Because ICMP6
messages have their own rate limit, it is unnecessary to apply a
second rate limit to UDP messages.

Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D10387
2017-05-30 14:32:44 +00:00
Michael Tuexen
5d08768a2b Use the SCTP_PCB_FLAGS_ACCEPTING flags to check for listeners.
While there, use a macro for checking the listen state to allow for
easier changes if required.

This done to help glebius@ with his listen changes.
2017-05-26 16:29:00 +00:00
Gleb Smirnoff
6a6cefac3d o Rearrange struct inpcb fields to optimize the TCP output code path
considering cache line hits and misses.  Put the lock and hash list
  glue into the first cache line, put inp_refcount inp_flags inp_socket
  into the second cache line.
o On allocation zero out entire structure except the lock and list entries,
  including inp_route inp_lle inp_gencnt.  When inp_route and inp_lle were
  introduced, they were added below inp_zero_size, resulting on not being
  cleared after free/alloc.  This definitely was a source of bugs with route
  caching.  Could be that r315956 has just fixed one of them.
  The inp_gencnt is reinitialized on every alloc, so it is safe to clear it.

This has been proved to improve TCP performance at Netflix.

Obtained from:		rrs
Differential Revision:	D10686
2017-05-24 17:47:16 +00:00
Michael Tuexen
5dba6ada91 The connect() system call should return -1 and set errno to EAFNOSUPPORT
if it is called on a TCP socket
 * with an IPv6 address and the socket is bound to an
    IPv4-mapped IPv6 address.
 * with an IPv4-mapped IPv6 address and the socket is bound to an
   IPv6 address.
Thanks to Jonathan T. Leighton for reporting this issue.

Reviewed by:		bz gnn
MFC after:		3 days
Differential Revision:	https://reviews.freebsd.org/D9163
2017-05-22 15:29:10 +00:00
Andrey V. Elsukov
38cc96a887 Set M_BCAST and M_MCAST flags on mbuf sent via divert socket.
r290383 has changed how mbufs sent by divert socket are handled.
Previously they are always handled by slow path processing in ip_input().
Now ip_tryforward() is invoked from ip_input() before in_broadcast() check.
Since diverted packet lost all mbuf flags, it passes the broadcast check
in ip_tryforward() due to missing M_BCAST flag. In the result the broadcast
packet is forwarded to the wire instead of be consumed by network stack.

Add in_broadcast() check to the div_output() function. And restore the
M_BCAST flag if destination address is broadcast for the given network
interface.

PR:		209491
MFC after:	1 week
2017-05-17 09:04:09 +00:00
Ed Maste
3e85b721d6 Remove register keyword from sys/ and ANSIfy prototypes
A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.

ANSIfy related prototypes while here.

Reviewed by:	cem, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D10193
2017-05-17 00:34:34 +00:00
Gleb Smirnoff
cc487c1697 Reduce in_pcbinfo_init() by two params. No users supply any flags to this
function (they used to say UMA_ZONE_NOFREE), so flag parameter goes away.
The zone_fini parameter also goes away.  Previously no protocols (except
divert) supplied zone_fini function, so inpcb locks were leaked with slabs.
This was okay while zones were allocated with UMA_ZONE_NOFREE flag, but now
this is a leak.  Fix that by suppling inpcb_fini() function as fini method
for all inpcb zones.
2017-05-15 21:58:36 +00:00
Enji Cooper
bd7459366e Add missing braces around MCAST_EXCLUDE check when KTR support is
compiled into the kernel

This ensures that .iss_asm (the number of ASM listeners) isn't incorrectly
decremented for MLD-layer source datagrams when inspecting im*s_st[1]
(the second state in the structure).

MFC after:	2 months
PR:		217509 [1]
Reported by:	Coverity (Isilon)
Reviewed by:	ae ("This patch looks correct to me." [1])
Submitted by:	Miles Ohlrich <miles.ohlrich@isilon.com>
Sponsored by:	Dell EMC Isilon
2017-05-13 18:41:24 +00:00
Gleb Smirnoff
7637c57ee1 There is no good reason for TCP reassembly zone to be UMA_ZONE_NOFREE.
It has strong locking model, doesn't have any timers associated with
entries.  The entries theirselves are referenced only from the tcpcb zone,
which itself is a normal zone, without the UMA_ZONE_NOFREE flag.
2017-05-10 23:32:31 +00:00
Eugene Grosbein
1a356b8b90 ipfw nat and natd support multiple aliasing instances with "nat global" feature
that chooses right alias_address for outgoing packets that already have
corresponding state in one of aliasing instances. This feature works just fine
for ICMP, UDP, TCP and SCTP packes but not for others. For example,
outgoing PPtP/GRE packets always get alias_address of latest configured
instance no matter whether such packets have corresponding state or not.

This change unbreaks translation of transit PPtP/GRE connections
for "nat global" case fixing a bug in static ProtoAliasOut() function
that ignores its "create" argument and performs translation
regardless of its value. This static function is called only
by LibAliasOutLocked() function and only for packers other than
ICMP, UDP, TCP and SCTP. LibAliasOutLocked() passes its "create"
argument unmodified.

We have only two consumers of LibAliasOutLocked() in the source tree
calling it with "create" unequal to 1: "ipfw nat global" code and similar
natd code having same problem. All other consumers of LibAliasOutLocked()
call it with create = 1 and the patch is "no-op" for such cases.

PR:		218968
Approved by:	ae, vsevolod (mentor)
MFC after:	1 week
2017-05-10 19:41:52 +00:00
Michael Tuexen
10e0318afa Allow SCTP to use the hostcache.
This patch allows the MTU stored in the hostcache to be used as an
initial value for SCTP paths. When an ICMP PTB message is received,
store the MTU in the hostcache.

MFC after:	1 week
2017-04-29 19:20:50 +00:00
Michael Tuexen
4f43a14a85 Don't set the DF-bit on timer based retransmissions.
MFC after:	1 week
2017-04-29 09:57:27 +00:00
Michael Tuexen
b6ecf43450 Set the DF bit for responses to out-of-the-blue packets.
MFC after:	1 week
2017-04-28 15:38:34 +00:00
Michael Tuexen
d274bcc661 Fix an issue with MTU calculation if an ICMP messaeg is received
for an SCTP/UDP packet.

MFC after:	1 week
2017-04-26 20:21:05 +00:00
Michael Tuexen
6ebfa5ee14 Use consistently uint32_t for mtu values.
This does not change functionality, but this cleanup is need for further
improvements of ICMP handling.

MFC after:	1 week
2017-04-26 19:26:40 +00:00
Michael Tuexen
ebfd753408 When a SYN-ACK is received in SYN-SENT state, RFC 793 requires the
validation of SEG.ACK as the first step. If the ACK is not acceptable,
a RST segment should be sent and the segment should be dropped.
Up to now, the segment was partially processed.
This patch moves the check for the SEG.ACK validation up to the front
as required.
Reviewed by:		hiren, gnn
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D10424
2017-04-26 06:20:58 +00:00
Navdeep Parhar
f8acc03ef1 Flush the LRO ctrl as soon as lro_mbufs fills up. There is no need to
wait for the next enqueue from the driver.

Reviewed by:	gnn@, hselasky@, gallatin@
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D10432
2017-04-24 22:35:00 +00:00
Navdeep Parhar
ea9a92f112 Frames that are not considered for LRO should not be counted in LRO statistics.
Reviewed by:	gnn@, hselasky@, gallatin@
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D10430
2017-04-24 22:31:56 +00:00
Brooks Davis
a7dc31283a Remove the NATM framework including the en(4), fatm(4), hatm(4), and
patm(4) devices.

Maintaining an address family and framework has real costs when we make
infrastructure improvements.  In the case of NATM we support no devices
manufactured in the last 20 years and some will not even work in modern
motherboards (some newer devices that patm(4) could be updated to
support apparently exist, but we do not currently have support).

With this change, support remains for some netgraph modules that don't
require NATM support code. It is unclear if all these should remain,
though ng_atmllc certainly stands alone.

Note well: FreeBSD 11 supports NATM and will continue to do so until at
least September 30, 2021.  Improvements to the code in FreeBSD 11 are
certainly welcome.

Reviewed by:	philip
Approved by:	harti
2017-04-24 21:21:49 +00:00
Michael Tuexen
75e7a91649 Represent "a syncache overflow hasn't happend yet" by using
-(SYNCOOKIE_LIFETIME + 1) instead of INT64_MIN, since it is
good enough and works when time_t is int32 or int64.
This fixes the issue reported by cy@ on i386.

Reported by:	cy
MFC after:	1 week
Sponsored by:	Netflix, Inc.
2017-04-21 06:05:34 +00:00
Michael Tuexen
190d9abce7 Syncoockies can be used in combination with the syncache. If the cache
overflows, syncookies are used.
This patch restricts the usage of syncookies in this case: accept
syncookies only if there was an overflow of the syncache recently.
This mitigates a problem reported in PR217637, where is syncookie was
accepted without any recent drops.
Thanks to glebius@ for suggesting an improvement.

PR:			217637
Reviewed by:		gnn, glebius
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D10272
2017-04-20 19:19:33 +00:00
Navdeep Parhar
b0ca71f0a0 Free lro_hash unconditionally, just like lro_mbuf_data a few lines
later.  Fix whitespace nit while here.
2017-04-19 23:06:07 +00:00
Navdeep Parhar
a3927369fa Do not leak lro_hash on failure to allocate lro_mbuf_data.
MFC after:	1 week
2017-04-19 22:27:26 +00:00
Navdeep Parhar
3d24e03800 Remove redundant assignment. 2017-04-19 22:20:41 +00:00
Andrey V. Elsukov
c33a231337 Rework r316770 to make it protocol independent and general, like we
do for streaming sockets.

And do more cleanup in the sbappendaddr_locked_internal() to prevent
leak information from existing mbuf to the one, that will be possible
created later by netgraph.

Suggested by:	glebius
Tested by:	Irina Liakh <spell at itl ua>
MFC after:	1 week
2017-04-14 09:00:48 +00:00
Andrey V. Elsukov
8428914909 Clear h/w csum flags on mbuf handled by UDP.
When checksums of received IP and UDP header already checked, UDP uses
sbappendaddr_locked() to pass received data to the socket.
sbappendaddr_locked() uses given mbuf as is, and if NIC supports checksum
offloading, mbuf contains csum_data and csum_flags that were calculated
for already stripped headers. Some NICs support only limited checksums
offloading and do not use CSUM_PSEUDO_HDR flag, and csum_data contains
some value that UDP/TCP should use for pseudo header checksum calculation.

When L2TP is used for tunneling with mpd5, ng_ksocket receives mbuf with
filled csum_flags and csum_data, that were calculated for outer headers.
When L2TP header is stripped, a packet that was tunneled goes to the IP
layer and due to presence of csum_flags (without CSUM_PSEUDO_HDR) and
csum_data, the UDP/TCP checksum check fails for this packet.

Reported by:	Irina Liakh <spell at itl ua>
Tested by:	Irina Liakh <spell at itl ua>
MFC after:	1 week
2017-04-13 17:03:57 +00:00
Michael Tuexen
013f4df643 The sysctl variable net.inet.tcp.drop_synfin is not honored in all states,
for example not in SYN-SENT.
This patch adds code to check the sysctl variable in other states than
LISTEN.
Thanks to ae and gnn for providing comments.
Reviewed by:		gnn
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D9894
2017-04-12 20:27:15 +00:00
Andrey V. Elsukov
7faa0d213b Make sysctl identifiers for direct netisr queue unique.
Introduce IPCTL_INTRDQMAXLEN and IPCTL_INTRDQDROPS macros for this purpose.

Reviewed by:	gnn
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D10358
2017-04-11 19:20:20 +00:00
Steven Hartland
e44c1887fd Use estimated RTT for receive buffer auto resizing instead of timestamps
Switched from using timestamps to RTT estimates when performing TCP receive
buffer auto resizing, as not all hosts support / enable TCP timestamps.

Disabled reset of receive buffer auto scaling when not in bulk receive mode,
which gives an extra 20% performance increase.

Also extracted auto resizing to a common method shared between standard and
fastpath modules.

With this AWS S3 downloads at ~17ms latency on a 1Gbps connection jump from
~3MB/s to ~100MB/s using the default settings.

Reviewed by:    lstewart, gnn
MFC after:      2 weeks
Relnotes:       Yes
Sponsored by:   Multiplay
Differential Revision:  https://reviews.freebsd.org/D9668
2017-04-10 08:19:35 +00:00
Ryan Stone
4af540d197 Revert the optimization from r304436
r304436 attempted to optimize the handling of incoming UDP packet by only
making an expensive call to in_broadcast() if the mbuf was marked as an
broadcast packet.  Unfortunately, this cannot work in the case of point-to-
point L2 protocols like PPP, which have no notion of "broadcast".  The
optimization has been disabled for several months now with no progress
towards fixing it, so it needs to go.
2017-04-05 16:57:13 +00:00
Andrey V. Elsukov
11c56650f0 Add O_EXTERNAL_DATA opcode support.
This opcode can be used to attach some data to external action opcode.
And unlike to O_EXTERNAL_INSTANCE opcode, this opcode does not require
creating of named instance to pass configuration arguments to external
action handler. The data is coming just next to O_EXTERNAL_ACTION opcode.

The userlevel part currenly supports formatting for opcode with ipfw_insn
size, by default it expects u16 numeric value in the arg1.

Obtained from:	Yandex LLC
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2017-04-03 02:44:40 +00:00
Steven Hartland
6ebc1b7b7d Allow explicitly assigned IPv4 loopback address to be used in jails
If a jail has an explicitly assigned loopback address then allow it to be
used instead of remapping requests for the loopback adddress to the first
IPv4 address assigned to the jail.

This fixes issues where applications attempt to detect their bound port
where they requested a loopback address, which was available, but instead
the kernel remapped it to the jails first address.

A example of this is binding nginx to 127.0.0.1 and then running "service
nginx upgrade" which before this change would cause nginx to fail.

Also:
* Correct the description of prison_check_ip4_locked to match the code.

MFC after:	2 weeks
Relnotes:	Yes
Sponsored by:	Multiplay
2017-03-31 00:41:54 +00:00
Mike Karels
4a5c6c6ab0 Enable route and LLE (ndp) caching in TCP/IPv6
tcp_output.c was using a route on the stack for IPv6, which does not
allow route caching or LLE/ndp caching. Switch to using the route
(v6 flavor) in the in_pcb, which was already present, which caches
both L3 and L2 lookups.

Reviewed by:	gnn hiren
MFC after:	2 weeks
2017-03-27 23:48:36 +00:00
Mike Karels
8c1960d506 Fix reference count leak with L2 caching.
ip_forward, TCP/IPv6, and probably SCTP leaked references to L2 cache
entry because they used their own routes on the stack, not in_pcb routes.
The original model for route caching was callers that provided a route
structure to ip{,6}input() would keep the route, and this model was used
for L2 caching as well. Instead, change L2 caching to be done by default
only when using a route structure in the in_pcb; the pcb deallocation
code frees L2 as well as L3 cacches. A separate change will add route
caching to TCP/IPv6.

Another suggestion was to have the transport protocols indicate willingness
to use L2 caching, but this approach keeps the changes in the network
level

Reviewed by:    ae gnn
MFC after:      2 weeks
Differential Revision:  https://reviews.freebsd.org/D10059
2017-03-25 15:06:28 +00:00
Gleb Smirnoff
3ae4b0e7e8 Force same alignment on struct xinpgen as we have on struct xinpcb. This
fixes 32-bit builds.
2017-03-21 16:23:44 +00:00
Gleb Smirnoff
cc65eb4e79 Hide struct inpcb, struct tcpcb from the userland.
This is a painful change, but it is needed.  On the one hand, we avoid
modifying them, and this slows down some ideas, on the other hand we still
eventually modify them and tools like netstat(1) never work on next version of
FreeBSD.  We maintain a ton of spares in them, and we already got some ifdef
hell at the end of tcpcb.

Details:
- Hide struct inpcb, struct tcpcb under _KERNEL || _WANT_FOO.
- Make struct xinpcb, struct xtcpcb pure API structures, not including
  kernel structures inpcb and tcpcb inside.  Export into these structures
  the fields from inpcb and tcpcb that are known to be used, and put there
  a ton of spare space.
- Make kernel and userland utilities compilable after these changes.
- Bump __FreeBSD_version.

Reviewed by:	rrs, gnn
Differential Revision:	D10018
2017-03-21 06:39:49 +00:00
Eric van Gyzen
40769242ed Add some ntohl() love to r315277
inet_ntoa() and inet_ntoa_r() take the address in network
byte-order.  When I removed those calls, I should have
replaced them with ntohl() to make the hex addresses slightly
less unreadable.  Here they are.

See r315277 regarding classic blunders.

vangyzen: you're deep in "no good deed" territory, it seems
    --badger

Reported by:	ian
MFC after:	3 days
MFC when:	I finally get it right
Sponsored by:	Dell EMC
2017-03-14 20:57:54 +00:00
Eric van Gyzen
47d803ea71 KTR: log IPv4 addresses in hex rather than dotted-quad
When I made the changes in r313821, I fell victim to one of the
classic blunders, the most famous of which is: never get involved
in a land war in Asia.  But only slightly less well known is this:
Keep your brain turned on and engaged when making a tedious, sweeping,
mechanical change.  KTR can correctly log the immediate integral values
passed to it, as well as constant strings, but not non-constant strings,
since they might change by the time ktrdump retrieves them.

Reported by:	glebius
MFC after:	3 days
Sponsored by:	Dell EMC
2017-03-14 18:27:48 +00:00
Conrad Meyer
fdb727f4f2 alias_proxy.c: Fix accidental error quashing
This was introduced on accident in r165243, when return sites were unified
to add a lock around LibAliasProxyRule().

PR:		217749
Submitted by:	Svyatoslav <razmyslov at viva64.com>
Sponsored by:	Viva64 (PVS-Studio)
2017-03-13 18:05:31 +00:00
Andrey V. Elsukov
719498102c Fix the L2 address printed in the "arp: %s moved from %*D" message.
In the r292978 struct llentry was changed and the ll_addr field become
the pointer.

PR:		217667
MFC after:	1 week
2017-03-11 04:57:52 +00:00
Gleb Smirnoff
c75e266608 Make inp_lock_assert() depend on INVARIANT_SUPPORT, not INVARIANTS.
This will make INVARIANT-enabled modules, that use this function to load
successfully on a kernel that has INVARIANT_SUPPORT only.
2017-03-09 00:55:19 +00:00
Ermal Luçi
dce33a45c9 The patch provides the same socket option as Linux IP_ORIGDSTADDR.
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.

The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.

This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.

Reviewed by:	adrian, aw
Approved by:	ae (mentor)
Sponsored by:	rsync.net
Differential Revision:	D9235
2017-03-06 04:01:58 +00:00
Warner Losh
fbbd9655e5 Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
Michael Tuexen
8d62aae8df TCP window updates are only sent if the window can be increased by at
least 2 * MSS. However, if the receive buffer size is small, this might
be impossible. Add back a criterion to send a TCP window update if
the window can be increased by at least half of the receive buffer size.
This condition was removed in r242252. This patch simply brings it back.
PR:			211003
Reviewed by:		gnn
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D9475
2017-02-23 18:14:36 +00:00
Eric van Gyzen
922193e7ff Remove inet_ntoa() from the kernel
inet_ntoa() cannot be used safely in a multithreaded environment
because it uses a static local buffer.  Remove it from the kernel.

Suggested by:	glebius, emaste
Reviewed by:	gnn
MFC after:	never
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D9625
2017-02-16 20:50:01 +00:00
Eric van Gyzen
8144690af4 Use inet_ntoa_r() instead of inet_ntoa() throughout the kernel
inet_ntoa() cannot be used safely in a multithreaded environment
because it uses a static local buffer. Instead, use inet_ntoa_r()
with a buffer on the caller's stack.

Suggested by:	glebius, emaste
Reviewed by:	gnn
MFC after:	2 weeks
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D9625
2017-02-16 20:47:41 +00:00
Andrey V. Elsukov
7a60a91011 Add missing check to fix the build with IPSEC_SUPPORT and without MAC.
Submitted by:	netchild
2017-02-14 21:33:10 +00:00
Andrey V. Elsukov
627c036f65 Remove IPsec related PCB code from SCTP.
The inpcb structure has inp_sp pointer that is initialized by
ipsec_init_pcbpolicy() function. This pointer keeps strorage for IPsec
security policies associated with a specific socket.
An application can use IP_IPSEC_POLICY and IPV6_IPSEC_POLICY socket
options to configure these security policies. Then ip[6]_output()
uses inpcb pointer to specify that an outgoing packet is associated
with some socket. And IPSEC_OUTPUT() method can use a security policy
stored in the inp_sp. For inbound packet the protocol-specific input
routine uses IPSEC_CHECK_POLICY() method to check that a packet conforms
to inbound security policy configured in the inpcb.

SCTP protocol doesn't specify inpcb for ip[6]_output() when it sends
packets. Thus IPSEC_OUTPUT() method does not consider such packets as
associated with some socket and can not apply security policies
from inpcb, even if they are configured. Since IPSEC_CHECK_POLICY()
method is called from protocol-specific input routine, it can specify
inpcb pointer and associated with socket inbound policy will be
checked. But there are two problems:
1. Such check is asymmetric, becasue we can not apply security policy
from inpcb for outgoing packet.
2. IPSEC_CHECK_POLICY() expects that caller holds INPCB lock and
access to inp_sp is protected. But for SCTP this is not correct,
becasue SCTP uses own locks to protect inpcb.

To fix these problems remove IPsec related PCB code from SCTP.
This imply that IP_IPSEC_POLICY and IPV6_IPSEC_POLICY socket options
will be not applicable to SCTP sockets. To be able correctly check
inbound security policies for SCTP, mark its protocol header with
the PR_LASTHDR flag.

Reported by:	tuexen
Reviewed by:	tuexen
Differential Revision:	https://reviews.freebsd.org/D9538
2017-02-13 11:37:52 +00:00
Ermal Luçi
c10c5b1eba Committed without approval from mentor.
Reported by:	gnn
2017-02-12 06:56:33 +00:00
Ryan Stone
5ede40dcf2 Don't zero out srtt after excess retransmits
If the TCP stack has retransmitted more than 1/4 of the total
number of retransmits before a connection drop, it decides that
its current RTT estimate is hopelessly out of date and decides
to recalculate it from scratch starting with the next ACK.

Unfortunately, it implements this by zeroing out the current RTT
estimate.  Drop this hack entirely, as it makes it significantly more
difficult to debug connection issues.  Instead check for excessive
retransmits at the point where srtt is updated from an ACK being
received.  If we've exceeded 1/4 of the maximum retransmits,
discard the previous srtt estimate and replace it with the latest
rtt measurement.

Differential Revision:	https://reviews.freebsd.org/D9519
Reviewed by:	gnn
Sponsored by:	Dell EMC Isilon
2017-02-11 17:05:08 +00:00
Gleb Smirnoff
cfff3743cd Move tcp_fields_to_net() static inline into tcp_var.h, just below its
friend tcp_fields_to_host(). There is third party code that also uses
this inline.

Reviewed by:	ae
2017-02-10 17:46:26 +00:00
Ermal Luçi
e97b60264d Fix build after r313524
Reported-by: ohartmann@walstatt.org
2017-02-10 06:01:47 +00:00
Ermal Luçi
4616026faf Revert r313527
Heh svn is not git
2017-02-10 05:58:16 +00:00
Ermal Luçi
c0fadfdbbf Correct missed variable name.
Reported-by: ohartmann@walstatt.org
2017-02-10 05:51:39 +00:00
Ermal Luçi
ed55edceef The patch provides the same socket option as Linux IP_ORIGDSTADDR.
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.

The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.

This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.

Sponsored-by: rsync.net
Differential Revision: D9235
Reviewed-by: adrian
2017-02-10 05:16:14 +00:00
Eric van Gyzen
edf0313b70 Fix garbage IP addresses in UDP log_in_vain messages
If multiple threads emit a UDP log_in_vain message concurrently,
the IP addresses could be garbage due to concurrent usage of a
single string buffer inside inet_ntoa().  Use inet_ntoa_r() with
two stack buffers instead.

Reported by:	Mark Martinec <Mark.Martinec+freebsd@ijs.si>
MFC after:	3 days
Relnotes:	yes
Sponsored by:	Dell EMC
2017-02-07 18:57:57 +00:00
Andrey V. Elsukov
fcf596178b Merge projects/ipsec into head/.
Small summary
 -------------

o Almost all IPsec releated code was moved into sys/netipsec.
o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel
  option IPSEC_SUPPORT added. It enables support for loading
  and unloading of ipsec.ko and tcpmd5.ko kernel modules.
o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by
  default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type
  support was removed. Added TCP/UDP checksum handling for
  inbound packets that were decapsulated by transport mode SAs.
  setkey(8) modified to show run-time NAT-T configuration of SA.
o New network pseudo interface if_ipsec(4) added. For now it is
  build as part of ipsec.ko module (or with IPSEC kernel).
  It implements IPsec virtual tunnels to create route-based VPNs.
o The network stack now invokes IPsec functions using special
  methods. The only one header file <netipsec/ipsec_support.h>
  should be included to declare all the needed things to work
  with IPsec.
o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed.
  Now these protocols are handled directly via IPsec methods.
o TCP_SIGNATURE support was reworked to be more close to RFC.
o PF_KEY SADB was reworked:
  - now all security associations stored in the single SPI namespace,
    and all SAs MUST have unique SPI.
  - several hash tables added to speed up lookups in SADB.
  - SADB now uses rmlock to protect access, and concurrent threads
    can do SA lookups in the same time.
  - many PF_KEY message handlers were reworked to reflect changes
    in SADB.
  - SADB_UPDATE message was extended to support new PF_KEY headers:
    SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They
    can be used by IKE daemon to change SA addresses.
o ipsecrequest and secpolicy structures were cardinally changed to
  avoid locking protection for ipsecrequest. Now we support
  only limited number (4) of bundled SAs, but they are supported
  for both INET and INET6.
o INPCB security policy cache was introduced. Each PCB now caches
  used security policies to avoid SP lookup for each packet.
o For inbound security policies added the mode, when the kernel does
  check for full history of applied IPsec transforms.
o References counting rules for security policies and security
  associations were changed. The proper SA locking added into xform
  code.
o xform code was also changed. Now it is possible to unregister xforms.
  tdb_xxx structures were changed and renamed to reflect changes in
  SADB/SPDB, and changed rules for locking and refcounting.

Reviewed by:	gnn, wblock
Obtained from:	Yandex LLC
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D9352
2017-02-06 08:49:57 +00:00
Patrick Kelsey
ec93ed8d95 Fix VIMAGE-related bugs in TFO. The autokey callout vnet context was
not being initialized, and the per-vnet fastopen context was only
being initialized for the default vnet.

PR:		216613
Reported by:	Alex Deiter <alex dot deiter at gmail dot com>
MFC after:	1 week
2017-02-03 17:02:57 +00:00
George V. Neville-Neil
82988b50a1 Add an mbuf to ipinfo_t translator to finish cleanup of mbuf passing to TCP probes.
Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9401
2017-02-01 19:33:00 +00:00
Michael Tuexen
c03627fd06 Ensure that the variable bail is always initialized before used.
MFC after:	1 week
2017-02-01 00:10:29 +00:00
Michael Tuexen
2aa116007c Take the SCTP common header into account when computing the
space available for chunks. This unbreaks the handling of
ICMPV6 packets indicating "packet too big". It just worked
for IPv4 since we are overbooking for IPv4.

MFC after:	1 week
2017-01-31 23:36:31 +00:00
Michael Tuexen
7858d7cb8e Remove a duplicate debug statement.
MFC after:	1 week
2017-01-31 23:34:02 +00:00
Cy Schubert
3df96ee68e Correct comment grammar and make it easier to understand.
MFC after:	1 week
2017-01-30 04:51:18 +00:00
Hiren Panchasara
6134aabe38 Add a knob to change default behavior of inheriting listen socket's tcp stack
regardless of what the default stack for the system is set to.

With current/default behavior, after changing the default tcp stack, the
application needs to be restarted to pick up that change. Setting this new knob
net.inet.tcp.functions_inherit_listen_socket_stack to '0' would change that
behavior and make any new connection use the newly selected default tcp stack.

Reviewed by:		rrs
MFC after:		2 weeks
Sponsored by:		Limelight Networks
2017-01-27 23:10:46 +00:00
Luiz Otavio O Souza
338e227ac0 After the in_control() changes in r257692, an existing address is
(intentionally) deleted first and then completely added again (so all the
events, announces and hooks are given a chance to run).

This cause an issue with CARP where the existing CARP data structure is
removed together with the last address for a given VHID, which will cause
a subsequent fail when the address is later re-added.

This change fixes this issue by adding a new flag to keep the CARP data
structure when an address is not being removed.

There was an additional issue with IPv6 CARP addresses, where the CARP data
structure would never be removed after a change and lead to VHIDs which
cannot be destroyed.

Reviewed by:	glebius
Obtained from:	pfSense
MFC after:	2 weeks
Sponsored by:	Rubicon Communications, LLC (Netgate)
2017-01-25 19:04:08 +00:00
Michael Tuexen
bd60638c98 Fix a bug where the overhead of the I-DATA chunk was not considered.
MFC after: 1 week
2017-01-24 21:30:31 +00:00
Hans Petter Selasky
f3e7afe2d7 Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.

- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.

- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().

- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.

- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.

- How rate limiting works:

1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.

2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.

3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.

4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.

Reviewed by:		wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision:	https://reviews.freebsd.org/D3687
Sponsored by:		Mellanox Technologies
MFC after:		3 months
2017-01-18 13:31:17 +00:00
Maxim Sobolev
339efd75a4 Add a new socket option SO_TS_CLOCK to pick from several different clock
sources to return timestamps when SO_TIMESTAMP is enabled. Two additional
clock sources are:

o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME);
o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC).

In addition to this, this option provides unified interface to get bintime
(equivalent of using SO_BINTIME), except it also supported with IPv6 where
SO_BINTIME has never been supported. The long term plan is to depreciate
SO_BINTIME and move everything to using SO_TS_CLOCK.

Idea for this enhancement has been briefly discussed on the Net session
during dev summit in Ottawa last June and the general input was positive.

This change is believed to benefit network benchmarks/profiling as well
as other scenarios where precise time of arrival measurement is necessary.

There are two regression test cases as part of this commit: one extends unix
domain test code (unix_cmsg) to test new SCM_XXX types and another one
implementis totally new test case which exchanges UDP packets between two
processes using both conventional methods (i.e. calling clock_gettime(2)
before recv(2) and after send(2)), as well as using setsockopt()+recv() in
receive path. The resulting delays are checked for sanity for all supported
clock types.

Reviewed by:    adrian, gnn
Differential Revision:  https://reviews.freebsd.org/D9171
2017-01-16 17:46:38 +00:00
Conrad Meyer
1d64db52f3 Fix a variety of cosmetic typos and misspellings
No functional change.

PR:		216096, 216097, 216098, 216101, 216102, 216106, 216109, 216110
Reported by:	Bulat <bltsrc at mail.ru>
Sponsored by:	Dell EMC Isilon
2017-01-15 18:00:45 +00:00
Gleb Smirnoff
0f7ddf91e9 Use getsock_cap() instead of deprecated fgetsock().
Reviewed by:	tuexen
2017-01-13 16:54:44 +00:00
Michael Tuexen
24209f0122 Ensure that the buffer length and the length provided in the IPv4
header match when using a raw socket to send IPv4 packets and
providing the header. If they don't match, let send return -1
and set errno to EINVAL.

Before this patch is was only enforced that the length in the header
is not larger then the buffer length.

PR:			212283
Reviewed by:		ae, gnn
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D9161
2017-01-13 10:55:26 +00:00
Maxim Sobolev
5e946c03c7 Fix slight type mismatch between so_options defined in sys/socketvar.h
and tw_so_options defined here which is supposed to be a copy of the
former (short vs u_short respectively).

Switch tw_so_options to be "signed short" to match the type of the field
it's inherited from.
2017-01-12 10:14:54 +00:00
Hiren Panchasara
b8a2fb91f6 sysctl net.inet.tcp.hostcache.list in a jail can see connections from other
jails and the host. This commit fixes it.

PR:		200361
Submitted by:	bz (original version), hiren (minor corrections)
Reported by:	Marcus Reid <marcus at blazingdot dot com>
Reviewed by:	bz, gnn
Tested by:	Lohith Bellad <lohithbsd at gmail dot com>
MFC after:	1 week
Sponsored by:	Limelight Networks (minor corrections)
2017-01-05 17:22:09 +00:00
George V. Neville-Neil
fad073dd44 Followup to mtod removal in main stack (r311225). Continued removal
of mtod() calls from TCP_PROBE macros.

MFC after:	1 week
Sponsored by:	Limelight Networks
2017-01-04 04:00:28 +00:00
George V. Neville-Neil
2b9c998413 Fix DTrace TCP tracepoints to not use mtod() as it is both unnecessary and
dangerous.  Those wanting data from an mbuf should use DTrace itself to get
the data.

PR:	203409
Reviewed by:	hiren
MFC after:	1 week
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D9035
2017-01-04 02:19:13 +00:00
Enji Cooper
cfff8d3dbd Unbreak ip_carp with WITHOUT_INET6 enabled by conditionalizing all IPv6
structs under the INET6 #ifdef. Similarly (even though it doesn't seem
to affect the build), conditionalize all IPv4 structs under the INET
#ifdef

This also unbreaks the LINT-NOINET6 tinderbox target on amd64; I have not
verified other MACHINE/TARGET pairs (e.g. armv6/arm).

MFC after:	2 weeks
X-MFC with:	r310847
Pointyhat to:	jpaetzel
Reported by:	O. Hartmann <o.hartmann@walstatt.org>
2016-12-30 21:33:01 +00:00
Josh Paetzel
8151740c88 Harden CARP against network loops.
If there is a loop in the network a CARP that is in MASTER state will see it's
own broadcasts, which will then cause it to assume BACKUP state.  When it
assumes BACKUP it will stop sending advertisements.  In that state it will no
longer see advertisements and will assume MASTER...

We can't catch all the cases where we are seeing our own CARP broadcast, but
we can catch the obvious case.

Submitted by:	torek
Obtained from:	FreeNAS
MFC after:	2 weeks
Sponsored by:	iXsystems
2016-12-30 18:46:21 +00:00
Andrey V. Elsukov
2e77d270c1 When we are sending IP fragments, update ip pointers in IP_PROBE() for
each fragment.

MFC after:	1 week
2016-12-29 19:57:46 +00:00