Commit Graph

5932 Commits

Author SHA1 Message Date
Matt Macy
d7c5a620e2 ifnet: Replace if_addr_lock rwlock with epoch + mutex
Run on LLNW canaries and tested by pho@

gallatin:
Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5
based ConnectX 4-LX NIC, I see an almost 12% improvement in received
packet rate, and a larger improvement in bytes delivered all the way
to userspace.

When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1,
I see, using nstat -I mce0 1 before the patch:

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
4.98   0.00   4.42   0.00 4235592     33   83.80 4720653 2149771   1235 247.32
4.73   0.00   4.20   0.00 4025260     33   82.99 4724900 2139833   1204 247.32
4.72   0.00   4.20   0.00 4035252     33   82.14 4719162 2132023   1264 247.32
4.71   0.00   4.21   0.00 4073206     33   83.68 4744973 2123317   1347 247.32
4.72   0.00   4.21   0.00 4061118     33   80.82 4713615 2188091   1490 247.32
4.72   0.00   4.21   0.00 4051675     33   85.29 4727399 2109011   1205 247.32
4.73   0.00   4.21   0.00 4039056     33   84.65 4724735 2102603   1053 247.32

After the patch

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
5.43   0.00   4.20   0.00 3313143     33   84.96 5434214 1900162   2656 245.51
5.43   0.00   4.20   0.00 3308527     33   85.24 5439695 1809382   2521 245.51
5.42   0.00   4.19   0.00 3316778     33   87.54 5416028 1805835   2256 245.51
5.42   0.00   4.19   0.00 3317673     33   90.44 5426044 1763056   2332 245.51
5.42   0.00   4.19   0.00 3314839     33   88.11 5435732 1792218   2499 245.52
5.44   0.00   4.19   0.00 3293228     33   91.84 5426301 1668597   2121 245.52

Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15366
2018-05-18 20:13:34 +00:00
Mark Johnston
b35822d9d0 Fix netdump configuration when VIMAGE is enabled.
We need to set the current vnet before iterating over the global
interface list. Because the dump device may only be set from the host,
only proceed with configuration if the thread belongs to the default
vnet. [1]

Also fix a resource leak that occurs if the priv_check() in set_dumper()
fails.

Reported by:	mmacy, sbruno [1]
Reviewed by:	sbruno
X-MFC with:	r333283
Differential Revision:	https://reviews.freebsd.org/D15449
2018-05-17 04:08:57 +00:00
Lawrence Stewart
9891578a40 Plug a memory leak and potential NULL-pointer dereference introduced in r331214.
Each TCP connection that uses the system default cc_newreno(4) congestion
control algorithm module leaks a "struct newreno" (8 bytes of memory) at
connection initialisation time. The NULL-pointer dereference is only germane
when using the ABE feature, which is disabled by default.

While at it:

- Defer the allocation of memory until it is actually needed given that ABE is
  optional and disabled by default.

- Document the ENOMEM errno in getsockopt(2)/setsockopt(2).

- Document ENOMEM and ENOBUFS in tcp(4) as being synonymous given that they are
  used interchangeably throughout the code.

- Fix a few other nits also accidentally omitted from the original patch.

Reported by:	Harsh Jain on freebsd-net@
Tested by:	tjh@
Differential Revision:	https://reviews.freebsd.org/D15358
2018-05-17 02:46:27 +00:00
Brooks Davis
514ef08caf Unwrap a line that no longer requires wrapping. 2018-05-15 20:14:38 +00:00
Brooks Davis
423349f696 Remove stray tabs from in_lltable_dump_entry(). 2018-05-15 20:13:00 +00:00
Stephen Hurd
f2cf90e264 Check that ifma_protospec != NULL in inm_lookup
If ifma_protospec is NULL when inm_lookup() is called, there
is a dereference in a NULL struct pointer. This ensures that struct is
not NULL before comparing the address.

Reported by:	dumbbell
Reviewed by:	sbruno
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15440
2018-05-15 16:54:41 +00:00
Michael Tuexen
482d420926 sctp_get_mbuf_for_msg() should honor the allinone parameter.
When it is not required that the buffer is not a chain, return
a chain. This is based on a patch provided by Irene Ruengeler.
2018-05-14 15:16:51 +00:00
Michael Tuexen
589c42c2c8 Ensure that the MTU's used are multiple of 4.
The length of SCTP packets is always a multiple of 4. Therefore,
ensure that the MTUs used are also a multiple of 4.

Thanks to Irene Ruengeler for providing an earlier version of this
patch.

MFC after:	1 week
2018-05-14 13:50:17 +00:00
Stephen Hurd
b69888c28f Fix LORs in in6?_leave_group()
r333175 updated the join_group functions, but not the leave_group ones.

Reviewed by:	sbruno
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15393
2018-05-11 21:42:27 +00:00
Warner Losh
33123867af Use the full year, for real this time. 2018-05-09 20:26:37 +00:00
Warner Losh
603bbd0631 Minor style nits
Use full copyright year.
Remove 'All Rights Reserved' from new file (rights holder OK'd)
Minor #ifdef motion and #endif tagging
Remove __FBSDID macro from comments

Sponsored by: Netflix
OK'd by: rrs@
2018-05-09 14:11:35 +00:00
Michael Tuexen
45d41de5e6 Fix two typos reported by N. J. Mann, which were introduced in
https://svnweb.freebsd.org/changeset/base/333382 by me.

MFC after:	3 days
2018-05-08 20:39:35 +00:00
Michael Tuexen
9669e724d1 When reporting ERROR or ABORT chunks, don't use more data
that is guaranteed to be contigous.
Thanks to Felix Weinrank for finding and reporting this bug
by fuzzing the usrsctp stack.

MFC after:	3 days
2018-05-08 18:48:51 +00:00
Matt Macy
10d20c84ed Fix spurious retransmit recovery on low latency networks
TCP's smoothed RTT (SRTT) can be much larger than an actual observed RTT. This can be either because of hz restricting the calculable RTT to 10ms in VMs or 1ms using the default 1000hz or simply because SRTT recently incorporated a larger value.

If an ACK arrives before the calculated badrxtwin (now + SRTT):
tp->t_badrxtwin = ticks + (tp->t_srtt >> (TCP_RTT_SHIFT + 1));

We'll erroneously reset snd_una to snd_max. If multiple segments were dropped and this happens repeatedly the transmit rate will be limited to 1MSS per RTO until we've retransmitted all drops.

Reported by:	rstone
Reviewed by:	hiren, transport
Approved by:	sbruno
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D8556
2018-05-08 02:22:34 +00:00
Alexander Motin
167a34407c Keep CARP state as INIT when net.inet.carp.allow=0.
Currently when net.inet.carp.allow=0 CARP state remains as MASTER, which is
not very useful (if there are other masters -- it can lead to split brain,
if there are none -- it makes no sense).  Having it as INIT makes it clear
that carp packets are disabled.

Submitted by:	wg
MFC after:	1 month
Relnotes:	yes
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D14477
2018-05-07 14:44:55 +00:00
Matt Macy
b6f6f88018 r333175 introduced deferred deletion of multicast addresses in order to permit the driver ioctl
to sleep on commands to the NIC when updating multicast filters. More generally this permitted
driver's to use an sx as a softc lock. Unfortunately this change introduced a race whereby a
a multicast update would still be queued for deletion when ifconfig deleted the interface
thus calling down in to _purgemaddrs and synchronously deleting _all_ of the multicast addresses
on the interface.

Synchronously remove all external references to a multicast address before enqueueing for delete.

Reported by:	lwhsu
Approved by:	sbruno
2018-05-06 20:34:13 +00:00
Michael Tuexen
67e8b08bbe Ensure we are not dereferencing a NULL pointer.
This was found by Coverity scanning the usrsctp stack (CID 203808).

MFC after:	3 days
2018-05-06 14:19:50 +00:00
Mark Johnston
e505460228 Import the netdump client code.
This is a component of a system which lets the kernel dump core to
a remote host after a panic, rather than to a local storage device.
The server component is available in the ports tree. netdump is
particularly useful on diskless systems.

The netdump(4) man page contains some details describing the protocol.
Support for configuring netdump will be added to dumpon(8) in a future
commit. To use netdump, the kernel must have been compiled with the
NETDUMP option.

The initial revision of netdump was written by Darrell Anderson and
was integrated into Sandvine's OS, from which this version was derived.

Reviewed by:	bdrewery, cem (earlier versions), julian, sbruno
MFC after:	1 month
X-MFC note:	use a spare field in struct ifnet
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15253
2018-05-06 00:38:29 +00:00
Matt Macy
28c001002a Currently in_pcbfree will unconditionally wunlock the pcbinfo lock
to avoid a LOR on the multicast list lock in the freemoptions routines.
As it turns out, tcp_usr_detach can acquire the tcbinfo lock readonly.
Trying to wunlock the pcbinfo lock in that context has caused a number
of reported crashes.

This change unclutters in_pcbfree and moves the handling of wunlock vs
runlock of pcbinfo to the freemoptions routine.

Reported by:	mjg@, bde@, o.hartmann at walstatt.org
Approved by:	sbruno
2018-05-05 22:40:40 +00:00
Andrey V. Elsukov
5ada542398 Immediately propagate EACCES error code to application from tcp_output.
In r309610 and r315514 the behavior of handling EACCES was changed, and
tcp_output() now returns zero when EACCES happens. The reason of this
change was a hesitation that applications that use TCP-MD5 will be
affected by changes in project/ipsec.

TCP-MD5 code returns EACCES when security assocition for given connection
is not configured. But the same error code can return pfil(9), and this
change has affected connections blocked by pfil(9). E.g. application
doesn't return immediately when SYN segment is blocked, instead it waits
when several tries will be failed.

Actually, for TCP-MD5 application it doesn't matter will it get EACCES
after first SYN, or after several tries. Security associtions must be
configured before initiating TCP connection.

I left the EACCES in the switch() to show that it has special handling.

Reported by:	Andreas Longwitz <longwitz at incore dot de>
MFC after:	10 days
2018-05-04 09:28:12 +00:00
Sean Bruno
68ff29affa cc_cubic:
- Update cubic parameters to draft-ietf-tcpm-cubic-04

Submitted by:	Matt Macy <mmacy@mattmacy.io>
Reviewed by:	lstewart
Differential Revision:	https://reviews.freebsd.org/D10556
2018-05-03 15:01:27 +00:00
Michael Tuexen
4c6a10903f SImplify the call to tcp_drop(), since the handling of soft error
is also done in tcp_drop(). No functional change.

Sponsored by:	Netflix, Inc.
2018-05-02 20:04:31 +00:00
Stephen Hurd
f3e1324b41 Separate list manipulation locking from state change in multicast
Multicast incorrectly calls in to drivers with a mutex held causing drivers
to have to go through all manner of contortions to use a non sleepable lock.
Serialize multicast updates instead.

Submitted by:	mmacy <mmacy@mattmacy.io>
Reviewed by:	shurd, sbruno
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14969
2018-05-02 19:36:29 +00:00
Randall Stewart
803a230592 This change re-arranges the fields within the tcp-pcb so that
they are more in order of cache line use as one passes
through the tcp_input/output paths (non-errors most likely path). This
helps speed up cache line optimization so that the tcp stack runs
a bit more efficently.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D15136
2018-04-26 21:41:16 +00:00
Sean Bruno
7875017ca9 Revert r332894 at the request of the submitter.
Submitted by:	Johannes Lundberg <johalun0_gmail.com>
Sponsored by:	Limelight Networks
2018-04-24 19:55:12 +00:00
Sean Bruno
7b7796eea5 Load balance sockets with new SO_REUSEPORT_LB option
This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations
As DragonflyBSD, a load balance group is limited to 256 pcbs
(256 programs or threads sharing the same socket).

Submitted by:	Johannes Lundberg <johanlun0@gmail.com>
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D11003
2018-04-23 19:51:00 +00:00
Randall Stewart
fd389e7cd5 These two modules need the tcp_hpts.h file for
when the option is enabled (not sure how LINT/build-universe
missed this) opps.

Sponsored by:	Netflix Inc
2018-04-19 15:03:48 +00:00
Randall Stewart
3ee9c3c4eb This commit brings in the TCP high precision timer system (tcp_hpts).
It is the forerunner/foundational work of bringing in both Rack and BBR
which use hpts for pacing out packets. The feature is optional and requires
the TCPHPTS option to be enabled before the feature will be active. TCP
modules that use it must assure that the base component is compile in
the kernel in which they are loaded.

MFC after:	Never
Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D15020
2018-04-19 13:37:59 +00:00
Brooks Davis
3a4fc8a8a1 Remove support for the Arcnet protocol.
While Arcnet has some continued deployment in industrial controls, the
lack of drivers for any of the PCI, USB, or PCIe NICs on the market
suggests such users aren't running FreeBSD.

Evidence in the PR database suggests that the cm(4) driver (our sole
Arcnet NIC) was broken in 5.0 and has not worked since.

PR:		182297
Reviewed by:	jhibbits, vangyzen
Relnotes:	yes
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15057
2018-04-13 21:18:04 +00:00
Brooks Davis
0437c8e3b1 Remove support for FDDI networks.
Defines in net/if_media.h remain in case code copied from ifconfig is in
use elsewere (supporting non-existant media type is harmless).

Reviewed by:	kib, jhb
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15017
2018-04-11 17:28:24 +00:00
Jonathan T. Looney
f897951965 Add missing header change from r332381.
Sponsored by:	Netflix, Inc.
Pointy hat:	jtl
2018-04-10 17:00:37 +00:00
Jonathan T. Looney
8b8718b5b0 Modify the net.inet.tcp.function_ids sysctl introduced in r331347.
Export additional information which may be helpful to userspace
consumers and rename the sysctl to net.inet.tcp.function_info.

Sponsored by:	Netflix, Inc.
2018-04-10 16:59:36 +00:00
Jonathan T. Looney
dcaffbd6fb Move the TCP Blackbox Recorder probe in tcp_output.c to be with the
other tracing/debugging code.

Sponsored by:	Netflix, Inc.
2018-04-10 15:54:29 +00:00
Jonathan T. Looney
4889b58ce8 Clean up some debugging code left in tcp_log_buf.c from r331347.
Sponsored by:	Netflix, Inc.
2018-04-10 15:51:37 +00:00
Michael Tuexen
efcf28ef77 Fix a logical inversion bug.
Thanks to Irene Ruengeler for finding and reporting this bug.

MFC after:	3 days
2018-04-08 12:08:20 +00:00
Michael Tuexen
3bcfd10fe1 Small cleanup, no functional change.
MFC after:	3 days
2018-04-08 11:50:06 +00:00
Michael Tuexen
c3115feb7e Fix a signed/unsigned warning showing up for the userland stack
on some platforms.
Thanks to Felix Weinrank for reporting the issue.

MFC after:i	3 days
2018-04-08 11:37:00 +00:00
Brooks Davis
6469bdcdb6 Move most of the contents of opt_compat.h to opt_global.h.
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c.  A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by:	kib, cem, jhb, jtl
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14941
2018-04-06 17:35:35 +00:00
Jonathan T. Looney
8fa799bd74 If a user closes the socket before we call tcp_usr_abort(), then
tcp_drop() may unlock the INP.  Currently, tcp_usr_abort() does not
check for this case, which results in a panic while trying to unlock
the already-unlocked INP (not to mention, a use-after-free violation).

Make tcp_usr_abort() check the return value of tcp_drop(). In the case
where tcp_drop() returns NULL, tcp_usr_abort() can skip further steps
to abort the connection and simply unlock the INP_INFO lock prior to
returning.

Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
2018-04-06 17:20:37 +00:00
Jonathan T. Looney
45a48d07c3 Check that in_pcbfree() is only called once for each PCB. If that
assumption is violated, "bad things" could follow.

I believe such an assert would have detected some of the problems jch@
was chasing in PR 203175 (see r307551).  We also use it in our internal
TCP development efforts.  And, in case a bug does slip through to
released code, this change silently ignores subsequent calls to
in_pcbfree().

Reviewed by:	rrs
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D14990
2018-04-06 16:48:11 +00:00
Ed Maste
c73b6f4da9 Fix kernel memory disclosure in tcp_ctloutput
strcpy was used to copy a string into a buffer copied to userland, which
left uninitialized data after the terminating 0-byte.  Use the same
approach as in tcp_subr.c: strncpy and explicit '\0'.

admbugs:	765, 822
MFC after:	1 day
Reported by:	Ilja Van Sprundel <ivansprundel@ioactive.com>
Reported by:	Vlad Tsyrklevich
Security:	Kernel memory disclosure
Sponsored by:	The FreeBSD Foundation
2018-04-04 21:12:35 +00:00
Jonathan T. Looney
4b9dc36454 r330675 introduced an extra window check in the LRO code to ensure it
captured and reported the highest window advertisement with the same
SEQ/ACK.  However, the window comparison uses modulo 2**16 math, rather
than directly comparing the absolute values.  Because windows use
absolute values and not modulo 2**16 math (i.e. they don't wrap), we
need to compare the absolute values.

Reviewed by:	gallatin
MFC after:	3 days
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D14937
2018-04-03 13:54:38 +00:00
Navdeep Parhar
a64564109a Add a hook to allow the toedev handling an offloaded connection to
provide accurate TCP_INFO.

Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D14816
2018-04-03 01:08:54 +00:00
Brooks Davis
541d96aaaf Use an accessor function to access ifr_data.
This fixes 32-bit compat (no ioctl command defintions are required
as struct ifreq is the same size).  This is believed to be sufficent to
fully support ifconfig on 32-bit systems.

Reviewed by:	kib
Obtained from:	CheriBSD
MFC after:	1 week
Relnotes:	yes
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14900
2018-03-30 18:50:13 +00:00
Navdeep Parhar
b9a9e8e9bd Fix RSS build (broken in r331309).
Sponsored by:	Chelsio Communications
2018-03-29 19:48:17 +00:00
Brooks Davis
69f0fecbd6 Remove infrastructure for token-ring networks.
Reviewed by:	cem, imp, jhb, jmallett
Relnotes:	yes
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14875
2018-03-28 23:33:26 +00:00
Sean Bruno
e2041bfa7c CC Cubic: fix underflow for cubic_cwnd()
Singed calculations in cubic_cwnd() can result in negative cwnd
value which is then cast to an unsigned value. Values less than
1 mss are generally bad for other parts of the code, also fixed.

Submitted by:	Jason Eggleston <jason@eggnet.com>
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14141
2018-03-26 19:53:36 +00:00
Jonathan T. Looney
e24e568336 Make the TCP blackbox code committed in r331347 be an optional feature
controlled by the TCP_BLACKBOX option.

Enable this as part of amd64 GENERIC. For now, leave it disabled on
other platforms.

Sponsored by:	Netflix, Inc.
2018-03-24 12:48:10 +00:00
Jonathan T. Looney
9959c8b9e6 Fix compilation for platforms that don't support atomic_fetchadd_64()
after r331347.

Reported by:	avg, br, jhibbits
Sponsored by:	Netflix, Inc.
2018-03-24 12:40:45 +00:00
Sean Bruno
72bfa0bf63 Revert r331379 as the "simple" lock changes have revealed a deeper problem
and need for a rethink.

Submitted by:	Jason Eggleston <jason@eggnet.com>
Sponsored by:	Limelight Networks
2018-03-23 18:34:38 +00:00