Commit Graph

75 Commits

Author SHA1 Message Date
mmacy
ecd6e9d307 UDP: further performance improvements on tx
Cumulative throughput while running 64
  netperf -H $DUT -t UDP_STREAM -- -m 1
on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps

Single stream throughput increases from 910kpps to 1.18Mpps

Baseline:
https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg

- Protect read access to global ifnet list with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg

- Protect short lived ifaddr references with epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg

- Convert if_afdata read lock path to epoch
https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg

A fix for the inpcbhash contention is pending sufficient time
on a canary at LLNW.

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15409
2018-05-23 21:02:14 +00:00
emaste
3de0b917bf Pair CURVNET_SET and CURVNET_RESTORE in a block
Per vnet(9), CURVNET_SET and CURVNET_RESTORE cannot be used as a single
statement for a conditional and CURVNET_RESTORE must be in the same
block as CURVNET_SET (or a subblock).

Reviewed by:	andrew
Sponsored by:	The FreeBSD Foundation
2018-05-21 13:08:44 +00:00
emaste
326c8de9ed Revert r333968, it broke all archs but i386 and amd64 2018-05-21 11:56:07 +00:00
mmacy
649df389bf in(6)_mcast: Expand out vnet set / restore macro so that they work in a conditional block
Reported by:	zec at fer.hr
2018-05-21 08:34:10 +00:00
mmacy
64bb7bd82b ensure that vnet is set when doing in_leavegroup 2018-05-21 07:12:06 +00:00
mmacy
ce91e745ec ip(6)_freemoptions: defer imo destruction to epoch callback task
Avoid the ugly unlock / lock of the inpcbinfo where we need to
figure out what kind of lock we hold by simply deferring the
operation to another context. (Also a small dependency for
converting the pcbinfo read lock to epoch)
2018-05-20 00:22:28 +00:00
mmacy
02971321ff netinet silence warnings 2018-05-19 05:56:21 +00:00
mmacy
7aeac9ef18 ifnet: Replace if_addr_lock rwlock with epoch + mutex
Run on LLNW canaries and tested by pho@

gallatin:
Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5
based ConnectX 4-LX NIC, I see an almost 12% improvement in received
packet rate, and a larger improvement in bytes delivered all the way
to userspace.

When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1,
I see, using nstat -I mce0 1 before the patch:

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
4.98   0.00   4.42   0.00 4235592     33   83.80 4720653 2149771   1235 247.32
4.73   0.00   4.20   0.00 4025260     33   82.99 4724900 2139833   1204 247.32
4.72   0.00   4.20   0.00 4035252     33   82.14 4719162 2132023   1264 247.32
4.71   0.00   4.21   0.00 4073206     33   83.68 4744973 2123317   1347 247.32
4.72   0.00   4.21   0.00 4061118     33   80.82 4713615 2188091   1490 247.32
4.72   0.00   4.21   0.00 4051675     33   85.29 4727399 2109011   1205 247.32
4.73   0.00   4.21   0.00 4039056     33   84.65 4724735 2102603   1053 247.32

After the patch

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
5.43   0.00   4.20   0.00 3313143     33   84.96 5434214 1900162   2656 245.51
5.43   0.00   4.20   0.00 3308527     33   85.24 5439695 1809382   2521 245.51
5.42   0.00   4.19   0.00 3316778     33   87.54 5416028 1805835   2256 245.51
5.42   0.00   4.19   0.00 3317673     33   90.44 5426044 1763056   2332 245.51
5.42   0.00   4.19   0.00 3314839     33   88.11 5435732 1792218   2499 245.52
5.44   0.00   4.19   0.00 3293228     33   91.84 5426301 1668597   2121 245.52

Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch

Reviewed by:	gallatin
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15366
2018-05-18 20:13:34 +00:00
shurd
d32470f6c3 Check that ifma_protospec != NULL in inm_lookup
If ifma_protospec is NULL when inm_lookup() is called, there
is a dereference in a NULL struct pointer. This ensures that struct is
not NULL before comparing the address.

Reported by:	dumbbell
Reviewed by:	sbruno
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15440
2018-05-15 16:54:41 +00:00
shurd
210352aeb1 Fix LORs in in6?_leave_group()
r333175 updated the join_group functions, but not the leave_group ones.

Reviewed by:	sbruno
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15393
2018-05-11 21:42:27 +00:00
mmacy
d3f138323c r333175 introduced deferred deletion of multicast addresses in order to permit the driver ioctl
to sleep on commands to the NIC when updating multicast filters. More generally this permitted
driver's to use an sx as a softc lock. Unfortunately this change introduced a race whereby a
a multicast update would still be queued for deletion when ifconfig deleted the interface
thus calling down in to _purgemaddrs and synchronously deleting _all_ of the multicast addresses
on the interface.

Synchronously remove all external references to a multicast address before enqueueing for delete.

Reported by:	lwhsu
Approved by:	sbruno
2018-05-06 20:34:13 +00:00
mmacy
d4a327c789 Currently in_pcbfree will unconditionally wunlock the pcbinfo lock
to avoid a LOR on the multicast list lock in the freemoptions routines.
As it turns out, tcp_usr_detach can acquire the tcbinfo lock readonly.
Trying to wunlock the pcbinfo lock in that context has caused a number
of reported crashes.

This change unclutters in_pcbfree and moves the handling of wunlock vs
runlock of pcbinfo to the freemoptions routine.

Reported by:	mjg@, bde@, o.hartmann at walstatt.org
Approved by:	sbruno
2018-05-05 22:40:40 +00:00
shurd
7d4b8facc7 Separate list manipulation locking from state change in multicast
Multicast incorrectly calls in to drivers with a mutex held causing drivers
to have to go through all manner of contortions to use a non sleepable lock.
Serialize multicast updates instead.

Submitted by:	mmacy <mmacy@mattmacy.io>
Reviewed by:	shurd, sbruno
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14969
2018-05-02 19:36:29 +00:00
pfg
78a6b08618 sys: general adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

No functional change intended.
2017-11-27 15:23:17 +00:00
ngie
e40523722e Add missing braces around MCAST_EXCLUDE check when KTR support is
compiled into the kernel

This ensures that .iss_asm (the number of ASM listeners) isn't incorrectly
decremented for MLD-layer source datagrams when inspecting im*s_st[1]
(the second state in the structure).

MFC after:	2 months
PR:		217509 [1]
Reported by:	Coverity (Isilon)
Reviewed by:	ae ("This patch looks correct to me." [1])
Submitted by:	Miles Ohlrich <miles.ohlrich@isilon.com>
Sponsored by:	Dell EMC Isilon
2017-05-13 18:41:24 +00:00
vangyzen
d7b79e158f Add some ntohl() love to r315277
inet_ntoa() and inet_ntoa_r() take the address in network
byte-order.  When I removed those calls, I should have
replaced them with ntohl() to make the hex addresses slightly
less unreadable.  Here they are.

See r315277 regarding classic blunders.

vangyzen: you're deep in "no good deed" territory, it seems
    --badger

Reported by:	ian
MFC after:	3 days
MFC when:	I finally get it right
Sponsored by:	Dell EMC
2017-03-14 20:57:54 +00:00
vangyzen
fad1018fc7 KTR: log IPv4 addresses in hex rather than dotted-quad
When I made the changes in r313821, I fell victim to one of the
classic blunders, the most famous of which is: never get involved
in a land war in Asia.  But only slightly less well known is this:
Keep your brain turned on and engaged when making a tedious, sweeping,
mechanical change.  KTR can correctly log the immediate integral values
passed to it, as well as constant strings, but not non-constant strings,
since they might change by the time ktrdump retrieves them.

Reported by:	glebius
MFC after:	3 days
Sponsored by:	Dell EMC
2017-03-14 18:27:48 +00:00
vangyzen
c7c348accc Use inet_ntoa_r() instead of inet_ntoa() throughout the kernel
inet_ntoa() cannot be used safely in a multithreaded environment
because it uses a static local buffer. Instead, use inet_ntoa_r()
with a buffer on the caller's stack.

Suggested by:	glebius, emaste
Reviewed by:	gnn
MFC after:	2 weeks
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D9625
2017-02-16 20:47:41 +00:00
pfg
d9c9113377 sys/net*: minor spelling fixes.
No functional change.
2016-05-03 18:05:43 +00:00
pfg
fe6332dab0 netinet: for pointers replace 0 with NULL.
These are mostly cosmetical, no functional change.

Found with devel/coccinelle.

Reviewed by:	ae. tuexen
2016-04-15 15:46:41 +00:00
melifaro
2bb0e924cc Make in_arpinput(), inp_lookup_mcast_ifp(), icmp_reflect(),
ip_dooptions(), icmp6_redirect_input(), in6_lltable_rtcheck(),
  in6p_lookup_mcast_ifp() and in6_selecthlim() use new routing api.

Eliminate now-unused ip_rtaddr().
Fix lookup key fib6_lookup_nh_basic() which was lost diring merge.
Make fib6_lookup_nh_basic() and fib6_lookup_nh_extended() always
  return IPv6 destination address with embedded scope. Currently
  rw_gateway has it scope embedded, do the same for non-gatewayed
  destinations.

Sponsored by:	Yandex LLC
2015-12-09 11:14:27 +00:00
cem
6ae0b4f9be in_getmulti: Fix recursion on if_addr_lock on malloc failure
When the M_NOWAIT allocation fails, we recurse the if_addr_lock trying
to clean up.  Reorder the cleanup after dropping the if_addr_lock.  The
obvious race is already possible between if_addmulti and IF_ADDR_WLOCK
above, so it must be ok.

Submitted by:	Ryan Libby <rlibby@gmail.com>
Reviewed by:	jhb
Found with:	M_NOWAIT failure injection testing
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4138
2015-11-18 23:53:13 +00:00
ae
75425458ac Convert in_ifaddr_lock and in6_ifaddr_lock to rmlock.
Both are used to protect access to IP addresses lists and they can be
acquired for reading several times per packet. To reduce lock contention
it is better to use rmlock here.

Reviewed by:	gnn (previous version)
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D3149
2015-07-29 08:12:05 +00:00
kib
3422f5929d Fix build with KTR after r278978. 2015-02-19 15:41:23 +00:00
glebius
17f48f0414 Use new struct mbufq instead of struct ifqueue to manage packet queues in
IPv4 multicast code.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-02-19 01:21:02 +00:00
jhb
9d6dc35e9d Only define the full inm_print() if KTR_IGMPV3 is enabled at compile time. 2014-09-30 17:26:34 +00:00
hselasky
35b126e324 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
gjb
fc21f40567 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
hselasky
bd1ed65f0f Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
gnn
61b1174206 Fix various places where we don't properly release a lock
PR:		185043
Submitted by:	Michael Bentkofsky
MFC after:	2 weeks
2014-01-16 22:14:54 +00:00
dim
c0103af5cd In sys/netinet/in_mcast.c, inm_is_ifp_detached() is only used whenever
KTR is defined, so put it between #ifdef KTR guards.  This avoids a
warning about a unused function if KTR is not enabled.

MFC after:	 3 days
2013-12-24 20:25:18 +00:00
glebius
24a0040a5c Uninline inm_lookup_locked(). Now in_var.h doesn't dereference
fields of struct ifnet.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-29 11:21:31 +00:00
glebius
ff6e113f1b The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 17:58:36 +00:00
delphij
d76e7522db Fix an integer overflow in computing the size of a temporary buffer
can result in a buffer which is too small for the requested
operation.

Security:	CVE-2013-3077
Security:	FreeBSD-SA-13:09.ip_multicast
2013-08-22 00:51:37 +00:00
hrs
b3bce89084 Fix a panic when leaving MC group in a kernel with VIMAGE enabled.
in_leavegroup() is called from an asynchronous task, and
igmp_change_state() requires that curvnet is set by the caller.
2013-07-02 16:39:12 +00:00
glebius
b7d04855ff Remove unused variable. 2012-01-24 14:27:14 +00:00
jhb
4ef366671a Convert all users of IF_ADDR_LOCK to use new locking macros that specify
either a read lock or write lock.

Reviewed by:	bz
MFC after:	2 weeks
2012-01-05 19:00:36 +00:00
jhb
419867b9b4 Defer the work of freeing IPv4 multicast options from a socket to an
asychronous task.  This avoids tearing down multicast state including
sending IGMP leave messages and reprogramming MAC filters while holding
the per-protocol global pcbinfo lock that is used in the receive path of
packet processing.

Reviewed by:	rwatson
MFC after:	1 month
2011-12-29 20:41:16 +00:00
ed
0c56cf839d Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
2011-11-07 15:43:11 +00:00
bms
a59efb5aef Fix a few issues related to the legacy 4.4 BSD multicast APIs.
IPv4 addresses can and do change during normal operation. Testing by
pfSense developers exposed an issue where OpenOSPFD was using the IPv4
address to leave the OSPF link-scope multicast groups on a dynamic
OpenVPN tun interface, rather than using RFC 3678 with the interface
index, which won't be raced when the interface's addresses change.

In inp_join_group():
 If we are already a member of an ASM group, and IP_ADD_MEMBERSHIP or
 MCAST_JOIN_GROUP ioctls are re-issued, return EADDRINUSE as per the
 legacy 4.4BSD multicast API. This bends RFC 3678 slightly, but does
 not violate POLA for apps using the old API.
 It also stops us falling through to kicking IGMP state transactions
 in what is otherwise a no-op case.
 [This has already been dealt with in HEAD, but make it explicit before
  we MFC the change to 8.]

In inp_leave_group():
 Fix a bogus conditional.
 Move the ifp null check to ioctls MCAST_LEAVE* in the switch..case
 where it actually belongs.
 If an interface was specified, by primary IPv4 address, for ioctl
 IP_DROP_MEMBERSHIP or MCAST_LEAVE_GROUP (an ASM full leave operation),
 then and only then should we look up the ifp from the IPv4 address in
 mreqs.imr_interface.
 If not, we fall through to imo_match_group() as before, but only in
 the IP_DROP_MEMBERSHIP case.

With these changes, the legacy 4.4BSD multicast API idempotence should
be mostly preserved in the SSM enabled IPv4 stack.

Found by:	ermal (with pfSense)
MFC after:	3 days
2010-04-10 12:05:31 +00:00
bms
8c86c2baad Correct a comment.
MFC after:	1 day
2009-11-19 13:21:37 +00:00
bms
178f20a475 Return ENOBUFS consistently if user attempts to exceed
in_mcast_maxsocksrc resource limit.

Submitted by:	syrinx
MFC after:	3 days
2009-09-18 15:12:31 +00:00
bms
9dcdfc8226 Comment some flawed assumptions in inp_join_group() about
mixing SSM full-state and delta-based APIs.

ENOTIME to fix right now.  No functional changes.

MFC after:	5 days
2009-09-12 20:37:44 +00:00
bms
ebe8706174 Don't allow joins w/o source on an existing group.
This is almost always pilot error.

We don't need to check for group filter UNDEFINED state at t1,
because we only ever allocate filters with their groups, so we
unconditionally reject such calls with EINVAL.
Trying to change the active filter mode w/o going through IP_MSFILTER
is also disallowed.

Deals with the case described in PR 137164 upfront, cumulative
with the fix in svn rev 197132 which only calls imo_match_source()
if the source address family was not unspecified.

PR:		137164
MFC after:	5 days
2009-09-12 20:18:23 +00:00
bms
e3b721990b Tighten input checking in inp_join_group():
* Don't try to use the source address, when its family is unspecified.
 * If we get a join without a source, on an existing inclusive
   mode group, this is an error, as it would change the filter mode.

Fix a problem with the handling of in_mfilter for new memberships:
 * Do not rely on imf being NULL; it is explicitly initialized to a
   non-NULL pointer when constructing a membership.
 * Explicitly initialize *imf to EX mode when the source address
   is unspecified.

This fixes a problem with in_mfilter slot recycling in the join path.

PR:		138690
Submitted by:	Stef Walter
MFC after:	5 days
2009-09-12 19:45:55 +00:00
bms
fe0bf7a268 Fix an obvious logic error in the IPv4 multicast leave processing,
where the filter mode vector was not updated correctly after the leave.

PR:		138691
Submitted by:	Stef Walter
MFC after:	5 days
2009-09-12 19:07:03 +00:00
bms
3031a90b58 Fix an API issue in leave processing for IPv4 multicast groups.
* Do not assume that the group lookup performed by imo_match_group()
   is valid when ifp is NULL in this case.
 * Instead, return EADDRNOTAVAIL if the ifp cannot be resolved for the
   membership we are being asked to leave.

Caveat user:
 * The way IPv4 multicast memberships are implemented in the inpcb layer
   at the moment, has the side-effect that struct ip_moptions will
   still hold the membership, under the old ifp, until ip_freemoptions()
   is called for the parent inpcb.
 * The underlying issue is: the inpcb layer does not get notification
   of ifp being detached going away in a thread-safe manner.
   This is non-trivial to fix.

But hey, at least the kernel should't panic when you unplug a card.

PR:		138689
Submitted by:	Stef Walter
MFC after:	5 days
2009-09-12 18:55:15 +00:00
syrinx
805d7d8fea When joining a multicast group, the inp_lookup_mcast_ifp call
does a KASSERT that the group address is multicast, so the
check if this is indeed true and eventually return a EINVAL if not,
should be done before calling inp_lookup_mcast_ifp. This fixes a kernel
crash when calling setsockopt (sock, IPPROTO_IP, IP_ADD_MEMBERSHIP,...)
with invalid group address.

Reviewed by:	bms
Approved by:	bms

MFC after:	3 days
2009-09-07 16:00:33 +00:00
rwatson
fb9ffed650 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
rwatson
57ca4583e7 Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator.  Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...).  This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack.  Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory.  Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy.  Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address.  When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by:  bz
Reviewed by:            bz, zec
Discussed with:         gnn, jamie, jeff, jhb, julian, sam
Suggested by:           peter
Approved by:            re (kensmith)
2009-07-14 22:48:30 +00:00