instead of a linear array.
The multicast memberships for the inpcb structure are protected by a
non-sleepable lock, INP_WLOCK(), which needs to be dropped when
calling the underlying possibly sleeping if_ioctl() method. When using
a linear array to keep track of multicast memberships, the computed
memory location of the multicast filter may suddenly change, due to
concurrent insertion or removal of elements in the linear array. This
in turn leads to various invalid memory access issues and kernel
panics.
To avoid this problem, put all multicast memberships on a STAILQ based
list. Then the memory location of the IPv4 and IPv6 multicast filters
become fixed during their lifetime and use after free and memory leak
issues are easier to track, for example by: vmstat -m | grep multi
All list manipulation has been factored into inline functions
including some macros, to easily allow for a future hash-list
implementation, if needed.
This patch has been tested by pho@ .
Differential Revision: https://reviews.freebsd.org/D20080
Reviewed by: markj @
MFC after: 1 week
Sponsored by: Mellanox Technologies
r333175 converted the global multicast lock to a sleepable sx lock,
so the lock order with respect to the (non-sleepable) inp lock changed.
To handle this, r333175 and r333505 added code to drop the inp lock,
but this opened races that could leave multicast group description
structures in an inconsistent state. This change fixes the problem by
simply acquiring the global lock sooner. Along the way, this fixes
some LORs and bogus error handling introduced in r333175, and commits
some related cleanup.
Reported by: syzbot+ba7c4943547e0604faca@syzkaller.appspotmail.com
Reported by: syzbot+1b803796ab94d11a46f9@syzkaller.appspotmail.com
Reviewed by: ae
MFC after: 3 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20070
When leaving a multicast group, a hole may be created in the inpcb's
source filter and group membership arrays. To remove the hole, the
succeeding array elements are copied over by one entry. The multicast
code expects that a newly allocated array element is initialized, but
the code which shifts a tail of the array was leaving stale data
in the final entry. Fix this by explicitly reinitializing the last
entry following such a copy.
Reported by: syzbot+f8c3c564ee21d650475e@syzkaller.appspotmail.com
Reviewed by: ae
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19872
connectivity.
Looking at past changes in this area like r337866, some refcounting
bugs have been introduced, one by one. For example like calling
in6m_disconnect() and in6m_rele_locked() in mld_v1_process_group_timer()
where previously no disconnect nor refcount decrement was done.
Calling in6m_disconnect() when it shouldn't causes IPv6 solitation to no
longer work, because all the multicast addresses receiving the solitation
messages are now deleted from the network interface.
This patch reverts some recent changes while improving the MLD
refcounting and concurrency model after the MLD code was converted
to using EPOCH(9).
List changes:
- All CK_STAILQ_FOREACH() macros are now properly enclosed into
EPOCH(9) sections. This simplifies assertion of locking inside
in6m_ifmultiaddr_get_inm().
- Corrected bad use of in6m_disconnect() leading to loss of IPv6
connectivity for MLD v1.
- Factored out checks for valid inm structure into
in6m_ifmultiaddr_get_inm().
PR: 233535
Differential Revision: https://reviews.freebsd.org/D18887
Reviewed by: bz (net)
Tested by: ae
MFC after: 1 week
Sponsored by: Mellanox Technologies
because the destructor will access the if_ioctl() callback in the ifnet
pointer which is about to be freed. This prevents use-after-free.
PR: 233535
Differential Revision: https://reviews.freebsd.org/D18887
Reviewed by: bz (net)
Tested by: ae
MFC after: 1 week
Sponsored by: Mellanox Technologies
- Remove macros that covertly create epoch_tracker on thread stack. Such
macros a quite unsafe, e.g. will produce a buggy code if same macro is
used in embedded scopes. Explicitly declare epoch_tracker always.
- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
locking macros to what they actually are - the net_epoch.
Keeping them as is is very misleading. They all are named FOO_RLOCK(),
while they no longer have lock semantics. Now they allow recursion and
what's more important they now no longer guarantee protection against
their companion WLOCK macros.
Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.
This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.
Discussed with: jtl, gallatin
This is actually several different bugs:
- The code is not designed to handle inpcb deletion after interface deletion
- add reference for inpcb membership
- The multicast address has to be removed from interface lists when the refcount
goes to zero OR when the interface goes away
- decouple list disconnect from refcount (v6 only for now)
- ifmultiaddr can exist past being on interface lists
- add flag for tracking whether or not it's enqueued
- deferring freeing moptions makes the incpb cleanup code simpler but opens the
door wider still to races
- call inp_gcmoptions synchronously after dropping the the inpcb lock
Fundamentally multicast needs a rewrite - but keep applying band-aids for now.
Tested by: kp
Reported by: novel, kp, lwhsu
On PowerPC (and possibly other architectures), that doesn't use
EARLY_AP_STARTUP, the config task queue may be used initialized.
This was observed while trying to mount the root fs from NFS, as
reported here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230168.
This patch has 2 main changes:
1- Perform a basic initialization of qgroup_config, similar to
what is done in taskqgroup_adjust, but simpler.
This makes qgroup_config ready to be used during NFS root mount.
2- When EARLY_AP_STARTUP is not used, call inm_init() and
in6m_init() right before SI_SUB_ROOT_CONF, because bootp needs
to send multicast packages to request an IP.
PR: Bug 230168
Reported by: sbruno
Reviewed by: jhibbits, mmacy, sbruno
Approved by: jhibbits
Differential Revision: D16633
Per vnet(9), CURVNET_SET and CURVNET_RESTORE cannot be used as a single
statement for a conditional and CURVNET_RESTORE must be in the same
block as CURVNET_SET (or a subblock).
Reviewed by: andrew
Sponsored by: The FreeBSD Foundation
Avoid the ugly unlock / lock of the inpcbinfo where we need to
figure out what kind of lock we hold by simply deferring the
operation to another context. (Also a small dependency for
converting the pcbinfo read lock to epoch)
r333175 updated the join_group functions, but not the leave_group ones.
Reviewed by: sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15393
Use LIST_FOREACH_SAFE in in6m_disconnect() since we're
deleting and freeing item from the membership list
while traversing the list.
Reviewed by: mmacy
Sponsored by: Netflix
to sleep on commands to the NIC when updating multicast filters. More generally this permitted
driver's to use an sx as a softc lock. Unfortunately this change introduced a race whereby a
a multicast update would still be queued for deletion when ifconfig deleted the interface
thus calling down in to _purgemaddrs and synchronously deleting _all_ of the multicast addresses
on the interface.
Synchronously remove all external references to a multicast address before enqueueing for delete.
Reported by: lwhsu
Approved by: sbruno
to avoid a LOR on the multicast list lock in the freemoptions routines.
As it turns out, tcp_usr_detach can acquire the tcbinfo lock readonly.
Trying to wunlock the pcbinfo lock in that context has caused a number
of reported crashes.
This change unclutters in_pcbfree and moves the handling of wunlock vs
runlock of pcbinfo to the freemoptions routine.
Reported by: mjg@, bde@, o.hartmann at walstatt.org
Approved by: sbruno
Multicast incorrectly calls in to drivers with a mutex held causing drivers
to have to go through all manner of contortions to use a non sleepable lock.
Serialize multicast updates instead.
Submitted by: mmacy <mmacy@mattmacy.io>
Reviewed by: shurd, sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14969
This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.
Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
No functional change intended.
compiled into the kernel
This ensures that .iss_asm (the number of ASM listeners) isn't incorrectly
decremented for MLD-layer source datagrams when inspecting im*s_st[1]
(the second state in the structure).
MFC after: 2 months
PR: 217509 [1]
Reported by: Coverity (Isilon)
Reviewed by: ae ("This patch looks correct to me." [1])
Submitted by: Miles Ohlrich <miles.ohlrich@isilon.com>
Sponsored by: Dell EMC Isilon
ip_dooptions(), icmp6_redirect_input(), in6_lltable_rtcheck(),
in6p_lookup_mcast_ifp() and in6_selecthlim() use new routing api.
Eliminate now-unused ip_rtaddr().
Fix lookup key fib6_lookup_nh_basic() which was lost diring merge.
Make fib6_lookup_nh_basic() and fib6_lookup_nh_extended() always
return IPv6 destination address with embedded scope. Currently
rw_gateway has it scope embedded, do the same for non-gatewayed
destinations.
Sponsored by: Yandex LLC
Analogously to r291040, in6_mc_get recurses on if_addr_lock if the
M_NOWAIT allocation fails. The fix is the same.
Suggested by: Andrey V. Elsukov
Reviewed by: jhb (ip4 version)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4138 (ip4 version)
interface index is specified as zero, the system should select the
interface to use for outgoing multicast packets. Even the comment
for the in6p_set_multicast_if() function says about index of zero.
But in fact for zero index the function just returns EADDRNOTAVAIL.
I.e. if you first set some interface and then will try reset it
with zero ifindex, you will get EADDRNOTAVAIL.
Reset im6o_multicast_ifp to NULL when interface index specified as
zero. Also return EINVAL in case when ifnet_byindex() returns NULL.
This will be the same behaviour as when ifindex is bigger than
V_if_index. And return EADDRNOTAVAIL only when interface is not
multicast capable.
Reported by: Olivier Cochard-Labbé
MFC after: 2 weeks
Sponsored by: Yandex LLC
When multicast capable interface goes away, it leaves multicast groups,
this leads to generate MLD reports, but MLD code does deffered send and
MLD reports are queued in the in6_multi's in6m_scq ifq. The problem is
that in6_multi structures are freed when interface leaves multicast groups
and thread that does deffered send will not take these queued packets.
PR: 194577
MFC after: 1 week
Sponsored by: Yandex LLC
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
Extend the so far IPv4-only support for multiple routing tables (FIBs)
introduced in r178888 to IPv6 providing feature parity.
This includes an extended rtalloc(9) KPI for IPv6, the necessary
adjustments to the network stack, and user land support as in netstat.
Sponsored by: Cisco Systems, Inc.
Reviewed by: melifaro (basically)
MFC after: 10 days
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
Don't allow joins w/o source on an existing group.
This is almost always pilot error.
We don't need to check for group filter UNDEFINED state at t1,
because we only ever allocate filters with their groups, so we
unconditionally reject such calls with EINVAL.
Trying to change the active filter mode w/o going through IPV6_MSFILTER
is also disallowed.
MFC after: 1 day
Tighten input checking in in6p_join_group():
* Don't try to use the source address, when its family is unspecified.
* If we get a join without a source, on an existing inclusive
mode group, this is an error, as it would change the filter mode.
Fix a problem with the handling of in6_mfilter for new memberships:
* Do not rely on im6f being NULL; it is explicitly initialized to a
non-NULL pointer when constructing a membership.
* Explicitly initialize *im6f to EX mode when the source address
is unspecified.
This fixes a problem with in_mfilter slot recycling in the join path.
MFC after: 1 day