Commit Graph

613 Commits

Author SHA1 Message Date
Kristof Provost
7f883a9b5b net: Revert vnet/epair cleanup race mitigation
Revert the mitigation code for the vnet/epair cleanup race (done in r365457).
r368237 introduced a more reliable fix.

MFC after:	2 weeks
Sponsored by:	Modirum MDPay
2020-12-01 16:34:43 +00:00
Kristof Provost
e133271fc1 if: Fix panic when destroying vnet and epair simultaneously
When destroying a vnet and an epair (with one end in the vnet) we often
panicked. This was the result of the destruction of the epair, which destroys
both ends simultaneously, happening while vnet_if_return() was moving the
struct ifnet to its home vnet. This can result in a freed ifnet being re-added
to the home vnet V_ifnet list. That in turn panics the next time the ifnet is
used.

Prevent this race by ensuring that vnet_if_return() cannot run at the same time
as if_detach() or epair_clone_destroy().

PR:		238870, 234985, 244703, 250870
MFC after:	2 weeks
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D27378
2020-12-01 16:23:59 +00:00
Konstantin Belousov
cd85379104 Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible.  Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*).  Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys.  Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight.  Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by:	imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
Kristof Provost
bca0e1d2ac if: Fix non-VIMAGE build
if_link_ifnet() and if_unlink_ifnet() are needed even when VIMAGE is not
enabled.

MFC after:	2 weeks
Sponsored by:	Modirum MDPay
2020-11-25 17:15:24 +00:00
Kristof Provost
a779388f8b if: Protect V_ifnet in vnet_if_return()
When we terminate a vnet (i.e. jail) we move interfaces back to their home
vnet. We need to protect our access to the V_ifnet CK_LIST.

We could enter NET_EPOCH, but if_detach_internal() (called from if_vmove())
waits for net epoch callback completion. That's not possible from NET_EPOCH.
Instead, we take the IFNET_WLOCK, build a list of the interfaces that need to
move and, once we've released the lock, move them back to their home vnet.

We cannot hold the IFNET_WLOCK() during if_vmove(), because that results in a
LOR between ifnet_sx, in_multi_sx and iflib ctx lock.

Separate out moving the ifp into or out of V_ifnet, so we can hold the lock as
we do the list manipulation, but do not hold it as we if_vmove().

Reviewed by:	melifaro
MFC after:	2 weeks
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D27279
2020-11-25 15:07:22 +00:00
Kristof Provost
a60100fdfc if: Remove ifnet_rwlock
It no longer serves any purpose, as evidenced by the fact that we never take it
without ifnet_sxlock.

Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D27278
2020-11-25 10:56:38 +00:00
Alexander V. Chernikov
bad6b23606 Move all ifaddr route creation business logic to net/route/route_ifaddr.c
Differential Revision:	https://reviews.freebsd.org/D26318
2020-11-08 11:12:00 +00:00
Ed Maste
c1aedfcbd9 add SIOCGIFDATA ioctl
For interfaces that do not support SIOCGIFMEDIA (for which there are
quite a few) the only fallback is to query the interface for
if_data->ifi_link_state.  While it's possible to get at if_data for an
interface via getifaddrs(3) or sysctl, both are heavy weight mechanisms.

SIOCGIFDATA is a simple ioctl to retrieve this fast with very little
resource use in comparison.  This implementation mirrors that of other
similar ioctls in FreeBSD.

Submitted by:	Roy Marples <roy@marples.name>
Reviewed by:	markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D26538
2020-09-28 16:54:39 +00:00
Kristof Provost
a969635b83 net: mitigate vnet / epair cleanup races
There's a race where dying vnets move their interfaces back to their original
vnet, and if_epair cleanup (where deleting one interface also deletes the other
end of the epair). This is commonly triggered by the pf tests, but also by
cleanup of vnet jails.

As we've not yet been able to fix the root cause of the issue work around the
panic by not dereferencing a NULL softc in epair_qflush() and by not
re-attaching DYING interfaces.

This isn't a full fix, but makes a very common panic far less likely.

PR:		244703, 238870
Reviewed by:	lutz_donnerhacke.de
MFC after:	4 days
Differential Revision:	https://reviews.freebsd.org/D26324
2020-09-08 14:54:10 +00:00
Mateusz Guzik
662c13053f net: clean up empty lines in .c and .h files 2020-09-01 21:19:14 +00:00
Mark Johnston
96ad26eefb Remove free_domain() and uma_zfree_domain().
These functions were introduced before UMA started ensuring that freed
memory gets placed in domain-local caches.  They no longer serve any
purpose since UMA now provides their functionality by default.  Remove
them to simplyify the kernel memory allocator interfaces a bit.

Reviewed by:	cem, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25937
2020-08-04 13:58:36 +00:00
Alexander V. Chernikov
e1c05fd290 Transition from rtrequest1_fib() to rib_action().
Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
 in6_rtrequest, rtrequest_fib> and their uses and switch to
 to rib_action(). This is part of the new routing KPI.

Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D25546
2020-07-21 19:56:13 +00:00
Alexander V. Chernikov
725871230d Temporarly revert r363319 to unbreak the build.
Reported by:	CI
Pointy hat to: melifaro
2020-07-19 10:53:15 +00:00
Alexander V. Chernikov
8cee15d9e4 Transition from rtrequest1_fib() to rib_action().
Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
 in6_rtrequest, rtrequest_fib> and their uses and switch to
to rib_action(). This is part of the new routing KPI.

Submitted by:	Neel Chauhan <neel AT neelc DOT org>
Differential Revision:	https://reviews.freebsd.org/D25546
2020-07-19 09:29:27 +00:00
Alexander V. Chernikov
2bbab0af6d Use epoch(9) for rtentries to simplify control plane operations.
Currently the only reason of refcounting rtentries is the need to report
 the rtable operation details immediately after the execution.
Delaying rtentry reclamation allows to stop refcounting and simplify the code.
Additionally, this change allows to reimplement rib_lookup_info(), which
 is used by some of the customers to get the matching prefix along
 with nexthops, in more efficient way.

The change keeps per-vnet rtzone uma zone. It adds nh_vnet field to
 nhop_priv to be able to reliably set curvnet even during vnet teardown.
Rest of the reference counting code will be removed in the D24867 .

Differential Revision:	https://reviews.freebsd.org/D24866
2020-05-23 10:21:02 +00:00
Brooks Davis
8ad798ae9a Expose ifr_buffer_get_(buffer|length) outside if.c.
This is a preparatory commit for D23933.

Reviewed by:	jhb
2020-03-03 18:05:11 +00:00
Pawel Biernacki
7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Gleb Smirnoff
e87c494015 Although most of the NIC drivers are epoch ready, due to peer pressure
switch over to opt-in instead of opt-out for epoch.

Instead of IFF_NEEDSEPOCH, provide IFF_KNOWSEPOCH. If driver marks
itself with IFF_KNOWSEPOCH, then ether_input() would not enter epoch
when processing its packets.

Now this will create recursive entrance in epoch in >90% network
drivers, but will guarantee safeness of the transition.

Mark several tested drivers as IFF_KNOWSEPOCH.

Reviewed by:		hselasky, jeff, bz, gallatin
Differential Revision:	https://reviews.freebsd.org/D23674
2020-02-24 21:07:30 +00:00
Bjoern A. Zeeb
10108cb673 Partially revert VNET change and expand VNET structure.
Revert parts of r353274 replacing vnet_state with a shutdown flag.

Not having the state flag for the current SI_SUB_* makes it harder to debug
kernel or module panics related to VNET bringup or teardown.
Not having the state also does not allow us to check for other dependency
levels between components, e.g. for moving interfaces.

Expand the VNET structure with the new boolean flag indicating that we are
doing a shutdown of a given vnet and update the vnet magic cookie for the
change.

Update libkvm to compile with a bool in the kernel struct.

Bump __FreeBSD_version for (external) module builds to more easily detect
the change.

Reviewed by:	hselasky
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D23097
2020-02-17 11:08:50 +00:00
Jeff Roberson
cd0be8b2ed Temporarily force IFF_NEEDSEPOCH until drivers have been resolved.
Recent network epoch changes have left some drivers unexpectedly broken
and there is not yet a consensus on the correct fix.  This is patch is
a minor performance impact until we can agree on the correct path
forward.

Reviewed by:	core, network, imp, glebius, hselasky
Differential Revision:	https://reviews.freebsd.org/D23515
2020-02-06 20:47:50 +00:00
Eugene Grosbein
2888eb4091 ifa_maintain_loopback_route: adjust debugging output
Correction after r333476:

- write this as LOG_DEBUG again instead of LOG_INFO;
- get back function name into the message;
- error may be ESRCH if an address is removed in process (by carp f.e.),
not only ENOENT;
- expression complexity grows, so try making it more readable.

MFC after:	1 week
2020-01-18 04:48:05 +00:00
Gleb Smirnoff
2a4bd982d0 Introduce NET_EPOCH_CALL() macro and use it everywhere where we free
data based on the network epoch.   The macro reverses the argument
order of epoch_call(9) - first function, then its argument. NFC
2020-01-15 06:05:20 +00:00
Gleb Smirnoff
97168be809 Mechanically substitute assertion of in_epoch(net_epoch_preempt) to
NET_EPOCH_ASSERT(). NFC
2020-01-15 05:45:27 +00:00
Gleb Smirnoff
3264dcadc9 - Move global network epoch definition to epoch.h, as more different
subsystems tend to need to know about it, and including if_var.h is
  huge header pollution for them.  Polluting possible non-network
  users with single symbol seems much lesser evil.
- Remove non-preemptible network epoch.  Not used yet, and unlikely
  to get used in close future.
2020-01-15 03:34:21 +00:00
Kyle Evans
c7bab2a7ca if_vmove: return proper error status
if_vmove can fail if it lost a race and the vnet's already been moved. The
callers (and their callers) can generally cope with this, but right now
success is assumed. Plumb out the ENOENT from if_detach_internal if it
happens so that the error's properly reported to userland.

Reviewed by:	bz, kp
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D22780
2020-01-09 03:52:50 +00:00
Alexander V. Chernikov
5fcb2832e3 Plug loopback idaddr refcount leak.
Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D22980
2020-01-02 09:08:45 +00:00
Mark Johnston
3f197b134c Deduplicate code between if_delgroup() and if_delgroups().
Fix some style in if_addgroup().  No functional change intended.

Reviewed by:	hselasky
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22892
2019-12-20 20:15:34 +00:00
Mark Johnston
718ef55ec7 Fix a memory leak in if_delgroups() introduced in r334118.
PR:		242712
Submitted by:	ghuckriede@blackberry.com
MFC after:	3 days
2019-12-20 17:21:57 +00:00
Bjoern A. Zeeb
3232273f42 Allow kernel to compile without BPF.
r297816 added some bpf magic for VIMAGE unconditionally which no longer
allows kernels to compile without bpf (but with other networking).
Add the missing ifdef checks and allow a kernel to compile without bpf
again.

PR:		242136
Reported by:	dave mischler.com
MFC after:	2 weeks
2019-11-24 23:21:47 +00:00
Conrad Meyer
7993a104a1 Add explicit SI_SUB_EPOCH
Add explicit SI_SUB_EPOCH, after SI_SUB_TASKQ and before SI_SUB_SMP
(EARLY_AP_STARTUP).  Rename existing "SI_SUB_TASKQ + 1" to SI_SUB_EPOCH.

epoch(9) consumers cannot epoch_alloc() before SI_SUB_EPOCH:SI_ORDER_SECOND,
but likely should allocate before SI_SUB_SMP.  Prior to this change,
consumers (well, epoch itself, and net/if.c) just open-coded the
SI_SUB_TASKQ + 1 order to match epoch.c, but this was fragile.

Reviewed by:	mmacy
Differential Revision:	https://reviews.freebsd.org/D22503
2019-11-22 23:23:40 +00:00
Gleb Smirnoff
9352fab6ab In if_siocaddmulti() enter VNET.
Reported & tested by:	garga
2019-11-13 16:28:53 +00:00
Gleb Smirnoff
0839aa5c04 There is a long standing problem with multicast programming for NICs
and IPv6.  With IPv6 we may call if_addmulti() in context of processing
of an incoming packet.  Usually this is interrupt context.  While most
of the NIC drivers are able to reprogram multicast filters without
sleeping, some of them can't.  An example is e1000 family of drivers.
With iflib conversion the problem was somewhat hidden.  Iflib processes
packets in private taskqueue, so going to sleep doesn't trigger an
assertion.  However, the sleep would block operation of the driver and
following incoming packets would fill the ring and eventually would
start being dropped.  Enabling epoch for the full time of a packet
processing again started to trigger assertions for e1000.

Fix this problem once and for all using a general taskqueue to call
if_ioctl() method in all cases when if_addmulti() is called in a
non sleeping context.  Note that nobody cares about returned value.

Reviewed by:	hselasky, kib
Differential Revision:	  https://reviews.freebsd.org/D22154
2019-10-29 17:36:06 +00:00
Gleb Smirnoff
19e09f447f Remove obsoleted KPIs that were used to access interface address lists. 2019-10-21 18:17:03 +00:00
Conrad Meyer
7790c8c199 Split out a more generic debugnet(4) from netdump(4)
Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport.  It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).

It is mostly a verbatim code lift from netdump(4).  Netdump(4) remains
the only consumer (until the rest of this patch series lands).

The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c.  UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c.  The
separation is not perfect and future improvement is welcome.  Supporting
INET6 is a long-term goal.

Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry.  I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.

The only functional change here is the mbuf allocation / tracking.  Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time.  If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark.  Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.

No other functional change intended.

Reviewed by:	markj (earlier version)
Some discussion with:	emaste, jhb
Objection from:	marius
Differential Revision:	https://reviews.freebsd.org/D21421
2019-10-17 16:23:03 +00:00
Gleb Smirnoff
b46d70fd88 do_link_state_change() is executed in taskqueue context and in
general is allowed to sleep.  Don't enter the epoch for the
whole duration.  If some event handlers need the epoch, they
should handle that theirselves.

Discussed with:	hselasky
2019-10-16 16:32:58 +00:00
Hans Petter Selasky
270b83b9d1 The two functions ifnet_byindex() and ifnet_byindex_locked() are exactly the
same after the network stack was epochified. Merge the two into one function
and cleanup all uses of ifnet_byindex_locked().

While at it:
- Add branch prediction macros.
- Make sure the ifnet pointer is only deferred once,
  also when code optimisation is disabled.

Sponsored by:	Mellanox Technologies
2019-10-15 12:08:09 +00:00
Hans Petter Selasky
93cfeb0ed9 Exclude the network link eventhandler from epochification after r353292.
This fixes the following assert when "options RATELIMIT" is used:
panic()
malloc()
sysctl_add_oid()
tcp_rl_ifnet_link()
do_link_state_change()
taskqueue_run_locked()

Sponsored by:	Mellanox Technologies
2019-10-15 11:20:16 +00:00
Gleb Smirnoff
416a1d1e70 if_delmulti() is never called without ifp argument, assert this instead
of doing a useless search through interfaces.
2019-10-14 21:18:37 +00:00
Gleb Smirnoff
fb3fc771f6 Add two extra functions that basically give count of addresses
on interface.  Such function could been implemented on top of
the if_foreach_llm?addr(), but several drivers need counting,
so avoid copy-n-paste inside the drivers.
2019-10-10 23:44:56 +00:00
Gleb Smirnoff
826857c833 Provide new KPI for network drivers to access lists of interface
addresses.  The KPI doesn't reveal neither how addresses are stored,
how the access to them is synchronized, neither reveal struct ifaddr
and struct ifmaddr.

Reviewed by:	gallatin, erj, hselasky, philip, stevek
Differential Revision:	https://reviews.freebsd.org/D21943
2019-10-10 23:42:55 +00:00
Gleb Smirnoff
1e80e4f26c Remove epoch assertion from if_setlladdr(). Originally this function was
protected by IF_ADDR_LOCK(), which was a mutex, so that two simultaneous
if_setlladdr() can't execute. Later it was switched to IF_ADDR_RLOCK(),
likely by a mistake. Later it was switched to NET_EPOCH_ENTER(). Then I
incorrectly added NET_EPOCH_ASSERT() here.

In reality ifp->if_addr never goes away and never changes its length. So,
doing bcopy() in it is always "safe", meaning it won't dereference a wrong
pointer or write into someone's else memory. Of course doing two bcopy() in
parallel would result in a mess of two addresses, but net epoch doesn't
protect against that, neither IF_ADDR_RLOCK() did.

So for now, just remove the assertion and leave for later a proper fix.

Reported by:	markj
2019-10-08 17:55:45 +00:00
Gleb Smirnoff
e9dc46cc30 In DIAGNOSTIC block of if_delmulti_ifma_flags() enter the network epoch.
This quickly plugs the regression from r353292. The locking of multicast
definitely needs a broader review today...

Reported by:	pho, dhw
2019-10-08 16:45:56 +00:00
Gleb Smirnoff
b8a6e03fac Widen NET_EPOCH coverage.
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by:	gallatin, hselasky, cy, adrian, kristof
Differential Revision:	https://reviews.freebsd.org/D19111
2019-10-07 22:40:05 +00:00
Hans Petter Selasky
204e2f30d9 Factor out VNET shutdown check into an own vnet structure field.
Remove the now obsolete vnet_state field. This greatly simplifies the
detection of VNET shutdown and avoids code duplication.

Discussed with:	bz@
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-10-07 14:15:41 +00:00
Gleb Smirnoff
dd902d015a Add debugging facility EPOCH_TRACE that checks that epochs entered are
properly nested and warns about recursive entrances.  Unlike with locks,
there is nothing fundamentally wrong with such use, the intent of tracer
is to help to review complex epoch-protected code paths, and we mean the
network stack here.

Reviewed by:	hselasky
Sponsored by:	Netflix
Pull Request:	https://reviews.freebsd.org/D21610
2019-09-25 18:26:31 +00:00
Konstantin Belousov
247cf5664e Add SIOCGIFDOWNREASON.
The ioctl(2) is intended to provide more details about the cause of
the down for the link.

Eventually we might define a comprehensive list of codes for the
situations.  But interface also allows the driver to provide free-form
null-terminated ASCII string to provide arbitrary non-formalized
information.  Sample implementation exists for mlx5(4), where the
string is fetched from firmware controlling the port.

Reviewed by:	hselasky, rrs
Sponsored by:	Mellanox Technologies
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D21527
2019-09-17 18:49:13 +00:00
Kyle Evans
40b1c921bd SIOCSIFNAME: Do nothing if we're not actually changing
Instead of throwing EEXIST, just succeed if the name isn't actually
changing. We don't need to trigger departure or any of that because there's
no change from consumers' perspective.

PR:		240539
Reviewed by:	brooks
MFC after:	5 days
Differential Revision:	https://reviews.freebsd.org/D21618
2019-09-12 15:36:48 +00:00
Hans Petter Selasky
0dbdf04125 Need to wait for epoch callbacks to complete before detaching a
network interface.

This particularly manifests itself when an INP has multicast options
attached during a network interface detach. Then the IPv4 and IPv6
leave group call which results from freeing the multicast address, may
access a freed ifnet structure. These are the steps to reproduce:

service mdnsd onestart # installed from ports

ifconfig epair create
ifconfig epair0a 0/24 up
ifconfig epair0a destroy

Tested by:	pho @
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-06-28 10:49:04 +00:00
John Baldwin
fb3bc59600 Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
  for a different ifp than the one the packet is being output on), in
  ip_output() and ip6_output().  This avoids sending packets with send
  tags to ifnet drivers that don't support send tags.

  Since we are now checking for ifp mismatches before invoking
  if_output, we can now try to allocate a new tag before invoking
  if_output sending the original packet on the new tag if allocation
  succeeds.

  To avoid code duplication for the fragment and unfragmented cases,
  add ip_output_send() and ip6_output_send() as wrappers around
  if_output and nd6_output_ifp, respectively.  All of the logic for
  setting send tags and dealing with send tag-related errors is done
  in these wrapper functions.

  For pseudo interfaces that wrap other network interfaces (vlan and
  lagg), wrapper send tags are now allocated so that ip*_output see
  the wrapper ifp as the ifp in the send tag.  The if_transmit
  routines rewrite the send tags after performing an ifp mismatch
  check.  If an ifp mismatch is detected, the transmit routines fail
  with EAGAIN.

- To provide clearer life cycle management of send tags, especially
  in the presence of vlan and lagg wrapper tags, add a reference count
  to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
  Provide a helper function (m_snd_tag_init()) for use by drivers
  supporting send tags.  m_snd_tag_init() takes care of the if_ref
  on the ifp meaning that code alloating send tags via if_snd_tag_alloc
  no longer has to manage that manually.  Similarly, m_snd_tag_rele
  drops the refcount on the ifp after invoking if_snd_tag_free when
  the last reference to a send tag is dropped.

  This also closes use after free races if there are pending packets in
  driver tx rings after the socket is closed (e.g. from tcpdrop).

  In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
  csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
  Drivers now also check this flag instead of checking snd_tag against
  NULL.  This avoids false positive matches when a forwarded packet
  has a non-NULL rcvif that was treated as a send tag.

- cxgbe was relying on snd_tag_free being called when the inp was
  detached so that it could kick the firmware to flush any pending
  work on the flow.  This is because the driver doesn't require ACK
  messages from the firmware for every request, but instead does a
  kind of manual interrupt coalescing by only setting a flag to
  request a completion on a subset of requests.  If all of the
  in-flight requests don't have the flag when the tag is detached from
  the inp, the flow might never return the credits.  The current
  snd_tag_free command issues a flush command to force the credits to
  return.  However, the credit return is what also frees the mbufs,
  and since those mbufs now hold references on the tag, this meant
  that snd_tag_free would never be called.

  To fix, explicitly drop the mbuf's reference on the snd tag when the
  mbuf is queued in the firmware work queue.  This means that once the
  inp's reference on the tag goes away and all in-flight mbufs have
  been queued to the firmware, tag's refcount will drop to zero and
  snd_tag_free will kick in and send the flush request.  Note that we
  need to avoid doing this in the middle of ethofld_tx(), so the
  driver grabs a temporary reference on the tag around that loop to
  defer the free to the end of the function in case it sends the last
  mbuf to the queue after the inp has dropped its reference on the
  tag.

- mlx5 preallocates send tags and was using the ifp pointer even when
  the send tag wasn't in use.  Explicitly use the ifp from other data
  structures instead.

- Sprinkle some assertions in various places to assert that received
  packets don't have a send tag, and that other places that overwrite
  rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.

Reviewed by:	gallatin, hselasky, rgrimes, ae
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
Conrad Meyer
e2e050c8ef Extract eventfilter declarations to sys/_eventfilter.h
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.

EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).

As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions.  The remainder of the patch addresses
adding appropriate includes to fix those files.

LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).

No functional change (intended).  Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed.  __FreeBSD_version has been bumped.
2019-05-20 00:38:23 +00:00