Commit Graph

187 Commits

Author SHA1 Message Date
Mark Johnston
3616095801 Fix style issues around existing SDT probes.
- Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect
  at the moment, but will be needed for some future changes.
- Don't hardcode the module component of the probe identifier. This is
  set automatically by the SDT framework.

MFC after:	1 week
2015-12-16 23:39:27 +00:00
Randall Stewart
18b4fd62e0 Add new async_drain to the callout system. This is so-far not used but
should be used by TCP for sure in its cleanup of the IN-PCB (will be coming shortly).

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D4076
2015-11-10 14:49:32 +00:00
Andriy Gapon
2f2f522b5d save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE
SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters
where n is typically smaller than 5.

Perhaps SDT_PROBE should be made a private implementation detail.

MFC after:	20 days
2015-09-28 12:14:16 +00:00
Hans Petter Selasky
c55f4c9445 Revert r287780 until more developers have their say.
Differential Revision:	https://reviews.freebsd.org/D3521
Requested by:		gnn
2015-09-22 06:51:55 +00:00
Hans Petter Selasky
9acc0eafd7 Implement callout_drain_async(), inspired by the projects/hps_head
branch.

This function is used to drain a callout via a callback instead of
blocking the caller until the drain is complete. Refer to the
callout_drain_async() manual page for a detailed description.

Limitation: If a lock is used with the callout, the callout can only
be drained asynchronously one time unless the callout_init_mtx()
function is called again. This limitation is not present in
projects/hps_head and will require more invasive changes to the
timeout code, which was not in the scope of this patch.

Differential Revision:	https://reviews.freebsd.org/D3521
Reviewed by:		wblock
MFC after:		1 month
2015-09-14 10:52:26 +00:00
Andriy Gapon
378d5c6c89 callout_reset: fix a reversed check for cc_exec_cancel
The typo was introduced in r278469 / 344ecf88af.

As a result of the bug there was a timing window where callout_reset()
would fail to cancel a concurrent execution of a callout that is about
to start and would schedule the callout again.
The callout would fire more times than it is scheduled.
That would happen even if the callout is initialized with a lock.

For example, the bug triggered the "Stray timeout" assertion in
taskqueue_timeout_func().

MFC after:	5 days
2015-09-01 09:27:14 +00:00
Julien Charbon
2ea3089cb1 Revert r286880: If at first this change made sense, it turns out
it helps only the TCP timers callout(9) usage.  As the benefit for
others callout(9) usages did not reach a consensus the historical
usage should prevail.

Differential Revision:      https://reviews.freebsd.org/D3078
2015-08-30 13:44:46 +00:00
Julien Charbon
cd252ea74d Silent a compilation warning on callout_stop() 2015-08-27 10:43:35 +00:00
Julien Charbon
682d0e15b5 In callout_stop(), do not forget to initialize not_running variable.
Thanks to hselasky for noticing that.

Differential Revision:	https://reviews.freebsd.org/D3078 (Updated)
Submitted by:		hselasky
Pointy hat to:		jch
2015-08-27 08:58:03 +00:00
Julien Charbon
0cfae4b4bc In callout_stop(), if a callout is both pending and currently
being serviced return 0 (fail) but it is applicable only
mpsafe callouts.  Thanks to hselasky for finding this.

Differential Revision:	https://reviews.freebsd.org/D3078 (Updated)
Submitted by:		hselasky
Reviewed by:		jch
2015-08-27 08:15:32 +00:00
Julien Charbon
a1e6f8ff27 callout_stop() should return 0 (fail) when the callout is currently
being serviced and indeed unstoppable.

A scenario to reproduce this case is:

- the callout is being serviced and at same time,
- callout_reset() is called on this callout that sets
  the CALLOUT_PENDING flag and at same time,
- callout_stop() is called on this callout and returns 1 (success)
  even if the callout is indeed currently running and unstoppable.

This issue was caught up while making r284245 (D2763) workaround, and
was discussed at BSDCan 2015.  Once applied the r284245 workaround
is not needed anymore and will be reverted.

Differential Revision:	https://reviews.freebsd.org/D3078
Reviewed by:		jhb
Sponsored by:		Verisign, Inc.
2015-08-18 10:15:09 +00:00
Randall Stewart
b132edb56f Fix my stupid restoral of old code.. must be c_iflags now.
Thanks jhb for catching my stupidity...
MFC after:	3 days
2015-04-14 00:02:39 +00:00
Randall Stewart
07a2df5d83 Restore the two lines accidentally deleted that allow CALLOUT_DIRECT to be
specifed in the flags.

Thanks Mark Johnston for noticing this ;-o

MFC after:	3 days
2015-04-13 23:06:13 +00:00
Randall Stewart
403df7a672 Adopt jhb's suggested changes, updated comments and callout_migration() moving
to kern/kern_timeout.c

This does *not* address his -1 -> NOCPU comment.

Sponsored by:	Netflix Inc.
2015-03-31 00:18:00 +00:00
Bjoern A. Zeeb
a04d412295 Try to unbreak !SMP kernels broken in r280785 by using the proper macros
to access cc_cpu.
2015-03-28 15:07:19 +00:00
Randall Stewart
15b1eb142c Change the callout to supply -1 to indicate we are not changing
CPU, also add protection against invalid CPU's as well as
split c_flags and c_iflags so that if a user plays with the active
flag (the one expected to be played with by callers in MPSAFE) without
a lock, it won't adversely affect the callout system by causing a corrupt
list. This also means that all callers need to use the macros and *not*
play with the falgs directly (like netgraph used to).

Differential Revision: htts://reviews.freebsd.org/D1894
Reviewed by: .. timed out but looked at by jhb, imp, adrian hselasky
             tested by hiren and netflix.
Sponsored by:	Netflix Inc.
2015-03-28 12:50:24 +00:00
Randall Stewart
66525b2d16 This fixes a bug I in-advertantly inserted when I updated the callout
code in my last commit. The cc_exec_next is used to track the next
when a direct call is being made from callout. It is *never* used
in the in-direct method. When macro-izing I made it so that it
would separate out direct/vs/non-direct. This is incorrect and can
cause panics as Peter Holm has found for me (Thanks so much Peter for
all your help in this). What this change does is restore that behavior
but also get rid of the cc_next from the array and instead make it
be part of the base callout structure. This way no one else will get
confused since we will never use it for non-direct.

Reviewed by:	Peter Holm and more importantly tested by him ;-)
MFC after:	3 days.
Sponsored by:	Netflix Inc.
2015-02-12 13:31:08 +00:00
Randall Stewart
d2854fa488 This fixes two conditions that can incur when migration
is being done in the callout code and harmonizes the macro
use.:
1) The callout_active() will lie. Basically if a migration
   is occuring and the callout is about to expire and the
   migration has been deferred, the callout_active will no
   longer return true until after the migration. This confuses
   and breaks callers that are doing callout_init(&c, 1); such
   as TCP.
2) The migration code had a bug in it where when migrating, if
   a two calls to callout_reset came in and they both collided with
   the callout on the wheel about to run, then the second call to
   callout_reset would corrupt the list the callout wheel uses
   putting the callout thread into a endless loop.
3) Per imp, I have fixed all the macro occurance in the code that
   were for the most part being ignored.

Phabricator D1711 and looked at by lstewart and jhb and sbruno.
Reviewed by:	kostikbel, imp, adrian, hselasky
MFC after:	3 days
Sponsored by:	Netflix Inc.
2015-02-09 19:19:44 +00:00
Adrian Chadd
9500dd9f0b Call WITNESS_WARN() in callout_drain() to check whether any locks are
being held before sleeping.

This has bitten me (in ath(4)) once before and I'd like to see this
not bite anyone else.

Differential Revision:	D1638
Reviewed by:	jhb, hselasky
MFC after:	1 week
2015-01-26 04:04:57 +00:00
Hans Petter Selasky
a115fb62ed Revert for r277213:
FreeBSD developers need more time to review patches in the surrounding
areas like the TCP stack which are using MPSAFE callouts to restore
distribution of callouts on multiple CPUs.

Bump the __FreeBSD_version instead of reverting it.

Suggested by:		kmacy, adrian, glebius and kib
Differential Revision:	https://reviews.freebsd.org/D1438
2015-01-22 11:12:42 +00:00
Hans Petter Selasky
1a26c3c047 Major callout subsystem cleanup and rewrite:
- Close a migration race where callout_reset() failed to set the
  CALLOUT_ACTIVE flag.
- Callout callback functions are now allowed to be protected by
  spinlocks.
- Switching the callout CPU number cannot always be done on a
  per-callout basis. See the updated timeout(9) manual page for more
  information.
- The timeout(9) manual page has been updated to reflect how all the
  functions inside the callout API are working. The manual page has
  been made function oriented to make it easier to deduce how each of
  the functions making up the callout API are working without having
  to first read the whole manual page. Group all functions into a
  handful of sections which should give a quick top-level overview
  when the different functions should be used.
- The CALLOUT_SHAREDLOCK flag and its functionality has been removed
  to reduce the complexity in the callout code and to avoid problems
  about atomically stopping callouts via callout_stop(). If someone
  needs it, it can be re-added. From my quick grep there are no
  CALLOUT_SHAREDLOCK clients in the kernel.
- A new callout API function named "callout_drain_async()" has been
  added. See the updated timeout(9) manual page for a complete
  description.
- Update the callout clients in the "kern/" folder to use the callout
  API properly, like cv_timedwait(). Previously there was some custom
  sleepqueue code in the callout subsystem, which has been removed,
  because we now allow callouts to be protected by spinlocks. This
  allows us to tear down the callout like done with regular mutexes,
  and a "td_slpmutex" has been added to "struct thread" to atomically
  teardown the "td_slpcallout". Further the "TDF_TIMOFAIL" and
  "SWT_SLEEPQTIMO" states can now be completely removed. Currently
  they are marked as available and will be cleaned up in a follow up
  commit.
- Bump the __FreeBSD_version to indicate kernel modules need
  recompilation.
- There has been several reports that this patch "seems to squash a
  serious bug leading to a callout timeout and panic".

Kernel build testing:	all architectures were built
MFC after:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D1438
Sponsored by:		Mellanox Technologies
Reviewed by:		jhb, adrian, sbruno and emaste
2015-01-15 15:32:30 +00:00
John Baldwin
232e8b52b0 Add schedgraph traces for callout handlers. Specifically, a callwheel logs
a running event each time it executes a callout function.  The event
includes the function pointer, argument, and whether or not it was run from
hardware interrupt context.  The callwheel is marked idle when each handler
completes.  This effectively logs the duration of each callout routine in
the graph.
2014-10-08 16:22:59 +00:00
Adrian Chadd
c445c3c7f6 If we're doing RSS then ensure that the callwheel swi's are CPU pinned. 2014-06-30 04:25:51 +00:00
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
Davide Italiano
e392e44c27 Convert functions to the new-style format.
Submitted by:	Vijay Singh <vijju.singh@gmail.com> via -hackers
2014-06-05 03:46:46 +00:00
Adrian Chadd
ac75ee9fa3 Add in support to optionally pin the swi threads.
Under enough load, the swi's can actually be preempted and migrated
to other currently free cores.  When doing RSS experiments, this lead
to the per-CPU TCP timers not lining up any more with the RX CPU said
flows were ending up on, leading to increased lock contention.

Since there was a little pushback on flipping them on by default,
I've left the default at "don't pin."

The other less obvious problem here is that the default swi
is also the same as the destination swi for CPU #0.  So if one
pins the swi on CPU #0, there's no default floating swi.

A nice future project would be to create a separate swi for
the "default" floating swi, as well as per-CPU swis that are
(optionally) pinned.

Tested:

* parallel TCP tests (2 x 1g unfortunately for now);
  CPU: Intel(R) Xeon(R) CPU E5-2650

Note:

This is based on some initial investigation into RSS/TCP stack lock
contention on FreeBSD-HEAD whilst at Netflix in January 2014.
2014-05-10 00:53:36 +00:00
Davide Italiano
4bc38a5ab0 Hide internal details of sbintime_t implementation wrapping INT64_MAX into
SBT_MAX, to make it more robust in case internal type representation will
change in the future. All the consumers were migrated to SBT_MAX and
every new consumer (if any) should from now use this interface.

Requested by:	bapt, jmg, Ryan Lortie (implictly)
Reviewed by:	mav, bde
2014-04-12 23:29:29 +00:00
Adrian Chadd
f44e2a4c0f Include the CPU id in the per-CPU timer swi thread descriptions.
Original patch by:	jhb
2014-02-14 23:19:51 +00:00
Andriy Gapon
d9fae5ab88 dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by:	markj
MFC after:	3 weeks
2013-11-26 08:46:27 +00:00
Attilio Rao
54366c0bd7 - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
Davide Italiano
1b0c144fc2 Make the callout arithmetic more robust adding checks for overflow.
Without these, if the timeout value passed is "large enough", the
value of the sum of it and other factors (e.g. current time as
returned by sbinuptime() or 'precision' argument) might result in a
negative number. This negative number is then passed to
eventtimers(4), which causes et_start() routine to load et_min_period
into eventtimer, making the CPU where the thread is stuck forever in
timer interrupt handler routine. This is now avoided rounding to
INT64_MAX the timeout period in case of overflow.

Reported by:	kib, pho
Discussed with:	kib, mav
Tested by:	pho (stress2 suite, kevent7.sh scenario)
Approved by:	re (kib)
2013-09-26 10:06:50 +00:00
Davide Italiano
1f96759fb1 Fix callout_init_rm() in the shared case, allocating storage for 'struct
rm_priotracker' directly in the softclock thread. Now consumers can
pass CALLOUT_SHAREDLOCK flag to callout initialization routine safely.
The choice of the already existing flags  instead of special casing
shared rmlocks is done to prevent consumer footshooting.

Suggested by:	jhb
Reviewed by:	jhb
Approved by:	re (delphij)
2013-09-20 23:16:15 +00:00
Mark Johnston
7b77e1fe0f Specify SDT probe argument types in the probe definition itself rather than
using SDT_PROBE_ARGTYPE(). This will make it easy to extend the SDT(9) API
to allow probes with dynamically-translated types.

There is no functional change.

MFC after:	2 weeks
2013-08-15 04:08:55 +00:00
Davide Italiano
3f321a4eac Cache the callout precision argument as part of the informations required
for migrating callouts to new CPU. This value is passed to
callout_cc_add() in order to update properly precision field in case of
rescheduling/migration.

Reviewed by:	mav
2013-03-25 09:43:50 +00:00
Andre Oppermann
a7aea132cf Bring back the comment on the sizing of the callout array that got
lost in r248031.

Requested by:	alc, alfred
2013-03-10 22:55:35 +00:00
Davide Italiano
c5904471dc Fixup r248032:
Change size requested to malloc(9) now that callwheel buckets are
callout_list and not callout_tailq anymore. This change was already
there but it seems it got lost after code churn in r248032.

Reported by:	alc, kib
2013-03-09 20:03:10 +00:00
Andre Oppermann
15ae0c9af9 Move the callout subsystem initialization to its own SYSINIT()
from being indirectly called via cpu_startup()+vm_ksubmap_init().
The boot order position remains the same at SI_SUB_CPU.

Allocation of the callout array is changed to stardard kernel malloc
from a slightly obscure direct kernel_map allocation.

kern_timeout_callwheel_alloc() is renamed to callout_callwheel_init()
to better describe its purpose.
kern_timeout_callwheel_init() is removed simplifying the per-cpu
initialization.

Reviewed by:	davide
2013-03-08 10:37:17 +00:00
Andre Oppermann
f8ccf82a4c Move the auto-sizing of the callout array from init_param2() to
kern_timeout_callwheel_alloc() where it is actually used.

This is a mechanical move and no tuning parameters are changed.

The pre-allocated callout array is only used for legacy timeout(9)
calls and is only allocated and active on cpu0.  Eventually all
remaining users of timeout(9) should switch to the callout_* API.

Reviewed by:	davide
2013-03-08 10:14:58 +00:00
Davide Italiano
ac42a1726a Complete r247813:
Use true/false instead of TRUE/FALSE.

Reported by:	attilio
Requested by:	jhb
2013-03-04 21:52:12 +00:00
Davide Italiano
a4a3ce9919 Use C99 'bool' rather than Machish 'boolean_t'.
Requested by:	jhb
2013-03-04 21:09:22 +00:00
Davide Italiano
037637812d Fix build with DIAGNOSTIC/CALLOUT_PROFILING options turned on.
Reported by:	kib, David Wolfskill <david at catwhisker dot org>
Pointy-hat to:	davide
2013-03-04 15:03:52 +00:00
Davide Italiano
5b999a6be0 - Make callout(9) tickless, relying on eventtimers(4) as backend for
precise time event generation. This greatly improves granularity of
callouts which are not anymore constrained to wait next tick to be
scheduled.
- Extend the callout KPI introducing a set of callout_reset_sbt* functions,
which take a sbintime_t as timeout argument. The new KPI also offers a
way for consumers to specify precision tolerance they allow, so that
callout can coalesce events and reduce number of interrupts as well as
potentially avoid scheduling a SWI thread.
- Introduce support for dispatching callouts directly from hardware
interrupt context, specifying an additional flag. This feature should be
used carefully, as long as interrupt context has some limitations
(e.g. no sleeping locks can be held).
- Enhance mechanisms to gather informations about callwheel, introducing
a new sysctl to obtain stats.

This change breaks the KBI. struct callout fields has been changed, in
particular 'int ticks' (4 bytes) has been replaced with 'sbintime_t'
(8 bytes) and another 'sbintime_t' field was added for precision.

Together with:	mav
Reviewed by:	attilio, bde, luigi, phk
Sponsored by:	Google Summer of Code 2012, iXsystems inc.
Tested by:	flo (amd64, sparc64), marius (sparc64), ian (arm),
		markj (amd64), mav, Fabian Keil
2013-03-04 11:09:56 +00:00
Davide Italiano
3f555c45eb callwheelmask and callwheelsize are always greater than zero.
Switch their type to u_int.
2013-03-03 15:01:33 +00:00
Davide Italiano
0fb285b716 Remove a couple of unused include. 2013-03-03 14:47:02 +00:00
Alexander Motin
4514d6fa18 MFcalloutng:
Some whitespace fixes.
2013-03-03 09:11:24 +00:00
Davide Italiano
e234a588cb MFcalloutng:
Style fixes.
2013-02-28 16:22:49 +00:00
Attilio Rao
bdf9120c16 Fixup r243901:
- As the comment report, CALLOUT_LOCAL_ALLOC cannot be checked
  directly from the callout flags but might be checked by a cached
  value.  Hence, do so before to actually remove the callout, when
  needed, in softclock_call_cc().
- In softclock_call_cc() also add a comment in the waiting and deferred
  migration case explaining that the dereference should be safe
  because of the migration dereference invariants.

Additively:
- In softclock_call_cc(), for the deferred migration case, move all the
  accesses to callout structure after the comment stating the callout
  must not be destroyed.
- For consistency with this last tweak, use cached c_flags for the
  KASSERT() in the deferred migration case.  It is not strictly necessary
  but this way all the callout accesses happen after the above mentioned
  comment, improving consistency.

Pointy hat to:	me
Sponsored by:	Isilon Systems / EMC Corporation
Reviewed by:	kib
MFC after:	2 weeks
X-MFC:		243901
2012-12-05 22:32:12 +00:00
Konstantin Belousov
eb8a718686 The softclock_call_cc() is executing with the callout already removed
from the callwheel. Calculate the cc->cc_next before removing the
callout, otherwise the code followed the invalid tailq links.  After
this, make softclock_call_cc() return void, since it always return
cc->cc_next, which is immediately available to the softclock()
anyway. This also allows to eliminate a label under #ifdef SMP.

Remove the assignment of cc->cc_next from callout_cc_del(), since the
function is called with the callout already removed from callwheel.

If cancelling the migration, also clear the CALLOUT_DFRMIGRATION flag.

Postpone the free of the timeout(9) allocated callouts after the
migration checks are done.

Add some more strict asserts about the state of the callout in
callout_call_cc().

Reviewed by:	attilio
Reported and tested by:	pho (previous version)
MFC after:	2 weeks
2012-12-05 19:02:22 +00:00