branch.
This function is used to drain a callout via a callback instead of
blocking the caller until the drain is complete. Refer to the
callout_drain_async() manual page for a detailed description.
Limitation: If a lock is used with the callout, the callout can only
be drained asynchronously one time unless the callout_init_mtx()
function is called again. This limitation is not present in
projects/hps_head and will require more invasive changes to the
timeout code, which was not in the scope of this patch.
Differential Revision: https://reviews.freebsd.org/D3521
Reviewed by: wblock
MFC after: 1 month
The typo was introduced in r278469 / 344ecf88af.
As a result of the bug there was a timing window where callout_reset()
would fail to cancel a concurrent execution of a callout that is about
to start and would schedule the callout again.
The callout would fire more times than it is scheduled.
That would happen even if the callout is initialized with a lock.
For example, the bug triggered the "Stray timeout" assertion in
taskqueue_timeout_func().
MFC after: 5 days
it helps only the TCP timers callout(9) usage. As the benefit for
others callout(9) usages did not reach a consensus the historical
usage should prevail.
Differential Revision: https://reviews.freebsd.org/D3078
being serviced return 0 (fail) but it is applicable only
mpsafe callouts. Thanks to hselasky for finding this.
Differential Revision: https://reviews.freebsd.org/D3078 (Updated)
Submitted by: hselasky
Reviewed by: jch
being serviced and indeed unstoppable.
A scenario to reproduce this case is:
- the callout is being serviced and at same time,
- callout_reset() is called on this callout that sets
the CALLOUT_PENDING flag and at same time,
- callout_stop() is called on this callout and returns 1 (success)
even if the callout is indeed currently running and unstoppable.
This issue was caught up while making r284245 (D2763) workaround, and
was discussed at BSDCan 2015. Once applied the r284245 workaround
is not needed anymore and will be reverted.
Differential Revision: https://reviews.freebsd.org/D3078
Reviewed by: jhb
Sponsored by: Verisign, Inc.
CPU, also add protection against invalid CPU's as well as
split c_flags and c_iflags so that if a user plays with the active
flag (the one expected to be played with by callers in MPSAFE) without
a lock, it won't adversely affect the callout system by causing a corrupt
list. This also means that all callers need to use the macros and *not*
play with the falgs directly (like netgraph used to).
Differential Revision: htts://reviews.freebsd.org/D1894
Reviewed by: .. timed out but looked at by jhb, imp, adrian hselasky
tested by hiren and netflix.
Sponsored by: Netflix Inc.
code in my last commit. The cc_exec_next is used to track the next
when a direct call is being made from callout. It is *never* used
in the in-direct method. When macro-izing I made it so that it
would separate out direct/vs/non-direct. This is incorrect and can
cause panics as Peter Holm has found for me (Thanks so much Peter for
all your help in this). What this change does is restore that behavior
but also get rid of the cc_next from the array and instead make it
be part of the base callout structure. This way no one else will get
confused since we will never use it for non-direct.
Reviewed by: Peter Holm and more importantly tested by him ;-)
MFC after: 3 days.
Sponsored by: Netflix Inc.
is being done in the callout code and harmonizes the macro
use.:
1) The callout_active() will lie. Basically if a migration
is occuring and the callout is about to expire and the
migration has been deferred, the callout_active will no
longer return true until after the migration. This confuses
and breaks callers that are doing callout_init(&c, 1); such
as TCP.
2) The migration code had a bug in it where when migrating, if
a two calls to callout_reset came in and they both collided with
the callout on the wheel about to run, then the second call to
callout_reset would corrupt the list the callout wheel uses
putting the callout thread into a endless loop.
3) Per imp, I have fixed all the macro occurance in the code that
were for the most part being ignored.
Phabricator D1711 and looked at by lstewart and jhb and sbruno.
Reviewed by: kostikbel, imp, adrian, hselasky
MFC after: 3 days
Sponsored by: Netflix Inc.
being held before sleeping.
This has bitten me (in ath(4)) once before and I'd like to see this
not bite anyone else.
Differential Revision: D1638
Reviewed by: jhb, hselasky
MFC after: 1 week
FreeBSD developers need more time to review patches in the surrounding
areas like the TCP stack which are using MPSAFE callouts to restore
distribution of callouts on multiple CPUs.
Bump the __FreeBSD_version instead of reverting it.
Suggested by: kmacy, adrian, glebius and kib
Differential Revision: https://reviews.freebsd.org/D1438
- Close a migration race where callout_reset() failed to set the
CALLOUT_ACTIVE flag.
- Callout callback functions are now allowed to be protected by
spinlocks.
- Switching the callout CPU number cannot always be done on a
per-callout basis. See the updated timeout(9) manual page for more
information.
- The timeout(9) manual page has been updated to reflect how all the
functions inside the callout API are working. The manual page has
been made function oriented to make it easier to deduce how each of
the functions making up the callout API are working without having
to first read the whole manual page. Group all functions into a
handful of sections which should give a quick top-level overview
when the different functions should be used.
- The CALLOUT_SHAREDLOCK flag and its functionality has been removed
to reduce the complexity in the callout code and to avoid problems
about atomically stopping callouts via callout_stop(). If someone
needs it, it can be re-added. From my quick grep there are no
CALLOUT_SHAREDLOCK clients in the kernel.
- A new callout API function named "callout_drain_async()" has been
added. See the updated timeout(9) manual page for a complete
description.
- Update the callout clients in the "kern/" folder to use the callout
API properly, like cv_timedwait(). Previously there was some custom
sleepqueue code in the callout subsystem, which has been removed,
because we now allow callouts to be protected by spinlocks. This
allows us to tear down the callout like done with regular mutexes,
and a "td_slpmutex" has been added to "struct thread" to atomically
teardown the "td_slpcallout". Further the "TDF_TIMOFAIL" and
"SWT_SLEEPQTIMO" states can now be completely removed. Currently
they are marked as available and will be cleaned up in a follow up
commit.
- Bump the __FreeBSD_version to indicate kernel modules need
recompilation.
- There has been several reports that this patch "seems to squash a
serious bug leading to a callout timeout and panic".
Kernel build testing: all architectures were built
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D1438
Sponsored by: Mellanox Technologies
Reviewed by: jhb, adrian, sbruno and emaste
a running event each time it executes a callout function. The event
includes the function pointer, argument, and whether or not it was run from
hardware interrupt context. The callwheel is marked idle when each handler
completes. This effectively logs the duration of each callout routine in
the graph.
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
Under enough load, the swi's can actually be preempted and migrated
to other currently free cores. When doing RSS experiments, this lead
to the per-CPU TCP timers not lining up any more with the RX CPU said
flows were ending up on, leading to increased lock contention.
Since there was a little pushback on flipping them on by default,
I've left the default at "don't pin."
The other less obvious problem here is that the default swi
is also the same as the destination swi for CPU #0. So if one
pins the swi on CPU #0, there's no default floating swi.
A nice future project would be to create a separate swi for
the "default" floating swi, as well as per-CPU swis that are
(optionally) pinned.
Tested:
* parallel TCP tests (2 x 1g unfortunately for now);
CPU: Intel(R) Xeon(R) CPU E5-2650
Note:
This is based on some initial investigation into RSS/TCP stack lock
contention on FreeBSD-HEAD whilst at Netflix in January 2014.
SBT_MAX, to make it more robust in case internal type representation will
change in the future. All the consumers were migrated to SBT_MAX and
every new consumer (if any) should from now use this interface.
Requested by: bapt, jmg, Ryan Lortie (implictly)
Reviewed by: mav, bde
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).
Reviewed by: markj
MFC after: 3 weeks
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].
[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].
Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip
Without these, if the timeout value passed is "large enough", the
value of the sum of it and other factors (e.g. current time as
returned by sbinuptime() or 'precision' argument) might result in a
negative number. This negative number is then passed to
eventtimers(4), which causes et_start() routine to load et_min_period
into eventtimer, making the CPU where the thread is stuck forever in
timer interrupt handler routine. This is now avoided rounding to
INT64_MAX the timeout period in case of overflow.
Reported by: kib, pho
Discussed with: kib, mav
Tested by: pho (stress2 suite, kevent7.sh scenario)
Approved by: re (kib)
rm_priotracker' directly in the softclock thread. Now consumers can
pass CALLOUT_SHAREDLOCK flag to callout initialization routine safely.
The choice of the already existing flags instead of special casing
shared rmlocks is done to prevent consumer footshooting.
Suggested by: jhb
Reviewed by: jhb
Approved by: re (delphij)
using SDT_PROBE_ARGTYPE(). This will make it easy to extend the SDT(9) API
to allow probes with dynamically-translated types.
There is no functional change.
MFC after: 2 weeks
for migrating callouts to new CPU. This value is passed to
callout_cc_add() in order to update properly precision field in case of
rescheduling/migration.
Reviewed by: mav
Change size requested to malloc(9) now that callwheel buckets are
callout_list and not callout_tailq anymore. This change was already
there but it seems it got lost after code churn in r248032.
Reported by: alc, kib
from being indirectly called via cpu_startup()+vm_ksubmap_init().
The boot order position remains the same at SI_SUB_CPU.
Allocation of the callout array is changed to stardard kernel malloc
from a slightly obscure direct kernel_map allocation.
kern_timeout_callwheel_alloc() is renamed to callout_callwheel_init()
to better describe its purpose.
kern_timeout_callwheel_init() is removed simplifying the per-cpu
initialization.
Reviewed by: davide
kern_timeout_callwheel_alloc() where it is actually used.
This is a mechanical move and no tuning parameters are changed.
The pre-allocated callout array is only used for legacy timeout(9)
calls and is only allocated and active on cpu0. Eventually all
remaining users of timeout(9) should switch to the callout_* API.
Reviewed by: davide
precise time event generation. This greatly improves granularity of
callouts which are not anymore constrained to wait next tick to be
scheduled.
- Extend the callout KPI introducing a set of callout_reset_sbt* functions,
which take a sbintime_t as timeout argument. The new KPI also offers a
way for consumers to specify precision tolerance they allow, so that
callout can coalesce events and reduce number of interrupts as well as
potentially avoid scheduling a SWI thread.
- Introduce support for dispatching callouts directly from hardware
interrupt context, specifying an additional flag. This feature should be
used carefully, as long as interrupt context has some limitations
(e.g. no sleeping locks can be held).
- Enhance mechanisms to gather informations about callwheel, introducing
a new sysctl to obtain stats.
This change breaks the KBI. struct callout fields has been changed, in
particular 'int ticks' (4 bytes) has been replaced with 'sbintime_t'
(8 bytes) and another 'sbintime_t' field was added for precision.
Together with: mav
Reviewed by: attilio, bde, luigi, phk
Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo (amd64, sparc64), marius (sparc64), ian (arm),
markj (amd64), mav, Fabian Keil
- As the comment report, CALLOUT_LOCAL_ALLOC cannot be checked
directly from the callout flags but might be checked by a cached
value. Hence, do so before to actually remove the callout, when
needed, in softclock_call_cc().
- In softclock_call_cc() also add a comment in the waiting and deferred
migration case explaining that the dereference should be safe
because of the migration dereference invariants.
Additively:
- In softclock_call_cc(), for the deferred migration case, move all the
accesses to callout structure after the comment stating the callout
must not be destroyed.
- For consistency with this last tweak, use cached c_flags for the
KASSERT() in the deferred migration case. It is not strictly necessary
but this way all the callout accesses happen after the above mentioned
comment, improving consistency.
Pointy hat to: me
Sponsored by: Isilon Systems / EMC Corporation
Reviewed by: kib
MFC after: 2 weeks
X-MFC: 243901
from the callwheel. Calculate the cc->cc_next before removing the
callout, otherwise the code followed the invalid tailq links. After
this, make softclock_call_cc() return void, since it always return
cc->cc_next, which is immediately available to the softclock()
anyway. This also allows to eliminate a label under #ifdef SMP.
Remove the assignment of cc->cc_next from callout_cc_del(), since the
function is called with the callout already removed from callwheel.
If cancelling the migration, also clear the CALLOUT_DFRMIGRATION flag.
Postpone the free of the timeout(9) allocated callouts after the
migration checks are done.
Add some more strict asserts about the state of the callout in
callout_call_cc().
Reviewed by: attilio
Reported and tested by: pho (previous version)
MFC after: 2 weeks
cache line in order to avoid manual frobbing but using
struct mtx_padalign.
The sole exception being nvme and sxfge drivers, where the author
redefined CACHE_LINE_SIZE manually, so they need to be analyzed and
dealt with separately.
Reviwed by: jimharris, alc
sharing especially on the default CPU 0 callout_cpu structure.
This will be followed up by attilio@ with a conversion to the new struct
mtx_padalign but doing this manual conversion first gives an easy MFC
candidate since mtx_padalign is a more extensive system change.
Sponsored by: Intel
Reviewed by: jeff, attilio
MFC after: 1 week