r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
In case the implementation ever changes from using a chain of next pointers,
then changing the macro definition will be necessary, but changing all the
files that iterate over vm_map entries will not.
Drop a counter in vm_object.c that would have an effect only if the
vm_map entry count was wrong.
Discussed with: alc
Reviewed by: markj
Tested by: pho (earlier version)
Differential Revision: https://reviews.freebsd.org/D21882
The hwpmc pcpu sample buffer is prone to head of line blocking
when waiting for user process to return to user space and
collect a pending callchain. If more than one tick has elapsed
between the time the sample entry was marked for collection and
the time that the hardclock pmc handler runs to copy the records
to a larger temporary buffer, mark the sample entry as not in
use.
This changes reduces the number of samples marked as not valid
when collecting under load from ~99.5% to 5-20%.
Reported by: mjg@
MFC after: 3 days
Remove malloc_domain(9) and most other _domain KPIs added in r327900.
The new functions allow the caller to specify a general NUMA domain
selection policy, rather than specifically requesting an allocation from
a specific domain. The latter policy tends to interact poorly with
M_WAITOK, resulting in situations where a caller is blocked indefinitely
because the specified domain is depleted. Most existing consumers of
the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy,
in which we fall back to other domains to satisfy the allocation
request.
This change also defines a set of DOMAINSET_FIXED() policies, which
only permit allocations from the specified domain.
Discussed with: gallatin, jeff
Reported and tested by: pho (previous version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17418
Refactor sample ring buffer ring handling to make it more robust to
long running callchain collection handling
r338112 introduced a (now fixed) regression that exposed a number of race
conditions within the management of the sample buffers. This
simplifies the handling and moves the decision to overwrite a
callchain sample that has taken too long out of the NMI in to the
hardlock handler. With this change the problem no longer shows up as a
ring corruption but as the code spending all of its time in callchain
collection.
- Makes the producer / consumer index incrementing monotonic, making it
easier (for me at least) to reason about.
- Moves the decision to overwrite a sample from NMI context to interrupt
context where we can enforce serialization.
- Puts a time limit on waiting to collect a user callchain - putting a
bound on head-of-line blocking causing samples to be dropped
- Removes the flush routine which was previously needed to purge
dangling references to the pmc from the sample buffers but now is only
a source of a race condition on unload.
Previously one could lock up or crash HEAD by running:
pmcstat -S inst_retired.any_p -T and then hitting ^C
After this change it is no longer possible.
PR: 231793
Reviewed by: markj@
Approved by: re (gjb@)
Differential Revision: https://reviews.freebsd.org/D17011
Not all event descriptions have a sample rate (such as inst_retired.any)
this will restore the legacy behavior of using 65536 in that case. It also
prevents accidental API misuse that could lead to panic.
PR: 230985
Reported by: markj
Reviewed by: markj
Approved by: re (gjb)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16958
- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
there's no longer any benefit to dropping the pcbinfo lock
and trying to do so just adds an error prone branchfest to
these functions
- Remove cases of same function recursion on the epoch as
recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
thread as the tracker field is now stack or heap allocated
as appropriate.
Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066
pmc_process_interrupt takes 5 arguments when only 3 are needed.
cpu is always available in curcpu and inuserspace can always be
derived from the passed trapframe.
While facially a reasonable cleanup this change was motivated
by the need to workaround a compiler bug.
core2_intr(cpu, tf) ->
pmc_process_interrupt(cpu, ring, pmc, tf, inuserspace) ->
pmc_add_sample(cpu, ring, pm, tf, inuserspace)
In the process of optimizing the tail call the tf pointer was getting
clobbered:
(kgdb) up
at /storage/mmacy/devel/freebsd/sys/dev/hwpmc/hwpmc_mod.c:4709
4709 pmc_save_kernel_callchain(ps->ps_pc,
(kgdb) up
1205 error = pmc_process_interrupt(cpu, PMC_HR, pm, tf,
resulting in a crash in pmc_save_kernel_callchain.
- add '-j' options to filter to enable converting native pmc
log format to json lines format to enable the use of scripts
and external tooling
% pmc filter -j pmc.log pmc.jsonl
- Record the tsc value in sampling interrupts as opposed to
recording nanotime when the sample is copied to a global log
in hardclock - potentially many milliseconds later.
- At initialize record the tsc_freq and the time of day to give
us an offset for translating the tsc values in callchain records
By logging all threads and processes 'pmc filter'
can now filter on process or thread name, relieving
the user of the burden of determining which tid or
pid was which when the sample was taken.
% pmc filter -T if_io_tqg -P nginx pmc.log pmc-iflib.log
% pmc filter -x -T idle pmc.log pmc-noidle.log
This adds the -U options to pmcstat which will attribute in-kernel samples
back to the user stack that invoked the system call. It is not the default,
because when looking at kernel profiles it is generally more desirable to
merge all instances of a given system call together.
Although heavily revised, this change is directly derived from D7350 by
Jonathan T. Looney.
Obtained from: jtl
Sponsored by: Juniper Networks, Limelight Networks
There are risks associated with waiting on a preemptible epoch section.
Change the name to make them not be the default and document the issue
under CAVEATS.
Reported by: markj
This implements per-thread counters for PMC sampling. The thread
descriptors are stored in a list attached to the process descriptor.
These thread descriptors can store any per-thread information necessary
for current or future features. For the moment, they just store the counters
for sampling.
The thread descriptors are created when the process descriptor is created.
Additionally, thread descriptors are created or freed when threads
are started or stopped. Because the thread exit function is called in a
critical section, we can't directly free the thread descriptors. Hence,
they are freed to a cache, which is also used as a source of allocations
when needed for new threads.
Approved by: sbruno
Obtained from: jtl
Sponsored by: Juniper Networks, Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15335
Once a pmc owner is added to the pmc_ss_owners list it is
visible for all to see. We don't want this to happen until
setup is complete.
Reported by: mjg
Approved by: sbruno
- fix load/unload race by allocating the per-domain list structure at boot
- fix long extant vm map LOR by replacing pmc_sx sx_slock with global_epoch
to protect the liveness of elements of the pmc_ss_owners list
Reported by: pho
Approved by: sbruno
It appears that domain information is set correctly independent
of whether or not NUMA is defined. However, there is no memory
backing secondary domains leading to allocation failure.
Reported by: pho@, np@
Approved by: sbruno@
On non-trivial SMP systems the contention on the pmc_owner mutex leads
to a substantial number of samples captured being from the pmc process
itself. This change a) makes buffers larger to avoid contention on the
global list b) makes the working sample buffer per cpu.
Run pmcstat in the background (default event rate of 64k):
pmcstat -S UNHALTED_CORE_CYCLES -O /dev/null sleep 600 &
Before:
make -j96 buildkernel -s >&/dev/null 3336.68s user 24684.10s system 7442% cpu 6:16.50 total
After:
make -j96 buildkernel -s >&/dev/null 2697.82s user 1347.35s system 6058% cpu 1:06.77 total
For more realistic overhead measurement set the sample rate for ~2khz
on a 2.1Ghz processor:
pmcstat -n 1050000 -S UNHALTED_CORE_CYCLES -O /dev/null sleep 6000 &
Collecting 10 samples of `make -j96 buildkernel` from each:
x before
+ after
real time:
N Min Max Median Avg Stddev
x 10 76.4 127.62 84.845 88.577 15.100031
+ 10 59.71 60.79 60.135 60.179 0.29957192
Difference at 95.0% confidence
-28.398 +/- 10.0344
-32.0602% +/- 7.69825%
(Student's t, pooled s = 10.6794)
system time:
N Min Max Median Avg Stddev
x 10 2277.96 6948.53 2949.47 3341.492 1385.2677
+ 10 1038.7 1081.06 1070.555 1064.017 15.85404
Difference at 95.0% confidence
-2277.47 +/- 920.425
-68.1574% +/- 8.77623%
(Student's t, pooled s = 979.596)
x no pmc
+ pmc running
real time:
HEAD:
N Min Max Median Avg Stddev
x 10 58.38 59.15 58.86 58.847 0.22504567
+ 10 76.4 127.62 84.845 88.577 15.100031
Difference at 95.0% confidence
29.73 +/- 10.0335
50.5208% +/- 17.0525%
(Student's t, pooled s = 10.6785)
patched:
N Min Max Median Avg Stddev
x 10 58.38 59.15 58.86 58.847 0.22504567
+ 10 59.71 60.79 60.135 60.179 0.29957192
Difference at 95.0% confidence
1.332 +/- 0.248939
2.2635% +/- 0.426506%
(Student's t, pooled s = 0.264942)
system time:
HEAD:
N Min Max Median Avg Stddev
x 10 1010.15 1073.31 1025.465 1031.524 18.135705
+ 10 2277.96 6948.53 2949.47 3341.492 1385.2677
Difference at 95.0% confidence
2309.97 +/- 920.443
223.937% +/- 89.3039%
(Student's t, pooled s = 979.616)
patched:
N Min Max Median Avg Stddev
x 10 1010.15 1073.31 1025.465 1031.524 18.135705
+ 10 1038.7 1081.06 1070.555 1064.017 15.85404
Difference at 95.0% confidence
32.493 +/- 16.0042
3.15% +/- 1.5794%
(Student's t, pooled s = 17.0331)
Reviewed by: jeff@
Approved by: sbruno@
Differential Revision: https://reviews.freebsd.org/D15155
pmcstat request for close will generate a close event.
This event will be in turn received by pmcstat to close the file.
Reviewed by: kib
Tested by: pho
MFC after: 1 week
Sponsored by: Stormshield
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
instead of asserting the condition.
The row index is directly supplied by userspace, the kernel must
handle invalid values.
Submitted by: pho
MFC after: 3 days
The r195005 unlocked pmc_sx before calling into pmclog_configure_log()
to avoid the LOR, but it allows flush or closelog to run in parallel
with the configuration, causing many failure modes.
Revert r195005. Pre-create the logging process, allowing it to run
after the set up succeeded, otherwise the process terminates itself.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D12882
When HWPMC stops sampling, ps_pmc may be freed before samples
are processed. In such situation treat PMC as stopped.
Add "ifdef" to fix build without INVARIANTS code.
Submitted by: Michal Mazur <mkm@semihalf.com>
Obtained from: Semihalf
Sponsored by: Stormshield, Netgate
Differential revision: https://reviews.freebsd.org/D10912
I see the fllowing panic on AMD when exiting pmcstat:
panic: [pmc,1473] pp_pmcval outside of expected range cpu=2 ri=17
pp_pmcval=fffffffffa529f5b pm_reloadcount=10000
It seems that at least on AMD a performance counter keeps counting after
overflowing. When pmcstat exits it sets counters that it used to
PMC_STATE_DELETED and waits until their use count goes to zero.
amd_intr() wouldn't reload a counter in that state and, thus, a counter
would be allowed to overflow. That means that the counter's value would
be allowed to go outside the expected range.
MFC after: 2 weeks
When hwpmc stops sampling it will set the pm_state to something other
than PMC_STATE_RUNNING. This means the following sequence can happen:
CPU 0: Enter the interrupt handler
CPU 0: Set the thread TDP_CALLCHAIN pflag
CPU 1: Stop sampling
CPU 0: Call pmc_process_samples, sampling is stopped so clears ps_nsamples
CPU 0: Finishes interrupt processing with the TDP_CALLCHAIN flag set
CPU 0: Call pmc_capture_user_callchain to capture the user call chain
CPU 0: Find all the pmc sample are free so no call chains need to be captured
CPU 0: KASSERT because of this
This fixes the issue by checking if any of the samples have been stopped
and including this in te KASSERT.
PR: 204273
Reviewed by: bz, gnn
Obtained from: ABT Systems Ltd
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6581
Currently, Application Processors (non-boot CPUs) are started by
MD code at SI_SUB_CPU, but they are kept waiting in a "pen" until
SI_SUB_SMP at which point they are released to run kernel threads.
SI_SUB_SMP is one of the last SYSINIT levels, so APs don't enter
the scheduler and start running threads until fairly late in the
boot.
This change moves SI_SUB_SMP up to just before software interrupt
threads are created allowing the APs to start executing kernel
threads much sooner (before any devices are probed). This allows
several initialization routines that need to perform initialization
on all CPUs to now perform that initialization in one step rather
than having to defer the AP initialization to a second SYSINIT run
at SI_SUB_SMP. It also permits all CPUs to be available for
handling interrupts before any devices are probed.
This last feature fixes a problem on with interrupt vector exhaustion.
Specifically, in the old model all device interrupts were routed
onto the boot CPU during boot. Later after the APs were released at
SI_SUB_SMP, interrupts were redistributed across all CPUs.
However, several drivers for multiqueue hardware allocate N interrupts
per CPU in the system. In a system with many CPUs, just a few drivers
doing this could exhaust the available pool of interrupt vectors on
the boot CPU as each driver was allocating N * mp_ncpu vectors on the
boot CPU. Now, drivers will allocate interrupts on their desired CPUs
during boot meaning that only N interrupts are allocated from the boot
CPU instead of N * mp_ncpu.
Some other bits of code can also be simplified as smp_started is
now true much earlier and will now always be true for these bits of
code. This removes the need to treat the single-CPU boot environment
as a special case.
As a transition aid, the new behavior is available under a new kernel
option (EARLY_AP_STARTUP). This will allow the option to be turned off
if need be during initial testing. I plan to enable this on x86 by
default in a followup commit in the next few days and to have all
platforms moved over before 11.0. Once the transition is complete,
the option will be removed along with the !EARLY_AP_STARTUP code.
These changes have only been tested on x86. Other platform maintainers
are encouraged to port their architectures over as well. The main
things to check for are any uses of smp_started in MD code that can be
simplified and SI_SUB_SMP SYSINITs in MD code that can be removed in
the EARLY_AP_STARTUP case (e.g. the interrupt shuffling).
PR: kern/199321
Reviewed by: markj, gnn, kib
Sponsored by: Netflix
when its result is immediately ignored, i.e. for kernel processes
forked from the user process. Do not test for non-null before freeing
string.
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
The code tracks a counter which is the number of events until the next
sample. On context switch in, it loads the saved counter. On context
switch out, it tries to calculate a new saved counter.
Problems:
1. The saved counter was shared by all threads in a process. However, this
means that all threads would be initially loaded with the same saved
counter. However, that could result in sampling more often than once every
X number of events.
2. The calculation to determine a new saved counter was backwards. It
added when it should have subtracted, and subtracted when it should have
added. Assume a single-threaded process with a reload count of 1000 events.
Assuming the counter on context switch in was 100 and the counter on context
switch out was 50 (meaning the thread has "consumed" 50 more events), the
code would calculate a new saved counter of 150 (instead of the proper 50).
Fix:
1. As soon as the saved counter is used to initialize a monitor for a
thread on context switch in, set the saved counter to the reload count.
That way, subsequent threads to use the saved counter will get the full
reload count, assuring we sample at least once every X number of events
(across all threads).
2. Change the calculation of the saved counter. Due to the change to the
saved counter in #1, we simply need to add (modulo the reload count) the
remaining counter time we retrieve from the CPU when a thread is context
switched out.
Differential Revision: https://reviews.freebsd.org/D4122
Approved by: gnn (mentor)
MFC after: 1 month
Sponsored by: Juniper Networks
Changes to the code to gather user stacks:
* Delay setting pmc_cpumask until we actually have the stack.
* When recording user stack traces, only walk the portion of the ring
that should have samples for us.
Sponsored by: Juniper Networks
Approved by: gnn (mentor)
MFC after: 1 month
Currently, there is a single pm_stalled flag that tracks whether a
performance monitor was "stalled" due to insufficent ring buffer
space for samples. However, because the same performance monitor
can run on multiple processes or threads at the same time, a single
pm_stalled flag that impacts them all seems insufficient.
In particular, you can hit corner cases where the code fails to stop
performance monitors during a context switch out, because it thinks
the performance monitor is already stopped. However, in reality,
it may be that only the monitor running on a different CPU was stalled.
This patch attempts to fix that behavior by tracking on a per-CPU basis
whether a PM desires to run and whether it is "stalled". This lets the
code make better decisions about when to stop PMs and when to try to
restart them. Ideally, we should avoid the case where the code fails
to stop a PM during a context switch out.
Sponsored by: Juniper Networks
Reviewed by: jhb
Approved by: gnn (mentor)
Differential Revision: https://reviews.freebsd.org/D4124