Commit Graph

306 Commits

Author SHA1 Message Date
mmacy
c937b516d8 CK: update consumers to use CK macros across the board
r334189 changed the fields to have names distinct from those in queue.h
in order to expose the oversights as compile time errors
2018-05-24 23:21:23 +00:00
mmacy
3ddb2ade67 hwppmc: set threadid in callchain records - second part of r334108 2018-05-23 17:44:29 +00:00
mmacy
0ff2c24861 pmc: detach free_gtask on unload
Reported by:	pho
2018-05-20 20:34:15 +00:00
mmacy
5f1cf9cd72 pmc: avoid potential race on shutdown
Clear shutdown flag first, conservatively allow 5ms for all hardclock consumers to
see flag before drainining
2018-05-20 19:35:24 +00:00
mmacy
a48d80f193 epoch(9): Make epochs non-preemptible by default
There are risks associated with waiting on a preemptible epoch section.
Change the name to make them not be the default and document the issue
under CAVEATS.

Reported by:	markj
2018-05-18 17:29:43 +00:00
mmacy
b4ad383689 hwpmc: Implement per-thread counters for PMC sampling
This implements per-thread counters for PMC sampling. The thread
descriptors are stored in a list attached to the process descriptor.
These thread descriptors can store any per-thread information necessary
for current or future features. For the moment, they just store the counters
for sampling.

The thread descriptors are created when the process descriptor is created.
Additionally, thread descriptors are created or freed when threads
are started or stopped. Because the thread exit function is called in a
critical section, we can't directly free the thread descriptors. Hence,
they are freed to a cache, which is also used as a source of allocations
when needed for new threads.

Approved by:	sbruno
Obtained from:	jtl
Sponsored by:	Juniper Networks, Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15335
2018-05-16 22:29:20 +00:00
mmacy
4c3dbe627d hwpmc: don't reference domain index with no memory backing it
On multi-socket the domain will be correctly set for a given CPU
regardless of whether or not NUMA is enabled.

Approved by:	sbruno
2018-05-14 06:11:25 +00:00
mmacy
4d2273cd96 pmc: don't add pmc owner to list until setup is complete
Once a pmc owner is added to the pmc_ss_owners list it is
visible for all to see. We don't want this to happen until
setup is complete.

Reported by:	mjg
Approved by:	sbruno
2018-05-14 01:08:47 +00:00
mmacy
9888701947 hwpmc: fix load/unload race and vm map LOR
- fix load/unload race by allocating the per-domain list structure at boot

- fix long extant vm map LOR by replacing pmc_sx sx_slock with global_epoch
  to protect the liveness of elements of the pmc_ss_owners list

Reported by:	pho
Approved by:	sbruno
2018-05-14 00:21:04 +00:00
mmacy
71ab2f70a9 hwpmc/epoch - don't reference domain if NUMA is not set
It appears that domain information is set correctly independent
of whether or not NUMA is defined. However, there is no memory
backing secondary domains leading to allocation failure.

Reported by:	pho@, np@
Approved by:	sbruno@
2018-05-12 20:00:29 +00:00
mmacy
dcb7d046f9 hwpmc(9): clear remaining sample work for hardclock
- fix last minute change in 333509 where by runcount references
  to a pmc would remaining causing us to pause loop forever

Approved by:	sbruno
2018-05-12 03:45:30 +00:00
mmacy
2981a3420c hwpmc(9): Make pmclog buffer pcpu and update constants
On non-trivial SMP systems the contention on the pmc_owner mutex leads
to a substantial number of samples captured being from the pmc process
itself. This change a) makes buffers larger to avoid contention on the
global list b) makes the working sample buffer per cpu.

Run pmcstat in the background (default event rate of 64k):
pmcstat -S UNHALTED_CORE_CYCLES -O /dev/null sleep 600 &

Before:
make -j96 buildkernel -s >&/dev/null 3336.68s user 24684.10s system 7442% cpu 6:16.50 total

After:
make -j96 buildkernel -s >&/dev/null 2697.82s user 1347.35s system 6058% cpu 1:06.77 total

For more realistic overhead measurement set the sample rate for ~2khz
on a 2.1Ghz processor:
pmcstat -n 1050000 -S UNHALTED_CORE_CYCLES -O /dev/null sleep 6000 &

Collecting 10 samples of `make -j96 buildkernel` from each:

x before
+ after

real time:
    N           Min           Max        Median           Avg        Stddev
x  10          76.4        127.62        84.845        88.577     15.100031
+  10         59.71         60.79        60.135        60.179    0.29957192
Difference at 95.0% confidence
        -28.398 +/- 10.0344
        -32.0602% +/- 7.69825%
        (Student's t, pooled s = 10.6794)

system time:
    N           Min           Max        Median           Avg        Stddev
x  10       2277.96       6948.53       2949.47      3341.492     1385.2677
+  10        1038.7       1081.06      1070.555      1064.017      15.85404
Difference at 95.0% confidence
        -2277.47 +/- 920.425
        -68.1574% +/- 8.77623%
        (Student's t, pooled s = 979.596)

x no pmc
+ pmc running
real time:

HEAD:
    N           Min           Max        Median           Avg        Stddev
x  10         58.38         59.15         58.86        58.847    0.22504567
+  10          76.4        127.62        84.845        88.577     15.100031
Difference at 95.0% confidence
        29.73 +/- 10.0335
        50.5208% +/- 17.0525%
        (Student's t, pooled s = 10.6785)

patched:
    N           Min           Max        Median           Avg        Stddev
x  10         58.38         59.15         58.86        58.847    0.22504567
+  10         59.71         60.79        60.135        60.179    0.29957192
Difference at 95.0% confidence
        1.332 +/- 0.248939
        2.2635% +/- 0.426506%
        (Student's t, pooled s = 0.264942)

system time:

HEAD:
    N           Min           Max        Median           Avg        Stddev
x  10       1010.15       1073.31      1025.465      1031.524     18.135705
+  10       2277.96       6948.53       2949.47      3341.492     1385.2677
Difference at 95.0% confidence
        2309.97 +/- 920.443
        223.937% +/- 89.3039%
        (Student's t, pooled s = 979.616)

patched:
    N           Min           Max        Median           Avg        Stddev
x  10       1010.15       1073.31      1025.465      1031.524     18.135705
+  10        1038.7       1081.06      1070.555      1064.017      15.85404
Difference at 95.0% confidence
        32.493 +/- 16.0042
        3.15% +/- 1.5794%
        (Student's t, pooled s = 17.0331)

Reviewed by:	jeff@
Approved by:	sbruno@
Differential Revision:	https://reviews.freebsd.org/D15155
2018-05-12 01:26:34 +00:00
mmacy
a0bd5d3d7f Eliminate the overhead of gratuitous repeated reinitialization of cap_rights
- Add macros to allow preinitialization of cap_rights_t.

- Convert most commonly used code paths to use preinitialized cap_rights_t.
  A 3.6% speedup in fstat was measured with this change.

Reported by:	mjg
Reviewed by:	oshogbo
Approved by:	sbruno
MFC after:	1 month
2018-05-09 18:47:24 +00:00
fabient
326b96a88c Fix pmcstat exit from kernel introduced by r325275.
pmcstat request for close will generate a close event.
This event will be in turn received by pmcstat to close the file.

Reviewed by:	kib
Tested by:	pho
MFC after:	1 week
Sponsored by: Stormshield
2018-01-17 16:41:22 +00:00
pfg
1537078d8f sys/dev: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 14:52:40 +00:00
kib
bb9ce4f821 Do not leak PMC_PO_OWNS_LOGFILE on error.
Note that PMCLOG_RESERVE_WITH_ERROR() macro contains goto error;
statement and executed after the flag is set.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-11-13 10:45:31 +00:00
kib
6f8de9c7ee Style bug.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-11-13 10:43:31 +00:00
kib
70df4cad97 Check that the pmc index is less than the number of hardware PMCs,
instead of asserting the condition.

The row index is directly supplied by userspace, the kernel must
handle invalid values.

Submitted by:	pho
MFC after:	3 days
2017-11-10 19:10:14 +00:00
kib
3d9fc3cca3 Do not run pmclog_configure_log() without pmc_sx protection.
The r195005 unlocked pmc_sx before calling into pmclog_configure_log()
to avoid the LOR, but it allows flush or closelog to run in parallel
with the configuration, causing many failure modes.

Revert r195005.  Pre-create the logging process, allowing it to run
after the set up succeeded, otherwise the process terminates itself.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D12882
2017-11-01 11:43:39 +00:00
kib
b7057b25e3 Be protective and check the po_file validity before dropping the ref.
Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D12882
2017-11-01 11:37:45 +00:00
kib
1ae9155255 In hwpmc, do not double-close the logging file.
hwpmc(4) must not voluntarily call fo_close(), doing this causes
double-close of the file.  It seems to almost avoid bad consequences
for pipes, but other types of files demonstrate random memory access.

To fix, remove fo_close() calls, which also do not provide the
declared wake-up of waiters consistently.  Instead, send a signal to
the logger and configure the logger process to not block it.  Since
logger never returns to userspace, the signal only causes termination
of the interruptible sleeps in fo_write().

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D12882
2017-11-01 11:32:52 +00:00
kib
024598205a There is no use for dropping Giant in the pmc syscall.
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D12882
2017-11-01 11:16:18 +00:00
kib
6d15235904 Minor style tweaks.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D12882
2017-11-01 11:05:47 +00:00
kib
964a0edf56 Use designated initializers for pmc sysent and module data.
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D12882
2017-11-01 10:49:41 +00:00
br
06468fac12 o Support for Kabylake CPU PMCs (fall down to PMC_CPU_INTEL_SKYLAKE).
o Fix bugs in events descriptions for Skylake, Skylake Xeon and Haswell.

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D12654
2017-10-13 15:02:29 +00:00
cem
da5d7dd0c1 hwpmc(4): Actually use a sufficiently wide type
jhibbits@ points out that left shifting bits 8-11 24 bits won't fit in a 32-bit
integer either.

Corrects r324533.

Submitted by:	jhibbits
Sponsored by:	Dell EMC Isilon
2017-10-11 15:13:40 +00:00
cem
32335afadd hwpmc(4): Force sufficiently wide type for left shift
Ordinary input to this macro comes from pe_code, which is uint16_t.  Coverity
points out that shifting such a value discards the result of a 24 bit shift,
which is not what we want.

A follow-up to r324291.

CID:		1381676
Sponsored by:	Dell EMC Isilon
2017-10-11 14:59:04 +00:00
cem
e57bc68025 hwpmc(4): Add support for extended AMD events
Sponsored by:	Dell EMC Isilon
2017-10-04 23:35:10 +00:00
kib
3e6224e523 Skylake server core PMC support for hwpmc(4).
Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation
Hardware provided by:	Intel
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D12221
2017-09-06 17:19:48 +00:00
kib
65e37fca9f Fix logic error in the the assert, causing the condition to be always true.
Also improve the formatting of the corresponding KASSERT message.

Based on the submission by:	Svyatoslav <razmyslov@viva64.com>
Found by:	PVS-Studio
PR:	217741
Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	1 week
2017-08-08 15:46:29 +00:00
zbb
13c9408e90 Fix INVARIANTS debug code in HWPMC
When HWPMC stops sampling, ps_pmc may be freed before samples
are processed. In such situation treat PMC as stopped.
Add "ifdef" to fix build without INVARIANTS code.

Submitted by: Michal Mazur <mkm@semihalf.com>
Obtained from: Semihalf
Sponsored by: Stormshield, Netgate
Differential revision: https://reviews.freebsd.org/D10912
2017-06-13 18:53:56 +00:00
zbb
86fcc88a77 Fix event table for Cortex A9.
Removed events 0x8 (INSTR_EXECUTED), 0xE (PC_PROC_RETURN) and
0x13-0x1d not supported on Cortex A9.
Add events 0x68 and 0x6E which replaced 0x8 and 0xE.

Submitted by: Michal Mazur <mkm@semihalf.com>
Obtained from: Semihalf
Sponsored by: Stormshield, Netgate
Differential revision: https://reviews.freebsd.org/D10911
2017-06-13 18:52:39 +00:00
zbb
0c419353c5 Fix HWPMC interrupt handling in Counting Mode
Additionally:
 - Fix support for Cycle Counter (evsel == 0xFF)
 - Stop and mask interrupts from all counters on init and finish

Submitted by: Michal Mazur <mkm@semihalf.com>
Obtained from: Semihalf
Sponsored by: Stormshield, Netgate
Differential revision: https://reviews.freebsd.org/D10910
2017-06-13 18:51:23 +00:00
fabient
74bd0be5e9 Fix arm stack frame walking support:
- Adjust stack offset for Clang
- Correctly fill registers for fake stack frame (soft PMC)

MFC after:	1 week
Sponsored by:	Stormshield
Differential Revision:	https://reviews.freebsd.org/D7396
2017-03-14 16:06:57 +00:00
gnn
0a141676a9 Fix PMC architecture check to handle later IPAs including Skylake
Tested with tools/test/hwpmc/pmctest.py

Obtained from:	Oliver Pinter
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D9036
2017-01-04 02:15:03 +00:00
avg
959e19a84a pmc_process_csw_out: ignore deleted counters
I see the fllowing panic on AMD when exiting pmcstat:

panic: [pmc,1473] pp_pmcval outside of expected range cpu=2 ri=17
pp_pmcval=fffffffffa529f5b pm_reloadcount=10000

It seems that at least on AMD a performance counter keeps counting after
overflowing.  When pmcstat exits it sets counters that it used to
PMC_STATE_DELETED and waits until their use count goes to zero.
amd_intr() wouldn't reload a counter in that state and, thus, a counter
would be allowed to overflow.  That means that the counter's value would
be allowed to go outside the expected range.

MFC after:	2 weeks
2016-11-10 11:12:45 +00:00
avg
864edd8840 hwpmc: fix a race between amd_stop_pmc and amd_intr
It is possible that wrmsr in amd_stop_pmc() causes an overflow in a counter
that it disables.  In that case a non-maskable interrupt is generated.  The
interrupt handler code was written in such a way that it would re-enable the
counter.  That would lead to an unexpected interrupt later on.

This problem was easy to reproduce with
$ pmcstat -T -P instructions -t $pid
if the target process is sufficiently busy and there are context switches from
time to time.  There would be a lot of interrupts to "race" with amd_stop_pmc()
called during the context switches.  The problem affected only AMD processors.

While there, trace whether amd_intr() claimed an interrupt.

Reviewed by:	jhb
MFC after:	2 weeks
2016-10-30 09:38:10 +00:00
emaste
2fd125607b hwpmc: remove sys/capability.h backwards compatibility
The Capsicum header is installed as sys/capsicum.h in stable/10 as well.
2016-09-20 12:56:03 +00:00
jhb
ae52b8b4ff Apply the fix from r232612 to fixed function counters.
Reviewed by:	emaste
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7397
2016-08-03 16:52:00 +00:00
andrew
15f895037c Don't panic in hwpmc when stopping sampling.
When hwpmc stops sampling it will set the pm_state to something other
than PMC_STATE_RUNNING. This means the following sequence can happen:

CPU 0: Enter the interrupt handler
CPU 0: Set the thread TDP_CALLCHAIN pflag
CPU 1: Stop sampling
CPU 0: Call pmc_process_samples, sampling is stopped so clears ps_nsamples
CPU 0: Finishes interrupt processing with the TDP_CALLCHAIN flag set
CPU 0: Call pmc_capture_user_callchain to capture the user call chain
CPU 0: Find all the pmc sample are free so no call chains need to be captured
CPU 0: KASSERT because of this

This fixes the issue by checking if any of the samples have been stopped
and including this in te KASSERT.

PR:		204273
Reviewed by:	bz, gnn
Obtained from:	ABT Systems Ltd
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6581
2016-05-28 13:05:39 +00:00
jhb
bcc5b0c55d Add an EARLY_AP_STARTUP option to start APs earlier during boot.
Currently, Application Processors (non-boot CPUs) are started by
MD code at SI_SUB_CPU, but they are kept waiting in a "pen" until
SI_SUB_SMP at which point they are released to run kernel threads.
SI_SUB_SMP is one of the last SYSINIT levels, so APs don't enter
the scheduler and start running threads until fairly late in the
boot.

This change moves SI_SUB_SMP up to just before software interrupt
threads are created allowing the APs to start executing kernel
threads much sooner (before any devices are probed).  This allows
several initialization routines that need to perform initialization
on all CPUs to now perform that initialization in one step rather
than having to defer the AP initialization to a second SYSINIT run
at SI_SUB_SMP.  It also permits all CPUs to be available for
handling interrupts before any devices are probed.

This last feature fixes a problem on with interrupt vector exhaustion.
Specifically, in the old model all device interrupts were routed
onto the boot CPU during boot.  Later after the APs were released at
SI_SUB_SMP, interrupts were redistributed across all CPUs.

However, several drivers for multiqueue hardware allocate N interrupts
per CPU in the system.  In a system with many CPUs, just a few drivers
doing this could exhaust the available pool of interrupt vectors on
the boot CPU as each driver was allocating N * mp_ncpu vectors on the
boot CPU.  Now, drivers will allocate interrupts on their desired CPUs
during boot meaning that only N interrupts are allocated from the boot
CPU instead of N * mp_ncpu.

Some other bits of code can also be simplified as smp_started is
now true much earlier and will now always be true for these bits of
code.  This removes the need to treat the single-CPU boot environment
as a special case.

As a transition aid, the new behavior is available under a new kernel
option (EARLY_AP_STARTUP).  This will allow the option to be turned off
if need be during initial testing.  I plan to enable this on x86 by
default in a followup commit in the next few days and to have all
platforms moved over before 11.0.  Once the transition is complete,
the option will be removed along with the !EARLY_AP_STARTUP code.

These changes have only been tested on x86.  Other platform maintainers
are encouraged to port their architectures over as well.  The main
things to check for are any uses of smp_started in MD code that can be
simplified and SI_SUB_SMP SYSINITs in MD code that can be removed in
the EARLY_AP_STARTUP case (e.g. the interrupt shuffling).

PR:		kern/199321
Reviewed by:	markj, gnn, kib
Sponsored by:	Netflix
2016-05-14 18:22:52 +00:00
trasz
94bd76e619 Remove misc NULL checks after M_WAITOK allocations.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-05-10 10:26:07 +00:00
pfg
eed4bd22ad sys/dev: minor spelling fixes.
Most affect comments, very few have user-visible effects.
2016-05-03 03:41:25 +00:00
pfg
6ce01c2d90 etc: minor spelling fixes.
Mostly comments but also some user-visible strings.

MFC after:	2 weeks
2016-05-02 16:47:28 +00:00
pfg
42747553f4 sys: use our nitems() macro when param.h is available.
This should cover all the remaining cases in the kernel.

Discussed in:	freebsd-current
2016-04-21 19:40:10 +00:00
pfg
fc65edc1cd Remove slightly used const values that can be replaced with nitems().
Suggested by:	jhb
2016-04-21 15:38:28 +00:00
pfg
e3c8c9cbf7 Remove unused e500_event_codes_size.
Found by:	jhb
2016-04-20 20:37:58 +00:00
pfg
b63211eed5 Cleanup unnecessary semicolons from the kernel.
Found with devel/coccinelle.
2016-04-10 23:07:00 +00:00
jhibbits
60a000630b Fix a masking bug for e500 PMC.
No idea how this slipped through my regression testing.  pe_code is the event to
count, pe_cpu is the CPU family mask.
2016-04-09 01:02:17 +00:00
kib
7af72453b7 If full width writes to the performance monitoring counters are
supported, use full-width aliases MSRs for writes.  This fixes the
"[pmc,X] negative increment" assertion on the context switch when
clipped counter value is sign-extended.

Add definitions for the MSR IA32_PERF_CAPABILITIES needed to detect
the feature.

PR:	207068
Submitted by:	joss.upton@yahoo.com
MFC after:	2 weeks
2016-02-12 07:27:24 +00:00