Commit Graph

300 Commits

Author SHA1 Message Date
Matt Macy
ae573a91cf pmc: detach free_gtask on unload
Reported by:	pho
2018-05-20 20:34:15 +00:00
Matt Macy
f2daab2c8f pmc: avoid potential race on shutdown
Clear shutdown flag first, conservatively allow 5ms for all hardclock consumers to
see flag before drainining
2018-05-20 19:35:24 +00:00
Matt Macy
70398c2f86 epoch(9): Make epochs non-preemptible by default
There are risks associated with waiting on a preemptible epoch section.
Change the name to make them not be the default and document the issue
under CAVEATS.

Reported by:	markj
2018-05-18 17:29:43 +00:00
Matt Macy
6161b98c99 hwpmc: Implement per-thread counters for PMC sampling
This implements per-thread counters for PMC sampling. The thread
descriptors are stored in a list attached to the process descriptor.
These thread descriptors can store any per-thread information necessary
for current or future features. For the moment, they just store the counters
for sampling.

The thread descriptors are created when the process descriptor is created.
Additionally, thread descriptors are created or freed when threads
are started or stopped. Because the thread exit function is called in a
critical section, we can't directly free the thread descriptors. Hence,
they are freed to a cache, which is also used as a source of allocations
when needed for new threads.

Approved by:	sbruno
Obtained from:	jtl
Sponsored by:	Juniper Networks, Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D15335
2018-05-16 22:29:20 +00:00
Matt Macy
102ccac21c hwpmc: don't reference domain index with no memory backing it
On multi-socket the domain will be correctly set for a given CPU
regardless of whether or not NUMA is enabled.

Approved by:	sbruno
2018-05-14 06:11:25 +00:00
Matt Macy
8fa7df3668 pmc: don't add pmc owner to list until setup is complete
Once a pmc owner is added to the pmc_ss_owners list it is
visible for all to see. We don't want this to happen until
setup is complete.

Reported by:	mjg
Approved by:	sbruno
2018-05-14 01:08:47 +00:00
Matt Macy
0f00315cb3 hwpmc: fix load/unload race and vm map LOR
- fix load/unload race by allocating the per-domain list structure at boot

- fix long extant vm map LOR by replacing pmc_sx sx_slock with global_epoch
  to protect the liveness of elements of the pmc_ss_owners list

Reported by:	pho
Approved by:	sbruno
2018-05-14 00:21:04 +00:00
Matt Macy
f1401123c5 hwpmc/epoch - don't reference domain if NUMA is not set
It appears that domain information is set correctly independent
of whether or not NUMA is defined. However, there is no memory
backing secondary domains leading to allocation failure.

Reported by:	pho@, np@
Approved by:	sbruno@
2018-05-12 20:00:29 +00:00
Matt Macy
d626a614b9 hwpmc(9): clear remaining sample work for hardclock
- fix last minute change in 333509 where by runcount references
  to a pmc would remaining causing us to pause loop forever

Approved by:	sbruno
2018-05-12 03:45:30 +00:00
Matt Macy
e6b475e0af hwpmc(9): Make pmclog buffer pcpu and update constants
On non-trivial SMP systems the contention on the pmc_owner mutex leads
to a substantial number of samples captured being from the pmc process
itself. This change a) makes buffers larger to avoid contention on the
global list b) makes the working sample buffer per cpu.

Run pmcstat in the background (default event rate of 64k):
pmcstat -S UNHALTED_CORE_CYCLES -O /dev/null sleep 600 &

Before:
make -j96 buildkernel -s >&/dev/null 3336.68s user 24684.10s system 7442% cpu 6:16.50 total

After:
make -j96 buildkernel -s >&/dev/null 2697.82s user 1347.35s system 6058% cpu 1:06.77 total

For more realistic overhead measurement set the sample rate for ~2khz
on a 2.1Ghz processor:
pmcstat -n 1050000 -S UNHALTED_CORE_CYCLES -O /dev/null sleep 6000 &

Collecting 10 samples of `make -j96 buildkernel` from each:

x before
+ after

real time:
    N           Min           Max        Median           Avg        Stddev
x  10          76.4        127.62        84.845        88.577     15.100031
+  10         59.71         60.79        60.135        60.179    0.29957192
Difference at 95.0% confidence
        -28.398 +/- 10.0344
        -32.0602% +/- 7.69825%
        (Student's t, pooled s = 10.6794)

system time:
    N           Min           Max        Median           Avg        Stddev
x  10       2277.96       6948.53       2949.47      3341.492     1385.2677
+  10        1038.7       1081.06      1070.555      1064.017      15.85404
Difference at 95.0% confidence
        -2277.47 +/- 920.425
        -68.1574% +/- 8.77623%
        (Student's t, pooled s = 979.596)

x no pmc
+ pmc running
real time:

HEAD:
    N           Min           Max        Median           Avg        Stddev
x  10         58.38         59.15         58.86        58.847    0.22504567
+  10          76.4        127.62        84.845        88.577     15.100031
Difference at 95.0% confidence
        29.73 +/- 10.0335
        50.5208% +/- 17.0525%
        (Student's t, pooled s = 10.6785)

patched:
    N           Min           Max        Median           Avg        Stddev
x  10         58.38         59.15         58.86        58.847    0.22504567
+  10         59.71         60.79        60.135        60.179    0.29957192
Difference at 95.0% confidence
        1.332 +/- 0.248939
        2.2635% +/- 0.426506%
        (Student's t, pooled s = 0.264942)

system time:

HEAD:
    N           Min           Max        Median           Avg        Stddev
x  10       1010.15       1073.31      1025.465      1031.524     18.135705
+  10       2277.96       6948.53       2949.47      3341.492     1385.2677
Difference at 95.0% confidence
        2309.97 +/- 920.443
        223.937% +/- 89.3039%
        (Student's t, pooled s = 979.616)

patched:
    N           Min           Max        Median           Avg        Stddev
x  10       1010.15       1073.31      1025.465      1031.524     18.135705
+  10        1038.7       1081.06      1070.555      1064.017      15.85404
Difference at 95.0% confidence
        32.493 +/- 16.0042
        3.15% +/- 1.5794%
        (Student's t, pooled s = 17.0331)

Reviewed by:	jeff@
Approved by:	sbruno@
Differential Revision:	https://reviews.freebsd.org/D15155
2018-05-12 01:26:34 +00:00
Matt Macy
cbd92ce62e Eliminate the overhead of gratuitous repeated reinitialization of cap_rights
- Add macros to allow preinitialization of cap_rights_t.

- Convert most commonly used code paths to use preinitialized cap_rights_t.
  A 3.6% speedup in fstat was measured with this change.

Reported by:	mjg
Reviewed by:	oshogbo
Approved by:	sbruno
MFC after:	1 month
2018-05-09 18:47:24 +00:00
Fabien Thomas
9f4f1d4d1f Fix pmcstat exit from kernel introduced by r325275.
pmcstat request for close will generate a close event.
This event will be in turn received by pmcstat to close the file.

Reviewed by:	kib
Tested by:	pho
MFC after:	1 week
Sponsored by: Stormshield
2018-01-17 16:41:22 +00:00
Pedro F. Giffuni
718cf2ccb9 sys/dev: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 14:52:40 +00:00
Konstantin Belousov
f4dd123e15 Do not leak PMC_PO_OWNS_LOGFILE on error.
Note that PMCLOG_RESERVE_WITH_ERROR() macro contains goto error;
statement and executed after the flag is set.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-11-13 10:45:31 +00:00
Konstantin Belousov
c9da263712 Style bug.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2017-11-13 10:43:31 +00:00
Konstantin Belousov
d2cd638852 Check that the pmc index is less than the number of hardware PMCs,
instead of asserting the condition.

The row index is directly supplied by userspace, the kernel must
handle invalid values.

Submitted by:	pho
MFC after:	3 days
2017-11-10 19:10:14 +00:00
Konstantin Belousov
20b555e146 Do not run pmclog_configure_log() without pmc_sx protection.
The r195005 unlocked pmc_sx before calling into pmclog_configure_log()
to avoid the LOR, but it allows flush or closelog to run in parallel
with the configuration, causing many failure modes.

Revert r195005.  Pre-create the logging process, allowing it to run
after the set up succeeded, otherwise the process terminates itself.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D12882
2017-11-01 11:43:39 +00:00
Konstantin Belousov
1121a37474 Be protective and check the po_file validity before dropping the ref.
Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D12882
2017-11-01 11:37:45 +00:00
Konstantin Belousov
ea4d25f90b In hwpmc, do not double-close the logging file.
hwpmc(4) must not voluntarily call fo_close(), doing this causes
double-close of the file.  It seems to almost avoid bad consequences
for pipes, but other types of files demonstrate random memory access.

To fix, remove fo_close() calls, which also do not provide the
declared wake-up of waiters consistently.  Instead, send a signal to
the logger and configure the logger process to not block it.  Since
logger never returns to userspace, the signal only causes termination
of the interruptible sleeps in fo_write().

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D12882
2017-11-01 11:32:52 +00:00
Konstantin Belousov
bd63e82975 There is no use for dropping Giant in the pmc syscall.
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D12882
2017-11-01 11:16:18 +00:00
Konstantin Belousov
cf9ef80607 Minor style tweaks.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D12882
2017-11-01 11:05:47 +00:00
Konstantin Belousov
1cfbc451b9 Use designated initializers for pmc sysent and module data.
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D12882
2017-11-01 10:49:41 +00:00
Ruslan Bukin
07ff05c2ae o Support for Kabylake CPU PMCs (fall down to PMC_CPU_INTEL_SKYLAKE).
o Fix bugs in events descriptions for Skylake, Skylake Xeon and Haswell.

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D12654
2017-10-13 15:02:29 +00:00
Conrad Meyer
1356a2e6fa hwpmc(4): Actually use a sufficiently wide type
jhibbits@ points out that left shifting bits 8-11 24 bits won't fit in a 32-bit
integer either.

Corrects r324533.

Submitted by:	jhibbits
Sponsored by:	Dell EMC Isilon
2017-10-11 15:13:40 +00:00
Conrad Meyer
a7b8be82f0 hwpmc(4): Force sufficiently wide type for left shift
Ordinary input to this macro comes from pe_code, which is uint16_t.  Coverity
points out that shifting such a value discards the result of a 24 bit shift,
which is not what we want.

A follow-up to r324291.

CID:		1381676
Sponsored by:	Dell EMC Isilon
2017-10-11 14:59:04 +00:00
Conrad Meyer
1d3aa3624d hwpmc(4): Add support for extended AMD events
Sponsored by:	Dell EMC Isilon
2017-10-04 23:35:10 +00:00
Konstantin Belousov
b99b705d9c Skylake server core PMC support for hwpmc(4).
Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation
Hardware provided by:	Intel
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D12221
2017-09-06 17:19:48 +00:00
Konstantin Belousov
16997138d3 Fix logic error in the the assert, causing the condition to be always true.
Also improve the formatting of the corresponding KASSERT message.

Based on the submission by:	Svyatoslav <razmyslov@viva64.com>
Found by:	PVS-Studio
PR:	217741
Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	1 week
2017-08-08 15:46:29 +00:00
Zbigniew Bodek
6cb40391c4 Fix INVARIANTS debug code in HWPMC
When HWPMC stops sampling, ps_pmc may be freed before samples
are processed. In such situation treat PMC as stopped.
Add "ifdef" to fix build without INVARIANTS code.

Submitted by: Michal Mazur <mkm@semihalf.com>
Obtained from: Semihalf
Sponsored by: Stormshield, Netgate
Differential revision: https://reviews.freebsd.org/D10912
2017-06-13 18:53:56 +00:00
Zbigniew Bodek
95ca4f5a0e Fix event table for Cortex A9.
Removed events 0x8 (INSTR_EXECUTED), 0xE (PC_PROC_RETURN) and
0x13-0x1d not supported on Cortex A9.
Add events 0x68 and 0x6E which replaced 0x8 and 0xE.

Submitted by: Michal Mazur <mkm@semihalf.com>
Obtained from: Semihalf
Sponsored by: Stormshield, Netgate
Differential revision: https://reviews.freebsd.org/D10911
2017-06-13 18:52:39 +00:00
Zbigniew Bodek
ab632d9651 Fix HWPMC interrupt handling in Counting Mode
Additionally:
 - Fix support for Cycle Counter (evsel == 0xFF)
 - Stop and mask interrupts from all counters on init and finish

Submitted by: Michal Mazur <mkm@semihalf.com>
Obtained from: Semihalf
Sponsored by: Stormshield, Netgate
Differential revision: https://reviews.freebsd.org/D10910
2017-06-13 18:51:23 +00:00
Fabien Thomas
d42aefee43 Fix arm stack frame walking support:
- Adjust stack offset for Clang
- Correctly fill registers for fake stack frame (soft PMC)

MFC after:	1 week
Sponsored by:	Stormshield
Differential Revision:	https://reviews.freebsd.org/D7396
2017-03-14 16:06:57 +00:00
George V. Neville-Neil
593b0c8420 Fix PMC architecture check to handle later IPAs including Skylake
Tested with tools/test/hwpmc/pmctest.py

Obtained from:	Oliver Pinter
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D9036
2017-01-04 02:15:03 +00:00
Andriy Gapon
593077d613 pmc_process_csw_out: ignore deleted counters
I see the fllowing panic on AMD when exiting pmcstat:

panic: [pmc,1473] pp_pmcval outside of expected range cpu=2 ri=17
pp_pmcval=fffffffffa529f5b pm_reloadcount=10000

It seems that at least on AMD a performance counter keeps counting after
overflowing.  When pmcstat exits it sets counters that it used to
PMC_STATE_DELETED and waits until their use count goes to zero.
amd_intr() wouldn't reload a counter in that state and, thus, a counter
would be allowed to overflow.  That means that the counter's value would
be allowed to go outside the expected range.

MFC after:	2 weeks
2016-11-10 11:12:45 +00:00
Andriy Gapon
3c1f73b18d hwpmc: fix a race between amd_stop_pmc and amd_intr
It is possible that wrmsr in amd_stop_pmc() causes an overflow in a counter
that it disables.  In that case a non-maskable interrupt is generated.  The
interrupt handler code was written in such a way that it would re-enable the
counter.  That would lead to an unexpected interrupt later on.

This problem was easy to reproduce with
$ pmcstat -T -P instructions -t $pid
if the target process is sufficiently busy and there are context switches from
time to time.  There would be a lot of interrupts to "race" with amd_stop_pmc()
called during the context switches.  The problem affected only AMD processors.

While there, trace whether amd_intr() claimed an interrupt.

Reviewed by:	jhb
MFC after:	2 weeks
2016-10-30 09:38:10 +00:00
Ed Maste
cec1957ae1 hwpmc: remove sys/capability.h backwards compatibility
The Capsicum header is installed as sys/capsicum.h in stable/10 as well.
2016-09-20 12:56:03 +00:00
John Baldwin
1f095f7051 Apply the fix from r232612 to fixed function counters.
Reviewed by:	emaste
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7397
2016-08-03 16:52:00 +00:00
Andrew Turner
7bc7e3cd65 Don't panic in hwpmc when stopping sampling.
When hwpmc stops sampling it will set the pm_state to something other
than PMC_STATE_RUNNING. This means the following sequence can happen:

CPU 0: Enter the interrupt handler
CPU 0: Set the thread TDP_CALLCHAIN pflag
CPU 1: Stop sampling
CPU 0: Call pmc_process_samples, sampling is stopped so clears ps_nsamples
CPU 0: Finishes interrupt processing with the TDP_CALLCHAIN flag set
CPU 0: Call pmc_capture_user_callchain to capture the user call chain
CPU 0: Find all the pmc sample are free so no call chains need to be captured
CPU 0: KASSERT because of this

This fixes the issue by checking if any of the samples have been stopped
and including this in te KASSERT.

PR:		204273
Reviewed by:	bz, gnn
Obtained from:	ABT Systems Ltd
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6581
2016-05-28 13:05:39 +00:00
John Baldwin
fdce57a042 Add an EARLY_AP_STARTUP option to start APs earlier during boot.
Currently, Application Processors (non-boot CPUs) are started by
MD code at SI_SUB_CPU, but they are kept waiting in a "pen" until
SI_SUB_SMP at which point they are released to run kernel threads.
SI_SUB_SMP is one of the last SYSINIT levels, so APs don't enter
the scheduler and start running threads until fairly late in the
boot.

This change moves SI_SUB_SMP up to just before software interrupt
threads are created allowing the APs to start executing kernel
threads much sooner (before any devices are probed).  This allows
several initialization routines that need to perform initialization
on all CPUs to now perform that initialization in one step rather
than having to defer the AP initialization to a second SYSINIT run
at SI_SUB_SMP.  It also permits all CPUs to be available for
handling interrupts before any devices are probed.

This last feature fixes a problem on with interrupt vector exhaustion.
Specifically, in the old model all device interrupts were routed
onto the boot CPU during boot.  Later after the APs were released at
SI_SUB_SMP, interrupts were redistributed across all CPUs.

However, several drivers for multiqueue hardware allocate N interrupts
per CPU in the system.  In a system with many CPUs, just a few drivers
doing this could exhaust the available pool of interrupt vectors on
the boot CPU as each driver was allocating N * mp_ncpu vectors on the
boot CPU.  Now, drivers will allocate interrupts on their desired CPUs
during boot meaning that only N interrupts are allocated from the boot
CPU instead of N * mp_ncpu.

Some other bits of code can also be simplified as smp_started is
now true much earlier and will now always be true for these bits of
code.  This removes the need to treat the single-CPU boot environment
as a special case.

As a transition aid, the new behavior is available under a new kernel
option (EARLY_AP_STARTUP).  This will allow the option to be turned off
if need be during initial testing.  I plan to enable this on x86 by
default in a followup commit in the next few days and to have all
platforms moved over before 11.0.  Once the transition is complete,
the option will be removed along with the !EARLY_AP_STARTUP code.

These changes have only been tested on x86.  Other platform maintainers
are encouraged to port their architectures over as well.  The main
things to check for are any uses of smp_started in MD code that can be
simplified and SI_SUB_SMP SYSINITs in MD code that can be removed in
the EARLY_AP_STARTUP case (e.g. the interrupt shuffling).

PR:		kern/199321
Reviewed by:	markj, gnn, kib
Sponsored by:	Netflix
2016-05-14 18:22:52 +00:00
Edward Tomasz Napierala
084d207584 Remove misc NULL checks after M_WAITOK allocations.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-05-10 10:26:07 +00:00
Pedro F. Giffuni
453130d9bf sys/dev: minor spelling fixes.
Most affect comments, very few have user-visible effects.
2016-05-03 03:41:25 +00:00
Pedro F. Giffuni
b790c1938d etc: minor spelling fixes.
Mostly comments but also some user-visible strings.

MFC after:	2 weeks
2016-05-02 16:47:28 +00:00
Pedro F. Giffuni
323b076e9c sys: use our nitems() macro when param.h is available.
This should cover all the remaining cases in the kernel.

Discussed in:	freebsd-current
2016-04-21 19:40:10 +00:00
Pedro F. Giffuni
8dfea46460 Remove slightly used const values that can be replaced with nitems().
Suggested by:	jhb
2016-04-21 15:38:28 +00:00
Pedro F. Giffuni
198e7845ee Remove unused e500_event_codes_size.
Found by:	jhb
2016-04-20 20:37:58 +00:00
Pedro F. Giffuni
74b8d63dcc Cleanup unnecessary semicolons from the kernel.
Found with devel/coccinelle.
2016-04-10 23:07:00 +00:00
Justin Hibbits
c4d7f6ab97 Fix a masking bug for e500 PMC.
No idea how this slipped through my regression testing.  pe_code is the event to
count, pe_cpu is the CPU family mask.
2016-04-09 01:02:17 +00:00
Konstantin Belousov
411c83ccd6 If full width writes to the performance monitoring counters are
supported, use full-width aliases MSRs for writes.  This fixes the
"[pmc,X] negative increment" assertion on the context switch when
clipped counter value is sign-extended.

Add definitions for the MSR IA32_PERF_CAPABILITIES needed to detect
the feature.

PR:	207068
Submitted by:	joss.upton@yahoo.com
MFC after:	2 weeks
2016-02-12 07:27:24 +00:00
Konstantin Belousov
0c8cc7b076 Remove tautological cast.
PR:	207068
Submitted by:	joss.upton@yahoo.com
MFC after:	2 weeks
2016-02-12 07:19:59 +00:00
Konstantin Belousov
db57c70a5b Rename P_KTHREAD struct proc p_flag to P_KPROC.
I left as is an apparent bug in ntoskrnl_var.h:AT_PASSIVE_LEVEL()
definition.

Suggested by:	jhb
Sponsored by:	The FreeBSD Foundation
2016-02-09 16:30:16 +00:00