|
|
|
@ -0,0 +1,886 @@
|
|
|
|
|
.\" $FreeBSD$
|
|
|
|
|
.\" Man page generated from reStructuredText.
|
|
|
|
|
.
|
|
|
|
|
.TH "LLVM-MCA" "1" "2018-08-02" "7" "LLVM"
|
|
|
|
|
.SH NAME
|
|
|
|
|
llvm-mca \- LLVM Machine Code Analyzer
|
|
|
|
|
.
|
|
|
|
|
.nr rst2man-indent-level 0
|
|
|
|
|
.
|
|
|
|
|
.de1 rstReportMargin
|
|
|
|
|
\\$1 \\n[an-margin]
|
|
|
|
|
level \\n[rst2man-indent-level]
|
|
|
|
|
level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
|
|
|
|
|
-
|
|
|
|
|
\\n[rst2man-indent0]
|
|
|
|
|
\\n[rst2man-indent1]
|
|
|
|
|
\\n[rst2man-indent2]
|
|
|
|
|
..
|
|
|
|
|
.de1 INDENT
|
|
|
|
|
.\" .rstReportMargin pre:
|
|
|
|
|
. RS \\$1
|
|
|
|
|
. nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
|
|
|
|
|
. nr rst2man-indent-level +1
|
|
|
|
|
.\" .rstReportMargin post:
|
|
|
|
|
..
|
|
|
|
|
.de UNINDENT
|
|
|
|
|
. RE
|
|
|
|
|
.\" indent \\n[an-margin]
|
|
|
|
|
.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
|
|
|
|
|
.nr rst2man-indent-level -1
|
|
|
|
|
.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
|
|
|
|
|
.in \\n[rst2man-indent\\n[rst2man-indent-level]]u
|
|
|
|
|
..
|
|
|
|
|
.SH SYNOPSIS
|
|
|
|
|
.sp
|
|
|
|
|
\fBllvm\-mca\fP [\fIoptions\fP] [input]
|
|
|
|
|
.SH DESCRIPTION
|
|
|
|
|
.sp
|
|
|
|
|
\fBllvm\-mca\fP is a performance analysis tool that uses information
|
|
|
|
|
available in LLVM (e.g. scheduling models) to statically measure the performance
|
|
|
|
|
of machine code in a specific CPU.
|
|
|
|
|
.sp
|
|
|
|
|
Performance is measured in terms of throughput as well as processor resource
|
|
|
|
|
consumption. The tool currently works for processors with an out\-of\-order
|
|
|
|
|
backend, for which there is a scheduling model available in LLVM.
|
|
|
|
|
.sp
|
|
|
|
|
The main goal of this tool is not just to predict the performance of the code
|
|
|
|
|
when run on the target, but also help with diagnosing potential performance
|
|
|
|
|
issues.
|
|
|
|
|
.sp
|
|
|
|
|
Given an assembly code sequence, llvm\-mca estimates the Instructions Per Cycle
|
|
|
|
|
(IPC), as well as hardware resource pressure. The analysis and reporting style
|
|
|
|
|
were inspired by the IACA tool from Intel.
|
|
|
|
|
.sp
|
|
|
|
|
\fBllvm\-mca\fP allows the usage of special code comments to mark regions of
|
|
|
|
|
the assembly code to be analyzed. A comment starting with substring
|
|
|
|
|
\fBLLVM\-MCA\-BEGIN\fP marks the beginning of a code region. A comment starting with
|
|
|
|
|
substring \fBLLVM\-MCA\-END\fP marks the end of a code region. For example:
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.INDENT 3.5
|
|
|
|
|
.sp
|
|
|
|
|
.nf
|
|
|
|
|
.ft C
|
|
|
|
|
# LLVM\-MCA\-BEGIN My Code Region
|
|
|
|
|
...
|
|
|
|
|
# LLVM\-MCA\-END
|
|
|
|
|
.ft P
|
|
|
|
|
.fi
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.sp
|
|
|
|
|
Multiple regions can be specified provided that they do not overlap. A code
|
|
|
|
|
region can have an optional description. If no user\-defined region is specified,
|
|
|
|
|
then \fBllvm\-mca\fP assumes a default region which contains every
|
|
|
|
|
instruction in the input file. Every region is analyzed in isolation, and the
|
|
|
|
|
final performance report is the union of all the reports generated for every
|
|
|
|
|
code region.
|
|
|
|
|
.sp
|
|
|
|
|
Inline assembly directives may be used from source code to annotate the
|
|
|
|
|
assembly text:
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.INDENT 3.5
|
|
|
|
|
.sp
|
|
|
|
|
.nf
|
|
|
|
|
.ft C
|
|
|
|
|
int foo(int a, int b) {
|
|
|
|
|
__asm volatile("# LLVM\-MCA\-BEGIN foo");
|
|
|
|
|
a += 42;
|
|
|
|
|
__asm volatile("# LLVM\-MCA\-END");
|
|
|
|
|
a *= b;
|
|
|
|
|
return a;
|
|
|
|
|
}
|
|
|
|
|
.ft P
|
|
|
|
|
.fi
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.sp
|
|
|
|
|
So for example, you can compile code with clang, output assembly, and pipe it
|
|
|
|
|
directly into llvm\-mca for analysis:
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.INDENT 3.5
|
|
|
|
|
.sp
|
|
|
|
|
.nf
|
|
|
|
|
.ft C
|
|
|
|
|
$ clang foo.c \-O2 \-target x86_64\-unknown\-unknown \-S \-o \- | llvm\-mca \-mcpu=btver2
|
|
|
|
|
.ft P
|
|
|
|
|
.fi
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.sp
|
|
|
|
|
Or for Intel syntax:
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.INDENT 3.5
|
|
|
|
|
.sp
|
|
|
|
|
.nf
|
|
|
|
|
.ft C
|
|
|
|
|
$ clang foo.c \-O2 \-target x86_64\-unknown\-unknown \-mllvm \-x86\-asm\-syntax=intel \-S \-o \- | llvm\-mca \-mcpu=btver2
|
|
|
|
|
.ft P
|
|
|
|
|
.fi
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.SH OPTIONS
|
|
|
|
|
.sp
|
|
|
|
|
If \fBinput\fP is "\fB\-\fP" or omitted, \fBllvm\-mca\fP reads from standard
|
|
|
|
|
input. Otherwise, it will read from the specified filename.
|
|
|
|
|
.sp
|
|
|
|
|
If the \fB\-o\fP option is omitted, then \fBllvm\-mca\fP will send its output
|
|
|
|
|
to standard output if the input is from standard input. If the \fB\-o\fP
|
|
|
|
|
option specifies "\fB\-\fP", then the output will also be sent to standard output.
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-help
|
|
|
|
|
Print a summary of command line options.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-mtriple=<target triple>
|
|
|
|
|
Specify a target triple string.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-march=<arch>
|
|
|
|
|
Specify the architecture for which to analyze the code. It defaults to the
|
|
|
|
|
host default target.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-mcpu=<cpuname>
|
|
|
|
|
Specify the processor for which to analyze the code. By default, the cpu name
|
|
|
|
|
is autodetected from the host.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-output\-asm\-variant=<variant id>
|
|
|
|
|
Specify the output assembly variant for the report generated by the tool.
|
|
|
|
|
On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables
|
|
|
|
|
the AT&T (vic. Intel) assembly format for the code printed out by the tool in
|
|
|
|
|
the analysis report.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-dispatch=<width>
|
|
|
|
|
Specify a different dispatch width for the processor. The dispatch width
|
|
|
|
|
defaults to field \(aqIssueWidth\(aq in the processor scheduling model. If width is
|
|
|
|
|
zero, then the default dispatch width is used.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-register\-file\-size=<size>
|
|
|
|
|
Specify the size of the register file. When specified, this flag limits how
|
|
|
|
|
many physical registers are available for register renaming purposes. A value
|
|
|
|
|
of zero for this flag means "unlimited number of physical registers".
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-iterations=<number of iterations>
|
|
|
|
|
Specify the number of iterations to run. If this flag is set to 0, then the
|
|
|
|
|
tool sets the number of iterations to a default value (i.e. 100).
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-noalias=<bool>
|
|
|
|
|
If set, the tool assumes that loads and stores don\(aqt alias. This is the
|
|
|
|
|
default behavior.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-lqueue=<load queue size>
|
|
|
|
|
Specify the size of the load queue in the load/store unit emulated by the tool.
|
|
|
|
|
By default, the tool assumes an unbound number of entries in the load queue.
|
|
|
|
|
A value of zero for this flag is ignored, and the default load queue size is
|
|
|
|
|
used instead.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-squeue=<store queue size>
|
|
|
|
|
Specify the size of the store queue in the load/store unit emulated by the
|
|
|
|
|
tool. By default, the tool assumes an unbound number of entries in the store
|
|
|
|
|
queue. A value of zero for this flag is ignored, and the default store queue
|
|
|
|
|
size is used instead.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-timeline
|
|
|
|
|
Enable the timeline view.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-timeline\-max\-iterations=<iterations>
|
|
|
|
|
Limit the number of iterations to print in the timeline view. By default, the
|
|
|
|
|
timeline view prints information for up to 10 iterations.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-timeline\-max\-cycles=<cycles>
|
|
|
|
|
Limit the number of cycles in the timeline view. By default, the number of
|
|
|
|
|
cycles is set to 80.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-resource\-pressure
|
|
|
|
|
Enable the resource pressure view. This is enabled by default.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-register\-file\-stats
|
|
|
|
|
Enable register file usage statistics.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-dispatch\-stats
|
|
|
|
|
Enable extra dispatch statistics. This view collects and analyzes instruction
|
|
|
|
|
dispatch events, as well as static/dynamic dispatch stall events. This view
|
|
|
|
|
is disabled by default.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-scheduler\-stats
|
|
|
|
|
Enable extra scheduler statistics. This view collects and analyzes instruction
|
|
|
|
|
issue events. This view is disabled by default.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-retire\-stats
|
|
|
|
|
Enable extra retire control unit statistics. This view is disabled by default.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-instruction\-info
|
|
|
|
|
Enable the instruction info view. This is enabled by default.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-all\-stats
|
|
|
|
|
Print all hardware statistics. This enables extra statistics related to the
|
|
|
|
|
dispatch logic, the hardware schedulers, the register file(s), and the retire
|
|
|
|
|
control unit. This option is disabled by default.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-all\-views
|
|
|
|
|
Enable all the view.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.TP
|
|
|
|
|
.B \-instruction\-tables
|
|
|
|
|
Prints resource pressure information based on the static information
|
|
|
|
|
available from the processor model. This differs from the resource pressure
|
|
|
|
|
view because it doesn\(aqt require that the code is simulated. It instead prints
|
|
|
|
|
the theoretical uniform distribution of resource pressure for every
|
|
|
|
|
instruction in sequence.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.SH EXIT STATUS
|
|
|
|
|
.sp
|
|
|
|
|
\fBllvm\-mca\fP returns 0 on success. Otherwise, an error message is printed
|
|
|
|
|
to standard error, and the tool returns 1.
|
|
|
|
|
.SH HOW LLVM-MCA WORKS
|
|
|
|
|
.sp
|
|
|
|
|
\fBllvm\-mca\fP takes assembly code as input. The assembly code is parsed
|
|
|
|
|
into a sequence of MCInst with the help of the existing LLVM target assembly
|
|
|
|
|
parsers. The parsed sequence of MCInst is then analyzed by a \fBPipeline\fP module
|
|
|
|
|
to generate a performance report.
|
|
|
|
|
.sp
|
|
|
|
|
The Pipeline module simulates the execution of the machine code sequence in a
|
|
|
|
|
loop of iterations (default is 100). During this process, the pipeline collects
|
|
|
|
|
a number of execution related statistics. At the end of this process, the
|
|
|
|
|
pipeline generates and prints a report from the collected statistics.
|
|
|
|
|
.sp
|
|
|
|
|
Here is an example of a performance report generated by the tool for a
|
|
|
|
|
dot\-product of two packed float vectors of four elements. The analysis is
|
|
|
|
|
conducted for target x86, cpu btver2. The following result can be produced via
|
|
|
|
|
the following command using the example located at
|
|
|
|
|
\fBtest/tools/llvm\-mca/X86/BtVer2/dot\-product.s\fP:
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.INDENT 3.5
|
|
|
|
|
.sp
|
|
|
|
|
.nf
|
|
|
|
|
.ft C
|
|
|
|
|
$ llvm\-mca \-mtriple=x86_64\-unknown\-unknown \-mcpu=btver2 \-iterations=300 dot\-product.s
|
|
|
|
|
.ft P
|
|
|
|
|
.fi
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.INDENT 3.5
|
|
|
|
|
.sp
|
|
|
|
|
.nf
|
|
|
|
|
.ft C
|
|
|
|
|
Iterations: 300
|
|
|
|
|
Instructions: 900
|
|
|
|
|
Total Cycles: 610
|
|
|
|
|
Dispatch Width: 2
|
|
|
|
|
IPC: 1.48
|
|
|
|
|
Block RThroughput: 2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Instruction Info:
|
|
|
|
|
[1]: #uOps
|
|
|
|
|
[2]: Latency
|
|
|
|
|
[3]: RThroughput
|
|
|
|
|
[4]: MayLoad
|
|
|
|
|
[5]: MayStore
|
|
|
|
|
[6]: HasSideEffects (U)
|
|
|
|
|
|
|
|
|
|
[1] [2] [3] [4] [5] [6] Instructions:
|
|
|
|
|
1 2 1.00 vmulps %xmm0, %xmm1, %xmm2
|
|
|
|
|
1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3
|
|
|
|
|
1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Resources:
|
|
|
|
|
[0] \- JALU0
|
|
|
|
|
[1] \- JALU1
|
|
|
|
|
[2] \- JDiv
|
|
|
|
|
[3] \- JFPA
|
|
|
|
|
[4] \- JFPM
|
|
|
|
|
[5] \- JFPU0
|
|
|
|
|
[6] \- JFPU1
|
|
|
|
|
[7] \- JLAGU
|
|
|
|
|
[8] \- JMul
|
|
|
|
|
[9] \- JSAGU
|
|
|
|
|
[10] \- JSTC
|
|
|
|
|
[11] \- JVALU0
|
|
|
|
|
[12] \- JVALU1
|
|
|
|
|
[13] \- JVIMUL
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Resource pressure per iteration:
|
|
|
|
|
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
|
|
|
|
|
\- \- \- 2.00 1.00 2.00 1.00 \- \- \- \- \- \- \-
|
|
|
|
|
|
|
|
|
|
Resource pressure by instruction:
|
|
|
|
|
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
|
|
|
|
|
\- \- \- \- 1.00 \- 1.00 \- \- \- \- \- \- \- vmulps %xmm0, %xmm1, %xmm2
|
|
|
|
|
\- \- \- 1.00 \- 1.00 \- \- \- \- \- \- \- \- vhaddps %xmm2, %xmm2, %xmm3
|
|
|
|
|
\- \- \- 1.00 \- 1.00 \- \- \- \- \- \- \- \- vhaddps %xmm3, %xmm3, %xmm4
|
|
|
|
|
.ft P
|
|
|
|
|
.fi
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.sp
|
|
|
|
|
According to this report, the dot\-product kernel has been executed 300 times,
|
|
|
|
|
for a total of 900 dynamically executed instructions.
|
|
|
|
|
.sp
|
|
|
|
|
The report is structured in three main sections. The first section collects a
|
|
|
|
|
few performance numbers; the goal of this section is to give a very quick
|
|
|
|
|
overview of the performance throughput. In this example, the two important
|
|
|
|
|
performance indicators are \fBIPC\fP and \fBBlock RThroughput\fP (Block Reciprocal
|
|
|
|
|
Throughput).
|
|
|
|
|
.sp
|
|
|
|
|
IPC is computed dividing the total number of simulated instructions by the total
|
|
|
|
|
number of cycles. A delta between Dispatch Width and IPC is an indicator of a
|
|
|
|
|
performance issue. In the absence of loop\-carried data dependencies, the
|
|
|
|
|
observed IPC tends to a theoretical maximum which can be computed by dividing
|
|
|
|
|
the number of instructions of a single iteration by the \fIBlock RThroughput\fP\&.
|
|
|
|
|
.sp
|
|
|
|
|
IPC is bounded from above by the dispatch width. That is because the dispatch
|
|
|
|
|
width limits the maximum size of a dispatch group. IPC is also limited by the
|
|
|
|
|
amount of hardware parallelism. The availability of hardware resources affects
|
|
|
|
|
the resource pressure distribution, and it limits the number of instructions
|
|
|
|
|
that can be executed in parallel every cycle. A delta between Dispatch
|
|
|
|
|
Width and the theoretical maximum IPC is an indicator of a performance
|
|
|
|
|
bottleneck caused by the lack of hardware resources. In general, the lower the
|
|
|
|
|
Block RThroughput, the better.
|
|
|
|
|
.sp
|
|
|
|
|
In this example, \fBInstructions per iteration/Block RThroughput\fP is 1.50. Since
|
|
|
|
|
there are no loop\-carried dependencies, the observed IPC is expected to approach
|
|
|
|
|
1.50 when the number of iterations tends to infinity. The delta between the
|
|
|
|
|
Dispatch Width (2.00), and the theoretical maximum IPC (1.50) is an indicator of
|
|
|
|
|
a performance bottleneck caused by the lack of hardware resources, and the
|
|
|
|
|
\fIResource pressure view\fP can help to identify the problematic resource usage.
|
|
|
|
|
.sp
|
|
|
|
|
The second section of the report shows the latency and reciprocal
|
|
|
|
|
throughput of every instruction in the sequence. That section also reports
|
|
|
|
|
extra information related to the number of micro opcodes, and opcode properties
|
|
|
|
|
(i.e., \(aqMayLoad\(aq, \(aqMayStore\(aq, and \(aqHasSideEffects\(aq).
|
|
|
|
|
.sp
|
|
|
|
|
The third section is the \fIResource pressure view\fP\&. This view reports
|
|
|
|
|
the average number of resource cycles consumed every iteration by instructions
|
|
|
|
|
for every processor resource unit available on the target. Information is
|
|
|
|
|
structured in two tables. The first table reports the number of resource cycles
|
|
|
|
|
spent on average every iteration. The second table correlates the resource
|
|
|
|
|
cycles to the machine instruction in the sequence. For example, every iteration
|
|
|
|
|
of the instruction vmulps always executes on resource unit [6]
|
|
|
|
|
(JFPU1 \- floating point pipeline #1), consuming an average of 1 resource cycle
|
|
|
|
|
per iteration. Note that on AMD Jaguar, vector floating\-point multiply can
|
|
|
|
|
only be issued to pipeline JFPU1, while horizontal floating\-point additions can
|
|
|
|
|
only be issued to pipeline JFPU0.
|
|
|
|
|
.sp
|
|
|
|
|
The resource pressure view helps with identifying bottlenecks caused by high
|
|
|
|
|
usage of specific hardware resources. Situations with resource pressure mainly
|
|
|
|
|
concentrated on a few resources should, in general, be avoided. Ideally,
|
|
|
|
|
pressure should be uniformly distributed between multiple resources.
|
|
|
|
|
.SS Timeline View
|
|
|
|
|
.sp
|
|
|
|
|
The timeline view produces a detailed report of each instruction\(aqs state
|
|
|
|
|
transitions through an instruction pipeline. This view is enabled by the
|
|
|
|
|
command line option \fB\-timeline\fP\&. As instructions transition through the
|
|
|
|
|
various stages of the pipeline, their states are depicted in the view report.
|
|
|
|
|
These states are represented by the following characters:
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
D : Instruction dispatched.
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
e : Instruction executing.
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
E : Instruction executed.
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
R : Instruction retired.
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
= : Instruction already dispatched, waiting to be executed.
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
\- : Instruction executed, waiting to be retired.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.sp
|
|
|
|
|
Below is the timeline view for a subset of the dot\-product example located in
|
|
|
|
|
\fBtest/tools/llvm\-mca/X86/BtVer2/dot\-product.s\fP and processed by
|
|
|
|
|
\fBllvm\-mca\fP using the following command:
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.INDENT 3.5
|
|
|
|
|
.sp
|
|
|
|
|
.nf
|
|
|
|
|
.ft C
|
|
|
|
|
$ llvm\-mca \-mtriple=x86_64\-unknown\-unknown \-mcpu=btver2 \-iterations=3 \-timeline dot\-product.s
|
|
|
|
|
.ft P
|
|
|
|
|
.fi
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.INDENT 3.5
|
|
|
|
|
.sp
|
|
|
|
|
.nf
|
|
|
|
|
.ft C
|
|
|
|
|
Timeline view:
|
|
|
|
|
012345
|
|
|
|
|
Index 0123456789
|
|
|
|
|
|
|
|
|
|
[0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2
|
|
|
|
|
[0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3
|
|
|
|
|
[0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
|
|
|
|
|
[1,0] .DeeE\-\-\-\-\-R . vmulps %xmm0, %xmm1, %xmm2
|
|
|
|
|
[1,1] . D=eeeE\-\-\-R . vhaddps %xmm2, %xmm2, %xmm3
|
|
|
|
|
[1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
|
|
|
|
|
[2,0] . DeeE\-\-\-\-\-R . vmulps %xmm0, %xmm1, %xmm2
|
|
|
|
|
[2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3
|
|
|
|
|
[2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Average Wait times (based on the timeline view):
|
|
|
|
|
[0]: Executions
|
|
|
|
|
[1]: Average time spent waiting in a scheduler\(aqs queue
|
|
|
|
|
[2]: Average time spent waiting in a scheduler\(aqs queue while ready
|
|
|
|
|
[3]: Average time elapsed from WB until retire stage
|
|
|
|
|
|
|
|
|
|
[0] [1] [2] [3]
|
|
|
|
|
0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2
|
|
|
|
|
1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3
|
|
|
|
|
2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
|
|
|
|
|
.ft P
|
|
|
|
|
.fi
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.sp
|
|
|
|
|
The timeline view is interesting because it shows instruction state changes
|
|
|
|
|
during execution. It also gives an idea of how the tool processes instructions
|
|
|
|
|
executed on the target, and how their timing information might be calculated.
|
|
|
|
|
.sp
|
|
|
|
|
The timeline view is structured in two tables. The first table shows
|
|
|
|
|
instructions changing state over time (measured in cycles); the second table
|
|
|
|
|
(named \fIAverage Wait times\fP) reports useful timing statistics, which should
|
|
|
|
|
help diagnose performance bottlenecks caused by long data dependencies and
|
|
|
|
|
sub\-optimal usage of hardware resources.
|
|
|
|
|
.sp
|
|
|
|
|
An instruction in the timeline view is identified by a pair of indices, where
|
|
|
|
|
the first index identifies an iteration, and the second index is the
|
|
|
|
|
instruction index (i.e., where it appears in the code sequence). Since this
|
|
|
|
|
example was generated using 3 iterations: \fB\-iterations=3\fP, the iteration
|
|
|
|
|
indices range from 0\-2 inclusively.
|
|
|
|
|
.sp
|
|
|
|
|
Excluding the first and last column, the remaining columns are in cycles.
|
|
|
|
|
Cycles are numbered sequentially starting from 0.
|
|
|
|
|
.sp
|
|
|
|
|
From the example output above, we know the following:
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
Instruction [1,0] was dispatched at cycle 1.
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
Instruction [1,0] started executing at cycle 2.
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
Instruction [1,0] reached the write back stage at cycle 4.
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
Instruction [1,0] was retired at cycle 10.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.sp
|
|
|
|
|
Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the
|
|
|
|
|
scheduler\(aqs queue for the operands to become available. By the time vmulps is
|
|
|
|
|
dispatched, operands are already available, and pipeline JFPU1 is ready to
|
|
|
|
|
serve another instruction. So the instruction can be immediately issued on the
|
|
|
|
|
JFPU1 pipeline. That is demonstrated by the fact that the instruction only
|
|
|
|
|
spent 1cy in the scheduler\(aqs queue.
|
|
|
|
|
.sp
|
|
|
|
|
There is a gap of 5 cycles between the write\-back stage and the retire event.
|
|
|
|
|
That is because instructions must retire in program order, so [1,0] has to wait
|
|
|
|
|
for [0,2] to be retired first (i.e., it has to wait until cycle 10).
|
|
|
|
|
.sp
|
|
|
|
|
In the example, all instructions are in a RAW (Read After Write) dependency
|
|
|
|
|
chain. Register %xmm2 written by vmulps is immediately used by the first
|
|
|
|
|
vhaddps, and register %xmm3 written by the first vhaddps is used by the second
|
|
|
|
|
vhaddps. Long data dependencies negatively impact the ILP (Instruction Level
|
|
|
|
|
Parallelism).
|
|
|
|
|
.sp
|
|
|
|
|
In the dot\-product example, there are anti\-dependencies introduced by
|
|
|
|
|
instructions from different iterations. However, those dependencies can be
|
|
|
|
|
removed at register renaming stage (at the cost of allocating register aliases,
|
|
|
|
|
and therefore consuming physical registers).
|
|
|
|
|
.sp
|
|
|
|
|
Table \fIAverage Wait times\fP helps diagnose performance issues that are caused by
|
|
|
|
|
the presence of long latency instructions and potentially long data dependencies
|
|
|
|
|
which may limit the ILP. Note that \fBllvm\-mca\fP, by default, assumes at
|
|
|
|
|
least 1cy between the dispatch event and the issue event.
|
|
|
|
|
.sp
|
|
|
|
|
When the performance is limited by data dependencies and/or long latency
|
|
|
|
|
instructions, the number of cycles spent while in the \fIready\fP state is expected
|
|
|
|
|
to be very small when compared with the total number of cycles spent in the
|
|
|
|
|
scheduler\(aqs queue. The difference between the two counters is a good indicator
|
|
|
|
|
of how large of an impact data dependencies had on the execution of the
|
|
|
|
|
instructions. When performance is mostly limited by the lack of hardware
|
|
|
|
|
resources, the delta between the two counters is small. However, the number of
|
|
|
|
|
cycles spent in the queue tends to be larger (i.e., more than 1\-3cy),
|
|
|
|
|
especially when compared to other low latency instructions.
|
|
|
|
|
.SS Extra Statistics to Further Diagnose Performance Issues
|
|
|
|
|
.sp
|
|
|
|
|
The \fB\-all\-stats\fP command line option enables extra statistics and performance
|
|
|
|
|
counters for the dispatch logic, the reorder buffer, the retire control unit,
|
|
|
|
|
and the register file.
|
|
|
|
|
.sp
|
|
|
|
|
Below is an example of \fB\-all\-stats\fP output generated by MCA for the
|
|
|
|
|
dot\-product example discussed in the previous sections.
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.INDENT 3.5
|
|
|
|
|
.sp
|
|
|
|
|
.nf
|
|
|
|
|
.ft C
|
|
|
|
|
Dynamic Dispatch Stall Cycles:
|
|
|
|
|
RAT \- Register unavailable: 0
|
|
|
|
|
RCU \- Retire tokens unavailable: 0
|
|
|
|
|
SCHEDQ \- Scheduler full: 272
|
|
|
|
|
LQ \- Load queue full: 0
|
|
|
|
|
SQ \- Store queue full: 0
|
|
|
|
|
GROUP \- Static restrictions on the dispatch group: 0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dispatch Logic \- number of cycles where we saw N instructions dispatched:
|
|
|
|
|
[# dispatched], [# cycles]
|
|
|
|
|
0, 24 (3.9%)
|
|
|
|
|
1, 272 (44.6%)
|
|
|
|
|
2, 314 (51.5%)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Schedulers \- number of cycles where we saw N instructions issued:
|
|
|
|
|
[# issued], [# cycles]
|
|
|
|
|
0, 7 (1.1%)
|
|
|
|
|
1, 306 (50.2%)
|
|
|
|
|
2, 297 (48.7%)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Scheduler\(aqs queue usage:
|
|
|
|
|
JALU01, 0/20
|
|
|
|
|
JFPU01, 18/18
|
|
|
|
|
JLSAGU, 0/12
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Retire Control Unit \- number of cycles where we saw N instructions retired:
|
|
|
|
|
[# retired], [# cycles]
|
|
|
|
|
0, 109 (17.9%)
|
|
|
|
|
1, 102 (16.7%)
|
|
|
|
|
2, 399 (65.4%)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Register File statistics:
|
|
|
|
|
Total number of mappings created: 900
|
|
|
|
|
Max number of mappings used: 35
|
|
|
|
|
|
|
|
|
|
* Register File #1 \-\- JFpuPRF:
|
|
|
|
|
Number of physical registers: 72
|
|
|
|
|
Total number of mappings created: 900
|
|
|
|
|
Max number of mappings used: 35
|
|
|
|
|
|
|
|
|
|
* Register File #2 \-\- JIntegerPRF:
|
|
|
|
|
Number of physical registers: 64
|
|
|
|
|
Total number of mappings created: 0
|
|
|
|
|
Max number of mappings used: 0
|
|
|
|
|
.ft P
|
|
|
|
|
.fi
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.sp
|
|
|
|
|
If we look at the \fIDynamic Dispatch Stall Cycles\fP table, we see the counter for
|
|
|
|
|
SCHEDQ reports 272 cycles. This counter is incremented every time the dispatch
|
|
|
|
|
logic is unable to dispatch a group of two instructions because the scheduler\(aqs
|
|
|
|
|
queue is full.
|
|
|
|
|
.sp
|
|
|
|
|
Looking at the \fIDispatch Logic\fP table, we see that the pipeline was only able
|
|
|
|
|
to dispatch two instructions 51.5% of the time. The dispatch group was limited
|
|
|
|
|
to one instruction 44.6% of the cycles, which corresponds to 272 cycles. The
|
|
|
|
|
dispatch statistics are displayed by either using the command option
|
|
|
|
|
\fB\-all\-stats\fP or \fB\-dispatch\-stats\fP\&.
|
|
|
|
|
.sp
|
|
|
|
|
The next table, \fISchedulers\fP, presents a histogram displaying a count,
|
|
|
|
|
representing the number of instructions issued on some number of cycles. In
|
|
|
|
|
this case, of the 610 simulated cycles, single
|
|
|
|
|
instructions were issued 306 times (50.2%) and there were 7 cycles where
|
|
|
|
|
no instructions were issued.
|
|
|
|
|
.sp
|
|
|
|
|
The \fIScheduler\(aqs queue usage\fP table shows that the maximum number of buffer
|
|
|
|
|
entries (i.e., scheduler queue entries) used at runtime. Resource JFPU01
|
|
|
|
|
reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements
|
|
|
|
|
three schedulers:
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
JALU01 \- A scheduler for ALU instructions.
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
JFPU01 \- A scheduler floating point operations.
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
JLSAGU \- A scheduler for address generation.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.sp
|
|
|
|
|
The dot\-product is a kernel of three floating point instructions (a vector
|
|
|
|
|
multiply followed by two horizontal adds). That explains why only the floating
|
|
|
|
|
point scheduler appears to be used.
|
|
|
|
|
.sp
|
|
|
|
|
A full scheduler queue is either caused by data dependency chains or by a
|
|
|
|
|
sub\-optimal usage of hardware resources. Sometimes, resource pressure can be
|
|
|
|
|
mitigated by rewriting the kernel using different instructions that consume
|
|
|
|
|
different scheduler resources. Schedulers with a small queue are less resilient
|
|
|
|
|
to bottlenecks caused by the presence of long data dependencies.
|
|
|
|
|
The scheduler statistics are displayed by
|
|
|
|
|
using the command option \fB\-all\-stats\fP or \fB\-scheduler\-stats\fP\&.
|
|
|
|
|
.sp
|
|
|
|
|
The next table, \fIRetire Control Unit\fP, presents a histogram displaying a count,
|
|
|
|
|
representing the number of instructions retired on some number of cycles. In
|
|
|
|
|
this case, of the 610 simulated cycles, two instructions were retired during
|
|
|
|
|
the same cycle 399 times (65.4%) and there were 109 cycles where no
|
|
|
|
|
instructions were retired. The retire statistics are displayed by using the
|
|
|
|
|
command option \fB\-all\-stats\fP or \fB\-retire\-stats\fP\&.
|
|
|
|
|
.sp
|
|
|
|
|
The last table presented is \fIRegister File statistics\fP\&. Each physical register
|
|
|
|
|
file (PRF) used by the pipeline is presented in this table. In the case of AMD
|
|
|
|
|
Jaguar, there are two register files, one for floating\-point registers
|
|
|
|
|
(JFpuPRF) and one for integer registers (JIntegerPRF). The table shows that of
|
|
|
|
|
the 900 instructions processed, there were 900 mappings created. Since this
|
|
|
|
|
dot\-product example utilized only floating point registers, the JFPuPRF was
|
|
|
|
|
responsible for creating the 900 mappings. However, we see that the pipeline
|
|
|
|
|
only used a maximum of 35 of 72 available register slots at any given time. We
|
|
|
|
|
can conclude that the floating point PRF was the only register file used for
|
|
|
|
|
the example, and that it was never resource constrained. The register file
|
|
|
|
|
statistics are displayed by using the command option \fB\-all\-stats\fP or
|
|
|
|
|
\fB\-register\-file\-stats\fP\&.
|
|
|
|
|
.sp
|
|
|
|
|
In this example, we can conclude that the IPC is mostly limited by data
|
|
|
|
|
dependencies, and not by resource pressure.
|
|
|
|
|
.SS Instruction Flow
|
|
|
|
|
.sp
|
|
|
|
|
This section describes the instruction flow through MCA\(aqs default out\-of\-order
|
|
|
|
|
pipeline, as well as the functional units involved in the process.
|
|
|
|
|
.sp
|
|
|
|
|
The default pipeline implements the following sequence of stages used to
|
|
|
|
|
process instructions.
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
Dispatch (Instruction is dispatched to the schedulers).
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
Issue (Instruction is issued to the processor pipelines).
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
Write Back (Instruction is executed, and results are written back).
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
Retire (Instruction is retired; writes are architecturally committed).
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.sp
|
|
|
|
|
The default pipeline only models the out\-of\-order portion of a processor.
|
|
|
|
|
Therefore, the instruction fetch and decode stages are not modeled. Performance
|
|
|
|
|
bottlenecks in the frontend are not diagnosed. MCA assumes that instructions
|
|
|
|
|
have all been decoded and placed into a queue. Also, MCA does not model branch
|
|
|
|
|
prediction.
|
|
|
|
|
.SS Instruction Dispatch
|
|
|
|
|
.sp
|
|
|
|
|
During the dispatch stage, instructions are picked in program order from a
|
|
|
|
|
queue of already decoded instructions, and dispatched in groups to the
|
|
|
|
|
simulated hardware schedulers.
|
|
|
|
|
.sp
|
|
|
|
|
The size of a dispatch group depends on the availability of the simulated
|
|
|
|
|
hardware resources. The processor dispatch width defaults to the value
|
|
|
|
|
of the \fBIssueWidth\fP in LLVM\(aqs scheduling model.
|
|
|
|
|
.sp
|
|
|
|
|
An instruction can be dispatched if:
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
The size of the dispatch group is smaller than processor\(aqs dispatch width.
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
There are enough entries in the reorder buffer.
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
There are enough physical registers to do register renaming.
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
The schedulers are not full.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.sp
|
|
|
|
|
Scheduling models can optionally specify which register files are available on
|
|
|
|
|
the processor. MCA uses that information to initialize register file
|
|
|
|
|
descriptors. Users can limit the number of physical registers that are
|
|
|
|
|
globally available for register renaming by using the command option
|
|
|
|
|
\fB\-register\-file\-size\fP\&. A value of zero for this option means \fIunbounded\fP\&.
|
|
|
|
|
By knowing how many registers are available for renaming, MCA can predict
|
|
|
|
|
dispatch stalls caused by the lack of registers.
|
|
|
|
|
.sp
|
|
|
|
|
The number of reorder buffer entries consumed by an instruction depends on the
|
|
|
|
|
number of micro\-opcodes specified by the target scheduling model. MCA\(aqs
|
|
|
|
|
reorder buffer\(aqs purpose is to track the progress of instructions that are
|
|
|
|
|
"in\-flight," and to retire instructions in program order. The number of
|
|
|
|
|
entries in the reorder buffer defaults to the \fIMicroOpBufferSize\fP provided by
|
|
|
|
|
the target scheduling model.
|
|
|
|
|
.sp
|
|
|
|
|
Instructions that are dispatched to the schedulers consume scheduler buffer
|
|
|
|
|
entries. \fBllvm\-mca\fP queries the scheduling model to determine the set
|
|
|
|
|
of buffered resources consumed by an instruction. Buffered resources are
|
|
|
|
|
treated like scheduler resources.
|
|
|
|
|
.SS Instruction Issue
|
|
|
|
|
.sp
|
|
|
|
|
Each processor scheduler implements a buffer of instructions. An instruction
|
|
|
|
|
has to wait in the scheduler\(aqs buffer until input register operands become
|
|
|
|
|
available. Only at that point, does the instruction becomes eligible for
|
|
|
|
|
execution and may be issued (potentially out\-of\-order) for execution.
|
|
|
|
|
Instruction latencies are computed by \fBllvm\-mca\fP with the help of the
|
|
|
|
|
scheduling model.
|
|
|
|
|
.sp
|
|
|
|
|
\fBllvm\-mca\fP\(aqs scheduler is designed to simulate multiple processor
|
|
|
|
|
schedulers. The scheduler is responsible for tracking data dependencies, and
|
|
|
|
|
dynamically selecting which processor resources are consumed by instructions.
|
|
|
|
|
It delegates the management of processor resource units and resource groups to a
|
|
|
|
|
resource manager. The resource manager is responsible for selecting resource
|
|
|
|
|
units that are consumed by instructions. For example, if an instruction
|
|
|
|
|
consumes 1cy of a resource group, the resource manager selects one of the
|
|
|
|
|
available units from the group; by default, the resource manager uses a
|
|
|
|
|
round\-robin selector to guarantee that resource usage is uniformly distributed
|
|
|
|
|
between all units of a group.
|
|
|
|
|
.sp
|
|
|
|
|
\fBllvm\-mca\fP\(aqs scheduler implements three instruction queues:
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
WaitQueue: a queue of instructions whose operands are not ready.
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
ReadyQueue: a queue of instructions ready to execute.
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
IssuedQueue: a queue of instructions executing.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.sp
|
|
|
|
|
Depending on the operand availability, instructions that are dispatched to the
|
|
|
|
|
scheduler are either placed into the WaitQueue or into the ReadyQueue.
|
|
|
|
|
.sp
|
|
|
|
|
Every cycle, the scheduler checks if instructions can be moved from the
|
|
|
|
|
WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be
|
|
|
|
|
issued to the underlying pipelines. The algorithm prioritizes older instructions
|
|
|
|
|
over younger instructions.
|
|
|
|
|
.SS Write\-Back and Retire Stage
|
|
|
|
|
.sp
|
|
|
|
|
Issued instructions are moved from the ReadyQueue to the IssuedQueue. There,
|
|
|
|
|
instructions wait until they reach the write\-back stage. At that point, they
|
|
|
|
|
get removed from the queue and the retire control unit is notified.
|
|
|
|
|
.sp
|
|
|
|
|
When instructions are executed, the retire control unit flags the
|
|
|
|
|
instruction as "ready to retire."
|
|
|
|
|
.sp
|
|
|
|
|
Instructions are retired in program order. The register file is notified of
|
|
|
|
|
the retirement so that it can free the physical registers that were allocated
|
|
|
|
|
for the instruction during the register renaming stage.
|
|
|
|
|
.SS Load/Store Unit and Memory Consistency Model
|
|
|
|
|
.sp
|
|
|
|
|
To simulate an out\-of\-order execution of memory operations, \fBllvm\-mca\fP
|
|
|
|
|
utilizes a simulated load/store unit (LSUnit) to simulate the speculative
|
|
|
|
|
execution of loads and stores.
|
|
|
|
|
.sp
|
|
|
|
|
Each load (or store) consumes an entry in the load (or store) queue. Users can
|
|
|
|
|
specify flags \fB\-lqueue\fP and \fB\-squeue\fP to limit the number of entries in the
|
|
|
|
|
load and store queues respectively. The queues are unbounded by default.
|
|
|
|
|
.sp
|
|
|
|
|
The LSUnit implements a relaxed consistency model for memory loads and stores.
|
|
|
|
|
The rules are:
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.IP 1. 3
|
|
|
|
|
A younger load is allowed to pass an older load only if there are no
|
|
|
|
|
intervening stores or barriers between the two loads.
|
|
|
|
|
.IP 2. 3
|
|
|
|
|
A younger load is allowed to pass an older store provided that the load does
|
|
|
|
|
not alias with the store.
|
|
|
|
|
.IP 3. 3
|
|
|
|
|
A younger store is not allowed to pass an older store.
|
|
|
|
|
.IP 4. 3
|
|
|
|
|
A younger store is not allowed to pass an older load.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.sp
|
|
|
|
|
By default, the LSUnit optimistically assumes that loads do not alias
|
|
|
|
|
(\fI\-noalias=true\fP) store operations. Under this assumption, younger loads are
|
|
|
|
|
always allowed to pass older stores. Essentially, the LSUnit does not attempt
|
|
|
|
|
to run any alias analysis to predict when loads and stores do not alias with
|
|
|
|
|
each other.
|
|
|
|
|
.sp
|
|
|
|
|
Note that, in the case of write\-combining memory, rule 3 could be relaxed to
|
|
|
|
|
allow reordering of non\-aliasing store operations. That being said, at the
|
|
|
|
|
moment, there is no way to further relax the memory model (\fB\-noalias\fP is the
|
|
|
|
|
only option). Essentially, there is no option to specify a different memory
|
|
|
|
|
type (e.g., write\-back, write\-combining, write\-through; etc.) and consequently
|
|
|
|
|
to weaken, or strengthen, the memory model.
|
|
|
|
|
.sp
|
|
|
|
|
Other limitations are:
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
The LSUnit does not know when store\-to\-load forwarding may occur.
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
The LSUnit does not know anything about cache hierarchy and memory types.
|
|
|
|
|
.IP \(bu 2
|
|
|
|
|
The LSUnit does not know how to identify serializing operations and memory
|
|
|
|
|
fences.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.sp
|
|
|
|
|
The LSUnit does not attempt to predict if a load or store hits or misses the L1
|
|
|
|
|
cache. It only knows if an instruction "MayLoad" and/or "MayStore." For
|
|
|
|
|
loads, the scheduling model provides an "optimistic" load\-to\-use latency (which
|
|
|
|
|
usually matches the load\-to\-use latency for when there is a hit in the L1D).
|
|
|
|
|
.sp
|
|
|
|
|
\fBllvm\-mca\fP does not know about serializing operations or memory\-barrier
|
|
|
|
|
like instructions. The LSUnit conservatively assumes that an instruction which
|
|
|
|
|
has both "MayLoad" and unmodeled side effects behaves like a "soft"
|
|
|
|
|
load\-barrier. That means, it serializes loads without forcing a flush of the
|
|
|
|
|
load queue. Similarly, instructions that "MayStore" and have unmodeled side
|
|
|
|
|
effects are treated like store barriers. A full memory barrier is a "MayLoad"
|
|
|
|
|
and "MayStore" instruction with unmodeled side effects. This is inaccurate, but
|
|
|
|
|
it is the best that we can do at the moment with the current information
|
|
|
|
|
available in LLVM.
|
|
|
|
|
.sp
|
|
|
|
|
A load/store barrier consumes one entry of the load/store queue. A load/store
|
|
|
|
|
barrier enforces ordering of loads/stores. A younger load cannot pass a load
|
|
|
|
|
barrier. Also, a younger store cannot pass a store barrier. A younger load
|
|
|
|
|
has to wait for the memory/load barrier to execute. A load/store barrier is
|
|
|
|
|
"executed" when it becomes the oldest entry in the load/store queue(s). That
|
|
|
|
|
also means, by construction, all of the older loads/stores have been executed.
|
|
|
|
|
.sp
|
|
|
|
|
In conclusion, the full set of load/store consistency rules are:
|
|
|
|
|
.INDENT 0.0
|
|
|
|
|
.IP 1. 3
|
|
|
|
|
A store may not pass a previous store.
|
|
|
|
|
.IP 2. 3
|
|
|
|
|
A store may not pass a previous load (regardless of \fB\-noalias\fP).
|
|
|
|
|
.IP 3. 3
|
|
|
|
|
A store has to wait until an older store barrier is fully executed.
|
|
|
|
|
.IP 4. 3
|
|
|
|
|
A load may pass a previous load.
|
|
|
|
|
.IP 5. 3
|
|
|
|
|
A load may not pass a previous store unless \fB\-noalias\fP is set.
|
|
|
|
|
.IP 6. 3
|
|
|
|
|
A load has to wait until an older load barrier is fully executed.
|
|
|
|
|
.UNINDENT
|
|
|
|
|
.SH AUTHOR
|
|
|
|
|
Maintained by The LLVM Team (http://llvm.org/).
|
|
|
|
|
.SH COPYRIGHT
|
|
|
|
|
2003-2018, LLVM Project
|
|
|
|
|
.\" Generated by docutils manpage writer.
|
|
|
|
|
.
|