406 lines
14 KiB
ReStructuredText
406 lines
14 KiB
ReStructuredText
==============================
|
|
User Guide for AMDGPU Back-end
|
|
==============================
|
|
|
|
Introduction
|
|
============
|
|
|
|
The AMDGPU back-end provides ISA code generation for AMD GPUs, starting with
|
|
the R600 family up until the current Volcanic Islands (GCN Gen 3).
|
|
|
|
Refer to `AMDGPU section in Architecture & Platform Information for Compiler Writers <CompilerWriterInfo.html#amdgpu>`_
|
|
for additional documentation.
|
|
|
|
Conventions
|
|
===========
|
|
|
|
Address Spaces
|
|
--------------
|
|
|
|
The AMDGPU back-end uses the following address space mapping:
|
|
|
|
================== =================== ==============
|
|
LLVM Address Space DWARF Address Space Memory Space
|
|
================== =================== ==============
|
|
0 1 Private
|
|
1 N/A Global
|
|
2 N/A Constant
|
|
3 2 Local
|
|
4 N/A Generic (Flat)
|
|
5 N/A Region
|
|
================== =================== ==============
|
|
|
|
The terminology in the table, aside from the region memory space, is from the
|
|
OpenCL standard.
|
|
|
|
LLVM Address Space is used throughout LLVM (for example, in LLVM IR). DWARF
|
|
Address Space is emitted in DWARF, and is used by tools, such as debugger,
|
|
profiler and others.
|
|
|
|
Trap Handler ABI
|
|
----------------
|
|
The OS element of the target triple controls the trap handler behavior.
|
|
|
|
HSA OS
|
|
^^^^^^
|
|
For code objects generated by AMDGPU back-end for the HSA OS, the runtime
|
|
installs a trap handler that supports the s_trap instruction with the following
|
|
usage:
|
|
|
|
+--------------+-------------+-------------------+----------------------------+
|
|
|Usage |Code Sequence|Trap Handler Inputs|Description |
|
|
+==============+=============+===================+============================+
|
|
|reserved |s_trap 0x00 | |Reserved by hardware. |
|
|
+--------------+-------------+-------------------+----------------------------+
|
|
|HSA debugtrap |s_trap 0x01 |SGPR0-1: queue_ptr |Reserved for HSA debugtrap |
|
|
|(arg) | |VGPR0: arg |intrinsic (not implemented).|
|
|
+--------------+-------------+-------------------+----------------------------+
|
|
|llvm.trap |s_trap 0x02 |SGPR0-1: queue_ptr |Causes dispatch to be |
|
|
| | | |terminated and its |
|
|
| | | |associated queue put into |
|
|
| | | |the error state. |
|
|
+--------------+-------------+-------------------+----------------------------+
|
|
|llvm.debugtrap| s_trap 0x03 |SGPR0-1: queue_ptr |If debugger not installed |
|
|
| | | |handled same as llvm.trap. |
|
|
+--------------+-------------+-------------------+----------------------------+
|
|
|debugger |s_trap 0x07 | |Reserved for debugger |
|
|
|breakpoint | | |breakpoints. |
|
|
+--------------+-------------+-------------------+----------------------------+
|
|
|debugger |s_trap 0x08 | |Reserved for debugger. |
|
|
+--------------+-------------+-------------------+----------------------------+
|
|
|debugger |s_trap 0xfe | |Reserved for debugger. |
|
|
+--------------+-------------+-------------------+----------------------------+
|
|
|debugger |s_trap 0xff | |Reserved for debugger. |
|
|
+--------------+-------------+-------------------+----------------------------+
|
|
|
|
Non-HSA OS
|
|
^^^^^^^^^^
|
|
For code objects generated by AMDGPU back-end for non-HSA OS, the runtime does
|
|
not install a trap handler. The llvm.trap and llvm.debugtrap instructions are
|
|
handler as follows:
|
|
|
|
=============== ============= ===============================================
|
|
Usage Code Sequence Description
|
|
=============== ============= ===============================================
|
|
llvm.trap s_endpgm Causes wavefront to be terminated.
|
|
llvm.debugtrap Nothing. Compiler warning generated that there is no trap handler installed.
|
|
=============== ============= ===============================================
|
|
|
|
Assembler
|
|
=========
|
|
|
|
AMDGPU backend has LLVM-MC based assembler which is currently in development.
|
|
It supports Southern Islands ISA, Sea Islands and Volcanic Islands.
|
|
|
|
This document describes general syntax for instructions and operands. For more
|
|
information about instructions, their semantics and supported combinations
|
|
of operands, refer to one of Instruction Set Architecture manuals.
|
|
|
|
An instruction has the following syntax (register operands are
|
|
normally comma-separated while extra operands are space-separated):
|
|
|
|
*<opcode> <register_operand0>, ... <extra_operand0> ...*
|
|
|
|
|
|
Operands
|
|
--------
|
|
|
|
The following syntax for register operands is supported:
|
|
|
|
* SGPR registers: s0, ... or s[0], ...
|
|
* VGPR registers: v0, ... or v[0], ...
|
|
* TTMP registers: ttmp0, ... or ttmp[0], ...
|
|
* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
|
|
* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
|
|
* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
|
|
* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
|
|
* Register index expressions: v[2*2], s[1-1:2-1]
|
|
* 'off' indicates that an operand is not enabled
|
|
|
|
The following extra operands are supported:
|
|
|
|
* offset, offset0, offset1
|
|
* idxen, offen bits
|
|
* glc, slc, tfe bits
|
|
* waitcnt: integer or combination of counter values
|
|
* VOP3 modifiers:
|
|
|
|
- abs (\| \|), neg (\-)
|
|
|
|
* DPP modifiers:
|
|
|
|
- row_shl, row_shr, row_ror, row_rol
|
|
- row_mirror, row_half_mirror, row_bcast
|
|
- wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
|
|
- row_mask, bank_mask, bound_ctrl
|
|
|
|
* SDWA modifiers:
|
|
|
|
- dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
|
|
- dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
|
|
- abs, neg, sext
|
|
|
|
DS Instructions Examples
|
|
------------------------
|
|
|
|
.. code-block:: nasm
|
|
|
|
ds_add_u32 v2, v4 offset:16
|
|
ds_write_src2_b64 v2 offset0:4 offset1:8
|
|
ds_cmpst_f32 v2, v4, v6
|
|
ds_min_rtn_f64 v[8:9], v2, v[4:5]
|
|
|
|
|
|
For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
|
|
|
|
FLAT Instruction Examples
|
|
--------------------------
|
|
|
|
.. code-block:: nasm
|
|
|
|
flat_load_dword v1, v[3:4]
|
|
flat_store_dwordx3 v[3:4], v[5:7]
|
|
flat_atomic_swap v1, v[3:4], v5 glc
|
|
flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
|
|
flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
|
|
|
|
For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
|
|
|
|
MUBUF Instruction Examples
|
|
---------------------------
|
|
|
|
.. code-block:: nasm
|
|
|
|
buffer_load_dword v1, off, s[4:7], s1
|
|
buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
|
|
buffer_store_format_xy v[1:2], off, s[4:7], s1
|
|
buffer_wbinvl1
|
|
buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
|
|
|
|
For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
|
|
|
|
SMRD/SMEM Instruction Examples
|
|
-------------------------------
|
|
|
|
.. code-block:: nasm
|
|
|
|
s_load_dword s1, s[2:3], 0xfc
|
|
s_load_dwordx8 s[8:15], s[2:3], s4
|
|
s_load_dwordx16 s[88:103], s[2:3], s4
|
|
s_dcache_inv_vol
|
|
s_memtime s[4:5]
|
|
|
|
For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
|
|
|
|
SOP1 Instruction Examples
|
|
--------------------------
|
|
|
|
.. code-block:: nasm
|
|
|
|
s_mov_b32 s1, s2
|
|
s_mov_b64 s[0:1], 0x80000000
|
|
s_cmov_b32 s1, 200
|
|
s_wqm_b64 s[2:3], s[4:5]
|
|
s_bcnt0_i32_b64 s1, s[2:3]
|
|
s_swappc_b64 s[2:3], s[4:5]
|
|
s_cbranch_join s[4:5]
|
|
|
|
For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
|
|
|
|
SOP2 Instruction Examples
|
|
-------------------------
|
|
|
|
.. code-block:: nasm
|
|
|
|
s_add_u32 s1, s2, s3
|
|
s_and_b64 s[2:3], s[4:5], s[6:7]
|
|
s_cselect_b32 s1, s2, s3
|
|
s_andn2_b32 s2, s4, s6
|
|
s_lshr_b64 s[2:3], s[4:5], s6
|
|
s_ashr_i32 s2, s4, s6
|
|
s_bfm_b64 s[2:3], s4, s6
|
|
s_bfe_i64 s[2:3], s[4:5], s6
|
|
s_cbranch_g_fork s[4:5], s[6:7]
|
|
|
|
For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
|
|
|
|
SOPC Instruction Examples
|
|
--------------------------
|
|
|
|
.. code-block:: nasm
|
|
|
|
s_cmp_eq_i32 s1, s2
|
|
s_bitcmp1_b32 s1, s2
|
|
s_bitcmp0_b64 s[2:3], s4
|
|
s_setvskip s3, s5
|
|
|
|
For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
|
|
|
|
SOPP Instruction Examples
|
|
--------------------------
|
|
|
|
.. code-block:: nasm
|
|
|
|
s_barrier
|
|
s_nop 2
|
|
s_endpgm
|
|
s_waitcnt 0 ; Wait for all counters to be 0
|
|
s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
|
|
s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
|
|
s_sethalt 9
|
|
s_sleep 10
|
|
s_sendmsg 0x1
|
|
s_sendmsg sendmsg(MSG_INTERRUPT)
|
|
s_trap 1
|
|
|
|
For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
|
|
|
|
Unless otherwise mentioned, little verification is performed on the operands
|
|
of SOPP Instructions, so it is up to the programmer to be familiar with the
|
|
range or acceptable values.
|
|
|
|
Vector ALU Instruction Examples
|
|
-------------------------------
|
|
|
|
For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
|
|
the assembler will automatically use optimal encoding based on its operands.
|
|
To force specific encoding, one can add a suffix to the opcode of the instruction:
|
|
|
|
* _e32 for 32-bit VOP1/VOP2/VOPC
|
|
* _e64 for 64-bit VOP3
|
|
* _dpp for VOP_DPP
|
|
* _sdwa for VOP_SDWA
|
|
|
|
VOP1/VOP2/VOP3/VOPC examples:
|
|
|
|
.. code-block:: nasm
|
|
|
|
v_mov_b32 v1, v2
|
|
v_mov_b32_e32 v1, v2
|
|
v_nop
|
|
v_cvt_f64_i32_e32 v[1:2], v2
|
|
v_floor_f32_e32 v1, v2
|
|
v_bfrev_b32_e32 v1, v2
|
|
v_add_f32_e32 v1, v2, v3
|
|
v_mul_i32_i24_e64 v1, v2, 3
|
|
v_mul_i32_i24_e32 v1, -3, v3
|
|
v_mul_i32_i24_e32 v1, -100, v3
|
|
v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
|
|
v_max_f16_e32 v1, v2, v3
|
|
|
|
VOP_DPP examples:
|
|
|
|
.. code-block:: nasm
|
|
|
|
v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
|
|
v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
|
|
v_mov_b32 v0, v0 wave_shl:1
|
|
v_mov_b32 v0, v0 row_mirror
|
|
v_mov_b32 v0, v0 row_bcast:31
|
|
v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
|
|
v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
|
|
v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
|
|
|
|
VOP_SDWA examples:
|
|
|
|
.. code-block:: nasm
|
|
|
|
v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
|
|
v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
|
|
v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
|
|
v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
|
|
v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
|
|
|
|
For full list of supported instructions, refer to "Vector ALU instructions".
|
|
|
|
HSA Code Object Directives
|
|
--------------------------
|
|
|
|
AMDGPU ABI defines auxiliary data in output code object. In assembly source,
|
|
one can specify them with assembler directives.
|
|
|
|
.hsa_code_object_version major, minor
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
*major* and *minor* are integers that specify the version of the HSA code
|
|
object that will be generated by the assembler.
|
|
|
|
.hsa_code_object_isa [major, minor, stepping, vendor, arch]
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
*major*, *minor*, and *stepping* are all integers that describe the instruction
|
|
set architecture (ISA) version of the assembly program.
|
|
|
|
*vendor* and *arch* are quoted strings. *vendor* should always be equal to
|
|
"AMD" and *arch* should always be equal to "AMDGPU".
|
|
|
|
By default, the assembler will derive the ISA version, *vendor*, and *arch*
|
|
from the value of the -mcpu option that is passed to the assembler.
|
|
|
|
.amdgpu_hsa_kernel (name)
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
This directives specifies that the symbol with given name is a kernel entry point
|
|
(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
|
|
|
|
.amd_kernel_code_t
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
This directive marks the beginning of a list of key / value pairs that are used
|
|
to specify the amd_kernel_code_t object that will be emitted by the assembler.
|
|
The list must be terminated by the *.end_amd_kernel_code_t* directive. For
|
|
any amd_kernel_code_t values that are unspecified a default value will be
|
|
used. The default value for all keys is 0, with the following exceptions:
|
|
|
|
- *kernel_code_version_major* defaults to 1.
|
|
- *machine_kind* defaults to 1.
|
|
- *machine_version_major*, *machine_version_minor*, and
|
|
*machine_version_stepping* are derived from the value of the -mcpu option
|
|
that is passed to the assembler.
|
|
- *kernel_code_entry_byte_offset* defaults to 256.
|
|
- *wavefront_size* defaults to 6.
|
|
- *kernarg_segment_alignment*, *group_segment_alignment*, and
|
|
*private_segment_alignment* default to 4. Note that alignments are specified
|
|
as a power of two, so a value of **n** means an alignment of 2^ **n**.
|
|
|
|
The *.amd_kernel_code_t* directive must be placed immediately after the
|
|
function label and before any instructions.
|
|
|
|
For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
|
|
comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
|
|
|
|
Here is an example of a minimal amd_kernel_code_t specification:
|
|
|
|
.. code-block:: none
|
|
|
|
.hsa_code_object_version 1,0
|
|
.hsa_code_object_isa
|
|
|
|
.hsatext
|
|
.globl hello_world
|
|
.p2align 8
|
|
.amdgpu_hsa_kernel hello_world
|
|
|
|
hello_world:
|
|
|
|
.amd_kernel_code_t
|
|
enable_sgpr_kernarg_segment_ptr = 1
|
|
is_ptr64 = 1
|
|
compute_pgm_rsrc1_vgprs = 0
|
|
compute_pgm_rsrc1_sgprs = 0
|
|
compute_pgm_rsrc2_user_sgpr = 2
|
|
kernarg_segment_byte_size = 8
|
|
wavefront_sgpr_count = 2
|
|
workitem_vgpr_count = 3
|
|
.end_amd_kernel_code_t
|
|
|
|
s_load_dwordx2 s[0:1], s[0:1] 0x0
|
|
v_mov_b32 v0, 3.14159
|
|
s_waitcnt lgkmcnt(0)
|
|
v_mov_b32 v1, s0
|
|
v_mov_b32 v2, s1
|
|
flat_store_dword v[1:2], v0
|
|
s_endpgm
|
|
.Lfunc_end0:
|
|
.size hello_world, .Lfunc_end0-hello_world
|