@@ -353,9 +353,9 @@ supported for the ``amdgcn`` target.
353353 (scratch), and group (LDS) memory depending on if the address is within one
354354 of the aperture ranges. Flat access to scratch requires hardware aperture
355355 setup and setup in the kernel prologue (see
356- :ref:`amdgpu-amdhsa-flat-scratch`). Flat access to LDS requires hardware
357- aperture setup and M0 (GFX7-GFX8) register setup (see
358- :ref:`amdgpu-amdhsa-m0`).
356+ :ref:`amdgpu-amdhsa-kernel-prolog- flat-scratch`). Flat access to LDS requires
357+ hardware aperture setup and M0 (GFX7-GFX8) register setup (see
358+ :ref:`amdgpu-amdhsa-kernel-prolog- m0`).
359359
360360 To convert between a private or group address space address (termed a segment
361361 address) and a flat address the base address of the corresponding aperture
@@ -5954,7 +5954,7 @@ SGPR register initial state is defined in
59545954 must be used to set up FLAT
59555955 SCRATCH for flat addressing
59565956 (see
5957- :ref:`amdgpu-amdhsa-flat-scratch`).
5957+ :ref:`amdgpu-amdhsa-kernel-prolog- flat-scratch`).
59585958 ========== ========================== ====== ==============================
59595959
59605960The order of the VGPR registers is defined, but the compiler can specify which
@@ -6020,7 +6020,23 @@ following properties:
60206020Kernel Prolog
60216021~~~~~~~~~~~~~
60226022
6023- .. _amdgpu-amdhsa-m0:
6023+ The compiler performs initialization in the kernel prologue depending on the
6024+ target and information about things like stack usage in the kernel and called
6025+ functions. Some of this initialization requires the compiler to request certain
6026+ User and System SGPRs be present in the
6027+ :ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
6028+ :ref:`amdgpu-amdhsa-kernel-descriptor`.
6029+
6030+ .. _amdgpu-amdhsa-kernel-prolog-cfi:
6031+
6032+ CFI
6033+ +++
6034+
6035+ 1. The CFI return address is undefined.
6036+ 2. The CFI CFA is defined using an expression which evaluates to a memory
6037+ location description for the private segment address ``0``.
6038+
6039+ .. _amdgpu-amdhsa-kernel-prolog-m0:
60246040
60256041M0
60266042++
@@ -6035,15 +6051,35 @@ GFX9-GFX10
60356051 The M0 register is not used for range checking LDS accesses and so does not
60366052 need to be initialized in the prolog.
60376053
6038- .. _amdgpu-amdhsa-flat-scratch:
6054+ .. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
6055+
6056+ Stack Pointer
6057+ +++++++++++++
6058+
6059+ If the kernel has function calls it must set up the ABI stack pointer described
6060+ in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by
6061+ setting SGPR32 to the the unswizzled scratch offset of the address past the
6062+ last local allocation.
6063+
6064+ .. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
6065+
6066+ Frame Pointer
6067+ +++++++++++++
6068+
6069+ If the kernel needs a frame pointer for the reasons defined in
6070+ ``SIFrameLowering`` then SGPR34 is used and is always set to ``0`` in the
6071+ kernel prolog. If a frame pointer is not required then all uses of the frame
6072+ pointer are replaced with immediate ``0`` offsets.
6073+
6074+ .. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
60396075
60406076Flat Scratch
60416077++++++++++++
60426078
6043- If the kernel may use flat operations to access scratch memory, the prolog code
6044- must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
6045- are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch
6046- Wavefront Offset SGPR registers (see
6079+ If the kernel or any function it calls may use flat operations to access
6080+ scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
6081+ (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
6082+ uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
60476083:ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
60486084
60496085GFX6
@@ -6074,6 +6110,52 @@ GFX9-GFX10
60746110 FLAT_SCRATCH pair for use as the flat scratch base in flat memory
60756111 instructions.
60766112
6113+ .. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
6114+
6115+ Private Segment Buffer
6116+ ++++++++++++++++++++++
6117+
6118+ A set of four SGPRs beginning at a four-aligned SGPR index are always selected
6119+ to serve as the scratch V# for the kernel as follows:
6120+
6121+ - If it is know during instruction selection that there is stack usage,
6122+ SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if
6123+ optimisations are disabled (``-O0``), if stack objects already exist (for
6124+ locals, etc.), or if there are any function calls.
6125+
6126+ - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
6127+ are reserved for the tentative scratch V#. These will be used if it is
6128+ determined that spilling is needed.
6129+
6130+ - If no use is made of the tentative scratch V#, then it is unreserved
6131+ and the register count is determined ignoring it.
6132+ - If use is made of the tenatative scratch V#, then its register numbers
6133+ are shifted to the first four-aligned SGPR index after the highest one
6134+ allocated by the register allocator, and all uses are updated. The
6135+ register count includes them in the shifted location.
6136+ - In either case, if the processor has the SGPR allocation bug, the
6137+ tentative allocation is not shifted or unreserved in order to ensure
6138+ the register count is higher to workaround the bug.
6139+
6140+ .. note::
6141+
6142+ This approach of using a tentative scratch V# and shifting the register
6143+ numbers if used avoids having to perform register allocation a second
6144+ time if the tentative V# is eliminated. This is more efficient and
6145+ avoids the problem that the second register allocation may perform
6146+ spilling which will fail as there is no longer a scratch V#.
6147+
6148+ When the kernel prolog code is being emitted it is known whether the scratch V#
6149+ described above is actually used. If it is, the prolog code must set it up by
6150+ copying the Private Segment Buffer to the scratch V# registers and then adding
6151+ the Private Segment Wavefront Offset to the queue base address in the V#. The
6152+ result is a V# with a base address pointing to the beginning of the wavefront
6153+ scratch backing memory.
6154+
6155+ The Private Segment Buffer is always requested, but the Private Segment
6156+ Wavefront Offset is only requested if it is used (see
6157+ :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
6158+
60776159.. _amdgpu-amdhsa-memory-model:
60786160
60796161Memory Model
@@ -8514,6 +8596,8 @@ Call Convention
85148596See :ref:`amdgpu-dwarf-address-space-mapping` for information on swizzled
85158597addresses. Unswizzled addresses are normal linear addresses.
85168598
8599+ .. _amdgpu-amdhsa-function-call-convention-kernel-functions:
8600+
85178601Kernel Functions
85188602++++++++++++++++
85198603
@@ -8537,86 +8621,56 @@ how the AMDGPU implements function calls:
85378621 by-value struct?
85388622 - What is ABI for lambda values?
85398623
8540- 2. The CFI return address is undefined.
8541- 3. If the kernel contains no calls then:
8542-
8543- - If using the ``amdhsa`` OS ABI (see :ref:`amdgpu-os-table`), and know
8544- during ISel that there is stack usage SGPR0-3 is reserved for use as the
8545- scratch SRD and SGPR33 reserved for the wave scratch offset. Stack usage
8546- is assumed if ``-O0``, if already aware of stack objects for locals, etc.,
8547- or if there are any function calls.
8548- - Otherwise, five high numbered SGPRs are reserved for the tentative scratch
8549- SRD and wave scratch offset. These will be used if determine need to do
8550- spilling.
8551-
8552- - If no use is made of the tentative scratch SRD or wave scratch offset,
8553- then they are unreserved and the register count is determined ignoring
8554- them.
8555- - If use is made of the tenatative scratch SRD or wave scratch offset,
8556- then the register numbers used are shifted to be after the highest one
8557- allocated by the register allocator, and all uses updated. The register
8558- count will include them in the shifted location. Since register
8559- allocation may introduce spills, this shifting allows them to be
8560- eliminated without having to perform register allocation again.
8561- - In either case, if the processor has the SGPR allocation bug, the
8562- tentative allocation is not shifted or unreserved inorder to ensure the
8563- register count is higher to workaround the bug.
8564-
8565- 4. If the kernel contains function calls:
8566-
8567- - SP is set to the wave scratch offset.
8568-
8569- - Since SP is an unswizzled address relative to the queue scratch base, an
8570- wave scratch offset is an unswizzle offset, this means that if SP is
8571- used to access swizzled scratch memory, it will access the private
8572- segment address 0.
8573-
8574- .. note::
8624+ 4. The kernel performs certain setup in its prolog, as described in
8625+ :ref:`amdgpu-amdhsa-kernel-prolog`.
85758626
8576- This is planned to be changed to be the unswizzled base address of the
8577- wavefront scratch backing memory.
8627+ .. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
85788628
85798629Non-Kernel Functions
85808630++++++++++++++++++++
85818631
85828632This section describes the call convention ABI for functions other than the
85838633outer kernel function.
85848634
8585- If a kernel has function calls then scratch is always allocated and used for the
8586- call stack which grows from low address to high address using the swizzled
8635+ If a kernel has function calls then scratch is always allocated and used for
8636+ the call stack which grows from low address to high address using the swizzled
85878637scratch address space.
85888638
85898639On entry to a function:
85908640
8591- 1. SGPR0-3 contain a V# with the following properties:
8592-
8593- * Base address of the queue scratch backing memory.
8594-
8595- .. note::
8596-
8597- This is planned to be changed to be the unswizzled base address of the
8598- wavefront scratch backing memory.
8641+ 1. SGPR0-3 contain a V# with the following properties (see
8642+ :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
85998643
8644+ * Base address pointing to the beginning of the wavefront scratch backing
8645+ memory.
86008646 * Swizzled with dword element size and stride of wavefront size elements.
86018647
860286482. The FLAT_SCRATCH register pair is setup. See
8603- :ref:`amdgpu-amdhsa-flat-scratch`.
8604- 3. GFX6-8: M0 register set to the size of LDS in bytes.
8649+ :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
8650+ 3. GFX6-8: M0 register set to the size of LDS in bytes. See
8651+ :ref:`amdgpu-amdhsa-kernel-prolog-m0`.
860586524. The EXEC register is set to the lanes active on entry to the function.
860686535. MODE register: *TBD*
860786546. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
86088655 below.
860986567. SGPR30-31 return address (RA). The code address that the function must
86108657 return to when it completes. The value is undefined if the function is *no
86118658 return*.
8612- 8. SGPR32 is used for the stack pointer (SP). It is an unswizzled
8613- scratch offset relative to the beginning of the queue scratch backing
8614- memory.
8659+ 8. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
8660+ offset relative to the beginning of the wavefront scratch backing memory.
86158661
86168662 The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
86178663 offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
86188664 manner.
86198665
8666+ The unswizzled SP value can be converted into the swizzled SP value by:
8667+
8668+ | swizzled SP = unswizzled SP / wavefront size
8669+
8670+ This may be used to obtain the private address space address of stack
8671+ objects and to convert this address to a flat address by adding the flat
8672+ scratch aperture base address.
8673+
86208674 The swizzled SP value is always 4 bytes aligned for the ``r600``
86218675 architecture and 16 byte aligned for the ``amdgcn`` architecture.
86228676
@@ -8639,41 +8693,14 @@ On entry to a function:
86398693 arguments after the last local allocation and adjust SGPR32 to the address
86408694 after the last local allocation.
86418695
8642- .. note::
8643-
8644- The SP value is planned to be changed to be the unswizzled offset relative
8645- to the wavefront scratch backing memory.
8646-
8647- 9. SGPR33 wavefront scratch base offset. The unswizzled offset from the queue
8648- scratch backing memory base to the base of the wavefront scratch backing
8649- memory.
8650-
8651- It is used to convert the unswizzled SP value to swizzled address in the
8652- private address space by:
8653-
8654- | private address = (unswizzled SP - wavefront scratch base offset) /
8655- wavefront size
8656-
8657- This may be used to obtain the private address of stack objects and to
8658- convert these address to a flat address by adding the flat scratch aperture
8659- base address.
8660-
8661- .. note::
8662-
8663- This is planned to be eliminated when SP is changed to be the unswizzled
8664- offset relative to the wavefront scratch backing memory. The the
8665- conversion simplifies to:
8666-
8667- | private address = unswizzled SP / wavefront size
8668-
8669- 10. All other registers are unspecified.
8670- 11. Any necessary ``waitcnt`` has been performed to ensure memory is available
8696+ 9. All other registers are unspecified.
8697+ 10. Any necessary ``waitcnt`` has been performed to ensure memory is available
86718698 to the function.
86728699
86738700On exit from a function:
86748701
867587021. VGPR0-31 and SGPR4-29 are used to pass function result arguments as
8676- described below. Any registers used are considered clobbered registers,
8703+ described below. Any registers used are considered clobbered registers.
867787042. The following registers are preserved and have the same value as on entry:
86788705
86798706 * FLAT_SCRATCH
@@ -8872,7 +8899,7 @@ describes how the AMDGPU implements function calls:
88728899
887389001. SGPR34 is used as a frame pointer (FP) if necessary. Like the SP it is an
88748901 unswizzled scratch address. It is only needed if runtime sized ``alloca``
8875- are used, or for the reasons defined in ``SiFrameLowering ``.
8902+ are used, or for the reasons defined in ``SIFrameLowering ``.
887689032. Runtime stack alignment is not currently supported.
88778904
88788905 .. TODO::
@@ -8886,14 +8913,11 @@ describes how the AMDGPU implements function calls:
88868913
88878914 ..note::
88888915
8889- Before CFI is generated, the call convention will be changed so that SP is
8890- an unswizzled address relative to the wave scratch base.
8891-
88928916 CFI will be generated that defines the CFA as the unswizzled address
88938917 relative to the wave scratch base in the unswizzled private address space
88948918 of the lowest address stack allocated local variable.
88958919
8896- ``DW_AT_frame_base`` will be defined as the swizelled address in the
8920+ ``DW_AT_frame_base`` will be defined as the swizzled address in the
88978921 swizzled private address space by dividing the CFA by the wavefront size
88988922 (since CFA is always at least dword aligned which matches the scratch
88998923 swizzle element size).
0 commit comments