Skip to content

Commit 60b1967

Browse files
committed
[AMDGPU] Add Scratch Wave Offset to Scratch Buffer Descriptor in entry functions
Add the scratch wave offset to the scratch buffer descriptor (SRSrc) in the entry function prologue. This allows us to removes the scratch wave offset register from the calling convention ABI. As part of this change, allow the use of an inline constant zero for the SOffset of MUBUF instructions accessing the stack in entry functions when a frame pointer is not requested/required. Entry functions with calls still need to set up the calling convention ABI stack pointer register, and reference it in order to address arguments of called functions. The ABI stack pointer register remains unswizzled, but is now wave-relative instead of queue-relative. Non-entry functions also use an inline constant zero SOffset for wave-relative scratch access, but continue to use the stack and frame pointers as before. When the stack or frame pointer is converted to a swizzled offset it is now scaled directly, as the scratch wave offset no longer needs to be subtracted first. Update llvm/docs/AMDGPUUsage.rst to reflect these changes to the calling convention. Tags: #llvm Differential Revision: https://reviews.llvm.org/D75138
1 parent db099f9 commit 60b1967

File tree

109 files changed

+5079
-4980
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

109 files changed

+5079
-4980
lines changed

llvm/docs/AMDGPUUsage.rst

Lines changed: 121 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -353,9 +353,9 @@ supported for the ``amdgcn`` target.
353353
(scratch), and group (LDS) memory depending on if the address is within one
354354
of the aperture ranges. Flat access to scratch requires hardware aperture
355355
setup and setup in the kernel prologue (see
356-
:ref:`amdgpu-amdhsa-flat-scratch`). Flat access to LDS requires hardware
357-
aperture setup and M0 (GFX7-GFX8) register setup (see
358-
:ref:`amdgpu-amdhsa-m0`).
356+
:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
357+
hardware aperture setup and M0 (GFX7-GFX8) register setup (see
358+
:ref:`amdgpu-amdhsa-kernel-prolog-m0`).
359359

360360
To convert between a private or group address space address (termed a segment
361361
address) and a flat address the base address of the corresponding aperture
@@ -5954,7 +5954,7 @@ SGPR register initial state is defined in
59545954
must be used to set up FLAT
59555955
SCRATCH for flat addressing
59565956
(see
5957-
:ref:`amdgpu-amdhsa-flat-scratch`).
5957+
:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
59585958
========== ========================== ====== ==============================
59595959

59605960
The order of the VGPR registers is defined, but the compiler can specify which
@@ -6020,7 +6020,23 @@ following properties:
60206020
Kernel Prolog
60216021
~~~~~~~~~~~~~
60226022

6023-
.. _amdgpu-amdhsa-m0:
6023+
The compiler performs initialization in the kernel prologue depending on the
6024+
target and information about things like stack usage in the kernel and called
6025+
functions. Some of this initialization requires the compiler to request certain
6026+
User and System SGPRs be present in the
6027+
:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
6028+
:ref:`amdgpu-amdhsa-kernel-descriptor`.
6029+
6030+
.. _amdgpu-amdhsa-kernel-prolog-cfi:
6031+
6032+
CFI
6033+
+++
6034+
6035+
1. The CFI return address is undefined.
6036+
2. The CFI CFA is defined using an expression which evaluates to a memory
6037+
location description for the private segment address ``0``.
6038+
6039+
.. _amdgpu-amdhsa-kernel-prolog-m0:
60246040

60256041
M0
60266042
++
@@ -6035,15 +6051,35 @@ GFX9-GFX10
60356051
The M0 register is not used for range checking LDS accesses and so does not
60366052
need to be initialized in the prolog.
60376053

6038-
.. _amdgpu-amdhsa-flat-scratch:
6054+
.. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
6055+
6056+
Stack Pointer
6057+
+++++++++++++
6058+
6059+
If the kernel has function calls it must set up the ABI stack pointer described
6060+
in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by
6061+
setting SGPR32 to the the unswizzled scratch offset of the address past the
6062+
last local allocation.
6063+
6064+
.. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
6065+
6066+
Frame Pointer
6067+
+++++++++++++
6068+
6069+
If the kernel needs a frame pointer for the reasons defined in
6070+
``SIFrameLowering`` then SGPR34 is used and is always set to ``0`` in the
6071+
kernel prolog. If a frame pointer is not required then all uses of the frame
6072+
pointer are replaced with immediate ``0`` offsets.
6073+
6074+
.. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
60396075

60406076
Flat Scratch
60416077
++++++++++++
60426078

6043-
If the kernel may use flat operations to access scratch memory, the prolog code
6044-
must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
6045-
are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch
6046-
Wavefront Offset SGPR registers (see
6079+
If the kernel or any function it calls may use flat operations to access
6080+
scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
6081+
(FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
6082+
uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
60476083
:ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
60486084

60496085
GFX6
@@ -6074,6 +6110,52 @@ GFX9-GFX10
60746110
FLAT_SCRATCH pair for use as the flat scratch base in flat memory
60756111
instructions.
60766112

6113+
.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
6114+
6115+
Private Segment Buffer
6116+
++++++++++++++++++++++
6117+
6118+
A set of four SGPRs beginning at a four-aligned SGPR index are always selected
6119+
to serve as the scratch V# for the kernel as follows:
6120+
6121+
- If it is know during instruction selection that there is stack usage,
6122+
SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if
6123+
optimisations are disabled (``-O0``), if stack objects already exist (for
6124+
locals, etc.), or if there are any function calls.
6125+
6126+
- Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
6127+
are reserved for the tentative scratch V#. These will be used if it is
6128+
determined that spilling is needed.
6129+
6130+
- If no use is made of the tentative scratch V#, then it is unreserved
6131+
and the register count is determined ignoring it.
6132+
- If use is made of the tenatative scratch V#, then its register numbers
6133+
are shifted to the first four-aligned SGPR index after the highest one
6134+
allocated by the register allocator, and all uses are updated. The
6135+
register count includes them in the shifted location.
6136+
- In either case, if the processor has the SGPR allocation bug, the
6137+
tentative allocation is not shifted or unreserved in order to ensure
6138+
the register count is higher to workaround the bug.
6139+
6140+
.. note::
6141+
6142+
This approach of using a tentative scratch V# and shifting the register
6143+
numbers if used avoids having to perform register allocation a second
6144+
time if the tentative V# is eliminated. This is more efficient and
6145+
avoids the problem that the second register allocation may perform
6146+
spilling which will fail as there is no longer a scratch V#.
6147+
6148+
When the kernel prolog code is being emitted it is known whether the scratch V#
6149+
described above is actually used. If it is, the prolog code must set it up by
6150+
copying the Private Segment Buffer to the scratch V# registers and then adding
6151+
the Private Segment Wavefront Offset to the queue base address in the V#. The
6152+
result is a V# with a base address pointing to the beginning of the wavefront
6153+
scratch backing memory.
6154+
6155+
The Private Segment Buffer is always requested, but the Private Segment
6156+
Wavefront Offset is only requested if it is used (see
6157+
:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
6158+
60776159
.. _amdgpu-amdhsa-memory-model:
60786160

60796161
Memory Model
@@ -8514,6 +8596,8 @@ Call Convention
85148596
See :ref:`amdgpu-dwarf-address-space-mapping` for information on swizzled
85158597
addresses. Unswizzled addresses are normal linear addresses.
85168598

8599+
.. _amdgpu-amdhsa-function-call-convention-kernel-functions:
8600+
85178601
Kernel Functions
85188602
++++++++++++++++
85198603

@@ -8537,86 +8621,56 @@ how the AMDGPU implements function calls:
85378621
by-value struct?
85388622
- What is ABI for lambda values?
85398623

8540-
2. The CFI return address is undefined.
8541-
3. If the kernel contains no calls then:
8542-
8543-
- If using the ``amdhsa`` OS ABI (see :ref:`amdgpu-os-table`), and know
8544-
during ISel that there is stack usage SGPR0-3 is reserved for use as the
8545-
scratch SRD and SGPR33 reserved for the wave scratch offset. Stack usage
8546-
is assumed if ``-O0``, if already aware of stack objects for locals, etc.,
8547-
or if there are any function calls.
8548-
- Otherwise, five high numbered SGPRs are reserved for the tentative scratch
8549-
SRD and wave scratch offset. These will be used if determine need to do
8550-
spilling.
8551-
8552-
- If no use is made of the tentative scratch SRD or wave scratch offset,
8553-
then they are unreserved and the register count is determined ignoring
8554-
them.
8555-
- If use is made of the tenatative scratch SRD or wave scratch offset,
8556-
then the register numbers used are shifted to be after the highest one
8557-
allocated by the register allocator, and all uses updated. The register
8558-
count will include them in the shifted location. Since register
8559-
allocation may introduce spills, this shifting allows them to be
8560-
eliminated without having to perform register allocation again.
8561-
- In either case, if the processor has the SGPR allocation bug, the
8562-
tentative allocation is not shifted or unreserved inorder to ensure the
8563-
register count is higher to workaround the bug.
8564-
8565-
4. If the kernel contains function calls:
8566-
8567-
- SP is set to the wave scratch offset.
8568-
8569-
- Since SP is an unswizzled address relative to the queue scratch base, an
8570-
wave scratch offset is an unswizzle offset, this means that if SP is
8571-
used to access swizzled scratch memory, it will access the private
8572-
segment address 0.
8573-
8574-
.. note::
8624+
4. The kernel performs certain setup in its prolog, as described in
8625+
:ref:`amdgpu-amdhsa-kernel-prolog`.
85758626

8576-
This is planned to be changed to be the unswizzled base address of the
8577-
wavefront scratch backing memory.
8627+
.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
85788628

85798629
Non-Kernel Functions
85808630
++++++++++++++++++++
85818631

85828632
This section describes the call convention ABI for functions other than the
85838633
outer kernel function.
85848634

8585-
If a kernel has function calls then scratch is always allocated and used for the
8586-
call stack which grows from low address to high address using the swizzled
8635+
If a kernel has function calls then scratch is always allocated and used for
8636+
the call stack which grows from low address to high address using the swizzled
85878637
scratch address space.
85888638

85898639
On entry to a function:
85908640

8591-
1. SGPR0-3 contain a V# with the following properties:
8592-
8593-
* Base address of the queue scratch backing memory.
8594-
8595-
.. note::
8596-
8597-
This is planned to be changed to be the unswizzled base address of the
8598-
wavefront scratch backing memory.
8641+
1. SGPR0-3 contain a V# with the following properties (see
8642+
:ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
85998643

8644+
* Base address pointing to the beginning of the wavefront scratch backing
8645+
memory.
86008646
* Swizzled with dword element size and stride of wavefront size elements.
86018647

86028648
2. The FLAT_SCRATCH register pair is setup. See
8603-
:ref:`amdgpu-amdhsa-flat-scratch`.
8604-
3. GFX6-8: M0 register set to the size of LDS in bytes.
8649+
:ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
8650+
3. GFX6-8: M0 register set to the size of LDS in bytes. See
8651+
:ref:`amdgpu-amdhsa-kernel-prolog-m0`.
86058652
4. The EXEC register is set to the lanes active on entry to the function.
86068653
5. MODE register: *TBD*
86078654
6. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
86088655
below.
86098656
7. SGPR30-31 return address (RA). The code address that the function must
86108657
return to when it completes. The value is undefined if the function is *no
86118658
return*.
8612-
8. SGPR32 is used for the stack pointer (SP). It is an unswizzled
8613-
scratch offset relative to the beginning of the queue scratch backing
8614-
memory.
8659+
8. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
8660+
offset relative to the beginning of the wavefront scratch backing memory.
86158661

86168662
The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
86178663
offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
86188664
manner.
86198665

8666+
The unswizzled SP value can be converted into the swizzled SP value by:
8667+
8668+
| swizzled SP = unswizzled SP / wavefront size
8669+
8670+
This may be used to obtain the private address space address of stack
8671+
objects and to convert this address to a flat address by adding the flat
8672+
scratch aperture base address.
8673+
86208674
The swizzled SP value is always 4 bytes aligned for the ``r600``
86218675
architecture and 16 byte aligned for the ``amdgcn`` architecture.
86228676

@@ -8639,41 +8693,14 @@ On entry to a function:
86398693
arguments after the last local allocation and adjust SGPR32 to the address
86408694
after the last local allocation.
86418695

8642-
.. note::
8643-
8644-
The SP value is planned to be changed to be the unswizzled offset relative
8645-
to the wavefront scratch backing memory.
8646-
8647-
9. SGPR33 wavefront scratch base offset. The unswizzled offset from the queue
8648-
scratch backing memory base to the base of the wavefront scratch backing
8649-
memory.
8650-
8651-
It is used to convert the unswizzled SP value to swizzled address in the
8652-
private address space by:
8653-
8654-
| private address = (unswizzled SP - wavefront scratch base offset) /
8655-
wavefront size
8656-
8657-
This may be used to obtain the private address of stack objects and to
8658-
convert these address to a flat address by adding the flat scratch aperture
8659-
base address.
8660-
8661-
.. note::
8662-
8663-
This is planned to be eliminated when SP is changed to be the unswizzled
8664-
offset relative to the wavefront scratch backing memory. The the
8665-
conversion simplifies to:
8666-
8667-
| private address = unswizzled SP / wavefront size
8668-
8669-
10. All other registers are unspecified.
8670-
11. Any necessary ``waitcnt`` has been performed to ensure memory is available
8696+
9. All other registers are unspecified.
8697+
10. Any necessary ``waitcnt`` has been performed to ensure memory is available
86718698
to the function.
86728699

86738700
On exit from a function:
86748701

86758702
1. VGPR0-31 and SGPR4-29 are used to pass function result arguments as
8676-
described below. Any registers used are considered clobbered registers,
8703+
described below. Any registers used are considered clobbered registers.
86778704
2. The following registers are preserved and have the same value as on entry:
86788705

86798706
* FLAT_SCRATCH
@@ -8872,7 +8899,7 @@ describes how the AMDGPU implements function calls:
88728899

88738900
1. SGPR34 is used as a frame pointer (FP) if necessary. Like the SP it is an
88748901
unswizzled scratch address. It is only needed if runtime sized ``alloca``
8875-
are used, or for the reasons defined in ``SiFrameLowering``.
8902+
are used, or for the reasons defined in ``SIFrameLowering``.
88768903
2. Runtime stack alignment is not currently supported.
88778904

88788905
.. TODO::
@@ -8886,14 +8913,11 @@ describes how the AMDGPU implements function calls:
88868913

88878914
..note::
88888915

8889-
Before CFI is generated, the call convention will be changed so that SP is
8890-
an unswizzled address relative to the wave scratch base.
8891-
88928916
CFI will be generated that defines the CFA as the unswizzled address
88938917
relative to the wave scratch base in the unswizzled private address space
88948918
of the lowest address stack allocated local variable.
88958919

8896-
``DW_AT_frame_base`` will be defined as the swizelled address in the
8920+
``DW_AT_frame_base`` will be defined as the swizzled address in the
88978921
swizzled private address space by dividing the CFA by the wavefront size
88988922
(since CFA is always at least dword aligned which matches the scratch
88998923
swizzle element size).

llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -800,8 +800,6 @@ bool AMDGPUCallLowering::lowerFormalArguments(
800800
TLI.allocateSystemSGPRs(CCInfo, MF, *Info, CC, IsShader);
801801
} else {
802802
CCInfo.AllocateReg(Info->getScratchRSrcReg());
803-
CCInfo.AllocateReg(Info->getScratchWaveOffsetReg());
804-
CCInfo.AllocateReg(Info->getFrameOffsetReg());
805803
TLI.allocateSpecialInputSGPRs(CCInfo, MF, *TRI, *Info);
806804
}
807805

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1474,6 +1474,7 @@ static bool isStackPtrRelative(const MachinePointerInfo &PtrInfo) {
14741474
}
14751475

14761476
std::pair<SDValue, SDValue> AMDGPUDAGToDAGISel::foldFrameIndex(SDValue N) const {
1477+
SDLoc DL(N);
14771478
const MachineFunction &MF = CurDAG->getMachineFunction();
14781479
const SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();
14791480

@@ -1488,9 +1489,8 @@ std::pair<SDValue, SDValue> AMDGPUDAGToDAGISel::foldFrameIndex(SDValue N) const
14881489
}
14891490

14901491
// If we don't know this private access is a local stack object, it needs to
1491-
// be relative to the entry point's scratch wave offset register.
1492-
return std::make_pair(N, CurDAG->getRegister(Info->getScratchWaveOffsetReg(),
1493-
MVT::i32));
1492+
// be relative to the entry point's scratch wave offset.
1493+
return std::make_pair(N, CurDAG->getTargetConstant(0, DL, MVT::i32));
14941494
}
14951495

14961496
bool AMDGPUDAGToDAGISel::SelectMUBUFScratchOffen(SDNode *Parent,
@@ -1515,10 +1515,10 @@ bool AMDGPUDAGToDAGISel::SelectMUBUFScratchOffen(SDNode *Parent,
15151515
// In a call sequence, stores to the argument stack area are relative to the
15161516
// stack pointer.
15171517
const MachinePointerInfo &PtrInfo = cast<MemSDNode>(Parent)->getPointerInfo();
1518-
unsigned SOffsetReg = isStackPtrRelative(PtrInfo) ?
1519-
Info->getStackPtrOffsetReg() : Info->getScratchWaveOffsetReg();
15201518

1521-
SOffset = CurDAG->getRegister(SOffsetReg, MVT::i32);
1519+
SOffset = isStackPtrRelative(PtrInfo)
1520+
? CurDAG->getRegister(Info->getStackPtrOffsetReg(), MVT::i32)
1521+
: CurDAG->getTargetConstant(0, DL, MVT::i32);
15221522
ImmOffset = CurDAG->getTargetConstant(Imm & 4095, DL, MVT::i16);
15231523
return true;
15241524
}
@@ -1576,12 +1576,12 @@ bool AMDGPUDAGToDAGISel::SelectMUBUFScratchOffset(SDNode *Parent,
15761576
SRsrc = CurDAG->getRegister(Info->getScratchRSrcReg(), MVT::v4i32);
15771577

15781578
const MachinePointerInfo &PtrInfo = cast<MemSDNode>(Parent)->getPointerInfo();
1579-
unsigned SOffsetReg = isStackPtrRelative(PtrInfo) ?
1580-
Info->getStackPtrOffsetReg() : Info->getScratchWaveOffsetReg();
15811579

15821580
// FIXME: Get from MachinePointerInfo? We should only be using the frame
15831581
// offset if we know this is in a call sequence.
1584-
SOffset = CurDAG->getRegister(SOffsetReg, MVT::i32);
1582+
SOffset = isStackPtrRelative(PtrInfo)
1583+
? CurDAG->getRegister(Info->getStackPtrOffsetReg(), MVT::i32)
1584+
: CurDAG->getTargetConstant(0, DL, MVT::i32);
15851585

15861586
Offset = CurDAG->getTargetConstant(CAddr->getZExtValue(), DL, MVT::i16);
15871587
return true;

0 commit comments

Comments
 (0)