Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions .github/workflows/release-binaries.yml
Original file line number Diff line number Diff line change
Expand Up @@ -188,9 +188,6 @@ jobs:
with:
ref: ${{ needs.prepare.outputs.ref }}

- name: Install Ninja
uses: llvm/actions/install-ninja@5dd955034a6742a2e21d82bf165fcb1050ae7b49 # main

- name: Set Build Prefix
id: setup-stage
shell: bash
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,9 @@
#include "mlir/Dialect/OpenACC/OpenACC.h"

namespace fir {
class AddrOfOp;
class DeclareOp;
class GlobalOp;
} // namespace fir

namespace hlfir {
Expand Down Expand Up @@ -53,6 +55,18 @@ struct PartialEntityAccessModel<hlfir::DeclareOp>
bool isCompleteView(mlir::Operation *op) const;
};

struct AddressOfGlobalModel
: public mlir::acc::AddressOfGlobalOpInterface::ExternalModel<
AddressOfGlobalModel, fir::AddrOfOp> {
mlir::SymbolRefAttr getSymbol(mlir::Operation *op) const;
};

struct GlobalVariableModel
: public mlir::acc::GlobalVariableOpInterface::ExternalModel<
GlobalVariableModel, fir::GlobalOp> {
bool isConstant(mlir::Operation *op) const;
};

} // namespace fir::acc

#endif // FLANG_OPTIMIZER_OPENACC_FIROPENACC_OPS_INTERFACES_H_
Original file line number Diff line number Diff line change
Expand Up @@ -59,4 +59,13 @@ bool PartialEntityAccessModel<hlfir::DeclareOp>::isCompleteView(
return !getBaseEntity(op);
}

mlir::SymbolRefAttr AddressOfGlobalModel::getSymbol(mlir::Operation *op) const {
return mlir::cast<fir::AddrOfOp>(op).getSymbolAttr();
}

bool GlobalVariableModel::isConstant(mlir::Operation *op) const {
auto globalOp = mlir::cast<fir::GlobalOp>(op);
return globalOp.getConstant().has_value();
}

} // namespace fir::acc
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,9 @@ void registerOpenACCExtensions(mlir::DialectRegistry &registry) {
PartialEntityAccessModel<fir::CoordinateOp>>(*ctx);
fir::DeclareOp::attachInterface<PartialEntityAccessModel<fir::DeclareOp>>(
*ctx);

fir::AddrOfOp::attachInterface<AddressOfGlobalModel>(*ctx);
fir::GlobalOp::attachInterface<GlobalVariableModel>(*ctx);
});

// Register HLFIR operation interfaces
Expand Down
106 changes: 106 additions & 0 deletions llvm/docs/NVPTXUsage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -796,6 +796,112 @@ every time. For more information, refer PTX ISA
Membar/Fences
-------------

'``llvm.nvvm.fence.acquire/release.sync_restrict.*``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm

declare void @llvm.nvvm.fence.acquire.sync_restrict.space.cluster.scope.cluster()
declare void @llvm.nvvm.fence.release.sync_restrict.space.cta.scope.cluster()

Overview:
"""""""""

The `nvvm.fence.{semantics}.sync_restrict.*` restrict the class of memory
operations for which the fence instruction provides the memory ordering guarantees.
When `.sync_restrict` is restricted to `shared_cta`, then memory semantics must
be `release` and the effect of the fence operation only applies to operations
performed on objects in `shared_cta` space. Likewise, when `sync_restrict` is
restricted to `shared_cluster`, then memory semantics must be `acquire` and the
effect of the fence operation only applies to operations performed on objects in
`shared_cluster` memory space. The scope for both operations is `cluster`. For more details,
please refer the `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-membar>`__

'``llvm.nvvm.fence.mbarrier_init.release.cluster``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm

declare void @llvm.nvvm.fence.mbarrier_init.release.cluster()

Overview:
"""""""""

`nvvm.fence.mbarrier_init.release.cluster` intrinsic restrict the class of
memory operations for which the fence instruction provides the memory ordering
guarantees. The `mbarrier_init` modifiers restricts the synchronizing effect to
the prior `mbarrier_init` operation executed by the same thread on mbarrier objects
in `shared_cta` memory space. For more details, please refer the `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-membar>`__

'``llvm.nvvm.fence.proxy.async_generic.acquire/release.sync_restrict``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm

declare void @llvm.nvvm.fence.proxy.async.generic.acquire.sync_restrict.space.cluster.scope.cluster()
declare void @llvm.nvvm.fence.proxy.async.generic.release.sync_restrict.space.cta.scope.cluster()

Overview:
"""""""""

`nvvm.fence.proxy.async_generic.{semantics}.sync_restrict` are used to establish
ordering between a prior memory access performed via the `async proxy<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#proxies>__`
and a subsequent memory access performed via the generic proxy.
``nvvm.fence.proxy.async_generic.release.sync_restrict`` can form a release
sequence that synchronizes with an acquire sequence that contains the
``nvvm.fence.proxy.async_generic.acquire.sync_restrict`` proxy fence. When
`.sync_restrict` is restricted to `shared_cta`, then memory semantics must
be `release` and the effect of the fence operation only applies to operations
performed on objects in `shared_cta` space. Likewise, when `sync_restrict` is
restricted to `shared_cluster`, then memory semantics must be `acquire` and the
effect of the fence operation only applies to operations performed on objects in
`shared_cluster` memory space. The scope for both operations is `cluster`.
For more details, please refer the `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-membar>`__

'``llvm.nvvm.fence.proxy.<proxykind>``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Syntax:
"""""""

.. code-block:: llvm

declare void @llvm.nvvm.fence.proxy.alias()
declare void @llvm.nvvm.fence.proxy.async()
declare void @llvm.nvvm.fence.proxy.async.global()
declare void @llvm.nvvm.fence.proxy.async.shared_cluster()
declare void @llvm.nvvm.fence.proxy.async.shared_cta()

Overview:
"""""""""

`nvvm.fence.proxy.{proxykind}` intrinsics represent a fence with bi-directional
proxy ordering that is established between the memory accesses done between the
`generic proxy<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#proxies>__`
and the proxy specified by `proxykind`. A `bi-directional proxy` ordering between
two proxykinds establishes two `uni-directional` proxy orderings: one from the
first proxykind to the second proxykind and the other from the second proxykind
to the first proxykind.

`alias` proxykind refers to memory accesses performed using virtually aliased
addresses to the same memory location

`async` proxykind specifies that the memory ordering is established between the
`async proxy` and the `generic proxy`. The memory ordering is limited only to
operations performed on objects in the state space specified (`generic`, `global`,
`shared_cluster`, `shared_cta`). If no state space is specified, then the memory
ordering applies on all state spaces. For more details, please refer the
`PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-membar>`__

'``llvm.nvvm.fence.proxy.tensormap_generic.*``'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down
28 changes: 27 additions & 1 deletion llvm/include/llvm/Analysis/TargetTransformInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,32 @@ struct HardwareLoopInfo {
LLVM_ABI bool canAnalyze(LoopInfo &LI);
};

/// Information for memory intrinsic cost model.
class MemIntrinsicCostAttributes {
/// Vector type of the data to be loaded or stored.
Type *DataTy = nullptr;

/// ID of the memory intrinsic.
Intrinsic::ID IID;

/// Address space of the pointer.
unsigned AddressSpace = 0;

/// Alignment of single element.
Align Alignment;

public:
LLVM_ABI MemIntrinsicCostAttributes(Intrinsic::ID Id, Type *DataTy,
Align Alignment, unsigned AddressSpace)
: DataTy(DataTy), IID(Id), AddressSpace(AddressSpace),
Alignment(Alignment) {}

Intrinsic::ID getID() const { return IID; }
Type *getDataType() const { return DataTy; }
unsigned getAddressSpace() const { return AddressSpace; }
Align getAlignment() const { return Alignment; }
};

class IntrinsicCostAttributes {
const IntrinsicInst *II = nullptr;
Type *RetTy = nullptr;
Expand Down Expand Up @@ -1556,7 +1582,7 @@ class TargetTransformInfo {

/// \return The cost of masked Load and Store instructions.
LLVM_ABI InstructionCost getMaskedMemoryOpCost(
unsigned Opcode, Type *Src, Align Alignment, unsigned AddressSpace,
const MemIntrinsicCostAttributes &MICA,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;

/// \return The cost of Gather or Scatter operation
Expand Down
3 changes: 1 addition & 2 deletions llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
Original file line number Diff line number Diff line change
Expand Up @@ -842,8 +842,7 @@ class TargetTransformInfoImplBase {
}

virtual InstructionCost
getMaskedMemoryOpCost(unsigned Opcode, Type *Src, Align Alignment,
unsigned AddressSpace,
getMaskedMemoryOpCost(const MemIntrinsicCostAttributes &MICA,
TTI::TargetCostKind CostKind) const {
return 1;
}
Expand Down
24 changes: 14 additions & 10 deletions llvm/include/llvm/CodeGen/BasicTTIImpl.h
Original file line number Diff line number Diff line change
Expand Up @@ -1558,9 +1558,13 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
}

InstructionCost
getMaskedMemoryOpCost(unsigned Opcode, Type *DataTy, Align Alignment,
unsigned AddressSpace,
getMaskedMemoryOpCost(const MemIntrinsicCostAttributes &MICA,
TTI::TargetCostKind CostKind) const override {
Type *DataTy = MICA.getDataType();
Align Alignment = MICA.getAlignment();
unsigned Opcode = MICA.getID() == Intrinsic::masked_load
? Instruction::Load
: Instruction::Store;
// TODO: Pass on AddressSpace when we have test coverage.
return getCommonMaskedMemoryOpCost(Opcode, DataTy, Alignment, true, false,
CostKind);
Expand Down Expand Up @@ -1617,10 +1621,12 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {

// Firstly, the cost of load/store operation.
InstructionCost Cost;
if (UseMaskForCond || UseMaskForGaps)
Cost = thisT()->getMaskedMemoryOpCost(Opcode, VecTy, Alignment,
AddressSpace, CostKind);
else
if (UseMaskForCond || UseMaskForGaps) {
unsigned IID = Opcode == Instruction::Load ? Intrinsic::masked_load
: Intrinsic::masked_store;
Cost = thisT()->getMaskedMemoryOpCost(
{IID, VecTy, Alignment, AddressSpace}, CostKind);
} else
Cost = thisT()->getMemoryOpCost(Opcode, VecTy, Alignment, AddressSpace,
CostKind);

Expand Down Expand Up @@ -2403,14 +2409,12 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
case Intrinsic::masked_store: {
Type *Ty = Tys[0];
Align TyAlign = thisT()->DL.getABITypeAlign(Ty);
return thisT()->getMaskedMemoryOpCost(Instruction::Store, Ty, TyAlign, 0,
CostKind);
return thisT()->getMaskedMemoryOpCost({IID, Ty, TyAlign, 0}, CostKind);
}
case Intrinsic::masked_load: {
Type *Ty = RetTy;
Align TyAlign = thisT()->DL.getABITypeAlign(Ty);
return thisT()->getMaskedMemoryOpCost(Instruction::Load, Ty, TyAlign, 0,
CostKind);
return thisT()->getMaskedMemoryOpCost({IID, Ty, TyAlign, 0}, CostKind);
}
case Intrinsic::experimental_vp_strided_store: {
auto *Ty = cast<VectorType>(ICA.getArgTypes()[0]);
Expand Down
58 changes: 45 additions & 13 deletions llvm/include/llvm/IR/IntrinsicsNVVM.td
Original file line number Diff line number Diff line change
Expand Up @@ -1746,33 +1746,65 @@ let TargetPrefix = "nvvm" in {
def int_nvvm_barrier_cluster_wait_aligned : Intrinsic<[]>;
}

//
// Membar
//
let IntrProperties = [IntrNoCallback] in {
//
// Membar / Fence
//
let IntrProperties = [IntrNoCallback] in {
def int_nvvm_membar_cta : NVVMBuiltin, Intrinsic<[]>;
def int_nvvm_membar_gl : NVVMBuiltin, Intrinsic<[]>;
def int_nvvm_membar_sys : NVVMBuiltin, Intrinsic<[]>;
def int_nvvm_fence_sc_cluster : Intrinsic<[]>;
}

//
// Proxy fence (uni-directional)
//
// Operation fence
def int_nvvm_fence_mbarrier_init_release_cluster: Intrinsic<[], [], [],
"llvm.nvvm.fence.mbarrier_init.release.cluster">;

// Thread fence
def int_nvvm_fence_acquire_sync_restrict_space_cluster_scope_cluster :
Intrinsic<[], [], [],
"llvm.nvvm.fence.acquire.sync_restrict.space.cluster.scope.cluster">;

def int_nvvm_fence_release_sync_restrict_space_cta_scope_cluster :
Intrinsic<[], [], [],
"llvm.nvvm.fence.release.sync_restrict.space.cta.scope.cluster">;

//
// Proxy fence (uni-directional)
//

def int_nvvm_fence_proxy_async_generic_acquire_sync_restrict_space_cluster_scope_cluster :
Intrinsic<[], [], [],
"llvm.nvvm.fence.proxy.async_generic.acquire.sync_restrict.space.cluster.scope.cluster">;

def int_nvvm_fence_proxy_async_generic_release_sync_restrict_space_cta_scope_cluster :
Intrinsic<[], [], [],
"llvm.nvvm.fence.proxy.async_generic.release.sync_restrict.space.cta.scope.cluster">;

foreach scope = ["cta", "cluster", "gpu", "sys"] in {

def int_nvvm_fence_proxy_tensormap_generic_release_ # scope :
Intrinsic<[], [], [IntrNoCallback],
Intrinsic<[], [], [],
"llvm.nvvm.fence.proxy.tensormap_generic.release." # scope>;

// The imm-arg 'size' can only be 128.
def int_nvvm_fence_proxy_tensormap_generic_acquire_ # scope :
Intrinsic<[], [llvm_ptr_ty, llvm_i32_ty],
[IntrNoCallback, IntrArgMemOnly, ImmArg<ArgIndex<1>>,
Range<ArgIndex<1>, 128, 129>],
"llvm.nvvm.fence.proxy.tensormap_generic.acquire." # scope>;
Intrinsic<[], [llvm_ptr_ty, llvm_i32_ty], [],
"llvm.nvvm.fence.proxy.tensormap_generic.acquire." # scope> {
let IntrProperties = [IntrNoCallback, IntrArgMemOnly,
ImmArg<ArgIndex<1>>, Range<ArgIndex<1>, 128, 129>];
}
}

//
// Proxy fence (bi-directional)
//
foreach proxykind = ["alias", "async", "async.global", "async.shared_cta",
"async.shared_cluster"] in {
defvar Intr = IntrinsicName<"llvm.nvvm.fence.proxy." # proxykind>;
def Intr.record_name: Intrinsic<[], [], [], Intr.intr_name>;
}
}

//
// Async Copy
//
Expand Down
Loading