[AMDGPU] Add an option to disable unsafe uses of atomic xor #1

pasaulais · 2023-10-12T15:08:02Z

On some systems and when accessing certain memory areas it is not safe to use atomic xor with AMDGPU. This may be the case when using fine-grained memory allocations (e.g. through hipAllocManaged) and when the atomic operation needs to go through the PCIe bus, which does not support atomic xor (PCIe 3.0 does support other atomic operations like fetch_and_add and cmpxchg) 1.

The issue has been worked around in DPC++/HIP by prefetching memory before executing kernels 2, however this adds an overhead that can outweigh the performance benefit of using atomic xor over a cmpxchg loop. This is why a way to switch between 'native' atomic xor and a cmpxchg-loop solution is needed.

These changes add -munsafe-int-atomics and -mno-unsafe-int-atomics options to clang that can be used to switch between the two implementations. The default is the current behaviour of generative 'native' atomic xor instructions. When -mno-unsafe-int-atomics is passed to clang, functions are marked with the amdgpu-unsafe-int-atomics attribute (set to false) at the IR level and this tells the backend to expand atomic xor to a cmpxchg loop. The options only affect the global and flat address spaces.

github-actions · 2023-10-12T15:18:47Z

✅ With the latest revision this PR passed the C/C++ code formatter.

…(NFC)

If both icmps have the same operands and the RHS is constant, we would currently go into the isImpliedCondMatchingOperands() code path, instead of the isImpliedCondCommonOperandWithConstants() path. Both are correct, but the latter can produce more accurate results if the implication is dependent on the sign.

…3769) getOperandLatency has the following behavior: it returns -1 as a special value, negative numbers other than -1 on some target-specific overrides, or a valid non-negative latency. This behavior can be surprising, as some callers do arithmetic on these negative values. Change the interface of getOperandLatency to return a std::optional<unsigned> to prevent surprises in callers. While at it, change the interface of getInstrLatency to return unsigned instead of int. This change was inspired by a refactoring in TargetSchedModel::computeOperandLatency.

The rendered document is not correctly indentated because of this space.

This allows a YAML-based multilib configuration to specify explicitly that a subset of its library directories are alternatives to each other, i.e. at most one of that subset should be selected. So if you have multiple sysroots each including a full set of headers and libraries, you can mark them as members of the same mutually exclusive group, and then you'll be sure that only one of them is selected, even if two or more are compatible with the compile options. This is particularly important in multilib setups including the libc++ headers, where selecting the include directories from two different sysroots can cause an actual build failure. This occurs when including <stdio.h>, for example: libc++'s stdio.h is included first, and will try to use `#include_next` to fetch the underlying libc's version. But if there are two include directories from separate multilibs, then both of their C++ include directories will end up on the include path first, followed by both the C directories. So the `#include_next` from the first libc++ stdio.h will include the second libc++ stdio.h, which will do nothing because it has the same include guard macro, and the libc header won't ever be included at all. If more than one of the options in an exclusive group matches the given flags, the last one wins. The syntax for specifying this in multilib.yaml is to define a Groups section in which you specify your group names, and for each one, declare it to have Type: Exclusive. (This reserves space in the syntax for maybe adding other group types later, such as a group of mutually _dependent_ things that you must have all or none of.) Then each Variant record that's a member of a group has a Group: property giving that group's name.

…loat. DenseMapAPIntKeyInfo looks like a redundant definition because it mirrors the default used by DenseMap when not specified. Replacing DenseMapAPFloatKeyInfo with a specialisation of DenseMapInfo allows DenseMap<T> to be more easily used when T is an aggregate type containing an APFloat.

Libomptarget cannot be build because of the recent refactoring introduced in patch 148dec9 : [OpenMP][NFC] Separate Envar (environment variable) handling (llvm#73994) That patch moved handling of environment variables from libomptarget library. That's why we don't need usage of "llvm::omp::target" namespace if we handle environment variables.

…AT64 (llvm#73948)

…UBLE_IS_FLOAT128 (llvm#74052) To make it consistent with llvm#73948 and llvm#73950

…_FLOAT80 (llvm#73950)

Extract a function and call it with both operand orders, so that we don't have to explicitly commute every single pattern.

…rgets (llvm#73834) Follow-up on llvm#72873 When ADR/LDR instructions reference a label in a different section, the offset is not known until link time, however, the assembler assumes it can resolve them in some cases. The previous patch addressed the issue for most LDR instructions, focusing on little-endian targets. This patch addresses the remaining work for ADRs and big-endian targets.

) This patch continues the work started with ea5b1ef. See that commit and its corresponding PR for details.

This reverts commit 61aef97, which broke the CI on GCC.

This folds transpose(shape_cast) into a new shape_cast, when the transpose just permutes a unit dim from the result of the shape_cast. Example: ``` %0 = vector.shape_cast %vec : vector<[4]xf32> to vector<[4]x1xf32> %1 = vector.transpose %0, [1, 0] : vector<[4]x1xf32> to vector<1x[4]xf32> ``` Folds to: ``` %0 = vector.shape_cast %vec : vector<[4]xf32> to vector<1x[4]xf32> ``` This is an (alternate) fix for lowering matmuls to ArmSME.

See ARM-software/acle#217 Patch by: Kerry McLaughlin [email protected]

…NFC) In order to use SQ inside of these. There doesn't seem to be any strong need for these to be static.

llvm#74082) …ation The previous code was technically incorrect in that the type indicated that the memref only has 1 dimension, while the code below was happily dereferencing the size array out of bounds. Now, if the compiler doesn't get too smart about optimizations, this code *might even work*. But, if the compiler realizes that the array has 1 element it might starrt doing silly things. This generates a specialization per each supported rank, making sure we don't do any UB.

If the lshr operand is non-negative, we can treat it the same way as an ashr. Ideally we would represent this as "lshr nneg", but for now just perform the necessary ValueTracking query. Proof: https://alive2.llvm.org/ce/z/Ahg4ri

…74038) This does the same as llvm#72142 for vector.transfer_write. Previously the pattern would silently drop the mask.

This reverts commit b92693a. I've made a silly typo in the condition. Will reapply the corrected version.

Split off live-in printing to VPlan::printLiveIns and use it to print Live-ins when printing in the DOT format.

…73790) This patch changes the bounds generation code shared between OpenMP and OpenACC to always attach an extent to the bounds generation. This is currently required for OpenMP descriptor lowering but may not necessarily be required in the case of OpenACC.

…4311) This patch continues the work started with ea5b1ef. See that commit and its corresponding PR for details.

…4137)

Make a modernize version of abseil-string-find-startswith using the available C++20 `std::string::starts_with` and `std::string_view::starts_with`. Following up from llvm#72283.

llvm#73367) This PR introduces DIGlobalVariableAttr and DIGlobalVariableExpressionAttr so that ModuleTranslation can emit the required metadata needed for debug information about global variable. The translator implementation for debug metadata needed to be refactored in order to allow translation of nodes based on MDNode (DIGlobalVariableExpressionAttr and DIExpression) in addition to DINode-based nodes. A DIGlobalVariableExpressionAttr can now be passed to the GlobalOp operation directly and ModuleTranslation will create the respective DIGlobalVariable and DIGlobalVariableExpression nodes. The compile unit that DIGlobalVariable is expected to be configured with will be updated with the created DIGlobalVariableExpression.

Fix ppc build bot

…lvm#74095) In preparation for running clang-format on the whole code base, we are also removing mentions of the legacy _LIBCPP_INLINE_VISIBILITY macro in favor of the newer _LIBCPP_HIDE_FROM_ABI. We're still leaving the definition of _LIBCPP_INLINE_VISIBILITY to avoid creating needless breakage in case some older patches are checked-in with mentions of the old macro. After we branch for LLVM 18, we can do another pass to clean up remaining uses of the macro that might have gotten introduced by mistake (if any) and remove the macro itself at the same time. This is just a minor convenience to smooth out the transition as much as possible. See https://discourse.llvm.org/t/rfc-clang-formatting-all-of-libc-once-and-for-all for the clang-format proposal.

…s file

Enable by default for optimization levels higher than 0 (same behavior as clang). For simplicity, only forward the flag to the frontend driver when it contradicts what is implied by the optimization level. This was first landed in llvm#73111 but was later reverted due to a performance regression. That regression was fixed by llvm#74065.

We can use Intrinsic::getDeclaration() here, we just have to pass the correct arguments. This function accepts only the mangled types, not all argument types.

MinBWs contains entries that specify the minimum required bitwidth. In some cases, the old and new bitwidths can be equal (see test case) and in those cases no truncations are needed, so skip those cases. Fixes llvm#74307.

…pass (llvm#74075) GPU dialect has `#gpu.address_space<workgroup>` for shared memory of NVGPU (address space =3). Howeverm when IR combine NVGPU and GPU dialect, `nvgpu-to-nvvm` pass fails due to missing attribute conversion. This PR adds `populateGpuMemorySpaceAttributeConversions` to nvgou-to-nvvm lowering, so we can use `#gpu.address_space<workgroup>` `nvgpu-to-nvvm` pass

This PR introduce `nvvm.fence.proxy` OP for the following cases: ``` nvvm.fence.proxy { kind = #nvvm.proxy_kind<alias>} nvvm.fence.proxy { kind = #nvvm.proxy_kind<async>} nvvm.fence.proxy { kind = #nvvm.proxy_kind<async.global>} nvvm.fence.proxy { kind = #nvvm.proxy_kind<async.shared>, space = #nvvm.shared_space<cta>} nvvm.fence.proxy { kind = #nvvm.proxy_kind<async.shared>, space = #nvvm.shared_space<cluster>} ```

Clang currently implements a set of vector rotate builtins (__builtin_s390_verll*) in terms of platform-specific LLVM intrinsics. To simplify the IR (and allow for common code optimizations if applicable), this patch removes those LLVM intrinsics and implements the builtins in terms of the platform-independent funnel shift intrinsics instead. Also, fix the prototype of the __builtin_s390_verll* builtins for full compatibility with GCC.

The builtins that expand to the vlrl/vlrlr and vstrl/vstrlr instructions are currently named inconsistently between GCC and clang. Rename the clang versions to match GCC.

The __builtin_s390_vceq* family of builtins currently take signed arguments with clang, but unsigned with GCC. Update clang to match existing GCC precendent.

…lvm#73463) Despite CWG2497 not being resolved, it is reasonable to expect the following code to compile (and which is supported by other compilers) ```cpp template<typename T> constexpr T f(); constexpr int g() { return f<int>(); } // #1 template<typename T> constexpr T f() { return 123; } int k[g()]; // llvm#2 ``` To that end, we eagerly instantiate all referenced specializations of constexpr functions when they are defined. We maintain a map of (pattern, [instantiations]) independent of `PendingInstantiations` to avoid having to iterate that list after each function definition. We should apply the same logic to constexpr variables, but I wanted to keep the PR small. Fixes llvm#73232

pasaulais force-pushed the amdgpu-unsafe-atomic-xor branch 6 times, most recently from 43a46da to 508e577 Compare October 16, 2023 15:08

nikic and others added 23 commits December 1, 2023 12:18

[InstSimplify] Add test for implied cond with equal ops and constant …

89b0044

…(NFC)

[mlir] update linalg transform ops docs

5a32014

[NFC] Remove a space in CMake.rst

6902082

The rendered document is not correctly indentated because of this space.

[libc][NFC] rename LONG_DOUBLE_IS_DOUBLE into LIBC_LONG_DOUBLE_IS_FLO…

808b7d2

…AT64 (llvm#73948)

[libc][NFC] Rename LIBC_LONG_DOUBLE_IS_IEEE754_BIN128 to LIBC_LONG_DO…

f1d0276

…UBLE_IS_FLOAT128 (llvm#74052) To make it consistent with llvm#73948 and llvm#73950

[libc][NFC] Rename SPECIAL_X86_LONG_DOUBLE in LIBC_LONG_DOUBLE_IS_X86…

977af42

…_FLOAT80 (llvm#73950)

[ValueTracking] Reduce duplication in haveNoCommonBitsSet() (NFC)

da86d4a

Extract a function and call it with both operand orders, so that we don't have to explicitly commute every single pattern.

[clang][NFC] Refactor expected directives in C++ DRs 100-199 (llvm#74061

6ab7662

) This patch continues the work started with ea5b1ef. See that commit and its corresponding PR for details.

[libc++] Revert "Compile MSAN/TSAN failing test with -O1 (llvm#73555)"

f19571e

This reverts commit 61aef97, which broke the CI on GCC.

[AArch64][SME2] Add SME2 builtins for zero { zt0 } (llvm#72274)

e59a0cd

See ARM-software/acle#217 Patch by: Kerry McLaughlin [email protected]

[InstCombiner] Make isFreeToInvert() and friends instance functions (…

9363658

…NFC) In order to use SQ inside of these. There doesn't seem to be any strong need for these to be static.

[InstCombine] Add tests for invert of lshr (NFC)

dd5c534

[mlir][vector] Fix unit dim dropping pattern for masked writes (llvm#…

fdf84cb

…74038) This does the same as llvm#72142 for vector.transfer_write. Previously the pattern would silently drop the mask.

Revert "[InstCombine] Support inverting lshr with non-negative operand"

8c13099

This reverts commit b92693a. I've made a silly typo in the condition. Will reapply the corrected version.

jayfoad and others added 26 commits December 4, 2023 13:20

[AMDGPU] Fix back off barrier checks

3c01256

[AMDGPU] Tweak non-entry-alloc prefixes to simplify rerunning UTC

0e1b3ed

[AMDGPU] Regenerate some MIR checks

110319f

[VPlan] Add missing output of live-ins to VPlan dot printing.

99aa531

Split off live-in printing to VPlan::printLiveIns and use it to print Live-ins when printing in the DOT format.

[CodeGen] Port WinEHPrepare to new pass manager (llvm#74233)

8a9bbac

[clang][NFC] Refactor expected directives in C++ DRs 400-499 (llvm#7…

c556c9c

…4311) This patch continues the work started with ea5b1ef. See that commit and its corresponding PR for details.

[flang] Fix move-assign operator for struct IncrementLoopInfo (llvm#7…

dd376f8

…4137)

[AMDGPU][True16][GISel] Support v_add_f16 codegen. (llvm#74122)

2b7191c

[clang-tidy] Add new modernize-use-starts-ends-with check (llvm#72385)

bc8cff1

Make a modernize version of abseil-string-find-startswith using the available C++20 `std::string::starts_with` and `std::string_view::starts_with`. Following up from llvm#72283.

[gn build] Port bc8cff1

83abe57

[IndVarSimplify] Regenerate test checks (NFC)

8f40ef3

[CodeGen] Remove redundant ID in WinEHPrepareImpl (llvm#74322)

e0357e9

Fix ppc build bot

[X86] Regenerate combine-or.ll

c4b795d

[git] Add _LIBCPP_INLINE_VISIBILITY renaming to .git-blame-ignore-rev…

c274eea

…s file

[AutoUpgrade] Simplify vclz upgrade (NFC)

e309667

We can use Intrinsic::getDeclaration() here, we just have to pass the correct arguments. This function accepts only the mangled types, not all argument types.

[SystemZ] Auto-generate vec-intrinsics tests

0c568c2

[SystemZ] Fix naming of vlrlr/vstrlr builtins

dca432c

The builtins that expand to the vlrl/vlrlr and vstrl/vstrlr instructions are currently named inconsistently between GCC and clang. Rename the clang versions to match GCC.

[SystemZ] Fix __builtin_s390_vceq* inconsistency

9d27139

The __builtin_s390_vceq* family of builtins currently take signed arguments with clang, but unsigned with GCC. Update clang to match existing GCC precendent.

pasaulais force-pushed the amdgpu-unsafe-atomic-xor branch from 508e577 to 24b0e52 Compare December 4, 2023 18:11

[AMDGPU] Add an option to disable unsafe uses of atomic xor

cd33e84

pasaulais force-pushed the amdgpu-unsafe-atomic-xor branch from 24b0e52 to cd33e84 Compare December 4, 2023 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Add an option to disable unsafe uses of atomic xor #1

[AMDGPU] Add an option to disable unsafe uses of atomic xor #1

Uh oh!

pasaulais commented Oct 12, 2023 •

edited

Loading

Uh oh!

github-actions bot commented Oct 12, 2023 •

edited

Loading

Uh oh!

Uh oh!

[AMDGPU] Add an option to disable unsafe uses of atomic xor #1

Are you sure you want to change the base?

[AMDGPU] Add an option to disable unsafe uses of atomic xor #1

Uh oh!

Conversation

pasaulais commented Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pasaulais commented Oct 12, 2023 •

edited

Loading

github-actions bot commented Oct 12, 2023 •

edited

Loading