forked from llvm/llvm-project
-
Notifications
You must be signed in to change notification settings - Fork 0
[AMDGPU] Add an option to disable unsafe uses of atomic xor #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
pasaulais
wants to merge
5,033
commits into
main
Choose a base branch
from
amdgpu-unsafe-atomic-xor
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
✅ With the latest revision this PR passed the C/C++ code formatter. |
43a46da
to
508e577
Compare
If both icmps have the same operands and the RHS is constant, we would currently go into the isImpliedCondMatchingOperands() code path, instead of the isImpliedCondCommonOperandWithConstants() path. Both are correct, but the latter can produce more accurate results if the implication is dependent on the sign.
…3769) getOperandLatency has the following behavior: it returns -1 as a special value, negative numbers other than -1 on some target-specific overrides, or a valid non-negative latency. This behavior can be surprising, as some callers do arithmetic on these negative values. Change the interface of getOperandLatency to return a std::optional<unsigned> to prevent surprises in callers. While at it, change the interface of getInstrLatency to return unsigned instead of int. This change was inspired by a refactoring in TargetSchedModel::computeOperandLatency.
The rendered document is not correctly indentated because of this space.
This allows a YAML-based multilib configuration to specify explicitly that a subset of its library directories are alternatives to each other, i.e. at most one of that subset should be selected. So if you have multiple sysroots each including a full set of headers and libraries, you can mark them as members of the same mutually exclusive group, and then you'll be sure that only one of them is selected, even if two or more are compatible with the compile options. This is particularly important in multilib setups including the libc++ headers, where selecting the include directories from two different sysroots can cause an actual build failure. This occurs when including <stdio.h>, for example: libc++'s stdio.h is included first, and will try to use `#include_next` to fetch the underlying libc's version. But if there are two include directories from separate multilibs, then both of their C++ include directories will end up on the include path first, followed by both the C directories. So the `#include_next` from the first libc++ stdio.h will include the second libc++ stdio.h, which will do nothing because it has the same include guard macro, and the libc header won't ever be included at all. If more than one of the options in an exclusive group matches the given flags, the last one wins. The syntax for specifying this in multilib.yaml is to define a Groups section in which you specify your group names, and for each one, declare it to have Type: Exclusive. (This reserves space in the syntax for maybe adding other group types later, such as a group of mutually _dependent_ things that you must have all or none of.) Then each Variant record that's a member of a group has a Group: property giving that group's name.
…loat. DenseMapAPIntKeyInfo looks like a redundant definition because it mirrors the default used by DenseMap when not specified. Replacing DenseMapAPFloatKeyInfo with a specialisation of DenseMapInfo allows DenseMap<T> to be more easily used when T is an aggregate type containing an APFloat.
Libomptarget cannot be build because of the recent refactoring introduced in patch 148dec9 : [OpenMP][NFC] Separate Envar (environment variable) handling (llvm#73994) That patch moved handling of environment variables from libomptarget library. That's why we don't need usage of "llvm::omp::target" namespace if we handle environment variables.
…UBLE_IS_FLOAT128 (llvm#74052) To make it consistent with llvm#73948 and llvm#73950
Extract a function and call it with both operand orders, so that we don't have to explicitly commute every single pattern.
…rgets (llvm#73834) Follow-up on llvm#72873 When ADR/LDR instructions reference a label in a different section, the offset is not known until link time, however, the assembler assumes it can resolve them in some cases. The previous patch addressed the issue for most LDR instructions, focusing on little-endian targets. This patch addresses the remaining work for ADRs and big-endian targets.
) This patch continues the work started with ea5b1ef. See that commit and its corresponding PR for details.
This reverts commit 61aef97, which broke the CI on GCC.
This folds transpose(shape_cast) into a new shape_cast, when the transpose just permutes a unit dim from the result of the shape_cast. Example: ``` %0 = vector.shape_cast %vec : vector<[4]xf32> to vector<[4]x1xf32> %1 = vector.transpose %0, [1, 0] : vector<[4]x1xf32> to vector<1x[4]xf32> ``` Folds to: ``` %0 = vector.shape_cast %vec : vector<[4]xf32> to vector<1x[4]xf32> ``` This is an (alternate) fix for lowering matmuls to ArmSME.
See ARM-software/acle#217 Patch by: Kerry McLaughlin [email protected]
…NFC) In order to use SQ inside of these. There doesn't seem to be any strong need for these to be static.
llvm#74082) …ation The previous code was technically incorrect in that the type indicated that the memref only has 1 dimension, while the code below was happily dereferencing the size array out of bounds. Now, if the compiler doesn't get too smart about optimizations, this code *might even work*. But, if the compiler realizes that the array has 1 element it might starrt doing silly things. This generates a specialization per each supported rank, making sure we don't do any UB.
If the lshr operand is non-negative, we can treat it the same way as an ashr. Ideally we would represent this as "lshr nneg", but for now just perform the necessary ValueTracking query. Proof: https://alive2.llvm.org/ce/z/Ahg4ri
…74038) This does the same as llvm#72142 for vector.transfer_write. Previously the pattern would silently drop the mask.
This reverts commit b92693a. I've made a silly typo in the condition. Will reapply the corrected version.
Split off live-in printing to VPlan::printLiveIns and use it to print Live-ins when printing in the DOT format.
…73790) This patch changes the bounds generation code shared between OpenMP and OpenACC to always attach an extent to the bounds generation. This is currently required for OpenMP descriptor lowering but may not necessarily be required in the case of OpenACC.
Make a modernize version of abseil-string-find-startswith using the available C++20 `std::string::starts_with` and `std::string_view::starts_with`. Following up from llvm#72283.
llvm#73367) This PR introduces DIGlobalVariableAttr and DIGlobalVariableExpressionAttr so that ModuleTranslation can emit the required metadata needed for debug information about global variable. The translator implementation for debug metadata needed to be refactored in order to allow translation of nodes based on MDNode (DIGlobalVariableExpressionAttr and DIExpression) in addition to DINode-based nodes. A DIGlobalVariableExpressionAttr can now be passed to the GlobalOp operation directly and ModuleTranslation will create the respective DIGlobalVariable and DIGlobalVariableExpression nodes. The compile unit that DIGlobalVariable is expected to be configured with will be updated with the created DIGlobalVariableExpression.
Fix ppc build bot
…lvm#74095) In preparation for running clang-format on the whole code base, we are also removing mentions of the legacy _LIBCPP_INLINE_VISIBILITY macro in favor of the newer _LIBCPP_HIDE_FROM_ABI. We're still leaving the definition of _LIBCPP_INLINE_VISIBILITY to avoid creating needless breakage in case some older patches are checked-in with mentions of the old macro. After we branch for LLVM 18, we can do another pass to clean up remaining uses of the macro that might have gotten introduced by mistake (if any) and remove the macro itself at the same time. This is just a minor convenience to smooth out the transition as much as possible. See https://discourse.llvm.org/t/rfc-clang-formatting-all-of-libc-once-and-for-all for the clang-format proposal.
Enable by default for optimization levels higher than 0 (same behavior as clang). For simplicity, only forward the flag to the frontend driver when it contradicts what is implied by the optimization level. This was first landed in llvm#73111 but was later reverted due to a performance regression. That regression was fixed by llvm#74065.
We can use Intrinsic::getDeclaration() here, we just have to pass the correct arguments. This function accepts only the mangled types, not all argument types.
MinBWs contains entries that specify the minimum required bitwidth. In some cases, the old and new bitwidths can be equal (see test case) and in those cases no truncations are needed, so skip those cases. Fixes llvm#74307.
…pass (llvm#74075) GPU dialect has `#gpu.address_space<workgroup>` for shared memory of NVGPU (address space =3). Howeverm when IR combine NVGPU and GPU dialect, `nvgpu-to-nvvm` pass fails due to missing attribute conversion. This PR adds `populateGpuMemorySpaceAttributeConversions` to nvgou-to-nvvm lowering, so we can use `#gpu.address_space<workgroup>` `nvgpu-to-nvvm` pass
This PR introduce `nvvm.fence.proxy` OP for the following cases: ``` nvvm.fence.proxy { kind = #nvvm.proxy_kind<alias>} nvvm.fence.proxy { kind = #nvvm.proxy_kind<async>} nvvm.fence.proxy { kind = #nvvm.proxy_kind<async.global>} nvvm.fence.proxy { kind = #nvvm.proxy_kind<async.shared>, space = #nvvm.shared_space<cta>} nvvm.fence.proxy { kind = #nvvm.proxy_kind<async.shared>, space = #nvvm.shared_space<cluster>} ```
Clang currently implements a set of vector rotate builtins (__builtin_s390_verll*) in terms of platform-specific LLVM intrinsics. To simplify the IR (and allow for common code optimizations if applicable), this patch removes those LLVM intrinsics and implements the builtins in terms of the platform-independent funnel shift intrinsics instead. Also, fix the prototype of the __builtin_s390_verll* builtins for full compatibility with GCC.
The builtins that expand to the vlrl/vlrlr and vstrl/vstrlr instructions are currently named inconsistently between GCC and clang. Rename the clang versions to match GCC.
The __builtin_s390_vceq* family of builtins currently take signed arguments with clang, but unsigned with GCC. Update clang to match existing GCC precendent.
pasaulais
pushed a commit
that referenced
this pull request
Dec 4, 2023
…lvm#73463) Despite CWG2497 not being resolved, it is reasonable to expect the following code to compile (and which is supported by other compilers) ```cpp template<typename T> constexpr T f(); constexpr int g() { return f<int>(); } // #1 template<typename T> constexpr T f() { return 123; } int k[g()]; // llvm#2 ``` To that end, we eagerly instantiate all referenced specializations of constexpr functions when they are defined. We maintain a map of (pattern, [instantiations]) independent of `PendingInstantiations` to avoid having to iterate that list after each function definition. We should apply the same logic to constexpr variables, but I wanted to keep the PR small. Fixes llvm#73232
508e577
to
24b0e52
Compare
24b0e52
to
cd33e84
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
On some systems and when accessing certain memory areas it is not safe to use atomic xor with AMDGPU. This may be the case when using fine-grained memory allocations (e.g. through
hipAllocManaged
) and when the atomic operation needs to go through the PCIe bus, which does not support atomic xor (PCIe 3.0 does support other atomic operations like fetch_and_add and cmpxchg) 1.The issue has been worked around in DPC++/HIP by prefetching memory before executing kernels 2, however this adds an overhead that can outweigh the performance benefit of using atomic xor over a cmpxchg loop. This is why a way to switch between 'native' atomic xor and a cmpxchg-loop solution is needed.
These changes add
-munsafe-int-atomics
and-mno-unsafe-int-atomics
options to clang that can be used to switch between the two implementations. The default is the current behaviour of generative 'native' atomic xor instructions. When-mno-unsafe-int-atomics
is passed to clang, functions are marked with theamdgpu-unsafe-int-atomics
attribute (set tofalse
) at the IR level and this tells the backend to expand atomic xor to a cmpxchg loop. The options only affect the global and flat address spaces.