Skip to content

Conversation

pasaulais
Copy link
Owner

@pasaulais pasaulais commented Oct 12, 2023

On some systems and when accessing certain memory areas it is not safe to use atomic xor with AMDGPU. This may be the case when using fine-grained memory allocations (e.g. through hipAllocManaged) and when the atomic operation needs to go through the PCIe bus, which does not support atomic xor (PCIe 3.0 does support other atomic operations like fetch_and_add and cmpxchg) 1.

The issue has been worked around in DPC++/HIP by prefetching memory before executing kernels 2, however this adds an overhead that can outweigh the performance benefit of using atomic xor over a cmpxchg loop. This is why a way to switch between 'native' atomic xor and a cmpxchg-loop solution is needed.

These changes add -munsafe-int-atomics and -mno-unsafe-int-atomics options to clang that can be used to switch between the two implementations. The default is the current behaviour of generative 'native' atomic xor instructions. When -mno-unsafe-int-atomics is passed to clang, functions are marked with the amdgpu-unsafe-int-atomics attribute (set to false) at the IR level and this tells the backend to expand atomic xor to a cmpxchg loop. The options only affect the global and flat address spaces.

@github-actions
Copy link

github-actions bot commented Oct 12, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@pasaulais pasaulais force-pushed the amdgpu-unsafe-atomic-xor branch 6 times, most recently from 43a46da to 508e577 Compare October 16, 2023 15:08
nikic and others added 23 commits December 1, 2023 12:18
If both icmps have the same operands and the RHS is constant, we
would currently go into the isImpliedCondMatchingOperands() code
path, instead of the isImpliedCondCommonOperandWithConstants()
path. Both are correct, but the latter can produce more accurate
results if the implication is dependent on the sign.
…3769)

getOperandLatency has the following behavior: it returns -1 as a special
value, negative numbers other than -1 on some target-specific overrides,
or a valid non-negative latency. This behavior can be surprising, as
some callers do arithmetic on these negative values. Change the
interface of getOperandLatency to return a std::optional<unsigned> to
prevent surprises in callers. While at it, change the interface of
getInstrLatency to return unsigned instead of int.

This change was inspired by a refactoring in
TargetSchedModel::computeOperandLatency.
The rendered document is not correctly indentated because of this space.
This allows a YAML-based multilib configuration to specify explicitly
that a subset of its library directories are alternatives to each
other, i.e. at most one of that subset should be selected.

So if you have multiple sysroots each including a full set of headers
and libraries, you can mark them as members of the same mutually
exclusive group, and then you'll be sure that only one of them is
selected, even if two or more are compatible with the compile options.

This is particularly important in multilib setups including the libc++
headers, where selecting the include directories from two different
sysroots can cause an actual build failure. This occurs when including
<stdio.h>, for example: libc++'s stdio.h is included first, and will
try to use `#include_next` to fetch the underlying libc's version. But
if there are two include directories from separate multilibs, then
both of their C++ include directories will end up on the include path
first, followed by both the C directories. So the `#include_next` from
the first libc++ stdio.h will include the second libc++ stdio.h, which
will do nothing because it has the same include guard macro, and the
libc header won't ever be included at all.

If more than one of the options in an exclusive group matches the
given flags, the last one wins.

The syntax for specifying this in multilib.yaml is to define a Groups
section in which you specify your group names, and for each one,
declare it to have Type: Exclusive. (This reserves space in the syntax
for maybe adding other group types later, such as a group of mutually
_dependent_ things that you must have all or none of.) Then each
Variant record that's a member of a group has a Group: property giving
that group's name.
…loat.

DenseMapAPIntKeyInfo looks like a redundant definition because it
mirrors the default used by DenseMap when not specified.

Replacing DenseMapAPFloatKeyInfo with a specialisation of
DenseMapInfo allows DenseMap<T> to be more easily used when T is
an aggregate type containing an APFloat.
Libomptarget cannot be build because of the recent refactoring
introduced in patch 148dec9 :
[OpenMP][NFC] Separate Envar (environment variable) handling (llvm#73994)

That patch moved handling of environment variables from libomptarget
library. That's why we don't need usage of "llvm::omp::target" namespace
if we handle environment variables.
Extract a function and call it with both operand orders, so that
we don't have to explicitly commute every single pattern.
…rgets (llvm#73834)

Follow-up on llvm#72873
    
When ADR/LDR instructions reference a label in a different section, the
offset is not known until link time, however, the assembler assumes it
    can resolve them in some cases.
    
    The previous patch addressed the issue for most LDR instructions,
    focusing on little-endian targets.
    
This patch addresses the remaining work for ADRs and big-endian targets.
)

This patch continues the work started with ea5b1ef. See that commit and its corresponding PR for details.
This folds transpose(shape_cast) into a new shape_cast, when the
transpose just permutes a unit dim from the result of the shape_cast.

Example:

```
%0 = vector.shape_cast %vec : vector<[4]xf32> to vector<[4]x1xf32>
%1 = vector.transpose %0, [1, 0] : vector<[4]x1xf32> to vector<1x[4]xf32>
```

Folds to:
```
%0 = vector.shape_cast %vec : vector<[4]xf32> to vector<1x[4]xf32>
```

This is an (alternate) fix for lowering matmuls to ArmSME.
…NFC)

In order to use SQ inside of these. There doesn't seem to be any
strong need for these to be static.
llvm#74082)

…ation

The previous code was technically incorrect in that the type indicated
that the memref only has 1 dimension, while the code below was happily
dereferencing the size array out of bounds. Now, if the compiler doesn't
get too smart about optimizations, this code *might even work*. But, if
the compiler realizes that the array has 1 element it might starrt doing
silly things. This generates a specialization per each supported rank,
making sure we don't do any UB.
If the lshr operand is non-negative, we can treat it the same
way as an ashr. Ideally we would represent this as "lshr nneg",
but for now just perform the necessary ValueTracking query.

Proof: https://alive2.llvm.org/ce/z/Ahg4ri
…74038)

This does the same as llvm#72142 for vector.transfer_write. Previously the
pattern would silently drop the mask.
This reverts commit b92693a.

I've made a silly typo in the condition. Will reapply the corrected
version.
jayfoad and others added 26 commits December 4, 2023 13:20
Split off live-in printing to VPlan::printLiveIns and use it to print
Live-ins when printing in the DOT format.
…73790)

This patch changes the bounds generation code shared between OpenMP and OpenACC to always attach an extent to the bounds generation. This is
currently required for OpenMP descriptor lowering but may not
necessarily be required in the case of OpenACC.
…4311)

This patch continues the work started with ea5b1ef. See that commit and its corresponding PR for details.
Make a modernize version of abseil-string-find-startswith using the
available C++20 `std::string::starts_with` and
`std::string_view::starts_with`. Following up from
llvm#72283.
llvm#73367)

This PR introduces DIGlobalVariableAttr and
DIGlobalVariableExpressionAttr so that ModuleTranslation can emit the
required metadata needed for debug information about global variable.
The translator implementation for debug metadata needed to be refactored
in order to allow translation of nodes based on MDNode
(DIGlobalVariableExpressionAttr and DIExpression) in addition to
DINode-based nodes.

A DIGlobalVariableExpressionAttr can now be passed to the GlobalOp
operation directly and ModuleTranslation will create the respective
DIGlobalVariable and DIGlobalVariableExpression nodes. The compile unit
that DIGlobalVariable is expected to be configured with will be updated
with the created DIGlobalVariableExpression.
…lvm#74095)

In preparation for running clang-format on the whole code base, we are
also removing mentions of the legacy _LIBCPP_INLINE_VISIBILITY macro in
favor of the newer _LIBCPP_HIDE_FROM_ABI.

We're still leaving the definition of _LIBCPP_INLINE_VISIBILITY to avoid
creating needless breakage in case some older patches are checked-in
with mentions of the old macro. After we branch for LLVM 18, we can do
another pass to clean up remaining uses of the macro that might have
gotten introduced by mistake (if any) and remove the macro itself at the
same time. This is just a minor convenience to smooth out the transition
as much as possible.

See
https://discourse.llvm.org/t/rfc-clang-formatting-all-of-libc-once-and-for-all
for the clang-format proposal.
Enable by default for optimization levels higher than 0 (same behavior
as clang).

For simplicity, only forward the flag to the frontend driver when it
contradicts what is implied by the optimization level.

This was first landed in
llvm#73111 but was later reverted
due to a performance regression. That regression was fixed by
llvm#74065.
We can use Intrinsic::getDeclaration() here, we just have to pass
the correct arguments. This function accepts only the mangled types,
not all argument types.
MinBWs contains entries that specify the minimum required bitwidth. In
some cases, the old and new bitwidths can be equal (see test case) and
in those cases no truncations are needed, so skip those cases.

Fixes llvm#74307.
…pass (llvm#74075)

GPU dialect has `#gpu.address_space<workgroup>` for shared memory of
NVGPU (address space =3). Howeverm when IR combine NVGPU and GPU
dialect, `nvgpu-to-nvvm` pass fails due to missing attribute conversion.

This PR adds `populateGpuMemorySpaceAttributeConversions` to
nvgou-to-nvvm lowering, so we can use `#gpu.address_space<workgroup>`
`nvgpu-to-nvvm` pass
This PR introduce `nvvm.fence.proxy` OP for the following cases:

```
nvvm.fence.proxy { kind = #nvvm.proxy_kind<alias>}
nvvm.fence.proxy { kind = #nvvm.proxy_kind<async>}
nvvm.fence.proxy { kind = #nvvm.proxy_kind<async.global>}
nvvm.fence.proxy { kind = #nvvm.proxy_kind<async.shared>, space = #nvvm.shared_space<cta>}
nvvm.fence.proxy { kind = #nvvm.proxy_kind<async.shared>, space = #nvvm.shared_space<cluster>}
```
Clang currently implements a set of vector rotate builtins
(__builtin_s390_verll*) in terms of platform-specific LLVM
intrinsics.  To simplify the IR (and allow for common code
optimizations if applicable), this patch removes those LLVM
intrinsics and implements the builtins in terms of the
platform-independent funnel shift intrinsics instead.

Also, fix the prototype of the __builtin_s390_verll*
builtins for full compatibility with GCC.
The builtins that expand to the vlrl/vlrlr and vstrl/vstrlr
instructions are currently named inconsistently between GCC
and clang.  Rename the clang versions to match GCC.
The __builtin_s390_vceq* family of builtins currently take
signed arguments with clang, but unsigned with GCC.  Update
clang to match existing GCC precendent.
pasaulais pushed a commit that referenced this pull request Dec 4, 2023
…lvm#73463)

Despite CWG2497 not being resolved, it is reasonable to expect the
following code to compile (and which is supported by other compilers)

```cpp
  template<typename T> constexpr T f();
  constexpr int g() { return f<int>(); } // #1
  template<typename T> constexpr T f() { return 123; }
  int k[g()];
  // llvm#2
```

To that end, we eagerly instantiate all referenced specializations of
constexpr functions when they are defined.

We maintain a map of (pattern, [instantiations]) independent of
`PendingInstantiations` to avoid having to iterate that list after each
function definition.

We should apply the same logic to constexpr variables, but I wanted to
keep the PR small.

Fixes llvm#73232
@pasaulais pasaulais force-pushed the amdgpu-unsafe-atomic-xor branch from 508e577 to 24b0e52 Compare December 4, 2023 18:11
@pasaulais pasaulais force-pushed the amdgpu-unsafe-atomic-xor branch from 24b0e52 to cd33e84 Compare December 4, 2023 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.