CodeGenPrepare: Add support for llvm.threadlocal.address address-mode sinking #87844

MatzeB · 2024-04-06T00:02:34Z

Depending on the TLSMode many thread-local accesses on x86 can be expressed by adding a %fs: segment register to an addressing mode. Even if there are mutliple users of a llvm.threadlocal.address intrinsic it is generally not worth sharing the value in a register but instead fold the %fs access into multiple addressing modes.

Hence this changes CodeGenPrepare to duplicate the llvm.threadlocal.address intrinsic as necessary.

Introduces a new TargetLowering::addressingModeSupportsTLS callback that allows targets to indicate whether TLS accesses can be part of an addressing mode.

This is fixing a performance problem, as this folding of TLS-accesses into multiple addressing modes happened naturally before the introduction of the llvm.threadlocal.address intrinsic, but regressed due to SelectionDAG keeping things in registers when accessed across basic blocks, so CodeGenPrepare needs to duplicate to mitigate this. We see a ~0.5% recovery in a codebase with heavy TLS usage (HHVM).

This fixes #87437

MatzeB · 2024-04-06T00:04:52Z

I also submitted #87841 to verify the assumption that arg0 of threadlocal.address is a thread-local GlobalVariable. (Turns out it currently fails 1 test, but it's debatable whether the test is okay).

llvmbot · 2024-04-06T00:06:21Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-x86

Author: Matthias Braun (MatzeB)

Changes

The threadlocal_address intrinsic is currently ignored/removed for
instruction selection by the SelectionDAGBuilder (see also
https://reviews.llvm.org/D125291 ).

However being an Instruction means SelectionDAG will assign a register
to it and share the value across basic blocks. This sharing is
suboptimal in the "LocalExec" TLS model on x86 where it is cheaper to
just recompute the address. We saw a 0.5% regression in a codebase with
a lot of TLS usage (HHVM).

This introduces a new cheapToRecomputeTLSAddress target lowering
callback and removes the threadlocal_address intrinsic in
CodeGenPrepare to restore the efficient behavior from before the
introduction of the threadlocal_address intrinsic.

This fixes #87437

Full diff: https://github.com/llvm/llvm-project/pull/87844.diff

5 Files Affected:

(modified) llvm/include/llvm/CodeGen/TargetLowering.h (+7)
(modified) llvm/lib/CodeGen/CodeGenPrepare.cpp (+12)
(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+10)
(modified) llvm/lib/Target/X86/X86ISelLowering.h (+2)
(added) llvm/test/CodeGen/X86/tls-multi-use.ll (+149)

diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index a4dc097446186a..0152e8506cdb75 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -3393,6 +3393,13 @@ class TargetLoweringBase {
     return nullptr;
   }
 
+  /// Returns true if thread local storage (TLS) addresses are so cheap to
+  /// re-compute that it is not worth keeping them in a register between basic
+  /// blocks.
+  virtual bool cheapToRecomputeTLSAddress(const GlobalVariable &) const {
+    return false;
+  }
+
   //===--------------------------------------------------------------------===//
   // Runtime Library hooks
   //
diff --git a/llvm/lib/CodeGen/CodeGenPrepare.cpp b/llvm/lib/CodeGen/CodeGenPrepare.cpp
index e657872c382848..32caf402844213 100644
--- a/llvm/lib/CodeGen/CodeGenPrepare.cpp
+++ b/llvm/lib/CodeGen/CodeGenPrepare.cpp
@@ -2528,6 +2528,18 @@ bool CodeGenPrepare::optimizeCallInst(CallInst *CI, ModifyDT &ModifiedDT) {
       return optimizeGatherScatterInst(II, II->getArgOperand(0));
     case Intrinsic::masked_scatter:
       return optimizeGatherScatterInst(II, II->getArgOperand(1));
+    case Intrinsic::threadlocal_address:
+      // SelectionDAGBuilder currently skips this intrinsic anyway; but removing
+      // it earlier means the addresses will not be kept in registers accross
+      // basic blocks but recomputed. This is preferable on architectures where
+      // TLS is part of normal addressing modes.
+      GlobalVariable &GV = cast<GlobalVariable>(*II->getArgOperand(0));
+      if (TLI->cheapToRecomputeTLSAddress(GV)) {
+        replaceAllUsesWith(II, &GV, FreshBBs, IsHugeFunc);
+        II->eraseFromParent();
+        return true;
+      }
+      break;
     }
 
     SmallVector<Value *, 2> PtrOps;
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 6f65344215c020..ba99abcb9899b5 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -18913,6 +18913,16 @@ X86TargetLowering::LowerGlobalTLSAddress(SDValue Op, SelectionDAG &DAG) const {
   llvm_unreachable("TLS not implemented for this target.");
 }
 
+bool X86TargetLowering::cheapToRecomputeTLSAddress(
+    const GlobalVariable &GV) const {
+  if (Subtarget.is64Bit() && Subtarget.isTargetELF()) {
+    const TargetMachine &TM = getTargetMachine();
+    TLSModel::Model Model = TM.getTLSModel(&GV);
+    return Model == TLSModel::LocalExec;
+  }
+  return false;
+}
+
 /// Lower SRA_PARTS and friends, which return two i32 values
 /// and take a 2 x i32 value to shift plus a shift amount.
 /// TODO: Can this be moved to general expansion code?
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index 0a1e8ca4427314..e3480a84bd9581 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1661,6 +1661,8 @@ namespace llvm {
     SDValue LowerGlobalTLSAddress(SDValue Op, SelectionDAG &DAG) const;
     SDValue LowerExternalSymbol(SDValue Op, SelectionDAG &DAG) const;
 
+    bool cheapToRecomputeTLSAddress(const GlobalVariable &GV) const override;
+
     /// Creates target global address or external symbol nodes for calls or
     /// other uses.
     SDValue LowerGlobalOrExternal(SDValue Op, SelectionDAG &DAG,
diff --git a/llvm/test/CodeGen/X86/tls-multi-use.ll b/llvm/test/CodeGen/X86/tls-multi-use.ll
new file mode 100644
index 00000000000000..8a0ac027e0973e
--- /dev/null
+++ b/llvm/test/CodeGen/X86/tls-multi-use.ll
@@ -0,0 +1,149 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -o - %s | FileCheck %s
+; RUN: llc -o - -relocation-model=pic %s | FileCheck --check-prefix=CHECK_PIC %s
+target triple = "x86_64--linux-gnu"
+
+@foo = dso_local thread_local global i32 0, align 4
+@foo_nonlocal = thread_local global i32 0, align 4
+
+declare i32 @rand()
+declare nonnull ptr @llvm.threadlocal.address.p0(ptr nonnull)
+
+define i32 @tls_multi_use(i32 %arg) {
+; CHECK-LABEL: tls_multi_use:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    pushq %rbx
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    .cfi_offset %rbx, -16
+; CHECK-NEXT:    movl %fs:foo@TPOFF, %ebx
+; CHECK-NEXT:    testl %edi, %edi
+; CHECK-NEXT:    movl %ebx, %eax
+; CHECK-NEXT:    jne .LBB0_2
+; CHECK-NEXT:  # %bb.1: # %if.then
+; CHECK-NEXT:    callq rand@PLT
+; CHECK-NEXT:    movl %fs:foo@TPOFF, %eax
+; CHECK-NEXT:  .LBB0_2: # %if.end
+; CHECK-NEXT:    addl %eax, %ebx
+; CHECK-NEXT:    movl %ebx, %eax
+; CHECK-NEXT:    popq %rbx
+; CHECK-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-NEXT:    retq
+;
+; CHECK_PIC-LABEL: tls_multi_use:
+; CHECK_PIC:       # %bb.0: # %entry
+; CHECK_PIC-NEXT:    pushq %rbp
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 16
+; CHECK_PIC-NEXT:    pushq %r14
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 24
+; CHECK_PIC-NEXT:    pushq %rbx
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 32
+; CHECK_PIC-NEXT:    .cfi_offset %rbx, -32
+; CHECK_PIC-NEXT:    .cfi_offset %r14, -24
+; CHECK_PIC-NEXT:    .cfi_offset %rbp, -16
+; CHECK_PIC-NEXT:    movl %edi, %ebp
+; CHECK_PIC-NEXT:    leaq .Lfoo$local@TLSLD(%rip), %rdi
+; CHECK_PIC-NEXT:    callq __tls_get_addr@PLT
+; CHECK_PIC-NEXT:    movl .Lfoo$local@DTPOFF(%rax), %ebx
+; CHECK_PIC-NEXT:    testl %ebp, %ebp
+; CHECK_PIC-NEXT:    movl %ebx, %ecx
+; CHECK_PIC-NEXT:    jne .LBB0_2
+; CHECK_PIC-NEXT:  # %bb.1: # %if.then
+; CHECK_PIC-NEXT:    leaq .Lfoo$local@DTPOFF(%rax), %r14
+; CHECK_PIC-NEXT:    callq rand@PLT
+; CHECK_PIC-NEXT:    movl (%r14), %ecx
+; CHECK_PIC-NEXT:  .LBB0_2: # %if.end
+; CHECK_PIC-NEXT:    addl %ecx, %ebx
+; CHECK_PIC-NEXT:    movl %ebx, %eax
+; CHECK_PIC-NEXT:    popq %rbx
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 24
+; CHECK_PIC-NEXT:    popq %r14
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 16
+; CHECK_PIC-NEXT:    popq %rbp
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 8
+; CHECK_PIC-NEXT:    retq
+entry:
+  %addr = tail call align 4 ptr @llvm.threadlocal.address.p0(ptr align 4 @foo)
+  %load0 = load i32, ptr %addr, align 4
+  %cond = icmp eq i32 %arg, 0
+  br i1 %cond, label %if.then, label %if.end
+
+if.then:
+  tail call i32 @rand()
+  %load1 = load i32, ptr %addr, align 4
+  br label %if.end
+
+if.end:
+  %phi = phi i32 [ %load1, %if.then ], [ %load0, %entry ]
+  %add = add nsw i32 %load0, %phi
+  ret i32 %add
+}
+
+
+define i32 @tls_multi_use_nonlocal(i32 %arg) {
+; CHECK-LABEL: tls_multi_use_nonlocal:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    pushq %rbx
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    .cfi_offset %rbx, -16
+; CHECK-NEXT:    movl %fs:foo@TPOFF, %ebx
+; CHECK-NEXT:    testl %edi, %edi
+; CHECK-NEXT:    movl %ebx, %eax
+; CHECK-NEXT:    jne .LBB1_2
+; CHECK-NEXT:  # %bb.1: # %if.then
+; CHECK-NEXT:    callq rand@PLT
+; CHECK-NEXT:    movl %fs:foo@TPOFF, %eax
+; CHECK-NEXT:  .LBB1_2: # %if.end
+; CHECK-NEXT:    addl %eax, %ebx
+; CHECK-NEXT:    movl %ebx, %eax
+; CHECK-NEXT:    popq %rbx
+; CHECK-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-NEXT:    retq
+;
+; CHECK_PIC-LABEL: tls_multi_use_nonlocal:
+; CHECK_PIC:       # %bb.0: # %entry
+; CHECK_PIC-NEXT:    pushq %rbp
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 16
+; CHECK_PIC-NEXT:    pushq %r14
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 24
+; CHECK_PIC-NEXT:    pushq %rbx
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 32
+; CHECK_PIC-NEXT:    .cfi_offset %rbx, -32
+; CHECK_PIC-NEXT:    .cfi_offset %r14, -24
+; CHECK_PIC-NEXT:    .cfi_offset %rbp, -16
+; CHECK_PIC-NEXT:    movl %edi, %ebp
+; CHECK_PIC-NEXT:    leaq .Lfoo$local@TLSLD(%rip), %rdi
+; CHECK_PIC-NEXT:    callq __tls_get_addr@PLT
+; CHECK_PIC-NEXT:    movl .Lfoo$local@DTPOFF(%rax), %ebx
+; CHECK_PIC-NEXT:    testl %ebp, %ebp
+; CHECK_PIC-NEXT:    movl %ebx, %ecx
+; CHECK_PIC-NEXT:    jne .LBB1_2
+; CHECK_PIC-NEXT:  # %bb.1: # %if.then
+; CHECK_PIC-NEXT:    leaq .Lfoo$local@DTPOFF(%rax), %r14
+; CHECK_PIC-NEXT:    callq rand@PLT
+; CHECK_PIC-NEXT:    movl (%r14), %ecx
+; CHECK_PIC-NEXT:  .LBB1_2: # %if.end
+; CHECK_PIC-NEXT:    addl %ecx, %ebx
+; CHECK_PIC-NEXT:    movl %ebx, %eax
+; CHECK_PIC-NEXT:    popq %rbx
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 24
+; CHECK_PIC-NEXT:    popq %r14
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 16
+; CHECK_PIC-NEXT:    popq %rbp
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 8
+; CHECK_PIC-NEXT:    retq
+entry:
+  %addr = tail call align 4 ptr @llvm.threadlocal.address.p0(ptr align 4 @foo)
+  %load0 = load i32, ptr %addr, align 4
+  %cond = icmp eq i32 %arg, 0
+  br i1 %cond, label %if.then, label %if.end
+
+if.then:
+  tail call i32 @rand()
+  %load1 = load i32, ptr %addr, align 4
+  br label %if.end
+
+if.end:
+  %phi = phi i32 [ %load1, %if.then ], [ %load0, %entry ]
+  %add = add nsw i32 %load0, %phi
+  ret i32 %add
+}

llvm/test/CodeGen/X86/tls-multi-use.ll

ChuanqiXu9

I am not familiar with the codes in the backend. But it looks good in the high level. Please wait a few days in case there are more comments.

jyknight · 2024-04-08T18:05:30Z

Hm. I think this is an unfortunate way to fix the issue. Long-term, I'd like to get to the point where it is no longer valid to reference a TLS value directly in IR as if it was a simple address, but instead, require that accesses MUST always go through a call. We aren't in that ideal state now, but this goes further away from it instead of further towards it.

MatzeB · 2024-04-08T18:37:04Z

Hm. I think this is an unfortunate way to fix the issue. Long-term, I'd like to get to the point where it is no longer valid to reference a TLS value directly in IR as if it was a simple address, but instead, require that accesses MUST always go through a call. We aren't in that ideal state now, but this goes further away from it instead of further towards it.

This isn't really changing the IR in the actual transform/optimization phases of LLVM...

I was thinking of this as just yet another hack/workaround where massaging the IR in CodeGenPrepare is just way easier than doing things in SelectionDAG because of its "basic block at a time" design. From all I can tell it is somewhat fundamentally baked into SelectionDAG today that instructions read across basic blocks need registers assigned and only Values not tied to instructions can be seen immediately. I initially tried fixing this in SelectionDAG but that would need more fundamental changes there than seemed reasonable for a small improvement like this...

nikic · 2024-04-09T02:27:09Z

I agree with @jyknight that we should avoid dropping the threadlocal.address intrinsics in this way. The long-term plan for TLS was to change TLS globals to return a token value, and make threadlocal.address accept that token and return the actual pointer. This change wouldn't be compatible with that, as we'd replace the pointer with the token.

You do need to address this in CGP, but I'd expect it to be in the same way as we do it for address calculations in general: By duplicating/sinking. Possibly this is something that should be directly integrated in the address mode logic (as this is ultimately about address modes), but if nothing else you can duplicate the threadlocal.address intrinsics separately.

MatzeB · 2024-04-11T01:02:58Z

Reworked this change in terms of CodeGenPrepare address-mode sinking logic and enabled the duplication on llvm.threadlocal.address for X86 TLSModel::LocalExec and TLSModel::InitialExec where it improves code quality.

nikic

LGTM, but please give it a day in case there's more feedback.

llvm/lib/CodeGen/CodeGenPrepare.cpp

llvm/test/CodeGen/X86/codegen-prepare-addrmode-tls.ll

…compute. The `threadlocal_address` intrinsic is currently ignored/removed for instruction selection by the `SelectionDAGBuilder` (see also https://reviews.llvm.org/D125291 ). However being an Instruction means `SelectionDAG` will assign a register to it and share the value across basic blocks. This sharing is suboptimal in the "LocalExec" TLS model on x86 where it is cheaper to just recompute the address. We saw a 0.5% regression in a codebase with a lot of TLS usage (HHVM). This introduces a new `cheapToRecomputeTLSAddress` target lowering callback and removes the `threadlocal_address` intrinsic in `CodeGenPrepare` to restore the efficient behavior from before the introduction of the `threadlocal_address` intrinsic. This fixes llvm#87437

MatzeB requested review from ChuanqiXu9, WenleiHe, jyknight, nhaehnle and nikic April 6, 2024 00:02

MatzeB marked this pull request as ready for review April 6, 2024 00:05

llvmbot added the backend:X86 label Apr 6, 2024

MatzeB mentioned this pull request Apr 6, 2024

Verify threadlocal_address constraints #87841

Merged

phoebewang reviewed Apr 6, 2024

View reviewed changes

llvm/test/CodeGen/X86/tls-multi-use.ll Outdated Show resolved Hide resolved

ChuanqiXu9 approved these changes Apr 7, 2024

View reviewed changes

MatzeB force-pushed the less_tls_spills branch from fc482ad to c25fd0b Compare April 8, 2024 17:39

MatzeB force-pushed the less_tls_spills branch from c25fd0b to 9c9667b Compare April 11, 2024 00:55

llvmbot added the llvm:transforms label Apr 11, 2024

MatzeB force-pushed the less_tls_spills branch from 9c9667b to 9852236 Compare April 11, 2024 01:04

nikic approved these changes Apr 11, 2024

View reviewed changes

llvm/lib/CodeGen/CodeGenPrepare.cpp Outdated Show resolved Hide resolved

llvm/test/CodeGen/X86/codegen-prepare-addrmode-tls.ll Outdated Show resolved Hide resolved

MatzeB changed the title ~~CodeGenPrepare: Remove threadlocal_address intrinsic when cheap to recompute.~~ CodeGenPrepare: Add support for llvm.threadlocal.address address-mode sinking Apr 11, 2024

MatzeB force-pushed the less_tls_spills branch from 9852236 to 864723f Compare April 11, 2024 20:53

MatzeB added 3 commits April 12, 2024 16:07

Add test for TLS handling change

1fe6dd6

fix

f2048f0

MatzeB force-pushed the less_tls_spills branch from 864723f to f2048f0 Compare April 16, 2024 20:18

nhaehnle removed their request for review April 17, 2024 10:33

MatzeB merged commit 652bcf6 into llvm:main Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CodeGenPrepare: Add support for llvm.threadlocal.address address-mode sinking #87844

CodeGenPrepare: Add support for llvm.threadlocal.address address-mode sinking #87844

Uh oh!

MatzeB commented Apr 6, 2024 •

edited

Loading

Uh oh!

MatzeB commented Apr 6, 2024

Uh oh!

llvmbot commented Apr 6, 2024 •

edited

Loading

Uh oh!

Uh oh!

ChuanqiXu9 left a comment

Uh oh!

jyknight commented Apr 8, 2024

Uh oh!

MatzeB commented Apr 8, 2024 •

edited

Loading

Uh oh!

nikic commented Apr 9, 2024

Uh oh!

MatzeB commented Apr 11, 2024

Uh oh!

nikic left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

CodeGenPrepare: Add support for llvm.threadlocal.address address-mode sinking #87844

CodeGenPrepare: Add support for llvm.threadlocal.address address-mode sinking #87844

Uh oh!

Conversation

MatzeB commented Apr 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MatzeB commented Apr 6, 2024

Uh oh!

llvmbot commented Apr 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ChuanqiXu9 left a comment

Choose a reason for hiding this comment

Uh oh!

jyknight commented Apr 8, 2024

Uh oh!

MatzeB commented Apr 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikic commented Apr 9, 2024

Uh oh!

MatzeB commented Apr 11, 2024

Uh oh!

nikic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

MatzeB commented Apr 6, 2024 •

edited

Loading

llvmbot commented Apr 6, 2024 •

edited

Loading

MatzeB commented Apr 8, 2024 •

edited

Loading