Skip to content

Conversation

@MatzeB
Copy link
Contributor

@MatzeB MatzeB commented Apr 6, 2024

Depending on the TLSMode many thread-local accesses on x86 can be expressed by adding a %fs: segment register to an addressing mode. Even if there are mutliple users of a llvm.threadlocal.address intrinsic it is generally not worth sharing the value in a register but instead fold the %fs access into multiple addressing modes.

Hence this changes CodeGenPrepare to duplicate the llvm.threadlocal.address intrinsic as necessary.

Introduces a new TargetLowering::addressingModeSupportsTLS callback that allows targets to indicate whether TLS accesses can be part of an addressing mode.

This is fixing a performance problem, as this folding of TLS-accesses into multiple addressing modes happened naturally before the introduction of the llvm.threadlocal.address intrinsic, but regressed due to SelectionDAG keeping things in registers when accessed across basic blocks, so CodeGenPrepare needs to duplicate to mitigate this. We see a ~0.5% recovery in a codebase with heavy TLS usage (HHVM).

This fixes #87437

@MatzeB
Copy link
Contributor Author

MatzeB commented Apr 6, 2024

I also submitted #87841 to verify the assumption that arg0 of threadlocal.address is a thread-local GlobalVariable. (Turns out it currently fails 1 test, but it's debatable whether the test is okay).

@MatzeB MatzeB marked this pull request as ready for review April 6, 2024 00:05
@llvmbot
Copy link
Member

llvmbot commented Apr 6, 2024

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-x86

Author: Matthias Braun (MatzeB)

Changes

The threadlocal_address intrinsic is currently ignored/removed for
instruction selection by the SelectionDAGBuilder (see also
https://reviews.llvm.org/D125291 ).

However being an Instruction means SelectionDAG will assign a register
to it and share the value across basic blocks. This sharing is
suboptimal in the "LocalExec" TLS model on x86 where it is cheaper to
just recompute the address. We saw a 0.5% regression in a codebase with
a lot of TLS usage (HHVM).

This introduces a new cheapToRecomputeTLSAddress target lowering
callback and removes the threadlocal_address intrinsic in
CodeGenPrepare to restore the efficient behavior from before the
introduction of the threadlocal_address intrinsic.

This fixes #87437


Full diff: https://github.com/llvm/llvm-project/pull/87844.diff

5 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/TargetLowering.h (+7)
  • (modified) llvm/lib/CodeGen/CodeGenPrepare.cpp (+12)
  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+10)
  • (modified) llvm/lib/Target/X86/X86ISelLowering.h (+2)
  • (added) llvm/test/CodeGen/X86/tls-multi-use.ll (+149)
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index a4dc097446186a..0152e8506cdb75 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -3393,6 +3393,13 @@ class TargetLoweringBase {
     return nullptr;
   }
 
+  /// Returns true if thread local storage (TLS) addresses are so cheap to
+  /// re-compute that it is not worth keeping them in a register between basic
+  /// blocks.
+  virtual bool cheapToRecomputeTLSAddress(const GlobalVariable &) const {
+    return false;
+  }
+
   //===--------------------------------------------------------------------===//
   // Runtime Library hooks
   //
diff --git a/llvm/lib/CodeGen/CodeGenPrepare.cpp b/llvm/lib/CodeGen/CodeGenPrepare.cpp
index e657872c382848..32caf402844213 100644
--- a/llvm/lib/CodeGen/CodeGenPrepare.cpp
+++ b/llvm/lib/CodeGen/CodeGenPrepare.cpp
@@ -2528,6 +2528,18 @@ bool CodeGenPrepare::optimizeCallInst(CallInst *CI, ModifyDT &ModifiedDT) {
       return optimizeGatherScatterInst(II, II->getArgOperand(0));
     case Intrinsic::masked_scatter:
       return optimizeGatherScatterInst(II, II->getArgOperand(1));
+    case Intrinsic::threadlocal_address:
+      // SelectionDAGBuilder currently skips this intrinsic anyway; but removing
+      // it earlier means the addresses will not be kept in registers accross
+      // basic blocks but recomputed. This is preferable on architectures where
+      // TLS is part of normal addressing modes.
+      GlobalVariable &GV = cast<GlobalVariable>(*II->getArgOperand(0));
+      if (TLI->cheapToRecomputeTLSAddress(GV)) {
+        replaceAllUsesWith(II, &GV, FreshBBs, IsHugeFunc);
+        II->eraseFromParent();
+        return true;
+      }
+      break;
     }
 
     SmallVector<Value *, 2> PtrOps;
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 6f65344215c020..ba99abcb9899b5 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -18913,6 +18913,16 @@ X86TargetLowering::LowerGlobalTLSAddress(SDValue Op, SelectionDAG &DAG) const {
   llvm_unreachable("TLS not implemented for this target.");
 }
 
+bool X86TargetLowering::cheapToRecomputeTLSAddress(
+    const GlobalVariable &GV) const {
+  if (Subtarget.is64Bit() && Subtarget.isTargetELF()) {
+    const TargetMachine &TM = getTargetMachine();
+    TLSModel::Model Model = TM.getTLSModel(&GV);
+    return Model == TLSModel::LocalExec;
+  }
+  return false;
+}
+
 /// Lower SRA_PARTS and friends, which return two i32 values
 /// and take a 2 x i32 value to shift plus a shift amount.
 /// TODO: Can this be moved to general expansion code?
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index 0a1e8ca4427314..e3480a84bd9581 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1661,6 +1661,8 @@ namespace llvm {
     SDValue LowerGlobalTLSAddress(SDValue Op, SelectionDAG &DAG) const;
     SDValue LowerExternalSymbol(SDValue Op, SelectionDAG &DAG) const;
 
+    bool cheapToRecomputeTLSAddress(const GlobalVariable &GV) const override;
+
     /// Creates target global address or external symbol nodes for calls or
     /// other uses.
     SDValue LowerGlobalOrExternal(SDValue Op, SelectionDAG &DAG,
diff --git a/llvm/test/CodeGen/X86/tls-multi-use.ll b/llvm/test/CodeGen/X86/tls-multi-use.ll
new file mode 100644
index 00000000000000..8a0ac027e0973e
--- /dev/null
+++ b/llvm/test/CodeGen/X86/tls-multi-use.ll
@@ -0,0 +1,149 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -o - %s | FileCheck %s
+; RUN: llc -o - -relocation-model=pic %s | FileCheck --check-prefix=CHECK_PIC %s
+target triple = "x86_64--linux-gnu"
+
+@foo = dso_local thread_local global i32 0, align 4
+@foo_nonlocal = thread_local global i32 0, align 4
+
+declare i32 @rand()
+declare nonnull ptr @llvm.threadlocal.address.p0(ptr nonnull)
+
+define i32 @tls_multi_use(i32 %arg) {
+; CHECK-LABEL: tls_multi_use:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    pushq %rbx
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    .cfi_offset %rbx, -16
+; CHECK-NEXT:    movl %fs:foo@TPOFF, %ebx
+; CHECK-NEXT:    testl %edi, %edi
+; CHECK-NEXT:    movl %ebx, %eax
+; CHECK-NEXT:    jne .LBB0_2
+; CHECK-NEXT:  # %bb.1: # %if.then
+; CHECK-NEXT:    callq rand@PLT
+; CHECK-NEXT:    movl %fs:foo@TPOFF, %eax
+; CHECK-NEXT:  .LBB0_2: # %if.end
+; CHECK-NEXT:    addl %eax, %ebx
+; CHECK-NEXT:    movl %ebx, %eax
+; CHECK-NEXT:    popq %rbx
+; CHECK-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-NEXT:    retq
+;
+; CHECK_PIC-LABEL: tls_multi_use:
+; CHECK_PIC:       # %bb.0: # %entry
+; CHECK_PIC-NEXT:    pushq %rbp
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 16
+; CHECK_PIC-NEXT:    pushq %r14
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 24
+; CHECK_PIC-NEXT:    pushq %rbx
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 32
+; CHECK_PIC-NEXT:    .cfi_offset %rbx, -32
+; CHECK_PIC-NEXT:    .cfi_offset %r14, -24
+; CHECK_PIC-NEXT:    .cfi_offset %rbp, -16
+; CHECK_PIC-NEXT:    movl %edi, %ebp
+; CHECK_PIC-NEXT:    leaq .Lfoo$local@TLSLD(%rip), %rdi
+; CHECK_PIC-NEXT:    callq __tls_get_addr@PLT
+; CHECK_PIC-NEXT:    movl .Lfoo$local@DTPOFF(%rax), %ebx
+; CHECK_PIC-NEXT:    testl %ebp, %ebp
+; CHECK_PIC-NEXT:    movl %ebx, %ecx
+; CHECK_PIC-NEXT:    jne .LBB0_2
+; CHECK_PIC-NEXT:  # %bb.1: # %if.then
+; CHECK_PIC-NEXT:    leaq .Lfoo$local@DTPOFF(%rax), %r14
+; CHECK_PIC-NEXT:    callq rand@PLT
+; CHECK_PIC-NEXT:    movl (%r14), %ecx
+; CHECK_PIC-NEXT:  .LBB0_2: # %if.end
+; CHECK_PIC-NEXT:    addl %ecx, %ebx
+; CHECK_PIC-NEXT:    movl %ebx, %eax
+; CHECK_PIC-NEXT:    popq %rbx
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 24
+; CHECK_PIC-NEXT:    popq %r14
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 16
+; CHECK_PIC-NEXT:    popq %rbp
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 8
+; CHECK_PIC-NEXT:    retq
+entry:
+  %addr = tail call align 4 ptr @llvm.threadlocal.address.p0(ptr align 4 @foo)
+  %load0 = load i32, ptr %addr, align 4
+  %cond = icmp eq i32 %arg, 0
+  br i1 %cond, label %if.then, label %if.end
+
+if.then:
+  tail call i32 @rand()
+  %load1 = load i32, ptr %addr, align 4
+  br label %if.end
+
+if.end:
+  %phi = phi i32 [ %load1, %if.then ], [ %load0, %entry ]
+  %add = add nsw i32 %load0, %phi
+  ret i32 %add
+}
+
+
+define i32 @tls_multi_use_nonlocal(i32 %arg) {
+; CHECK-LABEL: tls_multi_use_nonlocal:
+; CHECK:       # %bb.0: # %entry
+; CHECK-NEXT:    pushq %rbx
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    .cfi_offset %rbx, -16
+; CHECK-NEXT:    movl %fs:foo@TPOFF, %ebx
+; CHECK-NEXT:    testl %edi, %edi
+; CHECK-NEXT:    movl %ebx, %eax
+; CHECK-NEXT:    jne .LBB1_2
+; CHECK-NEXT:  # %bb.1: # %if.then
+; CHECK-NEXT:    callq rand@PLT
+; CHECK-NEXT:    movl %fs:foo@TPOFF, %eax
+; CHECK-NEXT:  .LBB1_2: # %if.end
+; CHECK-NEXT:    addl %eax, %ebx
+; CHECK-NEXT:    movl %ebx, %eax
+; CHECK-NEXT:    popq %rbx
+; CHECK-NEXT:    .cfi_def_cfa_offset 8
+; CHECK-NEXT:    retq
+;
+; CHECK_PIC-LABEL: tls_multi_use_nonlocal:
+; CHECK_PIC:       # %bb.0: # %entry
+; CHECK_PIC-NEXT:    pushq %rbp
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 16
+; CHECK_PIC-NEXT:    pushq %r14
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 24
+; CHECK_PIC-NEXT:    pushq %rbx
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 32
+; CHECK_PIC-NEXT:    .cfi_offset %rbx, -32
+; CHECK_PIC-NEXT:    .cfi_offset %r14, -24
+; CHECK_PIC-NEXT:    .cfi_offset %rbp, -16
+; CHECK_PIC-NEXT:    movl %edi, %ebp
+; CHECK_PIC-NEXT:    leaq .Lfoo$local@TLSLD(%rip), %rdi
+; CHECK_PIC-NEXT:    callq __tls_get_addr@PLT
+; CHECK_PIC-NEXT:    movl .Lfoo$local@DTPOFF(%rax), %ebx
+; CHECK_PIC-NEXT:    testl %ebp, %ebp
+; CHECK_PIC-NEXT:    movl %ebx, %ecx
+; CHECK_PIC-NEXT:    jne .LBB1_2
+; CHECK_PIC-NEXT:  # %bb.1: # %if.then
+; CHECK_PIC-NEXT:    leaq .Lfoo$local@DTPOFF(%rax), %r14
+; CHECK_PIC-NEXT:    callq rand@PLT
+; CHECK_PIC-NEXT:    movl (%r14), %ecx
+; CHECK_PIC-NEXT:  .LBB1_2: # %if.end
+; CHECK_PIC-NEXT:    addl %ecx, %ebx
+; CHECK_PIC-NEXT:    movl %ebx, %eax
+; CHECK_PIC-NEXT:    popq %rbx
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 24
+; CHECK_PIC-NEXT:    popq %r14
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 16
+; CHECK_PIC-NEXT:    popq %rbp
+; CHECK_PIC-NEXT:    .cfi_def_cfa_offset 8
+; CHECK_PIC-NEXT:    retq
+entry:
+  %addr = tail call align 4 ptr @llvm.threadlocal.address.p0(ptr align 4 @foo)
+  %load0 = load i32, ptr %addr, align 4
+  %cond = icmp eq i32 %arg, 0
+  br i1 %cond, label %if.then, label %if.end
+
+if.then:
+  tail call i32 @rand()
+  %load1 = load i32, ptr %addr, align 4
+  br label %if.end
+
+if.end:
+  %phi = phi i32 [ %load1, %if.then ], [ %load0, %entry ]
+  %add = add nsw i32 %load0, %phi
+  ret i32 %add
+}

Copy link
Member

@ChuanqiXu9 ChuanqiXu9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with the codes in the backend. But it looks good in the high level. Please wait a few days in case there are more comments.

@MatzeB MatzeB force-pushed the less_tls_spills branch from fc482ad to c25fd0b Compare April 8, 2024 17:39
@jyknight
Copy link
Member

jyknight commented Apr 8, 2024

Hm. I think this is an unfortunate way to fix the issue. Long-term, I'd like to get to the point where it is no longer valid to reference a TLS value directly in IR as if it was a simple address, but instead, require that accesses MUST always go through a call. We aren't in that ideal state now, but this goes further away from it instead of further towards it.

@MatzeB
Copy link
Contributor Author

MatzeB commented Apr 8, 2024

Hm. I think this is an unfortunate way to fix the issue. Long-term, I'd like to get to the point where it is no longer valid to reference a TLS value directly in IR as if it was a simple address, but instead, require that accesses MUST always go through a call. We aren't in that ideal state now, but this goes further away from it instead of further towards it.

This isn't really changing the IR in the actual transform/optimization phases of LLVM...

I was thinking of this as just yet another hack/workaround where massaging the IR in CodeGenPrepare is just way easier than doing things in SelectionDAG because of its "basic block at a time" design. From all I can tell it is somewhat fundamentally baked into SelectionDAG today that instructions read across basic blocks need registers assigned and only Values not tied to instructions can be seen immediately. I initially tried fixing this in SelectionDAG but that would need more fundamental changes there than seemed reasonable for a small improvement like this...

@nikic
Copy link
Contributor

nikic commented Apr 9, 2024

I agree with @jyknight that we should avoid dropping the threadlocal.address intrinsics in this way. The long-term plan for TLS was to change TLS globals to return a token value, and make threadlocal.address accept that token and return the actual pointer. This change wouldn't be compatible with that, as we'd replace the pointer with the token.

You do need to address this in CGP, but I'd expect it to be in the same way as we do it for address calculations in general: By duplicating/sinking. Possibly this is something that should be directly integrated in the address mode logic (as this is ultimately about address modes), but if nothing else you can duplicate the threadlocal.address intrinsics separately.

@MatzeB
Copy link
Contributor Author

MatzeB commented Apr 11, 2024

Reworked this change in terms of CodeGenPrepare address-mode sinking logic and enabled the duplication on llvm.threadlocal.address for X86 TLSModel::LocalExec and TLSModel::InitialExec where it improves code quality.

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please give it a day in case there's more feedback.

@MatzeB MatzeB changed the title CodeGenPrepare: Remove threadlocal_address intrinsic when cheap to recompute. CodeGenPrepare: Add support for llvm.threadlocal.address address-mode sinking Apr 11, 2024
MatzeB added 3 commits April 12, 2024 16:07
…compute.

The `threadlocal_address` intrinsic is currently ignored/removed for
instruction selection by the `SelectionDAGBuilder` (see also
https://reviews.llvm.org/D125291 ).

However being an Instruction means `SelectionDAG` will assign a register
to it and share the value across basic blocks. This sharing is
suboptimal in the "LocalExec" TLS model on x86 where it is cheaper to
just recompute the address. We saw a 0.5% regression in a codebase with
a lot of TLS usage (HHVM).

This introduces a new `cheapToRecomputeTLSAddress` target lowering
callback and removes the `threadlocal_address` intrinsic in
`CodeGenPrepare` to restore the efficient behavior from before the
introduction of the `threadlocal_address` intrinsic.

This fixes llvm#87437
@nhaehnle nhaehnle removed their request for review April 17, 2024 10:33
@MatzeB MatzeB merged commit 652bcf6 into llvm:main Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

llvm.threadlocal.address changes regressed code using TLS

6 participants