-
Notifications
You must be signed in to change notification settings - Fork 15.2k
CodeGenPrepare: Add support for llvm.threadlocal.address address-mode sinking #87844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I also submitted #87841 to verify the assumption that arg0 of |
|
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-x86 Author: Matthias Braun (MatzeB) ChangesThe However being an Instruction means This introduces a new This fixes #87437 Full diff: https://github.com/llvm/llvm-project/pull/87844.diff 5 Files Affected:
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index a4dc097446186a..0152e8506cdb75 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -3393,6 +3393,13 @@ class TargetLoweringBase {
return nullptr;
}
+ /// Returns true if thread local storage (TLS) addresses are so cheap to
+ /// re-compute that it is not worth keeping them in a register between basic
+ /// blocks.
+ virtual bool cheapToRecomputeTLSAddress(const GlobalVariable &) const {
+ return false;
+ }
+
//===--------------------------------------------------------------------===//
// Runtime Library hooks
//
diff --git a/llvm/lib/CodeGen/CodeGenPrepare.cpp b/llvm/lib/CodeGen/CodeGenPrepare.cpp
index e657872c382848..32caf402844213 100644
--- a/llvm/lib/CodeGen/CodeGenPrepare.cpp
+++ b/llvm/lib/CodeGen/CodeGenPrepare.cpp
@@ -2528,6 +2528,18 @@ bool CodeGenPrepare::optimizeCallInst(CallInst *CI, ModifyDT &ModifiedDT) {
return optimizeGatherScatterInst(II, II->getArgOperand(0));
case Intrinsic::masked_scatter:
return optimizeGatherScatterInst(II, II->getArgOperand(1));
+ case Intrinsic::threadlocal_address:
+ // SelectionDAGBuilder currently skips this intrinsic anyway; but removing
+ // it earlier means the addresses will not be kept in registers accross
+ // basic blocks but recomputed. This is preferable on architectures where
+ // TLS is part of normal addressing modes.
+ GlobalVariable &GV = cast<GlobalVariable>(*II->getArgOperand(0));
+ if (TLI->cheapToRecomputeTLSAddress(GV)) {
+ replaceAllUsesWith(II, &GV, FreshBBs, IsHugeFunc);
+ II->eraseFromParent();
+ return true;
+ }
+ break;
}
SmallVector<Value *, 2> PtrOps;
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 6f65344215c020..ba99abcb9899b5 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -18913,6 +18913,16 @@ X86TargetLowering::LowerGlobalTLSAddress(SDValue Op, SelectionDAG &DAG) const {
llvm_unreachable("TLS not implemented for this target.");
}
+bool X86TargetLowering::cheapToRecomputeTLSAddress(
+ const GlobalVariable &GV) const {
+ if (Subtarget.is64Bit() && Subtarget.isTargetELF()) {
+ const TargetMachine &TM = getTargetMachine();
+ TLSModel::Model Model = TM.getTLSModel(&GV);
+ return Model == TLSModel::LocalExec;
+ }
+ return false;
+}
+
/// Lower SRA_PARTS and friends, which return two i32 values
/// and take a 2 x i32 value to shift plus a shift amount.
/// TODO: Can this be moved to general expansion code?
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index 0a1e8ca4427314..e3480a84bd9581 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1661,6 +1661,8 @@ namespace llvm {
SDValue LowerGlobalTLSAddress(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerExternalSymbol(SDValue Op, SelectionDAG &DAG) const;
+ bool cheapToRecomputeTLSAddress(const GlobalVariable &GV) const override;
+
/// Creates target global address or external symbol nodes for calls or
/// other uses.
SDValue LowerGlobalOrExternal(SDValue Op, SelectionDAG &DAG,
diff --git a/llvm/test/CodeGen/X86/tls-multi-use.ll b/llvm/test/CodeGen/X86/tls-multi-use.ll
new file mode 100644
index 00000000000000..8a0ac027e0973e
--- /dev/null
+++ b/llvm/test/CodeGen/X86/tls-multi-use.ll
@@ -0,0 +1,149 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -o - %s | FileCheck %s
+; RUN: llc -o - -relocation-model=pic %s | FileCheck --check-prefix=CHECK_PIC %s
+target triple = "x86_64--linux-gnu"
+
+@foo = dso_local thread_local global i32 0, align 4
+@foo_nonlocal = thread_local global i32 0, align 4
+
+declare i32 @rand()
+declare nonnull ptr @llvm.threadlocal.address.p0(ptr nonnull)
+
+define i32 @tls_multi_use(i32 %arg) {
+; CHECK-LABEL: tls_multi_use:
+; CHECK: # %bb.0: # %entry
+; CHECK-NEXT: pushq %rbx
+; CHECK-NEXT: .cfi_def_cfa_offset 16
+; CHECK-NEXT: .cfi_offset %rbx, -16
+; CHECK-NEXT: movl %fs:foo@TPOFF, %ebx
+; CHECK-NEXT: testl %edi, %edi
+; CHECK-NEXT: movl %ebx, %eax
+; CHECK-NEXT: jne .LBB0_2
+; CHECK-NEXT: # %bb.1: # %if.then
+; CHECK-NEXT: callq rand@PLT
+; CHECK-NEXT: movl %fs:foo@TPOFF, %eax
+; CHECK-NEXT: .LBB0_2: # %if.end
+; CHECK-NEXT: addl %eax, %ebx
+; CHECK-NEXT: movl %ebx, %eax
+; CHECK-NEXT: popq %rbx
+; CHECK-NEXT: .cfi_def_cfa_offset 8
+; CHECK-NEXT: retq
+;
+; CHECK_PIC-LABEL: tls_multi_use:
+; CHECK_PIC: # %bb.0: # %entry
+; CHECK_PIC-NEXT: pushq %rbp
+; CHECK_PIC-NEXT: .cfi_def_cfa_offset 16
+; CHECK_PIC-NEXT: pushq %r14
+; CHECK_PIC-NEXT: .cfi_def_cfa_offset 24
+; CHECK_PIC-NEXT: pushq %rbx
+; CHECK_PIC-NEXT: .cfi_def_cfa_offset 32
+; CHECK_PIC-NEXT: .cfi_offset %rbx, -32
+; CHECK_PIC-NEXT: .cfi_offset %r14, -24
+; CHECK_PIC-NEXT: .cfi_offset %rbp, -16
+; CHECK_PIC-NEXT: movl %edi, %ebp
+; CHECK_PIC-NEXT: leaq .Lfoo$local@TLSLD(%rip), %rdi
+; CHECK_PIC-NEXT: callq __tls_get_addr@PLT
+; CHECK_PIC-NEXT: movl .Lfoo$local@DTPOFF(%rax), %ebx
+; CHECK_PIC-NEXT: testl %ebp, %ebp
+; CHECK_PIC-NEXT: movl %ebx, %ecx
+; CHECK_PIC-NEXT: jne .LBB0_2
+; CHECK_PIC-NEXT: # %bb.1: # %if.then
+; CHECK_PIC-NEXT: leaq .Lfoo$local@DTPOFF(%rax), %r14
+; CHECK_PIC-NEXT: callq rand@PLT
+; CHECK_PIC-NEXT: movl (%r14), %ecx
+; CHECK_PIC-NEXT: .LBB0_2: # %if.end
+; CHECK_PIC-NEXT: addl %ecx, %ebx
+; CHECK_PIC-NEXT: movl %ebx, %eax
+; CHECK_PIC-NEXT: popq %rbx
+; CHECK_PIC-NEXT: .cfi_def_cfa_offset 24
+; CHECK_PIC-NEXT: popq %r14
+; CHECK_PIC-NEXT: .cfi_def_cfa_offset 16
+; CHECK_PIC-NEXT: popq %rbp
+; CHECK_PIC-NEXT: .cfi_def_cfa_offset 8
+; CHECK_PIC-NEXT: retq
+entry:
+ %addr = tail call align 4 ptr @llvm.threadlocal.address.p0(ptr align 4 @foo)
+ %load0 = load i32, ptr %addr, align 4
+ %cond = icmp eq i32 %arg, 0
+ br i1 %cond, label %if.then, label %if.end
+
+if.then:
+ tail call i32 @rand()
+ %load1 = load i32, ptr %addr, align 4
+ br label %if.end
+
+if.end:
+ %phi = phi i32 [ %load1, %if.then ], [ %load0, %entry ]
+ %add = add nsw i32 %load0, %phi
+ ret i32 %add
+}
+
+
+define i32 @tls_multi_use_nonlocal(i32 %arg) {
+; CHECK-LABEL: tls_multi_use_nonlocal:
+; CHECK: # %bb.0: # %entry
+; CHECK-NEXT: pushq %rbx
+; CHECK-NEXT: .cfi_def_cfa_offset 16
+; CHECK-NEXT: .cfi_offset %rbx, -16
+; CHECK-NEXT: movl %fs:foo@TPOFF, %ebx
+; CHECK-NEXT: testl %edi, %edi
+; CHECK-NEXT: movl %ebx, %eax
+; CHECK-NEXT: jne .LBB1_2
+; CHECK-NEXT: # %bb.1: # %if.then
+; CHECK-NEXT: callq rand@PLT
+; CHECK-NEXT: movl %fs:foo@TPOFF, %eax
+; CHECK-NEXT: .LBB1_2: # %if.end
+; CHECK-NEXT: addl %eax, %ebx
+; CHECK-NEXT: movl %ebx, %eax
+; CHECK-NEXT: popq %rbx
+; CHECK-NEXT: .cfi_def_cfa_offset 8
+; CHECK-NEXT: retq
+;
+; CHECK_PIC-LABEL: tls_multi_use_nonlocal:
+; CHECK_PIC: # %bb.0: # %entry
+; CHECK_PIC-NEXT: pushq %rbp
+; CHECK_PIC-NEXT: .cfi_def_cfa_offset 16
+; CHECK_PIC-NEXT: pushq %r14
+; CHECK_PIC-NEXT: .cfi_def_cfa_offset 24
+; CHECK_PIC-NEXT: pushq %rbx
+; CHECK_PIC-NEXT: .cfi_def_cfa_offset 32
+; CHECK_PIC-NEXT: .cfi_offset %rbx, -32
+; CHECK_PIC-NEXT: .cfi_offset %r14, -24
+; CHECK_PIC-NEXT: .cfi_offset %rbp, -16
+; CHECK_PIC-NEXT: movl %edi, %ebp
+; CHECK_PIC-NEXT: leaq .Lfoo$local@TLSLD(%rip), %rdi
+; CHECK_PIC-NEXT: callq __tls_get_addr@PLT
+; CHECK_PIC-NEXT: movl .Lfoo$local@DTPOFF(%rax), %ebx
+; CHECK_PIC-NEXT: testl %ebp, %ebp
+; CHECK_PIC-NEXT: movl %ebx, %ecx
+; CHECK_PIC-NEXT: jne .LBB1_2
+; CHECK_PIC-NEXT: # %bb.1: # %if.then
+; CHECK_PIC-NEXT: leaq .Lfoo$local@DTPOFF(%rax), %r14
+; CHECK_PIC-NEXT: callq rand@PLT
+; CHECK_PIC-NEXT: movl (%r14), %ecx
+; CHECK_PIC-NEXT: .LBB1_2: # %if.end
+; CHECK_PIC-NEXT: addl %ecx, %ebx
+; CHECK_PIC-NEXT: movl %ebx, %eax
+; CHECK_PIC-NEXT: popq %rbx
+; CHECK_PIC-NEXT: .cfi_def_cfa_offset 24
+; CHECK_PIC-NEXT: popq %r14
+; CHECK_PIC-NEXT: .cfi_def_cfa_offset 16
+; CHECK_PIC-NEXT: popq %rbp
+; CHECK_PIC-NEXT: .cfi_def_cfa_offset 8
+; CHECK_PIC-NEXT: retq
+entry:
+ %addr = tail call align 4 ptr @llvm.threadlocal.address.p0(ptr align 4 @foo)
+ %load0 = load i32, ptr %addr, align 4
+ %cond = icmp eq i32 %arg, 0
+ br i1 %cond, label %if.then, label %if.end
+
+if.then:
+ tail call i32 @rand()
+ %load1 = load i32, ptr %addr, align 4
+ br label %if.end
+
+if.end:
+ %phi = phi i32 [ %load1, %if.then ], [ %load0, %entry ]
+ %add = add nsw i32 %load0, %phi
+ ret i32 %add
+}
|
ChuanqiXu9
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with the codes in the backend. But it looks good in the high level. Please wait a few days in case there are more comments.
|
Hm. I think this is an unfortunate way to fix the issue. Long-term, I'd like to get to the point where it is no longer valid to reference a TLS value directly in IR as if it was a simple address, but instead, require that accesses MUST always go through a call. We aren't in that ideal state now, but this goes further away from it instead of further towards it. |
This isn't really changing the IR in the actual transform/optimization phases of LLVM... I was thinking of this as just yet another hack/workaround where massaging the IR in |
|
I agree with @jyknight that we should avoid dropping the threadlocal.address intrinsics in this way. The long-term plan for TLS was to change TLS globals to return a token value, and make threadlocal.address accept that token and return the actual pointer. This change wouldn't be compatible with that, as we'd replace the pointer with the token. You do need to address this in CGP, but I'd expect it to be in the same way as we do it for address calculations in general: By duplicating/sinking. Possibly this is something that should be directly integrated in the address mode logic (as this is ultimately about address modes), but if nothing else you can duplicate the threadlocal.address intrinsics separately. |
|
Reworked this change in terms of |
nikic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but please give it a day in case there's more feedback.
…compute. The `threadlocal_address` intrinsic is currently ignored/removed for instruction selection by the `SelectionDAGBuilder` (see also https://reviews.llvm.org/D125291 ). However being an Instruction means `SelectionDAG` will assign a register to it and share the value across basic blocks. This sharing is suboptimal in the "LocalExec" TLS model on x86 where it is cheaper to just recompute the address. We saw a 0.5% regression in a codebase with a lot of TLS usage (HHVM). This introduces a new `cheapToRecomputeTLSAddress` target lowering callback and removes the `threadlocal_address` intrinsic in `CodeGenPrepare` to restore the efficient behavior from before the introduction of the `threadlocal_address` intrinsic. This fixes llvm#87437
Depending on the TLSMode many thread-local accesses on x86 can be expressed by adding a %fs: segment register to an addressing mode. Even if there are mutliple users of a
llvm.threadlocal.addressintrinsic it is generally not worth sharing the value in a register but instead fold the %fs access into multiple addressing modes.Hence this changes CodeGenPrepare to duplicate the
llvm.threadlocal.addressintrinsic as necessary.Introduces a new
TargetLowering::addressingModeSupportsTLScallback that allows targets to indicate whether TLS accesses can be part of an addressing mode.This is fixing a performance problem, as this folding of TLS-accesses into multiple addressing modes happened naturally before the introduction of the
llvm.threadlocal.addressintrinsic, but regressed due toSelectionDAGkeeping things in registers when accessed across basic blocks, so CodeGenPrepare needs to duplicate to mitigate this. We see a ~0.5% recovery in a codebase with heavy TLS usage (HHVM).This fixes #87437