Skip to content

Missed optimization: Lowering struct materialization into cold branches #150120

@jeremy-rifkin

Description

@jeremy-rifkin

Clang optimizes sub-optimally for the following code:

struct S {
    int& x;
    int& y;
    bool check() {
        return x < y;
    }
};

[[noreturn]] [[gnu::cold]] void bar(const S& s);

void foo(int a, int b) {
    S s{a, b};
    if(s.check()) [[unlikely]] { // very unlikely and cold
        bar(s);
    }
}
foo(int, int):
        sub     rsp, 24
        mov     dword ptr [rsp + 4], edi
        mov     dword ptr [rsp], esi
        lea     rax, [rsp + 4]
        mov     qword ptr [rsp + 8], rax
        mov     rax, rsp
        mov     qword ptr [rsp + 16], rax
        cmp     edi, esi
        jl      .LBB0_2
        add     rsp, 24
        ret
.LBB0_2:
        lea     rdi, [rsp + 8]
        call    bar(S const&)@PLT

Struct S must be on the stack in order to call bar(), however, that's only needed in the unlikely case that the condition fails. Ideally the codegen should be the following:

foo(int, int):
        cmp     edi, esi
        jl      .LBB0_2
        ret
.LBB0_2:
        sub     rsp, 24
        ... copy edi/esi to the stack and make struct S ...
        call    bar(S const&)@PLT

MSVC generates something along these lines, gcc and clang do not: https://godbolt.org/z/4axKfoe8x

I can't simply write if(a < b) or delay the construction of S until inside the branch. My specific use case that results in code like this is libassert, where an expression template is built from the user's condition and that is evaluated and inspected during assertion failure.

Even if the code is written as follows, clang still generates sub-ideal code:

void foo(int a, int b) {
    if(a < b) [[unlikely]] { // very unlikely and cold
        S s{a, b};
        bar(s);
    }
}
foo(int, int):
        sub     rsp, 24
        mov     dword ptr [rsp + 4], edi
        mov     dword ptr [rsp], esi
        cmp     edi, esi
        jl      .LBB0_2
        add     rsp, 24
        ret
.LBB0_2:
        lea     rax, [rsp + 4]
        mov     qword ptr [rsp + 8], rax
        mov     rax, rsp
        mov     qword ptr [rsp + 16], rax
        lea     rdi, [rsp + 8]
        call    bar(S const&)@PLT

This may be a tricky optimization to perform, however, due to the above I expect it would benefit a large amount of code.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions