Skip to content

[Integer division]Optimization missed in looped division #146483

Open
@bionukg

Description

@bionukg

I learned that, for any uint32_t i, i/3 is equivalent to (uint64_t(i)*0xAAAAAAAB)>>33. This is how modern compilers typically optimize integer division by constants.

To verify this equivalence, I've written the following code:

#include <iostream>
#include <cstdint>
#include <chrono>
//[[gnu::noinline]] 
uint32_t div3_v0(uint32_t a) { return a / 3; }
//[[gnu::noinline]] 
uint32_t div3_v1(uint32_t a) { return (static_cast<uint64_t>(a) * 0xAAAAAAAB) >> 33; }

uint64_t test0()
{
    uint64_t res0 = 0;
    for (uint64_t i = 0; i <= 0xFFFF'FFFF; ++i)
    {
        res0 += div3_v0(i);
    }
    return res0;
}

uint64_t test1()
{
    uint64_t res1 = 0;
    for (uint64_t i = 0; i <= 0xFFFF'FFFF; ++i)
    {
        res1 += div3_v1(i);
    }
    return res1;
}

int main()
{
    // Iterate through all possible uint32_t values
    for (uint64_t i = 0; i <= 0xFFFF'FFFF; ++i)
    {
        // Calculate result using regular division
        uint32_t division_result = div3_v0(i);

        // Calculate result using multiplication and right shift
        uint32_t multiplication_shift_result = div3_v1(i);

        // Verify if both results are equal
        if (division_result != multiplication_shift_result)
        {
            std::cout << "i = " << i << std::endl;
            std::cout << "i/3 = " << division_result << std::endl;
            std::cout << "(uint64_t(i)*0xAAAAAAAB)>>33 = " << multiplication_shift_result << std::endl;
            return 1;
        }
    }
    std::cout << "for all i in uint32_t, i / 3 == (uint64_t(i)*0xAAAAAAAB)>>33" << std::endl;
    
    // Test time cost
    auto tp0 = std::chrono::high_resolution_clock::now();
    uint64_t res0 = test0();
    auto tp1 = std::chrono::high_resolution_clock::now();

    auto tp2 = std::chrono::high_resolution_clock::now();
    uint64_t res1 = test1();
    auto tp3 = std::chrono::high_resolution_clock::now();
    
    std::cout << "res0 = " << res0 << std::endl;
    std::cout << "res1 = " << res1 << std::endl;
    std::cout << "duration0 = " << std::chrono::duration_cast<std::chrono::milliseconds>(tp1 - tp0).count() << " ms" << std::endl;
    std::cout << "duration1 = " << std::chrono::duration_cast<std::chrono::milliseconds>(tp3 - tp2).count() << " ms" << std::endl;

    return 0;
}

When compiled with -O3 on my system, the output is:

for all i in uint32_t, i / 3 == (uint64_t(i)*0xAAAAAAAB)>>33
res0 = 3074457343470774955
res1 = 3074457343470774955
duration0 = 1514 ms
duration1 = 1095 ms

The performance discrepancy between duration0 and duration1 is unexpected, as both functions should theoretically be optimized to the same machine code.

Upon examining the generated assembly code, I noticed that in test1, the multiplication static_cast<uint64_t>(i) * 0xAAAAAAAB has undergone strength reduction optimization and been converted into a sequence of addition operations. However, in test0, while the division was optimized into a multiplication, the subsequent multiplication was not further optimized into addition operations. This appears to be a missed optimization opportunity.

Additionally, even though div3_v0 and div3_v1 produce identical assembly code, they are treated as distinct functions with separate memory addresses. This is another missed optimization, as the compiler should recognize their semantic equivalence and merge them.

Compilation command used: clang++ -march=x86-64 -O3 -g -mno-sse -mno-avx -mno-avx512f

This issue can be reproduced with the following compilers:

  1. x86-64 clang 20.1.0 on godbolt.org
  2. On my WSL environment:
Ubuntu clang version 19.1.7 (++20250114103320+cd708029e0b2-1~exp1~20250114103432.75)
Target: x86_64-pc-linux-gnu

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions