Description
I learned that, for any uint32_t i
, i/3
is equivalent to (uint64_t(i)*0xAAAAAAAB)>>33
. This is how modern compilers typically optimize integer division by constants.
To verify this equivalence, I've written the following code:
#include <iostream>
#include <cstdint>
#include <chrono>
//[[gnu::noinline]]
uint32_t div3_v0(uint32_t a) { return a / 3; }
//[[gnu::noinline]]
uint32_t div3_v1(uint32_t a) { return (static_cast<uint64_t>(a) * 0xAAAAAAAB) >> 33; }
uint64_t test0()
{
uint64_t res0 = 0;
for (uint64_t i = 0; i <= 0xFFFF'FFFF; ++i)
{
res0 += div3_v0(i);
}
return res0;
}
uint64_t test1()
{
uint64_t res1 = 0;
for (uint64_t i = 0; i <= 0xFFFF'FFFF; ++i)
{
res1 += div3_v1(i);
}
return res1;
}
int main()
{
// Iterate through all possible uint32_t values
for (uint64_t i = 0; i <= 0xFFFF'FFFF; ++i)
{
// Calculate result using regular division
uint32_t division_result = div3_v0(i);
// Calculate result using multiplication and right shift
uint32_t multiplication_shift_result = div3_v1(i);
// Verify if both results are equal
if (division_result != multiplication_shift_result)
{
std::cout << "i = " << i << std::endl;
std::cout << "i/3 = " << division_result << std::endl;
std::cout << "(uint64_t(i)*0xAAAAAAAB)>>33 = " << multiplication_shift_result << std::endl;
return 1;
}
}
std::cout << "for all i in uint32_t, i / 3 == (uint64_t(i)*0xAAAAAAAB)>>33" << std::endl;
// Test time cost
auto tp0 = std::chrono::high_resolution_clock::now();
uint64_t res0 = test0();
auto tp1 = std::chrono::high_resolution_clock::now();
auto tp2 = std::chrono::high_resolution_clock::now();
uint64_t res1 = test1();
auto tp3 = std::chrono::high_resolution_clock::now();
std::cout << "res0 = " << res0 << std::endl;
std::cout << "res1 = " << res1 << std::endl;
std::cout << "duration0 = " << std::chrono::duration_cast<std::chrono::milliseconds>(tp1 - tp0).count() << " ms" << std::endl;
std::cout << "duration1 = " << std::chrono::duration_cast<std::chrono::milliseconds>(tp3 - tp2).count() << " ms" << std::endl;
return 0;
}
When compiled with -O3
on my system, the output is:
for all i in uint32_t, i / 3 == (uint64_t(i)*0xAAAAAAAB)>>33
res0 = 3074457343470774955
res1 = 3074457343470774955
duration0 = 1514 ms
duration1 = 1095 ms
The performance discrepancy between duration0
and duration1
is unexpected, as both functions should theoretically be optimized to the same machine code.
Upon examining the generated assembly code, I noticed that in test1
, the multiplication static_cast<uint64_t>(i) * 0xAAAAAAAB
has undergone strength reduction optimization and been converted into a sequence of addition operations. However, in test0
, while the division was optimized into a multiplication, the subsequent multiplication was not further optimized into addition operations. This appears to be a missed optimization opportunity.
Additionally, even though div3_v0
and div3_v1
produce identical assembly code, they are treated as distinct functions with separate memory addresses. This is another missed optimization, as the compiler should recognize their semantic equivalence and merge them.
Compilation command used: clang++ -march=x86-64 -O3 -g -mno-sse -mno-avx -mno-avx512f
This issue can be reproduced with the following compilers:
x86-64 clang 20.1.0
ongodbolt.org
- On my WSL environment:
Ubuntu clang version 19.1.7 (++20250114103320+cd708029e0b2-1~exp1~20250114103432.75)
Target: x86_64-pc-linux-gnu