[compiler-rt][ARM] Optimized mulsf3 and divsf3 #161546

statham-arm · 2025-10-01T16:24:26Z

This commit adds optimized assembly versions of single-precision float multiplication and division. Both functions are implemented in a style that can be assembled as either of Arm and Thumb2; for multiplication, a separate implementation is provided for Thumb1. Also, extensive new tests are added for multiplication and division.

These implementations can be removed from the build by defining the cmake variable COMPILER_RT_ARM_OPTIMIZED_FP=OFF.

Outlying parts of the functionality which are not on the fast path, such as NaN handling and underflow, are handled in helper functions written in C. These can be shared between the Arm/Thumb2 and Thumb1 implementations, and also reused by other optimized assembly functions we hope to add in future.

This commit adds optimized assembly versions of single-precision float multiplication and division. Both functions are implemented in a style that can be assembled as either of Arm and Thumb2; for multiplication, a separate implementation is provided for Thumb1. Also, extensive new tests are added for multiplication and division. These implementations can be removed from the build by defining the cmake variable COMPILER_RT_ARM_OPTIMIZED_FP=OFF. Outlying parts of the functionality which are not on the fast path, such as NaN handling and underflow, are handled in helper functions written in C. These can be shared between the Arm/Thumb2 and Thumb1 implementations, and also reused by other optimized assembly functions we hope to add in future.

statham-arm · 2025-10-01T16:24:43Z

This is the second PR in my planned series to upstream optimized AArch32 FP implementations, as discussed on Discourse in August. (Sorry for the delay.)

The first PR is #154093, which is replacing an existing assembly implementation with (we think) a better one. This one is adding new assembly implementations, for functions which don't have them already. The two PRs conflict, but benignly, in that they both add the same supporting C functions; whichever one lands first, I'll update the other one.

This PR is not quite in a committable state yet, because I'd like advice on what to do about the new tests. At the moment, they're using compareResultF to check the answers, which forgives differences of opinion in NaN handling. Our assembly routines have well specified NaN handling (designed to match the behavior of Arm's hardware FP), and a set of tests to check it. So when they're testing the new versions of the function, I'd like to make them check the output NaNs exactly.

But other architectures, and the existing C implementations in compiler-rt, can't be expected to pass those tests in their strict form. So those tests will have to be reverted to use compareResultF on any other architecture, or when the new config option COMPILER_RT_ARM_OPTIMIZED_FP=OFF is set. Any thoughts on the best thing to do about that?

aykevl

Can't give much of a review here, but superficially this looks fine to me.

aykevl · 2025-10-01T16:53:16Z

compiler-rt/lib/builtins/CMakeLists.txt

 )

+option(COMPILER_RT_ARM_OPTIMIZED_FP
+  "On 32-bit Arm, use optimized assembly implementations of FP arithmetic" ON)


I believe this is a code size vs speed tradeoff, right?
I think it would be a good idea to say that explicitly. (And IMHO if the new assembly routines are both smaller and faster they should just be replaced instead of having two options).

I've done that, with a "likely" in it to cover the fact that until we've gone through all of the available functions we won't know for sure whether all of them trade off size for speed.

(It's also difficult to judge, since when you compare assembly against C, the C is more likely to vary with compile options, so the answer might turn out to be "in this configuration but not that one".)

compiler-rt/lib/builtins/arm/thumb1/mulsf3.S

compnerd · 2025-10-01T17:54:26Z

compiler-rt/lib/builtins/arm/divsf3.S

+
+*/
+
+  .p2align 2  // make sure we start on a 32-bit boundary, even in Thumb


I think that changing this to 4-byte boundary is better than 32-bit boundary as it can be confusing when scanning over the comment and code.

compiler-rt/lib/builtins/arm/fnan2.c

This was Petr Hosek's comment on llvm#154093, but if we're doing that, we should do it consistently.

Now we should only test the extra NaN faithfulness in cases where it's provided by the library. Also tweaked the cmake setup to make it easier to add more assembly files later. Plus a missing piece of comment in fnan2.c.

statham-arm · 2025-10-27T11:28:22Z

Ping! I haven't come back to this for a few weeks because I've been busy with other things, but it's been a while since it had any attention. @compnerd, you left review comments at the start of the month; are you happy with the updated version?

compnerd · 2025-10-27T12:05:22Z

compiler-rt/lib/builtins/CMakeLists.txt

+    arm/mulsf3.S
+    arm/divsf3.S)
+  set_source_files_properties(${assembly_files}
+    PROPERTIES COMPILE_OPTIONS "-Wa,-mimplicit-it=always")


Might be nice to limit this to when the driver supports it. This is important if the build is using the raw assembler vs the compiler driver (i.e. -implicit-it=always vs -Xa -implicit-it=always)

Thanks, I hadn't thought of that. (I didn't know it was even possible to use llvm-mc directly in a cmake build.)

Do you happen to know how to check which assembler command-line syntax is in use? It looks to me as if cmake doesn't have check_asm_compiler_flag or CMAKE_ASM_COMPILER_FRONTEND_VARIANT (whereas it has both of those for cxx). I'm sure I can make up my own system if I have to, but if there's an existing one I haven't found, I'd prefer to use it, and surely you would too 🙂

You can use GNU as by passing CMAKE_ASM_COMPILER. I think that check_compiler_flag is the right tool.

I was delayed getting back to this, sorry. But I think the answer to this is that if you're trying to build the Arm assembler builtins with bare GNU as, you're doomed anyway.

The most immediate problem, when I tried it just now, was that invocations of as would fail with error messages of the form "Fatal error: bad defsym; format is --defsym name=value". That arises because cmake has put things like --defsym VISIBILITY_HIDDEN and --defsym _LARGEFILE_SOURCE on the as command line, which indeed violates the syntax of --defsym, which requires an explicit value in every definition.

But worse, those --defsym symbols are clearly the ones that would have been used with -D if cmake had thought it was using a compiler driver as its assembler. So it's expecting them to turn into macros that can be tested using #ifdef. And even if there hadn't been a command-line syntax error, that won't work – --defsym and -D don't mean the same thing, and in any case, bare as doesn't run cpp at all, so the #ifdefs in the source files will be syntax errors regardless of what was defined on the command line. Not to mention the #includes.

This is all true of the existing assembly language source files in lib/builtins, not just my new ones. I think there's no hope of getting any of them to build with any kind of bare assembler like as or llvm-mc; they need a compiler driver which supports -D and will run the preprocessor.

@compnerd, ping?

In case that comment wasn't clear enough, what I'm saying is: I don't think there is any Arm assembler which can consume these source files at all and does not support -Wa,-mimplicit-it=always, because bare GNU as or bare LLVM mc won't be able to handle the use of the C preprocessor. Therefore I don't think there's any need to add a check for whether that command-line option is accepted.

cpp+as doesn't guarantee that -Wa is supported IMO. In fact, I think that -mimplicit-it=always can just be passed without the -Wa, if you are claiming that no assembler could actually support this.

I'm still very confused about how to configure a test build of compiler-rt which will successfully assemble the Arm builtins without being a compiler driver that speaks -Wa,. You seem to be saying here that there's some other way to configure cmake to invoke both cpp and as? But I don't know what it is; surely CMAKE_ASM_COMPILER would need to be set to a single command, and I can't think what command would work.

In fact, I think that -mimplicit-it=always can just be passed without the -Wa,

Huh, apparently that works in clang! TIL. Doesn't work in my nearest gcc, though – that still demands -Wa,. But I suppose at least that gives me a way to test a piece of cmake that checks which flag works.

OK, now I've put in a check. It turned out that no cmake command higher-level than the general-purpose try_compile would handle assembly language -- all the things like check_compiler_flag() had a list of supported languages not including ASM. So I had to write my own helper function.

I still haven't figured out how to test it against anything that isn't a compiler driver. But I'm testing for -mimplicit-it first, which means that compiling with clang the first test passes, and with gcc the first test fails and we fall back to -Wa,-mimplicit-it, so I've at least been able to check that these tests are doing something nontrivial.

@compnerd , ping?

compnerd · 2025-10-27T12:09:38Z

compiler-rt/lib/builtins/arm/fnorm2.c

+  values->b <<= 8;
+
+  // Test if a is denormal.
+  if (values->expa == 0) {


Future enhancement idea: extract the adjustment into a helper and share across the two values.

compnerd · 2025-10-27T12:17:43Z

This feels generally ready to me. I've not spent time to really deeply think through the actual math behind the operations, but am trusting that ARM had tested this and that the existing math tests cover that.

Oops! I forgot that COMPILER_RT_ARM_OPTIMIZED_FP defaulted to on, so that this check would run even for builds that never even knew about it.

compnerd

Thanks!

This reverts commit f7e6521.

Reverts #161546 One of the buildbots reported a cmake error I don't understand, and which I didn't get in my own test builds: ``` CMake Error at /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/compiler-rt/cmake/Modules/CheckAssemblerFlag.cmake:23 (try_compile): COMPILE_DEFINITIONS specified on a srcdir type TRY_COMPILE ``` My best guess is that the thing I did in `CheckAssemblerFlag.cmake` only works on some versions of cmake. But I don't understand the problem well enough to fix it quickly, so I'm reverting the whole patch and will reland it later.

…167906) Reverts llvm/llvm-project#161546 One of the buildbots reported a cmake error I don't understand, and which I didn't get in my own test builds: ``` CMake Error at /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/compiler-rt/cmake/Modules/CheckAssemblerFlag.cmake:23 (try_compile): COMPILE_DEFINITIONS specified on a srcdir type TRY_COMPILE ``` My best guess is that the thing I did in `CheckAssemblerFlag.cmake` only works on some versions of cmake. But I don't understand the problem well enough to fix it quickly, so I'm reverting the whole patch and will reland it later.

statham-arm · 2025-11-13T17:17:06Z

I had to revert this immediately because of a buildbot breakage. I didn't understand it at the time, because the error message was confusing, but now I do understand.

The new check_assembler_flag cmake function uses try_compile in a mode where it tries to compile a single source file rather than a whole project. But that mode of try_compile was introduced in cmake 3.25, and LLVM only enforces 3.20 or better. The Fuchsia buildbot is running on Ubuntu 22.04, which has 3.22.1.

llvm-ci · 2025-11-13T20:02:02Z

LLVM Buildbot has detected a new failure on builder llvm-clang-win-x-armv7l running on as-builder-1 while building compiler-rt at step 13 "test-check-compiler-rt-armv7-unknown-linux-gnueabihf".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/38/builds/6321

Here is the relevant piece of the build log for the reference

Step 13 (test-check-compiler-rt-armv7-unknown-linux-gnueabihf) failure: Test just built components: check-compiler-rt-armv7-unknown-linux-gnueabihf completed (failure)
******************** TEST 'Builtins-armhf-linux :: divsf3_test.c' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 5
C:/buildbot/as-builder-1/x-armv7l/build/./bin/clang.exe   -gline-tables-only  --stdlib=libc++ --sysroot='c:/buildbot/fs/jetson-tk1-arm-ubuntu' -Wthread-safety -Wthread-safety-reference -Wthread-safety-beta  -fomit-frame-pointer -DCOMPILER_RT_ARMHF_TARGET -DCOMPILER_RT_HAS_FLOAT16  -fno-builtin -I C:/buildbot/as-builder-1/x-armv7l/llvm-project/compiler-rt\lib\builtins -nodefaultlibs C:\buildbot\as-builder-1\x-armv7l\llvm-project\compiler-rt\test\builtins\Unit\divsf3_test.c C:/buildbot/as-builder-1/x-armv7l/build/./lib/../lib/clang/22/lib/armv7-unknown-linux-gnueabihf\libclang_rt.builtins.a -lc -lm -o C:\buildbot\as-builder-1\x-armv7l\build\runtimes\runtimes-armv7-unknown-linux-gnueabihf-bins\compiler-rt\test\builtins\Unit\ARMHFLinuxConfig\Output\divsf3_test.c.tmp && "C:/Python313/python.exe" "C:/buildbot/as-builder-1/x-armv7l/llvm-project/llvm/utils/remote-exec.py" [email protected] C:\buildbot\as-builder-1\x-armv7l\build\runtimes\runtimes-armv7-unknown-linux-gnueabihf-bins\compiler-rt\test\builtins\Unit\ARMHFLinuxConfig\Output\divsf3_test.c.tmp
# executed command: C:/buildbot/as-builder-1/x-armv7l/build/./bin/clang.exe -gline-tables-only --stdlib=libc++ --sysroot=c:/buildbot/fs/jetson-tk1-arm-ubuntu -Wthread-safety -Wthread-safety-reference -Wthread-safety-beta -fomit-frame-pointer -DCOMPILER_RT_ARMHF_TARGET -DCOMPILER_RT_HAS_FLOAT16 -fno-builtin -I 'C:/buildbot/as-builder-1/x-armv7l/llvm-project/compiler-rt\lib\builtins' -nodefaultlibs 'C:\buildbot\as-builder-1\x-armv7l\llvm-project\compiler-rt\test\builtins\Unit\divsf3_test.c' 'C:/buildbot/as-builder-1/x-armv7l/build/./lib/../lib/clang/22/lib/armv7-unknown-linux-gnueabihf\libclang_rt.builtins.a' -lc -lm -o 'C:\buildbot\as-builder-1\x-armv7l\build\runtimes\runtimes-armv7-unknown-linux-gnueabihf-bins\compiler-rt\test\builtins\Unit\ARMHFLinuxConfig\Output\divsf3_test.c.tmp'
# executed command: C:/Python313/python.exe C:/buildbot/as-builder-1/x-armv7l/llvm-project/llvm/utils/remote-exec.py [email protected] 'C:\buildbot\as-builder-1\x-armv7l\build\runtimes\runtimes-armv7-unknown-linux-gnueabihf-bins\compiler-rt\test\builtins\Unit\ARMHFLinuxConfig\Output\divsf3_test.c.tmp'
# .---command stdout------------
# | error in test__divsf3(00000000, 80000002) = 00000000, expected 80000000
# | error in test__divsf3(00000000, 807fffff) = 00000000, expected 80000000
# | error in test__divsf3(00000000, 80800001) = 00000000, expected 80000000
# | error in test__divsf3(00000000, 81000000) = 00000000, expected 80000000
# | error in test__divsf3(00000000, c0400000) = 00000000, expected 80000000
# | error in test__divsf3(00000000, c0e00000) = 00000000, expected 80000000
# | error in test__divsf3(00000000, fe7fffff) = 00000000, expected 80000000
# | error in test__divsf3(00000000, ff000000) = 00000000, expected 80000000
# | error in test__divsf3(00000000, ff800000) = 00000000, expected 80000000
# | error in test__divsf3(00000001, 00000000) = 00000001, expected 7f800000
# | error in test__divsf3(00000001, 3e000000) = 00000001, expected 00000008
# | error in test__divsf3(00000001, 3f000000) = 00000001, expected 00000002
# | error in test__divsf3(00000001, 40000000) = 00000001, expected 00000000
# | error in test__divsf3(00000001, 7f7fffff) = 00000001, expected 00000000
# | error in test__divsf3(00000001, 7f800000) = 00000001, expected 00000000
# | error in test__divsf3(00000001, c0000000) = 00000001, expected 80000000
# | error in test__divsf3(00000001, ff7fffff) = 00000001, expected 80000000
# | error in test__divsf3(00000002, 80000000) = 00000002, expected ff800000
# | error in test__divsf3(00000002, ff800000) = 00000002, expected 80000000
# | error in test__divsf3(00000009, 41100000) = 00000009, expected 00000001
# | error in test__divsf3(00000009, c1100000) = 00000009, expected 80000001
# | error in test__divsf3(007ffff7, 3f7ffffe) = 007ffff7, expected 007ffff8
# | error in test__divsf3(007ffffe, 3f7ffffe) = 007ffffe, expected 007fffff
# | error in test__divsf3(007fffff, 00000000) = 007fffff, expected 7f800000
# | error in test__divsf3(007fffff, 3b000000) = 007fffff, expected 04fffffe
# | error in test__divsf3(007fffff, 3f000000) = 007fffff, expected 00fffffe
# | error in test__divsf3(007fffff, 3f800002) = 007fffff, expected 007ffffd
# | error in test__divsf3(007fffff, 7f800000) = 007fffff, expected 00000000
# | error in test__divsf3(007fffff, 80000000) = 007fffff, expected ff800000
# | error in test__divsf3(007fffff, bf800000) = 007fffff, expected 807fffff
# | error in test__divsf3(007fffff, ff800000) = 007fffff, expected 80000000
# | error in test__divsf3(00800000, 00000000) = 00800000, expected 7f800000
# | error in test__divsf3(00800000, 3f800001) = 00800000, expected 007fffff
# | error in test__divsf3(00800000, 7f800000) = 00800000, expected 00000000
# | error in test__divsf3(00800001, 3f800002) = 00800001, expected 007fffff
# | error in test__divsf3(00800001, 80000000) = 00800001, expected ff800000
# | error in test__divsf3(00800001, ff800000) = 00800001, expected 80000000
# | error in test__divsf3(00800002, 3f800006) = 00800002, expected 007ffffc
# | error in test__divsf3(00fffffe, 40000000) = 00fffffe, expected 007fffff
# | error in test__divsf3(00ffffff, 00000000) = 00ffffff, expected 7f800000
...

statham-arm · 2025-11-14T11:24:46Z

Hmmm, those failure logs look like an ABI mismatch to me: all the failing tests seem to have in common that the wrong return value is the same as the first input, which suggests that somewhere a caller and callee disagreed on whether to pass things in integer or float registers. But that buildbot looks like a tricky environment to reproduce!

This commit adds optimized assembly versions of single-precision float multiplication and division. Both functions are implemented in a style that can be assembled as either of Arm and Thumb2; for multiplication, a separate implementation is provided for Thumb1. Also, extensive new tests are added for multiplication and division. These implementations can be removed from the build by defining the cmake variable COMPILER_RT_ARM_OPTIMIZED_FP=OFF. Outlying parts of the functionality which are not on the fast path, such as NaN handling and underflow, are handled in helper functions written in C. These can be shared between the Arm/Thumb2 and Thumb1 implementations, and also reused by other optimized assembly functions we hope to add in future.

(Reland of #161546, fixing three build and test issues) This commit adds optimized assembly versions of single-precision float multiplication and division. Both functions are implemented in a style that can be assembled as either of Arm and Thumb2; for multiplication, a separate implementation is provided for Thumb1. Also, extensive new tests are added for multiplication and division. These implementations can be removed from the build by defining the cmake variable COMPILER_RT_ARM_OPTIMIZED_FP=OFF. Outlying parts of the functionality which are not on the fast path, such as NaN handling and underflow, are handled in helper functions written in C. These can be shared between the Arm/Thumb2 and Thumb1 implementations, and also reused by other optimized assembly functions we hope to add in future.

statham-arm requested review from aykevl, compnerd, petrhosek and smithp35 October 1, 2025 16:24

llvmbot added compiler-rt compiler-rt:builtins labels Oct 1, 2025

aykevl reviewed Oct 1, 2025

View reviewed changes

compnerd reviewed Oct 1, 2025

View reviewed changes

statham-arm added 6 commits October 2, 2025 13:51

Fix the Thumb1 build which I forgot to test

8c7228f

Use DEFINE_COMPILERRT_THUMB_FUNCTION in Thumb1

4edb28b

Tweak final return in fnan2 as suggested

c73dfea

Clarify comment about 4-byte boundary

b236372

Lowercase instruction mnemonics and shifter operands

7a24535

This was Petr Hosek's comment on llvm#154093, but if we're doing that, we should do it consistently.

Mention size/speed tradeoff in the cmake option help

40e3621

statham-arm added a commit to statham-arm/llvm-project that referenced this pull request Oct 2, 2025

Changes to fnan2 to be consistent wih llvm#161546

a6f6263

Update build and test setup

66a3bcb

Now we should only test the extra NaN faithfulness in cases where it's provided by the library. Also tweaked the cmake setup to make it easier to add more assembly files later. Plus a missing piece of comment in fnan2.c.

compnerd approved these changes Oct 27, 2025

View reviewed changes

statham-arm added 2 commits November 6, 2025 11:55

Check for the right spelling of the implicit-it option

33232d8

Fix build failure on every non-Arm platform

ecfa7fc

Oops! I forgot that COMPILER_RT_ARM_OPTIMIZED_FP defaulted to on, so that this check would run even for builds that never even knew about it.

compnerd approved these changes Nov 13, 2025

View reviewed changes

statham-arm merged commit f7e6521 into llvm:main Nov 13, 2025
10 checks passed

statham-arm deleted the optimized-float-mul-div branch November 13, 2025 16:26

statham-arm added a commit that referenced this pull request Nov 13, 2025

Revert "[compiler-rt][ARM] Optimized mulsf3 and divsf3 (#161546)"

4b25214

This reverts commit f7e6521.

statham-arm mentioned this pull request Nov 13, 2025

Revert "[compiler-rt][ARM] Optimized mulsf3 and divsf3" #167906

Merged

statham-arm mentioned this pull request Nov 17, 2025

[compiler-rt][ARM] Optimized mulsf3 and divsf3 #168394

Merged


		*/

		.p2align 2 // make sure we start on a 32-bit boundary, even in Thumb

[compiler-rt][ARM] Optimized mulsf3 and divsf3 #161546

[compiler-rt][ARM] Optimized mulsf3 and divsf3 #161546

Uh oh!

Conversation

statham-arm commented Oct 1, 2025

Uh oh!

statham-arm commented Oct 1, 2025

Uh oh!

aykevl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

statham-arm commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

statham-arm Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

compnerd commented Oct 27, 2025

Uh oh!

compnerd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

statham-arm commented Nov 13, 2025

Uh oh!

llvm-ci commented Nov 13, 2025

Uh oh!

statham-arm commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

statham-arm commented Oct 27, 2025 •

edited

Loading

statham-arm Oct 27, 2025 •

edited

Loading