Skip to content

Conversation

brandtbucher
Copy link
Member

@brandtbucher brandtbucher commented Jul 7, 2025

As the new comment says, upon manual review of -O3, -O2, and -Os, it seems that -Os generates the best code for the JIT's use-case. Perf impact is close to noise, but slightly positive on x86-64 Linux and AArch64 macOS, neutral on AArch64 Linux, and slightly negative on x86-64 Windows. According to the stats, the size of JIT code is down by about 1-2%: https://github.com/faster-cpython/benchmarking-public/blob/main/results/bm-20250628-3.15.0a0-33054dd-JIT/README.md

Here's an example of how skipping tail-duplication removes an extra jump and a duplicate instruction from _POP_TOP (also reducing its size by 19%):

-    // 11: 75 04                         jne     0x17 <_JIT_ENTRY+0x17>
+    // 11: 75 0f                         jne     0x22 <_JIT_ENTRY+0x22>
     // 13: ff 0f                         decl    (%rdi)
-    // 15: 74 07                         je      0x1e <_JIT_ENTRY+0x1e>
-    // 17: 4d 8b 6c 24 40                movq    0x40(%r12), %r13
-    // 1c: eb 10                         jmp     0x2e <_JIT_CONTINUE>
-    // 1e: 50                            pushq   %rax
-    // 1f: ff 15 00 00 00 00             callq   *(%rip)                 # 0x25 <_JIT_ENTRY+0x25>
-    // 0000000000000021:  R_X86_64_GOTPCRELX   _Py_Dealloc-0x4
-    // 25: 48 83 c4 08                   addq    $0x8, %rsp
-    // 29: 4d 8b 6c 24 40                movq    0x40(%r12), %r13
-    const unsigned char code_body[46] = {
+    // 15: 75 0b                         jne     0x22 <_JIT_ENTRY+0x22>
+    // 17: 50                            pushq   %rax
+    // 18: ff 15 00 00 00 00             callq   *(%rip)                 # 0x1e <_JIT_ENTRY+0x1e>
+    // 000000000000001a:  R_X86_64_GOTPCRELX   _Py_Dealloc-0x4
+    // 1e: 48 83 c4 08                   addq    $0x8, %rsp
+    // 22: 4d 8b 6c 24 40                movq    0x40(%r12), %r13
+    const unsigned char code_body[39] = {
         0x49, 0x8b, 0x7d, 0xf8, 0x49, 0x83, 0xc5, 0xf8,
         0x4d, 0x89, 0x6c, 0x24, 0x40, 0x40, 0xf6, 0xc7,
-        0x01, 0x75, 0x04, 0xff, 0x0f, 0x74, 0x07, 0x4d,
-        0x8b, 0x6c, 0x24, 0x40, 0xeb, 0x10, 0x50, 0xff,
-        0x15, 0x00, 0x00, 0x00, 0x00, 0x48, 0x83, 0xc4,
-        0x08, 0x4d, 0x8b, 0x6c, 0x24, 0x40,
+        0x01, 0x75, 0x0f, 0xff, 0x0f, 0x75, 0x0b, 0x50,
+        0xff, 0x15, 0x00, 0x00, 0x00, 0x00, 0x48, 0x83,
+        0xc4, 0x08, 0x4d, 0x8b, 0x6c, 0x24, 0x40,
     };

Full diff for the stencils here:

https://gist.github.com/brandtbucher/7340be56f2d2cf7061b5c9bf1c87939c

@brandtbucher brandtbucher self-assigned this Jul 7, 2025
@brandtbucher brandtbucher added performance Performance or resource usage skip news interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-JIT labels Jul 7, 2025
@bedevere-app bedevere-app bot mentioned this pull request Jul 7, 2025
13 tasks
f"-I{CPYTHON / 'Python'}",
f"-I{CPYTHON / 'Tools' / 'jit'}",
"-O3",
# -O2 and -O3 include some optimizations that make sense for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you investigate -Oz as well? The clang docs are fairly vague, but they say it reduces code size even further, so I'm curious if it's worth investigating as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea! I'm definitely down to try benchmarking it after this lands.

I suspect it may be quite a bit slower, though. My understanding is that -Os does all of the meaningful performance optimizations except those that increase size, while -Oz will actually hurt performance in pursuit of the smallest possible machine code. Our goal is to be fast, of course, but in this particular case -Os is also just giving us better code (as a side-effect of not aligning jumps or duplicating tails, etc). So smaller isn't necessarily always better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure this is going to be a win. It basically turns off inlining for functions called more than once. For instance, _POP_TWO turns from this on -Os:

    // 0000000000000000 <_JIT_ENTRY>:
    // 0: 50                            pushq   %rax
    // 1: 49 8d 45 f8                   leaq    -0x8(%r13), %rax
    // 5: 49 8b 5d f0                   movq    -0x10(%r13), %rbx
    // 9: 49 8b 7d f8                   movq    -0x8(%r13), %rdi
    // d: 49 89 44 24 40                movq    %rax, 0x40(%r12)
    // 12: 40 f6 c7 01                   testb   $0x1, %dil
    // 16: 75 0a                         jne     0x22 <_JIT_ENTRY+0x22>
    // 18: ff 0f                         decl    (%rdi)
    // 1a: 75 06                         jne     0x22 <_JIT_ENTRY+0x22>
    // 1c: ff 15 00 00 00 00             callq   *(%rip)                 # 0x22 <_JIT_ENTRY+0x22>
    // 000000000000001e:  R_X86_64_GOTPCRELX   _Py_Dealloc-0x4
    // 22: 49 83 44 24 40 f8             addq    $-0x8, 0x40(%r12)
    // 28: f6 c3 01                      testb   $0x1, %bl
    // 2b: 75 0d                         jne     0x3a <_JIT_ENTRY+0x3a>
    // 2d: ff 0b                         decl    (%rbx)
    // 2f: 75 09                         jne     0x3a <_JIT_ENTRY+0x3a>
    // 31: 48 89 df                      movq    %rbx, %rdi
    // 34: ff 15 00 00 00 00             callq   *(%rip)                 # 0x3a <_JIT_ENTRY+0x3a>
    // 0000000000000036:  R_X86_64_GOTPCRELX   _Py_Dealloc-0x4
    // 3a: 4d 8b 6c 24 40                movq    0x40(%r12), %r13
    // 3f: 58                            popq    %rax

Into this on -Oz (outlining PyStackRef_CLOSE makes it 2 bytes shorter, but adds up to three additional jumps):

    // 0000000000000000 <_JIT_ENTRY>:
    // 0: 50                            pushq   %rax
    // 1: 49 8d 45 f8                   leaq    -0x8(%r13), %rax
    // 5: 49 8b 5d f0                   movq    -0x10(%r13), %rbx
    // 9: 49 8b 7d f8                   movq    -0x8(%r13), %rdi
    // d: 49 89 44 24 40                movq    %rax, 0x40(%r12)
    // 12: e8 16 00 00 00                callq   0x2d <PyStackRef_CLOSE>
    // 17: 49 83 44 24 40 f8             addq    $-0x8, 0x40(%r12)
    // 1d: 48 89 df                      movq    %rbx, %rdi
    // 20: e8 08 00 00 00                callq   0x2d <PyStackRef_CLOSE>
    // 25: 4d 8b 6c 24 40                movq    0x40(%r12), %r13
    // 2a: 58                            popq    %rax
    // 2b: eb 11                         jmp     0x3e <_JIT_CONTINUE>
    // 
    // 000000000000002d <PyStackRef_CLOSE>:
    // 2d: 40 f6 c7 01                   testb   $0x1, %dil
    // 31: 75 04                         jne     0x37 <PyStackRef_CLOSE+0xa>
    // 33: ff 0f                         decl    (%rdi)
    // 35: 74 01                         je      0x38 <PyStackRef_CLOSE+0xb>
    // 37: c3                            retq
    // 38: ff 25 00 00 00 00             jmpq    *(%rip)                 # 0x3e <_JIT_CONTINUE>
    // 000000000000003a:  R_X86_64_GOTPCRELX   _Py_Dealloc-0x4

I'll still try benchmarking it though. But I'll land this PR in the meantime since it's just a one-character change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, -Oz is about 1-2% slower across the board.

@brandtbucher brandtbucher merged commit c49dc3b into python:main Jul 9, 2025
72 checks passed
AndPuQing pushed a commit to AndPuQing/cpython that referenced this pull request Jul 11, 2025
Pranjal095 pushed a commit to Pranjal095/cpython that referenced this pull request Jul 12, 2025
picnixz pushed a commit to picnixz/cpython that referenced this pull request Jul 13, 2025
taegyunkim pushed a commit to taegyunkim/cpython that referenced this pull request Aug 4, 2025
Agent-Hellboy pushed a commit to Agent-Hellboy/cpython that referenced this pull request Aug 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage skip news topic-JIT

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants