Skip to content

GH-115802: Optimize JIT stencils for size #136393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 9, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion Tools/jit/_targets.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,15 @@ async def _compile(
f"-I{CPYTHON / 'Include' / 'internal' / 'mimalloc'}",
f"-I{CPYTHON / 'Python'}",
f"-I{CPYTHON / 'Tools' / 'jit'}",
"-O3",
# -O2 and -O3 include some optimizations that make sense for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you investigate -Oz as well? The clang docs are fairly vague, but they say it reduces code size even further, so I'm curious if it's worth investigating as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea! I'm definitely down to try benchmarking it after this lands.

I suspect it may be quite a bit slower, though. My understanding is that -Os does all of the meaningful performance optimizations except those that increase size, while -Oz will actually hurt performance in pursuit of the smallest possible machine code. Our goal is to be fast, of course, but in this particular case -Os is also just giving us better code (as a side-effect of not aligning jumps or duplicating tails, etc). So smaller isn't necessarily always better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure this is going to be a win. It basically turns off inlining for functions called more than once. For instance, _POP_TWO turns from this on -Os:

    // 0000000000000000 <_JIT_ENTRY>:
    // 0: 50                            pushq   %rax
    // 1: 49 8d 45 f8                   leaq    -0x8(%r13), %rax
    // 5: 49 8b 5d f0                   movq    -0x10(%r13), %rbx
    // 9: 49 8b 7d f8                   movq    -0x8(%r13), %rdi
    // d: 49 89 44 24 40                movq    %rax, 0x40(%r12)
    // 12: 40 f6 c7 01                   testb   $0x1, %dil
    // 16: 75 0a                         jne     0x22 <_JIT_ENTRY+0x22>
    // 18: ff 0f                         decl    (%rdi)
    // 1a: 75 06                         jne     0x22 <_JIT_ENTRY+0x22>
    // 1c: ff 15 00 00 00 00             callq   *(%rip)                 # 0x22 <_JIT_ENTRY+0x22>
    // 000000000000001e:  R_X86_64_GOTPCRELX   _Py_Dealloc-0x4
    // 22: 49 83 44 24 40 f8             addq    $-0x8, 0x40(%r12)
    // 28: f6 c3 01                      testb   $0x1, %bl
    // 2b: 75 0d                         jne     0x3a <_JIT_ENTRY+0x3a>
    // 2d: ff 0b                         decl    (%rbx)
    // 2f: 75 09                         jne     0x3a <_JIT_ENTRY+0x3a>
    // 31: 48 89 df                      movq    %rbx, %rdi
    // 34: ff 15 00 00 00 00             callq   *(%rip)                 # 0x3a <_JIT_ENTRY+0x3a>
    // 0000000000000036:  R_X86_64_GOTPCRELX   _Py_Dealloc-0x4
    // 3a: 4d 8b 6c 24 40                movq    0x40(%r12), %r13
    // 3f: 58                            popq    %rax

Into this on -Oz (outlining PyStackRef_CLOSE makes it 2 bytes shorter, but adds up to three additional jumps):

    // 0000000000000000 <_JIT_ENTRY>:
    // 0: 50                            pushq   %rax
    // 1: 49 8d 45 f8                   leaq    -0x8(%r13), %rax
    // 5: 49 8b 5d f0                   movq    -0x10(%r13), %rbx
    // 9: 49 8b 7d f8                   movq    -0x8(%r13), %rdi
    // d: 49 89 44 24 40                movq    %rax, 0x40(%r12)
    // 12: e8 16 00 00 00                callq   0x2d <PyStackRef_CLOSE>
    // 17: 49 83 44 24 40 f8             addq    $-0x8, 0x40(%r12)
    // 1d: 48 89 df                      movq    %rbx, %rdi
    // 20: e8 08 00 00 00                callq   0x2d <PyStackRef_CLOSE>
    // 25: 4d 8b 6c 24 40                movq    0x40(%r12), %r13
    // 2a: 58                            popq    %rax
    // 2b: eb 11                         jmp     0x3e <_JIT_CONTINUE>
    // 
    // 000000000000002d <PyStackRef_CLOSE>:
    // 2d: 40 f6 c7 01                   testb   $0x1, %dil
    // 31: 75 04                         jne     0x37 <PyStackRef_CLOSE+0xa>
    // 33: ff 0f                         decl    (%rdi)
    // 35: 74 01                         je      0x38 <PyStackRef_CLOSE+0xb>
    // 37: c3                            retq
    // 38: ff 25 00 00 00 00             jmpq    *(%rip)                 # 0x3e <_JIT_CONTINUE>
    // 000000000000003a:  R_X86_64_GOTPCRELX   _Py_Dealloc-0x4

I'll still try benchmarking it though. But I'll land this PR in the meantime since it's just a one-character change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, -Oz is about 1-2% slower across the board.

# standalone functions, but not for snippets of code that are going
# to be laid out end-to-end (like ours)... common examples include
# passes like tail-duplication, or aligning jump targets with nops.
# -Os is equivalent to -O2 with many of these problematic passes
# disabled. Based on manual review, for *our* purposes it usually
# generates better code than -O2 (and -O2 usually generates better
# code than -O3). As a nice benefit, it uses less memory too:
"-Os",
"-S",
# Shorten full absolute file paths in the generated code (like the
# __FILE__ macro and assert failure messages) for reproducibility:
Expand Down
Loading
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy