GH-115802: Optimize JIT stencils for size #136393

brandtbucher · 2025-07-07T18:52:43Z

As the new comment says, upon manual review of -O3, -O2, and -Os, it seems that -Os generates the best code for the JIT's use-case. Perf impact is close to noise, but slightly positive on x86-64 Linux and AArch64 macOS, neutral on AArch64 Linux, and slightly negative on x86-64 Windows. According to the stats, the size of JIT code is down by about 1-2%: https://github.com/faster-cpython/benchmarking-public/blob/main/results/bm-20250628-3.15.0a0-33054dd-JIT/README.md

Here's an example of how skipping tail-duplication removes an extra jump and a duplicate instruction from _POP_TOP (also reducing its size by 19%):

-    // 11: 75 04                         jne     0x17 <_JIT_ENTRY+0x17>
+    // 11: 75 0f                         jne     0x22 <_JIT_ENTRY+0x22>
     // 13: ff 0f                         decl    (%rdi)
-    // 15: 74 07                         je      0x1e <_JIT_ENTRY+0x1e>
-    // 17: 4d 8b 6c 24 40                movq    0x40(%r12), %r13
-    // 1c: eb 10                         jmp     0x2e <_JIT_CONTINUE>
-    // 1e: 50                            pushq   %rax
-    // 1f: ff 15 00 00 00 00             callq   *(%rip)                 # 0x25 <_JIT_ENTRY+0x25>
-    // 0000000000000021:  R_X86_64_GOTPCRELX   _Py_Dealloc-0x4
-    // 25: 48 83 c4 08                   addq    $0x8, %rsp
-    // 29: 4d 8b 6c 24 40                movq    0x40(%r12), %r13
-    const unsigned char code_body[46] = {
+    // 15: 75 0b                         jne     0x22 <_JIT_ENTRY+0x22>
+    // 17: 50                            pushq   %rax
+    // 18: ff 15 00 00 00 00             callq   *(%rip)                 # 0x1e <_JIT_ENTRY+0x1e>
+    // 000000000000001a:  R_X86_64_GOTPCRELX   _Py_Dealloc-0x4
+    // 1e: 48 83 c4 08                   addq    $0x8, %rsp
+    // 22: 4d 8b 6c 24 40                movq    0x40(%r12), %r13
+    const unsigned char code_body[39] = {
         0x49, 0x8b, 0x7d, 0xf8, 0x49, 0x83, 0xc5, 0xf8,
         0x4d, 0x89, 0x6c, 0x24, 0x40, 0x40, 0xf6, 0xc7,
-        0x01, 0x75, 0x04, 0xff, 0x0f, 0x74, 0x07, 0x4d,
-        0x8b, 0x6c, 0x24, 0x40, 0xeb, 0x10, 0x50, 0xff,
-        0x15, 0x00, 0x00, 0x00, 0x00, 0x48, 0x83, 0xc4,
-        0x08, 0x4d, 0x8b, 0x6c, 0x24, 0x40,
+        0x01, 0x75, 0x0f, 0xff, 0x0f, 0x75, 0x0b, 0x50,
+        0xff, 0x15, 0x00, 0x00, 0x00, 0x00, 0x48, 0x83,
+        0xc4, 0x08, 0x4d, 0x8b, 0x6c, 0x24, 0x40,
     };

Full diff for the stencils here:

https://gist.github.com/brandtbucher/7340be56f2d2cf7061b5c9bf1c87939c

Issue: Improving JIT code quality #115802

savannahostrowski · 2025-07-08T02:27:12Z

Tools/jit/_targets.py

@@ -137,7 +137,15 @@ async def _compile(
            f"-I{CPYTHON / 'Include' / 'internal' / 'mimalloc'}",
            f"-I{CPYTHON / 'Python'}",
            f"-I{CPYTHON / 'Tools' / 'jit'}",
-            "-O3",
+            # -O2 and -O3 include some optimizations that make sense for


Did you investigate -Oz as well? The clang docs are fairly vague, but they say it reduces code size even further, so I'm curious if it's worth investigating as well.

Nice idea! I'm definitely down to try benchmarking it after this lands.

I suspect it may be quite a bit slower, though. My understanding is that -Os does all of the meaningful performance optimizations except those that increase size, while -Oz will actually hurt performance in pursuit of the smallest possible machine code. Our goal is to be fast, of course, but in this particular case -Os is also just giving us better code (as a side-effect of not aligning jumps or duplicating tails, etc). So smaller isn't necessarily always better.

Yeah, I'm not sure this is going to be a win. It basically turns off inlining for functions called more than once. For instance, _POP_TWO turns from this on -Os:

// 0000000000000000 <_JIT_ENTRY>: // 0: 50 pushq %rax // 1: 49 8d 45 f8 leaq -0x8(%r13), %rax // 5: 49 8b 5d f0 movq -0x10(%r13), %rbx // 9: 49 8b 7d f8 movq -0x8(%r13), %rdi // d: 49 89 44 24 40 movq %rax, 0x40(%r12) // 12: 40 f6 c7 01 testb $0x1, %dil // 16: 75 0a jne 0x22 <_JIT_ENTRY+0x22> // 18: ff 0f decl (%rdi) // 1a: 75 06 jne 0x22 <_JIT_ENTRY+0x22> // 1c: ff 15 00 00 00 00 callq *(%rip) # 0x22 <_JIT_ENTRY+0x22> // 000000000000001e: R_X86_64_GOTPCRELX _Py_Dealloc-0x4 // 22: 49 83 44 24 40 f8 addq $-0x8, 0x40(%r12) // 28: f6 c3 01 testb $0x1, %bl // 2b: 75 0d jne 0x3a <_JIT_ENTRY+0x3a> // 2d: ff 0b decl (%rbx) // 2f: 75 09 jne 0x3a <_JIT_ENTRY+0x3a> // 31: 48 89 df movq %rbx, %rdi // 34: ff 15 00 00 00 00 callq *(%rip) # 0x3a <_JIT_ENTRY+0x3a> // 0000000000000036: R_X86_64_GOTPCRELX _Py_Dealloc-0x4 // 3a: 4d 8b 6c 24 40 movq 0x40(%r12), %r13 // 3f: 58 popq %rax

Into this on -Oz (outlining PyStackRef_CLOSE makes it 2 bytes shorter, but adds up to three additional jumps):

// 0000000000000000 <_JIT_ENTRY>: // 0: 50 pushq %rax // 1: 49 8d 45 f8 leaq -0x8(%r13), %rax // 5: 49 8b 5d f0 movq -0x10(%r13), %rbx // 9: 49 8b 7d f8 movq -0x8(%r13), %rdi // d: 49 89 44 24 40 movq %rax, 0x40(%r12) // 12: e8 16 00 00 00 callq 0x2d <PyStackRef_CLOSE> // 17: 49 83 44 24 40 f8 addq $-0x8, 0x40(%r12) // 1d: 48 89 df movq %rbx, %rdi // 20: e8 08 00 00 00 callq 0x2d <PyStackRef_CLOSE> // 25: 4d 8b 6c 24 40 movq 0x40(%r12), %r13 // 2a: 58 popq %rax // 2b: eb 11 jmp 0x3e <_JIT_CONTINUE> // // 000000000000002d <PyStackRef_CLOSE>: // 2d: 40 f6 c7 01 testb $0x1, %dil // 31: 75 04 jne 0x37 <PyStackRef_CLOSE+0xa> // 33: ff 0f decl (%rdi) // 35: 74 01 je 0x38 <PyStackRef_CLOSE+0xb> // 37: c3 retq // 38: ff 25 00 00 00 00 jmpq *(%rip) # 0x3e <_JIT_CONTINUE> // 000000000000003a: R_X86_64_GOTPCRELX _Py_Dealloc-0x4

I'll still try benchmarking it though. But I'll land this PR in the meantime since it's just a one-character change.

Yep, -Oz is about 1-2% slower across the board.

brandtbucher and others added 2 commits June 28, 2025 08:41

-Os

33054dd

Add comment explaining -Os

dbca451

brandtbucher self-assigned this Jul 7, 2025

brandtbucher requested a review from savannahostrowski as a code owner July 7, 2025 18:52

brandtbucher added performance Performance or resource usage skip news interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-JIT labels Jul 7, 2025

bedevere-app bot added the awaiting core review label Jul 7, 2025

bedevere-app bot mentioned this pull request Jul 7, 2025

Improving JIT code quality #115802

Open

13 tasks

savannahostrowski reviewed Jul 8, 2025

View reviewed changes

brandtbucher merged commit c49dc3b into python:main Jul 9, 2025
72 checks passed

bedevere-app bot removed the awaiting core review label Jul 9, 2025

AndPuQing pushed a commit to AndPuQing/cpython that referenced this pull request Jul 11, 2025

pythonGH-115802: Optimize JIT stencils for size (pythonGH-136393)

21b1d77

Pranjal095 pushed a commit to Pranjal095/cpython that referenced this pull request Jul 12, 2025

pythonGH-115802: Optimize JIT stencils for size (pythonGH-136393)

2e4718c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

GH-115802: Optimize JIT stencils for size #136393

GH-115802: Optimize JIT stencils for size #136393

Uh oh!

brandtbucher commented Jul 7, 2025 •

edited by bedevere-app bot

Loading

Uh oh!

savannahostrowski Jul 8, 2025

Uh oh!

brandtbucher Jul 9, 2025

Uh oh!

brandtbucher Jul 9, 2025

Uh oh!

brandtbucher Jul 10, 2025

Uh oh!

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Uh oh!

GH-115802: Optimize JIT stencils for size #136393

GH-115802: Optimize JIT stencils for size #136393

Uh oh!

Conversation

brandtbucher commented Jul 7, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

savannahostrowski Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

brandtbucher Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

brandtbucher Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

brandtbucher Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

brandtbucher commented Jul 7, 2025 •

edited by bedevere-app bot

Loading