-
-
Notifications
You must be signed in to change notification settings - Fork 32.3k
GH-115802: Optimize JIT stencils for size #136393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -137,7 +137,15 @@ async def _compile( | |||
f"-I{CPYTHON / 'Include' / 'internal' / 'mimalloc'}", | |||
f"-I{CPYTHON / 'Python'}", | |||
f"-I{CPYTHON / 'Tools' / 'jit'}", | |||
"-O3", | |||
# -O2 and -O3 include some optimizations that make sense for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you investigate -Oz
as well? The clang docs are fairly vague, but they say it reduces code size even further, so I'm curious if it's worth investigating as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea! I'm definitely down to try benchmarking it after this lands.
I suspect it may be quite a bit slower, though. My understanding is that -Os
does all of the meaningful performance optimizations except those that increase size, while -Oz
will actually hurt performance in pursuit of the smallest possible machine code. Our goal is to be fast, of course, but in this particular case -Os
is also just giving us better code (as a side-effect of not aligning jumps or duplicating tails, etc). So smaller isn't necessarily always better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure this is going to be a win. It basically turns off inlining for functions called more than once. For instance, _POP_TWO
turns from this on -Os
:
// 0000000000000000 <_JIT_ENTRY>:
// 0: 50 pushq %rax
// 1: 49 8d 45 f8 leaq -0x8(%r13), %rax
// 5: 49 8b 5d f0 movq -0x10(%r13), %rbx
// 9: 49 8b 7d f8 movq -0x8(%r13), %rdi
// d: 49 89 44 24 40 movq %rax, 0x40(%r12)
// 12: 40 f6 c7 01 testb $0x1, %dil
// 16: 75 0a jne 0x22 <_JIT_ENTRY+0x22>
// 18: ff 0f decl (%rdi)
// 1a: 75 06 jne 0x22 <_JIT_ENTRY+0x22>
// 1c: ff 15 00 00 00 00 callq *(%rip) # 0x22 <_JIT_ENTRY+0x22>
// 000000000000001e: R_X86_64_GOTPCRELX _Py_Dealloc-0x4
// 22: 49 83 44 24 40 f8 addq $-0x8, 0x40(%r12)
// 28: f6 c3 01 testb $0x1, %bl
// 2b: 75 0d jne 0x3a <_JIT_ENTRY+0x3a>
// 2d: ff 0b decl (%rbx)
// 2f: 75 09 jne 0x3a <_JIT_ENTRY+0x3a>
// 31: 48 89 df movq %rbx, %rdi
// 34: ff 15 00 00 00 00 callq *(%rip) # 0x3a <_JIT_ENTRY+0x3a>
// 0000000000000036: R_X86_64_GOTPCRELX _Py_Dealloc-0x4
// 3a: 4d 8b 6c 24 40 movq 0x40(%r12), %r13
// 3f: 58 popq %rax
Into this on -Oz
(outlining PyStackRef_CLOSE
makes it 2 bytes shorter, but adds up to three additional jumps):
// 0000000000000000 <_JIT_ENTRY>:
// 0: 50 pushq %rax
// 1: 49 8d 45 f8 leaq -0x8(%r13), %rax
// 5: 49 8b 5d f0 movq -0x10(%r13), %rbx
// 9: 49 8b 7d f8 movq -0x8(%r13), %rdi
// d: 49 89 44 24 40 movq %rax, 0x40(%r12)
// 12: e8 16 00 00 00 callq 0x2d <PyStackRef_CLOSE>
// 17: 49 83 44 24 40 f8 addq $-0x8, 0x40(%r12)
// 1d: 48 89 df movq %rbx, %rdi
// 20: e8 08 00 00 00 callq 0x2d <PyStackRef_CLOSE>
// 25: 4d 8b 6c 24 40 movq 0x40(%r12), %r13
// 2a: 58 popq %rax
// 2b: eb 11 jmp 0x3e <_JIT_CONTINUE>
//
// 000000000000002d <PyStackRef_CLOSE>:
// 2d: 40 f6 c7 01 testb $0x1, %dil
// 31: 75 04 jne 0x37 <PyStackRef_CLOSE+0xa>
// 33: ff 0f decl (%rdi)
// 35: 74 01 je 0x38 <PyStackRef_CLOSE+0xb>
// 37: c3 retq
// 38: ff 25 00 00 00 00 jmpq *(%rip) # 0x3e <_JIT_CONTINUE>
// 000000000000003a: R_X86_64_GOTPCRELX _Py_Dealloc-0x4
I'll still try benchmarking it though. But I'll land this PR in the meantime since it's just a one-character change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, -Oz
is about 1-2% slower across the board.
As the new comment says, upon manual review of
-O3
,-O2
, and-Os
, it seems that-Os
generates the best code for the JIT's use-case. Perf impact is close to noise, but slightly positive on x86-64 Linux and AArch64 macOS, neutral on AArch64 Linux, and slightly negative on x86-64 Windows. According to the stats, the size of JIT code is down by about 1-2%: https://github.com/faster-cpython/benchmarking-public/blob/main/results/bm-20250628-3.15.0a0-33054dd-JIT/README.mdHere's an example of how skipping tail-duplication removes an extra jump and a duplicate instruction from
_POP_TOP
(also reducing its size by 19%):Full diff for the stencils here:
https://gist.github.com/brandtbucher/7340be56f2d2cf7061b5c9bf1c87939c