gh-136459: Add perf trampoline support for macOS #136461

canova · 2025-07-09T10:10:09Z

This PR adds perf trampoline support for macOS. Previously it was Linux-only, which is not actually needed it to be. We can share the implementation of Linux and macOS with some minor changes done in this PR.

For example, here's a before and after profile of this PR, captured by samply (for the same dummy script):
Before / After

As you can see, now we have blue frames that mentions the real Python function names, which makes it a lot more useful to understand this profile. Without it, it's just native frames that's really difficult to understand.

This is my first PR in this project, so please let me know if there is anything I didn't do or should do. Thanks!

Issue: Consider enabling perf trampoline on macOS #136459

📚 Documentation preview 📚: https://cpython-previews--136461.org.readthedocs.build/

python-cla-bot · 2025-07-09T10:10:13Z

All commit authors signed the Contributor License Agreement.

canova · 2025-07-09T10:12:10Z

cc @pablogsal. I went ahead and submitted this PR for #136459 but please let me know what you think!

canova · 2025-07-09T10:14:18Z

Doc/c-api/perfmaps.rst

-A running process may create a file in the ``/tmp`` directory, which contains entries
-that can map a section of executable code to a name. This interface is described in the
+profiling tool (such as `perf <https://perf.wiki.kernel.org/index.php/Main_Page>`_ or
+`samply <https://github.com/mstange/samply/>`_). A running process may create a


I added a shameless plug to samply 😄 Disclaimer: I'm not the maintainer of the project, but the maintainer is my colleague. But it doesn't change the fact that it's an awesome profiler! But I can revert it if you prefer not to include :)

I am happy with the plug, but this docs are going to need much more than this then. If samply is the main way to use this on macOS then we will need to update https://docs.python.org/3/howto/perf_profiling.html with full instructions for samply :)

configure.ac

Zheaoli · 2025-07-09T19:57:19Z

Python/perf_jit_trampoline.c

+
+/* These constants are defined inside <elf.h>, which we can't use outside of linux. */
+#if !defined(__linux__)
+#  define EM_386      3


I think we need to define the variable for different platforms. like this https://github.com/python/cpython/blob/main/Python/perf_jit_trampoline.c#L126-L135

Not define all the possible variable

Sure, I just changed them to be defined depending on the platform. I actually copy pasted this from the old code but your suggestion is better.

Zheaoli

For now, the trampoline depens on the mmap.

But it has record about the mmap performance issue on macOS.

I think we need a benchmark here

FYI

pablogsal · 2025-07-09T20:31:10Z

For now, the trampoline depens on the mmap.

But it has record about the mmap performance issue on macOS.

I think we need a benchmark here

FYI

tmm1/node@3542e4e

https://chromium-review.googlesource.com/c/v8/v8/+/6385655/7/src/diagnostics/perf-jit.cc

https://bugzilla-dev.allizom.org/show_bug.cgi?id=1827214

Thanks for you comment but I don't think we need any benchmark. We are using mmap all over the place and that's what the allocators, the JIT and many other things do. The trampoline also batches allocations in big chunks so I think we are fine. If the problem is on samply side that's their challenge to figure out there not here. If they need some sort of hook they should mention it here but is unlikely we will add anything that's not generic.

Edit: ah I think you are referring to the mmap to tell Perf about the maps file no? Sorry I thought you were referring to the jit trampoline compiler that itself uses mmap to get the chunks.

pablogsal · 2025-07-09T20:32:56Z

@canova i am currently on vacation until EOW I will answer more when I have some time later today or tomorrow but I would love to get macOS support. I have some questions I think we need to answer first to ensure we provide a good UX if we are adding more profilers.

On macOS, we don't need to call mmap because samply has already detected the file path during the call to `open` before (it interposes `open` with a preloaded library), and because the mmap call can be slow.

canova · 2025-07-09T23:37:02Z

@Zheaoli

For now, the trampoline depens on the mmap.

But it has record about the mmap performance issue on macOS.

I think we need a benchmark here

FYI
1. [tmm1/node@3542e4e](https://github.com/tmm1/node/commit/3542e4e2944f2fe1132ded1052e223f54c90e4bf)

2. https://chromium-review.googlesource.com/c/v8/v8/+/6385655/7/src/diagnostics/perf-jit.cc

3. https://bugzilla-dev.allizom.org/show_bug.cgi?id=1827214

Ah I think you are right. I actually looked for this and only found one mmap inside the perf_trampoline.c file and that was for a memory arena so thought this change was not needed and didn't think much of if after. But looking at it again, I think it's this one that we shouldn't mmap:

cpython/Python/perf_jit_trampoline.c

Lines 945 to 961 in e697f5e

    
               /* 
        
                * Map the first page of the jitdump file 
        
                * 
        
                * This memory mapping serves as a signal to perf that this process 
        
                * is generating JIT code. Perf scans /proc/.../maps looking for mapped 
        
                * files that match the jitdump naming pattern. 
        
                * 
        
                * The mapping must be PROT_READ | PROT_EXEC to be detected by perf. 
        
                */ 
        
               perf_jit_map_state.mapped_buffer = mmap( 
        
                   NULL,                    // Let kernel choose address 
        
                   page_size,               // Map one page 
        
                   PROT_READ | PROT_EXEC,   // Read and execute permissions (required by perf) 
        
                   MAP_PRIVATE,             // Private mapping 
        
                   fd,                      // File descriptor 
        
                   0                        // Offset 0 (first page) 
        
               );

I updated the PR to remove this on macOS, let me know what you think!

@pablogsal

@canova i am currently on vacation until EOW I will answer more when I have some time later today or tomorrow but I would love to get macOS support. I have some questions I think we need to answer first to ensure we provide a good UX if we are adding more profilers.

Thanks! I'm glad that you would like to support it on macOS. And I'm happy to discuss the details further and update the code later. (Enjoy your vacation!)

configure.ac

pablogsal · 2025-07-09T23:54:27Z

Python/asm_trampoline.S

@@ -1,13 +1,21 @@
    .text
+#if defined(__APPLE__)
+    .globl	__Py_trampoline_func_start


Ah, Apple and their symbol choices...:S

pablogsal · 2025-07-09T23:57:28Z

@canova One thing I am concerned is that for perf I had to add jit support because nothing compiles with frame pointers in the wild which means that unless people compile Python themselves (and take a big-ish hit unfortunately) the frame-pointer version is suboptimal. What's the status of this for samply? Does it also support the perf jit interface? Can you ping here the samply maintainer if the answer is nuanced?

mstange · 2025-07-10T00:17:12Z

Samply maintainer here - I don't understand the question, I'm afraid. What do you mean by "add jit support", and what do you mean by "the perf jit interface"?

Samply supports unwinding with DWARF information if available, but only for regular code mapped from a binary / shared library, not for JIT code. When it encounters JIT code during unwinding, it falls back to using frame pointers for that frame. It does not make use of any JIT_CODE_UNWINDING_INFO records from the jitdump.

pablogsal · 2025-07-10T01:45:38Z

Samply maintainer here - I don't understand the question, I'm afraid. What do you mean by "add jit support", and what do you mean by "the perf jit interface"?

Samply supports unwinding with DWARF information if available, but only for regular code mapped from a binary / shared library, not for JIT code. When it encounters JIT code during unwinding, it falls back to using frame pointers for that frame. It does not make use of any JIT_CODE_UNWINDING_INFO records from the jitdump.

Perf supports two modes to deal with JIT compilers which is what we are leveraging to make it work with Python via the trampolines:

A simple perf maps based interface described here: https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/jit-interface.txt
A much more complex JIT interface described here: https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/jitdump-specification.txt

We do implement both. The reason we implement both is that the simple maps interface works as long as perf can unwind through the JIT frames. Perf uses libunwind or eelfutils for this and both choke when Python is not compiled with frame pointers. Check https://docs.python.org/3/howto/perf_profiling.html#how-to-obtain-the-best-results for more info.

Unfortunately, nobody compiles Python with frame pointers because it's very slow. See #96174. Not only that, even if you compile with frame pointers most of the time wheels and binary packages are not compiled with frame pointers so this is sort of not very useful in the wild unless you have control over your full ecosystem.

To deal with this we implement the much more complex JIT interface. This allows us to provide to perf DWARF for the trampolines so we can pass the eh_frame values and the unwinding information. perf then uses these via some horrible code that creates a single ELF file per trampoline (yuk!) and then uses that to interpret the samples it took. The problem with this mode is that to do this perf must dump the entire stack to disk to analyze it later. Unfortunately Python C stack is very big so this not only is slow but almost all the time you need the MAX size of the dump. This ends with gigantic files and a much slower compilation. You can see our support for this in https://github.com/python/cpython/blob/main/Python/perf_jit_trampoline.c

I hope this gives some background for the question.

I really want to know if simply is going to be able cope with the general case in general in Linux and Mac for all arches that we support.

I am asking this among other things because we already had problems with code that doesn't have frame pointers around the trampolines. See #130856.

pablogsal · 2025-07-10T01:51:16Z

Also, we probably need a buildbot and tests for this to ensure this doesn't break in the future

mstange · 2025-07-10T03:07:45Z

You can see our support for this in https://github.com/python/cpython/blob/main/Python/perf_jit_trampoline.c

Thanks for the pointer! Wow, I feel slightly nauseous after reading the part about having to to space out the trampolines by the size of the unwind info... Anyway:

samply supports both perf.map file and jitdump. But as I said above, it does not respect the JIT unwind info. Instead, when finding the return address for a JIT frame, samply always uses framepointer unwinding. Looking at the actual assembly for the trampolines, I can see the following:

On aarch64, the trampoline actually has a frame pointer - the stp x29, x30, [sp, -16]!; mov x29, sp instructions make it so that the x29 ("fp") register is set to the location where the caller's framepointer and the return address are stored. So samply's approach should just work.
On x86_64, the trampoline does not set up a frame record. But it also doesn't clobber the rbp register. So: If the python C functions are compiled with framepointers, then unwinding can still resume, but the immediate caller of the JIT frame will be missing. And if the python C functions do not use framepointers, then unwinding will likely fail, because unwinding the caller using dwarf information usually requires an accurate rsp value, which samply won't have unless it starts respecting the jitdump unwind info.

For samply's current approach to work on x86_64, the x86_64 trampoline just needs to set up a framepointer - add a push rbp, mov rbp rsp at the start and a pop rbp at the end, adjust the jitdump dwarf to match (for Linux perf), and that should be it, no? Having a framepointer in the trampoline doesn't require using framepointers when compiling the python C code; it's a per-function decision. So IMO such a change would be very much orthogonal to #96174.

pablogsal · 2025-07-10T04:00:53Z

Thanks a lot @mstange, that makes perfect sense and really clarifies the situation!

If we’re going to add samply support, we need to ensure it works reliably on both aarch64 and x86_64 across both macOS and Linux. Based on your analysis, we should definitely add frame pointers to the x86_64 trampoline in this PR to ensure consistent behavior across architectures.

I can help set up some buildbot infrastructure to ensure all of this works properly with samply once the PR is ready for more comprehensive testing.

I also think we should have full end-to-end documentation that covers how to use this with Python, similar to our existing perf docs. We should probably make the docs more generic to cover both perf and samply workflows, since users will likely want guidance on both tools depending on their platform and preferences.

@canova would you be willing to add the frame pointer setup to the x86_64 trampoline and the DWARF as part of this PR? It seems like the right time to ensure sample works everywhere and we can advertise it properly

pablogsal · 2025-07-10T04:02:49Z

Thanks for the pointer! Wow, I feel slightly nauseous after reading the part about having to to space out the trampolines by the size of the unwind info...

I feel you: I was certainly nauseous when I had to debug that nonsense for hours and hours and even more when I had to implement the hack to "fix" it.

Zheaoli · 2025-07-10T04:33:40Z

Edit: ah I think you are referring to the mmap to tell Perf about the maps file no? Sorry I thought you were referring to the jit trampoline compiler that itself uses mmap to get the chunks.

Yes, I mean maps file here. Sorry for confusing

Zheaoli · 2025-07-10T04:34:03Z

I updated the PR to remove this on macOS, let me know what you think!

SGTM

corona10 · 2025-07-10T08:34:36Z

Python/perf_jit_trampoline.c

@@ -1166,7 +1191,11 @@ static void perf_map_jit_write_entry(void *state, const void *code_addr,
    ev.base.size = sizeof(ev) + (name_length+1) + size;
    ev.base.time_stamp = get_current_monotonic_ticks();
    ev.process_id = getpid();
+#if defined(__APPLE__)
+    pthread_threadid_np(NULL, &ev.thread_id);


Can we just use this implementation in here?

cpython/Include/object.h

Lines 193 to 255 in b44316a

_Py_ThreadId(void)

{

uintptr_t tid;

#if defined(_MSC_VER) && defined(_M_X64)

tid = __readgsqword(48);

#elif defined(_MSC_VER) && defined(_M_IX86)

tid = __readfsdword(24);

#elif defined(_MSC_VER) && defined(_M_ARM64)

tid = __getReg(18);

#elif defined(__MINGW32__) && defined(_M_X64)

tid = __readgsqword(48);

#elif defined(__MINGW32__) && defined(_M_IX86)

tid = __readfsdword(24);

#elif defined(__MINGW32__) && defined(_M_ARM64)

tid = __getReg(18);

#elif defined(__i386__)

__asm__("movl %%gs:0, %0" : "=r" (tid)); // 32-bit always uses GS

#elif defined(__MACH__) && defined(__x86_64__)

__asm__("movq %%gs:0, %0" : "=r" (tid)); // x86_64 macOSX uses GS

#elif defined(__x86_64__)

__asm__("movq %%fs:0, %0" : "=r" (tid)); // x86_64 Linux, BSD uses FS

#elif defined(__arm__) && __ARM_ARCH >= 7

__asm__ ("mrc p15, 0, %0, c13, c0, 3\nbic %0, %0, #3" : "=r" (tid));

#elif defined(__aarch64__) && defined(__APPLE__)

__asm__ ("mrs %0, tpidrro_el0" : "=r" (tid));

#elif defined(__aarch64__)

__asm__ ("mrs %0, tpidr_el0" : "=r" (tid));

#elif defined(__powerpc64__)

#if defined(__clang__) && _Py__has_builtin(__builtin_thread_pointer)

tid = (uintptr_t)__builtin_thread_pointer();

#else

// r13 is reserved for use as system thread ID by the Power 64-bit ABI.

register uintptr_t tp __asm__ ("r13");

__asm__("" : "=r" (tp));

tid = tp;

#endif

#elif defined(__powerpc__)

#if defined(__clang__) && _Py__has_builtin(__builtin_thread_pointer)

tid = (uintptr_t)__builtin_thread_pointer();

#else

// r2 is reserved for use as system thread ID by the Power 32-bit ABI.

register uintptr_t tp __asm__ ("r2");

__asm__ ("" : "=r" (tp));

tid = tp;

#endif

#elif defined(__s390__) && defined(__GNUC__)

// Both GCC and Clang have supported __builtin_thread_pointer

// for s390 from long time ago.

tid = (uintptr_t)__builtin_thread_pointer();

#elif defined(__riscv)

#if defined(__clang__) && _Py__has_builtin(__builtin_thread_pointer)

tid = (uintptr_t)__builtin_thread_pointer();

#else

// tp is Thread Pointer provided by the RISC-V ABI.

__asm__ ("mv %0, tp" : "=r" (tid));

#endif

#else

// Fallback to a portable implementation if we do not have a faster

// platform-specific implementation.

tid = _Py_GetThreadLocal_Addr();

#endif

return tid;

}

Thanks for the suggestion! It looks like _Py_ThreadId is defined only when Py_GIL_DISABLED is defined and Py_LIMITED_API is not defined:

cpython/Include/object.h

Line 189 in d754f75

#if defined(Py_GIL_DISABLED) && !defined(Py_LIMITED_API)

It doesn't compile because of that.

Ah I meant just for inline assembly because it was for low over head at free threading, maybe you can just pick it for apple implementation.

canova · 2025-07-10T10:08:40Z

@mstange @pablogsal Thanks for the investigation!

@canova would you be willing to add the frame pointer setup to the x86_64 trampoline and the DWARF as part of this PR? It seems like the right time to ensure sample works everywhere and we can advertise it properly

Sure that sounds good to me, I'll update the PR soon with this and add some more tests.

Would you like me to update the documentation in this PR as well?

pablogsal · 2025-07-10T10:18:23Z

@mstange @pablogsal Thanks for the investigation!

@canova would you be willing to add the frame pointer setup to the x86_64 trampoline and the DWARF as part of this PR? It seems like the right time to ensure sample works everywhere and we can advertise it properly

Sure that sounds good to me, I'll update the PR soon with this and add some more tests.

Would you like me to update the documentation in this PR as well?

This should do the trick, but needs checking and a fix for CET:

diff --git a/Python/asm_trampoline.S b/Python/asm_trampoline.S
index 616752459ba..a14e68c0e81 100644
--- a/Python/asm_trampoline.S
+++ b/Python/asm_trampoline.S
@@ -12,9 +12,10 @@ _Py_trampoline_func_start:
 #if defined(__CET__) && (__CET__ & 1)
     endbr64
 #endif
-    sub    $8, %rsp
-    call    *%rcx
-    add    $8, %rsp
+    push   %rbp
+    mov    %rsp, %rbp
+    call   *%rcx
+    pop    %rbp
     ret
 #endif // __x86_64__
 #if defined(__aarch64__) && defined(__AARCH64EL__) && !defined(__ILP32__)
diff --git a/Python/perf_jit_trampoline.c b/Python/perf_jit_trampoline.c
index 2ca18c23593..671b56e0846 100644
--- a/Python/perf_jit_trampoline.c
+++ b/Python/perf_jit_trampoline.c
@@ -401,10 +401,12 @@ enum {
     DWRF_CFA_nop = 0x0,                    // No operation
     DWRF_CFA_offset_extended = 0x5,        // Extended offset instruction
     DWRF_CFA_def_cfa = 0xc,               // Define CFA rule
+    DWRF_CFA_def_cfa_register = 0xd,      // Define CFA register
     DWRF_CFA_def_cfa_offset = 0xe,        // Define CFA offset
     DWRF_CFA_offset_extended_sf = 0x11,   // Extended signed offset
     DWRF_CFA_advance_loc = 0x40,          // Advance location counter
-    DWRF_CFA_offset = 0x80                // Simple offset instruction
+    DWRF_CFA_offset = 0x80,               // Simple offset instruction
+    DWRF_CFA_restore = 0xc0               // Restore register
 };

 /* DWARF Exception Handling pointer encodings */
@@ -868,17 +870,22 @@ static void elf_init_ehframe(ELFObjectContext* ctx) {
          * conventions and register usage patterns.
          */
 #ifdef __x86_64__
-        /* x86_64 calling convention unwinding rules */
+        /* x86_64 calling convention unwinding rules with frame pointer */
 #  if defined(__CET__) && (__CET__ & 1)
-        DWRF_U8(DWRF_CFA_advance_loc | 8);    // Advance location by 8 bytes when CET protection is enabled
-#  else
-        DWRF_U8(DWRF_CFA_advance_loc | 4);    // Advance location by 4 bytes
+        DWRF_U8(DWRF_CFA_advance_loc | 4);    // Advance past endbr64 (4 bytes)
 #  endif
-        DWRF_U8(DWRF_CFA_def_cfa_offset);     // Redefine CFA offset
-        DWRF_UV(16);                          // New offset: SP + 16
-        DWRF_U8(DWRF_CFA_advance_loc | 6);    // Advance location by 6 bytes
-        DWRF_U8(DWRF_CFA_def_cfa_offset);     // Redefine CFA offset
-        DWRF_UV(8);                           // New offset: SP + 8
+        DWRF_U8(DWRF_CFA_advance_loc | 1);    // Advance past push %rbp (1 byte)
+        DWRF_U8(DWRF_CFA_def_cfa_offset);     // def_cfa_offset 16
+        DWRF_UV(16);
+        DWRF_U8(DWRF_CFA_offset | DWRF_REG_BP); // offset r6 at cfa-16
+        DWRF_UV(2);
+        DWRF_U8(DWRF_CFA_advance_loc | 3);    // Advance past mov %rsp,%rbp (3 bytes)
+        DWRF_U8(DWRF_CFA_def_cfa_register);   // def_cfa_register r6
+        DWRF_UV(DWRF_REG_BP);
+        DWRF_U8(DWRF_CFA_advance_loc | 3);    // Advance past call *%rcx (2 bytes) + pop %rbp (1 byte) = 3
+        DWRF_U8(DWRF_CFA_def_cfa);            // def_cfa r7 ofs 8
+        DWRF_UV(DWRF_REG_SP);
+        DWRF_UV(8);
 #elif defined(__aarch64__) && defined(__AARCH64EL__) && !defined(__ILP32__)
         /* AArch64 calling convention unwinding rules */
         DWRF_U8(DWRF_CFA_advance_loc | 1);        // Advance location by 1 instruction (stp x29, x30)

pablogsal · 2025-07-10T10:43:38Z

@canova I have made a PR to add the frame pointers and the DWARF as it was slighly more tricky than I though: #136500

canova · 2025-07-10T10:48:58Z

@canova I have made a PR to add the frame pointers and the DWARF as it was slighly more tricky than I though: #136500

Ah cool, thank you! I can rebase this PR on top of yours.

pablogsal · 2025-07-10T11:12:59Z

@canova @mstange can you confirm that samply works as expected with #136500 ?

pablogsal · 2025-07-10T11:19:54Z

Would you like me to update the documentation in this PR as well?

Yep please.

canova · 2025-07-10T12:17:59Z

@canova @mstange can you confirm that samply works as expected with #136500 ?

Just tested your patch using samply. I can verify that it fixes the stack walking! Here are before and after profiles:
Before your patch / After your patch

pablogsal · 2025-07-10T12:35:47Z

@canova @mstange do you know why the Python frames appear duplicated? Is that a samply bug?

mstange · 2025-07-10T12:46:12Z

It's a samply bug, yes - or rather a workaround that's not necessary for Python because there's only a single native function for each Python function (just the trampoline itself). The duplication was initially added for JS JITs which can compile the same function multiple times with different JIT tiers, and it was useful to both have a single call node per JS function and to have a separate call node for each tier.

canova · 2025-07-10T14:22:06Z

@pablogsal I just added two new commits for documentation and testing.

For documentation, I made the perf profiling page more generic and added a section for samply. Started small, so I can shape it along the way with the feedback I get.

For the test, we already have some test coverage already from previous tests for perf. But now I added some samply tests as well. (they might fail on x86_64 since I haven't rebased on top of your patch yet. I'll do it once it's merged) Let me know what you think about them. Currently samply tests will be skipped if samply is not installed. I think you were talking about setting up some buildbot tasks. Would that be possible to install samply?

pablogsal

@hugovk @AA-Turner I would love if you could help us a bit with the docs. The previous docs are assuming that only the perf profiler exist and is centered on it but seems we want to move to a world when perf and other perf-like profilers can exist. I want this to be a excellent place to learn about this so I would love if you could guide @canova on how to ensure there is enough advice on how to use "simply" that users can have a good experience and doesn't feel out of place.

pablogsal · 2025-07-10T18:04:27Z

Doc/howto/perf_profiling.rst

@@ -148,6 +149,26 @@ Instead, if we run the same experiment with ``perf`` support enabled we get:



+Using ``samply`` profiler


We are going to need a bit more here. For example, simply supports both perf modes so we need clarification on when tho use them and what are the recommendations. How to read the flamegraphs etc etc

Would it make sense to break these discussions out into a separate PR? It doesn't seem useful to delay landing trampoline support for this.

It doesn't seem useful to delay landing trampoline support for this.

Is there any rush? This will go into 3.15 anyway and that's going to be released October 2026. We still need to figure out the buildbot situation which will take some time...

I am happy to separate this into a different PR, though

Oh, I was hoping that we could maybe enable it on 3.14. Considering that the code was there since 3.12, and it's mostly putting lots of ifdefs here and there (minus samply and documentation part). I suspect that updating the documentation will take longer. But I'm not familiar with the release process.

Oh, I was hoping that we could maybe enable it on 3.14.

No way unfortunately as we are 3 betas past beta freeze. It's up to the release manager to decide (CC @hugovk) but we have a strict policy for this I am afraid and no new features can be added past beta freeze.

@hugovk Checking just in case although I assume the answer is "no" but would you consider adding this to 3.14 given that this is a new platform and the code is gated by ifdefs? This will allow people on macOS to profile their code using a native profiler, which would be very useful for investigating performance in Python+compiled code.

Some context for this: this would allow people on macOS to profile free threaded Python using samply, so maybe there is a case to allow it in 3.14 but I am still unsure. Up to you @hugovk

canova requested review from brandtbucher, savannahostrowski, erlend-aasland and corona10 as code owners July 9, 2025 10:10

bedevere-app bot added the awaiting review label Jul 9, 2025

bedevere-app bot mentioned this pull request Jul 9, 2025

Consider enabling perf trampoline on macOS #136459

Open

canova commented Jul 9, 2025

View reviewed changes

configure.ac Show resolved Hide resolved

canova added 5 commits July 9, 2025 12:23

Add perf trampoline support for macOS

70e8099

Make sure that test_perfmaps.py test is not skipped on macOS

c496c67

Update the docs for perfmaps to mention that macOS is supported

dcd6928

Add myself to Misc/ACKS

9e1f940

Add a Misc/NEWS.d entry

f663627

canova force-pushed the perf-trampoline-macos branch from f0887a1 to f663627 Compare July 9, 2025 10:25

Zheaoli reviewed Jul 9, 2025

View reviewed changes

canova added 2 commits July 9, 2025 23:03

Define constants per-platform

a444cd3

Do not mmap the jitdump file on macOS

7d84315

On macOS, we don't need to call mmap because samply has already detected the file path during the call to `open` before (it interposes `open` with a preloaded library), and because the mmap call can be slow.

canova requested a review from diegorusso as a code owner July 9, 2025 23:36

pablogsal reviewed Jul 9, 2025

View reviewed changes

configure.ac Show resolved Hide resolved

pablogsal reviewed Jul 9, 2025

View reviewed changes

corona10 reviewed Jul 10, 2025

View reviewed changes

Update the perf profiling doc to include samply

057388d

Add some tests for samply profiling

8b03dc1

canova force-pushed the perf-trampoline-macos branch from e2252bf to 8b03dc1 Compare July 10, 2025 14:27

pablogsal reviewed Jul 10, 2025

View reviewed changes

pablogsal force-pushed the perf-trampoline-macos branch from e36f2af to 8b03dc1 Compare July 10, 2025 21:53

	_Py_ThreadId(void)
	{
	uintptr_t tid;
	#if defined(_MSC_VER) && defined(_M_X64)
	tid = __readgsqword(48);
	#elif defined(_MSC_VER) && defined(_M_IX86)
	tid = __readfsdword(24);
	#elif defined(_MSC_VER) && defined(_M_ARM64)
	tid = __getReg(18);
	#elif defined(__MINGW32__) && defined(_M_X64)
	tid = __readgsqword(48);
	#elif defined(__MINGW32__) && defined(_M_IX86)
	tid = __readfsdword(24);
	#elif defined(__MINGW32__) && defined(_M_ARM64)
	tid = __getReg(18);
	#elif defined(__i386__)
	__asm__("movl %%gs:0, %0" : "=r" (tid)); // 32-bit always uses GS
	#elif defined(__MACH__) && defined(__x86_64__)
	__asm__("movq %%gs:0, %0" : "=r" (tid)); // x86_64 macOSX uses GS
	#elif defined(__x86_64__)
	__asm__("movq %%fs:0, %0" : "=r" (tid)); // x86_64 Linux, BSD uses FS
	#elif defined(__arm__) && __ARM_ARCH >= 7
	__asm__ ("mrc p15, 0, %0, c13, c0, 3\nbic %0, %0, #3" : "=r" (tid));
	#elif defined(__aarch64__) && defined(__APPLE__)
	__asm__ ("mrs %0, tpidrro_el0" : "=r" (tid));
	#elif defined(__aarch64__)
	__asm__ ("mrs %0, tpidr_el0" : "=r" (tid));
	#elif defined(__powerpc64__)
	#if defined(__clang__) && _Py__has_builtin(__builtin_thread_pointer)
	tid = (uintptr_t)__builtin_thread_pointer();
	#else
	// r13 is reserved for use as system thread ID by the Power 64-bit ABI.
	register uintptr_t tp __asm__ ("r13");
	__asm__("" : "=r" (tp));
	tid = tp;
	#endif
	#elif defined(__powerpc__)
	#if defined(__clang__) && _Py__has_builtin(__builtin_thread_pointer)
	tid = (uintptr_t)__builtin_thread_pointer();
	#else
	// r2 is reserved for use as system thread ID by the Power 32-bit ABI.
	register uintptr_t tp __asm__ ("r2");
	__asm__ ("" : "=r" (tp));
	tid = tp;
	#endif
	#elif defined(__s390__) && defined(__GNUC__)
	// Both GCC and Clang have supported __builtin_thread_pointer
	// for s390 from long time ago.
	tid = (uintptr_t)__builtin_thread_pointer();
	#elif defined(__riscv)
	#if defined(__clang__) && _Py__has_builtin(__builtin_thread_pointer)
	tid = (uintptr_t)__builtin_thread_pointer();
	#else
	// tp is Thread Pointer provided by the RISC-V ABI.
	__asm__ ("mv %0, tp" : "=r" (tid));
	#endif
	#else
	// Fallback to a portable implementation if we do not have a faster
	// platform-specific implementation.
	tid = _Py_GetThreadLocal_Addr();
	#endif
	return tid;
	}

		@@ -148,6 +149,26 @@ Instead, if we run the same experiment with ``perf`` support enabled we get:



		Using ``samply`` profiler

Uh oh!

gh-136459: Add perf trampoline support for macOS #136461

Are you sure you want to change the base?

gh-136459: Add perf trampoline support for macOS #136461

Conversation

canova commented Jul 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

python-cla-bot bot commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

canova commented Jul 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Zheaoli left a comment

Choose a reason for hiding this comment

Uh oh!

pablogsal commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented Jul 9, 2025

Uh oh!

canova commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pablogsal commented Jul 9, 2025

Uh oh!

mstange commented Jul 10, 2025

Uh oh!

pablogsal commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented Jul 10, 2025

Uh oh!

mstange commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented Jul 10, 2025

Uh oh!

pablogsal commented Jul 10, 2025

Uh oh!

Zheaoli commented Jul 10, 2025

Uh oh!

Zheaoli commented Jul 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

canova commented Jul 10, 2025

Uh oh!

pablogsal commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented Jul 10, 2025

Uh oh!

canova commented Jul 10, 2025

Uh oh!

pablogsal commented Jul 10, 2025

Uh oh!

pablogsal commented Jul 10, 2025

Uh oh!

canova commented Jul 10, 2025

Uh oh!

pablogsal commented Jul 10, 2025

Uh oh!

mstange commented Jul 10, 2025

canova commented Jul 9, 2025 •

edited by github-actions bot

Loading

python-cla-bot bot commented Jul 9, 2025 •

edited

Loading

pablogsal commented Jul 9, 2025 •

edited

Loading

canova commented Jul 9, 2025 •

edited

Loading

pablogsal commented Jul 10, 2025 •

edited

Loading

mstange commented Jul 10, 2025 •

edited

Loading

pablogsal commented Jul 10, 2025 •

edited

Loading

pablogsal Jul 10, 2025 •

edited

Loading

pablogsal Jul 10, 2025 •

edited

Loading

pablogsal Jul 10, 2025 •

edited

Loading