Skip to content

gh-136459: Add perf trampoline support for macOS #136461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

canova
Copy link

@canova canova commented Jul 9, 2025

This PR adds perf trampoline support for macOS. Previously it was Linux-only, which is not actually needed it to be. We can share the implementation of Linux and macOS with some minor changes done in this PR.

For example, here's a before and after profile of this PR, captured by samply (for the same dummy script):
Before / After

As you can see, now we have blue frames that mentions the real Python function names, which makes it a lot more useful to understand this profile. Without it, it's just native frames that's really difficult to understand.

This is my first PR in this project, so please let me know if there is anything I didn't do or should do. Thanks!


📚 Documentation preview 📚: https://cpython-previews--136461.org.readthedocs.build/

@python-cla-bot
Copy link

python-cla-bot bot commented Jul 9, 2025

All commit authors signed the Contributor License Agreement.

CLA signed

@canova
Copy link
Author

canova commented Jul 9, 2025

cc @pablogsal. I went ahead and submitted this PR for #136459 but please let me know what you think!

A running process may create a file in the ``/tmp`` directory, which contains entries
that can map a section of executable code to a name. This interface is described in the
profiling tool (such as `perf <https://perf.wiki.kernel.org/index.php/Main_Page>`_ or
`samply <https://github.com/mstange/samply/>`_). A running process may create a
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a shameless plug to samply 😄 Disclaimer: I'm not the maintainer of the project, but the maintainer is my colleague. But it doesn't change the fact that it's an awesome profiler! But I can revert it if you prefer not to include :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy with the plug, but this docs are going to need much more than this then. If samply is the main way to use this on macOS then we will need to update https://docs.python.org/3/howto/perf_profiling.html with full instructions for samply :)

@canova canova force-pushed the perf-trampoline-macos branch from f0887a1 to f663627 Compare July 9, 2025 10:25

/* These constants are defined inside <elf.h>, which we can't use outside of linux. */
#if !defined(__linux__)
# define EM_386 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to define the variable for different platforms. like this https://github.com/python/cpython/blob/main/Python/perf_jit_trampoline.c#L126-L135

Not define all the possible variable

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I just changed them to be defined depending on the platform. I actually copy pasted this from the old code but your suggestion is better.

Copy link
Contributor

@Zheaoli Zheaoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, the trampoline depens on the mmap.

But it has record about the mmap performance issue on macOS.

I think we need a benchmark here

FYI

  1. tmm1/node@3542e4e
  2. https://chromium-review.googlesource.com/c/v8/v8/+/6385655/7/src/diagnostics/perf-jit.cc
  3. https://bugzilla-dev.allizom.org/show_bug.cgi?id=1827214

@pablogsal
Copy link
Member

pablogsal commented Jul 9, 2025

For now, the trampoline depens on the mmap.

But it has record about the mmap performance issue on macOS.

I think we need a benchmark here

FYI

  1. tmm1/node@3542e4e

  2. https://chromium-review.googlesource.com/c/v8/v8/+/6385655/7/src/diagnostics/perf-jit.cc

  3. https://bugzilla-dev.allizom.org/show_bug.cgi?id=1827214

Thanks for you comment but I don't think we need any benchmark. We are using mmap all over the place and that's what the allocators, the JIT and many other things do. The trampoline also batches allocations in big chunks so I think we are fine. If the problem is on samply side that's their challenge to figure out there not here. If they need some sort of hook they should mention it here but is unlikely we will add anything that's not generic.

Edit: ah I think you are referring to the mmap to tell Perf about the maps file no? Sorry I thought you were referring to the jit trampoline compiler that itself uses mmap to get the chunks.

@pablogsal
Copy link
Member

@canova i am currently on vacation until EOW I will answer more when I have some time later today or tomorrow but I would love to get macOS support. I have some questions I think we need to answer first to ensure we provide a good UX if we are adding more profilers.

canova added 2 commits July 9, 2025 23:03
On macOS, we don't need to call mmap because samply has already detected
the file path during the call to `open` before (it interposes `open` with
a preloaded library), and because the mmap call can be slow.
@canova canova requested a review from diegorusso as a code owner July 9, 2025 23:36
@canova
Copy link
Author

canova commented Jul 9, 2025

@Zheaoli

For now, the trampoline depens on the mmap.

But it has record about the mmap performance issue on macOS.

I think we need a benchmark here

FYI

1. [tmm1/node@3542e4e](https://github.com/tmm1/node/commit/3542e4e2944f2fe1132ded1052e223f54c90e4bf)

2. https://chromium-review.googlesource.com/c/v8/v8/+/6385655/7/src/diagnostics/perf-jit.cc

3. https://bugzilla-dev.allizom.org/show_bug.cgi?id=1827214

Ah I think you are right. I actually looked for this and only found one mmap inside the perf_trampoline.c file and that was for a memory arena so thought this change was not needed and didn't think much of if after. But looking at it again, I think it's this one that we shouldn't mmap:

/*
* Map the first page of the jitdump file
*
* This memory mapping serves as a signal to perf that this process
* is generating JIT code. Perf scans /proc/.../maps looking for mapped
* files that match the jitdump naming pattern.
*
* The mapping must be PROT_READ | PROT_EXEC to be detected by perf.
*/
perf_jit_map_state.mapped_buffer = mmap(
NULL, // Let kernel choose address
page_size, // Map one page
PROT_READ | PROT_EXEC, // Read and execute permissions (required by perf)
MAP_PRIVATE, // Private mapping
fd, // File descriptor
0 // Offset 0 (first page)
);

I updated the PR to remove this on macOS, let me know what you think!

@pablogsal

@canova i am currently on vacation until EOW I will answer more when I have some time later today or tomorrow but I would love to get macOS support. I have some questions I think we need to answer first to ensure we provide a good UX if we are adding more profilers.

Thanks! I'm glad that you would like to support it on macOS. And I'm happy to discuss the details further and update the code later. (Enjoy your vacation!)

@@ -1,13 +1,21 @@
.text
#if defined(__APPLE__)
.globl __Py_trampoline_func_start
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, Apple and their symbol choices...:S

@pablogsal
Copy link
Member

@canova One thing I am concerned is that for perf I had to add jit support because nothing compiles with frame pointers in the wild which means that unless people compile Python themselves (and take a big-ish hit unfortunately) the frame-pointer version is suboptimal. What's the status of this for samply? Does it also support the perf jit interface? Can you ping here the samply maintainer if the answer is nuanced?

@mstange
Copy link

mstange commented Jul 10, 2025

Samply maintainer here - I don't understand the question, I'm afraid. What do you mean by "add jit support", and what do you mean by "the perf jit interface"?

Samply supports unwinding with DWARF information if available, but only for regular code mapped from a binary / shared library, not for JIT code. When it encounters JIT code during unwinding, it falls back to using frame pointers for that frame. It does not make use of any JIT_CODE_UNWINDING_INFO records from the jitdump.

@pablogsal
Copy link
Member

pablogsal commented Jul 10, 2025

Samply maintainer here - I don't understand the question, I'm afraid. What do you mean by "add jit support", and what do you mean by "the perf jit interface"?

Samply supports unwinding with DWARF information if available, but only for regular code mapped from a binary / shared library, not for JIT code. When it encounters JIT code during unwinding, it falls back to using frame pointers for that frame. It does not make use of any JIT_CODE_UNWINDING_INFO records from the jitdump.

Perf supports two modes to deal with JIT compilers which is what we are leveraging to make it work with Python via the trampolines:

We do implement both. The reason we implement both is that the simple maps interface works as long as perf can unwind through the JIT frames. Perf uses libunwind or eelfutils for this and both choke when Python is not compiled with frame pointers. Check https://docs.python.org/3/howto/perf_profiling.html#how-to-obtain-the-best-results for more info.

Unfortunately, nobody compiles Python with frame pointers because it's very slow. See #96174. Not only that, even if you compile with frame pointers most of the time wheels and binary packages are not compiled with frame pointers so this is sort of not very useful in the wild unless you have control over your full ecosystem.

To deal with this we implement the much more complex JIT interface. This allows us to provide to perf DWARF for the trampolines so we can pass the eh_frame values and the unwinding information. perf then uses these via some horrible code that creates a single ELF file per trampoline (yuk!) and then uses that to interpret the samples it took. The problem with this mode is that to do this perf must dump the entire stack to disk to analyze it later. Unfortunately Python C stack is very big so this not only is slow but almost all the time you need the MAX size of the dump. This ends with gigantic files and a much slower compilation. You can see our support for this in https://github.com/python/cpython/blob/main/Python/perf_jit_trampoline.c

I hope this gives some background for the question.

I really want to know if simply is going to be able cope with the general case in general in Linux and Mac for all arches that we support.

I am asking this among other things because we already had problems with code that doesn't have frame pointers around the trampolines. See #130856.

@pablogsal
Copy link
Member

Also, we probably need a buildbot and tests for this to ensure this doesn't break in the future

@mstange
Copy link

mstange commented Jul 10, 2025

You can see our support for this in https://github.com/python/cpython/blob/main/Python/perf_jit_trampoline.c

Thanks for the pointer! Wow, I feel slightly nauseous after reading the part about having to to space out the trampolines by the size of the unwind info... Anyway:

samply supports both perf.map file and jitdump. But as I said above, it does not respect the JIT unwind info. Instead, when finding the return address for a JIT frame, samply always uses framepointer unwinding. Looking at the actual assembly for the trampolines, I can see the following:

  • On aarch64, the trampoline actually has a frame pointer - the stp x29, x30, [sp, -16]!; mov x29, sp instructions make it so that the x29 ("fp") register is set to the location where the caller's framepointer and the return address are stored. So samply's approach should just work.
  • On x86_64, the trampoline does not set up a frame record. But it also doesn't clobber the rbp register. So: If the python C functions are compiled with framepointers, then unwinding can still resume, but the immediate caller of the JIT frame will be missing. And if the python C functions do not use framepointers, then unwinding will likely fail, because unwinding the caller using dwarf information usually requires an accurate rsp value, which samply won't have unless it starts respecting the jitdump unwind info.

For samply's current approach to work on x86_64, the x86_64 trampoline just needs to set up a framepointer - add a push rbp, mov rbp rsp at the start and a pop rbp at the end, adjust the jitdump dwarf to match (for Linux perf), and that should be it, no? Having a framepointer in the trampoline doesn't require using framepointers when compiling the python C code; it's a per-function decision. So IMO such a change would be very much orthogonal to #96174.

@pablogsal
Copy link
Member

Thanks a lot @mstange, that makes perfect sense and really clarifies the situation!

If we’re going to add samply support, we need to ensure it works reliably on both aarch64 and x86_64 across both macOS and Linux. Based on your analysis, we should definitely add frame pointers to the x86_64 trampoline in this PR to ensure consistent behavior across architectures.

I can help set up some buildbot infrastructure to ensure all of this works properly with samply once the PR is ready for more comprehensive testing.

I also think we should have full end-to-end documentation that covers how to use this with Python, similar to our existing perf docs. We should probably make the docs more generic to cover both perf and samply workflows, since users will likely want guidance on both tools depending on their platform and preferences.

@canova would you be willing to add the frame pointer setup to the x86_64 trampoline and the DWARF as part of this PR? It seems like the right time to ensure sample works everywhere and we can advertise it properly

@pablogsal
Copy link
Member

Thanks for the pointer! Wow, I feel slightly nauseous after reading the part about having to to space out the trampolines by the size of the unwind info...

I feel you: I was certainly nauseous when I had to debug that nonsense for hours and hours and even more when I had to implement the hack to "fix" it.

@Zheaoli
Copy link
Contributor

Zheaoli commented Jul 10, 2025

Edit: ah I think you are referring to the mmap to tell Perf about the maps file no? Sorry I thought you were referring to the jit trampoline compiler that itself uses mmap to get the chunks.

Yes, I mean maps file here. Sorry for confusing

@Zheaoli
Copy link
Contributor

Zheaoli commented Jul 10, 2025

I updated the PR to remove this on macOS, let me know what you think!

SGTM

@@ -1166,7 +1191,11 @@ static void perf_map_jit_write_entry(void *state, const void *code_addr,
ev.base.size = sizeof(ev) + (name_length+1) + size;
ev.base.time_stamp = get_current_monotonic_ticks();
ev.process_id = getpid();
#if defined(__APPLE__)
pthread_threadid_np(NULL, &ev.thread_id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use this implementation in here?

cpython/Include/object.h

Lines 193 to 255 in b44316a

_Py_ThreadId(void)
{
uintptr_t tid;
#if defined(_MSC_VER) && defined(_M_X64)
tid = __readgsqword(48);
#elif defined(_MSC_VER) && defined(_M_IX86)
tid = __readfsdword(24);
#elif defined(_MSC_VER) && defined(_M_ARM64)
tid = __getReg(18);
#elif defined(__MINGW32__) && defined(_M_X64)
tid = __readgsqword(48);
#elif defined(__MINGW32__) && defined(_M_IX86)
tid = __readfsdword(24);
#elif defined(__MINGW32__) && defined(_M_ARM64)
tid = __getReg(18);
#elif defined(__i386__)
__asm__("movl %%gs:0, %0" : "=r" (tid)); // 32-bit always uses GS
#elif defined(__MACH__) && defined(__x86_64__)
__asm__("movq %%gs:0, %0" : "=r" (tid)); // x86_64 macOSX uses GS
#elif defined(__x86_64__)
__asm__("movq %%fs:0, %0" : "=r" (tid)); // x86_64 Linux, BSD uses FS
#elif defined(__arm__) && __ARM_ARCH >= 7
__asm__ ("mrc p15, 0, %0, c13, c0, 3\nbic %0, %0, #3" : "=r" (tid));
#elif defined(__aarch64__) && defined(__APPLE__)
__asm__ ("mrs %0, tpidrro_el0" : "=r" (tid));
#elif defined(__aarch64__)
__asm__ ("mrs %0, tpidr_el0" : "=r" (tid));
#elif defined(__powerpc64__)
#if defined(__clang__) && _Py__has_builtin(__builtin_thread_pointer)
tid = (uintptr_t)__builtin_thread_pointer();
#else
// r13 is reserved for use as system thread ID by the Power 64-bit ABI.
register uintptr_t tp __asm__ ("r13");
__asm__("" : "=r" (tp));
tid = tp;
#endif
#elif defined(__powerpc__)
#if defined(__clang__) && _Py__has_builtin(__builtin_thread_pointer)
tid = (uintptr_t)__builtin_thread_pointer();
#else
// r2 is reserved for use as system thread ID by the Power 32-bit ABI.
register uintptr_t tp __asm__ ("r2");
__asm__ ("" : "=r" (tp));
tid = tp;
#endif
#elif defined(__s390__) && defined(__GNUC__)
// Both GCC and Clang have supported __builtin_thread_pointer
// for s390 from long time ago.
tid = (uintptr_t)__builtin_thread_pointer();
#elif defined(__riscv)
#if defined(__clang__) && _Py__has_builtin(__builtin_thread_pointer)
tid = (uintptr_t)__builtin_thread_pointer();
#else
// tp is Thread Pointer provided by the RISC-V ABI.
__asm__ ("mv %0, tp" : "=r" (tid));
#endif
#else
// Fallback to a portable implementation if we do not have a faster
// platform-specific implementation.
tid = _Py_GetThreadLocal_Addr();
#endif
return tid;
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! It looks like _Py_ThreadId is defined only when Py_GIL_DISABLED is defined and Py_LIMITED_API is not defined:

#if defined(Py_GIL_DISABLED) && !defined(Py_LIMITED_API)

It doesn't compile because of that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I meant just for inline assembly because it was for low over head at free threading, maybe you can just pick it for apple implementation.

@canova
Copy link
Author

canova commented Jul 10, 2025

@mstange @pablogsal Thanks for the investigation!

@canova would you be willing to add the frame pointer setup to the x86_64 trampoline and the DWARF as part of this PR? It seems like the right time to ensure sample works everywhere and we can advertise it properly

Sure that sounds good to me, I'll update the PR soon with this and add some more tests.

Would you like me to update the documentation in this PR as well?

@pablogsal
Copy link
Member

pablogsal commented Jul 10, 2025

@mstange @pablogsal Thanks for the investigation!

@canova would you be willing to add the frame pointer setup to the x86_64 trampoline and the DWARF as part of this PR? It seems like the right time to ensure sample works everywhere and we can advertise it properly

Sure that sounds good to me, I'll update the PR soon with this and add some more tests.

Would you like me to update the documentation in this PR as well?

This should do the trick, but needs checking and a fix for CET:

diff --git a/Python/asm_trampoline.S b/Python/asm_trampoline.S
index 616752459ba..a14e68c0e81 100644
--- a/Python/asm_trampoline.S
+++ b/Python/asm_trampoline.S
@@ -12,9 +12,10 @@ _Py_trampoline_func_start:
 #if defined(__CET__) && (__CET__ & 1)
     endbr64
 #endif
-    sub    $8, %rsp
-    call    *%rcx
-    add    $8, %rsp
+    push   %rbp
+    mov    %rsp, %rbp
+    call   *%rcx
+    pop    %rbp
     ret
 #endif // __x86_64__
 #if defined(__aarch64__) && defined(__AARCH64EL__) && !defined(__ILP32__)
diff --git a/Python/perf_jit_trampoline.c b/Python/perf_jit_trampoline.c
index 2ca18c23593..671b56e0846 100644
--- a/Python/perf_jit_trampoline.c
+++ b/Python/perf_jit_trampoline.c
@@ -401,10 +401,12 @@ enum {
     DWRF_CFA_nop = 0x0,                    // No operation
     DWRF_CFA_offset_extended = 0x5,        // Extended offset instruction
     DWRF_CFA_def_cfa = 0xc,               // Define CFA rule
+    DWRF_CFA_def_cfa_register = 0xd,      // Define CFA register
     DWRF_CFA_def_cfa_offset = 0xe,        // Define CFA offset
     DWRF_CFA_offset_extended_sf = 0x11,   // Extended signed offset
     DWRF_CFA_advance_loc = 0x40,          // Advance location counter
-    DWRF_CFA_offset = 0x80                // Simple offset instruction
+    DWRF_CFA_offset = 0x80,               // Simple offset instruction
+    DWRF_CFA_restore = 0xc0               // Restore register
 };

 /* DWARF Exception Handling pointer encodings */
@@ -868,17 +870,22 @@ static void elf_init_ehframe(ELFObjectContext* ctx) {
          * conventions and register usage patterns.
          */
 #ifdef __x86_64__
-        /* x86_64 calling convention unwinding rules */
+        /* x86_64 calling convention unwinding rules with frame pointer */
 #  if defined(__CET__) && (__CET__ & 1)
-        DWRF_U8(DWRF_CFA_advance_loc | 8);    // Advance location by 8 bytes when CET protection is enabled
-#  else
-        DWRF_U8(DWRF_CFA_advance_loc | 4);    // Advance location by 4 bytes
+        DWRF_U8(DWRF_CFA_advance_loc | 4);    // Advance past endbr64 (4 bytes)
 #  endif
-        DWRF_U8(DWRF_CFA_def_cfa_offset);     // Redefine CFA offset
-        DWRF_UV(16);                          // New offset: SP + 16
-        DWRF_U8(DWRF_CFA_advance_loc | 6);    // Advance location by 6 bytes
-        DWRF_U8(DWRF_CFA_def_cfa_offset);     // Redefine CFA offset
-        DWRF_UV(8);                           // New offset: SP + 8
+        DWRF_U8(DWRF_CFA_advance_loc | 1);    // Advance past push %rbp (1 byte)
+        DWRF_U8(DWRF_CFA_def_cfa_offset);     // def_cfa_offset 16
+        DWRF_UV(16);
+        DWRF_U8(DWRF_CFA_offset | DWRF_REG_BP); // offset r6 at cfa-16
+        DWRF_UV(2);
+        DWRF_U8(DWRF_CFA_advance_loc | 3);    // Advance past mov %rsp,%rbp (3 bytes)
+        DWRF_U8(DWRF_CFA_def_cfa_register);   // def_cfa_register r6
+        DWRF_UV(DWRF_REG_BP);
+        DWRF_U8(DWRF_CFA_advance_loc | 3);    // Advance past call *%rcx (2 bytes) + pop %rbp (1 byte) = 3
+        DWRF_U8(DWRF_CFA_def_cfa);            // def_cfa r7 ofs 8
+        DWRF_UV(DWRF_REG_SP);
+        DWRF_UV(8);
 #elif defined(__aarch64__) && defined(__AARCH64EL__) && !defined(__ILP32__)
         /* AArch64 calling convention unwinding rules */
         DWRF_U8(DWRF_CFA_advance_loc | 1);        // Advance location by 1 instruction (stp x29, x30)

@pablogsal
Copy link
Member

@canova I have made a PR to add the frame pointers and the DWARF as it was slighly more tricky than I though: #136500

@canova
Copy link
Author

canova commented Jul 10, 2025

@canova I have made a PR to add the frame pointers and the DWARF as it was slighly more tricky than I though: #136500

Ah cool, thank you! I can rebase this PR on top of yours.

@pablogsal
Copy link
Member

@canova @mstange can you confirm that samply works as expected with #136500 ?

@pablogsal
Copy link
Member

Would you like me to update the documentation in this PR as well?

Yep please.

@canova
Copy link
Author

canova commented Jul 10, 2025

@canova @mstange can you confirm that samply works as expected with #136500 ?

Just tested your patch using samply. I can verify that it fixes the stack walking! Here are before and after profiles:
Before your patch / After your patch

@pablogsal
Copy link
Member

@canova @mstange do you know why the Python frames appear duplicated? Is that a samply bug?

@mstange
Copy link

mstange commented Jul 10, 2025

It's a samply bug, yes - or rather a workaround that's not necessary for Python because there's only a single native function for each Python function (just the trampoline itself). The duplication was initially added for JS JITs which can compile the same function multiple times with different JIT tiers, and it was useful to both have a single call node per JS function and to have a separate call node for each tier.

@canova
Copy link
Author

canova commented Jul 10, 2025

@pablogsal I just added two new commits for documentation and testing.

For documentation, I made the perf profiling page more generic and added a section for samply. Started small, so I can shape it along the way with the feedback I get.

For the test, we already have some test coverage already from previous tests for perf. But now I added some samply tests as well. (they might fail on x86_64 since I haven't rebased on top of your patch yet. I'll do it once it's merged) Let me know what you think about them. Currently samply tests will be skipped if samply is not installed. I think you were talking about setting up some buildbot tasks. Would that be possible to install samply?

@canova canova force-pushed the perf-trampoline-macos branch from e2252bf to 8b03dc1 Compare July 10, 2025 14:27
Copy link
Member

@pablogsal pablogsal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hugovk @AA-Turner I would love if you could help us a bit with the docs. The previous docs are assuming that only the perf profiler exist and is centered on it but seems we want to move to a world when perf and other perf-like profilers can exist. I want this to be a excellent place to learn about this so I would love if you could guide @canova on how to ensure there is enough advice on how to use "simply" that users can have a good experience and doesn't feel out of place.

@@ -148,6 +149,26 @@ Instead, if we run the same experiment with ``perf`` support enabled we get:



Using ``samply`` profiler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to need a bit more here. For example, simply supports both perf modes so we need clarification on when tho use them and what are the recommendations. How to read the flamegraphs etc etc

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to break these discussions out into a separate PR? It doesn't seem useful to delay landing trampoline support for this.

Copy link
Member

@pablogsal pablogsal Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem useful to delay landing trampoline support for this.

Is there any rush? This will go into 3.15 anyway and that's going to be released October 2026. We still need to figure out the buildbot situation which will take some time...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to separate this into a different PR, though

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I was hoping that we could maybe enable it on 3.14. Considering that the code was there since 3.12, and it's mostly putting lots of ifdefs here and there (minus samply and documentation part). I suspect that updating the documentation will take longer. But I'm not familiar with the release process.

Copy link
Member

@pablogsal pablogsal Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I was hoping that we could maybe enable it on 3.14.

No way unfortunately as we are 3 betas past beta freeze. It's up to the release manager to decide (CC @hugovk) but we have a strict policy for this I am afraid and no new features can be added past beta freeze.

Copy link
Member

@pablogsal pablogsal Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hugovk Checking just in case although I assume the answer is "no" but would you consider adding this to 3.14 given that this is a new platform and the code is gated by ifdefs? This will allow people on macOS to profile their code using a native profiler, which would be very useful for investigating performance in Python+compiled code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some context for this: this would allow people on macOS to profile free threaded Python using samply, so maybe there is a case to allow it in 3.14 but I am still unsure. Up to you @hugovk

@pablogsal pablogsal force-pushed the perf-trampoline-macos branch from e36f2af to 8b03dc1 Compare July 10, 2025 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy