Apparently(?) this also needs to be attached to the function declarator and does not work as a function specifier: `static void *__preserve_none slowpath();` and not `__preserve_none static void *slowpath();` (unlike GCC attribute syntax, which tends to be fairly gung-ho about this sort of thing, sometimes with confusing results).
Yay to getting undocumented MSVC features disclosed if Microsoft thinks you’re important enough :/
Important enough, or benefits them directly? I have no good guesses how improving Python's performance would benefit them, but I would guess that's the real reason.
I guess there are some Python workloads on Azure, Microsoft provides a lot of data analysis and LLM tools as a service (not paid by CPU minutes). Saving CPU cycles there directly translates to financial savings.
Im a bit out of the loop with this, but hope its not like that time with python 3.14, when it was claimed a geometric mean speedup of about 9-15% over the standard interpreter when built with Clang 19. It turned out the results were inflated due to a bug in LLVM 19 that prevented proper "tail duplication" optimization in the baseline interpreter's dispatch loop. Actual gains was aprox 4%.
Edit: Read through it and have come to the conclusion that the post is 100% OK and properly framed: He explicitly says his approach is to "sharing early and making a fool of myself," prioritizing transparency and rapid iteration over ironclad verification upfront.
One could make an argument that he should have cross-compiler checks, independent audits, or delayed announcements until results are bulletproof across all platforms. But given that he is 100% transparent with his thinking and how he works, it's all good in the hood.
Thanks :), that was indeed my intention. I think the previous 3.14 mistake was actually a good one on hindsight, because if I didn't publicize our work early, I wouldn't have caught the attention of Nelson. Nelson also probably wouldn't have spent one month digging into the Clang 19 bug. This also meant the bug wouldn't have been caught in the betas, and might've been out with the actual release, which would have been way worse. So this was all a happy accident on hindsight that I'm grateful for as it means overall CPython still benefited!
Also this time, I'm pretty confident because there are two perf improvements here: the dispatch logic, and the inlining. MSVC can actually convert switch-case interpreters to threaded code automatically if some conditions are met [1]. However, it does not seem to do that for the current CPython interpreter. In this case, I suspect the CPython interpreter loop is just too complicated to meet those conditions. The key point also that we would be relying on MSVC again to do its magic, but this tail calling approach takes more control into the writers of the C code. The inlining is pretty much impossible to convince MSVC to do except with `__forceinline` or changing things to use macros [2]. However, we don't just mark every function as forceinline in CPython as it might negatively affect other compilers.
I've never seen this kind of benchmark graph before, and it looks really neat! How was this generated? What tool was used for the benchmarks?
(I actually spent most of Sep/Oct working on optimizing the Immer JS immutable update library, and used a benchmarking tool called `mitata`, so I was doing a lot of this same kind of work: https://github.com/immerjs/immer/pull/1183 . Would love to add some new tools to my repertoire here!)
It's in essence a histogram for the distribution, with smoothing, and mirrored on each side.
It looks nice, but is not without well-deserved opposition because 1) the use of smoothing can hide the actual distribution, 2) mirroring contains no extra information, while taking up space, and implying the extra space contains information, and 3) when shown vertically, too often causes people to exclaim it looks like a vulva.
Python’s goal is never really to be fast. If that were its goal, it would’ve had a JIT long ago instead of toying with optimizing the interpreter. Guido prioritized code simplicity over speed. A lot of speed improvements including the JIT (PEP 744 – JIT Compilation) came about after he stepped down.
Should probably mention that Guido ended up on the team working on a pretty credible JIT effort. Though Microsoft subsequently threw a wrench in it with layoffs. Not sure the status now.
This is (a) wildly over expectations for open source and (b) a massive pain to maintain, and (c) not even the biggest timewaster of python, which is the packaging "system".
TLDR: The tail-calling interpreter is slightly faster than computed goto.
> I used to believe the the tailcalling interpreters get their speedup from better register use. While I still believe that now, I suspect that is not the main reason for speedups in CPython.
> My main guess now is that tail calling resets compiler heuristics to sane levels, so that compilers can do their jobs.
> Let me show an example, at the time of writing, CPython 3.15’s interpreter loop is around 12k lines of C code. That’s 12k lines in a single function for the switch-case and computed goto interpreter.
> […] In short, this overly large function breaks a lot of compiler heuristics.
> One of the most beneficial optimisations is inlining. In the past, we’ve found that compilers sometimes straight up refuse to inline even the simplest of functions in that 12k loc eval loop.
I think in the protobuf example the musttail did in fact benefit from better register use. All the functions are called with the same arguments, so there is no need to shuffle the registers. The same six register-passed arguments are reused from one function to the next.
Thanks for reading! For now, we maintain all 3 of the interpreters in CPython. We don't plan to remove the other interpreters anytime soon, probably never. If MSVC breaks the tail calling interpreter, we'll just go back to building and distributing the switch-case interpreter. Windows binaries will be slower again, but such is life :(.
Also the interpreter loop's dispatch is autogenerated and can be selected via configure flags. So there's almost no additional maintenance overhead. The main burden is the MSVC-specific changes we needed to get this working (amounting to a few hundred lines of code).
> Impact on debugging/profiling
I don't think there should be any, at least for Windows. Though I can't say for certain.
That makes sense, thanks for the detailed clarification. Having the switch-case interpreter as a fallback and keeping the dispatch autogenerated definitely reduces the long-term risk.
The Python interpreter core loop sounds like the perfect problem for AlphaEvolve. Or it's open source equivalent OpenEvolve if DeepMind doesn't want to speed up Python for the competition.
Apparently(?) this also needs to be attached to the function declarator and does not work as a function specifier: `static void *__preserve_none slowpath();` and not `__preserve_none static void *slowpath();` (unlike GCC attribute syntax, which tends to be fairly gung-ho about this sort of thing, sometimes with confusing results).
Yay to getting undocumented MSVC features disclosed if Microsoft thinks you’re important enough :/
> By 1977[2][3] the phrase had entered American usage as slang for the cum shot in a pornographic film
https://en.wikipedia.org/wiki/Money_shot
Edit: Read through it and have come to the conclusion that the post is 100% OK and properly framed: He explicitly says his approach is to "sharing early and making a fool of myself," prioritizing transparency and rapid iteration over ironclad verification upfront.
One could make an argument that he should have cross-compiler checks, independent audits, or delayed announcements until results are bulletproof across all platforms. But given that he is 100% transparent with his thinking and how he works, it's all good in the hood.
Also this time, I'm pretty confident because there are two perf improvements here: the dispatch logic, and the inlining. MSVC can actually convert switch-case interpreters to threaded code automatically if some conditions are met [1]. However, it does not seem to do that for the current CPython interpreter. In this case, I suspect the CPython interpreter loop is just too complicated to meet those conditions. The key point also that we would be relying on MSVC again to do its magic, but this tail calling approach takes more control into the writers of the C code. The inlining is pretty much impossible to convince MSVC to do except with `__forceinline` or changing things to use macros [2]. However, we don't just mark every function as forceinline in CPython as it might negatively affect other compilers.
[1]: https://github.com/faster-cpython/ideas/issues/183 [2]: https://github.com/python/cpython/issues/121263
(I actually spent most of Sep/Oct working on optimizing the Immer JS immutable update library, and used a benchmarking tool called `mitata`, so I was doing a lot of this same kind of work: https://github.com/immerjs/immer/pull/1183 . Would love to add some new tools to my repertoire here!)
It's in essence a histogram for the distribution, with smoothing, and mirrored on each side.
It looks nice, but is not without well-deserved opposition because 1) the use of smoothing can hide the actual distribution, 2) mirroring contains no extra information, while taking up space, and implying the extra space contains information, and 3) when shown vertically, too often causes people to exclaim it looks like a vulva.
In an HN discussion on the topic, medstrom at https://news.ycombinator.com/item?id=40766519 points to a half-violin plot at https://miro.medium.com/v2/1*J3Q4JKXa9WwJHtNaXRu-kQ.jpeg with the histogram on the left, and the half-violin on the right, which gives you a chance to see side-by-side presentation of the same data.
I'd have expected it to be hand rolled assembly for the major ISAs, with a C backup for less common ones.
How much energy has been wasted worldwide because of a relatively unoptimized interpreter?
Looks like it refers to this:
https://youtu.be/pUj32SF94Zw
(wish it's a link in the article)
> I used to believe the the tailcalling interpreters get their speedup from better register use. While I still believe that now, I suspect that is not the main reason for speedups in CPython.
> My main guess now is that tail calling resets compiler heuristics to sane levels, so that compilers can do their jobs.
> Let me show an example, at the time of writing, CPython 3.15’s interpreter loop is around 12k lines of C code. That’s 12k lines in a single function for the switch-case and computed goto interpreter.
> […] In short, this overly large function breaks a lot of compiler heuristics.
> One of the most beneficial optimisations is inlining. In the past, we’ve found that compilers sometimes straight up refuse to inline even the simplest of functions in that 12k loc eval loop.
Also the interpreter loop's dispatch is autogenerated and can be selected via configure flags. So there's almost no additional maintenance overhead. The main burden is the MSVC-specific changes we needed to get this working (amounting to a few hundred lines of code).
> Impact on debugging/profiling
I don't think there should be any, at least for Windows. Though I can't say for certain.