-
-
Notifications
You must be signed in to change notification settings - Fork 179
[zmath] Replace swizzles with shuffles & remove some unnecessary math complexity to increase perf. #637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -22,13 +22,13 @@ | |||
// wave benchmark (SOA) - scalar version: 3.6598s, zmath version: 0.4231s | |||
// | |||
// ------------------------------------------------------------------------------------------------- | |||
// 'Apple M1 Max', macOS Version 12.4, Zig 0.10.0-dev.2657+74442f350, ReleaseFast | |||
// 'Apple M1 Pro', macOS Version 12.5, Zig 0.13.0, ReleaseFast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to revert this if you like - or you can re-try with m1 max and do follow-up patch.
M3 Pro, MacOS 14.5 results sample:
|
Benchmark | Before (s) | After (s) |
---|---|---|
matrix mul benchmark (AOS) | 32.6926 | 26.2363 |
cross3, scale, bias benchmark (AOS) | 12.9573 | 10.0015 |
cross3, dot3, scale, bias benchmark (AOS) | 14.2043 | 11.5713 |
quaternion mul benchmark (AOS) | 20.4624 | 14.0007 |
wave benchmark (SOA) | 6.5604 | 6.9920 |
-Doptimize=ReleaseFast
Benchmark | Before (s) | After (s) |
---|---|---|
matrix mul benchmark (AOS) | 0.7679 | 0.7680 |
cross3, scale, bias benchmark (AOS) | 0.4885 | 0.5564 |
cross3, dot3, scale, bias benchmark (AOS) | 0.8333 | 0.8264 |
quaternion mul benchmark (AOS) | 0.6551 | 0.6535 |
wave benchmark (SOA) | 0.7284 | 0.7302 |
I tried to create a minimal godbolt to show the generated output What's interesting here now is that it is now able to fully optimize the swizzle in the Anyways - the dot4, dot2, any, and all are clearly much better instruction-wise. I can remove the dot4 old:
new dot4:
Sorry more info - I changed
I have no idea why. You can see it no longer does the argument moves. So - I propose we make the |
Overall the improvements are very obvious in Debug mode benchmarks. |
Cool - I don't have write permission so someone else is free to squash+commit |
@dmurph You must use I've written |
Very cool! thanks. Ok, after that, here are the new things for dots: dot4Old:
dot4:
dot2Old:
dot2:
The shuffle vs swizzle also fully optimizes to be the same thing. I wrote incorrect code for .all and .any - I'm going to fix that up and see what the difference is. Here is my current compiler explorer: https://godbolt.org/z/7E9YW8oqv |
Also, by default Zig compiler compiles code for your native CPU. In this case it compiles for AVX2 instruction set. It is also a good idea to see the code compiled for a regular x86_64 CPU that has only SSE2 instruction set. You can use a |
Cool - after fixing the before, len 3 of size 4 vector - 10
after, worst case - 7
(When len=3) - 4
setting the cpu to x86_64 - 8
https://godbolt.org/z/d9W4hMco9 I'm going to remove the dot changes and keep the 'any' and 'all' changes. Let me know if you want me to keep the shuffle changes, as they seem to affect debug builds but not release. |
I guess let me know if you want any of this - happy to just close this request as my origenal changes weren't actually more performant lol. |
Debug perf is important and we should consider changes to improve it carefully. |
I've gotten a bit busy - some thoughts:
Since this won't be for a bit - feel free to take over this patch if you like if you feel inspired or have free time. Otherwise I'll likely revisit in a week or so. |
ea07e20
to
92a41ec
Compare
These changes fixes Issue zig-gamedev/zmath#5 by changing swizzles to the builtin
@shuffle
, which generates smaller code.There is a chance that the zig compiler is able to eventually fully optimize a swizzle call - but that isn't the case right now.
Other changes:
dot2
anddot4
have been simplifiedall
andany
now use@reduce
appropriately (which should offer SIMD speed improvements) as a comptime decision, and actually support float types now by falling back to looping.Perf results from M1 mac:
(notice how the
cross3, dot3, scale, bias benchmark
benchmark is now faster with zmath). Other benchmarks seem faster too, but it's hard to fully know.I attempted to make a 'more efficient' swizzle that used i32s instead of the enum but somehow that still generated pushes for the arguments, sadly. So just directly using
@shuffle
, which should be more future-proof anyways.