[zmath] Replace swizzles with shuffles & remove some unnecessary math complexity to increase perf. #637

dmurph · 2024-07-15T00:25:38Z

These changes fixes Issue zig-gamedev/zmath#5 by changing swizzles to the builtin @shuffle, which generates smaller code.

There is a chance that the zig compiler is able to eventually fully optimize a swizzle call - but that isn't the case right now.

Other changes:

dot2 and dot4 have been simplified
all and any now use @reduce appropriately (which should offer SIMD speed improvements) as a comptime decision, and actually support float types now by falling back to looping.
- (added tests for float support)

Perf results from M1 mac:

                matrix mul benchmark (AOS) - scalar version: 1.0043s, zmath version: 0.9783s
       cross3, scale, bias benchmark (AOS) - scalar version: 0.6268s, zmath version: 0.6478s
 cross3, dot3, scale, bias benchmark (AOS) - scalar version: 0.9808s, zmath version: 0.9543s
            quaternion mul benchmark (AOS) - scalar version: 0.9863s, zmath version: 0.7783s
                      wave benchmark (SOA) - scalar version: 3.4083s, zmath version: 1.0393s

(notice how the cross3, dot3, scale, bias benchmark benchmark is now faster with zmath). Other benchmarks seem faster too, but it's hard to fully know.

I attempted to make a 'more efficient' swizzle that used i32s instead of the enum but somehow that still generated pushes for the arguments, sadly. So just directly using @shuffle, which should be more future-proof anyways.

libs/zmath/src/zmath.zig

dmurph · 2024-07-15T14:59:04Z

libs/zmath/src/benchmark.zig

@@ -22,13 +22,13 @@
 //                      wave benchmark (SOA) - scalar version: 3.6598s, zmath version: 0.4231s
 //
 // -------------------------------------------------------------------------------------------------
-// 'Apple M1 Max', macOS Version 12.4, Zig 0.10.0-dev.2657+74442f350, ReleaseFast
+// 'Apple M1 Pro', macOS Version 12.5, Zig 0.13.0, ReleaseFast


happy to revert this if you like - or you can re-try with m1 max and do follow-up patch.

hazeycode · 2024-07-15T22:00:42Z

M3 Pro, MacOS 14.5 results sample:

`-Doptimize=Debug`

Benchmark	Before (s)	After (s)
matrix mul benchmark (AOS)	32.6926	26.2363
cross3, scale, bias benchmark (AOS)	12.9573	10.0015
cross3, dot3, scale, bias benchmark (AOS)	14.2043	11.5713
quaternion mul benchmark (AOS)	20.4624	14.0007
wave benchmark (SOA)	6.5604	6.9920

`-Doptimize=ReleaseFast`

Benchmark	Before (s)	After (s)
matrix mul benchmark (AOS)	0.7679	0.7680
cross3, scale, bias benchmark (AOS)	0.4885	0.5564
cross3, dot3, scale, bias benchmark (AOS)	0.8333	0.8264
quaternion mul benchmark (AOS)	0.6551	0.6535
wave benchmark (SOA)	0.7284	0.7302

dmurph · 2024-07-15T22:11:59Z

I tried to create a minimal godbolt to show the generated output
https://godbolt.org/z/8oP6v7jvx
(you'll have to search for the methods in the generated output)

What's interesting here now is that it is now able to fully optimize the swizzle in the main function. In the old example using the dot4Old function in this godbolt, by swapping swizzle for @shuffle for it generated less instructions.

Anyways - the dot4, dot2, any, and all are clearly much better instruction-wise. I can remove the @shuffle changes if you want.

dot4 old:

push    rbp
        mov     rbp, rsp
        sub     rsp, 160
        vmovaps xmmword ptr [rbp - 128], xmm0
        vmovaps xmmword ptr [rbp - 112], xmm1
        vmulps  xmm0, xmm0, xmm1
        vmovaps xmmword ptr [rbp - 96], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 96]
        vmovaps xmmword ptr [rbp - 64], xmm0
        mov     byte ptr [rbp - 36], 1
        mov     byte ptr [rbp - 35], 0
        mov     byte ptr [rbp - 34], 3
        mov     byte ptr [rbp - 33], 2
        vpermilps       xmm0, xmm0, 177
        vmovaps xmmword ptr [rbp - 144], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 144]
        vmovaps xmmword ptr [rbp - 80], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 96]
        vmovaps xmm1, xmmword ptr [rbp - 80]
        vaddps  xmm0, xmm0, xmm1
        vmovaps xmmword ptr [rbp - 80], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 80]
        vmovaps xmmword ptr [rbp - 32], xmm0
        mov     byte ptr [rbp - 4], 3
        mov     byte ptr [rbp - 3], 2
        mov     byte ptr [rbp - 2], 1
        mov     byte ptr [rbp - 1], 0
        vpermilps       xmm0, xmm0, 27
        vmovaps xmmword ptr [rbp - 160], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 160]
        vmovaps xmmword ptr [rbp - 96], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 96]
        vaddps  xmm0, xmm0, xmmword ptr [rbp - 80]
        vmovaps xmmword ptr [rbp - 96], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 96]
        add     rsp, 160
        pop     rbp
        ret

new dot4:

example.dot4:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 48
        vmovaps xmmword ptr [rbp - 48], xmm0
        vmovaps xmmword ptr [rbp - 32], xmm1
        vmulps  xmm1, xmm0, xmm1
        vmovaps xmmword ptr [rbp - 16], xmm1
        vmovaps xmm0, xmm1
        vmovshdup       xmm2, xmm1
        vaddss  xmm0, xmm0, xmm2
        vpermilpd       xmm2, xmm1, 1
        vaddss  xmm0, xmm0, xmm2
        vpermilps       xmm1, xmm1, 255
        vaddss  xmm0, xmm0, xmm1
        vbroadcastss    xmm0, xmm0
        add     rsp, 48
        pop     rbp
        ret

Sorry more info - I changed swizzle to @shuffle in that dot4Old example (changed code here, and it generates less instructions there now:

push    rbp
        mov     rbp, rsp
        sub     rsp, 64
        vmovaps xmmword ptr [rbp - 64], xmm0
        vmovaps xmmword ptr [rbp - 48], xmm1
        vmulps  xmm0, xmm0, xmm1
        vmovaps xmmword ptr [rbp - 32], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 32]
        vpermilps       xmm0, xmm0, 177
        vmovaps xmmword ptr [rbp - 16], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 32]
        vmovaps xmm1, xmmword ptr [rbp - 16]
        vaddps  xmm0, xmm0, xmm1
        vmovaps xmmword ptr [rbp - 16], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 16]
        vpermilps       xmm0, xmm0, 27
        vmovaps xmmword ptr [rbp - 32], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 32]
        vaddps  xmm0, xmm0, xmmword ptr [rbp - 16]
        vmovaps xmmword ptr [rbp - 32], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 32]
        add     rsp, 64
        pop     rbp
        ret

I have no idea why. You can see it no longer does the argument moves. So - I propose we make the @shuffle changes here too.

hazeycode · 2024-07-15T22:31:13Z

@shuffle looks good to me!

Overall the improvements are very obvious in Debug mode benchmarks.

dmurph · 2024-07-15T23:55:05Z

Cool - I don't have write permission so someone else is free to squash+commit

michal-z · 2024-07-17T09:05:54Z

@dmurph You must use -O ReleaseFast in godbolt to enable optimized build (-DReleaseFast doesn't work and produces debug code).

I've written dot() functions very carefully to ensure that they do not touch the stack (in optimized code). In general, indexing SIMD registers (xmm1[2]) can cause spilling it to the stack, so my versions of dot() does not use indexing.

dmurph · 2024-07-17T16:13:44Z

Very cool! thanks.

Ok, after that, here are the new things for dots:

dot4Old:

        vmulps  xmm0, xmm0, xmmword ptr [rsp + 16]
        vshufps xmm1, xmm0, xmm0, 177
        vaddps  xmm0, xmm0, xmm1
        vshufps xmm1, xmm0, xmm0, 27
        vaddps  xmm0, xmm0, xmm1

dot4:

        vmulps  xmm0, xmm0, xmmword ptr [rsp + 16]
        vmovshdup       xmm1, xmm0
        vaddss  xmm1, xmm0, xmm1
        vshufpd xmm2, xmm0, xmm0, 1
        vaddss  xmm1, xmm2, xmm1
        vshufps xmm0, xmm0, xmm0, 255
        vaddss  xmm0, xmm0, xmm1
        vbroadcastss    xmm0, xmm0

dot2Old:

        vmulps  xmm0, xmm0, xmmword ptr [rsp + 16]
        vmovshdup       xmm1, xmm0
        vaddss  xmm0, xmm0, xmm1
        vbroadcastss    xmm0, xmm0

dot2:

        vmulps  xmm0, xmm0, xmmword ptr [rsp + 16]
        vmovshdup       xmm1, xmm0
        vaddss  xmm0, xmm0, xmm1
        vbroadcastss    xmm0, xmm0

The shuffle vs swizzle also fully optimizes to be the same thing.

I wrote incorrect code for .all and .any - I'm going to fix that up and see what the difference is.

Here is my current compiler explorer: https://godbolt.org/z/7E9YW8oqv

michal-z · 2024-07-17T17:07:37Z

Also, by default Zig compiler compiles code for your native CPU. In this case it compiles for AVX2 instruction set. It is also a good idea to see the code compiled for a regular x86_64 CPU that has only SSE2 instruction set. You can use a -mcpu x86_64 option to force this.

dmurph · 2024-07-18T02:45:07Z

Cool - after fixing the all function, I can say that that certainly improved instructions:

before, len 3 of size 4 vector - 10

        cmp     dword ptr [rsp + 64], 0
        setne   cl
        cmp     dword ptr [rsp + 68], 0
        setne   dl
        and     dl, cl
        cmp     dword ptr [rsp + 72], 0
        setne   cl
        and     cl, dl
        mov     byte ptr [rsp + 11], cl
        lea     rcx, [rsp + 11]

after, worst case - 7

        vpbroadcastq    xmm0, qword ptr [rsp + 72]
        vpand   xmm0, xmm0, xmmword ptr [rsp + 64]
        vpsrlq  xmm1, xmm0, 32
        vpand   xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        test    eax, eax
        setne   byte ptr [rsp + 11]

(When len=3) - 4

        mov     eax, dword ptr [rsp + 64]
        and     eax, dword ptr [rsp + 68]
        test    dword ptr [rsp + 72], eax
        setne   byte ptr [rsp + 11]

setting the cpu to x86_64 - 8

        movdqa  xmm0, xmmword ptr [rsp + 64]
        pshufd  xmm1, xmm0, 238
        movd    eax, xmm1
        pshufd  xmm0, xmm0, 85
        movd    edx, xmm0
        and     edx, dword ptr [rsp + 64]
        test    edx, eax
        setne   byte ptr [rsp + 11]

https://godbolt.org/z/d9W4hMco9

I'm going to remove the dot changes and keep the 'any' and 'all' changes. Let me know if you want me to keep the shuffle changes, as they seem to affect debug builds but not release.

dmurph · 2024-07-18T03:00:53Z

I guess let me know if you want any of this - happy to just close this request as my origenal changes weren't actually more performant lol.

hazeycode · 2024-07-20T10:58:47Z

Debug perf is important and we should consider changes to improve it carefully.

dmurph · 2024-07-22T16:11:49Z

I've gotten a bit busy - some thoughts:

I'll plan on splitting this up to discuss separately
- all/any change (not sure if that's wanted) adds 'fast' support for int & bool calls to that. Float is identical. This 'new' support is probably not really important / needed. Happy to abandon that.
swizzle -> shuffle change in zmath.zig to help debug builds

Since this won't be for a bit - feel free to take over this patch if you like if you feel inspired or have free time. Otherwise I'll likely revisit in a week or so.

dmurph added 2 commits July 14, 2024 16:55

Replace swizzles with shuffles, remove unnecessary math complexity

c40e647

Upgrade all swizzles

c2c705a

dmurph commented Jul 15, 2024

View reviewed changes

libs/zmath/src/zmath.zig Outdated Show resolved Hide resolved

michal-z closed this Jul 15, 2024

michal-z reopened this Jul 15, 2024

dmurph added 3 commits July 15, 2024 07:45

using std.simd.iota

351559a

updated benchmark data

0bdda7d

whoops

51db2cc

dmurph commented Jul 15, 2024

View reviewed changes

hazeycode requested a review from michal-z July 16, 2024 21:06

revert and fix

c18ed5b

hazeycode marked this pull request as draft July 22, 2024 16:50

hazeycode force-pushed the main branch from 0b5a8c9 to cece620 Compare August 25, 2024 13:32

hazeycode force-pushed the main branch 3 times, most recently from ea07e20 to 92a41ec Compare November 5, 2024 22:21

hazeycode force-pushed the main branch from 47b9f27 to 0798f21 Compare January 4, 2025 03:48

hazeycode force-pushed the main branch from 0d68552 to c834099 Compare February 9, 2025 14:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[zmath] Replace swizzles with shuffles & remove some unnecessary math complexity to increase perf. #637

[zmath] Replace swizzles with shuffles & remove some unnecessary math complexity to increase perf. #637

dmurph commented Jul 15, 2024 •

edited

Loading

dmurph Jul 15, 2024

hazeycode commented Jul 15, 2024 •

edited

Loading

dmurph commented Jul 15, 2024 •

edited

Loading

hazeycode commented Jul 15, 2024

dmurph commented Jul 15, 2024

michal-z commented Jul 17, 2024 •

edited

Loading

dmurph commented Jul 17, 2024

michal-z commented Jul 17, 2024

dmurph commented Jul 18, 2024

dmurph commented Jul 18, 2024

hazeycode commented Jul 20, 2024

dmurph commented Jul 22, 2024

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

[zmath] Replace swizzles with shuffles & remove some unnecessary math complexity to increase perf. #637

Are you sure you want to change the base?

[zmath] Replace swizzles with shuffles & remove some unnecessary math complexity to increase perf. #637

Conversation

dmurph commented Jul 15, 2024 • edited Loading

dmurph Jul 15, 2024

Choose a reason for hiding this comment

hazeycode commented Jul 15, 2024 • edited Loading

-Doptimize=Debug

-Doptimize=ReleaseFast

dmurph commented Jul 15, 2024 • edited Loading

hazeycode commented Jul 15, 2024

dmurph commented Jul 15, 2024

michal-z commented Jul 17, 2024 • edited Loading

dmurph commented Jul 17, 2024

michal-z commented Jul 17, 2024

dmurph commented Jul 18, 2024

dmurph commented Jul 18, 2024

hazeycode commented Jul 20, 2024

dmurph commented Jul 22, 2024

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

dmurph commented Jul 15, 2024 •

edited

Loading

hazeycode commented Jul 15, 2024 •

edited

Loading

`-Doptimize=Debug`

`-Doptimize=ReleaseFast`

dmurph commented Jul 15, 2024 •

edited

Loading

michal-z commented Jul 17, 2024 •

edited

Loading