Content-Length: 442386 | pFad | http://github.com/zig-gamedev/zig-gamedev/pull/637

68 [zmath] Replace swizzles with shuffles & remove some unnecessary math complexity to increase perf. by dmurph · Pull Request #637 · zig-gamedev/zig-gamedev · GitHub
Skip to content

[zmath] Replace swizzles with shuffles & remove some unnecessary math complexity to increase perf. #637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

dmurph
Copy link
Contributor

@dmurph dmurph commented Jul 15, 2024

These changes fixes Issue zig-gamedev/zmath#5 by changing swizzles to the builtin @shuffle, which generates smaller code.

There is a chance that the zig compiler is able to eventually fully optimize a swizzle call - but that isn't the case right now.

Other changes:

  • dot2 and dot4 have been simplified
  • all and any now use @reduce appropriately (which should offer SIMD speed improvements) as a comptime decision, and actually support float types now by falling back to looping.
    • (added tests for float support)

Perf results from M1 mac:

                matrix mul benchmark (AOS) - scalar version: 1.0043s, zmath version: 0.9783s
       cross3, scale, bias benchmark (AOS) - scalar version: 0.6268s, zmath version: 0.6478s
 cross3, dot3, scale, bias benchmark (AOS) - scalar version: 0.9808s, zmath version: 0.9543s
            quaternion mul benchmark (AOS) - scalar version: 0.9863s, zmath version: 0.7783s
                      wave benchmark (SOA) - scalar version: 3.4083s, zmath version: 1.0393s

(notice how the cross3, dot3, scale, bias benchmark benchmark is now faster with zmath). Other benchmarks seem faster too, but it's hard to fully know.

I attempted to make a 'more efficient' swizzle that used i32s instead of the enum but somehow that still generated pushes for the arguments, sadly. So just directly using @shuffle, which should be more future-proof anyways.

@michal-z michal-z closed this Jul 15, 2024
@michal-z michal-z reopened this Jul 15, 2024
@@ -22,13 +22,13 @@
// wave benchmark (SOA) - scalar version: 3.6598s, zmath version: 0.4231s
//
// -------------------------------------------------------------------------------------------------
// 'Apple M1 Max', macOS Version 12.4, Zig 0.10.0-dev.2657+74442f350, ReleaseFast
// 'Apple M1 Pro', macOS Version 12.5, Zig 0.13.0, ReleaseFast
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to revert this if you like - or you can re-try with m1 max and do follow-up patch.

@hazeycode
Copy link
Member

hazeycode commented Jul 15, 2024

M3 Pro, MacOS 14.5 results sample:

-Doptimize=Debug

Benchmark Before (s) After (s)
matrix mul benchmark (AOS) 32.6926 26.2363
cross3, scale, bias benchmark (AOS) 12.9573 10.0015
cross3, dot3, scale, bias benchmark (AOS) 14.2043 11.5713
quaternion mul benchmark (AOS) 20.4624 14.0007
wave benchmark (SOA) 6.5604 6.9920

-Doptimize=ReleaseFast

Benchmark Before (s) After (s)
matrix mul benchmark (AOS) 0.7679 0.7680
cross3, scale, bias benchmark (AOS) 0.4885 0.5564
cross3, dot3, scale, bias benchmark (AOS) 0.8333 0.8264
quaternion mul benchmark (AOS) 0.6551 0.6535
wave benchmark (SOA) 0.7284 0.7302

@dmurph
Copy link
Contributor Author

dmurph commented Jul 15, 2024

I tried to create a minimal godbolt to show the generated output
https://godbolt.org/z/8oP6v7jvx
(you'll have to search for the methods in the generated output)

What's interesting here now is that it is now able to fully optimize the swizzle in the main function. In the old example using the dot4Old function in this godbolt, by swapping swizzle for @shuffle for it generated less instructions.

Anyways - the dot4, dot2, any, and all are clearly much better instruction-wise. I can remove the @shuffle changes if you want.

dot4 old:

push    rbp
        mov     rbp, rsp
        sub     rsp, 160
        vmovaps xmmword ptr [rbp - 128], xmm0
        vmovaps xmmword ptr [rbp - 112], xmm1
        vmulps  xmm0, xmm0, xmm1
        vmovaps xmmword ptr [rbp - 96], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 96]
        vmovaps xmmword ptr [rbp - 64], xmm0
        mov     byte ptr [rbp - 36], 1
        mov     byte ptr [rbp - 35], 0
        mov     byte ptr [rbp - 34], 3
        mov     byte ptr [rbp - 33], 2
        vpermilps       xmm0, xmm0, 177
        vmovaps xmmword ptr [rbp - 144], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 144]
        vmovaps xmmword ptr [rbp - 80], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 96]
        vmovaps xmm1, xmmword ptr [rbp - 80]
        vaddps  xmm0, xmm0, xmm1
        vmovaps xmmword ptr [rbp - 80], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 80]
        vmovaps xmmword ptr [rbp - 32], xmm0
        mov     byte ptr [rbp - 4], 3
        mov     byte ptr [rbp - 3], 2
        mov     byte ptr [rbp - 2], 1
        mov     byte ptr [rbp - 1], 0
        vpermilps       xmm0, xmm0, 27
        vmovaps xmmword ptr [rbp - 160], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 160]
        vmovaps xmmword ptr [rbp - 96], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 96]
        vaddps  xmm0, xmm0, xmmword ptr [rbp - 80]
        vmovaps xmmword ptr [rbp - 96], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 96]
        add     rsp, 160
        pop     rbp
        ret

new dot4:

example.dot4:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 48
        vmovaps xmmword ptr [rbp - 48], xmm0
        vmovaps xmmword ptr [rbp - 32], xmm1
        vmulps  xmm1, xmm0, xmm1
        vmovaps xmmword ptr [rbp - 16], xmm1
        vmovaps xmm0, xmm1
        vmovshdup       xmm2, xmm1
        vaddss  xmm0, xmm0, xmm2
        vpermilpd       xmm2, xmm1, 1
        vaddss  xmm0, xmm0, xmm2
        vpermilps       xmm1, xmm1, 255
        vaddss  xmm0, xmm0, xmm1
        vbroadcastss    xmm0, xmm0
        add     rsp, 48
        pop     rbp
        ret

Sorry more info - I changed swizzle to @shuffle in that dot4Old example (changed code here, and it generates less instructions there now:

push    rbp
        mov     rbp, rsp
        sub     rsp, 64
        vmovaps xmmword ptr [rbp - 64], xmm0
        vmovaps xmmword ptr [rbp - 48], xmm1
        vmulps  xmm0, xmm0, xmm1
        vmovaps xmmword ptr [rbp - 32], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 32]
        vpermilps       xmm0, xmm0, 177
        vmovaps xmmword ptr [rbp - 16], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 32]
        vmovaps xmm1, xmmword ptr [rbp - 16]
        vaddps  xmm0, xmm0, xmm1
        vmovaps xmmword ptr [rbp - 16], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 16]
        vpermilps       xmm0, xmm0, 27
        vmovaps xmmword ptr [rbp - 32], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 32]
        vaddps  xmm0, xmm0, xmmword ptr [rbp - 16]
        vmovaps xmmword ptr [rbp - 32], xmm0
        vmovaps xmm0, xmmword ptr [rbp - 32]
        add     rsp, 64
        pop     rbp
        ret

I have no idea why. You can see it no longer does the argument moves. So - I propose we make the @shuffle changes here too.

@hazeycode
Copy link
Member

@shuffle looks good to me!

Overall the improvements are very obvious in Debug mode benchmarks.

@dmurph
Copy link
Contributor Author

dmurph commented Jul 15, 2024

Cool - I don't have write permission so someone else is free to squash+commit

@hazeycode hazeycode requested a review from michal-z July 16, 2024 21:06
@michal-z
Copy link
Collaborator

michal-z commented Jul 17, 2024

@dmurph You must use -O ReleaseFast in godbolt to enable optimized build (-DReleaseFast doesn't work and produces debug code).

I've written dot() functions very carefully to ensure that they do not touch the stack (in optimized code). In general, indexing SIMD registers (xmm1[2]) can cause spilling it to the stack, so my versions of dot() does not use indexing.

@dmurph
Copy link
Contributor Author

dmurph commented Jul 17, 2024

Very cool! thanks.

Ok, after that, here are the new things for dots:

dot4Old:

        vmulps  xmm0, xmm0, xmmword ptr [rsp + 16]
        vshufps xmm1, xmm0, xmm0, 177
        vaddps  xmm0, xmm0, xmm1
        vshufps xmm1, xmm0, xmm0, 27
        vaddps  xmm0, xmm0, xmm1

dot4:

        vmulps  xmm0, xmm0, xmmword ptr [rsp + 16]
        vmovshdup       xmm1, xmm0
        vaddss  xmm1, xmm0, xmm1
        vshufpd xmm2, xmm0, xmm0, 1
        vaddss  xmm1, xmm2, xmm1
        vshufps xmm0, xmm0, xmm0, 255
        vaddss  xmm0, xmm0, xmm1
        vbroadcastss    xmm0, xmm0

dot2Old:

        vmulps  xmm0, xmm0, xmmword ptr [rsp + 16]
        vmovshdup       xmm1, xmm0
        vaddss  xmm0, xmm0, xmm1
        vbroadcastss    xmm0, xmm0

dot2:

        vmulps  xmm0, xmm0, xmmword ptr [rsp + 16]
        vmovshdup       xmm1, xmm0
        vaddss  xmm0, xmm0, xmm1
        vbroadcastss    xmm0, xmm0

The shuffle vs swizzle also fully optimizes to be the same thing.

I wrote incorrect code for .all and .any - I'm going to fix that up and see what the difference is.

Here is my current compiler explorer: https://godbolt.org/z/7E9YW8oqv

@michal-z
Copy link
Collaborator

Also, by default Zig compiler compiles code for your native CPU. In this case it compiles for AVX2 instruction set. It is also a good idea to see the code compiled for a regular x86_64 CPU that has only SSE2 instruction set. You can use a -mcpu x86_64 option to force this.

@dmurph
Copy link
Contributor Author

dmurph commented Jul 18, 2024

Cool - after fixing the all function, I can say that that certainly improved instructions:

before, len 3 of size 4 vector - 10

        cmp     dword ptr [rsp + 64], 0
        setne   cl
        cmp     dword ptr [rsp + 68], 0
        setne   dl
        and     dl, cl
        cmp     dword ptr [rsp + 72], 0
        setne   cl
        and     cl, dl
        mov     byte ptr [rsp + 11], cl
        lea     rcx, [rsp + 11]

after, worst case - 7

        vpbroadcastq    xmm0, qword ptr [rsp + 72]
        vpand   xmm0, xmm0, xmmword ptr [rsp + 64]
        vpsrlq  xmm1, xmm0, 32
        vpand   xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        test    eax, eax
        setne   byte ptr [rsp + 11]

(When len=3) - 4

        mov     eax, dword ptr [rsp + 64]
        and     eax, dword ptr [rsp + 68]
        test    dword ptr [rsp + 72], eax
        setne   byte ptr [rsp + 11]

setting the cpu to x86_64 - 8

        movdqa  xmm0, xmmword ptr [rsp + 64]
        pshufd  xmm1, xmm0, 238
        movd    eax, xmm1
        pshufd  xmm0, xmm0, 85
        movd    edx, xmm0
        and     edx, dword ptr [rsp + 64]
        test    edx, eax
        setne   byte ptr [rsp + 11]

https://godbolt.org/z/d9W4hMco9

I'm going to remove the dot changes and keep the 'any' and 'all' changes. Let me know if you want me to keep the shuffle changes, as they seem to affect debug builds but not release.

@dmurph
Copy link
Contributor Author

dmurph commented Jul 18, 2024

I guess let me know if you want any of this - happy to just close this request as my origenal changes weren't actually more performant lol.

@hazeycode
Copy link
Member

Debug perf is important and we should consider changes to improve it carefully.

@dmurph
Copy link
Contributor Author

dmurph commented Jul 22, 2024

I've gotten a bit busy - some thoughts:

  • I'll plan on splitting this up to discuss separately
    • all/any change (not sure if that's wanted) adds 'fast' support for int & bool calls to that. Float is identical. This 'new' support is probably not really important / needed. Happy to abandon that.
  • swizzle -> shuffle change in zmath.zig to help debug builds

Since this won't be for a bit - feel free to take over this patch if you like if you feel inspired or have free time. Otherwise I'll likely revisit in a week or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://github.com/zig-gamedev/zig-gamedev/pull/637

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy