[MPS] Add native strided API for MPSNDArray starting with macOS 15 #128393

DenisVieriu97 · 2024-06-11T06:17:32Z

Add support for native strides in MPS starting with macOS Sequoia. This will get rid of the additional gather and scatter operations needed to solve the strides or storage offsets of the tensors.

Summary of changes (starting with macOS 15):

Add support for MPS strided API (strides/storage offsets etc):
Add native support for NHWC convolutions (without incurring any extra copy from NCHW -> NHWC -> NCHW).
Add support for strided output buffers (previously we would create a contiguous buffer

OSes older than macOS 15 will run the old gather/scatter code path to solve strides/storage offsets.

Couple performance stats collected from torchbench comparing macOS 15 vs macOS 14:

- test_train[functorch_maml_omniglot-mps]: 27% faster
- test_train[timm_vision_transformer-mps]: 12% faster
- test_train[hf_T5-mps]: 9.46% faster

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

pytorch-bot · 2024-06-11T06:17:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128393

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 132fb66 with merge base 63e5b09 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, amz2023.linux.8xlarge.nvidia.gpu) (gh) (similar failure)
distributed/test_c10d_nccl.py::NCCLTraceTestDumpOnTimeout::test_timeout_dumps_timing_enabled_False
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 3, 3, amz2023.linux.8xlarge.nvidia.gpu) (gh) (similar failure)
distributed/test_c10d_nccl.py::NCCLTraceTestTimeoutDumpOnStuckRanks::test_timeout_dumps_on_stuck_ranks

This comment was automatically generated by Dr. CI and updates every 15 minutes.

aten/src/ATen/mps/MPSDevice.mm

aten/src/ATen/native/mps/operations/RangeFactories.mm

test/test_mps.py

skotapati · 2024-08-06T20:35:24Z

@pytorchbot merge

pytorch-bot · 2024-08-06T20:35:28Z

This PR needs to be approved by an authorized maintainer before merge.

skotapati · 2024-08-07T16:55:15Z

@pytorchbot merge

pytorch-bot · 2024-08-07T16:55:19Z

This PR needs to be approved by an authorized maintainer before merge.

test/test_mps.py

albanD · 2024-08-08T14:08:43Z

test/test_mps.py

@@ -3543,6 +3547,116 @@ def test_slice(self):
        mps_slice4 = mps_x[1, :].to('cpu')
        self.assertEqual(cpu_slice4, mps_slice4)

+    def test_slice_reshape_view_api_test_1(self):


These tests seems to have a lot of things in common.

are they not covered by OpInfo based testing already?

If not, should they be refactored such that they have a lot less boiler plate and consistent output metadata checking?

While making them generic, you can use parametrize decorator to make them run on all dtypes (avoiding int-specific tests)

These tests are needed in addition to the opInfo tests, I'll parametrize them as you recommended

I've updated the tests, but I can't get parametrize to work. It keeps returning a missing argument error

albanD · 2024-08-08T14:12:28Z

aten/src/ATen/native/mps/operations/Indexing.mm

@@ -608,17 +608,20 @@ Tensor index_select_mps(const Tensor& self, int64_t dim, const Tensor& index) {
      newCachedGraph->outputTensor_ = outputTensor;
    });

+    // MPS TODO: MPS Gather is failing with MPS strided API. Fallback to old gather.


There are a couple TODOs lying around, should we open issues to track them down?

I will update with the issue number

aten/src/ATen/native/mps/operations/Copy.mm

aten/src/ATen/native/mps/OperationUtils.mm

albanD

Ok!

This essentially undoes large skips on everything but MacOS sequioia to nn.modules made by #128393 Instead it uses existing `xfail`, but guards it on `_macos15_or_newer` boolean

This essentially undoes large skips on everything but MacOS Sequoia to nn.modules made by #128393 Instead it uses existing `xfail`, but guards it on `_macos15_or_newer` boolean Before the change if run on MacOS 14: ``` % python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3 Ran 57 tests in 0.053s OK (skipped=32) ``` After ``` % python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3 Ran 57 tests in 0.229s OK (skipped=10, expected failures=2) ``` Pull Request resolved: #134858 Approved by: https://github.com/janeyx99

The issue reported in #135223 was already solved in #128393. This PR adds a regression test for it. Fixes #135223 Pull Request resolved: #135440 Approved by: https://github.com/ezyang

…ytorch#128393) Add support for native strides in MPS starting with macOS Sequoia. This will get rid of the additional gather and scatter operations needed to solve the strides or storage offsets of the tensors. Summary of changes (starting with macOS 15): - Add support for **MPS strided API** (strides/storage offsets etc): - [initWithBuffer:offset:descriptor:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4391636-initwithbuffer?language=objc) - [arrayViewWithCommandBuffer:descriptor:aliasing:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/3114040-arrayviewwithcommandbuffer?language=objc) - [arrayViewWithShape:strides:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4408694-arrayviewwithshape?language=objc) - [reshapeWithCommandBuffer:sourceArray:shape:destinationArray:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarrayidentity/4438557-reshapewithcommandbuffer?language=objc) - Add native support for NHWC convolutions (without incurring any extra copy from NCHW -> NHWC -> NCHW). - Add support for strided output buffers (previously we would create a contiguous buffer OSes older than macOS 15 will run the old gather/scatter code path to solve strides/storage offsets. --- Couple performance stats collected from torchbench comparing macOS 15 vs macOS 14: ``` - test_train[functorch_maml_omniglot-mps]: 27% faster - test_train[timm_vision_transformer-mps]: 12% faster - test_train[hf_T5-mps]: 9.46% faster ``` Pull Request resolved: pytorch#128393 Approved by: https://github.com/albanD Co-authored-by: Siddharth Kotapati <skotapati@apple.com>

This essentially undoes large skips on everything but MacOS Sequoia to nn.modules made by pytorch#128393 Instead it uses existing `xfail`, but guards it on `_macos15_or_newer` boolean Before the change if run on MacOS 14: ``` % python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3 Ran 57 tests in 0.053s OK (skipped=32) ``` After ``` % python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3 Ran 57 tests in 0.229s OK (skipped=10, expected failures=2) ``` Pull Request resolved: pytorch#134858 Approved by: https://github.com/janeyx99

The issue reported in pytorch#135223 was already solved in pytorch#128393. This PR adds a regression test for it. Fixes pytorch#135223 Pull Request resolved: pytorch#135440 Approved by: https://github.com/ezyang

This essentially undoes large skips on everything but MacOS Sequoia to nn.modules made by pytorch#128393 Instead it uses existing `xfail`, but guards it on `_macos15_or_newer` boolean Before the change if run on MacOS 14: ``` % python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3 Ran 57 tests in 0.053s OK (skipped=32) ``` After ``` % python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3 Ran 57 tests in 0.229s OK (skipped=10, expected failures=2) ``` Pull Request resolved: pytorch#134858 Approved by: https://github.com/janeyx99

The issue reported in pytorch#135223 was already solved in pytorch#128393. This PR adds a regression test for it. Fixes pytorch#135223 Pull Request resolved: pytorch#135440 Approved by: https://github.com/ezyang

The issue reported in #135223 was already solved in #128393. This PR adds a regression test for it. Fixes #135223 Pull Request resolved: #135440 Approved by: https://github.com/ezyang (cherry picked from commit 09287e3)

[MPS] Add regression test for `fft.fftfreq` (#135440) The issue reported in #135223 was already solved in #128393. This PR adds a regression test for it. Fixes #135223 Pull Request resolved: #135440 Approved by: https://github.com/ezyang (cherry picked from commit 09287e3) Co-authored-by: Roy Hvaara <roy@lightyear.no>

@DenisVieriu97

…`nn.Conv3d` (#141780) When the input tensor to Conv3d is in the channels_last_3d memory format the Conv3d op will generate incorrect output (see example image in #141471). This PR checks if the op is 3d, and then attempts to convert the input tensor to contiguous. Added a regression test that verifies the output by running the same op on the CPU. I'm unsure if Conv3d supports the channels last memory format after #128393. If it does, we should consider updating the logic to utilize this as it would be more efficient. Perhaps @DenisVieriu97 knows or has more context? Fixes #141471 Pull Request resolved: #141780 Approved by: https://github.com/malfet

@DenisVieriu97

…`nn.Conv3d` (pytorch#141780) When the input tensor to Conv3d is in the channels_last_3d memory format the Conv3d op will generate incorrect output (see example image in pytorch#141471). This PR checks if the op is 3d, and then attempts to convert the input tensor to contiguous. Added a regression test that verifies the output by running the same op on the CPU. I'm unsure if Conv3d supports the channels last memory format after pytorch#128393. If it does, we should consider updating the logic to utilize this as it would be more efficient. Perhaps @DenisVieriu97 knows or has more context? Fixes pytorch#141471 Pull Request resolved: pytorch#141780 Approved by: https://github.com/malfet

@DenisVieriu97

…`nn.Conv3d` (pytorch#141780) When the input tensor to Conv3d is in the channels_last_3d memory format the Conv3d op will generate incorrect output (see example image in pytorch#141471). This PR checks if the op is 3d, and then attempts to convert the input tensor to contiguous. Added a regression test that verifies the output by running the same op on the CPU. I'm unsure if Conv3d supports the channels last memory format after pytorch#128393. If it does, we should consider updating the logic to utilize this as it would be more efficient. Perhaps @DenisVieriu97 knows or has more context? Fixes pytorch#141471 Pull Request resolved: pytorch#141780 Approved by: https://github.com/malfet

Caused by #128393 that change semantic of `needsGather`, which resulted in silent correctness errors on MacOS-15+ if output tensor is non-contiguous Fixes #145203

Caused by #128393 that change semantic of `needsGather`, which resulted in silent correctness errors on MacOS-15+ if output tensor is non-contiguous Fixes #145203 Pull Request resolved: #146085 Approved by: https://github.com/dcci

Caused by pytorch#128393 that change semantic of `needsGather`, which resulted in silent correctness errors on MacOS-15+ if output tensor is non-contiguous Fixes pytorch#145203 Pull Request resolved: pytorch#146085 Approved by: https://github.com/dcci

DenisVieriu97 requested a review from jhavukainen June 11, 2024 06:17

DenisVieriu97 requested review from kulinseth and malfet as code owners June 11, 2024 06:17

pytorch-bot bot added ciflow/mps Run MPS tests (subset of trunk) module: cpu CPU specific problem (e.g., perf, algorithm) release notes: mps Release notes category labels Jun 11, 2024

pytorchbot added the open source label Jun 11, 2024

DenisVieriu97 force-pushed the dev/denis/strided_mps_support branch from e3617e7 to 0064862 Compare June 11, 2024 13:18

colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 12, 2024

DenisVieriu97 force-pushed the dev/denis/strided_mps_support branch from 54c057f to a0a2517 Compare June 27, 2024 20:01

malfet reviewed Jul 9, 2024

View reviewed changes

aten/src/ATen/mps/MPSDevice.mm Outdated Show resolved Hide resolved

aten/src/ATen/native/mps/operations/RangeFactories.mm Outdated Show resolved Hide resolved

test/test_mps.py Show resolved Hide resolved

DenisVieriu97 force-pushed the dev/denis/strided_mps_support branch from a0a2517 to e7da1c1 Compare July 22, 2024 17:06

skotapati force-pushed the dev/denis/strided_mps_support branch from e7da1c1 to 80733f5 Compare July 23, 2024 18:36

skotapati requested a review from malfet July 23, 2024 18:58

This was referenced Jul 25, 2024

torch.layer_norm() gives wrong results on MPS if applied on a slice of tensor #131750

Closed

MPS gives incorrect result when torch.nn.functional.softplus follows moveaxis #131736

Open

qqaatw mentioned this pull request Jul 26, 2024

MPS and cpu method results are drastically different #131285

Closed

skotapati force-pushed the dev/denis/strided_mps_support branch 2 times, most recently from cc98e6a to beb4421 Compare July 31, 2024 22:35

DenisVieriu97 force-pushed the dev/denis/strided_mps_support branch 3 times, most recently from df6b189 to 1f343ba Compare August 1, 2024 01:11

skotapati requested a review from mruberry as a code owner August 6, 2024 18:02

albanD reviewed Aug 8, 2024

View reviewed changes

albanD approved these changes Aug 13, 2024

View reviewed changes

pytorchmergebot removed the merging label Aug 16, 2024

qqaatw mentioned this pull request Aug 17, 2024

[MPS] Gather sliced inputs to batch norm #133610

Closed

malfet added a commit that referenced this pull request Aug 30, 2024

[BE][MPS] Prefer xfail to skip

f8aeb1e

This essentially undoes large skips on everything but MacOS sequioia to nn.modules made by #128393 Instead it uses existing `xfail`, but guards it on `_macos15_or_newer` boolean

malfet mentioned this pull request Aug 30, 2024

[BE][MPS] Prefer xfail to skip #134858

Closed

This was referenced Sep 8, 2024

torch.fft.fftfreq behaves unexpectedly when run on MPS backend #135223

Closed

[MPS] Add regression test for fft.fftfreq #135440

Closed

pytorchbot mentioned this pull request Oct 2, 2024

[MPS] Add regression test for fft.fftfreq #137215

Merged

malfet mentioned this pull request Oct 17, 2024

Nightly introduced bug for GGUF in comfy? #137800

Closed

This was referenced Nov 25, 2024

MPS Regression when rendering LTXVideo (after pytorch2.4.1) #141471

Closed

[MPS] Convert channels_last_3d to contiguous for input tensor in nn.Conv3d #141780

Closed

malfet mentioned this pull request Dec 1, 2024

StrideAPI caused regression in channels-last logic #141836

Open

malfet mentioned this pull request Jan 30, 2025

Indexed ^= (XOR in-place) operation doesn't work as expected on MPS backend #145203

Closed

malfet added a commit that referenced this pull request Jan 30, 2025

[MPS] Fix regression in con-contig bitwise ops

ae6c1e5

Caused by #128393 that change semantic of `needsGather`, which resulted in silent correctness errors on MacOS-15+ if output tensor is non-contiguous Fixes #145203

malfet mentioned this pull request Jan 30, 2025

[MPS] Fix regression in con-contig bitwise ops #146085

Closed

malfet mentioned this pull request Feb 20, 2025

clamp_ and clamp behave differently on MPS device. #147510

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MPS] Add native strided API for MPSNDArray starting with macOS 15 #128393

[MPS] Add native strided API for MPSNDArray starting with macOS 15 #128393

Uh oh!

DenisVieriu97 commented Jun 11, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Jun 11, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

skotapati commented Aug 6, 2024

Uh oh!

pytorch-bot bot commented Aug 6, 2024

Uh oh!

skotapati commented Aug 7, 2024

Uh oh!

pytorch-bot bot commented Aug 7, 2024

Uh oh!

Uh oh!

albanD Aug 8, 2024

Uh oh!

skotapati Aug 13, 2024 •

edited

Loading

Uh oh!

skotapati Aug 13, 2024

Uh oh!

albanD Aug 8, 2024

Uh oh!

DenisVieriu97 Aug 9, 2024

Uh oh!

Uh oh!

Uh oh!

albanD left a comment

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

[MPS] Add native strided API for MPSNDArray starting with macOS 15 #128393

[MPS] Add native strided API for MPSNDArray starting with macOS 15 #128393

Uh oh!

Conversation

DenisVieriu97 commented Jun 11, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128393

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

skotapati commented Aug 6, 2024

Uh oh!

pytorch-bot bot commented Aug 6, 2024

Uh oh!

skotapati commented Aug 7, 2024

Uh oh!

pytorch-bot bot commented Aug 7, 2024

Uh oh!

Uh oh!

albanD Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

skotapati Aug 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skotapati Aug 13, 2024

Choose a reason for hiding this comment

Uh oh!

albanD Aug 8, 2024

Choose a reason for hiding this comment

Uh oh!

DenisVieriu97 Aug 9, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

DenisVieriu97 commented Jun 11, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 11, 2024 •

edited

Loading

skotapati Aug 13, 2024 •

edited

Loading