[Inductor] Improve memory locality by iterating over y dimension before x #149339

blaine-rister · 2025-03-17T19:22:18Z

Feature

Fixes #148718 by reordering the tensor dims to (z, y, x).

As a bonus refactor, block pointers no longer needed the reorder=True argument to self.active_range_trees(). Since this argument is no longer used anywhere, this PR simply deletes it as opposed to updating the logic for the new iteration order.

Perf impact

It looks like there's a decent perf bump on A100, with cudagraphs enabled. Granted, perf runs seem to have some noise between commits. (Workflow run.)

Training (all neutral or positive):

Inference (one positive, one very small negative):

As reported in #148718, this PR makes consecutive threads access consecutive memory addresses. This should theoretically give the GPU more opportunities to coalesce loads and stores. From Nvidia's kernel profiling guide:

Local memory is private storage for an executing thread and is not visible outside of that thread. It is intended for thread-local data like thread stacks and register spills. Local memory addresses are translated to global virtual addresses by the AGU unit. Local memory has the same latency as global memory. One difference between global and local memory is that local memory is arranged such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., same index in an array variable, same member in a structure variable, etc.).

I couldn't find any information on how coalescing works for other kinds of memory, but the guide mentions it is also supported for accesses to the L2 cache.

The L2 Request Coalescer (LRC) processes incoming requests for L2 and tries to coalesce read requests before forwarding them to the L2 cache. It also serves programmatic multicast requests from the SM and supports compression for writes.

The answer to this Stack Overflow post also explains coalescing in a straightforward way. Inductor's current iteration order corresponds to the first (uncoalesced) example in that answer, while the order after this PR corresponds to the second (coalesced) example.

Besides GPUs, this order of accessing data is highly advantageous for systems relying on DMAs, as those are designed to access contiguous spans of memory. This change improves the performance of an elementwise add kernel on an internal model, using internal hardware, by 1.76x. I will share the details with reviewers who are Meta employees via a private channel.

Test plan

Updated expected code on CI tests.
Added a new test checking the {x,y,z}indices and block pointers on a 3D pointwise kernel.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

pytorch-bot · 2025-03-17T19:22:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149339

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 Cancelled Jobs, 2 Unrelated Failures

As of commit 81b4fca with merge base 08a644a ():

CANCELLED JOBS - The following jobs were cancelled. Please retry:

inductor-A100-perf-nightly / cuda12.6-py3.10-gcc9-sm80 (gh)
inductor-A100-perf-nightly / get-label-type / runner-determinator (gh)
##[error]The operation was canceled.

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge) (gh) (trunk failure)
Process completed with exit code 1.

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (#149370)
REGRESSION: benchmark ('aotdispatcher_training_subclass_cpu', 'compile_time_instruction_count') failed, actual result 9818349220 is 1.62% higher than expected 9662000000 ±+1.50% if this is an expected regression, please update the expected results.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/_inductor/codegen/simd.py

blaine-rister · 2025-03-18T03:34:57Z

Perf runs:

torch/_inductor/codegen/simd.py

shunting314 · 2025-03-18T17:15:44Z

Hmm, I think even for current tile dimension [XBLOCK, YBLOCK], the memory access pattern is still coalesced since triton would be able to handle it.

Maybe check a bit if the 7% speedup on TorchBench is true by looking into a few models with the most speedup.

Also I'm wondering if triton-mtia missing some optimizations so we see large perf improvement by swapping the tile dimension to [YBLOCK, XBLOCK].

blaine-rister · 2025-03-18T17:25:17Z

Hmm, I think even for current tile dimension [XBLOCK, YBLOCK], the memory access pattern is still coalesced since triton would be able to handle it.

Maybe check a bit if the 7% speedup on TorchBench is true by looking into a few models with the most speedup.

Also I'm wondering if triton-mtia missing some optimizations so we see large perf improvement by swapping the tile dimension to [YBLOCK, XBLOCK].

@shunting314 it seems like you're saying that the Triton compiler should be able to correct for less advantageous code from Inductor. Triton-MTIA does something like this locally to each block pointer, but it still requires transposing the data on chip after each load, or before each store. The problem is that if we change the order of one tensor without touching the others, the program would no longer be sound. Hence we end up loading/storing in one order and then swapping the data back on chip.

After this PR, we no longer need to transpose the data on chip because Inductor changes the order of all pointers globally. That feels like the kind of thing that Inductor is better suited to do than Triton.

Regardless of how much we could optimize the Triton compiler, I still think Inductor should provide the best code it can, all else being equal like the complexity of the solution. In this case, I don't think the fix is making Inductor any more complicated, so it seems like a win. Are there other factors to consider?

shunting314 · 2025-03-18T17:36:16Z

it seems like you're saying that the Triton compiler should be able to correct for less advantageous code from Inductor.

I feel that we should not call [XBLOCK, YBLOCK] tile a less advantageous if Triton can still generate coalesced memory access. One way to verify would be to create a tiled pointwise kernel with relative large input tensors and check the perf difference between tile dimension [XBLOCK, YBLOCK] and [YBLOCK, XBLOCK].

It's interesting to see that Triton-MTIA needs transposing the data on chip. That's may explain the perf difference for MTIA.

I think make the change in Inductor looks fine as long as it does not slow down GPU

blaine-rister · 2025-03-18T17:39:22Z

it seems like you're saying that the Triton compiler should be able to correct for less advantageous code from Inductor.

I feel that we should not call [XBLOCK, YBLOCK] tile a less advantageous if Triton can still generate coalesced memory access. One way to verify would be to create a tiled pointwise kernel with relative large input tensors and check the perf difference between tile dimension [XBLOCK, YBLOCK] and [YBLOCK, XBLOCK].

It's interesting to see that Triton-MTIA needs transposing the data on chip. That's may explain the perf difference for MTIA.

I think make the change in Inductor looks fine as long as it does not slow down GPU

Sounds good. I'll try profiling a microbenchmark and see if there's a difference. It's possible that the GPU is smart or flexible enough to make [XBLOCK, YBLOCK] work well.

blaine-rister · 2025-03-19T01:32:18Z

Following the suggestions above, I did some microbenchmarks on an elementwise addition kernel. I tested on the following shapes and strides:

shape (2^15, 2^14), stride (2^15, 1)
shape (16, 16), stride (32, 1)
shape (1024, 16), stride (18, 1)
shape (64, 16), stride (288, 1)

I also disabled autotuning, so we would always have block sizes of (32, 32). This prevented the autotuner from selecting YBLOCK=1, in which case x first and y first are equivalent.

Here are the results from running Nvidia Nsight on an A100 GPU, with the default settings, with the latency averaged over ~100 iterations:

(For the largest size, Nsight didn't have enough precision to measure the difference in ms, so I calculated the speedup using cycles instead.)

On all the test cases, y first was slightly faster than x first. The difference was less noticeable for very large shapes. It was more pronounced for smaller shapes, and for views where the stride between rows was very large.

Since x first was not faster for any of these shapes, the change in this PR seems safe.

See the raw data and test script here.

blaine-rister · 2025-03-20T01:24:10Z

@pytorchbot merge

pytorchmergebot · 2025-03-20T01:25:56Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-03-20T01:26:15Z

Merge failed

Reason: 2 jobs have failed, first few of them are: inductor-A100-perf-nightly / get-label-type / runner-determinator, inductor-A100-perf-nightly / cuda12.6-py3.10-gcc9-sm80

Details for Dev Infra team

Raised by workflow job

blaine-rister · 2025-03-20T01:27:21Z

@pytorchbot merge -i

pytorchmergebot · 2025-03-20T01:29:08Z

Merge started

Your change will be merged while ignoring the following 4 checks: pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge), pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu), inductor-A100-perf-nightly / get-label-type / runner-determinator, inductor-A100-perf-nightly / cuda12.6-py3.10-gcc9-sm80

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-03-20T07:27:43Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

blaine-rister · 2025-03-20T08:09:48Z

@pytorchbot merge -i

pytorchmergebot · 2025-03-20T08:11:34Z

Merge started

Your change will be merged while ignoring the following 4 checks: pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge), pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu), inductor-A100-perf-nightly / get-label-type / runner-determinator, inductor-A100-perf-nightly / cuda12.6-py3.10-gcc9-sm80

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…re x (pytorch#149339) # Feature Fixes pytorch#148718 by reordering the tensor dims to `(z, y, x)`. As a bonus refactor, block pointers no longer needed the `reorder=True` argument to `self.active_range_trees()`. Since this argument is no longer used anywhere, this PR simply deletes it as opposed to updating the logic for the new iteration order. # Perf impact It looks like there's a decent perf bump on A100, with cudagraphs enabled. Granted, perf runs seem to have some noise between commits. ([Workflow run](https://github.com/pytorch/pytorch/actions/runs/13914815576).) Training (all neutral or positive): ![image](https://github.com/user-attachments/assets/57f1ef1d-60b4-446f-baf3-aca87a26b81b) Inference (one positive, one very small negative): ![image](https://github.com/user-attachments/assets/679aa057-af23-47f1-8d8e-8520daf1bd92) As reported in pytorch#148718, this PR makes consecutive threads access consecutive memory addresses. This should theoretically give the GPU more opportunities to coalesce loads and stores. From Nvidia's [kernel profiling guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html): > Local memory is private storage for an executing thread and is not visible outside of that thread. It is intended for thread-local data like thread stacks and register spills. Local memory addresses are translated to global virtual addresses by the AGU unit. Local memory has the same latency as global memory. One difference between global and local memory is that local memory is arranged such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., same index in an array variable, same member in a structure variable, etc.). I couldn't find any information on how coalescing works for other kinds of memory, but the guide mentions it is also supported for accesses to the L2 cache. > The L2 Request Coalescer (LRC) processes incoming requests for L2 and tries to coalesce read requests before forwarding them to the L2 cache. It also serves programmatic multicast requests from the SM and supports compression for writes. The [answer to this Stack Overflow post](https://stackoverflow.com/a/5044424) also explains coalescing in a straightforward way. Inductor's current iteration order corresponds to the first (uncoalesced) example in that answer, while the order after this PR corresponds to the second (coalesced) example. Besides GPUs, this order of accessing data is highly advantageous for systems relying on DMAs, as those are designed to access contiguous spans of memory. This change improves the performance of an elementwise add kernel on an internal model, using internal hardware, by 1.76x. I will share the details with reviewers who are Meta employees via a private channel. # Test plan - Updated expected code on CI tests. - Added a new test checking the {x,y,z}indices and block pointers on a 3D pointwise kernel. Pull Request resolved: pytorch#149339 Approved by: https://github.com/jansel

move y first

d3fec42

pytorch-bot bot added ciflow/inductor module: inductor topic: not user facing topic category labels Mar 17, 2025

blaine-rister commented Mar 17, 2025

View reviewed changes

torch/_inductor/codegen/simd.py Outdated Show resolved Hide resolved

fix

be6c9be

blaine-rister changed the title ~~[Inductor] Make indices contiguous by iterate over y dimension before x~~ [Inductor] Make indices contiguous by iterating over y dimension before x Mar 17, 2025

blaine-rister changed the title ~~[Inductor] Make indices contiguous by iterating over y dimension before x~~ [Inductor] Improve memory locality by iterating over y dimension before x Mar 17, 2025

blaine-rister added 5 commits March 17, 2025 18:40

supply default for max

bae117f

handle no_x_dim

eca6fc5

add test case

d9138c2

update expected code

e5018b9

clean up test

a8ade54

blaine-rister commented Mar 18, 2025

View reviewed changes

torch/_inductor/codegen/simd.py Outdated Show resolved Hide resolved

blaine-rister added 2 commits March 17, 2025 20:37

Update torch/_inductor/codegen/simd.py

fe18118

remove superfluous code

81b4fca

blaine-rister marked this pull request as ready for review March 18, 2025 16:25

blaine-rister requested review from jansel and shunting314 March 18, 2025 16:25

jansel approved these changes Mar 20, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 20, 2025

pytorchmergebot added the merging label Mar 20, 2025

pytorchmergebot removed the merging label Mar 20, 2025

pytorchmergebot added the merging label Mar 20, 2025

pytorchmergebot added the Merged label Mar 20, 2025

pytorchmergebot closed this in 970ac2d Mar 20, 2025

pytorchmergebot removed the merging label Mar 20, 2025

github-actions bot deleted the brister/contiguous_index branch April 22, 2025 02:14

jianyizh mentioned this pull request Jul 10, 2025

[inductor] performance of torchbench alexnet has regression compare to 2.7 due to inductor changes a triton kernel #158000

Open

[Inductor] Improve memory locality by iterating over y dimension before x #149339

[Inductor] Improve memory locality by iterating over y dimension before x #149339

Uh oh!

Conversation

blaine-rister commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Feature

Perf impact

Test plan

Uh oh!

pytorch-bot bot commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149339

❌ 2 Cancelled Jobs, 2 Unrelated Failures

Uh oh!

Uh oh!

blaine-rister commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

shunting314 commented Mar 18, 2025

Uh oh!

blaine-rister commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shunting314 commented Mar 18, 2025

Uh oh!

blaine-rister commented Mar 18, 2025

Uh oh!

blaine-rister commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blaine-rister commented Mar 20, 2025

Uh oh!

pytorchmergebot commented Mar 20, 2025

Merge started

Uh oh!

pytorchmergebot commented Mar 20, 2025

Merge failed

Uh oh!

blaine-rister commented Mar 20, 2025

Uh oh!

pytorchmergebot commented Mar 20, 2025

Merge started

Uh oh!

pytorchmergebot commented Mar 20, 2025

Uh oh!

blaine-rister commented Mar 20, 2025

Uh oh!

pytorchmergebot commented Mar 20, 2025

Merge started

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

blaine-rister commented Mar 17, 2025 •

edited

Loading

pytorch-bot bot commented Mar 17, 2025 •

edited

Loading

blaine-rister commented Mar 18, 2025 •

edited

Loading

blaine-rister commented Mar 18, 2025 •

edited

Loading

blaine-rister commented Mar 19, 2025 •

edited

Loading