Skip to content

[Inductor] Improve memory locality by iterating over y dimension before x #149339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

blaine-rister
Copy link
Contributor

@blaine-rister blaine-rister commented Mar 17, 2025

Feature

Fixes #148718 by reordering the tensor dims to (z, y, x).

As a bonus refactor, block pointers no longer needed the reorder=True argument to self.active_range_trees(). Since this argument is no longer used anywhere, this PR simply deletes it as opposed to updating the logic for the new iteration order.

Perf impact

It looks like there's a decent perf bump on A100, with cudagraphs enabled. Granted, perf runs seem to have some noise between commits. (Workflow run.)

Training (all neutral or positive):
image

Inference (one positive, one very small negative):
image

As reported in #148718, this PR makes consecutive threads access consecutive memory addresses. This should theoretically give the GPU more opportunities to coalesce loads and stores. From Nvidia's kernel profiling guide:

Local memory is private storage for an executing thread and is not visible outside of that thread. It is intended for thread-local data like thread stacks and register spills. Local memory addresses are translated to global virtual addresses by the AGU unit. Local memory has the same latency as global memory. One difference between global and local memory is that local memory is arranged such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., same index in an array variable, same member in a structure variable, etc.).

I couldn't find any information on how coalescing works for other kinds of memory, but the guide mentions it is also supported for accesses to the L2 cache.

The L2 Request Coalescer (LRC) processes incoming requests for L2 and tries to coalesce read requests before forwarding them to the L2 cache. It also serves programmatic multicast requests from the SM and supports compression for writes.

The answer to this Stack Overflow post also explains coalescing in a straightforward way. Inductor's current iteration order corresponds to the first (uncoalesced) example in that answer, while the order after this PR corresponds to the second (coalesced) example.

Besides GPUs, this order of accessing data is highly advantageous for systems relying on DMAs, as those are designed to access contiguous spans of memory. This change improves the performance of an elementwise add kernel on an internal model, using internal hardware, by 1.76x. I will share the details with reviewers who are Meta employees via a private channel.

Test plan

  • Updated expected code on CI tests.
  • Added a new test checking the {x,y,z}indices and block pointers on a 3D pointwise kernel.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

Copy link

pytorch-bot bot commented Mar 17, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149339

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 Cancelled Jobs, 2 Unrelated Failures

As of commit 81b4fca with merge base 08a644a (image):

CANCELLED JOBS - The following jobs were cancelled. Please retry:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@blaine-rister blaine-rister changed the title [Inductor] Make indices contiguous by iterate over y dimension before x [Inductor] Make indices contiguous by iterating over y dimension before x Mar 17, 2025
@blaine-rister blaine-rister changed the title [Inductor] Make indices contiguous by iterating over y dimension before x [Inductor] Improve memory locality by iterating over y dimension before x Mar 17, 2025
@blaine-rister
Copy link
Contributor Author

blaine-rister commented Mar 18, 2025

@blaine-rister blaine-rister marked this pull request as ready for review March 18, 2025 16:25
@shunting314
Copy link
Contributor

Hmm, I think even for current tile dimension [XBLOCK, YBLOCK], the memory access pattern is still coalesced since triton would be able to handle it.

Maybe check a bit if the 7% speedup on TorchBench is true by looking into a few models with the most speedup.

Also I'm wondering if triton-mtia missing some optimizations so we see large perf improvement by swapping the tile dimension to [YBLOCK, XBLOCK].

@blaine-rister
Copy link
Contributor Author

blaine-rister commented Mar 18, 2025

Hmm, I think even for current tile dimension [XBLOCK, YBLOCK], the memory access pattern is still coalesced since triton would be able to handle it.

Maybe check a bit if the 7% speedup on TorchBench is true by looking into a few models with the most speedup.

Also I'm wondering if triton-mtia missing some optimizations so we see large perf improvement by swapping the tile dimension to [YBLOCK, XBLOCK].

@shunting314 it seems like you're saying that the Triton compiler should be able to correct for less advantageous code from Inductor. Triton-MTIA does something like this locally to each block pointer, but it still requires transposing the data on chip after each load, or before each store. The problem is that if we change the order of one tensor without touching the others, the program would no longer be sound. Hence we end up loading/storing in one order and then swapping the data back on chip.

After this PR, we no longer need to transpose the data on chip because Inductor changes the order of all pointers globally. That feels like the kind of thing that Inductor is better suited to do than Triton.

Regardless of how much we could optimize the Triton compiler, I still think Inductor should provide the best code it can, all else being equal like the complexity of the solution. In this case, I don't think the fix is making Inductor any more complicated, so it seems like a win. Are there other factors to consider?

@shunting314
Copy link
Contributor

it seems like you're saying that the Triton compiler should be able to correct for less advantageous code from Inductor.

I feel that we should not call [XBLOCK, YBLOCK] tile a less advantageous if Triton can still generate coalesced memory access. One way to verify would be to create a tiled pointwise kernel with relative large input tensors and check the perf difference between tile dimension [XBLOCK, YBLOCK] and [YBLOCK, XBLOCK].

It's interesting to see that Triton-MTIA needs transposing the data on chip. That's may explain the perf difference for MTIA.

I think make the change in Inductor looks fine as long as it does not slow down GPU

@blaine-rister
Copy link
Contributor Author

it seems like you're saying that the Triton compiler should be able to correct for less advantageous code from Inductor.

I feel that we should not call [XBLOCK, YBLOCK] tile a less advantageous if Triton can still generate coalesced memory access. One way to verify would be to create a tiled pointwise kernel with relative large input tensors and check the perf difference between tile dimension [XBLOCK, YBLOCK] and [YBLOCK, XBLOCK].

It's interesting to see that Triton-MTIA needs transposing the data on chip. That's may explain the perf difference for MTIA.

I think make the change in Inductor looks fine as long as it does not slow down GPU

Sounds good. I'll try profiling a microbenchmark and see if there's a difference. It's possible that the GPU is smart or flexible enough to make [XBLOCK, YBLOCK] work well.

@blaine-rister
Copy link
Contributor Author

blaine-rister commented Mar 19, 2025

Following the suggestions above, I did some microbenchmarks on an elementwise addition kernel. I tested on the following shapes and strides:

  • shape (2^15, 2^14), stride (2^15, 1)
  • shape (16, 16), stride (32, 1)
  • shape (1024, 16), stride (18, 1)
  • shape (64, 16), stride (288, 1)

I also disabled autotuning, so we would always have block sizes of (32, 32). This prevented the autotuner from selecting YBLOCK=1, in which case x first and y first are equivalent.

Here are the results from running Nvidia Nsight on an A100 GPU, with the default settings, with the latency averaged over ~100 iterations:
image

(For the largest size, Nsight didn't have enough precision to measure the difference in ms, so I calculated the speedup using cycles instead.)

On all the test cases, y first was slightly faster than x first. The difference was less noticeable for very large shapes. It was more pronounced for smaller shapes, and for views where the stride between rows was very large.

Since x first was not faster for any of these shapes, the change in this PR seems safe.

See the raw data and test script here.

@blaine-rister
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 20, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 jobs have failed, first few of them are: inductor-A100-perf-nightly / get-label-type / runner-determinator, inductor-A100-perf-nightly / cuda12.6-py3.10-gcc9-sm80

Details for Dev Infra team Raised by workflow job

@blaine-rister
Copy link
Contributor Author

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 4 checks: pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge), pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu), inductor-A100-perf-nightly / get-label-type / runner-determinator, inductor-A100-perf-nightly / cuda12.6-py3.10-gcc9-sm80

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@blaine-rister
Copy link
Contributor Author

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 4 checks: pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge), pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu), inductor-A100-perf-nightly / get-label-type / runner-determinator, inductor-A100-perf-nightly / cuda12.6-py3.10-gcc9-sm80

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
…re x (pytorch#149339)

# Feature

Fixes pytorch#148718 by reordering the tensor dims to `(z, y, x)`.

As a bonus refactor, block pointers no longer needed the `reorder=True` argument to `self.active_range_trees()`. Since this argument is no longer used anywhere, this PR simply deletes it as opposed to updating the logic for the new iteration order.

# Perf impact

It looks like there's a decent perf bump on A100, with cudagraphs enabled. Granted, perf runs seem to have some noise between commits. ([Workflow run](https://github.com/pytorch/pytorch/actions/runs/13914815576).)

Training (all neutral or positive):
![image](https://github.com/user-attachments/assets/57f1ef1d-60b4-446f-baf3-aca87a26b81b)

Inference (one positive, one very small negative):
![image](https://github.com/user-attachments/assets/679aa057-af23-47f1-8d8e-8520daf1bd92)

As reported in pytorch#148718, this PR makes consecutive threads access consecutive memory addresses. This should theoretically give the GPU more opportunities to coalesce loads and stores. From Nvidia's [kernel profiling guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html):

> Local memory is private storage for an executing thread and is not visible outside of that thread. It is intended for thread-local data like thread stacks and register spills. Local memory addresses are translated to global virtual addresses by the AGU unit. Local memory has the same latency as global memory. One difference between global and local memory is that local memory is arranged such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., same index in an array variable, same member in a structure variable, etc.).

I couldn't find any information on how coalescing works for other kinds of memory, but the guide mentions it is also supported for accesses to the L2 cache.

> The L2 Request Coalescer (LRC) processes incoming requests for L2 and tries to coalesce read requests before forwarding them to the L2 cache. It also serves programmatic multicast requests from the SM and supports compression for writes.

The [answer to this Stack Overflow post](https://stackoverflow.com/a/5044424) also explains coalescing in a straightforward way. Inductor's current iteration order corresponds to the first (uncoalesced) example in that answer, while the order after this PR corresponds to the second (coalesced) example.

Besides GPUs, this order of accessing data is highly advantageous for systems relying on DMAs, as those are designed to access contiguous spans of memory. This change improves the performance of an elementwise add kernel on an internal model, using internal hardware, by 1.76x. I will share the details with reviewers who are Meta employees via a private channel.

# Test plan
 - Updated expected code on CI tests.
 - Added a new test checking the {x,y,z}indices and block pointers on a 3D pointwise kernel.

Pull Request resolved: pytorch#149339
Approved by: https://github.com/jansel
@github-actions github-actions bot deleted the brister/contiguous_index branch April 22, 2025 02:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Inductor] Permuted memory access pattern for tiled pointwise kernels
4 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy