-
Notifications
You must be signed in to change notification settings - Fork 24.7k
[Inductor] Improve memory locality by iterating over y dimension before x #149339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149339
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 Cancelled Jobs, 2 Unrelated FailuresAs of commit 81b4fca with merge base 08a644a ( CANCELLED JOBS - The following jobs were cancelled. Please retry:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hmm, I think even for current tile dimension [XBLOCK, YBLOCK], the memory access pattern is still coalesced since triton would be able to handle it. Maybe check a bit if the 7% speedup on TorchBench is true by looking into a few models with the most speedup. Also I'm wondering if triton-mtia missing some optimizations so we see large perf improvement by swapping the tile dimension to [YBLOCK, XBLOCK]. |
@shunting314 it seems like you're saying that the Triton compiler should be able to correct for less advantageous code from Inductor. Triton-MTIA does something like this locally to each block pointer, but it still requires transposing the data on chip after each load, or before each store. The problem is that if we change the order of one tensor without touching the others, the program would no longer be sound. Hence we end up loading/storing in one order and then swapping the data back on chip. After this PR, we no longer need to transpose the data on chip because Inductor changes the order of all pointers globally. That feels like the kind of thing that Inductor is better suited to do than Triton. Regardless of how much we could optimize the Triton compiler, I still think Inductor should provide the best code it can, all else being equal like the complexity of the solution. In this case, I don't think the fix is making Inductor any more complicated, so it seems like a win. Are there other factors to consider? |
I feel that we should not call [XBLOCK, YBLOCK] tile a less advantageous if Triton can still generate coalesced memory access. One way to verify would be to create a tiled pointwise kernel with relative large input tensors and check the perf difference between tile dimension [XBLOCK, YBLOCK] and [YBLOCK, XBLOCK]. It's interesting to see that Triton-MTIA needs transposing the data on chip. That's may explain the perf difference for MTIA. I think make the change in Inductor looks fine as long as it does not slow down GPU |
Sounds good. I'll try profiling a microbenchmark and see if there's a difference. It's possible that the GPU is smart or flexible enough to make [XBLOCK, YBLOCK] work well. |
Following the suggestions above, I did some microbenchmarks on an elementwise addition kernel. I tested on the following shapes and strides:
I also disabled autotuning, so we would always have block sizes of Here are the results from running Nvidia Nsight on an A100 GPU, with the default settings, with the latency averaged over ~100 iterations: (For the largest size, Nsight didn't have enough precision to measure the difference in ms, so I calculated the speedup using cycles instead.) On all the test cases, y first was slightly faster than x first. The difference was less noticeable for very large shapes. It was more pronounced for smaller shapes, and for views where the stride between rows was very large. Since x first was not faster for any of these shapes, the change in this PR seems safe. See the raw data and test script here. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 jobs have failed, first few of them are: inductor-A100-perf-nightly / get-label-type / runner-determinator, inductor-A100-perf-nightly / cuda12.6-py3.10-gcc9-sm80 Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 4 checks: pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge), pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu), inductor-A100-perf-nightly / get-label-type / runner-determinator, inductor-A100-perf-nightly / cuda12.6-py3.10-gcc9-sm80 Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 4 checks: pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge), pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu), inductor-A100-perf-nightly / get-label-type / runner-determinator, inductor-A100-perf-nightly / cuda12.6-py3.10-gcc9-sm80 Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…re x (pytorch#149339) # Feature Fixes pytorch#148718 by reordering the tensor dims to `(z, y, x)`. As a bonus refactor, block pointers no longer needed the `reorder=True` argument to `self.active_range_trees()`. Since this argument is no longer used anywhere, this PR simply deletes it as opposed to updating the logic for the new iteration order. # Perf impact It looks like there's a decent perf bump on A100, with cudagraphs enabled. Granted, perf runs seem to have some noise between commits. ([Workflow run](https://github.com/pytorch/pytorch/actions/runs/13914815576).) Training (all neutral or positive):  Inference (one positive, one very small negative):  As reported in pytorch#148718, this PR makes consecutive threads access consecutive memory addresses. This should theoretically give the GPU more opportunities to coalesce loads and stores. From Nvidia's [kernel profiling guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html): > Local memory is private storage for an executing thread and is not visible outside of that thread. It is intended for thread-local data like thread stacks and register spills. Local memory addresses are translated to global virtual addresses by the AGU unit. Local memory has the same latency as global memory. One difference between global and local memory is that local memory is arranged such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., same index in an array variable, same member in a structure variable, etc.). I couldn't find any information on how coalescing works for other kinds of memory, but the guide mentions it is also supported for accesses to the L2 cache. > The L2 Request Coalescer (LRC) processes incoming requests for L2 and tries to coalesce read requests before forwarding them to the L2 cache. It also serves programmatic multicast requests from the SM and supports compression for writes. The [answer to this Stack Overflow post](https://stackoverflow.com/a/5044424) also explains coalescing in a straightforward way. Inductor's current iteration order corresponds to the first (uncoalesced) example in that answer, while the order after this PR corresponds to the second (coalesced) example. Besides GPUs, this order of accessing data is highly advantageous for systems relying on DMAs, as those are designed to access contiguous spans of memory. This change improves the performance of an elementwise add kernel on an internal model, using internal hardware, by 1.76x. I will share the details with reviewers who are Meta employees via a private channel. # Test plan - Updated expected code on CI tests. - Added a new test checking the {x,y,z}indices and block pointers on a 3D pointwise kernel. Pull Request resolved: pytorch#149339 Approved by: https://github.com/jansel
Feature
Fixes #148718 by reordering the tensor dims to
(z, y, x)
.As a bonus refactor, block pointers no longer needed the
reorder=True
argument toself.active_range_trees()
. Since this argument is no longer used anywhere, this PR simply deletes it as opposed to updating the logic for the new iteration order.Perf impact
It looks like there's a decent perf bump on A100, with cudagraphs enabled. Granted, perf runs seem to have some noise between commits. (Workflow run.)
Training (all neutral or positive):

Inference (one positive, one very small negative):

As reported in #148718, this PR makes consecutive threads access consecutive memory addresses. This should theoretically give the GPU more opportunities to coalesce loads and stores. From Nvidia's kernel profiling guide:
I couldn't find any information on how coalescing works for other kinds of memory, but the guide mentions it is also supported for accesses to the L2 cache.
The answer to this Stack Overflow post also explains coalescing in a straightforward way. Inductor's current iteration order corresponds to the first (uncoalesced) example in that answer, while the order after this PR corresponds to the second (coalesced) example.
Besides GPUs, this order of accessing data is highly advantageous for systems relying on DMAs, as those are designed to access contiguous spans of memory. This change improves the performance of an elementwise add kernel on an internal model, using internal hardware, by 1.76x. I will share the details with reviewers who are Meta employees via a private channel.
Test plan
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov