[inductor] comprehensive padding #120758

shunting314 · 2024-02-28T01:00:19Z

Stack from ghstack (oldest at bottom):

-> [inductor] comprehensive padding #120758

This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang @jansel @Chillee @eellison

[ghstack-poisoned]

pytorch-bot · 2024-02-28T01:00:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120758

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit cf4bccf with merge base 6ac8fe4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]

ghstack-source-id: e077c0d Pull Request resolved: #120758

This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]

ghstack-source-id: a330c68 Pull Request resolved: #120758

eellison

Cool!

We might need some logic around not increasing memory too much. Not sure if this would actually be an issue in practice.
I think we still want to preserve strides for user visible outputs.
We should make sure that we're still appropriately passing in dense tensors to extern ops that require it.. I think this will just work but worth double checking.
Not sure if there is an aot autograd related issue of changing strides... I dont think there is but I remember it has special logic to deal with changing strides from inductor

This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]

ghstack-source-id: 2d04b61 Pull Request resolved: #120758

This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]

ghstack-source-id: 89b08d5 Pull Request resolved: #120758

This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]

ghstack-source-id: 689df19 Pull Request resolved: #120758

This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]

ghstack-source-id: ef7c77a Pull Request resolved: #120758

This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]

ghstack-source-id: 0632db8 Pull Request resolved: #120758

This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]

ghstack-source-id: 7d0d5b0 Pull Request resolved: #120758

This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang jansel Chillee eellison [ghstack-poisoned]

ghstack-source-id: 6a9931e Pull Request resolved: #120758

shunting314 · 2024-04-12T01:06:04Z

All perf regress has been fixed now. It turns out that the large regress from default setting from TB is due to using a new triton pin. It seems that triton regresses.. Reverting the pin change eventually fixes it. But this also means we are not using Hongtao's fix here: triton-lang/triton#3497 (since I need use a relatively new triton to apply the PR). I'll try to dig more.

Since Hongtao's fix may take a while to get in (still under review). And our triton pin also needs some time to be updated (even if we have all the fixes in right now, we may still need run meta internal tests. That takes time). I'll try if I can apply hongtao's fix in the current triton pin inductor uses and measure how that works together with comprehensive padding.

eellison · 2024-04-12T19:06:21Z

As a follow up would be great to see if we could use this to replace some of the current pad_mm logic.

shunting314 · 2024-04-12T19:49:47Z

As a follow up would be great to see if we could use this to replace some of the current pad_mm logic.

I think it depends on a few things

we can not pad weights since they have FIxedLayout.
for activations, I think this PR can help pad the strides. But we'll need benchmark if padded tensor has as good matmul performance as a tensor with aligned shape.

shunting314 · 2024-04-12T21:31:26Z

cc @jansel to take another look

This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang jansel Chillee eellison [ghstack-poisoned]

ghstack-source-id: 8f684e6 Pull Request resolved: #120758

shunting314 · 2024-04-15T19:03:57Z

@pytorchbot merge

pytorchmergebot · 2024-04-15T19:05:39Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. Pull Request resolved: pytorch#120758 Approved by: https://github.com/jansel

[WIP][inductor] padding

d54f7bf

[ghstack-poisoned]

github-actions bot added module: inductor ciflow/inductor labels Feb 28, 2024

shunting314 added a commit that referenced this pull request Feb 28, 2024

[WIP][inductor] padding

a47d8be

ghstack-source-id: e077c0d Pull Request resolved: #120758

shunting314 mentioned this pull request Feb 28, 2024

[TESTING] enable comprehensive padding #120760

Closed

shunting314 added a commit that referenced this pull request Feb 28, 2024

[WIP][inductor] padding

cbc9281

ghstack-source-id: a330c68 Pull Request resolved: #120758

eellison reviewed Feb 28, 2024

View reviewed changes

shunting314 mentioned this pull request Feb 29, 2024

[inductor][eazy] fix a typo in test #120832

Closed

shunting314 added a commit that referenced this pull request Feb 29, 2024

[WIP][inductor] padding

e6cdb1b

ghstack-source-id: 2d04b61 Pull Request resolved: #120758

shunting314 added a commit that referenced this pull request Mar 5, 2024

[WIP][inductor] padding

2073960

ghstack-source-id: 89b08d5 Pull Request resolved: #120758

shunting314 added a commit that referenced this pull request Mar 6, 2024

[WIP][inductor] padding

54bf5b2

ghstack-source-id: 689df19 Pull Request resolved: #120758

shunting314 added a commit that referenced this pull request Mar 8, 2024

[WIP][inductor] padding

e27d88e

ghstack-source-id: ef7c77a Pull Request resolved: #120758

shunting314 added a commit that referenced this pull request Mar 12, 2024

[WIP][inductor] padding

cfdca19

ghstack-source-id: 0632db8 Pull Request resolved: #120758

shunting314 added a commit that referenced this pull request Mar 19, 2024

[WIP][inductor] padding

99b0929

ghstack-source-id: 7d0d5b0 Pull Request resolved: #120758

shunting314 added the topic: not user facing topic category label Apr 12, 2024

shunting314 requested review from eellison and jansel April 12, 2024 00:03

shunting314 added a commit that referenced this pull request Apr 12, 2024

[inductor] comprehensive padding

e5f2391

ghstack-source-id: 6a9931e Pull Request resolved: #120758

shunting314 mentioned this pull request Apr 12, 2024

[debug] apply hongtao's fix on the old pin #123895

Closed

jansel approved these changes Apr 12, 2024

View reviewed changes

shunting314 added a commit that referenced this pull request Apr 15, 2024

[inductor] comprehensive padding

f3ef5e0

ghstack-source-id: 8f684e6 Pull Request resolved: #120758

pytorchmergebot added the merging label Apr 15, 2024

pytorchmergebot closed this in fb6f627 Apr 15, 2024

pytorchmergebot added Merged and removed merging labels Apr 15, 2024

github-actions bot deleted the gh/shunting314/103/head branch May 16, 2024 01:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[inductor] comprehensive padding #120758

[inductor] comprehensive padding #120758

Uh oh!

shunting314 commented Feb 28, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 28, 2024 •

edited

Loading

Uh oh!

eellison left a comment

Uh oh!

shunting314 commented Apr 12, 2024

Uh oh!

eellison commented Apr 12, 2024

Uh oh!

shunting314 commented Apr 12, 2024

Uh oh!

shunting314 commented Apr 12, 2024

Uh oh!

shunting314 commented Apr 15, 2024

Uh oh!

pytorchmergebot commented Apr 15, 2024

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

[inductor] comprehensive padding #120758

[inductor] comprehensive padding #120758

Uh oh!

Conversation

shunting314 commented Feb 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120758

✅ No Failures

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

shunting314 commented Apr 12, 2024

Uh oh!

eellison commented Apr 12, 2024

Uh oh!

shunting314 commented Apr 12, 2024

Uh oh!

shunting314 commented Apr 12, 2024

Uh oh!

shunting314 commented Apr 15, 2024

Uh oh!

pytorchmergebot commented Apr 15, 2024

Merge started

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

shunting314 commented Feb 28, 2024 •

edited

Loading

pytorch-bot bot commented Feb 28, 2024 •

edited

Loading