Skip to content

[inductor] comprehensive padding #120758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 37 commits into from

Conversation

shunting314
Copy link
Contributor

@shunting314 shunting314 commented Feb 28, 2024

Stack from ghstack (oldest at bottom):

This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang @jansel @Chillee @eellison

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Feb 28, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120758

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit cf4bccf with merge base 6ac8fe4 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup.

cc jansel Chillee eellison 


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Feb 28, 2024
ghstack-source-id: e077c0d
Pull Request resolved: #120758
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup.

cc jansel Chillee eellison 


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Feb 28, 2024
ghstack-source-id: a330c68
Pull Request resolved: #120758
Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

  • We might need some logic around not increasing memory too much. Not sure if this would actually be an issue in practice.
  • I think we still want to preserve strides for user visible outputs.
  • We should make sure that we're still appropriately passing in dense tensors to extern ops that require it.. I think this will just work but worth double checking.
  • Not sure if there is an aot autograd related issue of changing strides... I dont think there is but I remember it has special logic to deal with changing strides from inductor

This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup.

cc jansel Chillee eellison 


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Feb 29, 2024
ghstack-source-id: 2d04b61
Pull Request resolved: #120758
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup.

cc jansel Chillee eellison 


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Mar 5, 2024
ghstack-source-id: 89b08d5
Pull Request resolved: #120758
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup.

cc jansel Chillee eellison 


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup.

cc jansel Chillee eellison 


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Mar 6, 2024
ghstack-source-id: 689df19
Pull Request resolved: #120758
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup.

cc jansel Chillee eellison 


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Mar 8, 2024
ghstack-source-id: ef7c77a
Pull Request resolved: #120758
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup.

cc jansel Chillee eellison 


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup.

cc jansel Chillee eellison 


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup.

cc jansel Chillee eellison 


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Mar 12, 2024
ghstack-source-id: 0632db8
Pull Request resolved: #120758
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup.

cc jansel Chillee eellison 


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Mar 19, 2024
ghstack-source-id: 7d0d5b0
Pull Request resolved: #120758
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang jansel Chillee eellison 


[ghstack-poisoned]
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang jansel Chillee eellison 


[ghstack-poisoned]
@shunting314 shunting314 added the topic: not user facing topic category label Apr 12, 2024
@shunting314 shunting314 requested review from eellison and jansel April 12, 2024 00:03
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang jansel Chillee eellison 


[ghstack-poisoned]
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang jansel Chillee eellison 


[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Apr 12, 2024
ghstack-source-id: 6a9931e
Pull Request resolved: #120758
@shunting314
Copy link
Contributor Author

All perf regress has been fixed now. It turns out that the large regress from default setting from TB is due to using a new triton pin. It seems that triton regresses.. Reverting the pin change eventually fixes it. But this also means we are not using Hongtao's fix here: triton-lang/triton#3497 (since I need use a relatively new triton to apply the PR). I'll try to dig more.

Since Hongtao's fix may take a while to get in (still under review). And our triton pin also needs some time to be updated (even if we have all the fixes in right now, we may still need run meta internal tests. That takes time). I'll try if I can apply hongtao's fix in the current triton pin inductor uses and measure how that works together with comprehensive padding.

@eellison
Copy link
Contributor

As a follow up would be great to see if we could use this to replace some of the current pad_mm logic.

@shunting314
Copy link
Contributor Author

As a follow up would be great to see if we could use this to replace some of the current pad_mm logic.

I think it depends on a few things

  1. we can not pad weights since they have FIxedLayout.
  2. for activations, I think this PR can help pad the strides. But we'll need benchmark if padded tensor has as good matmul performance as a tensor with aligned shape.

@shunting314
Copy link
Contributor Author

cc @jansel to take another look

This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang jansel Chillee eellison 


[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Apr 15, 2024
ghstack-source-id: 8f684e6
Pull Request resolved: #120758
@shunting314
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup.

Pull Request resolved: pytorch#120758
Approved by: https://github.com/jansel
petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently.

By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup.

Pull Request resolved: pytorch#120758
Approved by: https://github.com/jansel
@github-actions github-actions bot deleted the gh/shunting314/103/head branch May 16, 2024 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy