-
Notifications
You must be signed in to change notification settings - Fork 24.8k
[inductor] comprehensive padding #120758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[inductor] comprehensive padding #120758
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120758
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit cf4bccf with merge base 6ac8fe4 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
- We might need some logic around not increasing memory too much. Not sure if this would actually be an issue in practice.
- I think we still want to preserve strides for user visible outputs.
- We should make sure that we're still appropriately passing in dense tensors to extern ops that require it.. I think this will just work but worth double checking.
- Not sure if there is an aot autograd related issue of changing strides... I dont think there is but I remember it has special logic to deal with changing strides from inductor
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]
This is not fully ready for review yet. Sending it out in case people want to give early feedback. Also it can allow me run CI perf test earlier. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. I feel the solution can be further improved to get more speedup. cc jansel Chillee eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang jansel Chillee eellison [ghstack-poisoned]
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang jansel Chillee eellison [ghstack-poisoned]
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang jansel Chillee eellison [ghstack-poisoned]
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang jansel Chillee eellison [ghstack-poisoned]
Since Hongtao's fix may take a while to get in (still under review). And our triton pin also needs some time to be updated (even if we have all the fixes in right now, we may still need run meta internal tests. That takes time). I'll try if I can apply hongtao's fix in the current triton pin inductor uses and measure how that works together with comprehensive padding. |
As a follow up would be great to see if we could use this to replace some of the current pad_mm logic. |
I think it depends on a few things
|
cc @jansel to take another look |
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang jansel Chillee eellison [ghstack-poisoned]
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. Pull Request resolved: pytorch#120758 Approved by: https://github.com/jansel
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently. By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup. Pull Request resolved: pytorch#120758 Approved by: https://github.com/jansel
Stack from ghstack (oldest at bottom):
This PR adds the ability to pad tensor strides during lowering. The goal is to make sure (if possible) tensors with bad shape can have aligned strides so GPU can access the memory more efficiently.
By testing BlenderbotSmallForConditionalGeneration I already see 2.5ms speedup.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang @jansel @Chillee @eellison