Skip to content

[simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative #158062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 18 commits into from

Conversation

IvanKobzarev
Copy link
Contributor

@IvanKobzarev IvanKobzarev commented Jul 10, 2025

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Jul 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158062

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit e42de67 with merge base d5af0ec (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
IvanKobzarev added a commit that referenced this pull request Jul 11, 2025
preserving peak memory

ghstack-source-id: cc1038b
Pull Request resolved: #158062
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 11, 2025
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013)

[ghstack-poisoned]
IvanKobzarev added a commit that referenced this pull request Jul 11, 2025
preserving peak memory

ghstack-source-id: 03ea238
Pull Request resolved: #158062
@IvanKobzarev IvanKobzarev changed the title DEBUG sink waits [simple_fsdp][inductor_collectives] rewrite reorder_collectives, sink_waits_iterative Jul 11, 2025
@IvanKobzarev IvanKobzarev added the topic: not user facing topic category label Jul 11, 2025
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

superiwan pushed a commit to superiwan/pytorch that referenced this pull request Jul 14, 2025
ghstack-source-id: 5b8d32d
Pull Request resolved: pytorch/pytorch#158062
@IvanKobzarev IvanKobzarev requested a review from wconstab July 14, 2025 14:36
…tives, sink_waits_iterative"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013)

[ghstack-poisoned]
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.


# Dicts to keep track of "next" and "previous" as double-linked structure during grouping
_prev: dict[BaseSchedulerNode, Optional[BaseSchedulerNode]] = {}
_next: dict[BaseSchedulerNode, Optional[BaseSchedulerNode]] = {}
_prev: dict[Optional[BaseSchedulerNode], Optional[BaseSchedulerNode]] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the point of storing the 'prev' node of a None? or was this just a workaround for linter complaints?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only linter complaints.

Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. thanks!!

Git-Hub-Chris pushed a commit to Git-Hub-Chris/PyTorch that referenced this pull request Jul 15, 2025
preserving peak memory

ghstack-source-id: 30187c8
Pull Request resolved: pytorch/pytorch#158062
…tives, sink_waits_iterative"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013)

[ghstack-poisoned]
IvanKobzarev added a commit that referenced this pull request Jul 15, 2025
preserving peak memory

ghstack-source-id: 0b68cca
Pull Request resolved: #158062
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…tives, sink_waits_iterative"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013)

[ghstack-poisoned]
IvanKobzarev added a commit that referenced this pull request Jul 16, 2025
preserving peak memory

ghstack-source-id: 307e4f6
Pull Request resolved: #158062
…tives, sink_waits_iterative"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013)

[ghstack-poisoned]
IvanKobzarev added a commit that referenced this pull request Jul 16, 2025
preserving peak memory

ghstack-source-id: 57d656c
Pull Request resolved: #158062
…tives, sink_waits_iterative"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013)

[ghstack-poisoned]
IvanKobzarev added a commit that referenced this pull request Jul 16, 2025
preserving peak memory

ghstack-source-id: a8fe869
Pull Request resolved: #158062
…tives, sink_waits_iterative"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013)

[ghstack-poisoned]
IvanKobzarev added a commit that referenced this pull request Jul 16, 2025
preserving peak memory

ghstack-source-id: 1a5ee37
Pull Request resolved: #158062
@IvanKobzarev
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@IvanKobzarev
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 54576f90272e2713df746c57a5b2d881f5d5bd3e returned non-zero exit code 1

Auto-merging benchmarks/dynamo/pr_time_benchmarks/expected_results.csv
CONFLICT (content): Merge conflict in benchmarks/dynamo/pr_time_benchmarks/expected_results.csv
Auto-merging torch/_inductor/dependencies.py
error: could not apply 54576f90272... [inductor_collectives] rewrite reorder_collectives, sink_waits_iterative
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

…tives, sink_waits_iterative"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D78159013](https://our.internmc.facebook.com/intern/diff/D78159013)

[ghstack-poisoned]
IvanKobzarev added a commit that referenced this pull request Jul 17, 2025
preserving peak memory

ghstack-source-id: fe170c3
Pull Request resolved: #158062
@IvanKobzarev
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 1, 2, linux.rocm.gpu.2)

Details for Dev Infra team Raised by workflow job

@IvanKobzarev
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-no-td Do not run TD on this PR ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: dynamo module: inductor oncall: distributed Add this issue/PR to distributed oncall triage queue Reverted topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy