[1/n] refactor the ring attention implementation #155441

wanchaol · 2025-06-09T05:56:13Z

Stack from ghstack (oldest at bottom):

as titled, I'm working on a series of changes to make ring attention
impl and DTensor works better together, this PR specifically refactor the
current implemtnation to:

remove dead/unused code
restructure the functions to make them stay organized
refactor to remove/make error message better

cc @H-Huang @awgu @fegin @fduwjj @wz337 @wconstab @d4l3k

[ghstack-poisoned]

pytorch-bot · 2025-06-09T05:56:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155441

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

VolumeLimitExceeded Issue for linux.2xlarge and linux.4xlarge

❌ 1 Cancelled Job

As of commit 6011e48 with merge base 0f56318 ():

CANCELLED JOB - The following job was cancelled. Please retry:

inductor-rocm / rocm-py3.10-inductor / test (inductor, 2, 2, linux.rocm.gpu.2) (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

pytorchmergebot · 2025-06-27T17:00:55Z

Starting merge as part of PR stack under #155442

This PR rewrite how load balancing and sharding works in the current context parallel implementation. Why the changes? We should NOT expose another layer of "sharding" concept as it would confuse the user about its difference with DTensor sharding. The current CP perform sharding weirdly simply because it mixed the concept of load balancing and sharding. I think load balancing and sharding need to be decoupled to separate layers: * The load balancing layer is responsible to reorder the input sequence so that the attention computation are evenly balanced across rows/ranks. * Sharding is a separate layer after it, it simply take the input reordered by the load balancer and shard it exactly as how DTensor shard tensor sequentially In this PR: * I removed the "Sharder" and "LoadBalancer" mixed usage, and simply generate a roundrobin indices when the mask is a casual mask * use `distribute_tensor` to perform the sharding. We still keep the local shard instead of the DTensor objects to allow maximum compatibility with arbitrary model architecture given DTensor op coverage is not high enough. One alternative design is to still keep the LoadBalancer and add the indices generation and restore to be the protocol of the LoadBalancer. I thought through it and think we might want to directly expose the load_balancing indices as an argument instead of a dedicated class interface, so I removed it here. More discussion on this is welcomed. Pull Request resolved: #155442 Approved by: https://github.com/XilunWu ghstack dependencies: #155441

as titled, I'm working on a series of changes to make ring attention impl and DTensor works better together, this PR specifically refactor the current implemtnation to: * remove dead/unused code * restructure the functions to make them stay organized * refactor to remove/make error message better ghstack-source-id: e6c861b Pull-Request-resolved: pytorch/pytorch#155441

Update

e8ec841

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jun 9, 2025

wanchaol mentioned this pull request Jun 9, 2025

[2/n] rewrite load balancing and sharding in context parallel #155442

Closed

pytorchbot added the open source label Jun 9, 2025

wanchaol added the topic: not user facing topic category label Jun 9, 2025

Update

6011e48

[ghstack-poisoned]

wanchaol requested a review from fegin June 9, 2025 21:00

fegin approved these changes Jun 24, 2025

View reviewed changes

pytorchmergebot closed this in f7c7301 Jun 27, 2025

pytorchmergebot added the Merged label Jun 27, 2025

XilunWu mentioned this pull request Jul 7, 2025

[WIP][RFC] Compilable flex_attention + Context Parallel #157015

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[1/n] refactor the ring attention implementation #155441

[1/n] refactor the ring attention implementation #155441

Uh oh!

wanchaol commented Jun 9, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Jun 9, 2025 •

edited

Loading

Uh oh!

pytorchmergebot commented Jun 27, 2025

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

[1/n] refactor the ring attention implementation #155441

[1/n] refactor the ring attention implementation #155441

Uh oh!

Conversation

wanchaol commented Jun 9, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/155441

❗ 1 Active SEVs

❌ 1 Cancelled Job

Uh oh!

pytorchmergebot commented Jun 27, 2025

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

wanchaol commented Jun 9, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 9, 2025 •

edited

Loading