Add a test for AsyncCollectiveTensor handling for maybe-view ops #152688

bdhirsh · 2025-05-02T15:15:42Z

We never added a proper test for the fix from #134661

Stack from ghstack (oldest at bottom):

flex attention: fix dispatch order for tensor subclasses, avoid hardcoding call to faketensor impl in dynamo #151719
-> Add a test for AsyncCollectiveTensor handling for maybe-view ops #152688
SAC: fix recompute tag propagation for ops with list[tensor] inputs #152195

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

[ghstack-poisoned]

pytorch-bot · 2025-05-02T15:15:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152688

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

CI workflows being skipped on PR

✅ No Failures

As of commit 9cc4a79 with merge base a4a7716 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

bdhirsh · 2025-05-02T15:19:36Z

test/distributed/test_c10d_functional_native.py

+            out2 = out1.to(dtype=torch.bfloat16)
+            # this tests that the call to .to() properly triggered a wait() on the AsyncCollectiveTensor
+            self.assertTrue(type(out2) is torch.Tensor)
+            self.assertEqual(


@kwen2501 I wasn't able to repro the actual silent correctness in this test when I backed out the original fix (I confirmed that we we're calling out1.to(...) and doing a proper dtype conversion before issuing a proper wait_tensor() on the input, but the final output was still correct).

I did confirm that the earlier assert failed before though, which I think is a reasonable proxy - the output of calling .to() should give us a plain tensor, not an ACT, to represent the fact that we properly issued a sync.

Hi @bdhirsh I think assertEqual would eventually trigger a wait on out2 -- that's why you wouldn't see data corruption, with or without the fix.

In fact the return of ACT instead of Tensor is the only problem (copied from #152534):

The issue seems to be that for an AsyncCollectiveTensor t, invoking t.float() does not trigger the wait_tensor, in which case it would return a regular torch.Tensor, but instead it returns a new AsyncCollectiveTensor with garbage data.

If the fix makes a difference on self.assertTrue(type(out2) is torch.Tensor), then we are good.

I think assertEqual would eventually trigger a wait on out2 -- that's why you wouldn't see data corruption, with or without the fix.

so we are agreed on this. What I'm surprised by, though, is that even if assertEqual issues a wait, that is still not early enough. If we issue the .to() dtype conversion kernel before running wait_tensor, I would have imagined that the .to() kernel would read garbage data (the allgather output buffer) before synchronizing.

Either way, I agree that my torch.Tensor assertion is enough to convince myself that this test is ok

Arg, you are right. After we made .to() return a torch.Tensor, we'd still need to decide whether that torch.Tensor should be a waited tensor or not.
If it is an unwaited tensor, it seems we would get into trouble as well, because the user has now lost the ACT handle and cannot call wait_tensor anymore.

But wait, wouldn't the transition from ACT to Tensor already trigger a wait? Are you saying that this is not observed in your test?

kwen2501

Thanks for adding the test!

kwen2501 · 2025-05-02T16:33:16Z

test/distributed/test_c10d_functional_native.py

+            out2 = out1.to(dtype=torch.bfloat16)
+            # this tests that the call to .to() properly triggered a wait() on the AsyncCollectiveTensor
+            self.assertTrue(type(out2) is torch.Tensor)
+            self.assertEqual(


Hi @bdhirsh I think assertEqual would eventually trigger a wait on out2 -- that's why you wouldn't see data corruption, with or without the fix.

In fact the return of ACT instead of Tensor is the only problem (copied from #152534):

The issue seems to be that for an AsyncCollectiveTensor t, invoking t.float() does not trigger the wait_tensor, in which case it would return a regular torch.Tensor, but instead it returns a new AsyncCollectiveTensor with garbage data.

If the fix makes a difference on self.assertTrue(type(out2) is torch.Tensor), then we are good.

kwen2501 · 2025-05-02T16:35:32Z

Resolves #152534.

bdhirsh · 2025-05-05T14:39:46Z

@pytorchbot merge

pytorchmergebot · 2025-05-05T14:42:43Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Add a test for AsyncCollectiveTensor handling for maybe-view ops

9cc4a79

[ghstack-poisoned]

bdhirsh mentioned this pull request Apr 30, 2025

SAC: fix recompute tag propagation for ops with list[tensor] inputs #152195

Closed

bdhirsh mentioned this pull request Apr 30, 2025

flex attention: fix dispatch order for tensor subclasses, avoid hardcoding call to faketensor impl in dynamo #151719

Closed

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels May 2, 2025

github-actions bot requested review from albanD, antoniojkim, ezyang, miladm and SherlockNoMad May 2, 2025 15:15

bdhirsh commented May 2, 2025

View reviewed changes

kwen2501 approved these changes May 2, 2025

View reviewed changes

albanD removed their request for review May 2, 2025 17:23

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 5, 2025

pytorchmergebot added the merging label May 5, 2025

pytorchmergebot added the Merged label May 5, 2025

pytorchmergebot closed this in 131da0a May 5, 2025

pytorchmergebot removed the merging label May 5, 2025

bdhirsh mentioned this pull request May 5, 2025

AsyncCollectiveTensor doesn't trigger wait upon dtype cast #152534

Closed

github-actions bot deleted the gh/bdhirsh/660/head branch June 15, 2025 02:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a test for AsyncCollectiveTensor handling for maybe-view ops #152688

Add a test for AsyncCollectiveTensor handling for maybe-view ops #152688

Uh oh!

bdhirsh commented May 2, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 2, 2025 •

edited

Loading

Uh oh!

bdhirsh May 2, 2025

Uh oh!

kwen2501 May 2, 2025

Uh oh!

bdhirsh May 2, 2025

Uh oh!

bdhirsh May 2, 2025

Uh oh!

kwen2501 May 5, 2025 •

edited

Loading

Uh oh!

kwen2501 May 5, 2025

Uh oh!

kwen2501 left a comment

Uh oh!

kwen2501 May 2, 2025

Uh oh!

kwen2501 commented May 2, 2025

Uh oh!

bdhirsh commented May 5, 2025

Uh oh!

pytorchmergebot commented May 5, 2025

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Add a test for AsyncCollectiveTensor handling for maybe-view ops #152688

Add a test for AsyncCollectiveTensor handling for maybe-view ops #152688

Uh oh!

Conversation

bdhirsh commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152688

❗ 1 Active SEVs

✅ No Failures

Uh oh!

bdhirsh May 2, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 May 2, 2025

Choose a reason for hiding this comment

Uh oh!

bdhirsh May 2, 2025

Choose a reason for hiding this comment

Uh oh!

bdhirsh May 2, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 May 5, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 May 2, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented May 2, 2025

Uh oh!

bdhirsh commented May 5, 2025

Uh oh!

pytorchmergebot commented May 5, 2025

Merge started

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

bdhirsh commented May 2, 2025 •

edited

Loading

pytorch-bot bot commented May 2, 2025 •

edited

Loading

kwen2501 May 5, 2025 •

edited

Loading