Skip to content

Add accelerator API to RPC distributed examples: ddp_rpc, parameter_server, rnn #1371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jafraustro
Copy link
Contributor

Add accelerator API to RPC distributed examples:

  • ddp_rpc
  • parameter_server
  • rnn

CC: @soumith

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
- ddp_rpc
- parameter_server
- rnn

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
Copy link

netlify bot commented Jul 14, 2025

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit a84f91c
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-examples-preview/deploys/68798280b39c080008fc743c

@jafraustro jafraustro marked this pull request as ready for review July 14, 2025 16:34
@soumith
Copy link
Member

soumith commented Jul 15, 2025

failing CI

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
@jafraustro
Copy link
Contributor Author

I added numpy to requirement.txt files

@jafraustro jafraustro closed this Jul 15, 2025
@jafraustro jafraustro reopened this Jul 15, 2025
@soumith
Copy link
Member

soumith commented Jul 16, 2025

still failing :D

- Added a function to verify minimum GPU count before execution.
- Updated HybridModel initialization to use rank instead of device.
- Ensured proper cleanup of the process group to avoid resource leaks.
- Added exit message if insufficient GPUs are detected.

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
@jafraustro
Copy link
Contributor Author

Hi @soumith,

DDP step needs two gpu's.

Fix:

  • Added verify_min_gpu_count() function to check for sufficient GPU resources.
  • Updated the HybridModel class to use rank-based device assignment instead of generic device handling, improving device placement consistency across distributed processes.
  • Implemented proper cleanup by adding dist.destroy_process_group() calls for trainer processes,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy