June 4, 2025 – June 11, 2025

Overview

15 Active pull requests

11 Active issues
- 15 Merged pull requests
- 0 Open pull requests
- 9 Closed issues
- 2 New issues

1 Release published by 1 person

v0.17.1 v0.17.1 Patch Release
published Jun 9, 2025

15 Pull requests merged by 12 people

Don't break set_start_method
#7349 merged Jun 11, 2025
s/UlyssesPlus/Arctic Long Sequence Training (ALST)/
#7348 merged Jun 11, 2025
Update version after 0.17.1 release
#7345 merged Jun 10, 2025
Move pytest pinning from individual tests to requirements-dev.txt until fixed.
#7327 merged Jun 9, 2025
Fix docs that are rendering Incorrectly
#7344 merged Jun 9, 2025
Improve overflow handling in ZeRO
#6976 merged Jun 9, 2025
Update folder name
#7343 merged Jun 9, 2025
Fix LoRA arxiv reference
#7340 merged Jun 7, 2025
fixed: Modified the topkgating function and modified the test_moe file for testing
#7163 merged Jun 6, 2025
DeepNVMe update
#7215 merged Jun 6, 2025
fp16 optimizer timers fix - TypeError: 'NoneType' object is not callable
#7330 merged Jun 6, 2025
Fix issue with symint input
#7243 merged Jun 6, 2025
Fix pytest version to 8.3.5 in hpu-gaudi actions
#7337 merged Jun 5, 2025
Update config_utils.py
#7333 merged Jun 5, 2025
Improve Ulysses Plus Docs
#7335 merged Jun 5, 2025

9 Issues closed by 8 people

[BUG] Deepspeed may set multiprocessing set_start_method, breaking existing applications
#7347 closed Jun 11, 2025
[BUG] AutoTP: incorrect total train batch size when using the huggingface trainer API
#7298 closed Jun 10, 2025
Model Checkpoint docs are incorrectly rendered on deepspeed.readthedocs.io
#6747 closed Jun 9, 2025
[BUG] Zero2 offload overflow
#5241 closed Jun 9, 2025
[REQUEST]Do deepspeed's Pipline Engine support `ep >1 and pp >1 and tp>1` theoretically now?
#7336 closed Jun 7, 2025
Trainer.train(resume_from_checkpoint=...) fails when using auto tensor parallel
#7320 closed Jun 7, 2025
[REQUEST] Equivalent of FSDP ignore_params or ignore_modules for DeepSpeed Zero 3
#7271 closed Jun 6, 2025
[BUG]AssertionError: found no DeviceMesh from dtensor args for c10d.broadcast_.default!
#7338 closed Jun 6, 2025
[BUG] DeepSpeed accuracy issue for torch.compile if activation checkpoint function not compiler disabled
#6718 closed Jun 5, 2025

2 Issues opened by 2 people

DeepSpeed async_io requires libaio-0.3.112 or newer, breaks on libaio-0.3.111 (e.g., Fedora/EL9)
#7346 opened Jun 10, 2025
[REQUEST]Do deepspeed support megatron's `--sequence_parallel`？
#7339 opened Jun 6, 2025

20 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

Create COMMITTERS_RESPONSIBILITY.md
#7300 commented on Jun 10, 2025 • 1 new comment
Enable torch.autocast with ZeRO
#6993 commented on Jun 6, 2025 • 0 new comments
Update names of CPU Adam/Adagrad/Lion params to better match torch/GPU ops.
#5382 commented on Jun 10, 2025 • 0 new comments
[BUG] DeepCompile in ZeRO-1 fails to do the forward pass
#7229 commented on Jun 11, 2025 • 0 new comments
[BUG] DeepCompile in ZeRO-3 fails to do the forward pass
#7228 commented on Jun 11, 2025 • 0 new comments
Test hangs when world_size=4 and reuse_dist_env=True on PyTorch >= 2.7.0
#7334 commented on Jun 11, 2025 • 0 new comments
[BUG] DeepCompile: MemoryProfiling error /pytorch/build/aten/src/ATen/RegisterCUDA.cpp:7280: SymIntArrayRef expected to contain only concrete integers
#7311 commented on Jun 11, 2025 • 0 new comments
nv-nightly CI test failure
#7140 commented on Jun 11, 2025 • 0 new comments
nv-torch-nightly-v100 CI test failure
#7195 commented on Jun 11, 2025 • 0 new comments
[BUG] The NCCL timed out while using the zero3 model. How can I solve this problem?
#5066 commented on Jun 10, 2025 • 0 new comments
[BUG] deepspeed zero2 training hangon and timeout after a fixed step
#7044 commented on Jun 10, 2025 • 0 new comments
[BUG] how to achieve hybrid data and pipeline parallelism?
#7280 commented on Jun 10, 2025 • 0 new comments
[REQUEST]I want to use CPU-based distributed approach to train a small recommendation model. Is there a demo available for me to refer ?
#7329 commented on Jun 10, 2025 • 0 new comments
[REQUEST] Support for XLA/TPU
#6901 commented on Jun 6, 2025 • 0 new comments
[BUG]"DeepSpeedZeRoOffload missing '_restore_from_bit16_weights' method when loading checkpoints"
#7272 commented on Jun 6, 2025 • 0 new comments
[BUG] `Assert Error: assert buffer.grad is not None` & `RuntimeError: element 1 of tensors does not require grad and does not have a grad_fn` During pipeline parallelism
#7270 commented on Jun 6, 2025 • 0 new comments
[BUG]Training
#7319 commented on Jun 6, 2025 • 0 new comments
[BUG] AttributeError: 'Linear' object has no attribute 'ds_grads_remaining'
#7203 commented on Jun 6, 2025 • 0 new comments
[BUG]clip gradient not working
#7220 commented on Jun 5, 2025 • 0 new comments
[REQUEST] How can I use AutoTP for training without huggingface Trainer and accelerate ?
#7179 commented on Jun 5, 2025 • 0 new comments

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Alternative Proxies:

Alternative Proxy