-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Insights: deepspeedai/DeepSpeed
Overview
-
- 15 Merged pull requests
- 0 Open pull requests
- 9 Closed issues
- 2 New issues
Could not load contribution data
Please try again later
1 Release published by 1 person
-
v0.17.1 v0.17.1 Patch Release
published
Jun 9, 2025
15 Pull requests merged by 12 people
-
Don't break set_start_method
#7349 merged
Jun 11, 2025 -
s/UlyssesPlus/Arctic Long Sequence Training (ALST)/
#7348 merged
Jun 11, 2025 -
Update version after 0.17.1 release
#7345 merged
Jun 10, 2025 -
Move pytest pinning from individual tests to requirements-dev.txt until fixed.
#7327 merged
Jun 9, 2025 -
Fix docs that are rendering Incorrectly
#7344 merged
Jun 9, 2025 -
Improve overflow handling in ZeRO
#6976 merged
Jun 9, 2025 -
Update folder name
#7343 merged
Jun 9, 2025 -
Fix LoRA arxiv reference
#7340 merged
Jun 7, 2025 -
fixed: Modified the topkgating function and modified the test_moe file for testing
#7163 merged
Jun 6, 2025 -
DeepNVMe update
#7215 merged
Jun 6, 2025 -
fp16 optimizer timers fix - TypeError: 'NoneType' object is not callable
#7330 merged
Jun 6, 2025 -
Fix issue with symint input
#7243 merged
Jun 6, 2025 -
Fix pytest version to 8.3.5 in hpu-gaudi actions
#7337 merged
Jun 5, 2025 -
Update config_utils.py
#7333 merged
Jun 5, 2025 -
Improve Ulysses Plus Docs
#7335 merged
Jun 5, 2025
9 Issues closed by 8 people
-
[BUG] Deepspeed may set multiprocessing set_start_method, breaking existing applications
#7347 closed
Jun 11, 2025 -
[BUG] AutoTP: incorrect total train batch size when using the huggingface trainer API
#7298 closed
Jun 10, 2025 -
Model Checkpoint docs are incorrectly rendered on deepspeed.readthedocs.io
#6747 closed
Jun 9, 2025 -
[BUG] Zero2 offload overflow
#5241 closed
Jun 9, 2025 -
[REQUEST]Do deepspeed's Pipline Engine support `ep >1 and pp >1 and tp>1` theoretically now?
#7336 closed
Jun 7, 2025 -
Trainer.train(resume_from_checkpoint=...) fails when using auto tensor parallel
#7320 closed
Jun 7, 2025 -
[REQUEST] Equivalent of FSDP ignore_params or ignore_modules for DeepSpeed Zero 3
#7271 closed
Jun 6, 2025 -
[BUG]AssertionError: found no DeviceMesh from dtensor args for c10d.broadcast_.default!
#7338 closed
Jun 6, 2025
2 Issues opened by 2 people
-
DeepSpeed async_io requires libaio-0.3.112 or newer, breaks on libaio-0.3.111 (e.g., Fedora/EL9)
#7346 opened
Jun 10, 2025 -
[REQUEST]Do deepspeed support megatron's `--sequence_parallel`?
#7339 opened
Jun 6, 2025
20 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
Create COMMITTERS_RESPONSIBILITY.md
#7300 commented on
Jun 10, 2025 • 1 new comment -
Enable torch.autocast with ZeRO
#6993 commented on
Jun 6, 2025 • 0 new comments -
Update names of CPU Adam/Adagrad/Lion params to better match torch/GPU ops.
#5382 commented on
Jun 10, 2025 • 0 new comments -
[BUG] DeepCompile in ZeRO-1 fails to do the forward pass
#7229 commented on
Jun 11, 2025 • 0 new comments -
[BUG] DeepCompile in ZeRO-3 fails to do the forward pass
#7228 commented on
Jun 11, 2025 • 0 new comments -
Test hangs when world_size=4 and reuse_dist_env=True on PyTorch >= 2.7.0
#7334 commented on
Jun 11, 2025 • 0 new comments -
[BUG] DeepCompile: MemoryProfiling error /pytorch/build/aten/src/ATen/RegisterCUDA.cpp:7280: SymIntArrayRef expected to contain only concrete integers
#7311 commented on
Jun 11, 2025 • 0 new comments -
nv-nightly CI test failure
#7140 commented on
Jun 11, 2025 • 0 new comments -
nv-torch-nightly-v100 CI test failure
#7195 commented on
Jun 11, 2025 • 0 new comments -
[BUG] The NCCL timed out while using the zero3 model. How can I solve this problem?
#5066 commented on
Jun 10, 2025 • 0 new comments -
[BUG] deepspeed zero2 training hangon and timeout after a fixed step
#7044 commented on
Jun 10, 2025 • 0 new comments -
[BUG] how to achieve hybrid data and pipeline parallelism?
#7280 commented on
Jun 10, 2025 • 0 new comments -
[REQUEST]I want to use CPU-based distributed approach to train a small recommendation model. Is there a demo available for me to refer ?
#7329 commented on
Jun 10, 2025 • 0 new comments -
[REQUEST] Support for XLA/TPU
#6901 commented on
Jun 6, 2025 • 0 new comments -
[BUG]"DeepSpeedZeRoOffload missing '_restore_from_bit16_weights' method when loading checkpoints"
#7272 commented on
Jun 6, 2025 • 0 new comments -
[BUG] `Assert Error: assert buffer.grad is not None` & `RuntimeError: element 1 of tensors does not require grad and does not have a grad_fn` During pipeline parallelism
#7270 commented on
Jun 6, 2025 • 0 new comments -
[BUG]Training
#7319 commented on
Jun 6, 2025 • 0 new comments -
[BUG] AttributeError: 'Linear' object has no attribute 'ds_grads_remaining'
#7203 commented on
Jun 6, 2025 • 0 new comments -
[BUG]clip gradient not working
#7220 commented on
Jun 5, 2025 • 0 new comments -
[REQUEST] How can I use AutoTP for training without huggingface Trainer and accelerate ?
#7179 commented on
Jun 5, 2025 • 0 new comments