Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU 上训练中断 #391

Open
Hardy-Chung opened this issue Nov 29, 2024 · 4 comments
Open

GPU 上训练中断 #391

Hardy-Chung opened this issue Nov 29, 2024 · 4 comments

Comments

@Hardy-Chung
Copy link

Hardy-Chung commented Nov 29, 2024

能启动训练,但是训练10几个epoch就会报错中断

环境:cuda 11.6

yaml:
image

训练命令:
python train.py --config ./configs/shwd_yolov8n.yaml --device_target GPU

报错信息:

a_ops/gatherd.cu:35: GatherDKernel: block: [85,0,0], thread: [958,0,0] Assertion `j_read < input_shape.shape[dim]` failed.
/home/jenkins/agent-working-dir/workspace/Compile_GPU_X86_CentOS_Cuda11.6_PY39/mindspore/mindspore/ccsrc/plugin/device/gpu/kernel/cuda_impl/cuda_ops/gatherd.cu:35: GatherDKernel: block: [85,0,0], thread: [959,0,0] Assertion `j_read < input_shape.shape[dim]` failed.
[ERROR] RUNTIME_FRAMEWORK(64473,7f85fe328700,python):2024-12-02-12:36:05.918.774 [mindspore/ccsrc/runtime/graph_scheduler/actor/actor_common.cc:327] WaitRuntimePipelineFinish] Wait runtime pipeline finish and an error occurred: For `GatherD`, the cuda Kernel fails to run, the error number is 710, which means device-side assert triggered.

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/gpu/kernel/arrays/gatherd_gpu_kernel.cc:44 LaunchKernel

Traceback (most recent call last):
  File "/data/zhongzhijia/code/mindspore/mindyolo/train.py", line 330, in <module>
    train(args)
  File "/data/zhongzhijia/code/mindspore/mindyolo/train.py", line 285, in train
    trainer.train(
  File "/data/zhongzhijia/code/mindspore/mindyolo/mindyolo/utils/trainer_factory.py", line 170, in train
    run_context.loss, run_context.lr = self.train_step(imgs, labels, segments,
  File "/data/zhongzhijia/code/mindspore/mindyolo/mindyolo/utils/trainer_factory.py", line 366, in train_step
    loss, loss_item, _, grads_finite = self.train_step_fn(imgs, labels, True)
  File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/common/api.py", line 941, in staging_specialize
    out = _MindsporeFunctionExecutor(func, hash_obj, dyn_args, process_obj, jit_config)(*args, **kwargs)
  File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/common/api.py", line 185, in wrapper
    results = fn(*arg, **kwargs)
  File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/common/api.py", line 572, in __call__
    output = self._graph_executor(tuple(new_inputs), phase)
RuntimeError: For `GatherD`, the cuda Kernel fails to run, the error number is 710, which means device-side assert triggered.

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/gpu/kernel/arrays/gatherd_gpu_kernel.cc:44 LaunchKernel

[ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.361.608 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:209] SyncStream] cudaStreamSynchronize failed, ret[710], device-side assert triggered
[ERROR] ME(64473,7f87b8ecc740,python):2024-12-02-12:37:11.361.655 [mindspore/ccsrc/runtime/hardware/device_context_manager.cc:490] WaitTaskFinishOnDevice] SyncStream failed
[ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.465.838 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:200] DestroyStream] cudaStreamDestroy failed, ret[710], device-side assert triggered
[ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.465.856 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:70] ReleaseDevice] Op Error: Failed to destroy CUDA stream. | Error Number: 0
[ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.466.409 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:77] ReleaseDevice] cuDNN Error: Failed to destroy cuDNN handle | Error Number: 4 CUDNN_STATUS_INTERNAL_ERROR
[ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.787 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered
[ERROR] PRE_ACT(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.795 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:937] operator()] Free device memory[0x7f8238000000] error.
[ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.801 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered
[ERROR] PRE_ACT(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.806 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:937] operator()] Free device memory[0x7f8280000000] error.
[ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.811 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered
[ERROR] PRE_ACT(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.817 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:937] operator()] Free device memory[0x7f8580000000] error.
[ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.835 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered
[ERROR] PRE_ACT(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.840 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:937] operator()] Free device memory[0x7f84b8000000] error.
@m00nLi
Copy link

m00nLi commented Nov 29, 2024

多训练几次,可能会成功

@Hardy-Chung
Copy link
Author

多训练几次,可能会成功

之前断在epcoh10左右,最近的一次断在epoch50😂

@yuedongli1
Copy link
Collaborator

mindyolo当前仅支持在 Ascend 硬件上运行,其他平台 (e.g. CPU/GPU) 均未验证,有可能出现一些不可预期的问题。

@m00nLi
Copy link

m00nLi commented Dec 9, 2024

mindyolo当前仅支持在 Ascend 硬件上运行,其他平台 (e.g. CPU/GPU) 均未验证,有可能出现一些不可预期的问题。

建议在README上明确写明仅支持在 Ascend 硬件运行,CPU/GPU不支持,我们已经踩过坑了验证过了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy