GPU 上训练中断 #391

Hardy-Chung · 2024-11-29T07:54:56Z

能启动训练，但是训练10几个epoch就会报错中断

环境：cuda 11.6

yaml:

训练命令：
python train.py --config ./configs/shwd_yolov8n.yaml --device_target GPU

报错信息：

a_ops/gatherd.cu:35: GatherDKernel: block: [85,0,0], thread: [958,0,0] Assertion `j_read < input_shape.shape[dim]` failed.
/home/jenkins/agent-working-dir/workspace/Compile_GPU_X86_CentOS_Cuda11.6_PY39/mindspore/mindspore/ccsrc/plugin/device/gpu/kernel/cuda_impl/cuda_ops/gatherd.cu:35: GatherDKernel: block: [85,0,0], thread: [959,0,0] Assertion `j_read < input_shape.shape[dim]` failed.
[ERROR] RUNTIME_FRAMEWORK(64473,7f85fe328700,python):2024-12-02-12:36:05.918.774 [mindspore/ccsrc/runtime/graph_scheduler/actor/actor_common.cc:327] WaitRuntimePipelineFinish] Wait runtime pipeline finish and an error occurred: For `GatherD`, the cuda Kernel fails to run, the error number is 710, which means device-side assert triggered.

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/gpu/kernel/arrays/gatherd_gpu_kernel.cc:44 LaunchKernel

Traceback (most recent call last):
  File "/data/zhongzhijia/code/mindspore/mindyolo/train.py", line 330, in <module>
    train(args)
  File "/data/zhongzhijia/code/mindspore/mindyolo/train.py", line 285, in train
    trainer.train(
  File "/data/zhongzhijia/code/mindspore/mindyolo/mindyolo/utils/trainer_factory.py", line 170, in train
    run_context.loss, run_context.lr = self.train_step(imgs, labels, segments,
  File "/data/zhongzhijia/code/mindspore/mindyolo/mindyolo/utils/trainer_factory.py", line 366, in train_step
    loss, loss_item, _, grads_finite = self.train_step_fn(imgs, labels, True)
  File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/common/api.py", line 941, in staging_specialize
    out = _MindsporeFunctionExecutor(func, hash_obj, dyn_args, process_obj, jit_config)(*args, **kwargs)
  File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/common/api.py", line 185, in wrapper
    results = fn(*arg, **kwargs)
  File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/common/api.py", line 572, in __call__
    output = self._graph_executor(tuple(new_inputs), phase)
RuntimeError: For `GatherD`, the cuda Kernel fails to run, the error number is 710, which means device-side assert triggered.

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/gpu/kernel/arrays/gatherd_gpu_kernel.cc:44 LaunchKernel

[ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.361.608 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:209] SyncStream] cudaStreamSynchronize failed, ret[710], device-side assert triggered
[ERROR] ME(64473,7f87b8ecc740,python):2024-12-02-12:37:11.361.655 [mindspore/ccsrc/runtime/hardware/device_context_manager.cc:490] WaitTaskFinishOnDevice] SyncStream failed
[ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.465.838 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:200] DestroyStream] cudaStreamDestroy failed, ret[710], device-side assert triggered
[ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.465.856 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:70] ReleaseDevice] Op Error: Failed to destroy CUDA stream. | Error Number: 0
[ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.466.409 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:77] ReleaseDevice] cuDNN Error: Failed to destroy cuDNN handle | Error Number: 4 CUDNN_STATUS_INTERNAL_ERROR
[ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.787 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered
[ERROR] PRE_ACT(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.795 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:937] operator()] Free device memory[0x7f8238000000] error.
[ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.801 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered
[ERROR] PRE_ACT(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.806 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:937] operator()] Free device memory[0x7f8280000000] error.
[ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.811 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered
[ERROR] PRE_ACT(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.817 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:937] operator()] Free device memory[0x7f8580000000] error.
[ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.835 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered
[ERROR] PRE_ACT(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.840 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:937] operator()] Free device memory[0x7f84b8000000] error.

The text was updated successfully, but these errors were encountered:

m00nLi · 2024-11-29T08:16:15Z

多训练几次，可能会成功

Hardy-Chung · 2024-11-29T08:24:45Z

多训练几次，可能会成功

之前断在epcoh10左右，最近的一次断在epoch50😂

yuedongli1 · 2024-12-02T09:07:43Z

mindyolo当前仅支持在 Ascend 硬件上运行，其他平台 (e.g. CPU/GPU) 均未验证，有可能出现一些不可预期的问题。

m00nLi · 2024-12-09T02:09:20Z

mindyolo当前仅支持在 Ascend 硬件上运行，其他平台 (e.g. CPU/GPU) 均未验证，有可能出现一些不可预期的问题。

建议在README上明确写明仅支持在 Ascend 硬件运行，CPU/GPU不支持，我们已经踩过坑了验证过了

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU 上训练中断 #391

GPU 上训练中断 #391

Hardy-Chung commented Nov 29, 2024 •

edited

Loading

m00nLi commented Nov 29, 2024

Hardy-Chung commented Nov 29, 2024

yuedongli1 commented Dec 2, 2024

m00nLi commented Dec 9, 2024

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

GPU 上训练中断 #391

GPU 上训练中断 #391

Comments

Hardy-Chung commented Nov 29, 2024 • edited Loading

m00nLi commented Nov 29, 2024

Hardy-Chung commented Nov 29, 2024

yuedongli1 commented Dec 2, 2024

m00nLi commented Dec 9, 2024

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Hardy-Chung commented Nov 29, 2024 •

edited

Loading