We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
能启动训练,但是训练10几个epoch就会报错中断
环境:cuda 11.6
yaml:
训练命令: python train.py --config ./configs/shwd_yolov8n.yaml --device_target GPU
python train.py --config ./configs/shwd_yolov8n.yaml --device_target GPU
报错信息:
a_ops/gatherd.cu:35: GatherDKernel: block: [85,0,0], thread: [958,0,0] Assertion `j_read < input_shape.shape[dim]` failed. /home/jenkins/agent-working-dir/workspace/Compile_GPU_X86_CentOS_Cuda11.6_PY39/mindspore/mindspore/ccsrc/plugin/device/gpu/kernel/cuda_impl/cuda_ops/gatherd.cu:35: GatherDKernel: block: [85,0,0], thread: [959,0,0] Assertion `j_read < input_shape.shape[dim]` failed. [ERROR] RUNTIME_FRAMEWORK(64473,7f85fe328700,python):2024-12-02-12:36:05.918.774 [mindspore/ccsrc/runtime/graph_scheduler/actor/actor_common.cc:327] WaitRuntimePipelineFinish] Wait runtime pipeline finish and an error occurred: For `GatherD`, the cuda Kernel fails to run, the error number is 710, which means device-side assert triggered. ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/plugin/device/gpu/kernel/arrays/gatherd_gpu_kernel.cc:44 LaunchKernel Traceback (most recent call last): File "/data/zhongzhijia/code/mindspore/mindyolo/train.py", line 330, in <module> train(args) File "/data/zhongzhijia/code/mindspore/mindyolo/train.py", line 285, in train trainer.train( File "/data/zhongzhijia/code/mindspore/mindyolo/mindyolo/utils/trainer_factory.py", line 170, in train run_context.loss, run_context.lr = self.train_step(imgs, labels, segments, File "/data/zhongzhijia/code/mindspore/mindyolo/mindyolo/utils/trainer_factory.py", line 366, in train_step loss, loss_item, _, grads_finite = self.train_step_fn(imgs, labels, True) File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/common/api.py", line 941, in staging_specialize out = _MindsporeFunctionExecutor(func, hash_obj, dyn_args, process_obj, jit_config)(*args, **kwargs) File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/common/api.py", line 185, in wrapper results = fn(*arg, **kwargs) File "/root/miniconda3/envs/mindspore/lib/python3.9/site-packages/mindspore/common/api.py", line 572, in __call__ output = self._graph_executor(tuple(new_inputs), phase) RuntimeError: For `GatherD`, the cuda Kernel fails to run, the error number is 710, which means device-side assert triggered. ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/plugin/device/gpu/kernel/arrays/gatherd_gpu_kernel.cc:44 LaunchKernel [ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.361.608 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:209] SyncStream] cudaStreamSynchronize failed, ret[710], device-side assert triggered [ERROR] ME(64473,7f87b8ecc740,python):2024-12-02-12:37:11.361.655 [mindspore/ccsrc/runtime/hardware/device_context_manager.cc:490] WaitTaskFinishOnDevice] SyncStream failed [ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.465.838 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:200] DestroyStream] cudaStreamDestroy failed, ret[710], device-side assert triggered [ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.465.856 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:70] ReleaseDevice] Op Error: Failed to destroy CUDA stream. | Error Number: 0 [ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.466.409 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_device_manager.cc:77] ReleaseDevice] cuDNN Error: Failed to destroy cuDNN handle | Error Number: 4 CUDNN_STATUS_INTERNAL_ERROR [ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.787 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered [ERROR] PRE_ACT(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.795 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:937] operator()] Free device memory[0x7f8238000000] error. [ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.801 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered [ERROR] PRE_ACT(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.806 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:937] operator()] Free device memory[0x7f8280000000] error. [ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.811 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered [ERROR] PRE_ACT(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.817 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:937] operator()] Free device memory[0x7f8580000000] error. [ERROR] DEVICE(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.835 [mindspore/ccsrc/plugin/device/gpu/hal/device/cuda_driver.cc:55] FreeDeviceMem] cudaFree failed, ret[710], device-side assert triggered [ERROR] PRE_ACT(64473,7f87b8ecc740,python):2024-12-02-12:37:11.467.840 [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:937] operator()] Free device memory[0x7f84b8000000] error.
The text was updated successfully, but these errors were encountered:
多训练几次,可能会成功
Sorry, something went wrong.
之前断在epcoh10左右,最近的一次断在epoch50😂
mindyolo当前仅支持在 Ascend 硬件上运行,其他平台 (e.g. CPU/GPU) 均未验证,有可能出现一些不可预期的问题。
建议在README上明确写明仅支持在 Ascend 硬件运行,CPU/GPU不支持,我们已经踩过坑了验证过了
No branches or pull requests
能启动训练,但是训练10几个epoch就会报错中断
环境:cuda 11.6
yaml:
训练命令:
python train.py --config ./configs/shwd_yolov8n.yaml --device_target GPU
报错信息:
The text was updated successfully, but these errors were encountered: