Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请问如何打印出训练过程中还有多少显存? #352

Open
Living190711 opened this issue Sep 11, 2024 · 2 comments
Open

请问如何打印出训练过程中还有多少显存? #352

Living190711 opened this issue Sep 11, 2024 · 2 comments

Comments

@Living190711
Copy link

代码仓:https://github.com/mindspore-lab/mindyolo/tree/v0.3.0
环境:modelart、mindspore_2.2.12-cann_7.0.1.1-py_3.9-euler_2.10.7-aarch64-snt3p、
image

问题描述:
在使用过程中发现1000+图片、bachsize=64、num_parallel_workers=8时,一个epoch训练时长达到2.24分钟左右,请问如何打印出我还剩多少显存?以及训练一轮2分多时间正常么?若不正常请问有哪些解决方法呢?

请回复下,谢谢!

终端日志如下所示:
image

@Living190711
Copy link
Author

因为这边我想打印出现存还剩多少,以此来增加bachsize提高显存利用率来达到加快训练速度的效果。

@Living190711
Copy link
Author

请问若要断点续训如何操作呢? 这边我没有查看到相关参数,望解答一下,谢谢。
def get_parser_train(parents=None):
parser = argparse.ArgumentParser(description="Train", parents=[parents] if parents else [])
parser.add_argument("--task", type=str, default="detect", choices=["detect", "segment"])
parser.add_argument("--device_target", type=str, default="Ascend", help="device target, Ascend/GPU/CPU")
parser.add_argument("--save_dir", type=str, default="./runs", help="save dir")
parser.add_argument("--device_per_servers", type=int, default=8, help="device number on a server")
parser.add_argument("--log_level", type=str, default="INFO", help="log level to print")
parser.add_argument("--is_parallel", type=ast.literal_eval, default=False, help="Distribute train or not")
parser.add_argument("--ms_mode", type=int, default=0,
help="Running in GRAPH_MODE(0) or PYNATIVE_MODE(1) (default=0)")
parser.add_argument("--ms_amp_level", type=str, default="O0", help="amp level, O0/O1/O2/O3")
parser.add_argument("--keep_loss_fp32", type=ast.literal_eval, default=True,
help="Whether to maintain loss using fp32/O0-level calculation")
parser.add_argument("--ms_loss_scaler", type=str, default="static", help="train loss scaler, static/dynamic/none")
parser.add_argument("--ms_loss_scaler_value", type=float, default=1024.0, help="static loss scale value")
parser.add_argument("--ms_jit", type=ast.literal_eval, default=True, help="use jit or not")
parser.add_argument("--ms_enable_graph_kernel", type=ast.literal_eval, default=False,
help="use enable_graph_kernel or not")
parser.add_argument("--ms_datasink", type=ast.literal_eval, default=False, help="Train with datasink.")
parser.add_argument("--overflow_still_update", type=ast.literal_eval, default=True, help="overflow still update")
parser.add_argument("--clip_grad", type=ast.literal_eval, default=False)
parser.add_argument("--clip_grad_value", type=float, default=10.0)
parser.add_argument("--ema", type=ast.literal_eval, default=True, help="ema")
parser.add_argument("--weight", type=str, default="", help="initial weight path")
parser.add_argument("--ema_weight", type=str, default="", help="initial ema weight path")
parser.add_argument("--freeze", type=list, default=[], help="Freeze layers: backbone of yolov7=50, first3=0 1 2")
parser.add_argument("--epochs", type=int, default=300, help="total train epochs")
parser.add_argument("--per_batch_size", type=int, default=32, help="per batch size for each device")
parser.add_argument("--img_size", type=list, default=640, help="train image sizes")
parser.add_argument("--nbs", type=list, default=64, help="nbs")
parser.add_argument("--accumulate", type=int, default=1,
help="grad accumulate step, recommended when batch-size is less than 64")
parser.add_argument("--auto_accumulate", type=ast.literal_eval, default=False, help="auto accumulate")
parser.add_argument("--log_interval", type=int, default=100, help="log interval")
parser.add_argument("--single_cls", type=ast.literal_eval, default=False,
help="train multi-class data as single-class")
parser.add_argument("--sync_bn", type=ast.literal_eval, default=False,
help="use SyncBatchNorm, only available in DDP mode")
parser.add_argument("--keep_checkpoint_max", type=int, default=100)
parser.add_argument("--run_eval", type=ast.literal_eval, default=False, help="Whether to run eval during training")
parser.add_argument("--conf_thres", type=float, default=0.001, help="object confidence threshold for run_eval")
parser.add_argument("--iou_thres", type=float, default=0.65, help="IOU threshold for NMS for run_eval")
parser.add_argument("--conf_free", type=ast.literal_eval, default=False,
help="Whether the prediction result include conf")
parser.add_argument("--rect", type=ast.literal_eval, default=False, help="rectangular training")
parser.add_argument("--nms_time_limit", type=float, default=20.0, help="time limit for NMS")
parser.add_argument("--recompute", type=ast.literal_eval, default=False, help="Recompute")
parser.add_argument("--recompute_layers", type=int, default=0)
parser.add_argument("--seed", type=int, default=2, help="set global seed")
parser.add_argument("--summary", type=ast.literal_eval, default=True, help="collect train loss scaler or not")
parser.add_argument("--profiler", type=ast.literal_eval, default=False, help="collect profiling data or not")
parser.add_argument("--profiler_step_num", type=int, default=1, help="collect profiler data for how many steps.")
parser.add_argument("--opencv_threads_num", type=int, default=2, help="set the number of threads for opencv")
parser.add_argument("--strict_load", type=ast.literal_eval, default=True, help="strictly load the pretrain model")

# args for ModelArts
parser.add_argument("--enable_modelarts", type=ast.literal_eval, default=False, help="enable modelarts")
parser.add_argument("--data_url", type=str, default="", help="ModelArts: obs path to dataset folder")
parser.add_argument("--ckpt_url", type=str, default="", help="ModelArts: obs path to pretrain model checkpoint file")
parser.add_argument("--multi_data_url", type=str, default="", help="ModelArts: list of obs paths to multi-dataset folders")
parser.add_argument("--pretrain_url", type=str, default="", help="ModelArts: list of obs paths to multi-pretrain model files")
parser.add_argument("--train_url", type=str, default="", help="ModelArts: obs path to output folder")
parser.add_argument("--data_dir", type=str, default="/cache/data/",
                    help="ModelArts: local device path to dataset folder")
parser.add_argument("--ckpt_dir", type=str, default="/cache/pretrain_ckpt/",
                    help="ModelArts: local device path to checkpoint folder")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy