Improved YOLOV7-TINY Network For Sea Bream Detecti
Improved YOLOV7-TINY Network For Sea Bream Detecti
Abstract: Accurate identification of underwater fish species is of great scientific and economic significance in aquaculture, as
it can provide scientific basis for aquaculture production and promote related research. However, the complexity of the
underwater environment is affected by various factors such as light, water quality, and mutual occlusion of fish species. Therefore,
underwater fish images are often not clear enough, which limits the accurate identification of underwater targets. In this paper,
an improved YOLOV7-TINY model for sea bream detection is proposed. We employ FasterNet to replace the backbone network
of the YOLOV7-TINY model, further reducing model parameters and computational complexity without compromising accuracy.
By leveraging cascaded feature fusion in the backbone network, we effectively address the challenges posed by multi-scale
datasets and insufficient information extraction. Additionally, the RESNETCBAM attention mechanism is incorporated into the
feature maps at three different scales, allowing the network to better capture relevant information from complex underwater
environments while minimizing unnecessary interference. Finally, the ECIOU loss function is adopted to optimize frame
adjustments and reduce the training time of the model.
Keywords: Cascaded feature fusion; Attention mechanism; Improved YOLOV7-TINY network; ECIOU.
55
have adopted this idea to construct object detection representatives of the entire feature map for computation.
algorithms. Currently, popular object detection algorithms Without loss of generality, it is assumed that the input and
can be categorized into two types: two-stage and one-stage. output feature maps have the same number of channels.
The former is based on the principle of coarse positioning and
fine classification, first identifying candidate regions
containing objects and then performing classification, such as
R-CNN , Fast-RCNN , Faster-RCNN , Mask-RCNN , etc.,
which are relatively slower in detection speed compared to
the latter. One-stage object detection algorithms directly
predict object classification and localization through
convolutional neural networks, achieving a better balance
between accuracy and speed. Representative algorithms in
this category include SSD , the YOLO series (YOLOV3 ,
YOLOV4 , YOLOV5 , YOLOV6 , YOLOV7 ). Zhao et al. Figure 1. PConv structure
proposed a YOLO-UOD underwater detection algorithm
based on Yolov4-tiny. Li et al. introduced a triplet attention The FLOPs (Floating Point Operations) of PConv can be
mechanism in YOLOV5 to improve underwater biological expressed as ℎ × 𝑤 × 𝑘 2 × 𝑐𝑝2 ,where ℎ and 𝑤 represent the
feature extraction capabilities. Zhai et al. added CBAM to width and height of the feature map, k denotes the size of the
Yolov5s to improve recognition accuracy and efficiency and convolution kernel, and 𝑐𝑝 signifies the number of channels
introduced a multi-scale algorithm to enhance image contrast. in the conventional convolution. In PConv, 𝑐in in
Due to the complex underwater environment, simply conventional convolution is replaced by 𝑐P . In practical
𝑐 1
applying the above models to sea bream recognition still applications, 𝑟 = 𝑝 = is typically assumed. Consequently,
𝑐 4
poses some problems: the FLOPs of PConv are only 1/16 of those of conventional
(1) Underwater environments are affected by factors such convolution.
as lighting and water quality, and the background of the The memory access pattern of PConv can be represented as
captured images also poses difficulties for detection, leading ℎ × 𝑤 × 2𝑐𝑝 + 𝑘 2 × 𝑐𝑝2 , which is approximately ℎ × 𝑤 ×
to inaccurate detection results.
2𝑐𝑝 , where ℎ and 𝑤 are the width and height of the feature
(2) During the process of feature extraction and fusion, the
models may not fully extract multi-scale information from map, 𝑘 is the size of the convolution kernel, and 𝑐P is the
underwater fish schools. number of channels in the conventional convolution. The
(3) The original model has long training times and is large memory access of PConv is only a quarter of that of
in size, making it inconvenient for deployment on mobile conventional convolution, hence no additional memory
devices. access is required.
Our work aims to address the aforementioned issues. In order to fully and effectively utilize information from all
channels, a pointwise convolution (PWConv) is added after
2. Methods the PConv. The FLOPs of PWConv and PConv decoupling
can be calculated as ℎ × 𝑤 × (𝑘 2 × 𝑐𝑝2 + 𝑐 × 𝑐𝑝 ).Compared
2.1. FasterNet to regular convolutions, this reduces the computational
With the presence of significant redundant computations in workload. Each FasterNet module consists of one PConv
the backbone network of YOLOV7-TINY, FasterNet is based layer, two PWConv layers, batch normalization, and a ReLU
on reducing redundant computations and memory access, activation function, as shown in Figure 2.
which can further improve accuracy without compromising
the original Baseline.
For lightweight networks such as MobileNet, ShuffleNet,
and GhostNet, they utilize depthwise convolution (DWConv)
or group convolution (GConv) to extract spatial features.
However, in the process of reducing floating-point operations
(FLOPs), as indicated by (1), detection latency may also be
influenced by the number of floating-point operations per
second (FLOPS), and operators are often affected by the side
effects of increased memory access.
𝐹𝐿𝑂𝑃𝑠
𝐿𝑎𝑡𝑒𝑛𝑐𝑦 = (1)
𝐹𝐿𝑂𝑃𝑆
To achieve a high number of floating-point operations per
second (FLOPS) while reducing FLOPs, partial convolution
(PConv) is used to decrease memory access and
computational redundancy. Essentially, the FLOPs of PConv
are lower than regular Convolution, but FLOPS are higher
than DWConv/GConv. In other words, PConv better utilizes
the computational power of the device, while also being
effective in extracting spatial features.
In Figure 1, PConv applies regular Convolution to extract
spatial features from some input channels while keeping the
rest unchanged. For contiguous or regular memory access, the
first or last contiguous 𝑐P channels are considered as Figure 2. FasterNet Block
56
2.2. Fussion Block combination of these two modules leads to improved handling
of complex environmental conditions and other challenges.
The Fusion Block achieves more accurate object detection
Specifically, this involves reducing the dimensionality of
and localization by integrating feature maps from different
feature maps at the same scale using 1x1 convolutions. For
levels. Specifically, it consists of two essential components:
larger-scale feature maps, dimensionality reduction is
the feature fusion module and the multi-scale feature fusion
performed initially using 1x1 convolutions, followed by
module. The feature fusion module merges feature maps from
downsampling using a 3x3 convolution with a stride of 2. For
different levels through a feature pyramid network, thereby
smaller-scale feature maps, upsampling is conducted using
enhancing the model's capability for object detection. On the
2x2 transpose convolutions. Finally, the resulting feature
other hand, the multi-scale feature fusion module combines
maps from these three parts are concatenated together,
feature maps from different scales, enabling the model to
followed by dimensionality reduction using 1x1 convolutions
better adapt to objects of varying sizes and proportions. The
again, as shown in Figure 3.
57
Figure 4. Spatial Attention Module
Figure 6. ResNetCBAM
2.4. Loss Function considers three key geometric factors: overlap area, center
point distance, and aspect ratio. CIoU utilizes IoU, Euclidean
Object detection can typically be divided into two stages: distance, corresponding aspect ratio, and angle to measure the
localization and detection. The accuracy of the localization overlap area between the target and the ground truth box.
stage is primarily influenced by regression loss functions, In the regression stage, it is not suitable for both width and
leading to the emergence of various new regression loss height of the predicted box to increase or decrease
functions. simultaneously because 𝛼𝑣 cannot accurately represent the
To measure the similarity between predicted boxes and confidence in the true width and height differences. Therefore,
ground truth boxes and to select appropriate positive and when the model converges to the linear ratio between the
negative samples, Intersection over Union (IoU) has become width and height of the predicted frame and the true frame, it
the most popular metric in bounding box regression. To may sometimes hinder the effective optimization of similarity.
further optimize the IoU metric, the IoU loss function has The EIOU_Loss function decomposes the aspect ratio factor
been proposed. 𝛼𝑣 in CIOU_Loss to calculate the width and height of the
However, the IoU loss function fails when there is no predicted frame and the true frame, thus addressing the
overlap between the predicted box and the IoU loss function. problem of CIOU_Loss.
To address these issues, many IoU-based evaluation systems When dealing with distant edges, only the calculation of
have been derived, which improve the shortcomings of the EIOU_Loss can become slow and may not converge
original IoU loss function from different perspectives, prematurely. To address this issue, a new enhanced loss
significantly enhancing its robustness. function called ECIOU has been proposed, which can
Among them, the Generalized Intersection over Union facilitate adjustments to the predicted frame and improve
(GIoU) , Distance Intersection over Union (DIoU) , and frame regression rates.
Complete Intersection over Union (CIoU) loss functions are The foundation of ECIOU is the combination of the CIOU
the most representative methods. They have made significant and EIOU loss functions. Initially, the aspect ratio of the
progress in the field of object detection, but there is still predicted frame is adjusted by CIOU until it converges to an
considerable room for optimization. appropriate range. Then, each side is finely tuned by EIOU
Among these methods, CIoU is considered the most until it converges to the correct value.ECIOU_Loss is
optimal boundary regression loss function because it
58
computed using Equation (5). by directly predicting the positions and classes of bounding
boxes in the neural network. Due to the simple and efficient
𝜌2 (𝒃, 𝒃𝑔𝑡 ) design of the YOLO series, it has become one of the preferred
ℛ𝐶𝐼𝑜𝑈 = + 𝛼𝑣 (2) algorithms for real-time object detection tasks.
𝑎2
4 𝑤 𝑔𝑡
𝑤
2 By utilized YOLOv7-TINY as the BASELINE and
𝑣 = 2 (𝑎𝑟𝑐𝑡𝑎𝑛 𝑔𝑡 − 𝑎𝑟𝑐𝑡𝑎𝑛 ) (3) replaced the backbone network with the FasterNet network,
𝜋 ℎ ℎ
we aim to reduce redundant computations without sacrificing
𝜌2 (𝒃, 𝒃𝑔𝑡 ) accuracy. Subsequently, we employed the multi-scale
ℒ𝐶𝐼𝑜𝑈 = 1 − 𝐼𝑜𝑈 + + 𝛼𝑣 (4)
𝑎22 𝑔𝑡 BIFUSSION to better integrate contextual information.
2
𝜌 (𝑏 , 𝑏) 𝜌 (ℎ , ℎ) 𝜌 (𝑤 𝑔𝑡 , 𝑤)
𝑔𝑡 2
𝐸𝐶𝐼𝑂𝑈𝐿𝑜𝑠𝑠 = 1 − 𝐼𝑂𝑈 + 𝛼𝑣 + + + (5) Before the prediction head, we separately added
𝑐2 𝑐ℎ2 𝑐𝑤2
theRESNETCBAM attention mechanism after the ELAN
2.5. The improved YOLOv7-TINY network module, allowing the network to better capture and represent
architecture important features, thus enhancing the model's performance
and generalization ability. The YOLOv7-TINY (BASELINE)
The YOLO (You Only Look Once) network is a popular and the improved ResNetCBAM-FUSSION-YOLO network
real-time object detection algorithm. Its main idea is to treat architecture are illustrated in Figures 7 and 8, respectively.
the object detection problem as a single regression problem,
Figure 7. YOLOV7-TINY
59
Figure 8. ResNetCBAM-FUSSION-YOLO
𝑇𝑃
3. Results and Analysis 𝑃= (6)
𝑇𝑃 + 𝐹𝑃
3.1. Environment configuration Recall(R) represents the prediction result as the proportion
of the actual positive samples in the positive samples to the
When training the model, the LINUX operating system was
positive samples in the whole sample. The calculation
utilized alongside the PyTorch 1.12.0 framework. The server
formula is as follows:
configuration included an Intel(R) Xeon(R) Silver 4110 CPU 𝑇𝑃
@ 2.10GHz processor, 188GB of memory, and an NVIDIA 𝑅= (7)
𝑇𝑃+𝐹𝑁
A100 PCIe GPU with 80GB of memory, running CUDA 12.2 AP (average precision) refers to the average accuracy in
driver version. The input image size was set to 640x640, with object detection. It combines the model’s performance under
a batch size of 16 and 200 training steps. The learning rate different precision and recall conditions and reflects the
was set to 0.01, momentum to 0.937, and the weight decay for balance between accuracy and recall. AP is calculated as the
stochastic gradient descent (SGD) was 0.0005. For training area under the precision–recall curve (PR curve), as shown in
and testing purposes, the PyCharm software was installed on Equation (8).
1
a local WINDOWS system, communicating with the server
via remote connection. The original images were divided into 𝐴𝑃 = ∫ 𝑃(𝑅)𝑑𝑅 (8)
0
training, validation, and test sets in an 8:1:1 ratio. Additionally, The mAP metric evaluates the overall performance of the
data augmentation was applied to the training set to enhance model across all categories. It is calculated by taking the
the robustness of the experiment. average of the AP (Average Precision) values for different
3.2. Model evaluation criteria categories, as shown in Equation (9):
∑𝑛𝑖=0 𝐴𝑃(𝑖)
In this experiment, a comprehensive set of metrics was 𝑚𝐴𝑃 = (9)
𝑛
adopted to evaluate the performance of fish object detection.
Precision(P) represents the proportion of positive samples 3.3. Experiment
in the samples with positive prediction results. The definition 3.3.1. Data Preparation
is shown in Equation (6): The data images were collected from Időkép and
60
manually annotated using LabelImg software for 1430 images 3.3.2. Analysis of the results of the ResNetCBAM-
with a resolution of 1920x1080. To ensure the diversity and FUSSION-YOLO experiment
completeness of the data, we selected images from multiple In this experiment, a batch size of 16 was used, meaning
different time periods for capture. Through manual selection, that 16 images were processed in each training iteration. The
we ensured that the images have complete boundaries and YOLOV7 network provides good visualization effects. After
necessary clarity to guarantee the quality of the data. Data the model training is completed, test_batch_label.jpg is
augmentation was applied to the training set to enhance the generated to observe the real bounding boxes on the
robustness of the experiment. validation set for that batch, and test_batch_pred.jpg is
generated to observe the predicted bounding boxes on the
validation set for that batch. The predicted results on the
validation set during model training are shown in Figure 9.
3.3.3. Comparison of loss function results of different Table 1. Loss function and training time
IOU
According to Table 1, we employed four loss functions, IOU Training Time/h
namely SIOU, EIOU, CIOU, and ECIOU, on the same model, SIOU 1.896
and compared their training times. As indicated by the graph, EIOU 1.61
the training time for ECIOU is approximately 1.1 hours,
which is notably shorter compared to the other three methods CIOU 1.893
for completing 200 epochs. This would significantly enhance ECIOU 1.113
practical training efficiency. Additionally, as shown in Table
Table 2. Test results for different loss functions
2, in the ResNetCBAM-FUSSION-YOLO model using the
ECIOU loss function, the mAP@0.5 is higher by 1.2%, 2.6%, MODEL MAP0.5/% MAP@.5:0.9/5% P/% R/%
and 1.1% compared to using the SIOU, CIOU, and EIOU loss
SIOU 91.4 49.5 91.7 85.9
functions, respectively. Moreover, the mAP0.5:0.95 is higher
by 3.3%, 3%, and 1.7% compared to using the SIOU, CIOU, CIOU 90 49.8 92.1 85.5
and EIOU loss functions, respectively. Both precision and EIOU 91.5 51.1 92.1 85.7
recall also perform the best with the ECIOU loss function ECIOU 92.6 52.8 94.1 86.2
among these four. Consequently, we selected ECIOU as the
loss function for the ResNetCBAM-FUSSION-YOLO 3.3.4. Comparative Experiments of Various Mainstream
network model. Models
In summary, the improved YOLOv7-TINY model has To demonstrate the superior performance of the proposed
better detection accuracy compared to the original model, model on lightweight devices in the detection of fish schools,
with the MAP@0.5 increasing from 90.6% to 92.6%, a comparison of the improved network proposed in this paper
indicating that the proposed network model offers higher with mainstream target recognition networks during detection
detection precision. The computational complexity is reduced is shown in Table 3. In Table 3, our model P and R are both
by 16.7%, the number of parameters decreases by 11.5%, and superior to the other algorithms listed in the table, indicating
the model training time is shortened by 49.2%. This reduction good model performance. In terms of mAP@0.5, our method
in hardware resource requirements facilitates model shows improvements of 3.5% and 9.3% over SSD and Faster-
deployment in mobile devices and provides a reference for RCNN, respectively. Furthermore, our model's performance
intelligent aquaculture applications. surpasses YOLOV3, YOLOV4, and YOLOV5s by 1.8%,
3.4%, and 2.3%, respectively. In terms of mAP@0.5:0.95, it
surpasses SSD and Faster-RCNN by 2.3% and 6.2%,
61
respectively, and surpasses YOLOV4 and YOLOV5s by 1% high detection accuracy. Additionally, in terms of detection
and 0.8%, respectively. Our model's computational and speed, our model achieves 85.9FPS, surpassing the inference
parameter quantities are the smallest in the table, only 11 and speeds of other mainstream networks. Figure 10 compares
5.32M, respectively, meaning that it consumes minimal mAP@0.5 and FPS under different models.
hardware resources in device deployment while maintaining
Table 3. Comparative Experiment
MODEL MAP@0.5/% MAP@.5:0.95/% FPS P/% R/% Parameters/M GFLOPs/G
YOLOV5s 90.3 52 58.62 93.2 85.5 7.01M 15.8
YOLOV4 89.2 51.8 40 92.6 84.4 9.11M 20.6
YOLOV3 90.8 53.8 49.4 93.3 85.3 6.15M 154.5
Faster-RCNN 83.3 46.6 38.7 67 85.3 28.47M 470.1
SSD 89.1 50.5 47.1 90 83.8 26.15M 31.4
Our model 92.6 52.8 85.9 94.1 86.2 5.32M 11
70 58.62
60 49.4 47.1
50 40 38.7
40
30
20
10
0
YOLOV5s YOLOV4 YOLOV3 Faster-RCNN SSD Our model
MODEL
MAP@0.5% FPS
3.3.6. ResNetCBAM-FUSSION-YOLO test results and fine-tuning on the validation set, we selected the best
After multiple iterations of deep learning network training models and loaded them into the computer to obtain inference
62
results on the test set. Among them, we selected three models, conditions, reducing the missed detection rate. Overall, the
namely ResNetCBAM-FUSSION-YOLO, YOLOV7, and ResNetCBAM-FUSSION-YOLO network can rapidly,
YOLOV7-Tiny, to detect fish schools, as shown in Figures 11, accurately, and comprehensively detect fish schools in harsh
12, and 13. It can be observed that compared to YOLOV7 and environments. This is of significant importance for
YOLOV7-Tiny, ResNetCBAM-FUSSION-YOLO can detect understanding the growth status of aquaculture fish and
all fish schools in the images, regardless of multi-scale or achieving precise feeding.
dense regions, even under turbid water and lighting
63
populations that need to be detected in aquaculture and [9] Chen Feifen, "Research and application of water meter reading
different scenarios of aquaculture environments. We aim to recognition based on deep learning," Master's thesis, Guilin
establish a digital fishery system by integrating sensors such University of Electronic Technology, 2022. doi:
10.27049/d.cnki.ggldc.2021.000020.
as dissolved oxygen and pH meters, to achieve more accurate
and comprehensive assessment of fish growth conditions. The [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet
goal is to improve fisheries production efficiency and classification with deep convolutional neural networks,"
optimize resource utilization. Additionally, we will enhance Commun. ACM, vol. 60, no. 6, pp. 84–90, May 2017, doi:
10.1145/3065386.
the corresponding detection algorithms to find a better
balance between speed and accuracy. [11] K. Simonyan and A. Zisserman, "Very Deep Convolutional
Overall, the proposed sea bream group object detection Networks for Large-Scale Image Recognition," arXiv, April 10,
model demonstrates high efficiency and is suitable for 2015. Accessed: March 14, 2024. [Online]. Available:
http://arxiv.org/abs/1409.1556
deployment on mobile devices, providing strong technical
support for aquaculture technology. [12] W. Zaremba, I. Sutskever, and O. Vinyals, "Recurrent Neural
Network Regularization," arXiv, February 19, 2015. Accessed:
Acknowledgment March 14, 2024. [Online]. Available:
http://arxiv.org/abs/1409.2329
Funding: This work was supported in part by the National [13] K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, and
Natural Science Foundation of China un-der Grant 62175037, J. Schmidhuber, "LSTM: A Search Space Odyssey," IEEE
in part by the Huzhou Key R&D Program Agricultural Trans. Neural Netw. Learning Syst., vol. 28, no. 10, pp. 2222–
“Double Strong” Special Project (No. 2022ZD2060),in part 2232, October 2017, doi: 10.1109/TNNLS.2016.2582924.
by Zhejiang-French Digital Monitoring Lab for Aquatic [14] I. Goodfellow et al., "Generative Adversarial Nets."
Resources and Environment, Department of Science and
Technology of Zhejiang Province, and in part by the Huzhou [15] C. Szegedy et al., "Going deeper with convolutions," in 2015
Key Laboratory of Waters Robotics Technology (2022-3), IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Boston, MA, USA: IEEE, June 2015, pp. 1–9. doi:
Huzhou Science and Technology Bureau. 10.1109/CVPR.2015.7298594.
References [16] K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning
for Image Recognition," in 2016 IEEE Conference on
[1] Xu Hai, Xie Hongtao, and Zhang Yongdong, "Advances in Computer Vision and Pattern Recognition (CVPR), Las Vegas,
Visual Domain Generalization Techniques and Research," NV, USA: IEEE, June 2016, pp. 770–778. doi:
Journal of Guangzhou University (Natural Science Edition), 10.1109/CVPR.2016.90.
vol. 21, no. 2, pp. 42–59, 2022.
[17] A. G. Howard et al., "MobileNets: Efficient Convolutional
[2] N. J. C. Strachan, P. Nesvadba, and A. R. Allen, "Fish species Neural Networks for Mobile Vision Applications," arXiv,
recognition by shape analysis of images," Pattern Recognition, April 16, 2017. Accessed: March 14, 2024. [Online]. Available:
vol. 23, no. 5, pp. 539–544, January 1990, doi: 10.1016/0031- http://arxiv.org/abs/1704.04861
3203(90)90074-U.
[18] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.
[3] N. Castignolles, M. Cattoen, and M. Larinier, "Identification Chen, "MobileNetV2: Inverted Residuals and Linear
and counting of live fish by image analysis," presented at Bottlenecks," in 2018 IEEE/CVF Conference on Computer
IS&T/SPIE 1994 International Symposium on Electronic Vision and Pattern Recognition, Salt Lake City, UT: IEEE,
Imaging: Science and Technology, S. A. Rajala and R. L. June 2018, pp. 4510–4520. doi: 10.1109/CVPR.2018.00474.
Stevenson, eds., San Jose, CA, March 1994, pp. 200–209. doi:
10.1117/12.171067. [19] A. Howard et al., "Searching for MobileNetV3," in 2019
IEEE/CVF International Conference on Computer Vision
[4] D.-J. Lee, R. B. Schoenberger, D. Shiozawa, X. Xu, and P. (ICCV), Seoul, Korea (South): IEEE, October 2019, pp. 1314–
Zhan, "Contour matching for a fish recognition and migration- 1324. doi: 10.1109/ICCV.2019.00140.
monitoring system," presented at Optics East, K. G. Harding,
ed., Philadelphia, PA, December 2004, p. 37. doi: [20] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger,
10.1117/12.571789. "Densely Connected Convolutional Networks," in 2017
[5] Ding Shunrong and Xiao Ke, "Research on fish classification [21] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich Feature
method based on particle swarm optimization SVM and multi- Hierarchies for Accurate Object Detection and Semantic
feature fusion," Chinese Journal of Agricultural Mechanization, Segmentation."
vol. 41, no. 11, pp. 113–118, 170, 2020, doi: [22] S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN:
10.13733/j.jcam.issn.2095-5553.2020.11.018. Towards Real-Time Object Detection with Region Proposal
[6] Yao Runlu, Gui Yongwen, and Huang Qiugui, "Freshwater fish Networks," IEEE Transactions on Pattern Analysis and
species recognition based on machine vision," Journal of Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, June 2017,
Microcomputers and Applications, vol. 36, no. 24, pp. 37–39, doi: 10.1109/TPAMI.2016.2577031.
2017, doi: 10.19358/j.issn.1674-7720.2017.24.011. [23] K. He, G. Gkioxari, P. Dollar, and R. Girshick, "Mask R-
[7] P. Cisar, D. Bekkozhayeva, O. Movchan, M. Saberioon, and R. CNN," presented at Proceedings of the IEEE International
Schraml, "Computer vision based individual fish identification Conference on Computer Vision, 2017, pp. 2961–2969.
using skin dot pattern," Sci Rep, vol. 11, no. 1, p. 16904, Accessed: March 14,2024. [Online]. Available: https:
August 2021, doi: 10.1038/s41598-021-96476-4. //openaccess.thecvf.com/content_iccv_2017/html/He_Mask_
R-CNN_ICCV_2017_paper.html
[8] R. B. Dala-Corte, J. B. Moschetta, and F. G. Becker, "Photo-
identification as a technique for recognition of individual fish: [24] W. Liu et al., "SSD: Single Shot MultiBox Detector," in
a test with the freshwater armored catfish Rineloricaria Computer Vision – ECCV 2016, vol. 9905, B. Leibe, J. Matas,
aequalicuspis Reis & Cardoso, 2001 (Siluriformes: N. Sebe, and M. Welling, eds., Lecture Notes in Computer
Loricariidae)," Neotrop. ichthyol., vol. 14, no. 1, 2016, doi: Science, vol. 9905., Cham: Springer International Publishing,
10.1590/1982-0224-20150074. 2016, pp. 21–37. doi: 10.1007/978-3-319-46448-0_2.
64
[25] J. Redmon and A. Farhadi, "YOLOv3: An Incremental 6856. Accessed: March 15, 2024. [Online] .Available:
Improvement," arXiv, April 8, 2018. doi: https://openaccess.thecvf.com/content_cvpr_2018/html/Zhang
10.48550/arXiv.1804.02767. _ShuffleNet_An_Extremely_CVPR_2018_paper.html
[26] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, "YOLOv4: [36] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu,
Optimal Speed and Accuracy of Object Detection," arXiv, "GhostNet: More Features From Cheap Operations," presented
April 22, 2020. doi: 10.48550/arXiv.2004.10934. at Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2020, pp. 1580–1589.
[27] X. Zhu, S. Lyu, X. Wang, and Q. Zhao, "TPH-YOLOv5: Accessed: March 15, 2024. [Online]. Available:
Improved YOLOv5 Based on Transformer Prediction Head for https://openaccess.thecvf.com/content_CVPR_2020/html/Han
Object Detection on Drone-captured Scenarios," in 2021 _GhostNet_More_Features_From_Cheap_Operations_CVPR
IEEE/CVF International Conference on Computer Vision _2020_paper.html
Workshops (ICCVW), Montreal, BC, Canada: IEEE, October
2021, pp. 2778–2788. doi: 10.1109/ICCVW54120.2021.00312. [37] X. Glorot, A. Bordes, and Y. Bengio, "Deep Sparse Rectifier
Neural Networks," in Proceedings of the Fourteenth
[28] C. Li et al., "YOLOv6: A Single-Stage Object Detection International Conference on Artificial Intelligence and
Framework for Industrial Applications," arXiv, September 7, Statistics, JMLR Workshop and Conference Proceedings, June
2022. doi: 10.48550/arXiv.2209.02976. 2011, pp. 315–323. Accessed: March 15, 2024. [Online].
[29] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, "YOLOv7: Available: https://proceedings.mlr.press/v15/glorot11a.html
Trainable Bag-of-Freebies Sets New State-of-the-Art for Real- [38] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, "CBAM:
Time Object Detectors," in 2023 IEEE/CVF Conference on Convolutional Block Attention Module," presented at
Computer Vision and Pattern Recognition (CVPR), Vancouver, Proceedings of the European Conference on Computer Vision
BC, Canada: IEEE, June 2023, pp. 7464–7475. doi: (ECCV), 2018, pp. 3-19. Accessed: March 15, 2024 .[Online].
10.1109/CVPR52729.2023.00721. Available:https://openaccess.thecvf.com/content_ECCV_201
[30] S. Zhao, J. Zheng, S. Sun, and L. Zhang, "An Improved YOLO 8/html/Sanghyun_Woo_Convolutional_Block_Attention_EC
Algorithm for Fast and Accurate Underwater Object CV_2018_paper.html
Detection," Symmetry, vol. 14, no. 8, Art. no. 8, August 2022, [39] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning
doi: 10.3390/sym14081669. representations by back-propagating errors," Nature, vol. 323,
[31] Y. Li, X. Bai, and C. Xia, "An Improved YOLOV5 Based on no. 6088, pp. 533–536, October 1986, doi: 10.1038/323533a0.
Triplet Attention and Prediction Head Optimization for Marine [40] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S.
Organism Detection on Underwater Mobile Platforms," JMSE, Savarese, "Generalized Intersection Over Union: A Metric and
vol. 10, no. 9, p. 1230, September 2022, doi: a Loss for Bounding Box Regression," presented at
10.3390/jmse10091230. Proceedings of the IEEE/CVF Conference on Computer Vision
[32] X. Zhai, H. Wei, Y. He, Y. Shang, and C. Liu, "Underwater Sea and Pattern Recognition, 2019, pp.658–666.Accessed: March
Cucumber Identification Based on Improved YOLOv5," 15,2024.[Online].Available:https://openaccess.thecvf.com/con
Applied Sciences, vol. 12, no. 18, p. 9105, September 2022, doi: tent_CVPR_2019/html/Rezatofighi_Generalized_Intersection
10.3390/app12189105. _Over_Union_A_Metric_and_a_Loss_for_CVPR_2019_pape
r.html
[33] A. Markus, G. Kecskemeti, and A. Kertesz, "Flexible
Representation of IoT Sensors for Cloud Simulators," in 2017 [41] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, "Distance-
25th Euromicro International Conference on Parallel, IoU Loss: Faster and Better Learning for Bounding Box
Distributed and Network-based Processing (PDP), March 2017, Regression," Proceedings of the AAAI Conference on
pp. 199–203. doi: 10.1109/PDP.2017.87. Artificial Intelligence, vol. 34, no. 07, Art. no. 07, April 2020,
doi: 10.1609/aaai. v34i07.6999.
[34] J. Chen et al., "Run, Don’t Walk: Chasing Higher FLOPS for
Faster Neural Networks," arXiv.org .Accessed:November 8, [42] Z. Zheng et al., "Enhancing Geometric Factors in Model
2023. [Online]. Available: https://arxiv.org/abs/2303.03667v3 Learning and Inference for Object Detection and Instance
Segmentation," IEEE Transactions on Cybernetics, vol. 52, no.
[35] X. Zhang, X. Zhou, M. Lin, and J. Sun, "ShuffleNet: An 8, pp. 8574–8586, August 2022, doi: 10.1109 /
Extremely Efficient Convolutional Neural Network for Mobile TCYB.2021.3095305.
Devices," presented at Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2018, pp. 6848–
65