BALA_REF1
BALA_REF1
BALA_REF1
Energy and AI
journal homepage: www.sciencedirect.com/journal/energy-and-ai
H I G H L I G H T S G R A P H I C A L A B S T R A C T
A R T I C L E I N F O A B S T R A C T
Keywords: Coal-gangue object detection has attracted substantial attention because it is the core of realizing vision-based
Coal-gangue detection intelligent and green coal separation. However, most existing studies have been focused on laboratory data
Swin transformer sets and prioritized model lightweight. This makes the coal-gangue object detection challenging to adapt to the
Improved path augmentation feature pyramid
complex and harsh scenes of real production environments. Therefore, our project collected and labeled image
network
Task-aligned head
datasets of coal and gangue under real production conditions from a coal preparation plant. We then designed a
Object detection one-stage object model, named STATNet, following the “backbone-neck-head” architecture with the aim of
enhancing the detection accuracy under industrial coal preparation scenarios. The proposed model utilizes Swin
Transformer as backbone module to extract multi-scale features, improved path augmentation feature pyramid
network (iPAFPN) as neck module to enrich feature fusion, and task-aligned head (TAH) as head module to
mitigate conflicts and misalignments between classification and localization tasks. Experimental results on a real-
List of abbreviations: AI, Artificial intelligence; ANN, Artificial neural network; CNN, Convolutional neural network; CV, Computer vision; FLOPs, Floating-point
operations; FPN, Feature pyramid network; FPS, Frames per second; GAP, Global average pooling; HOG, Histogram of oriented gradients; IoU, Intersection over
union; iPAFPN, Improved path aggregation feature pyramid network; LN, Layer normalization; MLP, Multilayer perceptron; MSA, Multi-head self-attention; NLP,
Natural language processing; RF, Random forest; SIFT, Scale-invariant feature transform; SSD, Single shot multibox detector; SVM, Support vector machines; TAH,
Task-aligned head; TAP, Task alignment predictor; YOLO, You only look once.
* Corresponding author at: School of Chemical Engineering and Technology, China University of Mining and Technology, Xuzhou, Jiangsu 221116, China.
E-mail address: heshengyu@cumt.edu.cn (H. Yu).
https://doi.org/10.1016/j.egyai.2024.100388
world industrial dataset demonstrate that the proposed STATNet model achieves an impressive AP50 of 89.27 %,
significantly surpassing several state-of-the-art baseline models by 2.02 % to 5.58 %. Additionally, it exhibits
stronger robustness in resisting image corruption and perturbation. These findings demonstrate its promising
prospects in practical coal and gangue separation applications.
2
K. Zhang et al. Energy and AI 17 (2024) 100388
such situation is unacceptable because part of the coal may be misjudged self-attention mechanism [37,38]. For these reasons, our research
as gangue and discharged from system, resulting in substantial economic project was inspired to apply the promising Swin Transformer as the
losses. Third, coal and gangue are often densely arranged and appear backbone network for coal-gangue detection.
very similar to the conveyor belt background in practice, which in Neck module acts as a bridge between the backbone and the head,
terferes with the localization and classification tasks of the models which accepts and integrates feature maps at different levels of the
during detection. backbone to acquire richer feature information, thereby enhancing the
To fill the above research gap, we carefully designed a one-stage model’s recognition capabilities. Many prior studies have demonstrated
object detection model following the “backbone-neck-head” architec that feature pyramid representations have become the cornerstone of
ture, termed as STATNet, with the aim of enhancing the performance of neck modules for multi-scale feature processing [39]. One of the most
coal-gangue detection in real industrial coal beneficiation. The proposed representative neck module, feature pyramid network (FPN), constructs
STATNet model utilizes the Swin Transformer as the backbone, an the feature pyramid through a top-down pathway with lateral connec
improved Path Aggregation Feature Pyramid Network (iPAFPN) as neck, tions, which enables the sharing of rich high-semantic information for
and a Task-aligned Head (TAH) for classification and localization. The both shallow and deep feature maps [40]. The PAFPN, which builds
model incorporates cutting-edge algorithms for the three key modules upon FPN by introducing a bottom-up pathway with lateral connections
(backbone, neck and head) to maximize the accuracy of coal-gangue to further enhance feature fusion, can better capture multi-scale infor
detection on a real-world industrial dataset. The contributions and in mation. In this work, we further improve the original PAFPN (i.e.,
novations of this work are summarized as follows: iPAFPN) by adding additional down sampling operations after feature
(1) Industrial dataset construction: Our research collected coal and fusion to introduce deeper-scale feature maps, facilitating the detection
gangue images from a real industrial production site, and constructed an of large and oversized coal and gangue objects.
object dataset for the detection of coal and gangue. The dataset faithfully Head module is another crucial component responsible for final
reflects the challenges of object detection for coal and gangue under detection, including object classification and bounding box regression.
harsh environment of coal preparation plant. This contribution has The current mainstream detection models, such as SSD and RetinaNet,
significantly enriched and prompted the vision-based coal separation. typically employ convolutional layers, pooling layers, or fully connected
(2) Innovative feature extraction: The present research is the first to layers to accomplish multi-scale detection tasks [41]. However, classi
introduce Swin Transformer, instead of the commonly used CNN, as the fication and localization tasks are often implemented independently by
backbone network for coal-gangue detection. This approach boosts the two branches in parallel, which can potentially lead to misalignment of
model’s ability to extract global features of images, with negligible in detection results [42]. In this work, we introduced a Task-Aligned Head
crease in computational complexity. (TAH) to enhance the interaction between the classification and locali
(3) Enhanced feature fusion: We use improved PAFPN as neck zation tasks.
module for efficient feature fusion to better adapt to the size charac In addition to the design of the model architecture, data augmenta
teristics of coal and gangue, which focuses more on large and medium tion plays important roles in improving the accuracy of object detection
objects. models. It aims to generate diverse new images by transforming or
(4) Collaborative learning: The classification and localization tasks of processing the original images to expand the training dataset, thereby
heads are independent of each other and lack interaction in existing improving the model’s generalization ability and robustness. Common
studies on coal and gangue detection. In this work, we applied TAH to data augmentation strategies include flipping, rotation, cropping,
facilitate collaborative learning between the two tasks, resulting in more brightness and contrast adjustments. Recently, Mosaic and Mixup data
accurate classification and localization predictions. augmentation are widely used in CV tasks. The former merges four
(5) Effective model integration: The proposed model combines the randomly cropped small images into a large mosaic image, which is then
advantages of the three advanced modules (i.e., backbone, neck, head), used as a training sample to enhance the model’s detection ability in
and the detection accuracy of coal and gangue in real industrial sce complex backgrounds. The latter involves mixing two or more original
narios outperforms the current mainstream models. samples to generate new training samples, thereby reducing the risk of
In summary, our proposed model employs advanced AI deep learning overfitting [43]. They have the potential to improve the accuracy of
techniques for more efficient and cleaner utilization of coal resources, coal-gangue object detection.
contributing to global efforts in reducing carbon emissions and pro
moting sustainable development. 3. Methodology
Recent research on one-stage object detection has been focused on Fig. 1 illustrates the overall structure of the proposed STATNet. It
the design and improvement of the backbone, neck, and head modules. follows the common “backbone-neck-head” principle used in one-stage
Backbone module is one of the core components in object detection detectors [44]. The pre-processed images are first input into the Swin
models. Coal and gangue images are typically fed into hierarchical CNNs Transformer backbone network, undergoing four stages of resolution
such as VGG [15], ResNet [18], Inception [32], and EfficientNet [33] to reduction (4 × , 8 × , 16 × , and 32 × down sampling, respectively) to
extract features at different scales. The shadow-level features include the generate multi-scale feature maps. The feature maps from the first stage
edge and texture features of coal and gangue, while the deep-level fea (S1) lack sufficient abstract semantic information. Therefore, we then
tures abstractly represent characteristics like the shape of the objects. feed the feature maps obtained at the second to fourth stages (S2 - S4) to
Although CNN-based backbones are predominant in the field of CV, their the iPAFPN module to save memory footprint and improve computa
capability of modelling global information is restricted by local weight tional efficiency. The feature information of different scales is trans
sharing. Additionally, the fixed receptive fields at specific levels have mitted and aggregated, resulting in a robust ability to handle objects of
challenges in handling inputs with various scales and resolutions. different sizes under complex contexts. In the final stage, the features
Therefore, they have restricted performance when dealing with signifi obtained by iPAFPN are collaboratively used to calculate the classifi
cant variations in coal and gangue granularity [34]. Recently, the cation probabilities and adjust the localization bounding box through
transformer architectures originally designed for natural language pro the TAH.
cessing (NLP) and Large Language Models (such as ChatGPT), tasks have
been proven equally effective for visual tasks [35,36]. They exhibit
excellent spatial and global modeling capabilities owing to its
3
K. Zhang et al. Energy and AI 17 (2024) 100388
3.2. Backbone: swin transformer However, W-MSA still has certain limitations, i.e., the information be
tween windows is isolated, restricting the transmission of global infor
Fig. 2 takes the small version of Swin Transformer (Swin-S) as an mation. Therefore, SW-MSA introduces a design with shifted windows,
example to demonstrates how it extracts hierarchical multi-scale fea enabling interactions between different windows. The approach of
tures. First, the Patch Partition module divides the image into multiple partitioning shifted windows establishes connections between adjacent
patches (each consists of 4 × 4 adjacent pixels), followed by unfolding non-overlapping windows in the preceding layer, which can effectively
these patches along channel. Through a linear layer, i.e., Linear enhance the model’s performance in visual tasks [27].
Embedding module, the channel dimensions are further transformed. The calculation process of the Swin Transformer blocks is expressed
Essentially, the Patch Partition and Linear Embedding modules jointly as:
covert the RGB image of H × W × 3 dimensions into patches with H4 × l ( ( ))
W z = W − MSA LN zl− 1 + zl− 1
̂ (1)
4 × C dimensions. The C of the Swin-S is set to 96.
Next, feature maps of different scales are generated by four stages, ( ( l )) l
respectively. As illustrated in Fig. 2(c), except the first stage, the zl = MLP LN ̂
z +̂ z (2)
remaining three stages employ Patch Merging layers to achieve down ( ( ))
l+1
sampling like pooling layers in CNNs, creating feature maps of different z = SW − MSA LN zl + zl
̂ (3)
sizes.
( ( l+1 )) l+1
Finally, multi-head self-attention (MSA) with regular windowing (W- zl+1 = MLP LN ̂
z z
+̂ (4)
MSA) and with shifted windowing (SW-MAS) configurations are
executed within Swin Transformer blocks to achieve long-range infor where ̂
l
z and zl are the outputs of the (S)W-MSA module and a 2-layer
mation interaction. multilayer perceptron (MLP) with GELU non-linearity, respectively.
Fig. 2(d) shows the core blocks of the Swin Transformer. They are LN denotes the layer normalization. More details of Swin Transformer
similar to the original Transformer module [45], but with one key dif can be found in Ref [46].
ference: the MAS mechanism is replaced with W-MSA or SW-MSA.
Specifically, as shown in Fig. 3, W-MSA divides the feature map into
multiple non-overlapping windows. Each feature within a window only 3.3. Neck: improved path aggregation feature pyramid network
computes self-attention with all other features residing in the same
window. This significantly reduces computational cost, allowing the PAFPN has been proven to be an important module in improving the
transformer-based architecture to be efficiently used for visual tasks. performance of object detection tasks. Fig. 1 shows that it facilitates the
flow and integration of semantic information between different scales
Fig. 2. (a) The overall structure of the small version of Swin Transformer (Swin-S); (b) the patch partition and linear embedding modules; (c) the patch merging
module; (d) two successive Swin Transformer blocks.
4
K. Zhang et al. Energy and AI 17 (2024) 100388
Fig. 3. The illustration of different self-attention mechanisms (a) Original self-attention (quadratic computation complexity to image size); (b) W-MSA computation
within single window; (c) SW-MSA computation across windows.
through top-down, bottom-up, and lateral connections, thereby process not only aids the interaction between classification task and
enriching multi-scale features. We improve PAFPN by further down localization task, but also promotes the acquirement of rich multiscale
sampling at the top level of the original PAFPN output, to strengthen the information with different receptive fields.
detection ability for large objects. Then, a layer attention mechanism is used to dynamically compute
Fig. 4 shows the details of the top-down and bottom-up path the features of a specific task at layer level to encourage task decom
augmentation to generate N2 to N4 in PAFPN. Specifically, for the top- position:
down pathway, the high-level feature map Pi+1 is up sampled by a factor
of 2 using the nearest-neighbor up sampling to match the resolution of Xktask = wk ⋅Xkinter , ∀k ∈ {1, 2, …N}, (6)
the adjacent lower-level feature map. The feature map forms the back
where wk denotes the k-th element of the layer attention w ∈ RN , which
bone Si, which have the same resolution as the up sampled feature maps,
undergoes a 1 × 1 convolution operation to ensure the consistency of is calculated by the cross-layer of Xkinter [48]:
channels. Then, the up sampled high-level feature map and adjusted ( ( ))
xinter = GAP Cat Xkinter , (7)
lower-level feature map are element-wise summed to produce the fused
feature map Pi. These operations are important for the propagation of ( ( ( ( ))))
w = σ FC2 δ FC1 xinter , (8)
semantic information from the high-level features to the low-level ones,
enhancing the representation of the fused feature map. On the other where GAP and Cat represent Global Average Pooling operation and
hand, the bottom-up pathway operates in the opposite directions. It Concatenate operation, respectively; σ is a Sigmoid activation function;
reduces the resolution of low-level feature map Ni by a factor of 2 using a FC1 and FC2 are two fully connected layers. This is similar to the
3 × 3 convolution with a stride of 2, and then performs the same lateral channel-wise attention calculation in SENet [49]. Finally, the results of
connection to generate higher-level feature map Ni+1. For N5 and N6,
classification and localization can be predicted by Xtask :
we use max pooling with a stride of 2 to get more levels on top of out
( ( ( )))
puts. More details can be found in Ref [40,47]. Ztask = conv2 δ conv1 Xtask (9)
3.4. Head: task-aligned head - TAH where Xtask is obtained by concatenating Xktask ; conv1 is 1 × 1 convolution
layer for dimension reduction; conv2 is common 1 × 1 convolution layer.
Fig. 5 illustrates a schematic of the head module. It calculates task- Xtask can be converted into classification scores P ∈ RH×W× 2 , or
interactive features and performs predictions through two Task Align bounding box regression B ∈ RH×W× 4 .
ment Predictors (TAP). During the prediction step, the distributions of P and B are further
First, the features obtained from the neck module XiPAFPN ∈ RH×W×C adjusted. Specifically, a probability spatial map M ∈ RH×W× 1 is applied
are fed into N consecutive convolution layers within a single branch, to adjust the classification prediction, while a spatial offset map O ∈
inter
resulting in a group of stacked task-interaction features X1− N , which can RH×W× 8 is used to adjust the localization prediction, which can be
be expressed as: calculated by:
{ ( ( )) √̅̅̅̅̅̅̅̅̅̅̅̅̅
δ convk XiPAFPN , k = 1 Palign = P × M (10)
Xkinter = ( ( inter )) , ∀k ∈ {1, 2, …N} (5)
δ convk Xk− 1 , k > 1
Balign (i, j, c) = B(i + O(i, j, 2 × c), j + O(i, j, 2 × c + 1), c), ∀c ∈ {0, 1, 2, 3}
where convk and δ represent the k-th convolution layer and the Relu (11)
activation function, respectively; N is set to 6 in our experiments. This
Fig. 4. The blocks of (a) top-down and (b) bottom-up path augmentation.
5
K. Zhang et al. Energy and AI 17 (2024) 100388
Fig. 5. The illustration of the head module (a) TAH (b) TAP.
where M and O are learned from the stack of task-interaction features 4. Experiments
inter
X1− N , which can be further expressed by Eqs. (12) and (13); the index (i,
All experiments were performed on a workstation equipped with a
j, c) represents the (i, j)-th spatial location at the c-th channel.
Nvidia GeForce RTX 3090 graphics card with 24GB of VRAM. The model
( ( ( ( ))))
M = σ conv2 δ conv1 Xinter (12) constructions were conducted in a Python 3.8.16 environment using
Pytorch 1.13.1, CUDA 11.7, and cuDNN 8.5.0.
( ( ( )))
O = conv4 δ conv3 Xinter (13)
4.1. Dataset description
where conv1 and conv3 are 1 × 1 convolution layers for dimension
reduction; conv2 and conv4 are common 3 × 3 convolution layers. The coal-gangue data set was collected from a coal processing plant
For the loss functions, we use Quality Focal loss [50] for the classi in Yongcheng, Henan Province, China. An indusial camera (model: MV-
fication task and GIoU loss [51] for the localization task. More details CA050-10GC, HIKVISION) was installed about 0.5 m above the hand
about the TAH can be found in Ref [42]. sorting conveyor. To ensure constant light source in image collection,
images were all acquired during normal nighttime production, supple
mented with 285 W adjustable LED lights. Finally, a total of 1750 high-
quality images of coal and gangue were captured. Coal and gangue
6
K. Zhang et al. Energy and AI 17 (2024) 100388
objects were annotated using an online open-source labeling tool, “Make by measuring the proportion of the intersection of the predicted
Sense” (https://www.makesense.ai/). bounding box and the ground truth bounding box relative to their union,
Fig. 6 shows the annotations of coal and gangue samples. On the one which can be expressed as:
hand, there are distinct differences in the characteristics between coal
Area(Predicted ∩ Ground Truth)
and gangue samples. Coal surfaces are brighter and more reflective than IoU = (14)
Area(Predicted ∪ Ground Truth)
gangue surfaces. On the other hand, covering of moisture and coal slime
on sample surfaces and overlapping between samples are common, AP @ different IoUs are defined by calculating the area under the PR
posing challenges to detection. Therefore, a model that can address curve:
these challenges under complex environments is needed.
∫1
Table 1 presents the statistics of the dataset. We randomly split the
AP = P(R)dR (15)
dataset at a ratio of 8:1:1, producing 1400 images for training set, 175
images for validation set, and 175 images for test set. It can be observed 0
that the distributions of object category and size of the three sets are
where P is the Precision, and R is the Recall, which can be further
relatively consistent, which indicates the data split is reasonable. It is
computed as:
noteworthy that the materials on the conveyor had been pre-sized using
a screen, resulting in most coal and gangue materials having a size larger TP
P= (16)
than 50 mm. Therefore, the dataset contains almost no small objects TP + FP
according to the object size defined by COCO dataset [52]. In the sub
sequent evaluation process, we no longer discuss the evaluation of small TP
R= (17)
object detection. TP + FN
Fig. 7 illustrates the distribution of the objects in each image, which
where TP, FP, FN are true position, false positive, and false negative
generally follows a normal distribution. The number of objects in each
samples, respectively.
image is mostly concentrated around 15, indicating that the arrange
In terms of model complexity and speed, we considered three met
ment of objects is relatively dense. This demonstrates that object
rics, namely, floating-point operations (FLOPs) which assesses the
detection in real industrial production is significantly more challenging
computational complexity of the model, Params which evaluates the
than in laboratory settings.
physical memory usage of the model, and frames per second (FPS) which
measures the inference time of the model.
4.2. Implementation details Furthermore, we conducted robustness tests on all models to verify
their performance in addressing images with different types of corrup
4.2.1. Model setting tions [57].
To verify the superiority of the proposed STATNet model, we chose
several mainstream object detection models as benchmarks for com 4.2.3. Visualization of the training process of the proposed STATNet
parisons. They are: Faster RCNN [53], Cascade RCNN [54], SSD [11], Fig. 8 shows the training process of the proposed model. The left axis
RetinaNet, YOLOv3 [19], YOLOX [55], YOLOv7 [56], and YOLOv8 . All shows the loss of the classification and localization prediction tasks as
models were trained for 100 epochs using a mini-batch size of 2. The well as the total loss during training, whereas the right axis displays the
learning rate was reduced by a factor of 10 at the 50-th epoch and the AP50 results on the validation set for each epoch. It can be observed that
80-th epoch. Common data augmentation included random flip with a the model’s performance reached initial convergence at around the 80-
probability of 0.5. Inspired by the data augmentation in YOLOX [55], we th epoch. Subsequently, after discontinuing Mosaic and Mixup data
further introduced Mosaic and Mixup data augmentation strategies augmentation and adjusting the learning rate at the 80-th epoch, the
during the first 80 epochs of training to investigate their potential boost model’s performance continued to improve and eventually converged at
on our model performance. For model inference, we followed the set the 100-th epoch. This finding can be explained as follows. While Mosaic
tings used in Ref [42], except that we resized the input images to 512 × and Mix-up enhance the robustness and generalization of the model,
512. The remaining key parameters for each model are summarized in they distort the real distribution of the data. After discontinuing them,
Table 2. the images used for training revert to their original forms, allowing the
model to continue converging. The convergence tendency with and
4.2.2. Model metrics without Mosaic and Mixup data augmentation is consistent with that in
We utilized the common evaluation metrics of the COCO datasets to YOLOX [55].
assess model’s accuracy. Among these metrics, we used AP50 (Average
Precision @ intersection over union, IoU=0.50) as the main perfor 4.3. Results and discussion
mance measure. We also provided AP (AP @ IoU=0.50: 0.05: 0.95),
AP75 (AP @ IoU=0.75) and APm (AP for medium objects) and APl (AP for 4.3.1. Ablation study
large objects) for more references. Furthermore, we provided the Ablation experiments were conducted to verify the effects of
precision-recall (PR) curves to evaluate whether the model can maintain improving each module on the model performance. We developed our
high precision while achieving high recall. Specifically, IoU is calculated STATNet based on a popular and strong one-stage detector, RetinaNet-
Table 1
Statistics of the dataset.
Data set Ratio Images Object category Object size*
*Remark: According to the COCO dataset, the object sizes in an image are defined as follows.
Large: greater than 96 × 96 pixel; Medium: between 96 × 96 pixels and 32 × 32 pixels; Small: Less than 32 × 32 pixels.
7
K. Zhang et al. Energy and AI 17 (2024) 100388
Table 2
Key training parameters for each model.
Model Input size Backbone Optimizer Learning rate Momentum/Betas Weight decay
R101. Specifically, we introduced Mosaic and Mixup data augmenta detection accuracy, with an improvement of 0.53 % in terms of AP50.
tion, replaced its backbone module, improved its neck module, and Additionally, the computational complexity and parameters of the two
replaced its head module step by step, to demonstrate the improvements models have insignificant difference. As for model speed, Swin Trans
brought about by data augmentation and each new module in our former exhibits a decrease in inference speed. This is because CNN-based
STATNet model. The results are presented in Table 3. networks like ResNet, can be better accelerated by the cuDNN functions
We can then draw the following points from Table 3: under the PyTorch framework, while Transformer-based architectures
Effect of Mosaic and Mixup: Our research first introduced the may require further kernel optimization to adapt to existing hardware
Mosaic and Mixup data augmentation during the training process. As conditions [46].
seen from Table 3, we achieved an increase of 0.53 % in AP50 under the Fig. 9 employs AblationCAM to further visualize the decision-making
identical model computational complex, parameter, and inference process of different backbone networks. AblationCAM can generate
speed. Therefore, we selected the Mosaic and Mixup, which are two gradient-free visual explanations to provide valuable insights into the
plug-and-play data augmentation methods, in the subsequent in model’s decision-making process [58]. It is evident that the Swin
vestigations in this work to enhance model accuracy. Transformer backbone-based model can produce more comprehensive
Effect of Swin Transformer: Subsequently, we compare the capa and impressive activation regions, resulting in more trustworthy clas
bilities of RetinaNet model between using Swin Transformer and sification and localization decisions. This reflects Swin Transformer’s
ResNet101 as the backbone networks. Swin Transformer achieves higher stronger capability of global feature learning, which makes it a
8
K. Zhang et al. Energy and AI 17 (2024) 100388
9
K. Zhang et al. Energy and AI 17 (2024) 100388
Table 4
Comparisons with other SOTA models.
Method AP50 AP75 AP APm APl FLOPs Params FPS
(%) (%) (%) (%) (%) (G) (M)
importance. This is because lower model accuracy inevitably results in improve the model’s inference speed, making it meets the needs of
more serious misjudgment of coal and gangue, which leads to great loss real-time detection requirements.
of mineral resources, increases air pollution, and reduces the undermine Additionally, Fig. 12 provides the PR-curves for gangue and coal
profit of coal enterprises. Taking a coal preparation plant with an annual detection. It can be observed that the proposed STATNet model can
production capacity of 3 million tons and a clean coal yield of 50 % as an maintain higher accuracy while achieving higher recall rates. Another
example, assuming that the price of clean coal product is 800 CNY per notable observation is that all models perform slightly worse in coal
ton [59], a mere 1 % loss in coal product would result in a significant detection compared to gangue detection. This may be because the
economic loss of approximately 12 million CNY annually. This rough appearance of coal is more similar to the background (i.e., the black
estimation demonstrates that the model accuracy is critical to maxi color of conveyor belt), making it more challenging to detect than
mizing economic benefits. Furthermore, hardware upgrades can gangue. Nevertheless, the proposed model shows a more obvious
10
K. Zhang et al. Energy and AI 17 (2024) 100388
Fig. 11. Relationships between model accuracy and (a) computational complexity; (b) inference time.
Fig. 12. The PR-curve for (a) gangue detection; (b) coal detection.
11
K. Zhang et al. Energy and AI 17 (2024) 100388
12
K. Zhang et al. Energy and AI 17 (2024) 100388
[4] Yin J, Zhu J, Zhu H, Pan G, Zhu W, Zeng Q, et al. Intelligent photoelectric [31] Lv Z, Wang W, Xu Z, Zhang K, Fan Y, Song Y. Fine-grained object detection method
identification of coal and gangue − a review. Measurement 2024;233:114723. using attention mechanism and its application in coal–gangue detection. Appl Soft
https://doi.org/10.1016/j.measurement.2024.114723. Comput 2021;113:107891. https://doi.org/10.1016/j.asoc.2021.107891.
[5] Zhang K, Yang X, Xu L, Thé J, Tan Z, Yu H. Enhancing coal-gangue object detection [32] Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception
using GAN-based data augmentation strategy with dual attention mechanism. architecture for computer vision. In: Proceedings of the IEEE conference on
Energy 2024;287:129654. https://doi.org/10.1016/j.energy.2023.129654. computer vision and pattern recognition; 2016. p. 2818–26.
[6] Yang X, Zhang K, Ni C, Cao H, Thé J, Xie G, et al. Ash determination of coal [33] Tan M, Le Q. EfficientNet: rethinking model scaling for convolutional neural
flotation concentrate by analyzing froth image using a novel hybrid model based networks. In: Proceedings of the 36th international conference on machine
on deep learning algorithms and attention mechanism. Energy 2022;260:125027. learning. Proceedings of machine learning research: PMLR; 2019. p. 6105–14.
https://doi.org/10.1016/j.energy.2022.125027. [34] Luan H, Xu H, Tang W, Tian Y, Zhang Q. Coal and gangue classification in actual
[7] Yang J, Chang B, Zhang Y, Luo W, Ge S, Wu M. CNN coal and rock recognition environment of mines based on deep learning. Measurement 2023;211:112651.
method based on hyperspectral data. Int J Coal Sci Technol 2022;9(1):63. https:// https://doi.org/10.1016/j.measurement.2023.112651.
doi.org/10.1007/s40789-022-00516-x. [35] Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T.,
[8] Cheng G, Chen J, Wei Y, Chen S, Pan Z. A coal gangue identification method based et al., 2020. An image is worth 16x16 words: transformers for image recognition at
on HOG combined with LBP features and improved support vector machine. scale. arXiv preprint arXiv:201011929.
Symmetry (Basel) 2023;15(1):202. [36] Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end
[9] Weimer D, Scholz-Reiter B, Shpitalni M. Design of deep convolutional neural object detection with transformers. In: European conference on computer vision.
network architectures for automated feature extraction in industrial inspection. Springer; 2020. p. 213–29.
CIRP Ann 2016;65(1):417–20. https://doi.org/10.1016/j.cirp.2016.04.072. [37] Zhang K, Thé J, Xie G, Yu H. Multi-step ahead forecasting of regional air quality
[10] Wang X. Deep learning in object recognition, detection, and segmentation. Found using spatial-temporal deep neural networks: a case study of Huaihai Economic
Trends® Signal Process 2016;8(4):217–382. Zone. J Clean Prod 2020;277:123231. https://doi.org/10.1016/j.
[11] Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, et al. SSD: Single shot jclepro.2020.123231.
multibox detector. In: Computer Vision–ECCV 2016: 14th European conference. [38] Zhang K, Cao H, Thé J, Yu H. A hybrid model for multi-step coal price forecasting
Amsterdam, The Netherlands: Springer; 2016. p. 21–37. October 11–14, 2016, using decomposition technique and deep learning algorithms. Appl Energy 2022;
Proceedings, Part I 14. 306:118011. https://doi.org/10.1016/j.apenergy.2021.118011.
[12] Jiang P, Ergu D, Liu F, Cai Y, Ma B. A Review of Yolo algorithm developments. [39] Liu S, Huang D, Wang Y. 2019.
Procedia Comput Sci 2022;199:1066–73. [40] Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid
[13] Zhang Y, Wang J, Yu Z, Zhao S, Bei G. Research on intelligent detection of coal networks for object detection. In: Proceedings of the IEEE conference on computer
gangue based on deep learning. Measurement 2022;198:111415. https://doi.org/ vision and pattern recognition; 2017. p. 2117–25.
10.1016/j.measurement.2022.111415. [41] Zaidi SSA, Ansari MS, Aslam A, Kanwal N, Asghar M, Lee B. A survey of modern
[14] Zhang B., Zhang H.B. Coal gangue detection method based on improved SSD deep learning based object detection models. Digit Signal Process 2022;126:
algorithm. 2021 International conference on intelligent transportation, big data & 103514. https://doi.org/10.1016/j.dsp.2022.103514.
smart city (ICITBS). 2021, p. 634–7. [42] Feng C, Zhong Y, Gao Y, Scott MR, Huang W. Tood: task-aligned one-stage object
[15] Simonyan K., Zisserman A., 2014. Very deep convolutional networks for large-scale detection. In: IEEE/CVF International conference on computer vision (ICCV). 2021.
image recognition. arXiv preprint arXiv:14091556. IEEE Computer Society; 2021. p. 3490–9.
[16] Howard A.G., Zhu M., Chen B., Kalenichenko D., Wang W., Weyand T., et al., 2017. [43] Kaur P, Khehra BS, Mavi EBS. Data augmentation for object detection: a review. In:
MobileNets: efficient convolutional neural networks for mobile vision applications. IEEE International midwest symposium on circuits and systems (MWSCAS). 2021;
arXiv preprint arXiv:170404861. 2021. p. 537–43.
[17] Xue G., Li S., Hou P., Gao S., Tan R., 2023. Research on lightweight Yolo coal [44] Li B, Li Y, Zhu X, Qu L, Wang S, Tian Y, et al. Substation rotational object detection
gangue detection algorithm based on resnet18 backbone feature network. Internet based on multi-scale feature fusion and refinement. Energy AI 2023;14:100294.
of Things. 22, 100762. 10.1016/j.iot.2023.100762. https://doi.org/10.1016/j.egyai.2023.100294.
[18] He K, Zhang X, Ren S, Sun J. Identity mappings in deep residual networks. In: [45] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention
Computer Vision–ECCV 2016: 14th European conference. Amsterdam, the is all you need. Adv Neural Inf Process Syst 2017;30:5998–6008.
Netherlands: Springer; 2016. p. 630–45. October 11–14, 2016Proceedings, Part IV [46] Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: hierarchical
14. vision transformer using shifted windows. In: Proceedings of the IEEE/CVF
[19] Redmon J., Farhadi A., 2018. Yolov3: an incremental improvement. arXiv preprint international conference on computer vision; 2021. p. 10012–22.
arXiv:180402767. 10.48550/arXiv.1804.02767. [47] Liu S, Qi L, Qin H, Shi J, Jia J. Path aggregation network for instance segmentation.
[20] Lv Z, Wang W, Xu Z, Zhang K, Lv H. Cascade network for detection of coal and In: Proceedings of the IEEE conference on computer vision and pattern recognition
gangue in the production context. Powder Technol 2021;377:361–71. https://doi. (CVPR); 2018. p. 8759–68.
org/10.1016/j.powtec.2020.08.088. [48] Zhang K, Yang X, Cao H, Thé J, Tan Z, Yu H. Multi-step forecast of PM2.5 and
[21] Yan P, Wang W, Li G, Zhao Y, Wang J, Wen Z. A lightweight coal gangue detection PM10 concentrations using convolutional neural network integrated with
method based on multispectral imaging and enhanced YOLOv8n. Microchem J spatial–temporal attention and residual learning. Environ Int 2023;171:107691.
2024;199:110142. https://doi.org/10.1016/j.microc.2024.110142. https://doi.org/10.1016/j.envint.2022.107691.
[22] Li D-Y, Wang G-F, Guo Y-C, Zhang Y, Wang S. An identification and positioning [49] Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE
method for coal gangue based on lightweight mixed domain attention. Int J Coal conference on computer vision and pattern recognition; 2018. p. 7132–41.
Prep Util 2023;43(9):1542–60. https://doi.org/10.1080/ [50] Li X, Lv C, Wang W, Li G, Yang L, Yang J. Generalized focal loss: towards efficient
19392699.2022.2119561. representation learning for dense object detection. IEEE Trans Pattern Anal Mach
[23] Yan P, Sun Q, Yin N, Hua L, Shang S, Zhang C. Detection of coal and gangue based Intell 2023;45(3):3139–53. https://doi.org/10.1109/TPAMI.2022.3180392.
on improved YOLOv5.1 which embedded scSE module. Measurement 2022;188: [51] Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S. Generalized
110530. https://doi.org/10.1016/j.measurement.2021.110530. intersection over union: a metric and a loss for bounding box regression. In:
[24] Pan H, Shi Y, Lei X, Wang Z, Xin F. Fast identification model for coal and gangue Proceedings of the IEEE/CVF conference on computer vision and pattern
based on the improved tiny YOLO v3. J Real-Time Image Process 2022;19(3): recognition; 2019. p. 658–66.
687–701. https://doi.org/10.1007/s11554-022-01215-1. [52] Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft COCO:
[25] Li M, He X, Yuan Y, Yang M. Multiple factors influence coal and gangue image common objects in context. Cham Springer International Publishing; 2014.
recognition method and experimental research based on deep learning. Int J Coal p. 740–55.
Prep Util 2022:1–17. https://doi.org/10.1080/19392699.2022.2118260. [53] Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection
[26] Wei D, Li J, Li B, Wang X, Chen S, Wang X, et al. A fast recognition method for coal with region proposal networks. Adv Neural Inf Process Syst 2015;28:2015.
gangue image processing. Multimed Syst 2023;29(4):2323–35. https://doi.org/ [54] Cai Z, Vasconcelos N. Cascade R-CNN: delving into high quality object detection.
10.1007/s00530-023-01109-7. In: Proceedings of the IEEE conference on computer vision and pattern recognition;
[27] Wen X, Li B, Wang X, Li J, Wei D, Gao J, et al. A Swin transformer-functionalized 2018. p. 6154–62.
lightweight YOLOv5s for real-time coal–gangue detection. J Real-Time Image [55] Ge Z., Liu S., Wang F., Li Z., Sun J., 2021. Yolox: exceeding yolo series in 2021.
Process 2023;20(3):47. https://doi.org/10.1007/s11554-023-01305-8. arXiv preprint arXiv:210708430. 10.48550/arXiv.2107.08430.
[28] Liu Q, Li J, Li Y, Gao M. Recognition methods for coal and coal gangue based on [56] Wang C-Y, Bochkovskiy A, Liao H-YM. YOLOv7: trainable bag-of-freebies sets new
deep learning. IEEE Access 2021;9:77599–610. https://doi.org/10.1109/ state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF
ACCESS.2021.3081442. conference on computer vision and pattern recognition; 2023. p. 7464–75.
[29] Guo Y, Zhang Y, Li F, Wang S, Cheng G. Research of coal and gangue identification [57] Hendrycks D., Dietterich T., 2019. Benchmarking neural network robustness to
and positioning method at mobile device. Int J Coal Prep Util 2023;43(4):691–707. common corruptions and perturbations. arXiv preprint arXiv:190312261.
https://doi.org/10.1080/19392699.2022.2072305. [58] Ramaswamy HG. Ablation-cam: Visual explanations for deep convolutional
[30] Yang D, Miao C, Li X, Liu Y, Wang Y, Zheng Y. Improved YOLOv7 network model network via gradient-free localization. In: Proceedings of the IEEE/CVF winter
for gangue selection robot for gangue and foreign matter detection in coal. Sensors conference on applications of computer vision; 2020. p. 983–91.
2023;23(11):5140. [59] Zhu S, Chi Y, Gao K, Chen Y, Peng R. Analysis of influencing factors of thermal coal
price. Energies 2022;15(15):5652.
13