BALA_REF1

Energy and AI 17 (2024) 100388
Contents lists available at ScienceDirect
Energy and AI
journal homepage: www.sciencedirect.com/journal/energy-and-ai
STATNet: One-stage coal-gangue detector based on deep learning algorithm

for real industrial application
Kefei Zhang a , Teng Wang a , Xiaolin Yang a , Liang Xu a , Jesse Thé b, c , Zhongchao Tan d, e ,
Hesheng Yu a, *
a
Key Laboratory of Coal Processing and Efficient Utilization of Ministry of Education, School of Chemical Engineering and Technology, China University of Mining and
Technology, Xuzhou, Jiangsu 221116, China
b
Department of Mechanical & Mechatronics Engineering, University of Waterloo, 200 University Avenue West, Waterloo, Ontario N2L 3G1, Canada
c
Lakes Environmental Research Inc., 170 Columbia St. W. Suite 1, Waterloo, Ontario, N2L 3L3 Canada
d
Eastern Institute for Advanced Study, Eastern Institute of Technology, Ningbo, Zhejiang 315200, China
e
Department of Energy and Power Engineering, Tsinghua University, Beijing 100084, China
H I G H L I G H T S G R A P H I C A L A B S T R A C T
• Propose a one-stage object detection

model for accurate coal-gangue
detection.
• Use Swin-transformer as backbone to
extract global multi-scale information.
• Improve feature pyramid to facilitate
effective cross-scale feature fusion.
• Introduce task-aligned head to mitigate
classification-localization
misalignments.
• Our STATNet model outperforms SOTA
baseline models in real industrial
dataset.
A R T I C L E I N F O A B S T R A C T
Keywords: Coal-gangue object detection has attracted substantial attention because it is the core of realizing vision-based
Coal-gangue detection intelligent and green coal separation. However, most existing studies have been focused on laboratory data
Swin transformer sets and prioritized model lightweight. This makes the coal-gangue object detection challenging to adapt to the
Improved path augmentation feature pyramid
complex and harsh scenes of real production environments. Therefore, our project collected and labeled image
network
Task-aligned head
datasets of coal and gangue under real production conditions from a coal preparation plant. We then designed a
Object detection one-stage object model, named STATNet, following the “backbone-neck-head” architecture with the aim of
enhancing the detection accuracy under industrial coal preparation scenarios. The proposed model utilizes Swin
Transformer as backbone module to extract multi-scale features, improved path augmentation feature pyramid
network (iPAFPN) as neck module to enrich feature fusion, and task-aligned head (TAH) as head module to
mitigate conflicts and misalignments between classification and localization tasks. Experimental results on a real-
List of abbreviations: AI, Artificial intelligence; ANN, Artificial neural network; CNN, Convolutional neural network; CV, Computer vision; FLOPs, Floating-point
operations; FPN, Feature pyramid network; FPS, Frames per second; GAP, Global average pooling; HOG, Histogram of oriented gradients; IoU, Intersection over
union; iPAFPN, Improved path aggregation feature pyramid network; LN, Layer normalization; MLP, Multilayer perceptron; MSA, Multi-head self-attention; NLP,
Natural language processing; RF, Random forest; SIFT, Scale-invariant feature transform; SSD, Single shot multibox detector; SVM, Support vector machines; TAH,
Task-aligned head; TAP, Task alignment predictor; YOLO, You only look once.
* Corresponding author at: School of Chemical Engineering and Technology, China University of Mining and Technology, Xuzhou, Jiangsu 221116, China.
E-mail address: heshengyu@cumt.edu.cn (H. Yu).
https://doi.org/10.1016/j.egyai.2024.100388
Available online 13 June 2024

2666-5468/© 2024 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-
nc/4.0/).
K. Zhang et al. Energy and AI 17 (2024) 100388
world industrial dataset demonstrate that the proposed STATNet model achieves an impressive AP50 of 89.27 %,
significantly surpassing several state-of-the-art baseline models by 2.02 % to 5.58 %. Additionally, it exhibits
stronger robustness in resisting image corruption and perturbation. These findings demonstrate its promising
prospects in practical coal and gangue separation applications.
1. Introduction detectors have become the mainstream for vision-based coal-gangue

object detection because of their merits in terms of fast detection speed
Coal, as an important fossil energy, plays a significant role in the and relatively high detection accuracy.
course of human development. However, coal mining is usually The existing one-stage detectors typically consist of three compo
accompanied by a substantial amount of useless gangue, which can nents. The backbone represented by deep convolutional neural network
deteriorate the quality of the run-of-mine coal. Therefore, appropriate (CNN) is applied to extract multi-scale features of the input images; the
techniques must be employed to separate the gangue from the run-of- neck is responsible for information fusion of the multi-scale features
mine coal to ensure that the coal resources can be utilized cleanly and from the backbone; the head is used for the final classification and
efficiently [1]. Furthermore, addressing this issue holds great signifi localization calculation of the detection tasks. Following this architec
cance in reducing global carbon emissions and promoting sustainable ture, some milestone object detection models, such as Single Shot mul
development. tibox Detector (SSD) [11], and You Only Look Once (YOLO) family [12],
Nowadays, traditional manual selection remains in use in numerous have been developed and gradually introduced into the field of coal and
coal preparation plants to remove sizeable gangue and foreign material gangue object detection [13]. For example, Zhang et al. [14] imple
from raw coal [2]. Such separation method has many drawbacks. First, it mented a lightweight SSD model by replacing VGGNet [15] with
heavily relies on the experience and skills of the operators, thereby MobileNet [16] as the backbone. This model not only has low latency,
affecting sorting efficiency and accuracy. During practical work, oper requiring 0.04 s for detecting a single image, but also achieves a high
ators are tasked with repetitious handling, which frequently leads to accuracy of 91.5 %. Similarly, Xue et al. [17] accepted ResNet18 [18] as
exhaustion, posing challenges to maintaining efficiency. Moreover, op the backbone of YOLOv3 [19], and further reduced the model size
erators endure prolonged exposure to high levels of dust and noise, through feature scale reduction and non-structured pruning, thus
which is detrimentally to human health. Operators must also stay high improving the speed of coal and gangue detection while ensuring ac
alert to prevent accidental injuries. On the other hand, although tech curacy. Lv et al. [20] designed a cascade network that combines con
nologies such as jigging and heavy medium separation are ventional image processing and deep learning technologies. Their model
well-developed, they require significant water and dense medium con obtained superior accuracy compared to YOLOv3 in the detection of coal
sumption, leading to high costs and environmental unfriendliness. and gangue. Yan et al. [21] improved YOLOv8n for coal-gangue detec
Moreover, most exploitable coal resources are located in water-scarce tion. They utilized multispectral technology to capture images. These
areas like Northwestern China [3], where water-intensive technologies multispectral images contain multiple information with different spec
are impractical. On the other hand, some scholars propose dry separa tral bands. Then, they selected the optimal 3 spectral bands for detec
tion methods, such as the one detecting coal and gangue based on the tion, effectively improving the accuracy of their model. Recently, many
X-rays. However, the X-ray penetration are difficult to observe, and the researchers have made further improvements to the YOLO family al
radiation generated by X-rays poses potential health risks [4]. Therefore, gorithms for coal and gangue detection. These improvements include
more secure, automated and intelligent devices to reject large gangues the incorporation of attention mechanism in the backbone or neck to
from coal is gaining attention because they enhance the efficiency of enhance the extraction of important features. For example, Li et al. [22],
coal preparation and reduce labor and material costs [5]. Yan et al. [23], Pan et al. [24], and Li et al. [25] incorporated different
With the development of artificial intelligence (AI) and computer channels and/or spatial attention mechanisms into the backbone
vision (CV) technology, detecting and sorting coal and gangue through network of their models, to enhance feature extraction capabilities.
image processing and data analysis has gradually attracted the attention Moreover, refinements of the feature pyramid structure have been made
of practitioners [6,7]. This is primarily driven by the obvious differences to integrate richer feature information. For instance, Wei et al. [26] and
in color, reflectivity, texture, shape and other surface characteristics Wen et al. [27] integrated attention mechanisms into the neck module of
between coal and gangue. Such deviations can be effectively distin detection networks to enhance the semantic information interaction
guished through vision technology, enabling their reliable separation. across different scales. Liu et al. [28] increased the number of layers in
Vison-based detection of coal and gangue can be generally divided the feature pyramid to obtain richer feature information. These opera
into two categories: traditional machine learning method and state-of- tions helped their models achieve more accurate detection. Additionally,
the-art deep learning method. The former is mainly to manually optimizations have been applied to the loss function of the head for
design feature extractors, e.g. histogram of oriented gradients (HOG) better training convergence [29,30].
and scale-invariant feature transform (SIFT), to extract gray histograms, Although the deep learning-based models have achieved great ad
gray co-occurrence matrices from coal and gangue images [8]. Subse vancements in the detection of coal and gangue, there are notable
quently, machine leaning-based classifiers such as support vector ma research gaps that need to be addressed. First, most of the studies are
chines (SVM) and random forest (RF), and artificial neural networks carried out in laboratories, whereas rare research is conducted under
(ANNs) are used to distinguish coal from gangue based on these real industrial environments. Limited data in laboratory conditions can
extracted eigenvectors. The feature engineering of this method heavily more likely lead to overfitting of model. Moreover, the real industrial
depends on manual expertise and is relatively subjective. Consequently, environments are much more complex than the controlled laboratory
it is difficult to generate robust high-level semantic information, which settings. For example, the surfaces of gangue and coal are often coated
limits the detection accuracy of this method, especially in complex with coal slime during transportation, especially in the presence of
scenarios [9]. water on conveyor belts, further deteriorating the situation. This
Unlike traditional machine learning method, coal and gangue significantly degrades the recognizability of the captured images, mak
detection based on deep learning algorithms generally use a multi- ing detection more challenging [31]. Second, considerable studies have
structured neural network architecture, which can adaptively extract focused on the lightweight of model, which reduces the computational
advanced semantic features from extensive labeled images without cost and enables the deployment on mobile devices. However, this often
manual intervention [10]. Among the deep learning methods, one-stage comes at the price of sacrificing model accuracy. In industrial practice,
2
such situation is unacceptable because part of the coal may be misjudged self-attention mechanism [37,38]. For these reasons, our research
as gangue and discharged from system, resulting in substantial economic project was inspired to apply the promising Swin Transformer as the
losses. Third, coal and gangue are often densely arranged and appear backbone network for coal-gangue detection.
very similar to the conveyor belt background in practice, which in Neck module acts as a bridge between the backbone and the head,
terferes with the localization and classification tasks of the models which accepts and integrates feature maps at different levels of the
during detection. backbone to acquire richer feature information, thereby enhancing the
To fill the above research gap, we carefully designed a one-stage model’s recognition capabilities. Many prior studies have demonstrated
object detection model following the “backbone-neck-head” architec that feature pyramid representations have become the cornerstone of
ture, termed as STATNet, with the aim of enhancing the performance of neck modules for multi-scale feature processing [39]. One of the most
coal-gangue detection in real industrial coal beneficiation. The proposed representative neck module, feature pyramid network (FPN), constructs
STATNet model utilizes the Swin Transformer as the backbone, an the feature pyramid through a top-down pathway with lateral connec
improved Path Aggregation Feature Pyramid Network (iPAFPN) as neck, tions, which enables the sharing of rich high-semantic information for
and a Task-aligned Head (TAH) for classification and localization. The both shallow and deep feature maps [40]. The PAFPN, which builds
model incorporates cutting-edge algorithms for the three key modules upon FPN by introducing a bottom-up pathway with lateral connections
(backbone, neck and head) to maximize the accuracy of coal-gangue to further enhance feature fusion, can better capture multi-scale infor
detection on a real-world industrial dataset. The contributions and in mation. In this work, we further improve the original PAFPN (i.e.,
novations of this work are summarized as follows: iPAFPN) by adding additional down sampling operations after feature
(1) Industrial dataset construction: Our research collected coal and fusion to introduce deeper-scale feature maps, facilitating the detection
gangue images from a real industrial production site, and constructed an of large and oversized coal and gangue objects.
object dataset for the detection of coal and gangue. The dataset faithfully Head module is another crucial component responsible for final
reflects the challenges of object detection for coal and gangue under detection, including object classification and bounding box regression.
harsh environment of coal preparation plant. This contribution has The current mainstream detection models, such as SSD and RetinaNet,
significantly enriched and prompted the vision-based coal separation. typically employ convolutional layers, pooling layers, or fully connected
(2) Innovative feature extraction: The present research is the first to layers to accomplish multi-scale detection tasks [41]. However, classi
introduce Swin Transformer, instead of the commonly used CNN, as the fication and localization tasks are often implemented independently by
backbone network for coal-gangue detection. This approach boosts the two branches in parallel, which can potentially lead to misalignment of
model’s ability to extract global features of images, with negligible in detection results [42]. In this work, we introduced a Task-Aligned Head
crease in computational complexity. (TAH) to enhance the interaction between the classification and locali
(3) Enhanced feature fusion: We use improved PAFPN as neck zation tasks.
module for efficient feature fusion to better adapt to the size charac In addition to the design of the model architecture, data augmenta
teristics of coal and gangue, which focuses more on large and medium tion plays important roles in improving the accuracy of object detection
objects. models. It aims to generate diverse new images by transforming or
(4) Collaborative learning: The classification and localization tasks of processing the original images to expand the training dataset, thereby
heads are independent of each other and lack interaction in existing improving the model’s generalization ability and robustness. Common
studies on coal and gangue detection. In this work, we applied TAH to data augmentation strategies include flipping, rotation, cropping,
facilitate collaborative learning between the two tasks, resulting in more brightness and contrast adjustments. Recently, Mosaic and Mixup data
accurate classification and localization predictions. augmentation are widely used in CV tasks. The former merges four
(5) Effective model integration: The proposed model combines the randomly cropped small images into a large mosaic image, which is then
advantages of the three advanced modules (i.e., backbone, neck, head), used as a training sample to enhance the model’s detection ability in
and the detection accuracy of coal and gangue in real industrial sce complex backgrounds. The latter involves mixing two or more original
narios outperforms the current mainstream models. samples to generate new training samples, thereby reducing the risk of
In summary, our proposed model employs advanced AI deep learning overfitting [43]. They have the potential to improve the accuracy of
techniques for more efficient and cleaner utilization of coal resources, coal-gangue object detection.
contributing to global efforts in reducing carbon emissions and pro
moting sustainable development. 3. Methodology
2. Related work 3.1. Model overview
Recent research on one-stage object detection has been focused on Fig. 1 illustrates the overall structure of the proposed STATNet. It
the design and improvement of the backbone, neck, and head modules. follows the common “backbone-neck-head” principle used in one-stage
Backbone module is one of the core components in object detection detectors [44]. The pre-processed images are first input into the Swin
models. Coal and gangue images are typically fed into hierarchical CNNs Transformer backbone network, undergoing four stages of resolution
such as VGG [15], ResNet [18], Inception [32], and EfficientNet [33] to reduction (4 × , 8 × , 16 × , and 32 × down sampling, respectively) to
extract features at different scales. The shadow-level features include the generate multi-scale feature maps. The feature maps from the first stage
edge and texture features of coal and gangue, while the deep-level fea (S1) lack sufficient abstract semantic information. Therefore, we then
tures abstractly represent characteristics like the shape of the objects. feed the feature maps obtained at the second to fourth stages (S2 - S4) to
Although CNN-based backbones are predominant in the field of CV, their the iPAFPN module to save memory footprint and improve computa
capability of modelling global information is restricted by local weight tional efficiency. The feature information of different scales is trans
sharing. Additionally, the fixed receptive fields at specific levels have mitted and aggregated, resulting in a robust ability to handle objects of
challenges in handling inputs with various scales and resolutions. different sizes under complex contexts. In the final stage, the features
Therefore, they have restricted performance when dealing with signifi obtained by iPAFPN are collaboratively used to calculate the classifi
cant variations in coal and gangue granularity [34]. Recently, the cation probabilities and adjust the localization bounding box through
transformer architectures originally designed for natural language pro the TAH.
cessing (NLP) and Large Language Models (such as ChatGPT), tasks have
been proven equally effective for visual tasks [35,36]. They exhibit
excellent spatial and global modeling capabilities owing to its
3
Fig. 1. The overall structure of the STATNet.
3.2. Backbone: swin transformer However, W-MSA still has certain limitations, i.e., the information be
tween windows is isolated, restricting the transmission of global infor
Fig. 2 takes the small version of Swin Transformer (Swin-S) as an mation. Therefore, SW-MSA introduces a design with shifted windows,
example to demonstrates how it extracts hierarchical multi-scale fea enabling interactions between different windows. The approach of
tures. First, the Patch Partition module divides the image into multiple partitioning shifted windows establishes connections between adjacent
patches (each consists of 4 × 4 adjacent pixels), followed by unfolding non-overlapping windows in the preceding layer, which can effectively
these patches along channel. Through a linear layer, i.e., Linear enhance the model’s performance in visual tasks [27].
Embedding module, the channel dimensions are further transformed. The calculation process of the Swin Transformer blocks is expressed
Essentially, the Patch Partition and Linear Embedding modules jointly as:
covert the RGB image of H × W × 3 dimensions into patches with H4 × l ( ( ))
W z = W − MSA LN zl− 1 + zl− 1
̂ (1)
4 × C dimensions. The C of the Swin-S is set to 96.
Next, feature maps of different scales are generated by four stages, ( ( l )) l
respectively. As illustrated in Fig. 2(c), except the first stage, the zl = MLP LN ̂
z +̂ z (2)
remaining three stages employ Patch Merging layers to achieve down ( ( ))
l+1
sampling like pooling layers in CNNs, creating feature maps of different z = SW − MSA LN zl + zl
̂ (3)
sizes.
( ( l+1 )) l+1
Finally, multi-head self-attention (MSA) with regular windowing (W- zl+1 = MLP LN ̂
z z
+̂ (4)
MSA) and with shifted windowing (SW-MAS) configurations are
executed within Swin Transformer blocks to achieve long-range infor where ̂
l
z and zl are the outputs of the (S)W-MSA module and a 2-layer
mation interaction. multilayer perceptron (MLP) with GELU non-linearity, respectively.
Fig. 2(d) shows the core blocks of the Swin Transformer. They are LN denotes the layer normalization. More details of Swin Transformer
similar to the original Transformer module [45], but with one key dif can be found in Ref [46].
ference: the MAS mechanism is replaced with W-MSA or SW-MSA.
Specifically, as shown in Fig. 3, W-MSA divides the feature map into
multiple non-overlapping windows. Each feature within a window only 3.3. Neck: improved path aggregation feature pyramid network
computes self-attention with all other features residing in the same
window. This significantly reduces computational cost, allowing the PAFPN has been proven to be an important module in improving the
transformer-based architecture to be efficiently used for visual tasks. performance of object detection tasks. Fig. 1 shows that it facilitates the
flow and integration of semantic information between different scales
Fig. 2. (a) The overall structure of the small version of Swin Transformer (Swin-S); (b) the patch partition and linear embedding modules; (c) the patch merging
module; (d) two successive Swin Transformer blocks.
4
Fig. 3. The illustration of different self-attention mechanisms (a) Original self-attention (quadratic computation complexity to image size); (b) W-MSA computation
within single window; (c) SW-MSA computation across windows.
through top-down, bottom-up, and lateral connections, thereby process not only aids the interaction between classification task and
enriching multi-scale features. We improve PAFPN by further down localization task, but also promotes the acquirement of rich multiscale
sampling at the top level of the original PAFPN output, to strengthen the information with different receptive fields.
detection ability for large objects. Then, a layer attention mechanism is used to dynamically compute
Fig. 4 shows the details of the top-down and bottom-up path the features of a specific task at layer level to encourage task decom
augmentation to generate N2 to N4 in PAFPN. Specifically, for the top- position:
down pathway, the high-level feature map Pi+1 is up sampled by a factor
of 2 using the nearest-neighbor up sampling to match the resolution of Xktask = wk ⋅Xkinter , ∀k ∈ {1, 2, …N}, (6)
the adjacent lower-level feature map. The feature map forms the back
where wk denotes the k-th element of the layer attention w ∈ RN , which
bone Si, which have the same resolution as the up sampled feature maps,
undergoes a 1 × 1 convolution operation to ensure the consistency of is calculated by the cross-layer of Xkinter [48]:
channels. Then, the up sampled high-level feature map and adjusted ( ( ))
xinter = GAP Cat Xkinter , (7)
lower-level feature map are element-wise summed to produce the fused
feature map Pi. These operations are important for the propagation of ( ( ( ( ))))
w = σ FC2 δ FC1 xinter , (8)
semantic information from the high-level features to the low-level ones,
enhancing the representation of the fused feature map. On the other where GAP and Cat represent Global Average Pooling operation and
hand, the bottom-up pathway operates in the opposite directions. It Concatenate operation, respectively; σ is a Sigmoid activation function;
reduces the resolution of low-level feature map Ni by a factor of 2 using a FC1 and FC2 are two fully connected layers. This is similar to the
3 × 3 convolution with a stride of 2, and then performs the same lateral channel-wise attention calculation in SENet [49]. Finally, the results of
connection to generate higher-level feature map Ni+1. For N5 and N6,
classification and localization can be predicted by Xtask :
we use max pooling with a stride of 2 to get more levels on top of out
( ( ( )))
puts. More details can be found in Ref [40,47]. Ztask = conv2 δ conv1 Xtask (9)
3.4. Head: task-aligned head - TAH where Xtask is obtained by concatenating Xktask ; conv1 is 1 × 1 convolution
layer for dimension reduction; conv2 is common 1 × 1 convolution layer.
Fig. 5 illustrates a schematic of the head module. It calculates task- Xtask can be converted into classification scores P ∈ RH×W× 2 , or
interactive features and performs predictions through two Task Align bounding box regression B ∈ RH×W× 4 .
ment Predictors (TAP). During the prediction step, the distributions of P and B are further
First, the features obtained from the neck module XiPAFPN ∈ RH×W×C adjusted. Specifically, a probability spatial map M ∈ RH×W× 1 is applied
are fed into N consecutive convolution layers within a single branch, to adjust the classification prediction, while a spatial offset map O ∈
inter
resulting in a group of stacked task-interaction features X1− N , which can RH×W× 8 is used to adjust the localization prediction, which can be
be expressed as: calculated by:
{ ( ( )) √̅̅̅̅̅̅̅̅̅̅̅̅̅
δ convk XiPAFPN , k = 1 Palign = P × M (10)
Xkinter = ( ( inter )) , ∀k ∈ {1, 2, …N} (5)
δ convk Xk− 1 , k > 1
Balign (i, j, c) = B(i + O(i, j, 2 × c), j + O(i, j, 2 × c + 1), c), ∀c ∈ {0, 1, 2, 3}
where convk and δ represent the k-th convolution layer and the Relu (11)
activation function, respectively; N is set to 6 in our experiments. This
Fig. 4. The blocks of (a) top-down and (b) bottom-up path augmentation.
5
Fig. 5. The illustration of the head module (a) TAH (b) TAP.
where M and O are learned from the stack of task-interaction features 4. Experiments
inter
X1− N , which can be further expressed by Eqs. (12) and (13); the index (i,
All experiments were performed on a workstation equipped with a
j, c) represents the (i, j)-th spatial location at the c-th channel.
Nvidia GeForce RTX 3090 graphics card with 24GB of VRAM. The model
( ( ( ( ))))
M = σ conv2 δ conv1 Xinter (12) constructions were conducted in a Python 3.8.16 environment using
Pytorch 1.13.1, CUDA 11.7, and cuDNN 8.5.0.
( ( ( )))
O = conv4 δ conv3 Xinter (13)
4.1. Dataset description
where conv1 and conv3 are 1 × 1 convolution layers for dimension
reduction; conv2 and conv4 are common 3 × 3 convolution layers. The coal-gangue data set was collected from a coal processing plant
For the loss functions, we use Quality Focal loss [50] for the classi in Yongcheng, Henan Province, China. An indusial camera (model: MV-
fication task and GIoU loss [51] for the localization task. More details CA050-10GC, HIKVISION) was installed about 0.5 m above the hand
about the TAH can be found in Ref [42]. sorting conveyor. To ensure constant light source in image collection,
images were all acquired during normal nighttime production, supple
mented with 285 W adjustable LED lights. Finally, a total of 1750 high-
quality images of coal and gangue were captured. Coal and gangue
Fig. 6. The annotation information of coal and gangue samples.
6
objects were annotated using an online open-source labeling tool, “Make by measuring the proportion of the intersection of the predicted
Sense” (https://www.makesense.ai/). bounding box and the ground truth bounding box relative to their union,
Fig. 6 shows the annotations of coal and gangue samples. On the one which can be expressed as:
hand, there are distinct differences in the characteristics between coal
Area(Predicted ∩ Ground Truth)
and gangue samples. Coal surfaces are brighter and more reflective than IoU = (14)
Area(Predicted ∪ Ground Truth)
gangue surfaces. On the other hand, covering of moisture and coal slime
on sample surfaces and overlapping between samples are common, AP @ different IoUs are defined by calculating the area under the PR
posing challenges to detection. Therefore, a model that can address curve:
these challenges under complex environments is needed.
∫1
Table 1 presents the statistics of the dataset. We randomly split the
AP = P(R)dR (15)
dataset at a ratio of 8:1:1, producing 1400 images for training set, 175
images for validation set, and 175 images for test set. It can be observed 0
that the distributions of object category and size of the three sets are
where P is the Precision, and R is the Recall, which can be further
relatively consistent, which indicates the data split is reasonable. It is
computed as:
noteworthy that the materials on the conveyor had been pre-sized using
a screen, resulting in most coal and gangue materials having a size larger TP
P= (16)
than 50 mm. Therefore, the dataset contains almost no small objects TP + FP
according to the object size defined by COCO dataset [52]. In the sub
sequent evaluation process, we no longer discuss the evaluation of small TP
R= (17)
object detection. TP + FN
Fig. 7 illustrates the distribution of the objects in each image, which
where TP, FP, FN are true position, false positive, and false negative
generally follows a normal distribution. The number of objects in each
samples, respectively.
image is mostly concentrated around 15, indicating that the arrange
In terms of model complexity and speed, we considered three met
ment of objects is relatively dense. This demonstrates that object
rics, namely, floating-point operations (FLOPs) which assesses the
detection in real industrial production is significantly more challenging
computational complexity of the model, Params which evaluates the
than in laboratory settings.
physical memory usage of the model, and frames per second (FPS) which
measures the inference time of the model.
4.2. Implementation details Furthermore, we conducted robustness tests on all models to verify
their performance in addressing images with different types of corrup
4.2.1. Model setting tions [57].
To verify the superiority of the proposed STATNet model, we chose
several mainstream object detection models as benchmarks for com 4.2.3. Visualization of the training process of the proposed STATNet
parisons. They are: Faster RCNN [53], Cascade RCNN [54], SSD [11], Fig. 8 shows the training process of the proposed model. The left axis
RetinaNet, YOLOv3 [19], YOLOX [55], YOLOv7 [56], and YOLOv8 . All shows the loss of the classification and localization prediction tasks as
models were trained for 100 epochs using a mini-batch size of 2. The well as the total loss during training, whereas the right axis displays the
learning rate was reduced by a factor of 10 at the 50-th epoch and the AP50 results on the validation set for each epoch. It can be observed that
80-th epoch. Common data augmentation included random flip with a the model’s performance reached initial convergence at around the 80-
probability of 0.5. Inspired by the data augmentation in YOLOX [55], we th epoch. Subsequently, after discontinuing Mosaic and Mixup data
further introduced Mosaic and Mixup data augmentation strategies augmentation and adjusting the learning rate at the 80-th epoch, the
during the first 80 epochs of training to investigate their potential boost model’s performance continued to improve and eventually converged at
on our model performance. For model inference, we followed the set the 100-th epoch. This finding can be explained as follows. While Mosaic
tings used in Ref [42], except that we resized the input images to 512 × and Mix-up enhance the robustness and generalization of the model,
512. The remaining key parameters for each model are summarized in they distort the real distribution of the data. After discontinuing them,
Table 2. the images used for training revert to their original forms, allowing the
model to continue converging. The convergence tendency with and
4.2.2. Model metrics without Mosaic and Mixup data augmentation is consistent with that in
We utilized the common evaluation metrics of the COCO datasets to YOLOX [55].
assess model’s accuracy. Among these metrics, we used AP50 (Average
Precision @ intersection over union, IoU=0.50) as the main perfor 4.3. Results and discussion
mance measure. We also provided AP (AP @ IoU=0.50: 0.05: 0.95),
AP75 (AP @ IoU=0.75) and APm (AP for medium objects) and APl (AP for 4.3.1. Ablation study
large objects) for more references. Furthermore, we provided the Ablation experiments were conducted to verify the effects of
precision-recall (PR) curves to evaluate whether the model can maintain improving each module on the model performance. We developed our
high precision while achieving high recall. Specifically, IoU is calculated STATNet based on a popular and strong one-stage detector, RetinaNet-
Table 1
Statistics of the dataset.
Data set Ratio Images Object category Object size*
Coal Gangue Coal/Gangue Large Medium Small
All 100 % 1750 9890 16,537 0.598 23,035 3385 7

Train set 80 % 1400 7887 13,149 0.600 18,343 2689 4
Validation set 10 % 175 1020 1743 0.585 2402 359 2
Test set 10 % 175 983 1645 0.598 2290 337 1
*Remark: According to the COCO dataset, the object sizes in an image are defined as follows.
Large: greater than 96 × 96 pixel; Medium: between 96 × 96 pixels and 32 × 32 pixels; Small: Less than 32 × 32 pixels.
7
Fig. 7. The distribution of the number of objects in each image.
Table 2
Key training parameters for each model.
Model Input size Backbone Optimizer Learning rate Momentum/Betas Weight decay
Faster RCNN-R50 512 × 512 ResNet50 SGD 0.002 0.9 0.0005

Faster RCNN-R101 512 × 512 ResNet101 SGD 0.002 0.9 0.0005
Cascade RCNN-R50 512 × 512 ResNet50 SGD 0.002 0.9 0.0005
Cascade RCNN-R101 512 × 512 ResNet101 SGD 0.002 0.9 0.0005
SSD300 300 × 300 VGG16 SGD 0.002 0.9 0.0005
SSD512 512 × 512 VGG16 SGD 0.002 0.9 0.0005
RetinaNet-R50 512 × 512 ResNet50 SGD 0.002 0.9 0.0005
RetinaNet-R101 512 × 512 ResNet101 SGD 0.002 0.9 0.0005
YOLOv3–320 320 × 320 DarkNet SGD 0.001 0.9 0.0005
YOLOX-Tiny 416 × 416 CSPDarkNet SGD 0.01 0.9 0.0005
YOLOX-S 640 × 640 CSPDarkNet SGD 0.01 0.9 0.0005
YOLOX-M 640 × 640 CSPDarkNet SGD 0.01 0.9 0.0005
YOLOX-L 640 × 640 CSPDarkNet SGD 0.01 0.9 0.0005
YOLOX-X 640 × 640 CSPDarkNet SGD 0.01 0.9 0.0005
YOLOv7-Tiny 512 × 512 RCSPDarkNet SGD 0.01 0.937 0.0005
YOLOv7-L 512 × 512 RCSPDarkNet SGD 0.01 0.937 0.0005
YOLOv7-X 512 × 512 RCSPDarkNet SGD 0.01 0.937 0.0005
YOLOv8-S 512 × 512 CSPDarkNet SGD 0.01 0.937 0.0005
YOLOv8-M 512 × 512 CSPDarkNet SGD 0.01 0.937 0.0005
YOLOv8-L 512 × 512 CSPDarkNet SGD 0.01 0.937 0.0005
YOLOv8-X 512 × 512 CSPDarknet SGD 0.01 0.937 0.0005
STATNet (Ours) 512 × 512 Swin-S AdamW 0.0001 (0.9, 0.999) 0.05
R101. Specifically, we introduced Mosaic and Mixup data augmenta detection accuracy, with an improvement of 0.53 % in terms of AP50.
tion, replaced its backbone module, improved its neck module, and Additionally, the computational complexity and parameters of the two
replaced its head module step by step, to demonstrate the improvements models have insignificant difference. As for model speed, Swin Trans
brought about by data augmentation and each new module in our former exhibits a decrease in inference speed. This is because CNN-based
STATNet model. The results are presented in Table 3. networks like ResNet, can be better accelerated by the cuDNN functions
We can then draw the following points from Table 3: under the PyTorch framework, while Transformer-based architectures
Effect of Mosaic and Mixup: Our research first introduced the may require further kernel optimization to adapt to existing hardware
Mosaic and Mixup data augmentation during the training process. As conditions [46].
seen from Table 3, we achieved an increase of 0.53 % in AP50 under the Fig. 9 employs AblationCAM to further visualize the decision-making
identical model computational complex, parameter, and inference process of different backbone networks. AblationCAM can generate
speed. Therefore, we selected the Mosaic and Mixup, which are two gradient-free visual explanations to provide valuable insights into the
plug-and-play data augmentation methods, in the subsequent in model’s decision-making process [58]. It is evident that the Swin
vestigations in this work to enhance model accuracy. Transformer backbone-based model can produce more comprehensive
Effect of Swin Transformer: Subsequently, we compare the capa and impressive activation regions, resulting in more trustworthy clas
bilities of RetinaNet model between using Swin Transformer and sification and localization decisions. This reflects Swin Transformer’s
ResNet101 as the backbone networks. Swin Transformer achieves higher stronger capability of global feature learning, which makes it a
8
unaffected. This demonstrates that our designed iPAFPN has a positive

impact on enhancing feature fusion, thus improving the model accuracy.
With the integration of Mosaic and Mixup data augmentation, Swin
Transformer, and iPAFPN, the improved model has achieved an AP50 of
88.42 %, surpassing the original RetinaNet-R101 baseline model by 1.61
%.
Effect of TAH: Last, we adopted TAH as the head module of the
detection model. The AP50 score continues to increase by 0.85 % with
this head substitution. Overall, the STATNet model outperformed the
original RetinaNet-R101 baseline model by 2.46 % in terms of AP50,
while simultaneously reducing model computational complexity.
Admittedly, the inference speed of the STATNet model decreased to
some extent. Subsequent improvements in inference speed should be
achieved through further model optimization and hardware upgrades.
Fig. 10 presents the comparison of the detection results between the
Fig. 8. The training process of the proposed STATNet model. baseline RetinaNet-R101 and the current STATNet models. The detec
tion results of RetinaNet-R101 are characterized by numerous false
positives and repeated detections, reflecting a misalignment between
Table 3
the model’s classification and localization tasks. There are also some
Ablation study of each proposed component.
instances of missed detections for relatively small objects. In contrast,
Method AP50 (%) FLOPs (G) Params (M) FPS STATNet, which integrates all the improvement modules in Table 3,
RetinaNet-R101 86.81 71.87 55.12 15 consistently outperforms RetinaNet-R101 in all the sparse, scattered,
w/ Mosaic and Mixup 87.34 (þ0.53) 71.87 55.12 15 and dense prediction scenarios. It exhibits substantial improvements in
w/ Mosaic and Mixup + 87.87 (þ1.06) 77.96 58.16 10
addressing the misalignment between classification and localization,
Swin Transformer +
w/ Mosaic and Mixup + 88.42 (þ1.61) 79.40 59.34 10 with minimal occurrences of false positives and missed detections. In
Swin Transformer + addition, object overlap often occurs in dense scenarios, which poses
iPAFPN challenges to detection. Nevertheless, our proposed model maintains
w/ Mosaic and Mixup + 89.27 (þ2.46) 69.75 59.64 8 high accuracy and robustness even when dealing with overlapping ob
Swin Transformer +
iPAFPN +
jects. This highlights the practical applicability and reliability of our
TAH model in real-world coal mining operations.
In summary, the findings from the ablation studies indicate that each
Remark: FLOPs is calculated based on the input size of 512 × 512. The numbers
of our designed modules improves model performance. The proposed
in green font within parentheses in the AP50 column are improvement rates
compared to the original RetinaNet-R101 (baseline) model.
STATNet model achieves impressive detection accuracy.
4.3.2. Cross-comparison with the state-of-the-art (SOTA) models

We compared our proposed STATNet model with several milestone
SOTA models listed in Section 4.2.1 to validate the effectiveness of the
STATNet model in coal-gangue detection. The comparison results are
summarized in Table 4. Fig. 11 presents the relationships between model
accuracy, computational complexity, and inference time more intui
tively. The findings from these cross-comparisons can be summarized as
follows:
First, our STATNet model achieves AP50, AP75, AP, APm and APl of
89.27 %, 77.80 %, 66.52 %, 42.83 %, and 69.85 %, respectively, out
performing all SOTA baseline models. Particularly, it has an AP50
improvement rate from 2.02 % (YOLOX 87.25 %→89.27 %) to 5.58 %
(YOLOv3 83.69 %→89.27 %), which fully demonstrates its superiority
in detecting coal and gangue under complex industrial conditions.
Second, among the baseline models, while SSD and YOLOv3 series
models offer faster inference speed, their detection accuracy is relatively
poorer, with AP50 lower than 86 %. Faster RCNN, Cascade RCNN, Ret
inaNet, YOLOX, YOLOv7, and YOLOv8 series models show comparable
accuracy, but the upper limit of AP50 stays at 87.25 %. Interestingly, it is
not necessarily the case that more complex models perform better. For
instance, Retinanet-R101 even has a slightly lower AP50 compared to
Retinanet-R50. Similarly, in the YOLOX and YOLOv7 series, the best
model accuracy of AP50 is achieved by YOLOX-M and YOLOv7-Tiny,
Fig. 9. The heatmaps produced by AblationCAM using ResNet101 and Swin respectively. This may be attributed to the limited size of the indus
Transformer as backbone of RetinaNet. trial dataset.
Third, regarding model computational complexity, the FLOPs of our
promising choice as backbone in object detection networks. proposed STATNet model is basically consistent with models that use
Effect of iPAFPN: Next, our research work replaced the baseline ResNet101 as the backbone, indicating that STATNet significantly im
model’s FPN neck module with our iPAFPN, resulting in an additional proves model accuracy without increasing computational complexity.
0.55 % increase in AP50. This operation negligibly increases computa Fourth, in terms of inference speed, our STATNet model takes
tional complexity and parameters with the inference speed remained admittedly more inference time compared to other baseline models.
However, we believe that the model accuracy is of paramount
9
Fig. 10. Comparison of detection results between RetinaNet-R101 and STATNet.
Table 4
Comparisons with other SOTA models.
Method AP50 AP75 AP APm APl FLOPs Params FPS
(%) (%) (%) (%) (%) (G) (M)
Faster RCNN-R50 86.89 73.39 62.82 38.21 65.97 50.37 41.13 22

Faster RCNN-R101 86.74 73.91 63.16 38.09 66.49 69.84 60.12 16
Cascade RCNN-R50 86.76 75.07 64.03 38.33 67.28 52.40 68.93 17
Cascade RCNN-R101 86.75 74.89 64.08 39.29 67.36 71.87 87.92 14
SSD300 84.42 69.92 59.05 32.54 62.18 34.32 24.53 38
SSD512 85.85 71.84 60.85 33.17 64.22 87.86 24.54 34
RetinaNet-R50 86.82 73.03 62.64 38.28 65.69 52.39 36.13 21
RetinaNet-R101 86.81 73.20 63.15 37.65 66.41 71.87 55.12 15
YOLOv3–320 83.69 67.52 57.68 31.94 60.61 19.39 61.53 30
YOLOv3–416 84.05 68.28 58.91 32.64 62.10 32.76 61.53 29
YOLOv3–608 83.69 68.42 58.96 33.43 61.94 69.99 61.53 28
YOLOX-Tiny 86.57 73.43 63.29 38.25 66.53 3.20 5.03 26
YOLOX-S 86.89 74.77 64.00 37.94 67.62 13.32 8.94 24
YOLOX-M 87.25 74.07 63.89 36.75 67.46 36.75 25.28 20
YOLOX-L 86.69 74.66 64.49 37.15 68.13 77.66 54.15 17
YOLOX-X 87.18 75.07 64.78 37.00 68.63 140.76 99.00 15
YOLOv7-Tiny 86.25 72.60 62.35 37.42 65.74 4.20 6.02 27
YOLOv7-L 85.65 73.94 62.98 38.01 66.59 33.12 36.51 23
YOLOv7-X 85.63 74.48 63.62 35.48 64.42 60.30 70.82 22
YOLOv8-S 85.59 71.30 62.35 34.40 66.52 9.13 11.14 24
YOLOv8-M 85.71 72.59 62.95 35.54 66.89 25.24 25.86 22
YOLOv8-L 86.25 72.75 63.38 35.09 64.42 52.84 43.63 20
YOLOv8-X 86.59 72.92 63.78 35.48 67.77 82.48 68.16 19
STATNet (Ours) 89.27 77.80 66.52 42.83 69.85 69.75 59.64 8
importance. This is because lower model accuracy inevitably results in improve the model’s inference speed, making it meets the needs of
more serious misjudgment of coal and gangue, which leads to great loss real-time detection requirements.
of mineral resources, increases air pollution, and reduces the undermine Additionally, Fig. 12 provides the PR-curves for gangue and coal
profit of coal enterprises. Taking a coal preparation plant with an annual detection. It can be observed that the proposed STATNet model can
production capacity of 3 million tons and a clean coal yield of 50 % as an maintain higher accuracy while achieving higher recall rates. Another
example, assuming that the price of clean coal product is 800 CNY per notable observation is that all models perform slightly worse in coal
ton [59], a mere 1 % loss in coal product would result in a significant detection compared to gangue detection. This may be because the
economic loss of approximately 12 million CNY annually. This rough appearance of coal is more similar to the background (i.e., the black
estimation demonstrates that the model accuracy is critical to maxi color of conveyor belt), making it more challenging to detect than
mizing economic benefits. Furthermore, hardware upgrades can gangue. Nevertheless, the proposed model shows a more obvious
10
Fig. 11. Relationships between model accuracy and (a) computational complexity; (b) inference time.
Fig. 12. The PR-curve for (a) gangue detection; (b) coal detection.
advantage in coal detection. These findings reaffirm the superior accu

racy of our STATNet model. Furthermore, it is recommended to replace
commonly used black conveyor belts with other colors in practical ap
plications, which may help improve detection accuracy by reducing the
similarity between minerals and the background.
4.3.3. Robustness tests

Given the working environment in a coal preparation plant is very
complex, the quality of captured images is potentially unstable in actual
practice. Therefore, we conducted the robustness test on all models.
Specifically, as illustrated in Fig. 13, we applied 15 types of corruptions
and perturbations, including noise, blur, weather, and digital categories,
to the images of test set, and the level of corruptions and perturbations is
set to 1 [57]. Subsequently, we assessed the performance variations for
each model, and the statistical results are summarized in Fig. 14.
It can be observed that the accuracy of all models is affected to some
extent in the presence of image corruptions and perturbations. Among
them, Faster RCNN, Cascade RCNN, YOLOv3, YOLOv7 and YOLOv8
show the most significant performance deterioration, followed by SSD,
RetinaNet and YOLOX. In particular, some lightweight models such as
YOLOv7-Tiny and YOLOv8-S cannot guarantee model stability in
dealing with corruptions and perturbations. This confirms our viewpoint
that prioritizing model speed over accuracy is not advisable in industrial Fig. 13. The display of 15 types of image corruptions and perturbations.
applications. In contrast, our STATNet exhibits the least impacts of
corruptions and perturbations on accuracy, demonstrating superior practical production, it is advisable to avoid fast camera movements to
robustness compared to the SOTA baseline models. eliminate “Zoom_blur” image corruption and perturbation. Overall, the
Additionally, almost all models’ box plots contain one or more out robustness test reconfirms that our proposed STATNet model has
liers. A closer look shows that the most outliers are incurred by the stronger resistance to interference, and is superior to other baseline
“Zoom_blur” category, which indicates that this type of image corrup models in practical applications.
tion and perturbation has a great impact on the accuracy of models. In
11
Fig. 14. Statistical results of robustness tests.
5. Conclusions recommend pre-distributing the coal and gangue materials before

applying the model to reduce the adverse effects of object overlap on
This work proposes a new one-stage object detection model of the detection. This will further enhance the reliability of the model in
“backbone-neck-head” type based on deep learning. It aims to improve practical applications. Furthermore, we will collect coal and gangue
the accuracy of coal and gangue detection in real-world production image data from more coal preparation plants to further validate the
scenarios and to provide technical support for vision-based coal and model’s applicability and generalization in industrial applications.
gangue separation. Specifically, we utilized the Swin Transformer as the
backbone module, the improved path augmentation feature pyramid CRediT authorship contribution statement
network (iPAFPN) as the neck module, and the task-aligned head (TAH)
as the head module. The proposed model was evaluated on an industrial Kefei Zhang: Writing – original draft, Software, Methodology,
dataset of coal and gangue images collected from a coal preparation Investigation, Data curation, Conceptualization. Teng Wang: Method
plant. The results show that our model outperforms the state-of-the-art ology, Data curation, Conceptualization. Xiaolin Yang: Methodology,
baseline models in terms of detection accuracy. Specifically, our Data curation, Conceptualization. Liang Xu: Methodology, Data cura
STATNet model achieved an impressive AP50 (Average Precision @ tion, Conceptualization. Jesse Thé: Writing – review & editing, Super
intersection over union, IoU=0.50) of 89.27 %, exceeding these baseline vision. Zhongchao Tan: Writing – review & editing, Supervision.
models by 2.02–5.58 %. In addition, our model exhibits stronger Hesheng Yu: Writing – review & editing, Supervision, Resources,
robustness than its competitors in addressing image corruption and Project administration, Funding acquisition.
perturbation. The key conclusions are summarized as follows:
(1) Mosaic and Mixup data augmentation, plug-and-play modules, Declaration of competing interest
are proven effective means to strengthen model accuracy. Similar data
augmentation methods are worth investigating in future research, as The authors declare that they have no known competing financial
they have minimal impact on model complexity and speed. interests or personal relationships that could have appeared to influence
(2) For the first time, Swin Transformer is demonstrated as a suitable the work reported in this paper.
substitution for CNN-based backbone in the field of coal and gangue
detection. Its unique self-attention mechanism is capable of global Acknowledgments
feature extraction, contributing to improved detection accuracy.
(3) Coal-gangue detection primarily focuses on detecting medium- This work is funded by the Fundamental Research Funds for the
and large-sized objects. It is essential to emphasize the fusion of higher- Central Universities (No. 2020ZDPY0214). We appreciate the AI
level features during feature fusion. The designed iPAFPN neck module Computing Platform of the Eastern Institute for Advanced Study, Eastern
enhances the propagation and interaction of different scale features, Institute of Technology for providing computing resources. We also
resulting in higher detection accuracy. extend our gratitude to Mr. Changzhou Chen, the Director of the
(4) TAH head module can mitigate conflicts and misalignments be Chensilou Coal Preparation Plant within Yongcheng Coal and Electricity
tween classification and localization tasks, thus enhancing the model’s Holding Group Co., Ltd, for his support in data collection.
detection accuracy.
(5) Overall, the three modules used in this research work collectively
References
to optimize the detection accuracy of one-stage detectors for coal-
gangue detection. [1] Sharma H, Marinovici L, Adetola V, Schaef HT. Data-driven modeling of power
This work deserves further investigation. One of the most crucial generation for a coal power plant under cycling. Energy AI 2023;11:100214.
aspects is that the proposed model effectively improves detection ac https://doi.org/10.1016/j.egyai.2022.100214.
[2] Chen Q, Wei L. Coal dry beneficiation technology in china: the state-of-the-art.
curacy while decreasing inference speed to a certain extent. Hence, it is China Particuol 2003;1(2):52–6. https://doi.org/10.1016/S1672-2515(07)60108-
recommended to introduce model pruning and model distillation to 0.
compress and optimize model, reducing its model parameters and [3] Zhou C, Liu X, Zhao Y, Yang X, Li Y, Dong L, et al. Recent progress and potential
challenges in coal upgrading via gravity dry separation technologies. Fuel 2021;
enhancing the trade-off between accuracy and speed. Additionally, we 305:121430. https://doi.org/10.1016/j.fuel.2021.121430.
12
[4] Yin J, Zhu J, Zhu H, Pan G, Zhu W, Zeng Q, et al. Intelligent photoelectric [31] Lv Z, Wang W, Xu Z, Zhang K, Fan Y, Song Y. Fine-grained object detection method
identification of coal and gangue − a review. Measurement 2024;233:114723. using attention mechanism and its application in coal–gangue detection. Appl Soft
https://doi.org/10.1016/j.measurement.2024.114723. Comput 2021;113:107891. https://doi.org/10.1016/j.asoc.2021.107891.
[5] Zhang K, Yang X, Xu L, Thé J, Tan Z, Yu H. Enhancing coal-gangue object detection [32] Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception
using GAN-based data augmentation strategy with dual attention mechanism. architecture for computer vision. In: Proceedings of the IEEE conference on
Energy 2024;287:129654. https://doi.org/10.1016/j.energy.2023.129654. computer vision and pattern recognition; 2016. p. 2818–26.
[6] Yang X, Zhang K, Ni C, Cao H, Thé J, Xie G, et al. Ash determination of coal [33] Tan M, Le Q. EfficientNet: rethinking model scaling for convolutional neural
flotation concentrate by analyzing froth image using a novel hybrid model based networks. In: Proceedings of the 36th international conference on machine
on deep learning algorithms and attention mechanism. Energy 2022;260:125027. learning. Proceedings of machine learning research: PMLR; 2019. p. 6105–14.
https://doi.org/10.1016/j.energy.2022.125027. [34] Luan H, Xu H, Tang W, Tian Y, Zhang Q. Coal and gangue classification in actual
[7] Yang J, Chang B, Zhang Y, Luo W, Ge S, Wu M. CNN coal and rock recognition environment of mines based on deep learning. Measurement 2023;211:112651.
method based on hyperspectral data. Int J Coal Sci Technol 2022;9(1):63. https:// https://doi.org/10.1016/j.measurement.2023.112651.
doi.org/10.1007/s40789-022-00516-x. [35] Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T.,
[8] Cheng G, Chen J, Wei Y, Chen S, Pan Z. A coal gangue identification method based et al., 2020. An image is worth 16x16 words: transformers for image recognition at
on HOG combined with LBP features and improved support vector machine. scale. arXiv preprint arXiv:201011929.
Symmetry (Basel) 2023;15(1):202. [36] Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end
[9] Weimer D, Scholz-Reiter B, Shpitalni M. Design of deep convolutional neural object detection with transformers. In: European conference on computer vision.
network architectures for automated feature extraction in industrial inspection. Springer; 2020. p. 213–29.
CIRP Ann 2016;65(1):417–20. https://doi.org/10.1016/j.cirp.2016.04.072. [37] Zhang K, Thé J, Xie G, Yu H. Multi-step ahead forecasting of regional air quality
[10] Wang X. Deep learning in object recognition, detection, and segmentation. Found using spatial-temporal deep neural networks: a case study of Huaihai Economic
Trends® Signal Process 2016;8(4):217–382. Zone. J Clean Prod 2020;277:123231. https://doi.org/10.1016/j.
[11] Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, et al. SSD: Single shot jclepro.2020.123231.
multibox detector. In: Computer Vision–ECCV 2016: 14th European conference. [38] Zhang K, Cao H, Thé J, Yu H. A hybrid model for multi-step coal price forecasting
Amsterdam, The Netherlands: Springer; 2016. p. 21–37. October 11–14, 2016, using decomposition technique and deep learning algorithms. Appl Energy 2022;
Proceedings, Part I 14. 306:118011. https://doi.org/10.1016/j.apenergy.2021.118011.
[12] Jiang P, Ergu D, Liu F, Cai Y, Ma B. A Review of Yolo algorithm developments. [39] Liu S, Huang D, Wang Y. 2019.
Procedia Comput Sci 2022;199:1066–73. [40] Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid
[13] Zhang Y, Wang J, Yu Z, Zhao S, Bei G. Research on intelligent detection of coal networks for object detection. In: Proceedings of the IEEE conference on computer
gangue based on deep learning. Measurement 2022;198:111415. https://doi.org/ vision and pattern recognition; 2017. p. 2117–25.
10.1016/j.measurement.2022.111415. [41] Zaidi SSA, Ansari MS, Aslam A, Kanwal N, Asghar M, Lee B. A survey of modern
[14] Zhang B., Zhang H.B. Coal gangue detection method based on improved SSD deep learning based object detection models. Digit Signal Process 2022;126:
algorithm. 2021 International conference on intelligent transportation, big data & 103514. https://doi.org/10.1016/j.dsp.2022.103514.
smart city (ICITBS). 2021, p. 634–7. [42] Feng C, Zhong Y, Gao Y, Scott MR, Huang W. Tood: task-aligned one-stage object
[15] Simonyan K., Zisserman A., 2014. Very deep convolutional networks for large-scale detection. In: IEEE/CVF International conference on computer vision (ICCV). 2021.
image recognition. arXiv preprint arXiv:14091556. IEEE Computer Society; 2021. p. 3490–9.
[16] Howard A.G., Zhu M., Chen B., Kalenichenko D., Wang W., Weyand T., et al., 2017. [43] Kaur P, Khehra BS, Mavi EBS. Data augmentation for object detection: a review. In:
MobileNets: efficient convolutional neural networks for mobile vision applications. IEEE International midwest symposium on circuits and systems (MWSCAS). 2021;
arXiv preprint arXiv:170404861. 2021. p. 537–43.
[17] Xue G., Li S., Hou P., Gao S., Tan R., 2023. Research on lightweight Yolo coal [44] Li B, Li Y, Zhu X, Qu L, Wang S, Tian Y, et al. Substation rotational object detection
gangue detection algorithm based on resnet18 backbone feature network. Internet based on multi-scale feature fusion and refinement. Energy AI 2023;14:100294.
of Things. 22, 100762. 10.1016/j.iot.2023.100762. https://doi.org/10.1016/j.egyai.2023.100294.
[18] He K, Zhang X, Ren S, Sun J. Identity mappings in deep residual networks. In: [45] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention
Computer Vision–ECCV 2016: 14th European conference. Amsterdam, the is all you need. Adv Neural Inf Process Syst 2017;30:5998–6008.
Netherlands: Springer; 2016. p. 630–45. October 11–14, 2016Proceedings, Part IV [46] Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: hierarchical
14. vision transformer using shifted windows. In: Proceedings of the IEEE/CVF
[19] Redmon J., Farhadi A., 2018. Yolov3: an incremental improvement. arXiv preprint international conference on computer vision; 2021. p. 10012–22.
arXiv:180402767. 10.48550/arXiv.1804.02767. [47] Liu S, Qi L, Qin H, Shi J, Jia J. Path aggregation network for instance segmentation.
[20] Lv Z, Wang W, Xu Z, Zhang K, Lv H. Cascade network for detection of coal and In: Proceedings of the IEEE conference on computer vision and pattern recognition
gangue in the production context. Powder Technol 2021;377:361–71. https://doi. (CVPR); 2018. p. 8759–68.
org/10.1016/j.powtec.2020.08.088. [48] Zhang K, Yang X, Cao H, Thé J, Tan Z, Yu H. Multi-step forecast of PM2.5 and
[21] Yan P, Wang W, Li G, Zhao Y, Wang J, Wen Z. A lightweight coal gangue detection PM10 concentrations using convolutional neural network integrated with
method based on multispectral imaging and enhanced YOLOv8n. Microchem J spatial–temporal attention and residual learning. Environ Int 2023;171:107691.
2024;199:110142. https://doi.org/10.1016/j.microc.2024.110142. https://doi.org/10.1016/j.envint.2022.107691.
[22] Li D-Y, Wang G-F, Guo Y-C, Zhang Y, Wang S. An identification and positioning [49] Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE
method for coal gangue based on lightweight mixed domain attention. Int J Coal conference on computer vision and pattern recognition; 2018. p. 7132–41.
Prep Util 2023;43(9):1542–60. https://doi.org/10.1080/ [50] Li X, Lv C, Wang W, Li G, Yang L, Yang J. Generalized focal loss: towards efficient
19392699.2022.2119561. representation learning for dense object detection. IEEE Trans Pattern Anal Mach
[23] Yan P, Sun Q, Yin N, Hua L, Shang S, Zhang C. Detection of coal and gangue based Intell 2023;45(3):3139–53. https://doi.org/10.1109/TPAMI.2022.3180392.
on improved YOLOv5.1 which embedded scSE module. Measurement 2022;188: [51] Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S. Generalized
110530. https://doi.org/10.1016/j.measurement.2021.110530. intersection over union: a metric and a loss for bounding box regression. In:
[24] Pan H, Shi Y, Lei X, Wang Z, Xin F. Fast identification model for coal and gangue Proceedings of the IEEE/CVF conference on computer vision and pattern
based on the improved tiny YOLO v3. J Real-Time Image Process 2022;19(3): recognition; 2019. p. 658–66.
687–701. https://doi.org/10.1007/s11554-022-01215-1. [52] Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft COCO:
[25] Li M, He X, Yuan Y, Yang M. Multiple factors influence coal and gangue image common objects in context. Cham Springer International Publishing; 2014.
recognition method and experimental research based on deep learning. Int J Coal p. 740–55.
Prep Util 2022:1–17. https://doi.org/10.1080/19392699.2022.2118260. [53] Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection
[26] Wei D, Li J, Li B, Wang X, Chen S, Wang X, et al. A fast recognition method for coal with region proposal networks. Adv Neural Inf Process Syst 2015;28:2015.
gangue image processing. Multimed Syst 2023;29(4):2323–35. https://doi.org/ [54] Cai Z, Vasconcelos N. Cascade R-CNN: delving into high quality object detection.
10.1007/s00530-023-01109-7. In: Proceedings of the IEEE conference on computer vision and pattern recognition;
[27] Wen X, Li B, Wang X, Li J, Wei D, Gao J, et al. A Swin transformer-functionalized 2018. p. 6154–62.
lightweight YOLOv5s for real-time coal–gangue detection. J Real-Time Image [55] Ge Z., Liu S., Wang F., Li Z., Sun J., 2021. Yolox: exceeding yolo series in 2021.
Process 2023;20(3):47. https://doi.org/10.1007/s11554-023-01305-8. arXiv preprint arXiv:210708430. 10.48550/arXiv.2107.08430.
[28] Liu Q, Li J, Li Y, Gao M. Recognition methods for coal and coal gangue based on [56] Wang C-Y, Bochkovskiy A, Liao H-YM. YOLOv7: trainable bag-of-freebies sets new
deep learning. IEEE Access 2021;9:77599–610. https://doi.org/10.1109/ state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF
ACCESS.2021.3081442. conference on computer vision and pattern recognition; 2023. p. 7464–75.
[29] Guo Y, Zhang Y, Li F, Wang S, Cheng G. Research of coal and gangue identification [57] Hendrycks D., Dietterich T., 2019. Benchmarking neural network robustness to
and positioning method at mobile device. Int J Coal Prep Util 2023;43(4):691–707. common corruptions and perturbations. arXiv preprint arXiv:190312261.
https://doi.org/10.1080/19392699.2022.2072305. [58] Ramaswamy HG. Ablation-cam: Visual explanations for deep convolutional
[30] Yang D, Miao C, Li X, Liu Y, Wang Y, Zheng Y. Improved YOLOv7 network model network via gradient-free localization. In: Proceedings of the IEEE/CVF winter
for gangue selection robot for gangue and foreign matter detection in coal. Sensors conference on applications of computer vision; 2020. p. 983–91.
2023;23(11):5140. [59] Zhu S, Chi Y, Gao K, Chen Y, Peng R. Analysis of influencing factors of thermal coal
price. Energies 2022;15(15):5652.
13

BALA_REF1

Uploaded by

Copyright:

Available Formats

BALA_REF1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BALA_REF1

Uploaded by

Copyright:

Available Formats

Energy and AI 17 (2024) 100388

Contents lists available at ScienceDirect

STATNet: One-stage coal-gangue detector based on deep learning algorithm

• Propose a one-stage object detection

Available online 13 June 2024

1. Introduction detectors have become the mainstream for vision-based coal-gangue

2. Related work 3.1. Model overview

Fig. 1. The overall structure of the STATNet.

Fig. 6. The annotation information of coal and gangue samples.

Coal Gangue Coal/Gangue Large Medium Small

All 100 % 1750 9890 16,537 0.598 23,035 3385 7

Fig. 7. The distribution of the number of objects in each image.

Faster RCNN-R50 512 × 512 ResNet50 SGD 0.002 0.9 0.0005

unaffected. This demonstrates that our designed iPAFPN has a positive

4.3.2. Cross-comparison with the state-of-the-art (SOTA) models

Fig. 10. Comparison of detection results between RetinaNet-R101 and STATNet.

Faster RCNN-R50 86.89 73.39 62.82 38.21 65.97 50.37 41.13 22

advantage in coal detection. These findings reaffirm the superior accu­

4.3.3. Robustness tests

Fig. 14. Statistical results of robustness tests.

5. Conclusions recommend pre-distributing the coal and gangue materials before

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

advantage in coal detection. These findings reaffirm the superior accu