Artificial Intelligence and Robotics (Huimin Lu)
Artificial Intelligence and Robotics (Huimin Lu)
Huimin Lu Editor
Artificial
Intelligence
and Robotics
Studies in Computational Intelligence
Volume 917
Series Editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Computational Intelligence” (SCI) publishes new develop-
ments and advances in the various areas of computational intelligence—quickly and
with a high quality. The intent is to cover the theory, applications, and design
methods of computational intelligence, as embedded in the fields of engineering,
computer science, physics and life sciences, as well as the methodologies behind
them. The series contains monographs, lecture notes and edited volumes in
computational intelligence spanning the areas of neural networks, connectionist
systems, genetic algorithms, evolutionary computation, artificial intelligence,
cellular automata, self-organizing systems, soft computing, fuzzy systems, and
hybrid intelligent systems. Of particular value to both the contributors and the
readership are the short publication timeframe and the world-wide distribution,
which enable both wide and rapid dissemination of research output.
Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago.
Artificial Intelligence
and Robotics
123
Editor
Huimin Lu
Department of Mechanical and Control
Engineering
Kyushu Institute of Technology
Kitakyushu, Japan
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
v
Acknowledgements
This book was supported by Leading Initiative for Excellent Young Research
Program of Ministry of Education, Culture, Sports, Science and Technology of
Japan (16809746), Strengthening Research Support Project of Kyushu Institute of
Technology, Kitakyushu Convention & Visitors Association.
We would like to thank all authors for their contributions. The editors also wish
to thank the referees who carefully reviewed the papers and gave useful suggestions
and feedback to the authors. Finally, we would like to thank Prof. Hyoungseop
Kim, Prof. Manu Malek, Prof. Pin-Han Ho, Prof. Geoffrey C. Fox, Prof. Kaori
Yoshida, and all editors of Studies in Computational Intelligence for the coopera-
tion in preparing the book.
vii
About This Book
This edited book presents scientific results of the research fields of artificial intel-
ligence and robotics. The main focus of this book is on the new research ideas and
results for the mathematical problems in robotic vision systems.
In August 2019, the 4th International Symposium on Artificial Intelligence and
Robotics (ISAIR 2019) took place in Daegu, Korea. This conference is an annual
event sponsored by International Society for Artificial Intelligence and Robotics
(ISAIR), International Association for Pattern Recognition (IAPR), and technology
supported by IEEE Computer Society Big Data STC, and SPIE. ISAIR is to pro-
mote the research of the future information technology and trying to bring together
the researchers from academia and industry to share and discuss ideas, problems
and solutions in various areas of information technology, through series of activities
such as conferences, symposium, and special sessions.
ISAIR2019 had received over 280 papers from over 12 countries in the world.
The present book comprises selected contributions from this conference. The
chapters were chosen based on review scores submitted by editors, and underwent
further rigorous rounds of review. This publication captures 19 of the most
promising papers, and we impatiently await the important contributions that we
know these authors will bring to the fields of artificial intelligence and robotics.
ix
Contents
xi
xii Contents
1 Introduction
The arrival of information age brings urbanization development into a digital, smart
and mobile stage. Report of the major issues that cities face will depend more on
information technologies like smart phones and cameras [1, 2]. In China, now most
of urban management systems are running efficiently, citizens can use mobile APP,
WeChat client, computer and other terminals to upload images and fill out relevant
information (i.e., location, main category, real situation, etc.) when they find and
report the urban issues [3, 4]. Then, urban management staffs delegate the issues to the
related administration according to experience and providing information. However,
since the continued expansion of cities, the urban issues have been increasing so
rapidly that manual distributions of the administration are unable to meet the needs
of daily work. Hence, there urgently needs a quick classification tool of urban issues
to improve the efficiency of urban management and make it intelligentized. Finally,
in this paper, the problem of image classification is transformed into an image recog-
nition task, and we propose a deep learning approach to the classification of high
definition urban images for processing efficiency in urban management system.
Extracting image features is making good progress from artificial feature to
deep learning classification methods which the higher-level features are repre-
sented by combination of lower-level features of the data. Some traditional clas-
sification methods have been proposed from different perspectives, such as using
Gabor wavelet image [5], Gaussian Markov random filed (GMRF) [6] and Scale-
invariant feature transform (SIFT) [7] to extract textural features, remote sensing
image features and digital image features, then using support vector machine (SVM)
[8], Lagrange support vector machine (LSVM) [9] classification method and other
traditional machine learning methods to classify images. In 1962, Hubel and Wiesel
[10] presented the concept of receptive field. In 1980, Fukelima [11] proposed a
neural cognitive machine, which is the first network of CNN. In 1989, Lecun et al.
[12] proposed the back-propagation algorithm, which promoted the development of
convolutional neural networks greatly. In 1998, Lecun et al. proposed the structure
of the LeNet-5 network [13]. This basic design has good performance on MNIST,
CIFAR and other data sets, especially on ImageNet Category Challenge [14, 15].
From 2012 to 2015, with development of computer hardware and intensive research
on convolutional neural networks, deeper and more complete network models have
been breaking records on ImageNet, such as AlexNet by Krizhevsky et al. [15],
VGGNet of Oxford University [16], Google’s GoogleNet [17] and Microsoft’s
ResNet [18].
Urban images are collected by citizens and outdoor workers who take urban issues
photos by their own ordinary mobile phone. The collected image exists drawbacks,
its complex background information, low-resolution and uneven brightness make it
difficult to obtain exact results. Since urban images in urban management system
are manually collected, they must contain objects which represent key information
and vital features. If those key objects can be detected from the complex background
of the image, it is equivalent to extract the critical features that can represent image
CBCNet: A Deep Learning Approach … 3
Convolutional BP network
Neural Networks
sharing bike:80
Misplacing of
Car:20 bike-sharing
Trash:60
2 Proposed Method
The detection part of CBCNet model is designed according to the complex back-
ground and multi scale images. We use 1 × 1 reduction layers followed by every
3 × 3 convolutional layers to abstract representative data within the receptive filed
[22]. The full network is shown in Fig. 2.
Convolutional layers are employed to extract image feature. Each layer uses an
established kernel to convolute the image, adds weights and biases to the convolution
value, and then acts on an activation function. Because the convolutional kernel
used in this paper is relatively small, traversal of matrices is used to implement the
convolution operation. The concrete flow of detection network is: (1) Processing
the image data into a size of 448 * 448 as an input to the model; (2) Setting the
4 Z. Liu et al.
448
112
56
28
14
448 112
7 7 7
56
28
14
7 7 7
leaky rectified linear activation, as shown in (1). However, it has dual purpose: the
most important point is that they are mainly used for dimension reduction modules to
break calculation bottleneck, otherwise size of our network will be limited, it exerts a
vital influence on today’s work efficiency of urban management system. This allows
us not only to deepen the network and widen our network without causing serious
performance degradation, but also to accelerate network optimization. We can find
sometimes the overparameterization is possibly useful.
x if x > 0
ϕ(x) = (1)
0.1x otherwise
In this paper, we use the width w, the height h and the center relative offset x,
y of the single key object contained in each urban image as the input vector, by
using forward propagation of the alternating input and back propagation of the error,
making the change of weights become little by gradient descent method, finally it
attains the minimal error.
3 Training
We use urban images dataset with annotated object to train detection network to
extract information of urban images’ object contained, as shown in Fig. 6. We use data
to retrain CBCNet model which based on the YOLO model. Relationship between
image issues category and the annotated object is used to train the evaluation network
to get entire correct classification, and hence the entire network training is completed.
When training the detection model, it is easy to cause over-fitting due to small sample
in training set, we use data augmentation to expand dataset through fine-tuning of
rotation, exposure, saturation and tone of original image.
The detection model of CBCNet model uses sum-squared error as loss function to
optimize parameters, that is sum-squared of network output vector and the real image
corresponding vector, as shown in formula 2, where and represent the predicted data
and the annotated data. In this paper, contribution of object position’s correlated
errors and category errors to the loss value in this network is different. Therefore,
when calculating the loss, we set λcoord = 6.2 to increase position’s contribution and
λnoop = 0.4 to reduce influence of the grid which doesn’t contain object.
As shown in formula 3, it optimizes multi-part loss function when training the
model. Where x and y represent offset of the center of bounding box from grid
boundary, w and h represent ratio of bounding box’s width and height to the original
Fig. 6 Illustration of
detection network
training process for CBCNet
Convolutional
Neural Networks
448
448
Resize urban image
evaluation network
BP network
Obj:Score 1
Urban Image
Key objects
CBCNet: A Deep Learning Approach … 7
n
SS E = λi (yi − ŷi )2 (2)
i=0
S
2
obj
B
Loss = λcoor d i, j (xi − x̂i )2 + (yi − ŷi )2
i=0 j=0
S
B 2 √
obj √
+ λcoor d i, j ( wi − ŵi ) + ( h i − ĥ i )2
2
i=0 j=0
S2
B
. +
obj
i, j (Ci − Ĉi )2 . (3)
i=0 j=0
S 2
B
no−obj
+ λnoob i, j (Ci − Ĉi )2
i=0 j=0
S2
obj
+ i ( pi (c) − p̂i (c))2
i=0 c∈classes
YOLO and Faster-RCNN are selected to classify urban images. The experiments
are based on the deep learning framework Darknet in the hardware environment
configured as Ubuntu14.04, the graphics card is M40, and CPU is Intel XeonE5. We
evaluate effect of the model by calculating classification accuracy of each category, as
shown in formula 4, where TP is the true positive, S is total number of all category.
Both YOLO algorithm and Faster-RCNN algorithm have achieved good results.
Among them, Faster-RCNN reached 96.93%, which was about 0.6% higher than
YOLO (Table 2).
1
n
Mean_ pr e = Pr ecision i (6)
n i=1
The PASCAL VOC is a benchmark test for visual recognition and objects detection, it
provides standard image annotation dataset and evaluation systems, which contains a
total of 9963 marked pictures, divided into 20 categories, consists train/validation/test
three parts, and has 24,640 annotated objects. In our experiment, we use the images
only contain a single object as training set, others as testing set, to compare with
AlexNet, VGGNet16 and ResNet50 [24]. As is shown in Table 5, the highest clas-
sification accuracy is ResNet50 network with 73.8%, and our method is the second
but also reach to 72.4%. Using CBCNet, class ‘cat’ has the highest classification
accuracy 89.6%, class ‘potted plant’ has the lowest classification accuracy 48.1%.
5 Conclusion
To handle the increasing number of cases for urban management efficiently in urban
computing, in this paper we have proposed a deep learning approach for complex
background urban images classification, which is able to not only classify cases based
on urban images, but also improve efficiency of urban management. In framework
of the CBCNet classification, this method classifies issues by detecting informa-
tion of key objects in urban image, and makes up low accuracy of current images
classification algorithm in complex background images. We use urban images and
test our method’s classification performance from various aspects. The experimental
data shows that this method can classify urban images automatically, with 97.23%
accuracy. In the same hardware environment, this method can still consider good
classification accuracy while guaranteeing training speed compared to AlexNet,
VGGNet, ResNet. In the future work, we will further improve and open our dataset,
combined with algorithms such as image segmentation and image denoising to
improve classification accuracy and efficiency of the urban management system.
Acknowledgements This study is supported by the National Natural Science Foundation of China
(Grant No. 61562013, 61906050), the Natural Science Foundation of Guangxi Province (CN)
(2017GXNSFDA198025), the Key Research and Development Program of Guangxi Province under
Grant (AB16380293, AD19245202), the Study Abroad Program for Graduate Student of Guilin
University of Electronic Technology (GDYX2018006).
References
1. Yang C (2014) Research on construction of digital intelligent city management system. Int J
Hybrid Inf Technol
2. Zheng Y, Mascolo C, Silva CT (2017) Guest editorial: urban computing. IEEE Trans Big Data
3(2):124–125
3. Zhang BX, Wang XC (2014) The reality of plight faced by city management. Urban Probl
05:79–84
4. Dai P, Jing C, Du M et al (2006) A method based on spatial analyst to detect hot spot of urban
component management events. In: IEEE international conference on spatial data mining and
geographical knowledge services, vol 8, no 10, pp 55–59, Jan 2006
5. Moghaddam HA, Saadatmand-Tarzjan M (2006) Gabor wavelet correlogram algorithm for
image indexing and retrieval. In: International conference on pattern recognition, vol 2, pp
925–928
6. Seetharaman K (2015) Image retrieval based on micro-level spatial structure features and
content analysis using full range Gaussian Markov random field model. Eng Appl Artif Intell
13(40):103–116
7. Lindeberg T (2012) Scale invariant feature transform. Scholarpedia 5(1):2012–2021
8. Adankon MM, Cheriet M (2009) Support vector machine. In: International conference on
intelligent networks and intelligent systems, pp 418–421
9. Hwang JP, Choi B, Hong IW et al (2013) Multiclass Lagrangian support vector machine. Neural
Comput Appl 3(4):703–710
10. Hubel DH, Wiesel TN (1962) Receptive fields, binocular interaction and functional architecture
in the cat’s visual cortex. J Physiol 160(1):151–160
12 Z. Liu et al.
11. Fukushima K (1980) Neocognitron: a self-organizing neural network model for a mechanism
of pattern recognition unaffected by shift in position. Biol Cybern 36(4):193–202
12. Lecun Y, Boser B, Denker J et al (1989) Backpropagation applied to handwritten zip code
recognition. Neural Comput 1(4):541–551
13. Lécun Y, Bottou L, Bengio Y et al (1998) Gradient-based learning applied to document
recognition. Proc IEEE 86(11):2278–2324
14. Russakovsky O et al (2015) ImageNet large scale visual recognition challenge. Int J Comput
Vis 115(3):211–252
15. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional
neural networks. In: International conference on neural information processing systems, pp
1097–1105
16. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image
recognition. Comput Sci 1409–1556
17. Szegedy C, Liu W, Jia Y et al (2015) Going deeper with convolutions. Comput Vis Pattern
Recogn 1(9):1409–4842
18. He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. Comput Vis
Pattern Recogn 770–778
19. Girshick R, Donahue J, Darrell T et al (2014) Rich feature hierarchies for accurate object
detection and semantic segmentation. Comput Vis Pattern Recogn 11(23):580–587
20. Ren S, He K, Girshick R et al (2017) Faster R-CNN: towards real-time object detection with
region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
21. Cengil E, Cinar A et al (2017) A GPU-based convolutional neural network approach for image
classification. Intell Data Process Symp 16(17):1–6
22. Lin M, Chen Q, Yan S (2013) Network in network. Comput Sci 10(4)
23. Redmon J, Divvala S, Girshick R et al (2015) You only look once: unified, real-time object
detection. Comput Vis Pattern Recogn 27(30):779–788
24. Zhong G et al (2017) Reducing and stretching deep convolutional activation features for
accurate image classification. Cogn Comput 10(10):1–8
Building Label-Balanced Emotion
Corpus Based on Active Learning
for Text Emotion Classification
X. Shi · P. Liao
School of Mechanical Engineering, Nantong University, No. 9 Seyuan Road, Nantong, Jiangsu,
China
e-mail: sxfmch@163.com
P. Liao
e-mail: liao.p@ntu.edu.cn
X. Shi · X. Kang · F. Ren (B)
Faculty of Engineering, Tokushima University, 2-1 Minamijyousanjima-cho, Tokushima
770-8506, Japan
e-mail: ren@is.tokushima-u.ac.jp
X. Kang
e-mail: kang-xin@is.tokushima-u.ac.jp
1 Introduction
Emotions have been found useful for clarifying the mental thoughts underlying huge
number of social network messages for solving an increasing number of the real-
world problems, such as public opinion analysis [1–3], disease diagnosis [4–6],
stock trend prediction [7–9], and product review evaluation [10–12]. Correctly under-
standing the emotional information through social network messages helps the anal-
ysis of future trends in these fields and provides precious guidance for the decision
making at once.
Emotion classification research focuses on the recognition of delicate human
emotions [13, 14], which is different from the sentiment classification for the positive
and negative emotion polarities. There has been tiny variance of the human emotion
definitions in different research fields. For example, Ekman [15] proposed six basic
human emotion categories of Anger, Disgust, Fear, Happiness, Sadness, Surprise
for facial emotion recognition, while Ren et al. [24] proposed eight basic emotion
categories of Anger, Joy, Sorrow, Anxiety, Hate, Expect, Surprise, and Love for
language emotion classification. In this paper, we would employ the eight basic
emotion categories from [24] for studying the social network language emotion
classification methods.
One of the most significant problems in learning emotion classifiers for natural
language is that emotion labels are distributed in a highly biased manner in the raw
data. This makes a training set of well-balanced emotion labels very difficult to build
based on the real-world raw data and restricts the learned emotion classifiers for
recognizing the less frequent emotion labels, such as anxiety and surprise. In this
paper, we propose a novel method to actively select samples which are potentially
indicative of the less frequent emotion labels from a huge number of social network
raw messages, for building a balanced training set for emotion classification.
Specifically, given an existing training set of unbalanced emotion labels and a set
of raw text samples, our algorithm at first generates the probabilistic predictions for
all the raw samples and temporarily merge their probabilistic emotion predictions
into the existing training set to construct a bunch of candidate training sets. Then by
explicitly evaluating the Kullback–Leibler divergence of the emotion label distribu-
tions in different candidate training sets to the ideal uniform emotion distribution, the
algorithm incrementally finds the most promising candidate training sets as well as
the samples with the most promising label-balancing property accordingly. Finally,
the samples are given ground truth emotion labels by human experts and could be
merged to the training data permanently.
The rest of the paper is arranged as follows. We review the related work in Sect. 2.
In Sect. 3, we describe the algorithm for constructing a label-balanced training set for
text emotion classification. Section 4 describes the experiment set up and analyses
the text emotion classification results. Our conclusions and future work would be
given in Sect. 5.
Building Label-Balanced Emotion Corpus Based on Active Learning … 15
2 Related Work
In this part, we would illustrate our platform in detail for constructing a well-balanced
training set of emotional samples and for learning a multi-label emotion classifier.
These correspond to an active learning model as an extension of Kang et al.’s work
[28] and a basic multi-label emotion classification model, respectively. In the active
learning model, we extend Kang et al.’s work [28] by modifying the order of sample
selection procedure and by explicitly evaluating the label-balancing property for each
raw sample to improve the label balance in training set construction. The detailed
descriptions of two models are as follows.
yk = ϕk (x). (1)
We extract the bag-of-word features from message texts with the Chinese morpho-
logical analysis engine Thulac, with low frequency words and stop words filtered
out. The result feature vector x indicates the observation of each word in a Weibo
message. Hyper-parameters of the logistic regression classifiers including the l1
and l2 penalty, regularization strength, and class weights for each classifier ϕk , are
selected through the 5-fold cross validation on the training set.
As in the active learning model [28], text samples with the most significant infor-
mativeness and representativeness scores from the large set of raw data are greedily
selected to be appended to the existing training set, which is then employed for
updating the learning of the emotion classifiers ϕk . While different from the active
learning model [28], we reorder the sample selection procedure by putting the
complementariness criterion for inhibiting biased emotion labels after the other three
sample selection criteria, i.e. the informativeness criterion, the representativeness
criterion, the diverseness criterion. This allows our approach to directly adjust the
label-balancing property in the final output and put more weights to the complemen-
tariness criterion accordingly. Furthermore, we redesigned the complementariness
criterion to explicitly evaluate the Kullback-Leibler divergence between the emotion
distribution in a temporary training set to the ideal uniform emotion distribution,
Building Label-Balanced Emotion Corpus Based on Active Learning … 17
which evaluates the label-balancing property for the corresponding raw sample in a
more explicit manner. The derivations for each criterion are as follows.
The informativeness criterion evaluates the maximum of the probabilistic emotion
prediction entropy for all emotion categories by
1 √
r(x) = − x · x − 2x · x + x · x, (3)
|U | x∈U
in which we use U to indicate the set of all raw samples and employ the opposite
of the Euclidean distance between two samples x and x to indicate their linguistic
similarity. By maximizing this criterion as in Algorithm I, our approach could find
the candidate samples which are linguistically representative of many other samples
in the raw set.
The diverseness criterion evaluates the minimum of the Euclidean distances
between a raw sample to all selected training samples by
d(x) = min
x · x − 2x · x + x · x , (4)
x ∈X
in which we use X to indicate the set of message samples in the training data. By
maximizing this criterion as shown in Algorithm I, our approach could find the
candidate samples which are linguistically distinct to the already selected training
samples.
We propose a novel complementariness criterion for inhibiting the selection of
biased emotion labels into the training set, by constructing a series of temporal
training sets X ∪ {x} each of which corresponds to taking a raw sample x ∈ U
into the existing training set X , and by explicitly evaluating the Kullback–Leibler
Divergence between the emotion distribution p in a temporary training set and
the ideal uniform emotion label distribution u ∼ unif{1, K } to find the one of the
smallest Kullback–Leibler Divergence c(x), i.e. a new training set with the most
balanced emotion labels, by
K
uk
c(x) = − pk (x) log , (5)
k=1
pk (x)
18 X. Shi et al.
Fig. 1 In (a), the emotion labels in the training set are unbalanced for the short of Sorrow and Hate
labels. In (b), the sample selection procedure for the complementary criterion finds a candidate
sample with high probability of emotion label Sorrow. In (c), the unlabeled sample and its emotion
label Sorrow is added to the training set, rendering a more balanced emotion label distribution
x∈X ∪{x} ek (x)
pk (x) = , (6)
|X ∪ {x}|
1
uk = , (7)
K
in which ek (x) is the probability of observing emotion label k in sample x. For
samples in the existing training set x ∈ X , the probability of observing emotion
label k is either 1.0 or 0.0 given the ground truth emotion label. For samples in the
raw message data x ∈ U , the probability of observing emotion label k is given by
prediction result ek (x) = ϕk (x) from the logistic regression emotion classifier.
Figure 1 shows a concrete example of the sample selection procedure based on
the complementariness criterion. With an emotion label distribution in the current
training set X as shown in Fig. 1a, the proposed criterion tends to select a new
sample x with probabilistic emotion labels as Fig. 1b from the raw message data U ,
which renders the temporary training set X ∪ {x} with a more balanced emotion label
distribution, i.e. a smaller Kullback–Labler Divergence to the ideal uniform emotion
label distribution, as shown in Fig. 1c.
9. x = argmin c(x)|∀x ∈ U D
10. Obtain emotion label e for x
11. Append (x, e) to X and delete x from U
12. End for
We carry out an emotion classification experiment for Weibo messages, and evaluate
the emotion classification results with respect to different active learning algorithms.
The initial training, validation, and test sets consist of 864, 1005, and 1592 Weibo
messages, respectively. Each Weibo message has been annotated by human experts
with one or more emotion labels from Joy, Love, Expectation, Surprise, Hate, Sorrow,
Anger, Anxiety, and Neutral. The number of labels for each emotion category has
been kept the same around, which corresponds to 100, 100, and 184 for the training,
validation, and test sets, respectively. This is to allow our approach to learn an unbi-
ased emotion classifier at the very beginning of active learning, and to evaluate the
emotion classification results based on an even test embedding for each emotion cate-
gory. The raw data has been randomly retrieved from Weibo stream and arranged
into separate sets by the retrieval day and hour, with each set consists of several tens
of thousands of Weibo messages.
We employ the validation set and 3 raw data sets to determine the selection
parameters values in λI , λR , λD and 6 raw data sets to determine the parameter value in
λC for Algorithm I. Table 1 shows the candidate values for each selection parameter.
Specifically, the candidate values for the first three parameters are percentage values
for specifying the selection ratios, and the candidate values for the last parameter
specify the final output size. By incrementally updating the training set for each
group of parameter values, we collect and compare the micro and macro average F1
emotion classification scores on the validation set, and find the optimal parameter
values of 0.2 for λI , 0.5 for λR , 0.5 for λD , and 40 for λC .
1 The argpartition(F(X), n) function selects n elements x from X which have larger scores F(·) than
the other elements in X.
20 X. Shi et al.
The experiment is conducted as follows. For each raw data set U , we firstly
feed it together with the current training set X to Algorithm I, to get an updated
training set. Then, we train a series of emotion classifiers as in Eq. (1) based on
each of these training sets, and evaluate the emotion classification results with these
learned classifiers respectively on the test set. Finally, the classification results are
stacked together to demonstrate the improvement of text emotion classification with
increasing number of samples in the training sets, as shown in Fig. 2.
As the training set grows with the selected samples by Algorithm I, the learned
logistic regression emotion classification model is incrementally improved with
steadily increasing micro Precision, Recall, and F1 scores for text emotion clas-
sification. Specifically, as the active learning procedure is carried out for 60 loops,
the micro average scores of Precision, Recall, and F1 have been improved by 7.53%,
7.36%, and 7.51%, respectively. The results suggest that our approach is effective
in finding the appropriate samples from raw data set to significantly improve the
learning of a multi-label text emotion classification model.
Next, we examine the effectiveness of ours propose complementariness criterion
by comparing the text emotion classification results with the results from a control
experiment in which the sample selection procedure consists of only the first three
criteria, i.e. the informativeness, the representativeness, and the diverseness. The
selection parameters in the control experiment are kept the same as before, except
the second argument of argpartition for U D is replaced by λC to keep the size of
sample selection results the same as Algorithm I.
In Fig. 2, the micro average scores of Precision, Recall, and F1 from the proposed
approach are consistently higher than those from the control experiment. The aver-
aged gaps between two approaches are 1.55% in Precision, 0.94% in Recall, and
1.30% in F1 scores. The comparison of two experiments indicates that the proposed
complementariness criterion is able to reorder the priority for sample selection in an
effective manner so that high-quality samples can be more easily found and added
into the training set. When the comparison comes to CIRD model (proposed by Kang
et al. [28]), increments of the micro average scores are1.55%, 2.49% and 1.97% for
Precision, Recall and F1, respectively.
Finally, we look into the distribution of emotion labels generated by the proposed
approach and by the control experiment, to further analyze the label-balancing prop-
erty in these algorithms. Figure 3 shows the growing number of emotion labels in
the training set as increasing number of samples are selected by the proposed active
learning algorithm (Fig. 3a) or the controlling algorithm (Fig. 3b) and are annotated
with the emotion tags by human experts. The proposed active learning algorithm
Building Label-Balanced Emotion Corpus Based on Active Learning … 21
(a) F1 Increments
Fig. 2 Increments of emotion classification with respect to the incrementally increased training
data
22 X. Shi et al.
has generated a series of training sets with more balanced emotion labels than those
generated by the controlling algorithm. Specifically, the growth of the neutral emotion
label, which turns to be the most frequent label in the raw data sets, has been signif-
icantly restrained in the selection procedure. Meanwhile, the other emotion labels
are growing much faster than those selected by the controlling algorithm, and the
growth is especially significant for Anxiety, Joy, Hate, and Expect. These results
suggest that given raw data sets of highly biased label distributions, the proposed
complementariness criterion is effective in selecting raw samples of label-balancing
property, which essentially restrains the number of samples with frequent labels and
Building Label-Balanced Emotion Corpus Based on Active Learning … 23
increases the number of samples with infrequent labels in the training set, through
active learning.
5 Conclusion
Acknowledgements This research has been partially supported by the Ministry of Education,
Science, Sports and Culture of Japan, Grant-in-Aid for Scientific Research(A), 15H01712.
References
9. Das S, Chen M (2001) Yahoo! for amazon: extracting market sentiment from stock message
boards. In: Proceedings of the Asia Pacific finance association annual conference, vol 35
10. Desmet P (2003) A multilayered model of product emotions. Des J 6(2):4–13
11. Dave K, Lawrence S, Pennock DM (2003) Mining the peanut gallery: opinion extraction and
semantic classification of product reviews. In: Proceedings of the 12th international conference
on World Wide Web. ACM, pp 519–528
12. Hassenzahl M, Diefenbach S, Göritz A (2010) Needs, affect, and interactive products–facets
of user experience. Interact Comput 22(5):353–362
13. Wang L, Ren F, Miao D (2016) Multi-label emotion recognition of weblog sentence based on
Bayesian networks. IEEJ Trans Electr Electron Eng 11(2):178–184
14. Picard RW (1995) Affective computing
15. Ekman P (1992) An argument for basic emotions. Cogn Emot 6(3–4):169–200
16. Zhao YY, Qin B, Liu T (2010) Sentiment analysis. J Softw 21(8):1834–1848
17. Taboada M, Brooke J, Tofiloski M et al (2011) Lexicon-based methods for sentiment analysis.
Comput Linguist 37(2):267–307
18. Yang C, Lin KHY, Chen HH (2007) Building emotion lexicon from weblog corpora. In:
Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration
sessions. Association for Computational Linguistics, pp 133–136
19. Matsumoto K, Ren F (2011) Estimation of word emotions based on part of speech and positional
information. Comput Hum Behav 27(5):1553–1564
20. Picard RW, Vyzas E, Healey J (2001) Toward machine emotional intelligence: analysis of
affective physiological state. IEEE Trans Pattern Anal Mach Intell 23(10):1175–1191
21. Bhowmick PK (2009) Reader perspective emotion analysis in text through ensemble based
multi-label classification framework]. Comput Inf Sci 2(4):64
22. Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Warehouse
Min (IJDWM) 3(3):1–13
23. Tsoumakas G, Vlahavas I (2007) Random k-labelsets: an ensemble method for multilabel
classification. In: European conference on machine learning. Springer, Berlin, Heidelberg, pp
406–417
24. Ren F, Kang X (2013) Employing hierarchical Bayesian networks in simple and complex
emotion topic analysis. Comput Speech Lang 27(4):943–968
25. Liu H, Lieberman H, Selker T (2003) A model of textual affect sensing using real-world
knowledge. In: Proceedings of the 8th international conference on Intelligent user interfaces.
ACM, pp 125–132
26. Reyes O, Morell C, Ventura S (2018) Effective active learning strategy for multi-label learning.
Neurocomputing 273:494–508
27. Li X, Guo Y (2013) Active learning with multi-label SVM classification. IJCAI, 1479–1485
28. Kang X, Wu Y, Ren F (2018) Progressively improving supervised emotion classification
through active learning. In: International conference on multi-disciplinary trends in artificial
intelligence. Springer, Cham, pp 49–57
Arbitrary Perspective Crowd Counting
via Local to Global Algorithm
Abstract Crowd counting is getting more and more attention. More and more col-
lective activities, such as the Olympics Games and the World Expo, are also important
to control the crowd number. In this paper, we address the problem of crowd count-
ing in the crowded scene. Our model accurately estimated the count of people in the
crowded scene. Firstly, we proposed a novel and simple convolutional neural net-
work, called Global Counting CNN (GCCNN). The GCCNN can learn a mapping,
transforms the appearance of image patches to estimated density maps. Secondly,
The Local to Global counting CNN (LGCCNN), calculating the density map from
local to global. Stiching the local patches constrains the final density map of the larger
area, which make up for the difference values in the perspective map. In general, it
makes the final density map more accurate. The dataset we used is a set of public
dataset, which are WorldExpo’10 dataset, Shanghaitech dataset, the UCF_CC_50
dataset and the UCSD dataset. The experiments have proved our method achieves
the state-of-the-art result over other algorithms.
1 Introduction
The crowd counting has important social significance and market value. Managers
can reasonable scheduling of manpower, material resources and optimize resources
configuration by using the number of ROI area statistics. For some of the square,
passageway and other public occasions, the result of crowd statistics have very good
warning effect to social security problems. Therefore, the crowd counting becomes
Chuanrui Hu and Kai Cheng: These authors contributed equally to this work.
the key point in the field of video analysis and intelligent video surveillance. This
involves estimating the number of people in the crowd and the crowd distribution
over the entire region.
Traditional crowd counting algorithms share a common procedure. (1) Foreground
segmentation, but the split of foreground can not entirely separate people and back-
ground because sometimes people are still in high density scenarios, such as the
queue in front of the station ticket window. (2) Crowd feature extraction, due to the
dense scenarios perspective distortion, brightness condition and low resolution of
the image, the handcraft features (e.g., Scale Invariant Feature Transform (SIFT)
[13], Histogram of Oriented Gradient (HOG) [3], Local Binary Patterns (LBP) [17])
cannot fully express characteristics of the crowd. For the dense crowd, typical static
crowd scenes from the WorldExpo’10 Dataset [2].
It is difficult to detect the number of people because of occlusion, and it is not
wise to calculate the number by the foreground segmentation due to the randomness
of foreground segmentation. Some typical scenes from the WorldExpo’10 dataset
[2] are shown in Fig. 1.
There has been a significant recent progress in the field of crow counting due to
the successful development of deep learning (e.g., the convolutional neural networks
(ConvNets)) [2, 4, 12, 14, 15]. To the best of our knowledge, the Cong et al. [2]
Arbitrary Perspective Crowd Counting via Local to Global Algorithm 27
Fig. 2 We define the crowd counting task like a regression problem where a CNN model to map
the appearance of image to crowd density map. The yellow box indicates that the training image
dataset is densely extracted from the whole image
first train a CNN model to learn a map to solve the crowd counting problem, but in
order to get the crowd count, the result need to feed a ridge regressor with the output
features. The MCNN [21], which output is an estimated density map and it solve the
large scale variation. But the output final estimated density map is distortion due to
the size of the the final estimated density map is decreased. Recent researches [2, 21]
have proven the learned features performed better than the traditional hand-crafted
features. As illustrated in Fig. 2, in order to make up for the shortcoming of the resent
search [2, 21], we propose our convolutional neural network architectures to learn
the regression function that mapping the image appearance into a crowd density map.
The number of people in the crowd scene is calculated through integration over the
crowd density map.
The main contributions of this work can be conclude into these three aspects.
– In Sect. 3.1, we propose a novel convolutional neural network architectures, named
Global Counting CNN (GCCNN). Which is a fully convolutional neural network
[12] can get an accurate regression of a crowd density map of image patches. We
adopted a bilinear interpolation algorithm, the Fig. 4 clearly shows that the final
output feature map is the same size as the input patch.
– Due to the scale variation in the crowd images, we introduce the Local to Global
Counting CNN (LGCCNN) in Sect. 3.2 which provide an algorithm, calculating the
final density map form local to global. The algorithm make up for the differences
caused by different values in the perspective map then makes the density map of
the larger area more accurate.
– Our architecture has been evaluated on three benchmark datasets and is shown to
achieve state-of-the-art outperforms.
The rest of this paper are organized as follows: previous research about the crowd
counting is in Sect. 2. The proposed method and the overall structure of the two CNN
model are detailed listed in Sect. 3. Experiments and the comparisons of results are
summarized in Sect. 4. In the end, we make a conclusion about this paper in Sect. 5.
28 C. Hu et al.
2 Related Works
In recent years, the crowd counting method in the literature can be divided into two
categories: counting by detection and counting by regression.
Counting by detection [5, 10, 16, 20]. Many algorithms counting people by detec-
tion. First, they use the appearance and the motion feature to separate the moving
objects from the background over the two consecutive frames of a video clip. Then
these algorithms utilize the handcraft features (such as Haar wavlet features or edgelet
features [20]) to obtain the moving objects. However these methods can be used in
the video clip not suit for the still image and the handcrafted features often sustain
a decline in accuracy when the scene is perspective distortion, severe overlapping,
and varying illumination.
Counting by regression [1, 2, 7–9, 11, 14, 19, 21]. Counting by regression aims
to lean a mapping between the low-level features and people count via certain a
regression function without foreground segmentation or pedestrian detection. It is
more suitable for complex environments and more crowded instance like pedestrians.
Cong et al. [2] first trained a deep CNN model. It makes good performance. But
they reported the results feeding a ridge regressor with the output features of their
CNN model and the input patches of their CNN model is random which does not
consider the large scale-invariant to large scale changes well. Our network diminish
the perspective distortion and estimates both the crowd count as well as the crowd
density map.
3 Methodology
In this section, we will state our notation and crowd counting methodology. Here,
we treated the crowd counting problem as the density map estimation.
Previous research has followed [2] and defined the groundtruth of the density map
regression as sum of the Gaussian kernels centered on the locations of objects. This
mentioned density map is more suitable for representing the density distribution of
circle-like objects like cells and bacteria. Considering the shape of the pedestrian
in an ordinary surveillance camera is ellipse-like. We follow the method of the [2].
Arbitrary Perspective Crowd Counting via Local to Global Algorithm 29
Before generate the groundtruth density map, we should consider the large scale
variation due to the perspective distortion. Perspective normalization is necessary to
describe the pedestrian scale. After we get the perspective map of each scene and
a set of head annotations images, where all the heads are marked by dots. We can
generate the groundtruth density map Di , for an image I , is defined as a sum of
Gaussian functions centered on each dot annotation. We generate the crowd density
map is generated as:
1
Di ( p) = (N h (P; Ph , σh ) + Nb (P; Pb , )) (1)
P∈Pi
Z
D (P)
pr ed = F(P|θnet ) (2)
where θnet is the set of parameters of the CNN model. For the image patch P,
we could get the D (P)
pr ed . Thus for a given unseen test image, at first our algorithm
densely extracted image patches over the image. Then our CNN model could generate
an estimated density map corresponding to the image patch. At last, all the density
maps are aggregated into a density map for a whole test image.
Let us introduce our first ConvNet structure called the Global Counting CNN Model
(GCNN). As illustrated in Fig. 4. The crowd density estimation does not like image
classification, it need per pixel predictions. So we adopt the fully convolutional neural
networks natural. This would reduce the overfitting due to the fully convolutional
neural network has much fewer parameters than a network trained on an entire image.
The structure consists of 6 convolution layers and 2 pooling layers. They are specially
designed to extract the crowd features. The Conv1 layer has 3×3 filters with a depth
of 64. The Conv2 layer has 3×3 filters with a depth of 128. The max pooling layer
with a 2×2 kernel size is used after conv1 and conv2. The Conv1 layer has 33 filters
with a depth of 256. The Conv4 layer and Conv5 layer are made of 1×1 filters with
a depth of 1000 and 400. The Conv6 layer is another 1×1 filters with a depth of 1.
30 C. Hu et al.
Fig. 4 Our GCCNN structure, treated the input patches and their associated groundtruth density
maps and groundtruth crowd counts as input, which returns an estimated density map, the size is
same as the input patch
The output from these convolution layers is upsampled to the size of the input image
patch using bilinear interpolation to directly obtain the estimated crowd density map.
Due to the good performance for the CNNs of the Parametric Rectified Linear Unit
(PReLU) [6]. The PReLU was adopted as the activation function and it is not shown
in the Fig. 2. Equation (2) has point out, our CNN models is to learn a mapping from
a set of features extracted from training image patches to an estimated crowd density
map. So, our GCCNN is trained to solve the regression problem. The Euclidean
distance is used as the loss function.
1
N
L1 (θ net) = ||F (Pi|θ net) − DgtPi ||2 (3)
2N i
1
N
L2 (θ net) = ||C (Pi|θ net) − Cgt(Pi) ||2 (4)
2N i
where θ net denotes the learned parameters of the CNN model, N is the number of the
training images, Pi is the image patch will be training in the CNN model. F(Pi|θ net)
and C(Pi|θ net) represent the corresponding image patch stand for the estimated
pi
crowd density map and the crowd count. DgtPi and Cgt respectively represent the
groundtruth density map and ground truth crowd number of the corresponding image
patch. Different from Zhang et al., the master loss task is the L1(θ net). We let the
two loss functions pass through all previous layers together. The master loss task
is the L1(θ net). The L2(θ net) is treated as the auxiliary loss. The auxiliary loss
task helps optimize the learning process, while the master loss task takes the most
responsibility. We add weight to balance the auxiliary loss. The two loss tasks assisted
each other and trained together to obtain optimization.
After obtaining the parameters θ net of the CNN model. How do we implement the
prediction stage on the unseen target test image? First, we densely extracted image
patches. Then all the image patches are resized to 72×72 pixels. These input image
patches with their associated groundtruth density maps and groundtruth crowd count
are as illustrated in Fig. 5, which passed through our CNN architecture. It returns an
Arbitrary Perspective Crowd Counting via Local to Global Algorithm 31
estimated density map corresponding to the input image patch. Lastly, all the output
estimated density map will be aggregated into a density map over the whole test
image. Due to the extracted image patches are overlap. So the each location of the
final estimation density map must by normalized according the number of patches
that calculated into the final estimated density map.
On the basis of a counting by regression model, using the annotated perspective map
of each scene to solve the perspective distortion and scale variation. Due to the impact
of the perspective distortion on each image, the size of pedestrian will exhibit scale
variation. The features extracted from the same pedestrian at different scene depths
would have notable differences in values.
Go a step further, in order to get an accurate estimated crowd density map, we use
the Local to Global Counting CNN Model (LGCCNN). We proposed an algorithm for
estimating a density map from local to global which is specialized in the perspective
distortion and scale variation. The ConvNet structure was specialized designed is
shown in Fig. 6. Our CNN model was consisted of three columns CNN. The three
parallel CNNs contain the same structure (i.e., conv-pooling-conv-pooling) and the
same size of filters. The CNN model takes different but related inputs. The input is
the training image patches cropped from the training images. The patch of the first
column was resized to 94×94 pixel. The next two columns take the upper and lower
two parts of a complete patch. Each parallel CNN is in charge of learning features
of input patch for a different perspective value. Then the output feature maps of the
last two columns CNN Model are stitched. Compared the losses of the GCCNN, we
added the same loss function as show in Eq. (7).
32 C. Hu et al.
1
N
L1 (θ net) = ||F (Pi|θ net) − DgtPi ||2 (5)
2N i
1
N
L2 (θ net) = ||C (Pi|θ net) − Cgt(Pi) ||2 (6)
2N i
1
N
L3 (θ net) = ||F (Pi1; Pi2|θ net) − F(Pi|θ net)||2 (7)
2N i
where θ net denotes the learned parameters of the LGCNN model, N is the number of
the training images, Pi is the image patch will be training in the CNN model. Pi1 and
Pi2 are the corresponding upper and lower image patch of the completed image patch
Pi . F(Pi|θ net) is the output feature map of the first column CNNmodel. Noticed,
the upper and lower estimated density maps were calculated by the different values
of the perspective map. We constrain the final estimated density map on the lager
region which makes up for the difference caused by the different perspective values.
In the end, it makes the estimated density map more accurate and provide a method
for crowd density map was from local to global.
4 Experiments
We first evaluate our CNN model on the challenging the WordExpo’10 dataset [2].
The detail of the WorldExpo’10 dataset is shown in Table 1. This dataset contains
1132 annotated video clips, captured by 108 surveillance cameras. 1,127 one-minute
long video sequences are treated as training datasets. Testing datasets, 5 one-hour
long different video sequences. Each video sequence contains 120 labeled frames.
Arbitrary Perspective Crowd Counting via Local to Global Algorithm 33
Table 1 The attribution of the public datasets: NUM is number of frames; Total is the number of
labeled people; MAX is the maximum number of people in the ROI of a frame; MIN is the minimum
number of people in the ROI of a frame. AVG indicated the average crowd count
Dataset NUM Total MAX MIN AVG
UCSD 2000 49885 46 11 25
UCF_CC_50 50 63974 1279 4543 1279
WorldExpo’10 4.44 million 199623 253 1 50
ShanghaiTech Part A 482 241677 3139 33 501
ShanghaiTech Part B 716 88488 578 9 123
We train our deep convolution neural network on the basis of caffe library and
some modifications are applied. The NVIDIA GTX TITAN X GPU is used. We
use the standard Stochastic Gradient Descent (SGD) algorithm to optimize ConvNet
parameters with a learning rate of 1e−7 and momentum of 0.9 video clips captured
by 108 surveillance cameras. 1,127 one-minute long video sequences are treated
as training datasets. Testing datasets, 5 one-hour long different video sequences.
Each video sequence contains 120 labeled frames. We train our deep convolution
neural network on the basis of caffe library and some modifications are applied.
The NVIDIA GTX TITAN X GPU is used. We use the standard Stochastic Gradient
Descent (SGD) algorithm to optimize ConvNet parameters with a learning rate of
1e−7 and momentum of 0.9.
In order to make the experimental results more intuitionistic. We use the two evalua-
tion criteria: the mean absolutely evaluation (MAE) and the mean square evaluation
(MSE) which are defined as follows:
1
N
M AE = |Ci − Ei| (8)
N 1
1 N
MSE = |Ci − Ei| (9)
N 1
where N denotes the number of the training images, Ci is the true pedestrians number
of the ith test image. E i is the estimated pedestrians number of ith test image. MAE
represents the actual situation of the estimates error. MSE represent the robustness
of the estimates. The lower MAE and MSE are, the count result is more accurate and
better.
34 C. Hu et al.
The dataset consists of 108 scenes. In order to train our GCCNN model, we typically
selected 2600 images from the 103 scenes in the dataset. We collected 200 patches of
72×72 pixels extracted all over the image with their associated groundtruth density
maps and groundtruth crowd counts. It contains 100 positive patches which center
is the area of people and the 100 negative patches which center is the area of the
ground. Then we performed a data augmentation by flipping each patch randomly.
To train our LGCCNN model, we performed the same method as previously
mentioned to extract the image patches of 94×94 pixels. For a given complete patch
of 94×94 pixels, we will get the upper and lower patch of 58×94 pixels by cropped
the complete patch.
We typically selected 2600 images from the 103 scenes in the dataset as the
training images. Firstly, we collected 200 patches of 94×94 pixels extracted all over
the image with their associated groundtruth density maps and groundtruth crowd
counts. It contains 100 positive patches which center is the area of people and the
100 negative patches which center is the area of the ground. Then we perform a data
augmentation by flipping each patch randomly. After we got these patches and in
order to meet our CNN model. We will collect the small local patches by cropped
the global patches.
4.3 Results
Table 2 Quantitative results with other state of the art methods on the WorldExpo’10 dataset
Method Scene 1 Scene 2 Scene 3 Scene 4 Scene 5 Avg
LBP+RR [18] 13.6 58.9 37.1 21.8 23.4 31.0
Cong et al. [2] 9.8 14.1 14.3 22.2 3.7 12.9
MCNN [21] 3.4 20.6 12.9 13.0 8.1 11.6
GCCNN 7.5 22.6 15.7 16.0 6.2 13.6
LGCCNN 2.6 19.3 17.4 14.8 4.7 11.0
Arbitrary Perspective Crowd Counting via Local to Global Algorithm 35
5 Conclusions
In this paper, we proposed two convolution neural network architectures. For our first
architecture, the GCCNN model can learn a mapping which transforms the appear-
ance of crowd image to the crowd density map effectively. Our second architecture,
the LGCCNN model which goes a step further, provide a method for crowd density
map was from local to global. The final estimated density map on the lager region
which makes up for the difference caused by the upper and lower image patch of
different perspective values. In the end, it makes the estimated density map more
accurate. The density map is generated in the output layer of network and the num-
ber of people is obtained by integral regression. We test our proposed method in the
Shanghaitech dataset, the WorldExpo’10 dataset, the UCF_CC_50 dataset and the
UCSD dataset. Moreover, the experimental results show the accuracy the robustness
of our method outperforms the state-of-the-art crowd counting method.
36 C. Hu et al.
Fig. 7 Sample predictions of our LGCCNN model in the WorldExpo’10 dataset. The first column
is the target test image. The second column is the groundtruth density map corresponding to the
target test image. The second column shows the estimated density map
Arbitrary Perspective Crowd Counting via Local to Global Algorithm 37
Acknowledgements This work is supported by the National Natural Science Foundation (NSF)
of China (No. 61572029), and the Anhui Provincial Natural Science Foundation of China (No.
1908085J25), and Open fund for Discipline Construction, Institute of Physical Science and Infor-
mation Technology, Anhui University.
References
1. Chan AB, Vasconcelos N (2012) Counting people with low-level features and bayesian regres-
sion. IEEE Trans Image Process 21(4):2160–2177
2. Cong Z, Li H, Wang X, Yang X (2015) Cross-scene crowd counting via deep convolutional
neural networks. In: IEEE conference on computer vision & pattern recognition
3. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Schmid C,
Soatto S, Tomasi C (eds) International conference on computer vision & pattern recognition
(CVPR ’05), vol 1. IEEE Computer Society, San Diego, United States, pp 886–893. https://
doi.org/10.1109/CVPR.2005.177, https://hal.inria.fr/inria-00548512
4. Ge S, Zhao S, Li C, Li J (2018) Low-resolution face recognition in the wild via selective
knowledge distillation. CoRR. http://arxiv.org/abs/1811.09998
5. Ge W, Collins RT (2009) Marked point processes for crowd counting. In: IEEE conference on
computer vision & pattern recognition
6. He K, Zhang X, Ren S, Jian S (2015) Delving deep into rectifiers: surpassing human-level
performance on imagenet classification
7. Hu Y, Chang H, Nian F, Yan W, Teng L (2016) Dense crowd counting from still images with
convolutional neural networks. J Visual Comm Image Represent 38(C):530–539
8. Ke C, Chen CL, Gong S, Tao X (2012) Feature mining for localised crowd counting. In: British
machine vision conference
9. Lempitsky VS, Zisserman A (2010) Learning to count objects in images. In: International
conference on neural information processing systems
10. Lin SF, Chen JY, Chao HX (2001) Estimation of number of people in crowded scenes using
perspective transformation. IEEE Trans Syst Man Cybernetics Part A Syst Humans 31(6):645–
654
11. Liu T, Tao D (2014) On the robustness and generalization of cauchy regression. In: IEEE
international conference on information science & technology
12. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation.
In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–
3440
13. Lowe DG (1999) Object recognition from local scale-invariant features. In: Proceedings of
iccv
14. Loy CC, Gong S, Xiang T (2014) From semi-supervised to transfer counting of crowds. In:
IEEE international conference on computer vision
15. Lu H, Wang D, Li Y, Li J, Li X, Kim H, Serikawa S, Humar I (2019) Conet: a cognitive ocean
network. CoRR. http://arxiv.org/abs/1901.06253
16. Min L, Zhang Z, Huang K, Tan T (2009) Estimating the number of people in crowded scenes by
mid based foreground segmentation and head-shoulder detection. In: International conference
on pattern recognition
17. Ojala T, Pietikinen M, Menp T (2002) Gray scale and rotation invariant texture classification
with local binary patterns. IEEE Trans Pattern Anal Machine Intell 24(7):971–987
18. Rodriguez M, Laptev I, Sivic J, Audibert JY (2011) Density-aware person detection and tracking
in crowds. In: International conference on computer vision
19. Sam DB, Surya S, Babu RV (2017) Switching convolutional neural network for crowd counting
38 C. Hu et al.
20. Wu B, Nevatia R (2005) Detection of multiple, partially occluded humans in a single image
by bayesian combination of edgelet part detectors. In: Tenth IEEE international conference on
computer vision
21. Zhang Y, Zhou D, Chen S, Gao S, Yi M (2016) Single-image crowd counting via multi-column
convolutional neural network. In: Computer vision & pattern recognition
A Semi-supervised Learning Method
for Automatic Nuclei Segmentation
Using Generative Adversarial Networks
Chuanrui Hu, Kai Cheng, Jianhuo Shen, Jianfei Liu, and Teng Li
Chuanrui Hu and Kai Cheng: These authors contributed equally to this work.
1 Introduction
Potentially caused by image noise and cell motility during image acquisition,
nucleus shapes, sizes, and texture often exhibit significant variance. Nucleus touching
is also present in the image regions with densely-packing cells. Many methods based
on characteristic of microscopy images have been made to address these issues,
such as gray level thresholding, watershed [19] and active contour [2, 11]. Although
mathematical models are deliberately designed to fit image characteristics in these
methods, they are prone to being violated due to the complexity of microscopy
images. Deep learning based methods, on the other hand, O. Ronneberger et al.
proposed U-Net [15] structure and Ba et al. [8] proposed deep multiple instance
learning to segment microscopy images, they can build mathematical models directly
from microscopy images. Accurate nuclei segmentation can be achieved by utilizing
a large number of labelled images. Unfortunately, labelling microscopy images is an
extremely tedious process, which is sometimes infeasible in the real clinical practice.
Labeling microscopy images is an extremely tedious process which motivates us to
develop a semi-supervised learning strategy to reduce the dependence on the number
of labelled images. The success of generative adversarial networks (GANs) hints us
to understand such nuclei similarity by producing confidence maps for unlabeled
images to increase the training data. Based on this idea, we developed a novel semi-
supervised learning-based nuclei segmentation method, which can be summarized
in three fold.
– A light neural network structure is developed to identify nuclei in the microscopy
images, which is called Light-Unet. A multi-layer up-sampling network can
enhance edge information specialized in our nuclei image segmentation. Experi-
ments demonstrated that Light-Unet performed faster than existing networks, such
as U-Net.
– An adversarial framework is designed to improve nuclei segmentation accuracy
without additional computation in the inference process. We adopt the adversarial
learning through adversarial loss in discriminator network. With this adversar-
ial loss, we maxmize the probability of the predicted segmentation map being
considered as the ground truth distribution.
– For unlabeled images, we utilized semi-supervised learning scheme leveraging on
the discriminator network to assist the segmentation network training well.
The rest of this paper consists of the following sections: We review the deep learning
methods for cell image segmentation and semi-supervised learning for cell image
segmentation in Sect. 2; The overall framework and the proposed method are detailed
in Sect. 3; Experiments and the comparisons of results are summarized in Sect. 4;
Finally, we conclude this paper in Sect. 5.
42 C. Hu et al.
2 Related Works
In this section, we first review the related work of deep learning methods for biomed-
ical image segmentation in recent years, and then briefly summarize the related
semi-supervised methods for cell image segmentation.
Cell segmentation is an important task in biomedical image analysis and attracts lots
of research efforts in recent years. Recent state of the art methods for sematic seg-
mentation are mostly based on CNN. Long et al. [9] proposed a fully convolutional
neural network (FCN). Pixel classification level based CNN is first involved in seg-
mentation task. But it is not insensitive to details in the image, that means that detailed
spatial information is obliterated by the repeated combination of max-pooling and
downsampling performed at every layer of standard convolutional neural networks
(CNNs). Noh et al. [12] proposed a multi-layer up-sampling network structure which
effectively enhanced edge information. Motivated by the their successes in regular
image semantic segmentation, deep learning based methods have become popular for
cell segmentation such as: [8, 14, 15], and they have shown more promising results
than traditional segmentation algorithms. Raza et al. [14] proposed the MIMO-Net
to solve the problem of segmentation. Ge et al. [5] use the knowledge distillation
to decompress the deep neural networks. O. Ronneberger et al. proposed U-Net
for biomedical image segmentation. U-Net is a structure which skips connections
from the encoder features to the corresponding decoder activations. The network can
localize to capture object details. And there are also many others cell segmentation
networks based on U-Net.
For the task of cell image segmentation, pixel-level annotation is usually expensive
and requires professional knowledge. To reduce the heavy effort of labeling segmen-
tation ground truth, in recent years semi-supervised learning and weakly-supervised
methods have been explored for cell image segmentation such as [1, 17, 20, 21].
In the weakly-supervised setting, the segmentation network can be trained with
bounding boxes level labeling (Dai et al. [4]; Khoreva et al. [7]), without the need for
pixel level labeled data. However these weakly supervised approaches still required
a large amount of labeled data, while the detailed boundary information is diffi-
cult to capture with bounding box labels. Semi-supervised methods can leverage
the unlabeled images for model learning. Su et al. [18] propose an interactive cell
segmentation method by classifying feature-homogeneous super-pixels into specific
A Semi-supervised Learning Method for Automatic … 43
classes, which is guided by human interventions. Xu et al. [21] propose a CNN with
a semi-supervised regularization to address the neuron segmentation in 3D volume
with introducing a regularization term to the loss function of the CNN such that the
performance is improved.
Recently, the generative adversarial networks (GANs) originally proposed by
Goodfellow et al. [6] have shown to be quite useful and the idea is also applied in
semi-supervised image segmentation. Luc et al. [10] proposed about semantic seg-
mentation using adversarial network, and the output of their discriminator is a vector
rather than a map. Souly et al. [16] proposed a method for semi-supervised semantic
segmentation using GANs, but these generated examples may not be sufficiently
close to real images to help the segmentation network.
In this paper, we propose a novel deep learning based semi-supervised method for
cell image segmentation. Our proposed semi-supervised algorithm based on GANs
considers the output of discriminator as the supervisory signals, and learn the confi-
dence map through the discriminator network as the instructor for semi-supervised
learning.
3 Methodology
Figure 2 shows the overview of our proposed method. Our method has two networks,
a discriminator network training and a segmentation network training. We denote the
segmentation network as S( · ) and the discriminator network as D( · ). Given an input
image with size of H × W × 3. The segmentation network outputs probabily map
of size H × W × C. where C is equal to 2 in our work. The discriminator network
outputs a spatial probability map of size H × W × 1.
For the first network: a discriminator network. A typical discriminator network
is to distinguish ground truth label maps and probability maps from segmentation
network. Our discriminator network is inspired by fully convolutional network (FCN)
[9]. The input is either from the segmentation network output segmentation map
S(Xn ) or from the ground truth label, the ground truth label need to be one hot
encode before training, and then outputs spatial probability map. Each pixel value in
the discriminator network output maps represent the probability that pixel is from the
segmentation map or ground truth label. In contrast to typical GANs, which takes a
fixed size image as input and outputs is a single probability value that reflect a image
44 C. Hu et al.
Fig. 2 A detailed description of our proposed algorithm. We optimize the fully convolutional dis-
criminator with loss LD . We optimize the segmentation network by using three loss function:cross-
entropy loss Lce , adversarial loss Lad v and semi-supervised loss Lsemi
and adversarial loss (Lad v ) with discriminator network. For the unlabeled data, we
train the segmentation network with adversarial loss with discriminator network and
semi-supervised loss. We obtain the confidence map by the initial segmentation map
S(Xn ) passing through discriminator network. The confidence map like an instruc-
tor to teach the segmentation network to select confidential region in probability
map S(Xn ). We then perform a semi-supervised loss with confidential regions. Note
that the segmentation network is always supervised by adversarial loss due to the
adversarial loss is only related to the discriminator network.
Fig. 3 The structure of our proposed Light-Unet. Each box corresponds to a multi-channel feature
map. The number of channels is denoted on bottom of box. Our Light-Unet fusion multi layer
information to refine the spatial precision of the output
46 C. Hu et al.
For the discriminator work, we follow the structure used in Radford et al. [13]. It
contains 5 convolutions with kernel size is 4 × 4 with stride is 2 with channel number
is {64, 128, 256, 512, 1}. Leaky-ReLU, is the activation function applied after every
convolutional layer.
In order to train this network, according to Goodfellow et al. [6], they employed an
adversarial loss, the function is defined as:
A Semi-supervised Learning Method for Automatic … 47
where x represent the original image from an unknown distribution Pdata , z is the
input noise of the generator network G (·), and G(z) is the output map of generator
network. D (·) and G (·) are playing in a min-max game with this loss function. In
other word, we can transform equation (1) into another form, that likes:
LD (θdis ) = − ((1 − yn )log(1 − D(S(Xn ) |θdis ))(h,w) +
h,w (2)
(h,w)
yn log(D(Yn |θdis ) ))
where θdis is the discriminator network parameters. Xn is the n-th input image. When
yn = 0, It means the input of the discriminator network is the predicted segmentation
map of segmentation network. When yn = 1, It means the input image is sampled
from the ground truth label. Notice that, ground truth label image is only one channel,
we need to perform one-hot decode (convert the ground truth channel the same as
the probability map S(Xn )) before training.
Lseg (θseg ) = Lce (θseg ) + λad v Lad v (θseg ) + λsemi Lsemi (θseg ) (3)
where θseg is the discriminator network parameters. Lseg , Lad v , Lsemi represents the
cross-entropy loss, adversarial loss, and the semi-supervised loss respectively. There
are two hyper-parameter λad v , λsemi . The role of these parameters are used to balance
the multi-task training.
Training with labeled image. When we train with labeled image, we can use the
stand cross-entropy loss to optimize it. The loss is defined as:
Lce (θseg ) = − Yn(h,w,2) log(S(Xn θseg )(h,w,2) ) (4)
h,w c∈C
where S(Xn ) is the prediction result of the segmentation network for the n-th input
image.
With standard cross-entropy loss Lce we can make the predict probability map
distribution more close to the ground truth label distribution.
48 C. Hu et al.
We adopt adversarial learning by using adversarial loss, the loss is defined as:
(h,w)
Lad v (θseg ) = − log(D(S(Xn θseg )) ) (5)
h,w
With adversarial loss Lad v , we can maximum the probability of the predicted
segmentation map S(Xn ) be considering of the ground truth distribution.
Training with unlabeled image. When training with unlabeled image. Obviously,
Lce will not applied into multi-task. Lad v is still available, since it only rely on
discriminator network. The discriminator network can generate spatial probability
map. The confidence map reflects that probability that the prediction results are close
to the distribution of the ground truth map. We then binarized confidence map by set
a threshold Tsemi . Final, we denote a ground truth label as the masked segmentation
prediction Ŷn = argmax(S(Xn )) combining this binarized confidence map. The semi-
supervised loss is define as:
Lsemi (θseg ) = − I (D(S(Xn ))(h,w) > Tsemi )
h,w c∈2 (6)
(h,w)
·Ŷn(h,w,2) log (S(Xn θseg ) )
where I (·) is the active function, and Tsemi is the threshold to control the confidential
regions. During training, we have already get the Ŷn(h,w,2) , we can just consider it as
a standard cross-entropy loss.
With utilize the semi-supervised loss Lsemi , we can enhance the segmentation
network’s performance by using more unlabeled images.
Our goal is to learn the parameters θdis in the discriminator network and θseg in
the segmentation network, while θdis is parameters in the discriminator network only
introduced propagated the labeled image training and the parameters are updated
by Adaptive Moment Estimation (Adam). θseg is parameters in the segmentation
network. The parameters θseg are updated by Stochastic Gradient Descent (SGD).
4 Experiments
The proposed algorithm is implement on the basis of pytorch library and some
modifications are applied. The NVIDIA GTX TITAN X GPU is used. The standard
Stochastic Gradient Descent (SGD) algorithm is applied to optimize segmentation
network parameters with the momentum of 0.9, batch size of 10 and weight decay
of 0.0004. The initial learning rate is set as 0.001. For training the discriminator
network, we adopt Adam optimizer with learning rate as 0.0001 with momentum set
as 0.999.
A Semi-supervised Learning Method for Automatic … 49
4.2.1 Dataset
Our dataset is collected from 2018 Data Science Bowl competition. This dataset
contains a large number of segmented nuclei images. The images were acquired
under a variety of conditions and vary in the cell type, magnification, and imaging
modality (brightfield vs. fluorescence). There are 3 types images. Some images are
shown in Fig. 4: The images are of different sizes and so preprocessing is required
to make them of uniform size (256, 256). There are 670 images in the training set
and 65 images in the test set.
Some nuclei images might annotated with some errors. The wrong picture is
shown in Fig. 5. As can been from the Fig. 5, some nuclei images may have some
questions: The first and second column image shows there have a obvious nucleus
in image, but the manual segmentation may ignore it. In the test phase, we find that
our predicted segmentation map can easily find it however the human expert may
ignore it. The third column image shows some individual nuclei in the microscope
cell image, but manual annotation cause some nuclei may clustered.
In the end, we selected 40 images in the original test dataset as our test images.
It has been proven that small dataset or complicated network structure may lead to
serious over-fitting. Thus, in this paper, we adopt two data augmentation categories.
One is horizontal rotation Another is random crop. We first rescale images with the
size of 331 × 331 and then randomly select crops with the size of 256 × 256. Adding
augmentated data makes our algorithm more robustness.
2·P·R
F1 = (7)
P+R
First we report the experiments results of our proposed methods and comparisons
with other deep learning methods. The Table 2 shown it. Chen et al. [3] applied
Deeplab-v2 network structure. we use it as segmentation network to identify cell’s
nuclei in microscopy images. Ronneberger et al. [15] proposed U-Net to identify
cells successfully. The Table 2 shows our model is get a better result than DeepLab-
v2 and a little precious loss compare with U-Net. But our model has much fewer
parameters than U-Net. Our Light-Unet is under the setting about Lad v = 0.002, Lsemi
= 0.1, Tsemi =0.2. The detailed comparison with U-Net is shown in Table 3.
In order to validate the semi-supervised scheme, we randomly select 1/32, 1/16,
1/8, images as labeled and used the rest of the training data as unlabeled. By using
adversarial loss Lad v , the model achieved about 1.4–1.9% over the baseline model
in F1 score. This means our adversarial loss scheme can boost our segmentation
network learn the structural information from ground truth label distribution. Then
when both use adversarial loss Lad v and semi-supervised loss Lsemi , the model can
increase about 2.8–3.2% F1 score again than our baseline model.
Sometimes, we find that when we apply Deep-v2 structure, and select 1/8 amount
of dataset, we can get a best SEG score 78.1%. Does it means this algorithm is more
precious than another method ? The segmentation result is need to combine both SEG
and F1 score to make a accurate decision. It can get a highest SEG score, however
it’s F1 score is only 65.6%. It means many nuclei in image can not complete identify.
52 C. Hu et al.
The segmentation task adopts multi-task learning strategy to learn the two hyper
parameters λad v and λsemi . We used Tsemi to control the sensitivity in the semi-
supervised learning. We show the results with different parameter setting in Table 4.
We first evaluate the effect on λad v . Note that without λad v loss, the model achieves
75.5% F1 score and 78.1% SEG score. When λad v set equal to 0.002, the model will
achieve 1.2% improvement in F1 score and 0.2% improvement in SEG score. When
λad v = 0.005, the model performance will descend about 1.5% in F1 score. It means
adversarial loss is too large.
Second, we show comparisons under different value of λsemi with λad v equal to
0.002 and select 1/16 amount of data. We set λsemi in {0.05, 0.1, 0.2}, Only with Lad v
loss, the baseline model get a 70.2% F1 score and 76.1% SEG score. In general,
when Lsemi = 0.1, the model will achieve a best performance with a 1.8% F1 score
improvement than without Lsemi loss.
A Semi-supervised Learning Method for Automatic … 53
Lastly, we perform experiments with different value of Tsemi , where we set λad v =
0.002, λsemi = 0.1 and Tsemi in {0.1, 0.2, 0.3, 0.4, 0.5}. For unlabeld images, the
higher Tsemi , we only select ground label map which has a structural similarity more
close to the ground truth label distribution. When we set Tsemi = 0, it means, we trust
all the pixel in the predicted segmentation map of unlabeled image, the F1 score is
drop to 69.5%. The other extreme, the Tsemi is set to equal to 1. It means we don’t
truth any region of the predicted result of predicted segmentation map. The algorithm
perform best, when we set Tsemi = 0.2. It achieves 72.0% F1 score and 75.9% SEG
score. The nuclei segmentation results are shown in Fig. 6. We train the light-unet
without adversarial loss Lad v and semi-supervised loss Lsemi as our baseline model.
54 C. Hu et al.
Fig. 6 Our nuclei segmentation result using 1/16 training data. The first row show the nuclei are
well separated. we called the original image as ‘easy’ image. The third row has many cluster nuclei,
we called image like this as ‘hard’ image. For ‘hard’ image, we find that the mannal annotation image
even can not split individual each nucleus well, Sometimes, the predicted segmentation image can
split cluster nucleus more effective than manual segmentation image. Take a deep step, we perform
watershed on segmentation result images
5 Conclusions
support their supervised-learning scheme despite the fact that manual labeling by
experts is both costly and time-consuming. On the contrary, our semi-supervised
method only needs a small set of representative labelled images, which is more
acceptable by experts to establish training data. Moreover, our method is not limited
to a particular nucleus type, and it can be used as a general segmentation framework
on different types of microscopy images.
The primary reason to cause segmentation errors is the over-segmentation of
touching nuclei, which often happened in densely-clustered nuclei regions. We are
currently trying to either embed cell edge confidence maps during training process
[15] or watershed in post-processing to separate nuclei clusters. In the future, we
also need to determine the minimum number of labelled image to achieve clinically-
acceptable segmentation accuracy. Nevertheless, our semi-supervised learning
method could achieve high segmentation accuracy using a small set of labelled
images, illustrating the potential for our method as a clinical tool for high-throughput
microscopy image analysis.
Acknowledgements This work is supported by the National Natural Science Foundation (NSF)
of China (No. 61702001), and the Anhui Provincial Natural Science Foundation of China (No.
1908085J25), and Open fund for Discipline Construction, Institute of Physical Science and Infor-
mation Technology, Anhui University.
References
1. Arbelle A, Raviv TR (2018) Microscopy cell segmentation via adversarial neural networks.
In: 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018). IEEE, pp
645–648
2. Bamford P, Lovell B (1998) Unsupervised cell nucleus segmentation with active contours.
Signal Process 71(2):203–213
3. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) Deeplab: semantic image
segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE
Trans Pattern Anal Mach Intell 40(4):834–848
4. Dai J, He K, Sun J (2015) Boxsup: exploiting bounding boxes to supervise convolutional
networks for semantic segmentation. In: Proceedings of the IEEE international conference on
computer vision, pp 1635–1643 (2015)
5. Ge S, Zhao S, Li C, Li J (2018) Low-resolution face recognition in the wild via selective
knowledge distillation. CoRR abs/1811.09998. http://arxiv.org/abs/1811.09998
6. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio
Y (2014) Generative adversarial nets. In: Advances in neural information processing systems,
pp 2672–2680
7. Khoreva A, Benenson R, Hosang J, Hein M, Schiele B (2017) Simple does it: weakly supervised
instance and semantic segmentation. In: Proceedings of the IEEE conference on computer
vision and pattern recognition, pp 876–885
8. Kraus OZ, Ba JL, Frey BJ (2016) Classifying and segmenting microscopy images with deep
multiple instance learning. Bioinformatics 32(12):i52–i59
9. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation.
In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–
3440
56 C. Hu et al.
10. Luc P, Couprie C, Chintala S, Verbeek J (2016) Semantic segmentation using adversarial
networks. arXiv:1611.08408
11. Meijering E, Dzyubachyk O, Smal I (2012) Methods for cell and particle tracking. In: Methods
in enzymology, vol 504. Elsevier, pp 183–200
12. Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. In:
Proceedings of the IEEE international conference on computer vision, pp 1520–1528
13. Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convo-
lutional generative adversarial networks. arXiv:1511.06434
14. Raza SEA, Cheung L, Epstein D, Pelengaris S, Khan M, Rajpoot NM (2017) Mimo-net: a
multi-input multi-output convolutional neural network for cell segmentation in fluorescence
microscopy images. In: 2017 IEEE 14th international symposium on biomedical imaging (ISBI
2017). IEEE, pp 337–340
15. Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image
segmentation. In: International conference on medical image computing and computer-assisted
intervention. Springer, pp 234–241
16. Souly N, Spampinato C, Shah M (2017) Semi and weakly supervised semantic segmentation
using generative adversarial network. arXiv:1703.09695
17. Su H, Yin Z, Huh S, Kanade T (2013) Cell segmentation in phase contrast microscopy images
via semi-supervised classification over optics-related features. Med Image Anal 17(7):746–765
18. Su H, Yin Z, Huh S, Kanade T, Zhu J (2016) Interactive cell segmentation based on active and
semi-supervised learning. IEEE Trans Med Imaging 35(3):762–777
19. Vincent L, Soille P (1991) Watersheds in digital spaces: an efficient algorithm based on immer-
sion simulations. IEEE Trans Pattern Anal Mach Intell 6:583–598
20. Wei Y, Liang X, Chen Y, Shen X, Cheng MM, Feng J, Zhao Y, Yan S (2017) STC: a simple to
complex framework for weakly-supervised semantic segmentation. IEEE Trans Pattern Anal
Mach Intell 39(11):2314–2320
21. Xu K, Su H, Zhu J, Guan JS, Zhang B (2016) Neuron segmentation based on cnn with semi-
supervised regularization. In: Proceedings of the IEEE conference on computer vision and
pattern recognition workshops, pp 20–28
Research on Image Encryption Based
on Wavelet Transform Integrating
with 2D Logistic
Abstract In order to protect them from theft, it is necessary to encrypt image data.
When considering the security algorithm of image information security, we must
consider its particularity. Two-dimensional Logistic chaotic mapping is introduced
in the design of pixel gray value substitution, and two substitution ideas are adopted.
That is to say, the low-frequency coefficient matrix after wavelet decomposition is
scrambled first, and the image after wavelet transform is replaced by the pixel value.
The encryption system constructed by the chaotic dynamic model can still solve the
information better after the multimedia information is processed by some signals.
The combination of the lifting wavelet transform in the frequency domain and the
chaotic sequence in the spatial domain is used for image encryption, which has good
encryption effect and practical prospect.
1 Introduction
Image information is one of the most important means for human to express infor-
mation. The exchange and transmission of image data through the public network
is simple, fast and not subject to geographical restrictions, which can save a lot of
costs for data owners [1]. With the increasing frequency of information exchange
and the rapid increase of storage capacity, the analysis of image data with huge
storage capacity has become one of the main means for people to obtain information
[2]. Not all images can be disclosed, and some of them need to be kept secret in
X. Yan (B)
Department of Computer, Jiangsu University of Science and Technology, Zhenjiang 212003,
China
e-mail: just_yx@qq.com
X. Peng
School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang
212003, China
the process of receiving. For example, military facility drawings, weapons drawings
and satellite pictures [3]. In the process of image transmission, it is often attacked
artificially. Including information theft, data tampering, virus attacks, etc. [4]. Image
information security has a very broad meaning. When considering its security algo-
rithm, its particularity must be considered. Data owners need reliable image data
encryption technology to protect their own interests [5]. Traditional secrecy focuses
on restricting access to data, and data expansion is not considered when designing
algorithms [6]. The chaotic dynamic system has pseudo-random type, certainty and
extreme sensitivity to initial conditions and system parameters. Therefore, it can
construct a very good information encryption system [7]. How to ensure the infor-
mation security of images has become a hot topic of extensive attention and research
by experts and scholars.
Image encryption is actually a disorder of the pixels of an image in accordance with
a written algorithm or a replacement of pixel values. The information of the original
image is hidden, and what the interceptor sees during the sending and receiving
process is an unordered picture [8]. The party receiving the image can restore the
image according to the algorithm’s key. In the field of image encryption, emerging
security algorithms are constantly being proposed and researched [9]. The encryption
system constructed by the chaotic dynamic model can still solve the information
better after the multimedia information is processed by some signals [10]. Before
chaotic encryption, image data is transformed by wavelet transform. If one of the
wavelet coefficients is changed, it will be reflected in all pixel points by inverse
operation of wavelet transform. The problem of image information security has a
very broad meaning. When considering its security algorithm, we must consider its
particularity [11]. The encryption system based on chaotic dynamic model can still
declassify the information [12] well after the image information is processed by
some signal processing. Chaotic sequence generated by two-dimensional Logistic
mapping is used to adjust the template of wavelet transform coefficients and chaotic
scrambling method, which can obtain encrypted images with high security.
The encryption algorithm based on the hybrid theory is widely used because it fully
considers the characteristics of the image itself. Cryptocoding mainly studies the
method of transforming an information to protect it from being stolen, interpreted
and utilized by the enemy in the process of transmission of the channel. Because
of the high sensitivity and randomness of chaotic map to initial conditions, chaotic
sequences generated can be used to encrypt and decrypt information. The complete
cryptosystem also includes the sender and the receiver, the cryptographic algorithm
and the attacker of the password. The encryption system constructed by chaotic
dynamic model can solve the multimedia information well after it is processed by
some signals [13]. If the attacker does not crack the key and decrypts the image
data stream directly, the decryption process cannot be achieved at all. After the
Research on Image Encryption Based on ... 59
Table 1 Performance
Before optimization After optimization
parameters of digital image
structure before and after Rows 136 112
optimization Number of columns 97 65
Monitoring points 346 534
60 X. Yan and X. Peng
Table 2 Evaluation of
Level PSNR MSE Time
denoising effects at different
scales 1 28.36 98.53 0.869
2 27.75 95.36 0.758
3 30.51 146.81 1.123
4 31.68 112.63 1.117
5 28.73 136.57 1.545
integration structure has been greatly optimized, with fewer nodes and better moni-
toring area. Logistic integrated reliability optimization simulation is compared with
Fig. 2.
The multi-scale product of the larger amplitude wavelet coefficients tends to be
larger, and the multi-scale product of the smaller amplitude wavelet coefficients
tends to be smaller. For wavelet decomposition of one-dimensional signals, the one-
dimensional multi-scale product at point t is:
⎛ ⎞
p
p
ωj = dj/ d j = 1 − e j /⎝ p − ej⎠ (1)
j=1 j=1
T −t
w(t) = w2 + (w1 − w2 ) (5)
T
Then the direct sum representation of the orthogonal complement closed subspace
sequence is as follows:
n
e j = −k f i j ln f i j (6)
i=1
Chaotic phenomena are sensitive to the initial conditions. As long as the initial
conditions are slightly different or slightly disturbed, the final state of the system
will appear a huge difference binary sequence. When processing the wavelet signal,
the wavelet can be transformed into continuous and discrete wavelet transform. It
avoids the disadvantage that it is almost impossible to get the whole field data in the
traditional manual analysis method of fringe pattern because of its huge amount of
calculation. The high-speed computing ability of the computer is fully utilized, which
greatly shortens the time needed for fringe analysis. After drawing the segmentation
of the original signal, it is found that the wavelet transform has good time posi-
tioning performance. According to the simulation of the time and frequency signals
of wavelet transform, it is found that the signal segments in the low frequency part are
relatively blurred, while in the high frequency part, the original signal segments are
very clear. There is no imaginary part in the result of wavelet continuous transform,
because the wavelet transform uses a real function transform kernel. Wavelet trans-
form is a reversible transform of the signal-preserving type, and the information of
the original image or signal is completely retained in the coefficients of the wavelet
transform.
The wavelet transform simply redistributes the energy of the original image or the
letter one. Chaotic systems exhibit unpredictability, non-decomposability, but have
certain regularity. In order to reflect the convergence speed and global optimal fitness
value of the clustering and cluster head election algorithm for gateway selection
optimization, the relationship between the reciprocal of the global optimal wavelet
parameters and the number of iterations of the digital image is compared, as shown
in Table 4.
In the practical application of image processing, because the image signal itself
can be regarded as a two-dimensional signal, the image is decomposed into two-
dimensional discrete wavelet transform. As the noise variance increases, the peak
signal-to-noise ratio of each algorithm decreases, and the peak signal-to-noise ratio of
the proposed algorithm decreases relatively slowly. And when the noise is constant,
the signal-to-noise ratio of the improved algorithm is significantly higher than other
algorithms. Figure 4 shows the mean square error of the dena noise of each algo-
rithm under different noise intensities. Figure 5 is a comparison of the mean square
error of the IoT image after noise reduction for each algorithm under different noise
intensities.
Fig. 4 Comparison of mean square error after denoising of each algorithm of lena image under
different noise intensities
Fig. 5 Comparison of mean square error after noise reduction of IoT images under different noise
intensities
Research on Image Encryption Based on ... 65
Classic shrink functions are hard threshold functions and soft threshold functions.
Since the hard threshold function is discontinuous at the threshold, it is easy to cause
edge oscillation of the reconstructed image. The soft threshold function cannot reach
the true value of the wavelet coefficient, which easily causes the edge of the image
to be blurred. For the input simulation parameters, the node arrival rate, the service
rate, and the number of service stations per node are also included. The specific input
parameters are shown in Table 5.
According to the scope division of the image encryption algorithm, it can be
divided into airspace and frequency domain encryption. Among them, the encryption
in the airspace has become the most used algorithm because of its fast speed and
good encryption effect. The local variance is the principle that the noise contained in
each high-frequency sub-band of each digital image decomposition layer is different
after the digital image wavelet decomposition, and the noise variance is calculated on
each high-frequency sub-band of each wavelet decomposition layer. The comparison
of the average end-to-end delay between the gateway and the source node is shown
in Fig. 6. At a rate at which different node locations change, the packet delivery
rate between the gateway and the source node is compared as shown in Figure At
different rate of change in node position (Fig. 7).
Fig. 6 Comparison of
average end-to-end delay
between gateway and source
nodes
66 X. Yan and X. Peng
Fig. 7 Comparison of
packet delivery rates between
gateways and source nodes
Any real orthogonal decomposition wavelet filter bank can realize image decom-
position and synthesis, but not any decomposition can meet specific requirements.
The fast algorithm for digital image two-dimensional discrete wavelet decomposition
can be expressed as:
When encrypting the system, the generation of the key stream has a great impact
on the security of its encryption. The original image signal can be reconstructed by the
low frequency approximation coefficient and three high frequency detail coefficients
of wavelet decomposition. The reconstruction process can be expressed as:
the low frequency part, the frequency resolution is higher, but the time resolution is
lower. In contrast, the frequency resolution is low, but it has a high time resolution.
When the image edge is extended symmetrically, the reconstructed image edge is
less real, which is conducive to obtaining high quality reconstructed image.
4 Conclusions
In this paper, an algorithm for generating chaotic encryption templates and scram-
bling sequences is established from two-dimensional Logistic maps. Secondly, the
steps of image processing using two-level chaotic encryption are given. Finally, the
method of recovering the encrypted image is given. The key sequence is generated by
scrambling technology and chaotic system, and the coefficients in the wavelet trans-
form domain are encrypted. Both confidential image information is protected without
adding extra data. In the field of image processing, discrete wavelet transform is often
used for image analysis and processing. After multi-level wavelet decomposition of
the image, the low-frequency components of the wavelet decomposition are extracted
for diffusion and reconstruction operations. Although the lifting wavelet can perform
lossless decryption on image level decomposition, the image after decryption still
has pixel point loss when performing advanced decomposition. With the contin-
uous advancement of image encryption technology, more and more practical image
encryption technologies will emerge.
References
1. Bhatnagar G, Wu QMJ, Raman B (2013) Discrete fractional wavelet transform and its
application to multiple encryption. Inf Sci 223:297–316
2. Zheng P, Huang J (2013) Discrete wavelet transform and data expansion reduction in
homomorphic encrypted domain. IEEE Trans Image Process 22(6):2455–2468
3. Tedmori S, Alnajdawi N (2014) Image cryptographic algorithm based on the Haar wavelet
transform. Inf Sci 269(11):21–34
4. Asatryan D, Khalili M (2013) Colour spaces effects on improved discrete wavelet transform-
based digital image watermarking using Arnold transform map. IET Signal Process 7(3):177–
187
5. Hu Y, Xie X, Liu X et al (2017) Quantum multi-image encryption based on iteration arnold
transform with parameters and image correlation decomposition. Int J Theor Phys 56(7):2192–
2205
6. Mehra I, Fatima A, Nishchal NK (2018) Gyrator wavelet transform. IET Image Process
12(3):432–437
7. Chun-Lai L, Hong-Min L, Fu-Dong L et al (2018) Multiple-image encryption by using robust
chaotic map in wavelet transform domain. Optik 171:277–286
8. Tong XJ, Wang Z, Zhang M et al (2013) A new algorithm of the combination of image compres-
sion and encryption technology based on cross chaotic map. Nonlinear Dyn 72(1–2):229–241
9. Hamdi M, Rhouma R, Belghith S (2016) A selective compression-encryption of images based
on SPIHT coding and chirikov standard MAP. Signal Process 131:514–526
68 X. Yan and X. Peng
10. Multilevel Encrypted Text Watermarking on Medical Images Using Spread-Spectrum in DWT
Domain. Wirel Pers Commun 83(3), 2133–2150 (2015)
11. Gong LH, He XT, Cheng S et al (2016) Quantum image encryption algorithm based on quantum
image XOR operations. Int J Theor Phys 55(7):3234–3250
12. Beltrán del Río L, Gómez A, José-Yacamán M (1991) Image processing in TEM using the
wavelet transform. Ultramicroscopy 38(3–4):319–324
13. Rajakumar K, Arivoli T (2016) Lossy image compression using multiwavelet transform for
wireless transmission. Wireless Pers Commun 87(2):315–333
14. Lu H, Li Y, Uemura T, Kim H, Serikawa S (2018) Low illumination underwater light field
images reconstruction using deep convolutional neural networks. Futur Gener Comput Syst
82:142–148
15. Wang, M., Zhou, S., Yang, Z., et al.: Image fusion based on wavelet transform and gray-level
features. J. Mod. Opt. 1–10 (2018)
16. Sophia PE, Anitha J (2017) Contextual medical image compression using normalized wavelet-
transform coefficients and prediction. IETE J Res 63(5):1–13
17. Serikawa S, Lu H (2014) Underwater image dehazing using joint trilateral filter. Comput Electr
Eng 40(1):41–50
18. Lu H, Li Y, Mu S, Wang D, Kim H, Serikawa S (2018) Motor anomaly detection for unmanned
aerial vehicles using reinforcement learning. IEEE Internet Things J 5(4):2315–2322
19. Lu H, Li Y, Chen M, Kim H, Serikawa S (2018) Brain Intelligence: go beyond artificial
intelligence. Mob Netw Appl 23:368–375
20. Lu H, Wang D, Li Y, Li J, Li X, Kim H, Serikawa S, Humar I (2019) CONet: A Cognitive
Ocean Network. In Press, IEEE Wireless Communications
High Level Video Event Modeling,
Recognition and Reasoning via Petri Net
Abstract A Petri net based framework is proposed for automatic high level video
event description, recognition and reasoning purposes. In comparison with the
existing approaches reported in the literature, our work is characterized with a number
of novel features: (i) the high level video event modeling and recognition based on
Petri net are fully automatic, which are not only capable of covering single video
events but also multiple ones without limit; (ii) more variations of event paths can
be found and modeled using the proposed algorithms; (iii) the recognition results
are more accurate based on automatic built high level event models. Experimental
results show that the proposed method outperforms the existing benchmark in terms
of recognition precision and recall. Additional advantages can be achieved such that
hidden variations of events hardly identified by humans can also be recognized.
1 Introduction
With the rapid development of artificial intelligence [1, 2], computerized video
content analysis is moving towards high level semantics based approaches, where
video event recognition and reasoning remains to be one of the actively researched
topics over the past decades. To narrow the gap between low level visual features and
high level semantics, existed methods focus on two levels of video event analysis. The
low level is to recognize atomic actions. Researches on this area are often key-frame
based. Global and local features are extracted from those key frames and semantic
concept classifiers are applied to capture crucial patterns for event recognition. There
are many ways to extract low level features which have been successfully applied
in many areas [3–5]. Hasan et al. [6] propose a framework for continuous activity
learning using deep hybrid feature models and active learning. Samanta et al. [7]
use three-dimensional facet model to detect and describe space time interest points
in videos. Those methods can extract low level semantics for action recognition and
event analysis based on key actions or scenes. They give no or less consideration
about temporal and logical relations among actions when they are used to classify
events. Some improvement researches are done. Wang et al. [8] propose a new motion
feature to compute the relative motion between visual words and present approaches
to select informative features. Cui et al. [9] propose a novel unsupervised approach
for mining categories from action video sequences. They use pixel prototypes quan-
tized by spatially distributed dynamic pixels to represent video data structuration.
Abbasnejad et al. [10] present a model based on the combination of semantic and
temporal features extracted from video frames, that is able to detect the events with
unknown starting and ending locations.
The work in this paper focuses on the high level of video event recognition.
The high level is conducted on the results of the low level to recognize events
with complex action sequences. Veeraraghavan et al. [11] present semi super-
vised event learning algorithms. The models of events are represented as stochastic
context-free grammars. Kitani et al. [12] create a hierarchical Bayesian network
by combining stochastic context-free grammar and Bayesian network. They apply
the network on action sequences via deleted interpolation to recognize events. Shet
et al. [13] use Prolog based reasoning engine to recognize events from log of prim-
itive actions and predefined rules. Song et al. [14] present a multi-modal Markov
logic (ML) framework for recognizing complex events. Liu et al. [15] present an
interval-based Bayesian generative network approach to model complex activities
by constructing probabilistic interval-based networks with temporal dependencies
in complex activity recognition. Song et al. [16] present a framework for high-level
activity analysis which consists of multi-temporal analysis, multi-temporal percep-
tion layers, and late fusion. The method can handle temporal diversity of high-level
activities. Nawaz et al. [17] propose a framework for predictive and proactive complex
event reasoning which processes, integrates, and provides reasoning over complex
events using the logical and probabilistic reasoning approaches. Skarlatidis et al. [18]
present a system for recognizing human activity given a symbolic representation of
video content. They use a dialect of the Event Calculus for probabilistic reasoning.
Cavaliere et al. [19] employ semantic web technologies to encode video tracking and
classification data into ontological statements which allow the generation of a high-
level description of the scenario through activity detection. By semantic reasoning,
the system is able to connect the simple activities into more complex activities. Jorge
et al. [20] propose a predictive method based on a simple representation of trajecto-
ries of a person in the scene which allows a high level understanding of the global
High Level Video Event Modeling, Recognition and Reasoning … 71
human behavior. Their method does not need predefined models and rules to evaluate
behaviors.
Since Castel et al. [21] introduce Petri nets for high level representation of image
sequences, Petri nets and its variations are widely used in modeling video events
for their detection and recognition. Petri net is a powerful event model tool that
supports the representation of high level events. While places denote different states
of objects inside videos, transitions represent switches of states that are usually caused
by primitive actions performed by the tracked objects. When such a model is used
to recognize an event, each tracked object will be modeled as a token moving in the
Petri net model according to its action sequence. If any token reaches the end place of
the event model, the event is claimed to have happened. During the tracking process,
event reasoning can be done to predict which event has the biggest possibility to
happen.
Albanese et al. [22] propose an extended Probabilistic Petri Nets. They present the
PPN-MPS algorithm to find the minimal sub-videos that contain a given activity with
a probability above a certain threshold. Ghanem et al. [23, 24] address the advan-
tages of using Petri nets for event recognition. They propose a framework which
provides a graphical user interface for user to define objects and primitive events,
and then expresses composite events using logical, temporal and spatial relations.
Lavee et al. [25] propose the Particle Filter Petri Net to model and recognize activities
in videos. They also propose a method to transform semantic descriptions of events
in formal ontology languages to Petri net event models [26]. The surveillance event
recognition framework they proposed uses a single Petri net for recognition of event
occurrences in video, which allows modeling of events having variances in duration
and predicting future events probabilistically [27]. Borzin et al. [28] present video
event interpretation approach using GSPN. Through adding marking analysis into a
GSPN model, their methods provide better scene understanding and next marking
state prediction using historic data. Najla et al. [29] present an approach to auto-
matically detect abnormal high-level events in a parking lot. A Petri net model is
used to describe and recognize high-level events or scenarios that incorporate simple
events with temporal and spatial relations. Hamidun et al. [30] translate the event
sequence in the crossing scenario to the PN model. The combined effects of spatial
and temporal information were analyzed using the steady state analysis built in the
model. They point out that modeling with Petri Nets also allows the development of
model in hierarchical structure. Szwed [31, 32] proposes Fuzzy Semantic Petri Nets
(FSPN) as a tool aimed at solving video event modeling and recognition problems.
Linear Temporal Logic is used as a language for events specification and FSPN is
used as a tool for recognition. SanMiguel et al. [33] use Petri nets in the long-term
layer, which is in charge of detecting events with a temporal relation among their
counterparts. They extend the basic PN structure to manage uncertainty obtained by
the sub-events.
Existing researches focus on event recognitions, video event modeling is often
ignored and remains as one of the unsolved research problems. Existing efforts are
primarily limited to manual modeling approaches, including knowledge-based or
72 Z. Xiao et al.
Petri net is a directed graph constructed with four essential elements: place, transition,
arc and token. While the first three elements are used to model the static structures,
token is designed to reflect the dynamic states of a Petri net. For a detailed Petri net
introduction, we refer to [38].
Let the places representing possible states of tracked objects, the transitions repre-
senting possible primitive actions of tracked objects, and tokens standing for tracked
objects, we can define a single event model (SE-Tree) as follows based on the concept
of a classical Petri net.
Since there exist many uncertainties inside an event, a SE-Tree often needs to
model all its possible variations, and each of such variations is referred to as an
instance. To provide efficient and effective coverage of all the possible uncertainties,
we introduce the concept of a path to describe the route of an event instance.
Definition 2 (Path) A Path, path = < p0 , t 0 , p1 ,…, t i , pi+1 …, t m-1 , pm > , is a sequence
of nodes which connect the source place p0 to one of the end place pm (pm ∈Pe ).
A SE-Tree modelling all paths of an event has a tree structure. We choose the
tree structure other than net because there is less ambiguity and inaccuracy. Based
on the concept of SE-Tree as described above, a Petri net based multi event model
is defined as follows.
The process of high level video event model building is depicted in Fig. 1.
To build a high level event model, a certain amount of video segments containing
specific high level events should be prepared as the training dataset. Here we suppose
that all objects and their primitive actions and states are recognized and target events
are labeled. Our work focuses on the last two steps. The symbols used are explained
in Table 1.
The places and transitions of SET k are created based on following rules.
Rule 1: A process of creation will be fired for f u in a video segment if and only if:
(1) ∃v(FGF(fu , ov , event) = ek ), and
v , state) = FGF(fu , ov , state), and
(2) FGO(o
¬∃j pi = FGO(ov , place) ∧ tij ∈ pi • ∧pj ∈ tij • ∧FGP pj , label
(3)
= FGF(fu , ov , state))
74 Z. Xiao et al.
video segments
... ...
......
model building
SE-Tree
t01 p1 p0 t01 p1 t12 p2
p0
t02 p2 ......
t03 p3
t03 p3 t34 p4
ME-Tree
t01 p1 t12 p2
t03 p3
p0
t09 p9
t58 p8
Condition (1) requires a tracked object ov is conducting ek ; (2) means that there
is a change of ov ’s state in f u ; (3) denotes that there is no output transitions t ij of
pi (the place that ov stay currently), whose output place pj has the same label as the
new state of ov .
76 Z. Xiao et al.
If a tracked object ov is conducting event ek , all its states and switches of states
will be modeled. The creation will first trigger the creation of a new place pj which
describes the new state of the object inside f u . After the new place is created, a new
transition t ij will be created to connect the two places pi and pj , while pi denotes the
old state and pj denotes the new state of the object ov . After the creation, the token
representing ov will be moved from pi to pj , and the values of ov ’s attributes of ov
will be updated.
Rule 2: A place is marked as an end place in SET k if and only if the place is the final
state of an object conducted ek .
In order to support the certainty reasoning of video event, the probability distribu-
tion over the places in an event model needs to be learned to estimate the likelihood
an event’s occurrence. The number of tokens that once stay in each place and the
number of tokens that fire each transition will be counted. If a place is marked as
an end place, it will count the number of tokens who end in this place and which
event each token ends. Specific values of those numbers are learned for generating
the probability of the event’s occurrence.
Out of the above rules and formulas, the algorithm for SE-Tree Building can be
described as follows.
A ME-Tree can be built up by combining and refining several given single event
models. Not all places in all SE-Trees are added to ME-Tree. Those duplicated
places will be composed into one place to simplify the combined model. Let p’ be
the current place considered in SET k , pi is the corresponding place of p’ in MET, a
new place will only be created in MET according to the following rule.
78 Z. Xiao et al.
Rule 3: A process of creation will be fired for MET if and only if:
(1) ∃t t ∈ Tk ∧ t ∈ p · , and
(2) ¬∃j tij ∈ T ∧ pj ∈ P ∧ tij ∈ pi · ∧tij ∈ ·pj
∧ FGP pj , label = FGP t ·, label
where
TTij
PTij = (2)
TPi − TEPi
TEPi
PEi = (3)
TPi
TEPik
PEEik = (4)
TEPi
80 Z. Xiao et al.
m−1
TTij = TTijk (5)
k=0
m−1
TPi = TPik (6)
k=0
m−1
TEPi = TEPik (7)
k=0
n−1
PTij = 1 (8)
j=0
m−1
PEEik = 1 (9)
k=0
m−1
PPik = 1 (10)
k=0
A token in a ME-Tree denotes a tracked object. The events that tracked objects in a
video segment have conducted or will conduct can be recognized by observing the
distribution of tokens in the ME-Tree. A marking is a distribution of tokens over the
set of places in a ME-Tree. It changes along with the firing of transitions. A transition
is enabled if and only if there are tokens in the input place of the transition as defined
in rule 4.
Rule 4: t ij is enabled if and only if M (pi ) ≥ 0 ∧ G tij = true.
M(pi ) is the number of tokens stay in pi under marking M. G(t ij ) denotes the set
of all the guard functions on transition t ij for firing. Here, the main guard function
for firing a transition is a state change caused by a primitive action of an object or a
group of objects.
High Level Video Event Modeling, Recognition and Reasoning … 81
A change of the vth object’s state is detected in current frame, the transition will
be fired. The token representing the vth object will be moved from pi to pj with the
firing of t ij . After the firing of transition t ij , a new marking is generated according to
rule 5.
Rule 5: t ij is fired for the vth object in the uh frame and the marking M is replaced
by a new marking M’ produced according to Eq. (11) and (12).
M pj = M pj + 1 (12)
We select the published work [27] using manual event model and the published work
[20] using non-Petri net model to compare, both of them use the CAVIAR dataset
as the experimental dataset. The CAVIAR dataset contains 52 clips of videos of a
shopping center in Lisbon. This set of sequences contains 1500 frames on average.
The ground truth and labels are provided. Half of videos in a dataset are randomly
chosen as the training dataset, and the other half are used as the test dataset.
Seven different contexts are considered here which are numbered and referred to
as given in Table 2. The ground truth information, including context and situation
information provided by the CAVIAR dataset, is used to build SE-Tree for each high
level event, from which the context of each object is treated as a video event, and the
situation information is used to define the contexts. Seven SE-Trees built for those
events are shown in Fig. 2.
After all the SE-Trees are built from the training dataset, algorithm for building ME-
Tree will be called to create a ME-Tree for all events involved. The ME-Tree built
up in this way is depicted in Fig. 3. The source place and the end place are marked
in Fig. 3. The values of attribute PT ij of transitions are also labeled in Fig. 3. Other
details are omitted here.
Compared to the model depicted in [27], our model captures more hidden varia-
tions of events. For example, there is no direct unknown to shop enter sequence for
“shop enter” in the model presented in [27], which is captured by our method. As a
result, precision rates and recall rates for event recognition are both improved. The
reason is that there is less manual intervention during the building process of event
models using the proposed method, which can find some hidden variations of events
that are often ignored by human.
As stated in Sect. 3.2, the complexity of ME-tree depends on two factors: the
number of SE-Trees and the number of places and transitions in each SE-tree. A
p1 p2 p3 p4 p1 p2 p3 p4 p5
t01 browsing t12 moving t23 browsing t34 moving t01 inactive t12 moving t23 inactive t34 moving t45 inactive
p0 p0
unknown unknown
p5 p6 p7 p8 p6 p7 p8 p9
t05 t56 t67 t78 t06 t67 t78 t89
moving browsing moving browsing moving inactive moving inactive
p0
p0 p1 unknown
unknown t01 moving
p2 p3
t02 t23
moving shop enter
p1 p2 p4 p5
t01 browsing t12 moving t34 browsing t45 moving
p0
unknown
p6 p7 p8 p9
t06 t67 t78 t89
moving browsing moving shop enter
(4) windowshop
p1
t01 moving
p0
unknown p0 p1 p2 p3
unknown t01 shop exit t12 moving t23 shop enter
p2 p3
t02 t23
shop exit moving
Fig. 2 SETs for seven events ( is the source place, is the end place)
ME-Tree will combine and refine SE-Trees. If SE-Trees have similar or same paths
or sub-paths, the ME-Tree will combine them, which decreases the scale of the model.
But under this circumstance, event recognition will be more difficult due to the often
conflicts.
If SE-Trees contain many different paths from each other, the complexity of ME-
tree will increase since there are few nodes can be combined. But event recognition
will be much easier. If those paths have unique actions to distinguish themselves
from others, then the rest nodes of those paths after the unique actions can be elimi-
nated according to Algorithm 2, which can bring down the complexity of the model.
High Level Video Event Modeling, Recognition and Reasoning … 85
end place
t06 p6
0.0323
t011 p11 t1113 p13 t1314 p14 t1415 p15 t1516 p16 t1617 p17
0.6210 0.1364 1.0000 0.3333 1.0000 1.0000
However, if the unique actions appear at the end or near the end of the path, then
there are not too many nodes left to be eliminated.
Another limitation of the proposed framework is that it only considers events
that involve one object. But there are many events in reality that involves more than
one object. To model this kind of event, extensions should be done on the proposed
framework.
A token will be created and added to the ME-Tree for each object. According to
the firing rules described in Sect. 3.3, token will be moved from place to place. An
instance of an event is happened if there is a token reaches one of the end places.
Table 3 shows the recognition results of our proposed methods applied to the test
dataset.
As seen in Table 3, there exist some false negatives and false positives, especially
for “browsing” and “windowshop”. The main reason is that they are similar events,
which lead to similar sub-paths of the two events in the model. Actually, it is also
very hard for human to distinguish the two events. When a conflict occurs, we choose
the event with the largest probability according to our model building algorithm.
In order to compare to the results reported in [27] on the same test dataset, we
conduct event recognition on the whole dataset(including training dataset and test
86 Z. Xiao et al.
dataset). Table 4 compares the recognition results of our proposed method to the
results listed in [27].
Since our model captures more event paths, there is an improvement in terms of
precision rate, recall rate, and the overall accuracy. The results of “browsing” and
“windowshop” are also the worst due to the similar paths in the model. According
to the experimental results, we can draw a conclusion that the more unique actions
the events have, the more accuracy the results are. The reason is less similar paths
will exist among those events in the model if each event has its own unique actions.
We also conduct a series of experiments to compare the sensitivity, specificity,
and accuracy of Petri net based methods and non-Petri net based method [20]. The
definitions of the three performance indicator can be found in [20].
As shown in Table 5, our method outperforms the other two methods, while both
Petri net based methods outperform the non-Petri net based one. The main reason is
the method in [20] uses a pattern recognition approach assigning an event for just a
Table 5 Comparison of
Performance Non-Petri net Petri net based methods
recognition performance
based method
Result reported Result Result of our
in [20] reported in method
[27]
Sensitivity 0.772 0.864 0.945
Specificity 0.962 0.978 0.991
Accuracy 0.935 0.962 0.984
trajectory instead of a predefined, semantic model of event. Hence, the semantic gap
between the input and the output could be very large.
VIRAT Video Dataset release 1.0 is a collection of video clips recording people
doing different actions in a parking land. The ground truth information including
event and situation information are not provided. So we labeled the video segments
in the VIRAT dataset manually. 57 clips of videos from training dataset are used here
as training data. 102 clips of videos from test dataset are used as test data.
We use the videos in the training dataset to build model for each event. After the
ME-Tree including all paths of all events has been build, the videos in the test dataset
are used to conduct video recognition. As shown in Table 6, the precision rate and
the recall rate are quite acceptable.
5 Conclusions
A framework for high level video event modeling, recognition and reasoning is
proposed in this paper to facilitate the video content analysis. Experimental results
show that the proposed method can capture variants of paths inside a high level
event and achieve improved performances in comparison with the benchmark. The
accuracy of event recognition is improved based on the proposed automatic model
in comparison with that based on manual models.
With a simple extension of our method, it can be applied to build models of
multiple high level events that involve more than one object. In addition, the proposed
method provides an efficient tool for managing video content analysis and interpre-
tation in terms of high level events, leading to potential applications for high level
understanding of the video content by computers.
88 Z. Xiao et al.
Acknowledgements The work reported in this paper was supported by the National Engineering
Laboratory of China for Big Data System Computing Technology. The work reported in this paper
was supported in part by the National Natural Science Foundation of China under grant no. 61836005
and no. 61672358, the Natural Science Foundation of Guangdong Province, China, under grant no.
2017A030310521, the Science and Technology Innovation Commission of Shenzhen, China, under
grant no. JCYJ20160422151736824.
References
9. Cui P, Wang F, Sun L-F, Zhang J-W, Yang S-Q (2012) A matrix-based approach to unsupervised
human action categorization. IEEE Trans Multimedia 14(1):102–110
10. Abbasnejad I, Sridharan S, Denman S, Fookes C, Lucey S (2016) complex event detection
using joint max margin and semantic features. In: Proceedings of the international conference
on digital image computing—techniques and applications. Gold Coast, QLD, Australia
11. Veeraraghavan H, Papanikolopoulos NP (2009) Learning to recognize video-based spatiotem-
poral events. IEEE Trans Intell Transp Syst 10(4):628–638
12. Kitani KM, Sato Y, Sugimoto A (2005) Deleted Interpolation using a hierarchical bayesian
grammar network for recognizing human activity. In: Proceedings of the 2nd joint IEEE
international workshop on VS-PETS, pp 239–246. Beijing
13. Shet VD, Harwood D, Davis LS (2005) VidMAP: video monitoring of activity with prolog. In:
Proceedings of IEEE conference on advanced video and signal based surveillance (AVSS), pp
224–229
14. Song YC, Kautz H, Allen J, Swift M, Li Y, Luo J (2013) A Markov logic framework for
recognizing complex events from multimodal data. In: Proceedings of the ACM on international
conference on multimodal interaction (ICMI), pp 141–148
15. Liu L, Wang S, Hu B, Qiong Q, Wen J, Rosenblum DS (2018). Learning structures of
interval-based Bayesian networks in probabilistic generative model for human complex activity
recognition. Pattern Recogn 81:545–561
16. Song D, Kim C, Park S-K (2018) A multi-temporal framework for high-level activity analysis:
violent event detection in visual surveillance. Inf Sci 447:83–103
17. Nawaz F, Janjua NK, Hussain OK (2019) PERCEPTUS: predictive complex event processing
and reasoning for IoT-enabled supply chain. Knowl-Based Syst 180:133–146
18. Skarlatidis A, Artikis A, Filippou J, Paliouras G (2015) A probabilistic logic programming
event calculus. Theory Pract Logic Program 15(2):213–245
19. Cavaliere D, Loia V, Saggese A, Senatore S, Vento M (2019) A human-like description of
scene events for a proper UAV-based video content analysis. Knowl-Based Syst 178:163–175
20. Azorin-Lopez J, Saval-Calvo M, Fuster-Guillo A, Garcia-Rodriguez J (2016) A novel predic-
tion method for early recognition of global human behaviour in image sequences. Neural
Process Lett 43:363–387
21. Castel C, Chaudron L, Tessier (1996). What is going on? A high-level interpretation of a
sequence of images. In: Proceedings of the ECCV workshop on conceptual descriptions from
images. Cambridge, U.K
22. Albanese M, Chellappa R, Moscato V, Antonio Picariello VS, Subrahmanian PT, Udrea O
(2008) A constrained probabilistic Petri net framework for human activity detection in video.
IEEE Trans Multimedia 10(6):982–996
23. Ghanem N, DeMenthon D, Doermann D, Davis L (2004) Representation and recognition of
events in surveillance video using Petri nets. In: Proceedings of the 2004 IEEE computer society
conference on computer vision and pattern recognition workshops (CVPRW’04)
24. Ghanem N (2007) Petri net models for event recognition in surveillance videos. Doctor thesis,
University of Maryland
25. Lavee G, Rudzsky M, Rivlin E (2013) Propagating certainty in Petri nets for activity recognition.
IEEE Trans Circuits Syst Video Technol 23(2):337–348
26. Lavee G, Borzin A, Rivlin E, Rudzsky M (2007) Building Petri nets from video event ontologies.
In: Proceedings of the international symposium on visual computing, Part I, LNCS, vol 4841,
pp 442–451
27. Lavee G, Rudzsky M, Rivlin E, Borzin A (2010) Video event modeling and recognition in
generalized stochastic Petri nets. IEEE Trans Circuits Syst Video Technol 20(1):102–118
28. Borzin A, Rivlin E, Rudzsky M (2007) Surveillance event interpretation using generalized
stochastic petri nets. In: Proceedings of the 8th international workshop on image analysis for
multimedia interactive services (WIAMIS’07)
29. Ghrab NB, Boukhriss RR, Fendri E, Hammami M (2018) Abnormal high-level event
recognition in parking lot. Adv Intell Syst Comput 736:389–398
90 Z. Xiao et al.
30. Hamidun R, Kordi NE, Endut IR, Ishak SZ, Yusoff MFM (2015) Estimation of illegal crossing
accident risk using stochastic petri nets. J Eng Sci Technol 10:81–93
31. Szwed P (2016) Modeling and recognition of video events with fuzzy semantic petri nets.
Skulimowski AMJ, Kacprzyk J (eds) Knowledge, information and creativity support systems:
recent trends, advances and solutions, pp 507–518
32. Szwed P (2014) Video event recognition with fuzzy semantic petri nets. Gruca A et al. (eds)
Man-machine interactions, vol 3, pp 431–439
33. SanMiguel JC, Martínez JM (2012) A semantic-based probabilistic approach for real-time
video event recognition. Comput Vis Image Underst 116:937–952
34. Liu L, Wang S, Guoxin S, Bin H, Peng Y, Xiong Q, Wen J (2017) A framework of mining
semantic-based probabilistic event relations for complex activity recognition. Inf Sci 418–
419:13–33
35. Kardas K, Cicekli N (2017) SVAS: surveillance video analysis system. Expert Syst Appl
89:343–361
36. Acampora G, Foggia P, Saggese A, Vento M (2015) A hierarchical neuro-fuzzy architecture
for human behavior analysis. Inf Sci 310:130–148
37. Caruccio L, Polese G, Tortora G, Iannone D (2019) EDCAR: a knowledge representation
framework to enhance automatic video surveillance. Expert Syst Appl 131:190–207
38. Murata T (1989) Petri nets: properties, analysis and applications. Proc IEEE 77(4):541–580
39. CAVIAR. http://groups.inf.ed.ac.uk/vision/CAVIAR/CAVIARDATA1/
40. VIRAT video dataset release 1.0. http://midas.kitware.com
Question Generalization in Conversation
1 Introduction
2 Related Work
In recent years, Deep Learning (DL) has surmounted many complex tasks in the field
of natural language processing. It has reduced the difficulty of the research direction
that was considered difficult to break through in the past.
In 2014, Sutskever et al. proposed a paper based on neural network machine trans-
lation [3]. This new paradigm was quickly applied to the automatic generation of
chat responses, but it still could not reproduce the interaction of human communi-
cation. Subsequently, in order to solve a potential problem of encoder-decoder in
the traditional model that information was compressed into vectors of fixed length
and could not correspond to long sentences, bahdanau et al. proposed the attention
mechanism [7] so that each word of each sentence in the new model had its own
context. In 2015, Mukherjee used a large number of conversational data set training
sequences into the sequence model and found that it was able to extract knowledge
from a specific domain dataset and could perform simple common sense reasoning.
Later, in view of the suggestion of attention, Yao et al. proposed using the three RNN
structures of Encoder RNN, Intention RNN and Decoder RNN to train the dialogue
system, and used the two mechanisms of attention and intention to improve the effect
of multi-round dialogue [6]. Since the traditional seq2seq model usually had tended
to generate some safe and normal replies, such as “I don’t know” or “I am ok”, Li
et al. proposed a new objective function MMI to optimize seq2seq model [4], making
the generated replies more diverse. In 2016, Li et al. proposed a method of using
reinforcement learning to optimize the Seq2Seq parameters based on MMI [8] and
achieved a better response effect. In 2018, Liu et al. added a distribution weight [9]
to the objective function of Seq2Seq, which achieved an effect beyond MMI [8] and
was able to remove safe and nutrient-free responses more effectively.
The work of the above scholars all focused on how to improve the quality of
machine reply, but in the process of human-computer interaction, asking questions
is also a very important step, and it can improve people’s interest in dialogue so that
the conversation will not be boring. Based on these cerebration, we shift the focus
of our work to the question generation on dialogue generation model.
3 Method
Figure 1 shows the framework of the combination in our model. At the decoding
layer, we first filter the results of model decoding and prioritize the Output of the
Questions from Beam Search (OQBS) with the trigger probability of λ. The output
sentence (not question) is then used as input to determine whether the sentence can
be converted to a WH-type question by Name Entity Recognizer (WH-NER). Since
no entity in the sentence can be identified, the sentence will be entered into the next
method and judge whether it can be successfully converted into a General Question
by Grammatical Analysis (GQGA) with the trigger probability of μ. In the case of
determining that it can be successfully converted into a general question, we introduce
an additional conversion technique to judge whether the existing General Question
can be converted into a WH-type question (GQ-WH) with the trigger probability of
1 − μ.
As mentioned earlier, the sentence in the dialogue are different from the traditional
Question-Asking [10–16] task dataset which are more targeted and normative as
training data. Given this limitation, we propose three artificial construction method
to generate different types of questions on the sequences-to-sequences model based
Question Generalization in Conversation 95
chat-robot. The overall method consists of three parts: (i) Output existing Questions
from Beam Search (OQBS). (ii) Declarative sentence to WH-type question (WH-
NER), (iii) Declarative sentence to the General Question (GQGA).
• Output existing Questions from Beam search:
A. Question identification
B. Question selection
With the shoring up of the question word, a question can be selected for the conve-
nience of the response candidates. At the decoding layer, we use beam search [17] to
generate N-best list replies of the highest probability of response. In order to select
the most probable question from the candidate with a beam size of n, we use the above
Question identification way to determine the question. Let H = (h 1 , h 2 , . . . h n ) be
the candidate reply of the input sentence S, and O = (o1 , o2 , . . . on ) is the proba-
bility set corresponding to the sentence in H. Mathematically, for an input S, the new
generated probability sequence output the maximum probability sentence is that:
R = q P(arg max(Oi )|o1 , o2 , . . . on , H )|Aoriginal (1)
where q represents the probability of method triggering, and Aoriginal represents the
original statement to be output. When the trigger probability is not effective, the
original sentence output will be selected.
We model the candidate set according to the probability distribution as:
u = Q i (h 1 , h 2 , . . . h n ) i star t = n (3)
Here i star t is the list subscript to start locating. Qi is a function for finding a
question, and the result of the search is assigned to:
u = i i f iin(1 ≤ i ≤ n)
(4)
u=0 i f i = null
In the original list of output, the On is the maximum output probability of the
candidate set. The probability of the question found Ou is reset as followed:
Ou = On + 1 (5)
Stanford NER [18] is known as the CRF Classifier (conditional random field clas-
sifier). It provides a general implementation of a (random sequence) linear chain
conditional random field (CRF) sequence model. In the following, we use Stanford
NER to identify Organization, Person, Location named entities in vocabulary, and
then match What, Who, Where three types of question corresponds to each of them.
Figure 2 shows the correspondence between named entities and question types.
B. Interlaced combination
Fig. 2 Correspondence
between named entities and
question types
Question Generalization in Conversation 97
According to the basic structure unit of the general question which composed of
auxiliary verb, subject and predicate verb, the sentence St is first divided into tokens
Ti = (T1 , T2 , T3 …Ti ), where Ti refer to one word of sentence St , and the tagger
and position information of each token are recorded by the python’s NLTK Parser.
The first auxiliary verb Aver b or modal verb Mver b found in the sentence will be
advanced to the front of the subject word Tsub , where St = (Aver b , Tsub ,…Ti |Mver b ,
Tsub ,…Ti ). If the declarative sentence does not contain auxiliary verbs or modal verbs,
add ‘do (does, did)’ to reform a new general question before the subject word St =
(Do,Tsub ,…Ti ). It is worth noting that, in order to ensure the grammatical correctness,
the subject word and object word in the sentence also needs to be transformed where
St = (Do,Tsub−trans f r om ,…Ti ). for example, ‘I’ needs to be transformed into ‘you’
(Fig. 4).
98 J. Peng et al.
Fig. 4 Declarative sentence to the General Question example of modal verb being found and insert
‘Do’ in the sentence
In this section, we first describe the experimental setting we utilize for the evaluations.
Secondly, we compare the question generation results obtained by our method with
the baseline on Opensubtitles1 dataset. The experimental result show that our method
can not only increase the richness of the responses generated by the original model,
but also effectively increase the number of dialogue rounds.
Datasets: The dialogue generation requires high-quality initial inputs fed to the
agent. We conduct experiments on Opensubtitles1 , which contains a large number of
movie subtitle data sets and is one of benchmark data set in dialogue generation. The
advantage of using OpenSubtitles is that its conversation is a continuous conversation
in a scene. We randomly select 2.5 million pairs for training, 1000 instances for
validation and 500 instances for testing.
Baseline Dialogue Generation (NMT-Keras): Our baseline conversation generation
model is Google Neural Machine Translation with Keras2 (Theano and Tensorflow).
This is a versatile translation model composed of Attentional recurrent neural network
NMT model and Transformer NMT model.
Dialogue Generation with Question Generation (NMT-Q): Based on the original
Neural Machine Translation with Keras, we propose our new dialogue generation
system with question generation at the decoding layer and output layer.
Parameter setting: In the decoding layer, we use beam search with a beam size 10
to generate a response to a given input message. For the NMT-Q model, we first
generate 10-best lists and then for the sentence that is not the question generated by
1 https://www.opensubtitles.org/en/search/subs.
2 https://github.com/lvapeab/nmt-keras.
Question Generalization in Conversation 99
the highest generation probability, we assign the probability of the first question had
found to the maximum generation probability plus one in the candidate result. The
probability of setting OQBS is 0.3. In the output layer, the probability setting for
WH-NER is 0.3, If the conversion in WH-NER is failed, GQGA is triggered with
a probability of 0.5, otherwise the original sentence will be output.
Table 2 The average number of dialogue turns from original NMT model and proposed question
generation model
Model Average number of dialogue turns
NMT-Keras 1.918
NMT-Q 2.388
Table 3 The NMT-Keras and NMT-Keras dialogue results and the NMT-Keras and NMT-Q
dialogue results are shown separately
NMT-Keras (A) and NMT-Keras (B) NMT-Keras (A) and NMT-Q (B)
INPUT: their reckless disregard in cairo INPUT: their reckless disregard in cairo
brought on this violence today! brought on this violence today!
A: I don’t know what to do with it A: I don’t know what to do with it
B: I don’t know, but I don’t know B: You know what?
A: I don’t know, but I don’t know A: what you’re talking about
B: I don’t know, but I don’t know B: You know what I mean?
A: I don’t know, but I don’t know A: Do you know what I can do?
It is noteworthy that at the starting with the second sentence thrown by the robot,
if the question is generated, the subsequent sentence will change follow through.
Contrast to the original model, which produces a large number of highly repetitive
responses in the conversation, our approach solves this drawback to some extent.
Next, we report the Diversity scores of the two models, including NMT-Q and
NMT-Keras. From Fig. 5, our proposed model achieves Unigram 0.028 and Bigram
0.103, both surpassing the base line model. The results demonstrate that using the
proposed interface methods and the designed beam search algorithm, the overall
Fig. 5 Diversity scores (type-token ratios) for the standard NMT-Keras model and NMT-Q model
Question Generalization in Conversation 101
increase in sentence diversity has resulted in more versatile responses from machines.
This situation also reflects from the side that the Beam Search mechanism that directly
uses the maximum likelihood estimation is not conducive to solving specific problems
in natural language processing, but only relies to the most common responses in the
corpus.
In this paper, we propose two types of question transformers and a search mech-
anism that can improve the beam search to increase the number of questions to a
standard sequences-to-sequences model, aiming to different conversion contexts. At
the same time, we use the probability-triggered interleaving combination mecha-
nism to control the chat-bot to actively and reasonably propose different types of
questions. Finally, we use the evaluation metrics of dialogue response generation
system to prove that our proposed method can effectively improve the number of
rounds of dialogue and the diversity of responses. The method we propose can be
applied to any dialog model as an external interface. However, the current chat bots
can’t reach the level of human intelligence, and simply using the probabilistic trigger
combination mechanism can sometimes be too rigid. In the future work, we will use
reinforcement learning to optimize the probability trigger mechanism and the system
can automatically determine the conversion context in a dialogue, which will propose
questions more intelligently.
References
Abstract The huge volume of IoT data generated by emerging IoT end devices have
triggered the prosperous development of Fog computing in the past years, mainly due
to their real-time requirements. Fog computing aims at forming the idle edge devices
that are in the vicinity of IoT end devices as instantaneous small-scale Fog networks
(Fogs), so as to provide one-hop services to satisfy the real-time requirement. Since
Fogs may consist of only wireless nodes, only wired nodes or both of them, it is
significant to map IoT tasks with diverse QoS requirements to appropriate types of
Fogs, in order to optimize the overall Fog performance in terms of the OPEX cost
and transmission latency. Regarding this, we propose an integer linear programming
(ILP) model to optimally map the IoT tasks to different Fogs and/or Cloud, taking
into consideration of the task mobility and real-time requirements. Numerical results
show that the real-time and mobility requirements have significant impact on the
OPEX cost of the integrated Cloud-Fog (iCloudFog) framework.
1 Introduction
The current datacenters in the Cloud are fixed and distant from the IoT end devices
and thus are short of providing real-time services and supporting task mobility, when
confronting the emerging IoT tasks with features of velocity, volume and variety.
Regarding this, Fog computing was coined with the motivation of provisioning
services by edge devices in their vicinity, so as to minimize the transmission latency
for real-time IoT tasks. In the past years, great progresses have been made in Fog
computing in various aspects, such as Fog planning [1–5], energy consumption, Fog
applications in various fields, emerging Fog computing with 5G [6], etc.
Specifically, authors in [7] proposed an integrated Cloud-Fog computing frame-
work and described the major challenges and solutions. In [8, 9], the authors proposed
an optimized Fog planning model aiming at minimizing the overall network delay
and the number of tasks sent to the Cloud taking into account of the different attributes
of end devices, Fog nodes and links, and optimal installation location of Fog nodes.
Authors in [10] tried to apply Fog computing to the traffic system by compressing
high frame-rate video streams at Fog nodes. In [11], the authors proposed to emerge
the multi-access Edge computing with 5G networks, which was said to be able to
improve the QoS and utilize the mobile backhaul and core networks more efficiently.
The authors in [12] provided a theoretical analysis to optimize the power consumption
coming from the content caching and dissemination in hot spots, fronthaul links, and
rural areas. The use of backhaul/fronthaul links was traded off against the efficiency
of content distribution. In [13], the authors designed an architecture that addressed
some of the major challenges for the convergence of Network Function Virtualization
(NFV), 5G/Mobile Edge Computing (MEC), IoT and Fog computing.
Most of the existing literatures focused on Fog planning, energy consumption,
application of Fog computing, convergence of Fog and advanced 5G technologies,
etc. Very few of them has addressed the important issue of mapping the huge IoT tasks
to the integrated Cloud and Fog networks (iCloudFog). Regarding this, we set our
goal as optimally mapping IoT tasks to dynamic Fogs and/or Cloud. Specifically, we
propose an integer linear programming (ILP) model with the objective of minimizing
the overall OPEX cost and transmission delay. In the proposed model, we take into
account of both the diverse QoS requirements of IoT tasks, such as real-time and
mobility requirements, and the network attributes, such as the resource availability
status of different Fogs.
The rest of this paper is structured as follows. In Sect. 2, the integrated Cloud-Fog
(iCloudFog) architecture together with the characteristics of IoT tasks are introduced.
In Sect. 3, we introduce the proposed ILP model in details. The numerical results of
the proposed ILP model based on AMPL are given and analyzed in Sect. 4. Section 5
concludes this paper.
Optimal Scheduling of IoT Tasks in Cloud-Fog … 105
Fig. 1 iCloudFog framework. FT-3: three Fog types; IT-8: eight IoT tasks; WL: wireless Fog; WD:
wired Fog; HB: hybrid Fog
requirements on real-time and/or mobility. We assume that wireless and hybrid Fogs
can handle IoT tasks with real-time and mobility requirements, while wired Fogs
cannot. For the OPEX, we mainly consider the cost on requiring resources from
different Fog types. For example, the cost of using the wireless Fog links is the most
expensive, followed by the hybrid Fog links and the wired fog links. Suppose the
wired and hybrid Fogs are directly connected with the Cloud in the top layer, the
cost of using the Cloud links is more expensive than that of using any Fog types. For
the proposed ILP model, the set, parameters, objectives and constraints are given as
follows. Note that we consider two objectives under two different situations.
• i t2ci : Binary variable, which is one if IoT task i is handled by Cloud due to
insufficient computing resource from Fogs, i.e., i t2ci = 1; zero, vice versa,
i ∈ IT.
(3) Objective Functions
• Minimize (Total_OPEX):
f ∈F T,i∈I T f ∈F T,i∈I T
st f,i ∗ OpEx_I T 2F T f + st f,i ∗ it2ci ∗ OpEx F T 2C f (1)
f,i f,i
• Maximize (Success_Tasks):
f ∈FT ,i∈I T
st f ,i (2)
f ,i
Our Objective is to minimize the OPEX while maximizing the number of IoT
tasks successfully served under the iCloudFog framework as shown in Eqs. (1) and
(2). The first objective in (1), say Total_OPEX, aims at minimizing the OPEX due
to using Fog links and Cloud links. The second objective in (2), say Success_Tasks,
aims at maximizing the total number of IoT tasks that are successfully served.
• Weighted Sum Optimization
(4) Constraints
f ∈F T
st f,i ≤ 1, ∀i ∈ IT (4)
f
T
i∈I
st f,i ∗ ITCi ∗ (1 − it2ci ) ≤ FTC f , ∀ j ∈ FT (7)
i
T
i∈I
st f,i ∗ ITSi ∗ (1 − it2ci ) ≤ FTS f , ∀ j ∈ FT (8)
i
T
i∈I
st f,i ∗ (1 − it2ci ) ≤ FTTQ f , ∀ j ∈ FT (10)
i
Constraint (4) ensures that any IoT task can only be served by one Fog type at
most. Constraint (5) ensures that when an IoT task requires real-time service, the Fog
type serving it must have the ability to handle real-time tasks. Constraint (6) ensures
that if an IoT task is a real-time task, it can only be served by Fogs but cannot be
forwarded to the cloud. Vice versa, if it is not a real-time task, it can be uploaded to
the Cloud for processing. Constraints (7) and (8) ensure that for any Fog type, the
sum of demands on computing/storage resource of all IoT tasks served by it cannot
exceed its total amount of computing/storage resource. Constraint (9) ensures that
for any IoT task with mobility requirement, it can only be served by Fogs which can
provide mobility. For IoT tasks with no mobility requirement, they can be served by
any of the Fogs or Cloud. Constraint (10) ensures that for any Fog type, the total
number of IoT tasks that are successfully served must be no larger than the total
number of IoT task demands.
4 Numerical Evaluation
Fig. 2 Value Setting of ILP Model. O P E X wd2c /O P E X hb2c : the OPEX cost of traversing
wired/hybrid Fogs to Cloud link. O P E X it2wl /O P E X it2wd /O P E X it2hb : the OPEX cost of
provisioning IoT tasks via wireless/wired/hybrid Fogs link
Figures 3 and 4 show the total OPEX cost under the objective 1 in Eq. 1 by assuming
all the 32 IoT tasks have been served successfully. In Fig. 3, we assume two cases
differing in the numbers of IoT tasks with mobility (MB), i.e., MB = 0 and 5,
respectively. We can observe that the total OPEX costs of both cases increase with
increasing number of IoT tasks requiring real-time (RT) services. Similarly, in Fig. 4,
we assume two cases differing in the number of IoT tasks with real-time requirement
(RT), i.e., RT = 0 and 5, respectively. We can observe that the total OPEX costs
of both cases increase with the increasing number of IoT tasks requiring mobility
services.
Figures 5, 6, and 7 show the numerical results under the weighted objective 3 in
Eq. 3 which aims at minimizing the OPEX cost and meanwhile maximizing the total
number of IoT tasks successfully served. Figure 5 shows the impact of the number
of IoT tasks successfully served on the total OPEX cost. We can observe that the
Fig. 5 a Total OPEX cost under objective 3 (Eq. 3). MB: # of mobile IoT tasks. b The range of α
and the impact of α on the objectives
Fig. 7 Total OPEX versus # of IoT tasks server under objective 3 in Eq. 3
total OPEX increases with increasing number of successfully served IoT tasks which
is reasonable. Figure 6a shows the total OPEX cost with the increasing RTs under
different MBs, say MB = 0 and 5, respectively, when the weight value of α is set
to 86. We can observe that when the number of RT reaches 7 and 10 for MB =
5 and 0, respectively, the number of total OPEX cost decreases dynamically. This
is attributed to the fact that the number of IoT tasks successfully served reduces
dynamically because of lack of resource in Fogs that can support real-time and/or
mobility services. Figure 7 shows the OPEX cost and the total number of IoT tasks
that are successfully served by different Fog types under different RT and MB values.
We can observe that the WD Fogs are used the most frequently, followed by the HB
Fogs and the WL Fogs.
5 Conclusion
Acknowledgements This study was supported in part by the BK21 Plus project (SW Human
Resource Development Program for Supporting Smart Life) funded by the Ministry of Educa-
tion, School of Computer Science and Engineering, Kyungpook National University, Korea
112 Z. He et al.
(21A20131600005) and in part by the National Research Foundation of Korea (NRF) Grant funded
by the Korean government (Grand No. 2018R1D1A1B07051118).
References
1. Zhang D et al (2019) Model and algorithms for the planning of fog computing networks. IEEE
Internet Things J 6(2):3873–3884
2. Yousefpour A et al (2019) FogPlan: a lightweight QoS-aware dynamic fog service provision-
in-g Framework. IEEE Internet Things J 6(3):5080–5096
3. Li Y, Jiang Y, Tian D, Hu L, Lu H, Yuan Z (2019) AI-enabled emotion communication. IEEE
Netw 33(6):15–21
4. Lu H, Liu G, Li Y, Kim H, Serikawa S (2019) Cognitive Internet of vehicles for automatic
driving. IEEE Netw 33(3):65–73
5. Lu H, Wang D, Li Y, Li J, Li X, Kim H, Serikawa S, Humar I (2019) CONet: a cognitive ocean
network. IEEE Wirel Commun 26(3):90–96
6. Sodhro A et al (2018) 5G-based transmission power control mechanism in Fog computing for
internet of things devices. Sustainability 10(4):1258
7. Peng L et al (2018) Toward integrated Cloud-Fog networks for efficient IoT provisioning: Key
challenges and solutions. Future Gener Comput Syst 88:606–613
8. Haider F (2018) On the planning and design problem of fog networks. Carleton University
9. Deng R et al. (2015) Towards power consumption-delay tradeoff by workload allocation in
cloud-fog computing. In: IEEE international conference on communications (ICC), London
10. Liu J et al (2018) Secure intelligent traffic light control using fog computing. Future Gener
Comput Syst 78(2):817–824
11. Taleb T et al (2017) On multi-access edge computing: a survey of the emerging 5G network
edge cloud architecture and orchestration. IEEE Commun Surv Tutor 19(3):1657–1681
12. Lien SY et al (2018) Energy-optimal edge content cache and dissemination: designs for practical
network deployment. IEEE Commun Mag 56(5):88–93
13. Van Lingen F et al (2017) The unavoidable convergence of NFV, 5G, and fog: a model-driven
approach to bridge cloud and edge. IEEE Commun Mag 55(8):28–35
Context-Aware Based Discriminative
Siamese Neural Network for Face
Verification
Qiang Zhou, Tao Lu, Yanduo Zhang, Zixiang Xiong, Hui Chen,
and Yuntao Wu
Abstract Although face recognition and verification algorithms have made great
success under controlled conditions in recent years. In real-world uncontrolled appli-
cation scenarios, there is a fundamental challenge that how to guarantee the discrim-
inative ability of feature from vary inputs for face verification task. Aiming at this
problem, we proposed a context-aware based discriminative siamese neural network
for face verification. In fact, the structure of facial image are more stable rather
than hairstyle change and wearing jewelry. Firstly we use a context-aware module
to anchor facial structure information by filtering out irrelevant information. For
improved discrimination, we develop a siamese network including two symmetrical
branch subnetworks to learn discriminative feature by labeled triad training data.
The experimental results on LFW face dataset outperform some state-of-the-art face
verification methods.
Q. Zhou · T. Lu (B)
Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan, China
e-mail: lutxyl@gmail.com
Q. Zhou
e-mail: zq315653752@icloud.com
Y. Zhang · Y. Wu
School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan, China
e-mail: zhangyanduo@hotmail.com
Y. Wu
e-mail: ytwu@sina.com
Z. Xiong
Department of Electrical and Computer Engineering, Texas A&M University,
College Station, USA
e-mail: zx@ece.tamu.edu
H. Chen
Simshine-WIT Joint Laboratory of Perceptual Intelligence,
Wuhan Institute of Technology, Wuhan, China
e-mail: hui.chen@simshine.cn
1 Introduction
from facial images. First, we use context-aware model to extract structure informa-
tion with filtering out unrelated information. Then, we use the improved Siamese
network as the backbone and the cross entropy loss function to guide the training.
Our method achieves better performance with discriminative features for face veri-
fication than some state-of-the-art methods. To summarize, the contributions of this
paper are highlighted as follows:
(1) We firstly propose a context-aware Siamese network for face verification. The
proposed context-aware model can adaptively focus on core region of facial images
with removing irrelevant features. The landmarks detecting process automatically
aligns facial image into fixed grid.
(2) We proposed to use a improved discriminative Siamese network to learn
sematic robust features. Furthermore, we use a metric to measure the discrimina-
tive ability of features which can be used in other related vision based tasks.
The proposed CDSN is consisted by four parts: Input part, context-aware module,
enhanced Siamese network, and output part. The architecture of CDSN is shown in
Fig. 1.
Fig. 1 Architecture of CDSN. a Input images; b context-aware module, its role is to find the
most important information area of the input face image; c Enhanced Siamese neural network, The
structures of two branches of CDSN are the same, with sharing the same weight; d Output labels
116 Q. Zhou et al.
Fig. 2 Process of context-aware module. From left to right: Input facial image, landmark detector,
the 68 key points of facial image, location of boundary lines from landmarks, the cropped rectangle
according to the four edge points, output of facial image
main organs of facial image. We assume a minimum rectangle to crop out the face
content. The process of context-aware module is shown in Fig. 2. Here, we define
N
facial image samples Ii i=1 ∈ R m×n , i is the index of samples, N is the number of
all samples, m and n indicate the height and width of the facial image. The purpose
of context-aware module is to shrink the image Ii into main parts of facial image
X i ∈ R u×v , u and v is the height and width of main content of facial image. In this
paper, we use three-steps to detect the content of facial images. First, we define face
landmark detection function D as:
Second, from above 68 points, we need to find four points as the boundary lines.
Here these four boundary points are defined as:
The intersection of the four boundary lines yields a rectangle. We use this rectan-
gle’s upper left corner point E and the lower right corner point F to represent this
rectangle. It is easy to solve the coordinate values of E and F points by E(xa , yc ),
F(xc , yd ). At last, considering inaccurate detection of key points, we intend to
enlarge the yielded rectangle slightly. We add a constant w to adjust the rectan-
gle as E(xa − w, yc − w) F(xc + w, yd + w). In our implementation, the cropped
image size L is determined by L = max(dv , dh ). Where dv v is the distance between
the upper and lower landmark points, dv = yd − yc + 2w, dh is the horizontal dis-
tance between the left and right landmark points, dh = xc − xa + 2w. Once the crop
size L is determined, we crop the facial region center on the landmark point of nose.
For convenience, the cropped images are resized to a fixed size 72 × 72 pixels.
Context-Aware Based Discriminative Siamese Neural Network for Face Verification 117
When content of facial image X i is ready, we define a triplet data (x (i) , x ( j) , y),
where x (i) ∈ X i and x ( j) ∈ X j represent two images from training samples, where y
is label. If image x (i) and x ( j) are from one same person, then label y = 1, otherwise
y = 0. Different with shallow Siamese network [5], we propose a deeper Siamese
neural network to enhance its discrimination. Therefore, we use the cross entropy
loss function as the loss function of the network, the overall loss function is defined
as:
N
loss = (−(y log ŷ(1 − y) log(1 − ŷ))). (3)
i=1
where N represents the number of training sample pairs, y represents the real label,
and ŷ represents the predicted label. Since we paired the trained databases in pairwise
with assigned labels. Face verification becomes a classification problem. Specifically,
given a pair of face images as input, thus the feature vectors generated by the Siamese
network are input to the same Sigmoid unit. Here the output is y = 1 means that these
two images are recognized as the same person, otherwise y = 0 means different
person. The Sigmoid unit is defined as:
K (i) ( j)
ωk ( ff(x(x (i)))kk−−ff(x(x ( j)))kk) + b),
2
ŷ = σ ( (4)
k=1
(i) ( j)
where ( ff(x(x (i)))kk−−ff(x(x ( j)))kk) is called χ Square similarity, ŷ represents the predict label
2
which valued [0, 1]. f (x (i) )k and f (x ( j) )k represents the extracted k dimensional
feature vector from image x (i) and x ( j) , ωk indicates the initial fixed vector of the
network, b indicates the initial fixed value of the network, and σ(t) defined as:
σ(t) = 1
1+e−t
, (5)
ŷ =
K
1
.
( f (x (i) )k − f (x ( j) )k )2 (6)
1+exp ( ωk +b)
f (x (i) )k − f (x ( j) )k
k=1
The value of ŷ is [0, 1]. When the value of ŷ is closer to 1, it means that the
identity of the two faces is the same. When the value of ŷ is closer to 0, the identity
of the person in the two faces is different.
118 Q. Zhou et al.
3 Experiments
In order to verify the validation of context-aware model, we test the CDSN with
different settings of control parameter w. Firstly, we set w = 0 to visualize the
distribution of facial images. 80 images (10 person with 8 different images) from
LFW database are randomly selected to form the testing dataset. Images with original
size and with the context-aware size are used to compared to show the role of context-
aware module. We train two LCNN [24] models with above two settings and extract
the testing samples with the trained module. As shown in Fig. 3, testing samples are
mapped to 2-D vectors using T-SNE [17].
Obviously, context-aware module focus on the main organs of facial images, with-
out the background, the context-aware module play an important role in clustering
the samples into better discriminative distribution.
On the other hand, we adjust the control parameter w to testify the performance
of context-aware module. As we know, the value of w controls the size of key area of
facial image. Since point E(xa − w, yc − w) and point F(xb + w, yd + w) control
the size of the cropped rectangle, it is specified that when w = +∞, just select
the original image. When w > 0, the rectangle will expand, otherwise w < 0 the
rectangle will shrink. We use the cropped images to train and test CDSN. From
Fig. 4, when w = 3, the face information contained in the cropped rectangle is the
most related information for CDSN and it achieves its best performance. Moreover,
too big w involves more unrelated information, but too small w may ignore some
important information.
Context-Aware Based Discriminative Siamese Neural Network for Face Verification 119
Fig. 3 Visualization of facial features in 2-D vector space. Here, the different colors represent
different person. Samples processed by the context-aware module has compact distribution of in
features space. The context-aware module squeeze discriminative information for focusing on main
organs of facial image
Fig. 4 The value of w control the performance of CDSN. Obviously, a suitable value gains most
accurate information for verification task
120 Q. Zhou et al.
In this part, we intend to test the architecture of CDSN. With stable context-aware
module, we cascade enhanced Siamese neural network to figure out face verifica-
tion task. Comparing with classic 5-layer convolutional network, CDSN used more
layers for extracting discriminative features. As shown in Fig. 5, the results indicate
that deeper network architecture outperforms shallow ones. The main reason is that
increasing the number of layers of the network layer can increase the nonlinearity of
the network. This makes the network owns better learning ability and further enhanc-
ing the discriminative feature ability. Inevitably, when the network level is deeper
and deeper, the deep network gradient is very unstable, which is easy to cause the
gradient to disappear, results in worse performance.
In order to test the verification accuracy, we test the above algorithms on the real-
world database. Here, we randomly select 300 pairs of samples (the same person,
positive samples denotes as (xi , yi , 1), i = 1, 2, . . . , n) from our real-world face
database including 3300 samples. The other 300 pairs of negative sample pairs,
write as (x j , y j , 0), j = 1, 2, . . . , n). As shown in Fig. 6, two different metric models
(euclidean distance and cosine distance) are used to show the accuracy of different
methods.
No matter what metric is selected, CDSN yields the best performance on the
real-world database.
In order to show the discrimination of features, in this paper, we define a metric
to measure the discriminative ability of different algorithms. We directly use a score
between 0 and 1 to show the similarity of two samples in semantic space. That is,
for same person the score is to be 1, otherwise the score tends to be 0. However,
the cosine similarity is in [−1, 1], we should to normalize the similarity scores into
[0, 1] by following equation:
Ti(l) = 1
2
+ cos( f (xi )k , f (yi )k )
2
, (7)
where Ti(l) represents the result of normalization [0, 1], l represents label, which value
only with 0 or 1, f (xi )k , f (yi )k represents the image xi , yi feature vector extracted by
the network, cos( f (xi )k , f (yi )k ) represents the similarity of the extracted features,
and the value range is [−1, 1]. In order to more clearly describe the ability of each
model to distinguish whether two facial images are the same person’s, we introduce
a metric named as “scor e” as:
n
n
scor e = n1 ( Ti(1) − Ti(0) ), (i = 1, 2, . . . , n). (8)
i=1 i=1
Obviously, the larger the score, the stronger the discriminative ability of the feature
which can distinguish whether two face images are the same person. Thus we display
the results in Fig. 7.
Fig. 7 The discrimination metric of different algorithms on 600 pairs of real world facial images,
different colors represent different models, obviously CDSN has the best discriminative scores
Context-Aware Based Discriminative Siamese Neural Network for Face Verification 123
From these findings, we confirm the role of the proposed context-aware Siamese
neural network.
4 Conclusion
Acknowledgments This work is supported by the National Natural Science Foundation of China
(61502354, 61771353), Central Support Local Projects of China (2018ZYYD059), the Natural
Science Foundation of Hubei Province of China (2014CFA130, 2015CFB451), Scientific Research
Foundation of Wuhan Institute of Technology (K201713), The 10th Graduate Education Inno-
vation Fund of Wuhan Institute of Technology (CX2018213), 22020 Hubei Province High-value
Intellectual Property Cultivation Project, Wuhan City Enterprise Technology Innovation Project
(202001602011971), the National Natural Science Foundation of China (62072350).
References
1. Ahonen T, Hadid A, Pietikainen M (2006) Face description with local binary patterns: appli-
cation to face recognition. IEEE Trans Pattern Anal Mach Intell 12:2037–2041
2. Baker S, Kanade T (2000) Hallucinating faces. In: 2000 the fourth international conference on
automatic face and gesture recognition (FG’2000), pp 83–88
3. Chen D, Cao X, Wen F, Sun J (2013) Blessing of dimensionality: high-dimensional feature
and its efficient compression for face verification. In: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp 3025–3032
4. Chen S, Liu Y, Gao X, Han Z (2018) Mobilefacenets: efficient CNNS for accurate real-time
face verification on mobile devices. In: Chinese conference on biometric recognition. Springer,
Berlin, pp 428–438
5. Chopra S, Hadsell R, LeCun Y (2005) Learning a similarity metric discriminatively, with
application to face verification. In: Null. IEEE, pp 539–546
6. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection
7. Deng J, Guo J, Xue N, Zafeiriou S (2018) Arcface: additive angular margin loss for deep face
recognition (2018). arXiv:1801.07698
8. Guo Y, Lei Z, Hu Y, He X, Gao J (2016) Ms-celeb-1m: a dataset and benchmark for large-scale
face recognition (2016)
9. Hu J, Lu J, Tan YP (2014) Discriminative deep metric learning for face verification in the
wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp
1875–1882
10. Huang GB, Mattar M, Berg T, Learned-Miller E (2008) Labeled faces in the wild: a database
forstudying face recognition in unconstrained environments. In: Workshop on faces in’Real-
Life’Images: detection, alignment, and recognition
11. Koch G, Zemel R, Salakhutdinov R (2015) Siamese neural networks for one-shot image recog-
nition. In: ICML deep learning workshop, vol 2
124 Q. Zhou et al.
12. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional
neural networks. In: Advances in neural information processing systems, pp 1097–1105
13. Lee K, Chung Y, Byun H (2002) Svm-based face verification with feature set of small size.
Electron Lett 38(15):787–789
14. Lu H, Li Y, Chen M, Kim H, Serikawa S (2018) Brain intelligence: go beyond artificial intel-
ligence. Mob Netw Appl 23(2):368–375
15. Lu H, Li Y, Uemura T, Kim H, Serikawa S (2018) Low illumination underwater light field
images reconstruction using deep convolutional neural networks. Futur Gener Comput Syst
82:142–148
16. Lu J, Plataniotis KN, Venetsanopoulos AN, Li SZ (2006) Ensemble-based discriminant learning
with boosting for face recognition. IEEE Trans Neural Netw 17(1):166–178
17. Maaten Lvd, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(Nov):2579–
2605
18. Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In:
Proceedings of the 27th international conference on machine learning (ICML-10), pp 807–814
19. Schroff F, Kalenichenko, D, Philbin, J (2015) Facenet: a unified embedding for face recogni-
tion and clustering. In: Proceedings of the IEEE conference on computer vision and pattern
recognition, pp 815–823
20. Serikawa S, Lu H (2014) Underwater image dehazing using joint trilateral filter. Comput Electr
Eng 40(1):41–50
21. Taigman Y, Yang M, Ranzato M, Wolf L (2014) Deepface: closing the gap to human-level
performance in face verification. In: Proceedings of the IEEE conference on computer vision
and pattern recognition, pp 1701–1708
22. Wang, M., Deng, W.: Deep face recognition: a survey. arXiv preprint arXiv:1804.06655 (2018)
23. Wright J, Yang AY, Ganesh A, Sastry SS, Ma Y (2009) Robust face recognition via sparse
representation. IEEE Trans Pattern Anal Mach Intell 31(2):210–227
24. Wu X, He R, Sun Z, Tan T (2018) A light cnn for deep face representation with noisy labels.
IEEE Trans Inf Forensics Secur 13(11):2884–2896
25. Zhang K, Zhang Z, Li Z, Yu Q (2016) Joint face detection and alignment using multitask
cascaded convolutional networks. IEEE Signal Process Lett 23(10):1499–1503
Object-Level Matching for Multi-source
Image Using Improved Dictionary
Learning Algorithm
1 Introduction
With the increasing of multi-source sensors and multi-source data, we have entered
an era of remote sensing big data and new methods of processing and understanding
the information behind the big data are needed urgently. Object-level matching in
multi-source images, one of the fundamental techniques for mining the big data, is
the basis of fusing and utilizing multi-source information. In practical applications,
the indirect matching is mainly used. Firstly, the object feature information is used to
judge the object’s category, and then the matching is conducted according to whether
the judged object’s category is consistent. However, the main disadvantages of this
method are as follows: the accuracy of matching is overwhelmingly dependent on
the completeness of the identification databases and the accuracy of the algorithm;
the object feature is first extracted for judging the category of object and then the
matching result is acquired by the category information, which increases the loss of
information. Besides, the most studied multi-source image fusion is to achieve image
fusion in the pixel-level to obtain more detailed information, which is convenient for
further analysis, processing and understanding of images, but it has high require-
ments for temporal aligning and spatial aligning of input images, and has a large
amount of calculations and poor real-time performance. Considering the diversity
of multi-sensor platform and the heterogeneity of images, we intend to analyze the
feature information of multi-source images by extracting and comparing the feature
information to determine whether the images originate from the same object and
construct a mapping relationship between objects in multi-source images.
However, there are few corresponding research due to the high dimensionality of
feature information and the large difference in the feature information of heteroge-
neous sources. In order to overcome this limitation, our motivations are as follows
and shown in Fig. 1: feature extraction and similarity measure are usually adopted to
conduct object matching. The objects in images can be mapped into feature represen-
tations in a common feature space, and the similarity can be measured by calculating
the distance between the feature representations. But for multi-source images, the
existing data drift problem prevents the implementation of similarity measure. So,
how to perform a unified representation of the multi-source images to eliminate the
data drift problem is crucial.
Feature space
Source 1: multi-spectral image
feature 1
feature 2
feature 3
feature
feature 4
...
extraction
feature n-1
feature n
Unified
representation
r r r r r r r r
Data ( v1,v2,v3, ,424444444
3444444
vn , v1,v2,v3, ,vm )
3
drift n+m
extraction
feature m-1
feature m
Feature space
In this paper, we skip the object recognition and directly use the feature repre-
sentations between multi-source images to make matching. As discussed above,
we introduce the dictionary learning method into object-level matching, which is
used to unify the multi-source image in the same description space. The problem
of the object-level matching for multi-source image can be divided into two stages:
the representation learning phase and the similarity measure phase. In the stage of
representation learning, a unified dictionary of multi-source images is generated by
proposing an improved dictionary learning algorithm, which utilizes further the label
information in the objective function while considering the representation ability of
the dictionary, so that the sparse coefficients of the matching object are as close as
possible, and the sparse coefficients of the non-matching object are as different as
possible. Furthermore, a distance metric can be learned based on the sparse coef-
ficients generated by the unified dictionary. Specially, a matching discriminative
neural network is constructed to learn the relation between the sparse coefficients
and label information in a supervised way. Finally, we use this distance metric to
judge whether the object information acquired by the two sources comes from the
same object, and realize multi-source image object matching. The framework of our
improved dictionary learning algorithm is shown in Fig. 2.
Representation Similarity
Learning measure
d1
X1 |X1-X2|
… 1
d2
Sparse
d3 coding
... X2 |X1-X3| 0
dk-1
X3 |X2-X3| 0
dk …
Dictionary Set
Ge ne ra te Ge ne ra te
2 Related Work
3 Algorithm Structure
The algorithm structure of this paper includes two stages: representation learning
and discriminative matching. As shown in Fig. 2, the input image data Y1 , Y2 , Y3 is
characterized as a sparse coefficient X 1 , X 2 , X 3 in the same space through a unified
dictionary, and the “distance” of any two coefficients is measured by discriminative
model to output the number between 0 and 1. If the final output is 1, the two objects
are matched; if 0, they are not matched.
that dictionary learning should achieve the following learning objectives: learning
to obtain a complete dictionary, the sparse representation of the coefficients should
be as sparse as possible, and the reconstruction error should be controlled within a
certain range. The mathematical expression is shown in formula (1):
N
< D, X >= arg min (||yi − Dxi ||22 + α||xi ||1 ) (1)
D,X
i=1
Among them, the first item in the formula is the linear combination of the control
dictionary and the sparse coefficient to restore the sample set as much as possible
to control the reconstruction error; the second item uses the L1- norm to control
the sparseness of the sparse coefficient; for the hyper-parameter α, it can be used to
balance the weight of reconstruction error and the degree of sparsity.
It can be known from formula (1) that dictionary learning should generate dictio-
nary D and sparse coefficient X through unsupervised learning. The generation of
dictionary is the basis of the algorithm. And the performance of dictionary determines
the performance of dictionary learning algorithm. The optimization of formula (1)
is a non-convex optimization problem for D and X , but the problem of fixing one
parameter and optimizing another parameter is the convex optimization problem.
Therefore, dictionary learning is generally divided into two-step solving. First, the
dictionary is fixed, the sparse coefficients are solved, then the sparse coefficients are
fixed, the dictionary is updated, and the iteration is repeated to the optimal. The differ-
ence between the dictionary learning algorithms lies in the difference between the
method of solving the sparse coefficients and the method of updating the dictionary.
Solving the sparse coefficient in the fixed dictionary can be proved as NP-hard
problem [8], which can be solved by Orthogonal Matching Pursuit (OMP) [9] and
Basis Pursuit (BP) [10]. For a dictionary generated by fixed sparse coefficients, the
commonly used methods are the Method Of Optimal Directions (MOD) [11], and
FOCUSS method [12].
the dictionary elements are updated, The K-SVD algorithm uses any dictionary to
update the global coefficients to avoid falling into local optimum.
2
K
j
||Y − D X || F = Y −
2
d j xT
j=1
F
⎛ ⎞ 2
K
= ⎝Y − d j x T ⎠ − dk x T
j k
j=1
F
k 2
= Ek − dk x T F (2)
In order to make the sparse coefficient have certain discriminative ability, the sparse
coefficient of the matching object is as similar as possible, while the sparse coefficient
of the non-matching object is as different as possible. Based on the objective function
of dictionary learning, we add a label information to get the objective function as
shown in formula (3),
N
β
N
< D, X > = arg min (||yi − Dxi ||22 + α||xi ||1 ) + (||xi − x j ||22 Mi j )
D,X 2
i=1 i, j
N
= arg min (||yi − Dxi ||22 + α||xi ||1 ) + β(T r (X T X D) − T r (X T X M))
D,X
i=1
N
= arg min (||yi − Dxi ||22 + α||xi ||1 ) + β(T r (X T X L)) (3)
D,X
i=1
where α and β are hyper-parameters, which are used to control the weight of the
error represented by each item; the third term is a restriction item that makes the
sparse coefficient have the classification ability. As shown in formula (4), when the
label of the object pair (yi , y j ) is matching object, Mi j = 1; when the label (yi , y j )
of the object pair is non-matching object, Mi j = −1; other cases, such as the same
object (yi , y j ), Mi j = 0. D = diag{d1 , . . . , d N } is a diagonal matrix, each diagonal
Object-Level Matching for Multi-source Image Using Improved … 131
N
element is the sum of the elements of each column in the matrix M, di = j=1 Mi j ;
L = D − M.
⎧
⎨ +1, i f (yi , y j ) ∈ S
⎪
Mi j = −1, i f (yi , y j ) ∈ D (4)
⎪
⎩
0, other wise
Like the formula (1), the objective function formula (3) is a non-convex optimization
problem for the parameters D and X . It is necessary to fix one parameter to solve
another parameter, and thus the optimization problem of the objective function can
be converted into iterative optimization of the following two formulas:
L(xi ) = arg min ||yi − Dxi ||22 + α||xi ||1 + β(2xiT X L i − xiT xi L ii ) (5)
xi
N
L(D) = arg min ||yi − Dxi ||22 = arg min ||Y − D X ||2F (6)
D i=1 D
Among them, the formula (5) optimizes the sparse coefficient X for the fixed
dictionary D, and the formula (6) optimizes the dictionary set for the fixed sparse
coefficient. According to the formula (2), Eq. (6) can be optimized with K-SVD, by
updating the elements in the dictionary and the corresponding sparse coefficients one
by one; for the formula (5), the objective function is not continuously differentiable
due to the existence of the L1-norm, but the method for solving the problem is
given in [13]. The “feather-sign search algorithm” method proposed in the literature
transforms the object function into a standard, unconstrained quadratic optimization
problem (QP) by iteratively generating the sign vector θ of the corresponding sparse
coefficient. Therefore, the gradient of formula (5) is calculated, as shown in formula
(7),
∂ L(xi )
= 2D T (Dxi − yi ) + 2β X L i + αθ (7)
∂ xi
Let the result of formula (7) equals to zero, and the sparse coefficient can be
obtained, as shown in the formula (8).
1
xi∗ = (D T D + β L ii I )−1 · (D T yi − αθ ) (8)
2
132 X. Zhenyu et al.
The experimental part of this paper selects public remote sensing image datasets,
including two types of image data: multi-spectral image and panchromatic image. In
this paper, two types of images are preprocessed, and the same region or the same
object is respectively cut to form multi-source image matching datasets. A partial
example of multi-source image matching datasets is shown in Fig. 3.
We made multi-source image matching datasets based on the remote sensing image
datasets, in which the full-color image and the multi-spectral image respectively
Fig. 3 Multi-source remote sensing image dataset (the first line is a multi-spectral image and the
second line is a panchromatic image.)
Object-Level Matching for Multi-source Image Using Improved … 133
contain 1824 images, and 1600 of them are selected as training sets for generating
common dictionary and matching discriminative model, the remaining 224 images
were used as test sets.
Figures 4 and 5 show the reconstruction of panchromatic and multi-spectral
images in the training set and test set. It can be seen from Fig. 4 that the recon-
struction of the training set of the dictionary is very close to the original image and
difficult to distinguish with the naked eye; in Fig. 5, the reconstruction result of the
dictionary has a certain gap with the original image, and the contour lines are basi-
cally restored. However, the details are generally lost and blurred; from the following
images, the dictionary learning and the proposed method are similar in performance
to the training set and test set reconstruction, indicating that the two dictionary are
similar in terms of representation ability, which is consistent with the starting point
of this paper, maintaining the representation ability of the dictionary and increasing
the certain discriminative ability.
Fig. 4 Remote sensing image dataset training set reconstructed image comparison chart. a Compar-
ison of multi-spectral image reconstruction (from top to bottom: original image, dictionary learning
reconstructed image and reconstructed image of our method). b Contrast map of panchromatic
image reconstruction (from top to bottom: original image, dictionary learning reconstructed image
and reconstructed image of our method)
134 X. Zhenyu et al.
Fig. 5 Test set of remote sensing image dataset reconstructed image comparison chart. a Compar-
ison of multi-spectral image reconstruction (from top to bottom: original image, dictionary learning
reconstructed image and reconstructed image of our method). b Contrast map of panchromatic
image reconstruction (from top to bottom: original image, dictionary learning reconstructed image
and reconstructed image of our method)
Table 1 show the accuracy of labeled matching pairs by using dictionary learning and
proposed approach in multi-source image dataset. It can be seen from the results that
the matching strategy of this paper has a high matching accuracy for the datasets,
and the accuracy of proposed approach on the test set is improved by more than 8%
compared with the dictionary learning.
While we have thus far considered multi-source image object matching as a binary
classification problem, our end goal is to use it for location. This application can be
Table 1 Accuracy of
Method Accuracy (%)
labelled matching pairs in
multi-source image dataset Dictionary learning 84.17
(%) (p = 224) Proposed approach 92.86
Object-Level Matching for Multi-source Image Using Improved … 135
Fig. 6 Multi-source remote sensing image dictionary reconstruction DSTL dataset comparison
chart. a Comparison of multi-spectral image reconstruction (from top to bottom: original image,
reconstructed image). b Comparison of visible light image reconstruction (from top to bottom:
original image, reconstructed image)
5 Summary
In this paper, we propose a new method for multi-source image object matching,
which unifies the representation of multi-source image by improving the dictionary
learning algorithm. We use the neural network to learn the distance metric between
the objects, and then use this standard on the two sets of self-made datasets. The
validity of the algorithm and the high accuracy of matching are tested. Finally, the
method is proved to have good applicability in zero-shot learning.
Though our proposed object matching approach can achieve better performance,
there are still some shortcomings that we cannot neglect. When we face more different
types of images, the training time of the model is long and the accuracy of the object
matching will be influenced. For zero-shot learning, it’s hard to distinguish the detail
of the object. So, how to overcome the limitation for the use of module is one of our
future focuses.
Object-Level Matching for Multi-source Image Using Improved … 137
References
1 Introduction
X. Liu · H. Wang
Qingdao University of Science and Technology, No. 99 Songling Road, Qingdao, China
e-mail: nina.xf.liu@hotmail.com
M. Fu (B) · B. Zheng
Ocean University of China, No. 238 Songling Road, Qingdao, China
e-mail: fumin@ouc.edu.com
remote sensing earth observation technology has been widely used in many differ-
ent fields, such as mining, astronomy, chemical imaging, agriculture, environmental
science, wildland fire tracking and biological threat detection and so on [2].
The main purpose of HSI classification is to classify the pixels in the image so as
to realize the automatic classification of objects [3]. However, due to the limitation
of weather and terrain environment in which hyperspectral imaging instruments
work, the information contained in HSI will be polluted by noise or completely
submerged. Especially in the shadow area of HSI, due to insufficient illumination,
refraction and scattering of light, the effect of detection, recognition and classification
of objects in HSI is reduced, The shadow in HSIs is one of the main difficulties in
information mining. Dynamic Stochastic Resonance (DSR) has been proved to be a
good enhancement for shaded areas in gray and color images. Based on this, a DSR
technique is proposed to enhance the shaded area of HSI in order to facilitate the
detection and recognition of objects in the shaded area of HSI.
Stochastic Resonance (SR) theory was put forward by Benzi et al. in 1981 [4] when
he explained the phenomena of glacial and warm climatic periods alternately in paleo
meteorology. SR enhanced the signal by the interaction of non-linear system, weak
driving signal and noise. Although the development of SR theory is not long, scholars
have confirmed the existence of SR phenomenon in meteorological, biological and
circuit systems etc. [5]. SR theory has been applied to cochlear implant design,
detector enhancement [6], image enhancement [7] and signal processing as a typical
theory to effectively utilize noise energy [8]. It has even been applied to unexpected
areas, such as mechanical fault detection [9] and the human vestibular system [10].
The remainder of this paper is organized as follows: Sect. 2 overviews the main
theory of DSR; the proposed method is introduced in Sect. 3 in detail; Sect. 4 presents
the experimental results and discussion; the conclusion is contained in Sect. 5.
2 Overview of DSR
In signal processing or other related research work, the noise of interference signal
usually troubles researchers. It will distort the signal and greatly reduce the validity
and correctness of the signal. However, recent studies on stochastic resonance show
that the noise existing in weak signal is entirely possible to amplify weak signal. In
other words, the noise in weak signal can play an important role in enhancing signal
and improving signal-to-noise ratio.
The classical stochastic resonance theory can be described by the motion of an
overdamped particle between bistable potential wells [11]. Suppose that a particle
with mass m and friction coefficient γ moves in a bistable potential well determined
by U (x) under over-damped condition and the particle is affected by noise ξ(t) and
periodic driving signal f (x), then it’s Langevin equation of motion is:
U(x)
0.5
0
-0.5
0 20 40
x
where U (x) is a bistable fourth-order potential well function with mirror symmetry
as shown in Fig. 1:
2 4
U (x) = −a x2 + b x4 (3)
with a and b being the parameters of the bistable system. The potential well has
two stable states located at x± = ± ab and one unstable state at xs = 0. Between
the stable state and the unstable state, there is a potential barrier with a height of
a2
U = 4b .
Based on the theory above, Langevin equation describes the movement state of
particles moving back and forth between two potential wells under the combined
action of periodic driving signal and noise. When there is no periodic signal driving,
the particles oscillate around the stable state with small amplitude, and the statistical
variance of the oscillation amplitude is proportional to the noise intensity. Under
the action of periodic driving signals, the symmetrical bistable potential well sways
asymmetrically with the amplitude of periodic signals, i.e. the uneven potential wells
on both sides, the particles begin to oscillate greatly, but the driving signal amplitude
is relatively small, because the particles can not cross the barrier, only the periodic
signal is not enough to make the particles jump periodically in the two potential wells.
Adding appropriate intensity of noise can help the particles jump from one potential
well to another according to the frequency of the periodic signal. The original chaotic
noise acts on the bistable system and produces some statistical characteristics in the
form of the motion state of the output particles.
In order to apply stochastic resonance theory to HSI processing, based on the
above model, we further deduce Eq. (3) into Eq. (2):
d x(t)
dt
= [ax − bx 3 ] + f (t) + ξ(t) (4)
142 X. Liu et al.
Based on the understanding of stochastic resonance, f (t) is the input signal, and
ξ(t) is the noise of the signal. When processing the HSI, we can think that the shadow
in the HSI is a kind of noise. Because of the influence of the noise, the image under
the shadow becomes very weak. Therefore, the three conditions of DSR (a. bistable
system. b. weak driving signal. c. external noise.) have been satisfied, and the HSI
can be processed by DSR.
We use the shadow area in HSI as the input of the DSR system:
then:
d x(t)
dt
= [ax − bx 3 ] + I nput (6)
In order to make the system suitable for image processing, we rewrite the above
differential form to differential form:
where n is the iteration number of dynamic stochastic resonance. The initial value
of x(0) = 0.
As mentioned above in Sect. 2, DSR can enhance the signal intensity and signal-to-
noise ratio. In HSIs, the shadow area could be taken as a weak signal contaminated
by noise. Therefore, DSR could help to enhance the shadow area in HSIs.
In this paper, we propose to enhance the shadow area in HSIs by DSR in the
spatial and the spectral dimensions respectively, and then the fused image could be
classified by the support vector machine (SVM). The flow chart of the proposed
method is shown in Fig. 2.
Firstly, due to only the shadow areas in HSIs needing to be enhanced, we use
the shadow extraction mask to extract the shadow areas of the original hyperspectral
data and obtain the three-dimension (3D) tensor form of shadow data.
Secondly, DSR is performed in the spatial and the spectral dimensions respectively
to enhance the shadow areas.
• DSR on the spatial dimension: the extracted shadow data are divided into several
2D images according to the band. In order to calculate more conveniently by
Eq. (7), we need to transform the 2D shadow area images into 1D row vectors,
After DSR in Eq. (7) the enhanced 1D vector should be reshape to 2D matrix.
Shadow data in the form of 3D tensor after processing are obtained by sequential
arrangement.
Classification of Hyperspectral Image Based on Shadow … 143
• DSR on the spectral dimension: in the shadow data, the 1D spectral information
of each pixel can be extracted, and enhanced by the Eq. (7). After all the spectral
data of all the pixels in shadow areas are processed, the shadow data in the form
of 3D tensors could obtained.
Thirdly, the enhanced shadow areas should be fused with the non-shadow areas
in HSIs to obtain the enhanced image.
Finally, SVM could be used to classify the enhanced HSI.
4 Experiment
It can be seen from the Eq. (7) that DSR mainly contains three parameters a, b, t,
n. Among them, a and b are parameters in bistable system. They mainly affect the
position of potential well and the height of potential barrier in bistable system. t is
the iteration step in DSR processing, n is the iteration number of DSR processing, t
and n mainly affect the gain of DSR processing. Because there is no mature theoretical
support for the selection of parameters in the current research on dynamic stochastic
resonance, we try to find the optimal value of parameters for DSR processing in this
paper from the perspective of spectral dimension processing.
To determine the optimal values of a and b, we fixed the value of t to 0.01
and n to 24 firstly, and selected a spectral data of road pixels in the shadow from
HYDICE HSI to do DSR processing. Then different a and b are selected to use to get
different enhanced spectral data. The Pearson correlation coefficients of these data
are calculated with the standard road spectra in HYDICE HSI. Pearson correlation
coefficient is defined as Eq. (8):
n
i=1 (X i −X )(Yi −Y )
r = √ n √ n (8)
i=1 (X i −X ) i=1 (Yi −Y )
2 2
Classification of Hyperspectral Image Based on Shadow … 145
Fig. 4 The influence of the values of a and b on Pearson correlation coefficient. a 3D view. b 2D
view
where X and Y are two 1D vectors with length n, and X Y represent the average of
X and Y . Equation (8) shows that the closer the Pearson correlation coefficient is
to 1, the higher the correlation between the two data is. On the contrary, the closer
the Pearson correlation coefficient is to 0, the worse the correlation between the two
data is. The influence of the values of a and b on Pearson correlation coefficient is
shown in Fig. 4.
According to Sect. 3, the shadows of the HYDICE HSI can be enhanced in the spatial
and the spectral dimensions respectively and some of the results are illustrated in
Fig. 5 and Fig. 6.
It can be seen from Figs. 5 and 6, the DSR can enhance both the spatial and
the spectral data in the shadow area. The spatial dimension DSR processing can
enhance the contrast of the shadow area and make some details more prominent
under the shadow. The spectral dimension DSR processing can greatly improve the
original weak spectral data and make the spectral characteristics of the pixels under
the shadow clearer. From these two aspects, DSR can enhance the shadow area of
HSI significantly.
In recent years, machine learning and artificial intelligence [12] have developed con-
tinuously, and have been applied to image processing [13, 14], recognition [15],
anomaly detection [16] and other fields. SVM is a classical method based on super-
vised learning. In the experiment, we have used SVM to classify the enhanced HSI.
In order to compare and verify the enhancement effect of DSR, we chose two region
146 X. Liu et al.
Fig. 5 1st band of HYDICE HSI before and after DSR enhancement. a Before DSR enhancement.
b After DSR enhancement
0.4
Strength
0.3
0.2
0.1
0
0 50 100 150
Bands
Classification of Hyperspectral Image Based on Shadow … 147
of interest (ROI) selection methods in SVM classification. One contains eight sam-
ples (ROI1): road, road under shadow, grass, tree, shadow, target 1, target 2, target
3; the other contains seven samples (ROI2): road and road under shadow, grass,
tree, shadow, target 1, target 2 and target 3. Two kinds of ROIs have been used to
classify the original HSI, the space-dimensional DSR-enhanced HSI and the spectral-
dimensional DSR-enhanced HSI.
To evaluate the improvement of the classification, the overall accuracy (OA) in
percentage is applied and defined as: for P class Ci (i = 1, ..., P), if ai j is the number
of test samples that actually belong to class Ci and is classified into C j ( j = 1, ..., P),
then
P
O A = M1 aii (9)
i=1
where M is the total number of samples, P is the number of classes Ci and aii = ai j
for i = j.
The classification results are shown in Figs. 7 and 8. The accuracy comparison of
classification is listed in Table 1.
Because SVM only uses the spatial information of HSI in classification, it is not
obvious that the classification accuracy of HSI after processing is improved. From the
above results, it can be seen that both spatial dimension and spectral dimension DSR
have certain effects on the enhancement of HSI shadow area and the improvement
of HSI classification accuracy.
Fig. 7 Classification based on ROI1. a Original HSI. b DSR on spatial dimension. c DSR on
spectral dimension
148 X. Liu et al.
Fig. 8 Classification based on ROI2. a Original HSI. b DSR on spatial dimension. c DSR on
spectral dimension
5 Conclusion
In this paper, DSR theory is firstly introduced into the HSI shadow region enhance-
ment, and it is applied to process the shadow region of HSI from two aspects: the
spatial dimension and the spectral dimension. In the classification after enhancement,
compared with the original image classification, the proposed method can help to
improve the classification accuracy.
The promising results encourage us to further combine DSR with the convolutional
neural networks (CNN) method which can explore both the spatial and spectral
features of HSIs.
References
1. Zhang J, Liu H, Wei Z (2018) Regularized variational dynamic stochastic resonance method
for enhancement of dark and low-contrast image. Comput Math Appl 76(4):774–787. https://
doi.org/10.1016/j.camwa.2018.05.018
Classification of Hyperspectral Image Based on Shadow … 149
Abstract Recently, image is becoming more and more important as a carrier of infor-
mation, and the demand of image inpainting is increasing. We present an approach
for image inpainting in this paper. The completion model contains one generator and
double discriminators. The generator is the architecture of AutoEncoders with skip
connection and the discriminators are simple convolutional neural networks archi-
tecture. Wasserstein GAN loss is used to ensure our model’s stable training. We also
give the algorithm of training our model in this paper.
1 Introduction
Image inpainting [4] aims to fill missing regions of a damaged image with plausibly
synthesized contents. This technology can be widely applied in many fields like
medical image restoration, ancient books restoration, and PhotoShop processing.
The key problem to be solved in image inpainting is how to make the inpainted
image looks real and semantic coherent. To solve this problem, total variation (TV)
based approaches [1, 17] and PatchMatch (PM) based method [3] get great success
in filling small missing regions, deep learning (DL) based methods [6, 11, 14, 15]
are widely adopted to the inpainting task with large missing regions and have made
great progress. However, the problem of blurring the boundary between the inpainted
regions and the original regions still exists. Beyond that, how to ensure the semantic
correctness of the inpainted regions is also one of the difficulties in the task of image
inpainting.
In this paper, we train a model based on convolution neural networks to fill large-
scale missing regions. Our model consists of one generator for generating content to
fill the missing regions and double discriminators for discriminating if the inpainted
image is visually and semantically plausible. The architecture of the generator of
our model is similar to autoencoders [10] with an encoder and a decoder. Different
from original autoencoders, our model combines with the characteristics of skip
connection [9] and autoencoders. We use skip connection in the generator, which
can help our model use the underlying network to enhance the prediction ability
of the decoding process, and prevent the gradient vanishing caused by the deep
neural network. Similar to the architecture proposed by Iizuka et al. [11], we use the
architecture of double discriminators: global discriminator and local discriminator.
The difference is that we use the Wasserstein GAN [2] loss to train our model, which
can ensure our model’s stable training. We demonstrate that the model we proposed
is capable of generating realistic and semantically coherent images when inpainting
images.
The rest of this paper is structured as follows: In Sect. 2, we discuss multiple
image inpainting methods. This is followed in Sect. 3 by details about our model and
the training Algorithm of our method is proposed in Sect. 4. Conclusions and future
works are given in Sect. 5.
2 Related Work
learn high-level semantic information of images and CNNs are effective tools for
image processing [12].
More and more researchers use deep learning for image inpainting tasks. One of
the first methods based on neural networks is Context-encoder [15], which is proposed
by Pathak et al. After this, many methods [11, 14, 19, 20] appear and achieve great
success in the job of image inpainting. These methods can generate realistic contents
to fill the missing regions. However, these methods sometimes create blurry textures
inconsistent with the old regions of the image and do not perform well in the task of
image inpainting with complex structures missing. Although the existing methods
can repair the damaged image well and generated plausible results, the problem of
blurring still exists and is worth studying.
3 Methodology
This paragraph details the model framework for image inpainting. Given an image
with a missing hole, the goal is to fill a missing region of the image so that the entire
image is semantically coherent and true. Figure 1 shows our model that consists of
one generator and two discriminators.
3.2.1 Generator
To complete the task of image inpainting, the generator of our model uses the archi-
tecture of autoencoders with an encoder and a decoder. In this paper, we input the
incomplete image into the encoder to encode it into code. Then input the code into
the decoder to decode into the inpainted image. In order to avoid gradient vanishing
caused by the deep neural networks, skip connection [9] is added between the encoder
and the decoder. Skip connection can make sure that the decoding stage can utilize the
output of the low-level coding stage of the corresponding resolution to supplement
154 J. Xu et al.
the decoder with part of the structural feature information lost during the encoder
downsampling phase. Also, it can enhance the structure prediction capability of the
generator. We use multiple convolution layer architecture in the encoder. Similar to
the encoder, the structure of the decoder is symmetric to the encoder. We also use
dilated convolution layer instead of fully connected layer between the encoder and
the decoder.
3.2.2 Discriminator
The job of the generator is to fill the missing regions of the image. However, one
generator is not powerful enough to produce real results. In order to enhance the
ability of the generator, we use the discriminator as a binary classifier to discriminate
whether the image is from real image dataset or created by the generator. In this paper,
we use the architecture of double discriminators with a local discriminator and a
global discriminator. Both the local discriminator and the global discriminator are
CNNs based architectures. The local discriminator mainly identifies if the generated
regions are semantically accurate. The local discriminator can enhance the generating
ability of the generator. We use the global discriminator to identify the degree of
coherence between the new regions (generated by the generator) and the old regions
(original regions). The global discriminator can help the generator produce more
semantically consistent results.
Generative Image Inpainting 155
In this paragraph, we propose the objective function of our model. We first introduce
the reconstruction loss Lr . The generator is trained by minimizing Lr . In this paper,
we use the L2 norm loss function instead of the L1 norm loss function. It is mainly
because the L2 norm penalizes the outliers and is suitable for the task of image
inpainting. The reconstruction loss is defined as follows:
Lr = G(xm ) − x 22 (2)
Due to the use of double discriminators, we apply the Wasserstein GAN loss
[2] instead of the original GAN loss [8] in our model. Because the Wasserstein can
ensure our model’s stable training. The Wasserstein GAN loss is defined as follows:
min max V (D, G) = Ex∼pdata (x) [D(x)] − Ez∼pz (z) [D(G(z))] (3)
G D
The global discriminate loss and the local discriminate loss are defined as follows:
where Lglobal and Llocal represent the losses of global discriminator and local dis-
criminator respectively. Dg and Dl are the function of global discriminator and local
discriminator. xc is the whole image with generated regions and mc is regions gener-
ated by the generator. x and r are real image and region from real data distribution.
Overall, the total loss function is defined as
where λ1 and λ2 are the weights to balance the effects of different losses.
4 Training Algorithm
In this paragraph, we introduce the algorithm of training our model. We use the
mini-batch method to mask the images from the dataset in each iteration. Firstly,
we sample a mini-batch of images x from training data and mask them with random
hole. Then we get a mini-batch of masked images z, real regions before being masked
r and masks m. z = x m where represents element-wise multiplication. Then
we train the generator s times with Lr loss. After training the generator, we fix the
generator and train discriminators t times with Lglobal and Llocal . Finally, we train the
156 J. Xu et al.
joint model with joint loss L. Input z into the model and output the predicted images
c. Combining the masked regions of c with z, we get the final inpainting images
xi = z + c (1 − m).
Acknowledgements This work was supported in part by the National Natural Science Foundation
of China under grant nos. 61872313; the Key Research Projects in Education Informatization
in Jiangsu Province under grant 20180012; by the Postgraduate Research and Practice Innovation
Program of Jiangsu Province under grant KYCX 18_2366; and by Yangzhou Science and Technology
under grant YZ201728,YZ2018209; and by Yangzhou University Jiangdu High-end Equipment
Engineering Technology Research Institute Open Project under grant YDJD201707.
References
1 Introduction
With the rise of agricultural Internet of Things [1], big data [2] and artificial intel-
ligence technology [3], the intelligent management of agricultural production has
attracted more and more attention of experts in this field. Among them, rice growth
prediction is the key link of agricultural precision management [4, 5]. If a predic-
tion model can be established, the corresponding rice growth trend can be predicted
according to the input environmental parameters before actual production, and then
the final yield can be estimated, which will have a positive significance in enhancing
the potential of rice field production and guiding farming.
Rice growth and development is a complex process of interaction between vari-
eties and environmental factors, so the establishment of its prediction model is also
a non-linear and complex problem. At present, there are two different ideas in the
field of rice growth prediction: the crop growth model initiated by DeWit [6] in the
Netherlands has its own system. They subdivide the crop production system into
four different stages, and point out that the laws affecting the growth of each stage
are different. They build models around each stage, including HLCROS, BACROS,
SUCROS and WODOST, which are based on statistical theory. The law of crop
growth does express the general law of crop growth, but the comprehensive model
[7, 8] is complex and difficult which is grasped by ordinary people. The other is
the model of rice growth prediction based on data analysis [9–11], which exca-
vates the hidden relationship between rice yield and environmental factors such as
temperature, light and water. However, it often ignores the different characteristics
of influencing factors and indicators of crop growth in each growth period, and its
accuracy is limited.
On the basis of previous studies on the influence of environmental factors on
rice growth, this paper emphatically analyzed the growth characteristics of rice at
different growth stages, and gave the definition of rice growth considering both
growth cycle and key growth indicators. Rice growth is a comprehensive index,
which is a numerical representation of the growth results in a certain time interval. In
this paper, the relationship between environmental factors and rice growth is obtained
by using improved Elman neural network, and the growth trend is expressed by the
growth value, which provides a new idea for rice growth prediction.
The organization of the paper is as follows:
In Sect. 2, we introduced the overall framework of the rice growth prediction
model. In Sect. 3, we describe the definition and mathematical description of rice
growth. The fourth section describes the structure of the sample set in the model,
and the fifth section proposes improved Elman neural network algorithm. In Sect. 6,
the model has been trained, the seventh subsection shows the results of the model
training, and in Sect. 8, we summarize the conclusions of the rice prediction model.
This prediction system is based on the Agricultural Internet of Things [12, 13]. There
are two kinds of data collected through sensor networks and artificial observation.
One is the environmental data based on time axis, such as light, water level, tempera-
ture and humidity, and the other is the growth index data of rice in each growth cycle.
The neural network located in the central server will use this as training samples to
get the weight of the prediction network at all levels, and can train the network in
real time according to the continuous accumulation of follow-up samples to obtain
more accurate prediction model. When it is necessary to predict the growth trend,
the system can combine the data of the previous stages of perception of paddy field
in this year to predict the growth trend of paddy field layer by layer in the later stage,
so as to predict the growth value of rice, and even manually adjust the environmental
quantity in the later stage to analyze the growth trend change (Fig. 1).
Rice Growth Prediction Based on Periodic Growth 161
Prediction
Rice growth evaluation index data application
WorkstaƟon
Light sensor
Environmental data
Temperature
Sensor
training
Water level sensor
Nitrogen content
sensor Server
Elman neural network
In order to evaluate the growth of rice in different growth stages conveniently, the
paper defines the descriptive quantity of rice growth as growth amount R. Studies by
Zhang et al. [14] show that suitable leaf area index (LAI) is the basis of high yield
of rice. The change of leaf area affects and restricts the change of tiller number. The
accumulation of dry matter is closely related to these two factors, which together
affect the growth of rice.
Generally speaking, the growth of rice can be divided into six periods: Returning
green period, Tillering period, Jointing and booting period, Heading and flowering
period, Filling period and Maturing period [15–17]. The physiological characteristics
of rice plants at different growth stages are different, and each cycle has its own unique
physiological characteristics. Therefore, it is necessary to analyze the characteristics
of rice at different growth stages in detail. For example, people are mainly concerned
about its survival rate in the period of rejuvenation. The number of effective tillers
was concerned at Tillering period; The seed setting rate after flowering was mainly
considered at Heading and flowering period; The accumulation of dry matter could
be examined at Filling period.
Based on this analysis, we divided the different indicators affecting the growth of
each period into two categories:
Inheritance indicator is a cumulative measure that acts on all growth cycles of rice.
For example, dry matter accumulation is an inheritance index, which is expressed by
i hdi, j as the first inheritance index.
Independence Indicator is valid only in a certain growth cycle. For example, the
rate of ear formation is an independent indicator of Jointing and booting period.
Therefore, it is expressed by iddi, j , where i is the first growth cycle and j is the jth
observation index of i growth cycle.
In reference to agriculture, ARIMA model [18] and ORYZA2000 model [19]
were used to simulate the effects of different physiological characteristics of plant
162 Y. Cao et al.
organs on the final yield potential [20]. According to the characteristics of rice growth
stages, two reference quantities were selected as the most important and unique index
quantities for each growth stage. As shown in Table 1, the selected births for each
growth cycle were selected in the table. Rational character index.
The growth situation of rice depends to some extent on the upper limit of the
index of yield factors. The growth trend of each growth period is interlinked in the
process of rice growth. The growth trend of any period can not be ignored. It directly
or indirectly determines the final yield of rice.
Growth is a comprehensive index. According to the division of growth cycle, a
numerical value is used to characterize the growth trend in a certain time interval.
The higher the value is in the time interval, the better the growth trend will be in this
stage.
The paper divides the evaluation layer according to the agronomic growth period.
One growth period is an evaluation layer, and there are six evaluation layers. The
evaluation elements of different evaluation layers are not identical. In this paper,
each evaluation element matrix is regarded as the evaluation element matrix of each
period. The formula is as follows:
⎛ ⎞
a1,1 . . . a1, j
⎜ ⎟
G = ⎝ ... . . . ... ⎠i ∈ [1, 6], j ∈ [1, n]ai, j =
iddi, j I ndependeceindicator
i hd i, j I nheritanceindicator
ai,1 · · · ai, j
(1)
Rice Growth Prediction Based on Periodic Growth 163
In the formula, i and j are positive integers, i is the stage of rice current growth
period, j is the serial number of rice growth evaluation index parameters, and j is the
value of the jth rice evaluation index in the stage of rice growth.
If the weight value of the jth rice evaluation index parameter in the first growth
stage is set as the weight value, then the weight matrix corresponding to the evaluation
index in each cycle is set as.
⎡ ⎤
w1,1 · · · wi,1
⎢ .. . . .. ⎥ i ∈ [1, 6], j ∈ [1, n]
W = ⎣ . . . ⎦ (2)
w j,1 · · · w j,i
Then the principal diagonal matrix of the product of the factor matrix G and the
weight matrix W corresponding to rice growth evaluation is.
⎡ ⎤⎡ ⎤
a1,1 · · · a1, j w1,1 · · · wi,1
⎢ .. . . .. ⎥ ⎢ .. . . .. ⎥
R = GW = ⎣ . . . ⎦⎣ . . . ⎦ (3)
ai,1 · · · ai, j w j,1 · · · w j,i
For formal neatness, we multiply a unit matrix here, take the principal diagonal
line and multiply it by a full 1 matrix E of row i to get the growth matrix R.
⎡ ⎤⎡ ⎤ ⎡ ⎤
R1 0 0 1 R1
⎢ . . ⎥ ⎢ .. ⎥ ⎢ .. ⎥
R= R E = ⎣ 0 . 0 ⎦⎣.⎦ =⎣ . ⎦ (4)
0 0 Ri 1 Ri
In the prediction model, we define two data sets. The first data set is the environmental
data of each growth period, which comes from the real-time collection of sensor
nodes distributed in paddy fields. The other data set is the collection of growth
data indicators corresponding to each growth period. Some sources and the field
are specially equipped with sensors, most of which come from the actual manual
statistics and measurement in the field. We store it in standard XML form.
<Period> PeriodVar</Period>
<GrowingDays>DayVar</ GrowingDays >
<AvgTemperature>TempVar</AvgTemperature>
<DaylightHour>LightHourVar</DaylightHour>
<AvgWindSpeed>WindSpeedVar</AvgWindSpeed>
<WaterHeight> WaterHeightVar</WaterHeight>
<NitrogenContent>NitrogenVar</NitrogenContent>
</RECORD>
Definition 3 The corresponding growth evaluation indicators in the six growth cycles
of rice are not identical, so it is necessary to distinguish different growth stages, as
shown in Table 1 Q = {i hdi,1 , i hdi,2 , i hdi,3 , iddi,1 , iddi,2 }.
Take the format of the evaluation indicator data set of the greening period as an
example, as follows:
<RECORD>
<id> IDevaluation </id>
In this evaluation indicator data set, there are two types of indicators, three of
which are inherited indicators, as follows:
The remaining two indicators are the independence indicators for each stage, as
follows:
The difficulty of establishing rice growth prediction model lies in how to select
suitable training methods to train the samples. There are many methods to predict
the growth trend by using neural network [21, 22], such as BP [23], RBF [24],
GRNN [25], etc. The BP neural network is considered first. It is a very classical pre-
feedback neural network with simple structure, easy to understand algorithm and
good versatility. Through information forward propagation and error back feedback,
the final result is approached, but it is easy to fall into local extreme points in the
training process, and when the network structure is too large, it is easy to appear
“over-fitting” phenomenon. Rice growth is a dynamic process with time, so choosing
a dynamic neural network will be more in line with the needs of this study. Elman
cyclic neural network [26–28] is selected here. By adding an internal feedback,
the network has the ability to adapt to time-varying characteristics. The forward
feedback data are optimized and screened by genetic algorithm, so as to accelerate
the convergence speed of the network.
The evaluation methods of growth period are different and the growth trend of the
previous cycle directly affects the growth trend of the next cycle. It is reflected in the
model that six growth period models need to be trained successively, and the growth
prediction models of Returning green period stage, Tillering period, Jointing and
booting period, Heading and flowering period, Filling period and Maturing period
are generated respectively. Maturity growth prediction model, record M ReturningGreen ,
Mtillering , MJointingBooting , MHeadingFlowering , MFilling , MMaturing .
166 Y. Cao et al.
In the first stage, there is no previous cycle, and all of them are not affected
by the growth trend of the previous cycle. Therefore, the six staged models can be
divided into two categories according to the input–output structure. That is to say,
M ReturningGreen is for one category, the rest Mtillering , MJointingBooting , MHeadingFlowering ,
MFilling , MMaturing are for another category. Firstly, as shown in Fig. 2, a growth
model is established and trained, and the weight files of several environmental factors
on the evaluation index of the growth in the green-back period were obtained. In
Fig. 2, considering the correlation between inheritance index and input layer, internal
feedback of inheritance index screened by genetic algorithm is added.
As shown in Fig. 3, n = 6, i ∈ [2, 6], the weight files of multiple environmental
factors for other multiple growth stages were obtained.
The weight file above can not directly get the value of growth, so we need to
know the relationship between the value of growth and its evaluation index. The
weight of each index on growth is given dynamically by agricultural experts. For
example, in heading and flowering period, through consulting relevant experts, the
proportion of knowing seed setting rate is generally about 0.27, and the leaf area is
in the current period. Not very sensitive, accounting for about 0.15. Through this
empirical average, the corresponding growth Ri can be obtained by formula (5).
n
Ri = λi j Ai i ∈ [1, 6], j ∈ [1, 5] (5)
j=1
Algorithm 1 Improved genetic algorithm optimization based on Elman neural network for rice growth
prediction model
Input: Training forecast data Data ;Population size Size ; Genetic algebra Maxgen ; Minimum target error
MinError ;The number of input layer, hidden layer, acceptance layer, and output layer nodes are respectively
In 、 H1 、 H 2 、 Om ;
Output: Initial Values of Weights and Thresholds of Elman Neural Network Algorithms in Rice Growth
Prediction Models W1 、 W2 、 W3 、 θ1 、 θ 2 ;
The first generation ( Gen = 1) chromosomes were randomly generated by controlling the proportion of
Decoding each chromosome in the population and substituting into the Elman neural network;
End ( yd For the actual output value of the network, y is the expected output, avoiding the denominator being 0,
Selection;
Adaptive crossover;
Adaptive Mutation;
For k=1: Size
Decoding chromosomes in population and substituting them into Elman neural network
End
Compare the fitness values and find the individual Fitnessmax with the highest fitness value and the
End
Decode the best individual and get the output
Rice Growth Prediction Based on Periodic Growth 169
6 Model Training
Formula: X di is the processed data, the maximum value is X max , the minimum
value is X min .
(2) Setting of input and output parameters
The input and output of each model are shown in Table 2 as follows.
The relationship between Ri and Mi is one-to-one. i is the number of growth
stages, i ∈ [1, 6]. Taking the grain filling stage as an example, the growth of rice
in the fifth growth cycle of rice is predicted by the rice growth prediction model
(MFilling ).
(3) Number of Hidden Layers and Nodes
The speed and accuracy of learning will be affected by the number of hidden layers
and nodes. If there are too many points, the training time will be greatly increased,
and there may be problems of fitting. On the contrary, there will be relatively little
information and the training effect is not ideal. The basic principle of determining
the hidden layer is to use the network structure as compact as possible to meet the
accuracy requirements [32]. Therefore, the number of hidden layers and nodes must
be within a reasonable range, so that the training accuracy is high and the number of
training times is reduced as much as possible. The number of nodes is related to the
number of input and output nodes as well as the complexity of the problem. Here,
the following methods are used: first, the empirical formula is roughly ranged, and
then the best value is obtained through experiments.
√
P=L+ M+N (7)
In the formula, P, M and N are the number of nodes in the hidden layer, input
layer and output layer respectively, and L is a positive integer and L ∈ [1, 10]. When
calculating the number of hidden layer nodes of M ReturningGreen , we try to set the
number of hidden layer nodes to be 5, 6, … 12, 13 enter the simulation experiment,
observe the effect of the experiment, and finally determine the appropriate number
of hidden layer nodes.
Taking the samples in the rejuvenation period as an example, the experiment was
carried out on MATLAB, in which the training steps were set to 4000 steps and the
error value was set to 0.00001.
Table 3 shows that the convergence accuracy of the network is the best when the
number of hidden layer nodes is 8. So the number of nodes selected is 8. Similarly,
five growth stages models, such as tillering stage, are also calculated by this method.
The results show that 9 is the best time, so the number of hidden layer nodes of five
growth stages models, such as tillering stage, is 9, and the number of hidden layer
nodes of the whole model is 8.
(4) Number of nodes in the undertaking layer
The number of nodes in the receiving layer should be the same as that in the
hidden layer depending on the characteristics of the network. The number of nodes
in the receiving layer corresponding to each model is shown in Table 4.
Table 3 Number of hidden layer nodes and total errors in neural network simulation
Number of nodes in hidden layer Total error Number of nodes in hidden layer Total error
4 9.48e−04 9 4.37e−04
5 8.64e−04 10 9.82e−05
6 7.23e−04 11 2.27e−04
7 2.56e−04 12 1.37e−04
8 7.73e−05 13 2.42e−04
Rice Growth Prediction Based on Periodic Growth 171
env1_outputdata = dataset['Ingrowth']
output_env1 = env1_outputdata.values
input_env1 = env1_inputdata.values
hidden_layer_sizes=(12,),
learning_rate_init=1e-5, max_iter=10000,
random_state=32).fit(X_train, Y_train)
sc = clf.score(X_test, Y_test)
print(sc)
7 Test Result
In order to illustrate the validity of the improved Elman neural network in rice growth
prediction model, the following three models were used to predict rice growth.
(1) Traditional BP Neural Network
(2) Standard Elman Neural Network
(3) Improved Elman Neural Network.
The prediction results of the three kinds of neural networks are shown in Fig. 5.
The predicted values of the traditional BP neural network and the actual data have low
fitting degree, some inaccurate points exist, and the predicted values of the standard
Elman neural network have improved in the accuracy of the algorithm, but there is
172 Y. Cao et al.
still a reverse trend of individual error points, which indicates that the Elman neural
network will fall into the local extreme point and pass through the genetic algorithm.
The fitting degree of Elman improved by the algorithm is very high, and the error
between the predicted value and the real value is smaller. It can be seen that the
improved Elman neural network has better predictive value in the prediction model
of rice growth.
In order to verify the performance of the improved Elman neural network,
the Elman neural network and the improved Elman neural network are trained
respectively, and their curves are compared, as shown in Fig. 6.
The precision of traditional BP neural network, standard Elman neural network
and improved Elman neural network in rice growth prediction was compared and
analyzed by means of mean square deviation (MSE) and mean absolute error (MAE),
as shown in Table 5.
Table 5 Comparisons of
Neural network type Average MSE Average MAE
algorithmic errors
BP neural network 0.01710 0.15423
Elman neural network 0.00682 0.08440
Improved Elman neural 0.00212 0.02603
network
From the point of error analysis, as shown in Table 5, compared with the traditional
BP and Elman neural networks, the improved algorithm has smaller error range,
smaller points with larger prediction error and smaller fluctuation, which greatly
improves the prediction accuracy. From the point of convergence speed, as shown
in Fig. 6, the improved algorithm gradually stabilizes after 150 steps, while Elman
neural network takes about 250 steps to complete the convergence, which shows that
the convergence speed of the improved Elman neural network is faster than Elman.
Therefore, the prediction effect of improved Elman neural network in rice growth
prediction model is better than that of standard Elman neural network and traditional
BP neural network.
8 Conclusion
References
5. Mekala MS, Viswanathan P (2017) A novel technology for smart agriculture based on IoT with
cloud computing. In: 2017 IEEE international conference on I-SMAC (IoT in social, mobile,
analytics and cloud) (I-SMAC). IEEE
6. de Wit CT (1965) Photosynthesis of leaf canopies. Agric Res Rep. Centre for Agricultural
Publication and Documentation (PUDOC), Wageningen
7. Lin Z, Mo X, Xiang Y (2003) Review of crop growth models. J Crops 29(5):750–758
8. Lecerf R, Ceglar A, López-Lozano R et al (2018) Assessing the information in crop model and
meteorological indicators to forecast crop yield over Europe. Agri Syst S0308521X17310223
9. Gandhi N, Petkar O, Armstrong LJ (2016) Rice crop yield prediction using artificial neural
networks. In: Technological innovations in Ict for agriculture & rural development. IEEE
10. Kulkarni S, Mandal SN, Sharma GS et al (2018) Predictive analysis to improve crop yield
using a neural network model. In: 2018 international conference on advances in computing,
communications and informatics (ICACCI). IEEE, pp 74–79
11. Cao Y, Zhu J, Guo Y et al (2018) Process mining-based medical program evolution. Comput
Electr Eng 68:204–214
12. Lu H, Wang D, Li Y et al (2019) CONet: a cognitive ocean network
13. Jiang R, Zhang Y (2013) Research of agricultural information service platform based on internet
of things. In: 2013 12th international symposium on distributed computing and applications to
business, engineering & science. IEEE, pp 176–180
14. Zhang L, Su Z, Zhang Y (2004) Study on the relationship between stem structure and leaf area
index and yield of rice at jointing stage. J Yangzhou Univ (Agri Life Sci) 25(01):55–58
15. Peng S, Huang J, Sheehy JE et al (2004) Rice yields decline with higher night temperature
from global warming. Proc Natl Acad Sci U S A 101(27):9971–9975
16. Yu HY, Wang X, Li F et al (2017) Arsenic mobility and bioavailability in paddy soil under iron
compound amendments at different growth stages of rice. Environ Pollut 224:136–147
17. Wu C, Lu L, Yang X et al (2010) Uptake, translocation, and remobilization of zinc absorbed
at different growth stages by rice genotypes of different Zn densities. J Agri Food Chem
58(11):6767–6773
18. Cai C, Yang C, Mo H et al (2018) Prediction and analysis of rice yield in China based on
ARIMA model. Hybrid Rice 33(2):62–66
19. Li T, Angeles O, Marcaida M et al (2017) From ORYZA2000 to ORYZA (v3): an improved
simulation model for rice in drought and nitrogen-deficient environments. Agric For Meteorol
237–238:246–256
20. Nisar S, Arora VK (2018) Analysing dry-seeded rice responses to planting time and irrigation
regimes in a subtropical environment using ORYZA2000 model. Agri Res
21. Serikawa S, Lu H (2014) Underwater image dehazing using joint trilateral filter. Comput Electr
Eng 40(1):41–50
22. Lu H, Li Y, Uemura T, Kim H, Serikawa S (2018) Low illumination underwater light field
images reconstruction using deep convolutional neural networks. Future Gener Comput Syst
82:142–148
23. Liu M, Liu X, Li M et al (2010) Neural-network model for estimating leaf chlorophyll
concentration in rice under stress from heavy metals using four spectral indices. Biosys Eng
106(3):223–233
24. Rahimzadeh H, Sadeghi M, Ghasemi-Varnamkhasti M et al (2019) On the feasibility of metal
oxide gas sensor based electronic nose software modification to characterize rice ageing during
storage. J Food Eng 245:1–10
25. Dong J, Dai W, Xu J et al (2016) Spectral estimation model construction of heavy metals in
mining reclamation areas. Int J Environ Res Public Health 13(7):640
26. Wang J, Zhang W, Li Y et al (2014) Forecasting wind speed using empirical mode
decomposition and Elman neural network. Appl Soft Comput 23(Complete):452–459
27. Yang J, Honavar V (2002) Feature subset selection using a genetic algorithm. IEEE Intell Syst
Appl 13(2):44–49
28. Lu H, Li Y, Mu S, Wang D, Kim H, Serikawa S (2018) Motor anomaly detection for unmanned
aerial vehicles using reinforcement learning. IEEE Internet of Things J 5(4):2315–2322
Rice Growth Prediction Based on Periodic Growth 175
29. Canca D, Barrena E (2018) The integrated rolling stock circulation and depot location problem
in railway rapid transit systems. Transport Res Part E Logist Transport Rev 109:115–138
30. Nemati M, Braun et al. Optimization of unit commitment and economic dispatch in microgrids
based on genetic algorithm and mixed integer linear programming. Appl Energy 210
31. Huang W, Oh SK, Pedrycz W (2016) Polynomial neural network classifiers based on data
preprocessing and space search optimization. In: Joint international conference on soft
computing & intelligent systems. IEEE
32. Chiliang Z, Tao H, Yingda G et al (2019) Accelerating convolutional neural networks with
dynamic channel pruning. In: 2019 data compression conference (DCC). IEEE, p 563
He Zhou is a postgraduate student at Yangzhou University. His main research interests are Arti-
ficial Intelligence and Internet of things.
Abstract In recent year, researchers have gradually focused on single image super-
resolution for large scale factors. Single image contains scarce high-frequency
details, which is insufficient to reconstruct high-resolution image. To address this
problem, we propose a multi-scale progressive image super-resolution reconstruc-
tion network (MSPN) based on the asymmetric Laplacian pyramid structure. Our
proposed network allows us to separate the difficult problem into several subprob-
lems for better performance. Specially, we propose an improved multi-scale feature
extraction block (MSFB) to widen our proposed network and achieve deeper and
more effective feature information exploitation. Moreover, weight normalization is
applied into MSFB to tackle the gradient vanishing and gradient exploding problem,
and to accelerate the convergence speed of training. In addition, we introduce pyramid
pooling layer into the upsampling module to further enhance the image reconstruc-
tion performance by aggregating local and global context information. Extensive
evaluations on benchmark datasets show that our proposed algorithm gains great
performance against the state-of-the-art methods in terms of accuracy and visual
effect.
1 Introduction
proposed network MSPN and describes the details about network architecture. We
present the evaluation results and make comparisons with other state-of-the-art
algorithms quantitatively and qualitatively in Sect. 4. Section 5 concludes this paper.
2 Related Works
Image SR has been developed for more than one decade, and researchers have
proposed numerous innovative methods, including traditional image SR algorithms
[16, 27, 29], such as interpolation-based, reconstruction model-based, traditional
machine learning based methods, and deep learning based image SR algorithms [5–
14, 17–19]. Among them, SRCNN [5] proposed by Dong et al. firstly introduced
deep learning into image SR reconstruction, and LapSRN [12, 13] proposed by Lai
et al. firstly introduced progressive reconstruction structure into image SR.
In machine learning, the input data usually needs to satisfy an assumption of indepen-
dent and identical distribution (IID), while this assumption is difficult to be imple-
mented in the deep neural network. Because when it refers to multi-layer neural
network, the parameter update of each layer will change the distribution of the input
data of its upper layer. And after stacking more layers, the distribution will change
sharply, making high-level networks have to constantly readjust itself to the updated
low-level data.
To tackle this problem, the normalization method is applied into the deep neural
network with various transformation methods. It makes the input data of each layer
approximately satisfy with the assumption of IID, and limits the output of each layer
within a certain range to prepare for succeeding operations. Common normalization
methods include batch normalization (BN) [20], layer normalization (LN), weight
normalization (WN) [21], and group normalization.
Many previous works have found that BN is not applicable to train image SR
network, because of mini-batch dependency, strong regularization side-effects, and
requirement for different formulations in training and testing process. However, as the
depth of neural networks increases, lack of normalization operations makes networks
difficult to be trained.
Yu et al. [22] proposed that adopting WN into networks achieved higher accuracy
than BN or without any normalization during the training or testing process. By
normalizing the weight vector, WN speeds up the convergence of training, improves
the image reconstruction performance and enables networks to be trained at higher
learning rates. So, we adopt weight normalization into our proposed network to
improve the stability of the training process.
In deep neural networks, the size of the receptive fields represents the amount of
contextual information it contains. Moreover, as the depth of networks increases, the
size of the receptive fields also increases. However, in image processing, the size of the
receptive fields is usually insufficient to receive all the global context information,
resulting in poor image quality. In order to address this problem, there exist two
common strategies: one is to use the global average pooling method, but this method
may cause networks to lose the spatial relationship between context information and
to blur the reconstructed image; the other is to smoothly concatenate features from
different levels, which can reduce the information loss among different subregions.
Zhao et al. [23] proposed the pyramid scene parsing network (PSPNet) for
semantic segmentation. In order to incorporate suitable global features, they proposed
pyramid pooling layer, which can aggregate the context information from different
subregions and exploit the capability of global information. Pyramid pooling was
182 S. Ying et al.
commonly used in scene parsing task, while Park et al. [24] introduced it into EDSR
for image SR task. They proposed EDSR-PP, an improved version of EDSR by
applying four pyramid scales (1 × 1, 2 × 2, 3 × 3 and 4 × 4) into the upsampling
module, and obtained better performance. Therefore, we adopt pyramid pooling layer
to enhance the capability of exploiting the global context information, and to produce
SR images with higher quality.
3 Our Method
where fli denotes the operations of the ith pyramid module. Specially, the input of
the first pyramid module is I L R .
After extracting hierarchical features with the pyramid modules, we stack a
convolution layer to output the residual image I Res , which can be obtained by
i
i
I Res = f conv Iout , (2)
I S R = f up (I L R ) + I Res , (3)
184 S. Ying et al.
where f up denotes the operation of an upsampling kernel, here we use the bicubic
operation.
The architecture of our proposed pyramid module is shown in Fig. 2. Our pyramid
module consists of multi-scale compression blocks (MSCBs), global residual
learning and the upsampling module.
Global residual learning is introduced to obtain feature maps before the upsam-
pling module, and to further improve the information flows as there are several
MSCBs in one pyramid module. The mathematical formation of local residual
learning is:
FG F = FL F + Iout
i−1
, (4)
i−1
where Iout denotes the output of preceding pyramid module. In the i-th pyramid
i−1
module, Iout is regarded as the initial input FC0 B . After extracting low-level features
FL F in LR space, we produce the final feature maps FG F through global residual
learning. It should be noted that global residual learning can further encourage the
flow of information and gradient, and help high-level networks acquire more effective
features.
Multi-scale compression block (MSCB) is utilized to further exploit the hier-
archical features and to enhance the discriminative representations. As shown in
Fig. 3, it consists of multi-scale feature extraction blocks (MSFBs), compression
layer and local dense feature fusion. MSFBs are designed for extracting multi-scale
features, and MSCB fully utilizes these features to greatly improve the quality of SR
image. However, directly using multi-scale features will cause enormous computa-
tional complexity. Therefore, we employ a 1 × 1 compression layer to adaptively
fuse feature information from all the MSFBs, and to extract useful features before
being passed to the next MSCB. Besides, we find that adding a compression layer
before the image reconstruction module can improve the tightness of network.
Local dense feature fusion is applied to fuse features from all preceding MSFBs
and to extract local and global features. After extracting hierarchical features in LR
space, we stack a compression layer to achieve local dense feature fusion and to
reduce the computational complexity. FCi−1 i
B and FC B represent the input and output
of the ith MSCB respectively, and local dense feature fusion can be represented as
FCi B = C F F FCi−1
B , FF B , FF B , . . . , FF B ,
0 1 N
(5)
Fig. 4 Comparison of the commonly-used image reconstruction structure and our modified
structure
Fig. 5 The illustrations of pyramid pooling layer. In our proposed network, we remove the 6 × 6
convolution layer
× 6). However, we remove the 6 × 6 pyramid scale from pyramid pooling layer
to reduce the computational complexity. In our proposed upsampling module, we
perform multi-scale context information exploitation with three pyramid scales (1 ×
1, 2 × 2 and 3 × 3).
A Multi-scale Progressive Method of Image … 187
The contents of different images differ greatly in size and position, and the color
and structure in an image also vary considerably. Hence, it is difficult to select the
appropriate kernel size for convolution layer. For image classification and detection
task, Szegedy et al. [25] firstly proposed GoogLeNet with inception module. Different
from previous works, it improves the performance by introducing a new network
topology instead of stacking more layers.
As illustrated in Fig. 6, the inception module consists of three different filters
and maximum pooling operation. Through different convolution layers, GoogLeNet
can obtain distinctive feature information and improve the image characterization.
Li et al. [26] adaptively introduced the inception module to extract different scale
features for better image reconstruction performance. Furthermore, their method
performed deeper and more effective exploitation for feature information in LR
space by stacking the inception module.
Motivated by the method of applying inception module [25] into image SR
network [26], we propose a dual-branch multi-scale feature extraction block (MSFB),
as shown in Fig. 7. It forms various convolution kernel sizes by combining different
kernels in two branches, and operates the feature maps on all the convolution layers.
Besides, it achieves the feature sharing between two branches and concatenates all
the outputs into deep feature maps. This approach can widen the network for better
image characterization and feature utilization.
MSFB uses the dual-branch network to exploit the feature information by different
convolution kernels. Let FFi−1 i
B and FF B be the input and output of the ith MSFB
respectively. The mathematical formation of MSFB is:
A1 = σ FW N W3×3 ∗ FFi−1
B , (6)
B1 = σ FW N W5×5 ∗ FFi−1
B , (7)
where σ denotes the ReLU activation function. W3×3 and W5×5 are the weights of
W3×3 and W5×5 convolution layers, where the bias term is omitted for simplicity.
[·, ·] denotes the concatenation operation. Besides, we apply the local residual
learning to improve the efficiency of the network, which can be obtained by
FFi B = FFi−1
B + Fconcat . (11)
y = w · x + b, (12)
g
w= , (13)
v
where v is a k-dimensional vector, g is a scalar, and v is the Euclidean norm of the
vector v. In general, we assume that v = g, which is independent of the parameter
v, so that the European norm of the weight vector can be fixed and the regularization
effect will be performed. This weight decomposition method makes the training of
network parameters more robust.
4 Experiment
4.1 Datasets
We use DIV2K dataset [15] as the training set, which contains 1000 h images that
is divided into three parts: 800 training images, 100 validation images and 100 test
images. For testing and benchmarking, we choose Set5 [28], Set14 [29], BSDS100
[30], Urban100 [31] and Manga109 [32] datasets, to verify the feasibility of our
proposed network and to evaluate the performance of our image SR reconstruction
results.
As most of the previous networks, all our training and testing are conducted on
the luminance channel, and we choose the scale factors of 2 ×, 4 × and 8 × in this
paper for training and testing.
For training our proposed network, we use RGB LR images whose patches are of
size 64 × 64 as input. We randomly sample the LR patches and augment them in
three ways: (1) random scaling the images with the ratio between 0.5 and 1.0; (2)
random flipping the images horizontally or vertically; (3) random rotating with 90,
180 or 270°.
For 8 × scale factor, we build a three-stage progressive network and train the
network with the curriculum-learning method. Each multi-scale compression block
190 S. Ying et al.
In Table 1, the comparison among the original network proSR, proSR with WN,
proSR with pyramid pooling layer and our proposed MSPN is shown. It demon-
strates how weight normalization and pyramid pooling layer can improve the network
performance for 2 ×, 4 × and 8 × scale factors.
(1) Weight normalization
We add weight normalization before each ReLU operation in multi-scale
feature extraction block, which can achieve better performance than adding
batch normalization or without using any normalization. Furthermore, weight
normalization can provide more memory resources to our network for properly
deepening network without additional computational complexity.
In Table 1, we can find that proSR with weight normalization gains better perfor-
mance than the original proSR. This is mainly because the reparameterization
of the weight vectors can alleviate the gradient problem and accelerate the
optimization convergence.
Table 1 Comparison of proSR, with WN, with pyramid pooling layer (PPL), and MSPN for 8 ×
scale factor (bold indicates the best performance)
Methods Set5 PSNR/SSIM Set14 BSDS100 Urban100
PSNR/SSIM PSNR/SSIM PSNR/SSIM
proSR 2× 37.95/0.9613 33.58/0.9126 32.10/0.9045 32.03/0.9341
With WN 38.02/0.9642 33.68/0.9133 32.17/0.9074 32.41/0.9355
With PPL 38.02/0.9655 33.68/0.9145 32.15/0.9056 32.45/0.9364
MSPN (our) 38.06/0.9687 33.71/0.9264 32.21/0.9089 32.67/0.9388
proSR 4× 32.28/0.9045 28.57/0.7813 27.55/0.7344 26.19/0.7961
With WN 32.39/0.9051 28.65/0.7846 27.60/0.7458 26.43/0.8022
With PPL 32.46/0.9067 28.67/0.7853 27.60/0.7463 26.48/0.8043
MSPN (our) 32.50/0.9086 28.75/0.7902 27.64/0.7487 26.64/0.8077
proSR 8× 27.16/0.7832 24.83/0.6454 24.76/0.5935 22.56/0.6287
With WN 27.29/0.7923 24.93/0.6589 24.80/0.6023 22.67/0.6323
With PPL 27.38/0.7948 24.91/0.6475 24.81/0.6098 22.71/0.6356
MSPN (our) 27.45/0.7995 25.02/0.6523 24.86/0.6178 22.84/0.6432
A Multi-scale Progressive Method of Image … 191
Table 2 Comparison between the result of our proposed MSPN and other methods (bold italic
indicates the best performance and bold indicates the second best performance.)
Methods Scale Set5 Set14 BSDS100 Urban100 Manga109
PSNR/SSIM PSNR/SSIM PSNR/SSIM PSNR/SSIM PSNR/SSIM
Bicubic 2× 33.68/0.9291 30.28/0.8684 29.58/0.8435 26.88/0.8439 30.83/0.9334
A+ 36.58/0.9540 32.43/0.9060 31.23/0.8871 29.23/0.8955 35.41/0.9652
SRCNN 36.69/0.9553 32.39/0.9063 31.36/0.8881 29.51/0.8989 35.85/0.9676
VDSR 37.53/0.9589 33.05/0.9124 31.91/0.8966 30.77/0.9156 37.22/0.9730
DRCN 37.63/0.9586 33.07/0.9110 31.85/0.8947 30.74/0.9144 37.61/0.9218
DRRN 37.74/0.9593 33.23/0.9134 32.06/0.8971 31.23/0.9190 37.61/0.9734
LapSRN 37.52/0.9587 33.09/0.9128 31.40/0.8950 31.81/0.8952 37.27/0.9740
EDSR 38.11/0.9601 33.92/0.9195 32.32/0.9012 32.33/0.9015 39.02/0.9767
MSRN 38.06/0.9602 33.74/0.9167 32.24/0.9013 32.24/0.9014 38.82/0.9866
CARN 37.66/0.9585 33.48/0.9162 31.92/0.8961 30.98/0.9291 −/−
proSR 37.95/0.9655 33.58/0.9106 32.10/0.9097 32.03/0.9373 38.44/0.9865
MSPN(our) 38.06/0.9689 33.75/0.9214 32.21/0.9126 32.67/0.9398 38.70/0.9883
Bicubic 4× 28.41/0.8020 26.02/0.3943 25.95/0.6567 23.11/0.6602 24.87/0.7826
A+ 30.32/0.8561 27.41/0.7445 26.80/0.7010 24.30/0.7205 27.02/0.8440
SRCNN 30.47/0.8568 27.55/0.7443 26.93/0.6994 24.55/0.7234 27.60/0.8545
VDSR 31.33/0.8790 28.04/0.7618 27.34/0.7166 25.17/0.7532 28.83/0.8810
DRCN 31.52/0.8806 28.10/0.7620 27.21/0.7151 25.17/0.7540 28.86/0.8806
DRRN 31.55/0.8877 28.19/0.7714 27.33/0.7278 25.44/0.7627 29.13/0.8920
LapSRN 31.53/0.8810 28.11/0.7628 27.32/0.7280 25.21/0.7550 29.02/0.8840
EDSR 32.42/0.8958 28.74/0.7866 27.73/0.7420 26.60/0.8033 31.05/0.9438
MSRN 32.04/0.8896 28.50/0.7749 27.50/0.7271 26.02/0.7889 30.17/0.9033
CARN 32.03/0.8924 28.52/0.7760 27.47/0.7323 25.88/0.7788 −/−
proSR 32.38/0.9075 28.57/0.7898 27.55/0.7334 26.19/0.7945 30.58/0.9146
MSPN (our) 32.50/0.9143 28.76/0.7904 27.64/0.7478 26.64/0.8099 30.95/0.9187
Bicubic 8× 24.40/0.6485 23.10/0.5663 23.66/0.5482 20.72/0.5164 21.45/0.6135
A+ 25.53/0.6648 23.98/0.5530 24.21/0.5155 21.33/0.5190 22.37/0.6450
SRCNN 25.32/0.6571 23.76/0.5914 24.13/0.5665 21.29/0.5438 22.44/0.6784
VDSR 25.93/0.7243 24.24/0.6172 24.50/0.5832 21.66/0.5704 22.83/0.6763
DRCN 25.73/0.6743 24.21/0.5515 24.48/0.5170 21.66/0.5288 23.30/0.6687
LapSRN 26.15/0.7382 24.35/0.6210 24.51/0.5864 21.77/0.5811 23.41/0.7071
EDSR 27.09/0.7810 24.94/0.6431 24.81/0.5983 22.50/0.6237 24.70/0.7834
MSRN 26.59/0.7253 24.90/0.5962 24.72/0.5401 22.34/0.5967 24.27/0.7510
CARN 26.74/0.7667 24.83/0.6356 24.73/0.5911 22.25/0.6044 −/−
proSR 27.16/0.7832 24.83/0.6454 24.76/0.5935 22.56/0.6287 24.67/0.7832
MSPN (our) 27.45/0.7995 25.02/0.6523 24.86/0.6178 22.84/0.6432 24.98/0.7945
A Multi-scale Progressive Method of Image … 193
HR LapSRN proSR
Ground-truth HR
Manga109:
YumeiroCooking
MSRN Ours
Fig. 8 Visual comparison for 4 × SR on the BSDS100, Urban100 and Manga109 datasets
image ‘img_033’, other algorithms usually produce blurring artifacts and twisted
lines in images, especially for the striped or grid-like contents.
What’s worse, the generated lines are completely inconsistent with the direction
of the original lines. However, MSPN can solve these problems and recover more
real details and distinct features.
Besides, for image ‘TotteokiNoABC’ in Manga109 dataset, we can find that some
methods suffer from blurring artifacts and over-smoothed edges around the letters.
By contrast, MSPN shows great abilities in recovering more accurate information
and clear outline, which are closer to the real images (Fig. 9).
In general, our proposed method demonstrates better performance both in quan-
titatively and visually, whether for 4× or 8× scale factors. Our improved feature
extraction module greatly achieves deeper and more effective information exploita-
tion in LR images, and the image reconstruction module makes full use of feature
information and spatial relationship to reconstruct high-quality SR images, which is
more faithful to the real images.
194 S. Ying et al.
HR LapSRN proSR
Ground -truth HR
Manga10 9:
TotteokiNoABC
MSRN Ours
Fig. 9 Visual comparison for 8 × SR on the BSDS100, Urban100 and Manga109 datasets
5 Conclusion
References
1. Shi WZ, Caballero J, Ledig C, Zhuang XH, Bai WJ, Bhatia K, de Marvao AMSM, Dawes T,
O’Regan D, Rueckert D (2013) Cardiac image super-resolution with global correspondence
using multi-atlas patchmatch. Proceedings of MICCAI, pp 9–16
2. Gunturk BK, Altunbasak Y, Mersereau RM (2004) Super-resolution reconstruction of
compressed video using transform-domain statistics. IEEE Trans Image Process 13:33–43
3. Zou WWW, Yuen PC (2012) Very low resolution face recognition problem. IEEE Trans Image
Process 21:327–340
4. Yıldırım D, Güngör O (2012) A novel image fusion method using ikonos satellite images. J
Geodesy Geoinform 427–429:1593–1596
5. Dong C, Loy CC, He K, Tang X (2014) Learning a deep convolutional network for image
super-resolution. Eur Conf Comput Vis 8692:184–199
6. Kim J, Kwon Lee J, Mu Lee K (2016) Accurate image super-resolution using very deep
convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern
recognition, pp 1646–1654
7. Kim J, Kwon Lee J, Mu Lee K (2016) Deeply-recursive convolutional network for image super-
resolution. In: Proceedings of the IEEE conference on computer vision and pattern recognition,
pp 1637–1645
8. Tai Y, Yang J, Liu X (2017) Image super-resolution via deep recursive residual network. In:
Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2790–2798
9. Tong T, Li G, Liu XJ, Gao QQ (2017) Image super-resolution using dense skip connections. In:
Proceedings of the IEEE international conference on computer vision, IEEE computer society,
pp 4809–4817
10. Lim B, Son S, Kim H, Nah S, Lee KM (2017) Enhanced deep residual networks for single
image super-resolution. In: The IEEE conference on computer vision and pattern recognition
workshops, vol 1, pp 1132–1140
11. Zhang YL, Tian YP, Kong Y, Zhong BN, Fu Y (2018) Residual dense network for image
super-resolution. In: The IEEE/CVF conference on computer vision and pattern recognition,
pp 2472–2481
12. Lai WS, Huang JB, Ahuja N, Yang MH (2017) Deep Laplacian pyramid networks for fast and
accurate super-resolution. In: IEEE conference on computer vision and pattern recognition, pp
5835–5843
13. Lai WS, Huang JB, Ahuja N, Yang MH (2018) Fast and accurate image super-resolution with
deep Laplacian pyramid networks. IEEE Trans Pattern Anal Mach Intell 99
14. Wang YF, Perazzi F, Mcwilliams B, et al (2018) A fully progressive approach to single-image
super-resolution. In: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition workshops, pp 977–986
15. Agustsson E, Timofte R (2017) Ntire 2017 challenge on single image super-resolution: dataset
and study. In: The IEEE conference on computer vision and pattern recognition workshops,
vol 3, p 2
16. Keys RG (1981) Cubic convolution interpolation for digital image processing. IEEE Trans
Acoust Speech Signal Process 37:1153–1160
17. Shi W, Caballero J, Husz´ar F, Totz J, Aitken AP, Bishop R, Rueckert D, Wang Z (2016) Real-
time single image and video super-resolution using an efficient sub-pixel convolutional neural
network. In: Proceedings of the IEEE conference on computer vision and pattern recognition,
pp 1874–1883
18. Dong C, Loy CC, Tang X (2016) Accelerating the super-resolution convolutional neural
network. In: European conference on computer vision. Springer, pp 391–407
19. Ledig C, Theis L, Husz´ar F, Caballero J, Cunningham A, Acosta A, Aitken A, Tejani A,
Totz J, Wang Z, et al (2017) Photo-realistic single image super-resolution using a generative
adversarial network. In: The IEEE conference on computer vision and pattern recognition, pp
105–114
196 S. Ying et al.
20. Sergey I, Christian S (2015) Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In: International conference on machine learning, pp 448–456
21. Tim S, Diederik PK (2016) Weight normalization: A simple reparameterization to accelerate
training of deep neural networks. Adv Neural Inform Process Syst 901–909
22. Yu JH, Fan YC, Yang JC et al (2018) Wide activation for efficient and accurate image super-
resolution. In: IEEE conference on computer vision and pattern recognition
23. Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: The IEEE
conference on computer vision and pattern recognition, pp 2881–2890
24. Park D, Kim K, Chun SY (2018) Efficient module based single image super resolution for
multiple problems. Proceedings of CVPRW, pp 995–1003
25. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich
A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer
vision and pattern recognition, pp 1–9
26. Li JC, Fang FM, Mei KF, Zhang GX (2018) Multi-scale residual network for image super-
resolution. In: European conference on computer vision. Springer, pp 527–542
27. Huang G, Liu Z, Weinberger KQ, van der Maaten L (2017) Densely connected convolutional
networks. Proceedings of CVPR, pp 2261–2269
28. Bevilacqua M, Roumy A, Guillemot C, Alberi-Morel ML (2012) Low-complexity single-image
super-resolution based on nonnegative neighbor embedding. In: Proceedings of the 23rd British
machine vision conference
29. Zeyde R, Elad M, Protter M (2010) On single image scale-up using sparse representations. In:
International conference on curves and surfaces. Springer, pp 711–730
30. Arbelaez P, Maire M, Fowlkes C, Malik J (2011) Contour detection and hierarchical image
segmentation. IEEE Trans Pattern Anal Mach Intell 33:898–916
31. Huang JB, Singh A, Ahuja N (2015) Single image super-resolution from transformed self-
exemplars. In: Proceedings of the IEEE conference on computer vision and pattern recognition,
pp 5197–5206
32. Matsui Y, Ito K, Aramaki Y, Fujimoto A, Ogawa T, Yamasaki T, Aizawa K (2017) Sketch-based
manga retrieval using manga109 dataset. Multimedia Tools Appl 76:21811–21838
33. Timofte R, De Smet V, Van Gool L (2014) A+: adjusted anchored neighborhood regression for
fast super-resolution. In: Asian conference on computer vision. Springer, pp 111–126
34. Ahn N, Kang B, Sohn KA (2018) Fast, accurate, and lightweight super-resolution with
cascading residual network. In: European conference on computer vision. Springer, pp 256–272
35. Ahn N, Kang B, Sohn KA (2018) Image super-resolution via progressive cascading residual
network. In: 2018 IEEE/CVF conference on computer vision and pattern recognition
workshops, pp 904–912
Track Related Bursty Topics in Weibo
Yuecheng Yu, Yu Gu, Ying Cai, Daoyue Jing, and Dongsheng Wang
Abstract Weibo has become an important means for people to share, disseminate
and obtain information in real time. BTM can effectively discover bursty topics in
Weibo, but cannot track the related bursty topics. Based on the time series of Weibo,
the binary word pair is used for topic modeling and the bursty topics in the Weibo are
extracted to form a new topic set. And then, the similarity of the topics in adjacent
time periods can be calculated by KL metric and the related bursty topics can be
tracked. The experimental results show that the method can effectively segment the
time series of Weibo topics, and realize the discovery and tracking of related topics
in Weibo.
1 Introduction
Social media has become the main way of information acquisition and information
dissemination in the past decade [1]. A variety of social media platforms provide
convenience for people to share, communicate, and collaborate information [2]. As a
social networking platform, Weibo gradually become the preferred media for people
to express their opinions and report unexpected incidents [3]. Since the spread of
information in Weibo is very fast, many sudden topics are often first revealed in
Weibo [4]. Then the detection and discovery of sudden topics in Weibo has become
a new research hotspot [5].
It should be noted that the detected bursty topic in Weibo tends to evolve over time.
Thus, it is not only necessary to discover sudden topics in time, but also to effectively
evaluate the evolution of topics in Weibo. The traditional topic evolution method is
mainly used for the processing of long texts with clear contexts such as news and
corpus [6–9]. Weibo content is also composed of a series of texts, text-based topic
2 Related Work
In the BTM model, the topic is modeled using a binary phrase as the basic unit [16].
In the Weibo scenario, let a short text set B = {b1 , …, bNB }, assume that it contains
N B biterm words and K topics over W unique words in the collection. Let bi =
(wi,1 , wi,2 ) denote biterm word pair, z ∈ [1, K] be a topic indicator variable which
represents the biterm word pairs correspond to the topic. The word distribution for
topics is ϕ, which is a K × W matrix, and ϕ k denotes the kth row of the matrix. In fact,
ϕ k is a W-dimensional multinomial distribution. θ is a K-dimensional multinomial
distribution and represent the prevalence of topics in the collection. Therefore, by
using the Collapsed Gibbs algorithm [17] to estimate the model parameters, the
discovery of the bursty topic in Weibo is realized.
Assume that n tb is the number of times that the biterm word pair b appears in the
published microblog in the time slice t, and n tb is the average of n tb , that is n tb =
1 S
s=1 n b . Let ηb be the probability that the topic of the biterm word pair b is the
t−s t
S
same as the topic of the bursty topic, then it can be estimated according to formula
(1), where ε is a relatively small positive number used to avoid the probability of a
zero value.
max[(n tb − n tb ), ε]
ηbt = (1)
n tb
Specifically, let hz denote a topic selector, which is used to indicate whether the
topic is the same as the detected bursty topic. In our method, the value of hz is
sampled from the Bernoulli distribution. The topics are defined as related topics, and
represented by the data set E z = {z | hz = 1, z = 1, …, K}.
To illustrate whether the topic of the biterm word pair is the related topic, a binary
indicator variable Y i is introduced. If Y i = 1, then the topic of the word pair is the
related topic. Otherwise, the topic of the word pair is not the related topic.
The algorithm of our method can be described as follows:
Track Related Bursty Topics in Weibo 199
(1) For each biterm word bi ∈ B of each document, the generated document (biterm
set) can be obtained by sampling η ~ Beta (γ0 , γ1 ) and topic selector hz ~ Bern
−
→
(η), where h = {h z }k=1K
, then according to θ i ~ Dir (α), the parameter of
document-topic distribution θ i can be obtained.
(2) For each topic k ∈ {1, 2, …, K}, if the topic is not a related topic, sampling and
generating the background word distribution parameter ϕ 0 ~ Dir (β), otherwise,
sampling and generating topic “entry” Multiple distribution parameter ϕ k ~
Dir (β).
(3) For each biterm pair <wi,1 , wi,2 > of each word pair bi ∈ B, sampling and
generating category label Y i ~ Bern (ηbi ):
Similar to BTM, model parameters are still estimated using the Collapsed Gibbs
algorithm [17]. In order to track the evolution of bursty topics, bursty topics that has
been detected and the topic captured by the evolution tracking model are combined
into a new topic set, and then the Weibo content in the topic set is re-segmented by
time slice. Lets sub1 = {w1,1 , …, w1,n } and sub2 = {w2,1 , …, w2,n } be subtopics in
two adjacent time slices, and P(i) represent the probability distribution of the i-th
word in subtopic sub1 , Q(i) is the probability distribution of the i-th word in subtopic
sub2 , then the KL distances between the two topics can be calculated according to
formula (2) and newly evolved topic can be dectected [18].
P(i)
D(PQ) = ln( )P(i) (2)
i
Q(i)
4 Experimental Analysis
In order to verify the effectiveness of the proposed method in tracking the evolu-
tion of microblogging topic, we use the relevant corpus of the microblogging topic
“#Chongqing Wanzhou Bus Falling River#” as the experimental data set. Table 1
describes the related topics in each time slice, and fully describes the entire evolution
of the topic.
Figure 1 is the KL distance between different topics in adjacent time slices. As
shown in Fig. 1, the KL distance between adjacent time slices visually reflects the
evolution and transition between different topics in Table 1.
When the bursty topic “Chongqing Wanzhou bus crashed into the river” just
appeared, the focus of the topic was mainly on the visual description of the bus crash
accident. When the topic progressed to the second time slice, Weibo users began to
200 Y. Yu et al.
pay attention to the new focus. As shown in Fig. 1, since the differences in topics
in the two time slices are large, the corresponding KL values in the two time slices
also differ greatly. In the second and third time slice, although the topics discussed
were different, it was mainly around the bus crash. From time slice 3–6, the KL
value varies greatly. In fact, the focus of the discussion was also from the mistaken
belief that the car retrograde caused the bus to fall into the river and gradually turned
into a car driver. When the content of the Weibo topic changes smoothly, and the
corresponding KL values are relatively close.
5 Conclusion
Based on the traditional text probability model BTM, the method of tracking related
bursty topics in Weibo is proposed in this paper. With the help of the relevance of
Weibo content in time, the sudden topics discovered by BTM model are recombined
Track Related Bursty Topics in Weibo 201
into a new topic set. Then, by calculating the similarity of the burst topics on the
adjacent time slices, the effective tracking of the related bursty topics is realized.
This method is helpful to observe the hot topics that Weibo users pay attention to
and predict the trend of public opinion in real time. This also provides effective help
for realizing the effective guidance of public opinion and tracing the sequence of hot
events.
Acknowledgements This work is supported by National Natural Science Foundation of China No.
61702234, Science and Technology Support Project (Social Development) in Jiangsu Province No.
BE2014692. Science and Technology Support plan (Social Development) in Zhenjiang City No.
SH2015018.
References
7. Jensen S, Liu XZ, Yu YG (2016) Generation of topic evolution trees from heterogeneous
bibliographic networks. J Inf 4(2):606–621
8. Jo Y, Hopcroft JE, Lagoze C (2011) The web of topics: discovering the topology of topic
evolution in a corpus. WWW 2011-session: spatio-temporal analysis. ACM, Hyderabad, India,
pp. 257–266 (2011)
9. Lin C, Lin C, Li J et al (2012) Generating event storylines from microblogs. In: Proceedings
of the 21st ACM international conference on information and knowledge management. ACM,
New York, pp 175–184. https://doi.org/10.1145/2396761.2396787
10. Wei X, Bin Z, Genlin J (2016) Microblog topic evolution algorithm based on forwarding
relationship. Comput Sci 43(2):79–100
11. Mei Q, Zhai CX (2005) Discovering evolutionary theme patterns from text: An exploration of
temporal text mining. In: Proceedings of the eleventh ACM SIGK-DD international conference
on knowledge discovery in data mining. ACM, New York, pp 198–207. https://doi.org/10.1145/
1081870.1081895
12. Yanli H, Liang Z, Weiming Z (2012) A method of modeling and analysis of topic evolution.
Acta Autom Sinica 38(10):1690–1697
13. Ying F, Heyan H, Xin X (2014) topic evolution analysis for dynamic topic numbers. Chin J Inf
Sci 28(3):142–149
14. Jayashri M, Chitra P (2012) Topic clustering and topic evolution based on temporal parameters.
In: International conference on recent trends in information technology, IEEE, Chennai, India,
pp 559–564 (2012)
15. Blei DM, Ng AY, Jordan MI et al (2003) Latent dirichlet allocation. J Mach Learn Res 3:2003
16. Yan X, Guo J, Lan Y, Cheng X-Q (2013) A biterm topic model for short texts. In: Proceedings
of the 22nd international conference on world wide web. Rio de Janeiro, Brazil, pp 1445–1456
(2013)
17. Zaho B, Xu W, Ji GL (2016) Discovering topic evolution topology in a microblog corpus. In:
Third international conference on advanced cloud and big data. CBD, China, pp 7–14 (2016)
18. Griffiths TL, Steyvers M (2004) Finding scientific topics. Natl Acad Sci 101(1):5228–5235
Terrain Classification Algorithm
for Lunar Rover Based on Visual
Convolutional Neural Network
Abstract With the development of society, the lunar exploration capability marks
the development of aerospace science and technology, and it has received more and
more attention. We propose a method which combines vision-based convolutional
neural networks with ensemble learning to classify the current terrain by the pictures
taken by the onboard camera of the lunar rover. According to the classification result,
the lunar rover can independently select a better path, avoiding the unnecessary
trouble caused by the delay of the land-month communication. The overall accuracy
of our classification is 80%, and some of them have higher precision. It is expected
that the classification results will help the decision of path planning.
1 Introduction
2 Related Work
Google proposed NasNet [9], it was training on 500 GPUs. The accuracy of top5
and top1 of the competition of ImageNet increased from 57.1 and 76.3% to 96.2 and
82.7%. It can be seen that the accuracy of image classification based on convolutional
neural networks is constantly improving.
MIT’s Karl Iagnemma classifies the terrain into three categories [10], rock, sand and
beach grass, which gave good results. They had higher accuracy in sand and beach
grass, but lower accuracy in distinguishing rocks.
Lauro Ojeda of the University of Michigan also did some research on terrain
classification and some terrain description [11]. They used a fully connected neural
network with only one hidden layer to classify the terrain. They divided into five
categories. They are gravel, grass, sand and pavement dirt. The final average highest
accuracy rate reached 78.4%.
H(1) is the output of the first layer, g (1) is a nonlinear variation function, W (1) is
the weight matrix, which is the value we need to train. b is bias, it is also the model
that needs to be trained. x is an input layer, it is a vector. The second layer:
H (1) is the output of the first layer. So, the nth layer of network output we can
express as:
There is a theory that when you can represent an arbitrary function by a neural
network which has more than 2 layers. But from a practical point of view, training
a deep network requires much less parameter than training a shallow network. The
reason is that the shallower network has already extracted the basic features, and the
higher layer network only needs to combine these basic features to get more complex
features. It is similar to modularization in industrial production [19, 20].
We obtained the monthly data of the Chang’e III from the network. After cleaning,
cutting and labeling, we got 4,801 which are size of 784 × 576 × 3 valid sample
image with label. According to the slipping effect of the lunar rover in different
situations, the 4,801 sample images were divided into four categories, soft gravel
topography, compacted soil topography, rocky terrain and concave land. Among
them, the compacted soil topography is the optimal travel choice, and the soft gravel
topography has a large slip ratio. Rocky terrain and concave terrain are terrains that
are to be avoided as much as possible. The following four sets of figures show four
types of sample features (Figs. 1, 2, 3 and 4).
The data type and quantity are as follows (Table 1).
3.2 Model
We used the classic AlexNet as a basis for improvement [7]. We divided the samples
into three groups and used three AlexNet of the same structure to extract and classify
each feature. It is equivalent to fitting a classification standard to each model, and
preliminary classification of the data to obtain its own classification results. In the
idea of using integrated learning, combined with the preliminary structure of the
Terrain Classification Algorithm for Lunar Rover … 207
three models, vote. The voting weight of each model result is non-linear. We fit with
a simple fully connected neural network. The model is as follows (Fig. 5).
The following is a detailed description of the model:
The first layer is an input layer. It is a picture, which size is 784 × 576 × 3.
The second layer is a convolutional layer. The input layer is convoluted using 96
convolution kernels which size is 11 × 11 × 3 with a step size of 4 and 96 bias. The
208 L. Zhou and Z. Liu
pi,j is the pixel value of row i and column j; D is the depth of the convolution
kernel; F is the size of the convolution kernel; wd ,m,n is the weight of the convolution
kernel in m rows and n columns; wb is the bias term. Use the ReLU activation function
after the convolution calculation is complete:
Terrain Classification Algorithm for Lunar Rover … 209
f (x) = max(0, x)
x is the input parameter; after the activation result is obtained, the Local Response
Normalization operation is performed:
min(N −1,i+n/2)
i
px,y = i
ax,y /(k +α (ax,y
j
)2 )β
j=max(0,i−n/2)
i
ax,y denotes the output of the i-th convolution kernel calculated by the ReLU
activation function at position x row y column. n is the number of adjacent kernel
maps at the same position. N is the number of convolution kernels; k = 2, α =
10−4 , n = 5, and β = 0.75. After obtaining the convolution calculation result, the
maximum pooling calculation is performed on the result:
λ,τ
H = Dmax (C) = maxdownλ,τ (C)
C is the input convolution area, and the block size is λ × τ not repeated sampling.
Find the maximum value in this area to represent this area. λ = τ = 3 and the step
size is 2. Get the most pooled results.
The third layer is a convolutional layer. The results of the second layer are convo-
luted using 384 which size is 3 × 3 × 256 convolution kernels with a step size of 1
and corresponding 384 bias terms. Then, the obtained result is calculated using the
ReLU activation function.
The fourth layer is a convolutional layer. The 384 which size is 3 × 3 × 384
convolution kernels with a step size of 1 and the corresponding 384 bias terms are
used to convolute the results of the third layer. Then, the obtained result is calculated
using the ReLU activation function.
The fifth layer is a convolutional layer. The 256 which size is 3 × 3 × 384 convo-
lution kernel with a step size of 1 and a corresponding 256 bias term to convolute
the result of the fourth layer. Then, the obtained result is calculated using the ReLU
activation function.
The sixth layer is the pooling layer, and the result of the fifth layer is the most
pooled.
The seventh layer is the fully connected layer, and the pooling result of the sixth
layer is expanded into 4096 neurons for linear calculation:
f = Wx + b
W is the weight of the model obtained through training, b is the bias term, and
x is the output result of the previous layer. After the result is obtained, a dropout
[21, 22] operation with a probability of 0.5 is performed. Dropout is an effective
regularization technique. The basic idea is to improve the generalization of neural
networks by preventing feature detectors from working together.
210 L. Zhou and Z. Liu
Fig. 6 The process of dropout randomly discarding nodes to form different networks
zi,j = xT W . . .ij +bij, W ∈ Rd ×m×k . This is because the Maxout activation function
is a locally linear function. Dropout works better with linear activation functions.
The eighth layer is the fully connected layer, and the pooled result of the seventh
layer is expanded into 4096 neurons, and linear calculation is performed, and then a
dropout operation with a probability of 0.5 is performed. Get the calculation result.
The ninth layer is the Softmax layer, using the Softmax function:
exi
soft max(x1 , x2 , . . . , xn ) = n
i=1 exi
n is the number of types and the Softmax value is the probability of each type;
obviously the sum of all Softmax values is 1. A Softmax function is used to generate
a label distribution containing 4 categories. Cross entropy:
Terrain Classification Algorithm for Lunar Rover … 211
n
loss = p(xi ) log(q(xi ))
i=1
p(xi ) is the true probability of xi and q(xi ) is the probability that xi is calculated by
the model. Cross entropy measures the distance between two distributions. The closer
the actual distribution to the predicted distribution, the smaller the cross entropy.
Finally, we use Adam optimized gradient descent to solve this model. When the
parameters converge, we can get the preliminary model.
The tenth layer is a simple fully connected layer with 500 neurons, making a
simple nonlinear transformation of the results from the previous models.
The eleventh layer is also a simple fully connected layer with only four neurons,
making a simple nonlinear transformation of the results from the tenth layer.
The twelfth layer is a Softmax layer that is similar to the previous Softmax layer
function, giving the final probability of each type.
The entire model is shown below (Fig. 7).
3.3 Train
We use a thread to read the file name into the file name queue, and the decoder
thread reads the image into the memory queue through the file name queue. This will
speed up the training. So that the CPU does not need to wait for I/O operations, the
GPU can always do matrix operations without waiting for the data to be decoded in
the CPU.
4.1 Accuracy
We use accuracy to judge the results of the model. First, from the network structure
of the following figure, we calculate the loss function and then calculate the quasi-
curvature when the current loss function value (Fig. 9).
First, we read the data in bulk through a multi-threaded reader. Multiple images as
shown below are used as input to the model (Fig. 10).
Then, we noticed that the soft sand in the sample and the compacted soil had
rock interference. We initially tried to classify the samples into six categories, soft
rock with more rocks, soft sand with less rocks, compacted soil with more rocks,
compacted soil with less rocks, rocky terrain and concave land.
Fig. 11 The loss (left) and the accurancy (right) on training data with 2600 epochs
We divided the marked data into three parts. We take two of them and put each
picture into the network and train it. The accuracy rate is as follows (Table 2).
In terms of model training, Model-1-1 and Model-1-2 are obtained by performing
1300 epochs and 2600 epochs on the same data. We observe the loss function graph
and the accuracy graph during the training process (Fig. 11).
We found that the model has converged when the model has been trained 1300
epochs. So we don’t have to do extra training. The processes of training Model-2-1
and Model-2-2 are the same as those.
In terms of accuracy, we found that the model has a poor accuracy for new data
classification, which means that the model has a very low generalization ability.
Further, we analyze the reasons and find that in the classification results, the model
has fuzzy judgment ability on the number of rocks. The essential reason is that in
the sample, the judger’s judgment on the rock is also fuzzy. We believe that this
boundary blur condition should be removed, so we decided to divide the book into
four categories, soft sand, compacted soil, rocky terrain and concave land (Table 3).
We divided the marked data into three parts. Put each picture into the network
and train 1300 epochs. The accuracy rate is as follows (Fig. 12).
The loss function value and accuracy of each training batch are as follows (Fig. 13).
Fig. 12 The loss (left) and the accurancy (right) on training data with 1300 epochs
Fig. 13 The loss (left) and the accurancy (right) of the entire model
We found that the model converges with higher accuracy and lower loss function
values.
Finally, we use the idea of ensemble learning to fuse three models and then train
a simple neural network based on the results of the three models. We take its output
as the final output. This process is similar to a voting process, but the weight of each
vote is non-linear and it is a neural network that needs to be trained.
The loss function value and accuracy of each training batch are as follows.
Obviously, we see that as the training batch increases, the loss function value
decreases and the quasi-curvature rises. It is basic convergence to 12,000 times.
Finally, we see that the average quasi-curvature is stable at 80%. The average accuracy
of the previous method was 78.4% [10, 11].
Compared to a single model, the overall model, the overall model will be more
accurate than a single model (Table 4).
Among them, Train is the accuracy of the model in which the training set is
divided, Val is the accuracy of the model in which the validation set is divided, and
All is the accuracy of the model in the overall sample.
For the compacted soil, the recognition rate of our model reached 87.96%, the
recognition rate of soft sand was 79.1%, the recognition rate of rock topography
reached 71.2%, and the recognition rate of concave land reached 33.4%. We believe
that the reason for the poor recognition rate of concave land compared to other terrains
is that there are only 368 concave topographic maps in our sample. Conversely, there
are 2,765 compacted soils with the highest recognition accuracy in our sample,
indicating that the model requires a large number of samples to learn, in order to
truly learn to extract features and correctly classify the terrain.
There are four situations existing in the prediction results. They are true positive,
false positive, true negative and false negative. We abbreviated them to TP, FP, TN,
and FN respectively represent the corresponding number of samples, obviously:
FP
FRP =
TN + FP
The original data of this study was from the Chang’e III’s camera. We apply convo-
lutional neural networks and integrated learning to the identification of lunar surface
terrain, and propose a multi-network integrated learning model to predict the terrain
categories within the field of view and achieve a better accuracy. Although, neural
networks do not require as much computational complexity and hardware support as
backpropagation in feedforward computational reasoning. However, in the current
general view, the deeper the hierarchy, the more parameters the network tends to have
better results. However, this is not sufficient for the calculation of speed, hardware
requirements or power supply requirements for machines in many special working
environments (such as lunar calendars). Fortunately, people are no longer relying
on experience in network design. The development of AutoML [27] has led people
218 L. Zhou and Z. Liu
to use machines to design neural networks that are small and comparable to human
experts, such as NasNet [9], PNasNet [28], MnasNet [29], and so on. In the future
research, we will also move closer to the design of this fully automated neural network
structure.
References
1. Serikawa S, Lu H (2014) Underwater image dehazing using joint trilateral filter. Comput Electr
Eng 40(1):41–50
2. Lu H, Li Y, Mu S, Wang D, Kim H, Serikawa S (2018) Motor anomaly detection for unmanned
aerial vehicles using reinforcement learning. IEEE Internet Things J 5(4):2315–2322
3. Lu H, Li Y, Chen M, Kim H, Serikawa S (2018) Brain intelligence: go beyond artificial
intelligence. Mob Netw Appl 23:368–375
4. Lu H, Wang D, Li Y, Li J, Li X, Kim H, Serikawa S, Humar I (2019) CONet: a cognitive ocean
network. IEEE Wirel Commun. In Press
5. Lu H, Li Y, Uemura T, Kim H, Serikawa S (2018) Low illumination underwater light field
images reconstruction using deep convolutional neural networks. Fut Gen Comput Syst 82:142–
148
6. Deng J, Dong W, Socher R et al (2009) ImageNet: a large-scale hierarchical image database. In:
2009 IEEE computer society conference on computer vision and pattern recognition (CVPR
2009), 20–25 June 2009, Miami, Florida, USA. IEEE
7. Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classification with deep convolutional
neural networks. NIPS. Curran Associates Inc.
8. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image
recognition. Comput Sci
9. Zoph B, Vasudevan V, Shlens J et al (2017) Learning transferable architectures for scalable
image recognition
10. Brooks CA, Iagnemma K (2012) Self-supervised terrain classification for planetary surface
exploration rovers. J Field Robot 29(3):445–468
11. Ojeda L, Borenstein J, Witus G et al (2006) Terrain characterization and classification with a
mobile robot. J Robot 23(2):103–122
12. Hubel D, Wiesel T (1959) Receptive fields of single neurones in the cat’s striate cortex. J
Physiol 143(3):574–591
13. Hubel D (1962) Wiesel T. Receptive fields, binocular interaction and functional architecture
in the cat’s visual cortex 160(1):106–154
14. LeCun Y, Boser B, Denker J, Henderson D. Howard. R, Hubbard W, Jackel D (1989)
Backpropagation applied to handwritten zip code recognition. J Neural Comput 1(4):541–551
15. Behnke S (2003) Hierarchical neural networks for image interpretation. Springer, Berlin,
Heidelberg
16. Simard P, Steinkraus D, Platt J (2003) Best practices for convolutional neural networks applied
to visual document analysis. ICDAR 2:958
17. Ahn SM (2016) Deep learning architectures and applications. J Intel Info Syst 22(2):127–142
18. Ogiela L, Ogiela MR (2012) Advances in cognitive information systems. Cognit Syst Monogr
17:1–18
19. Seide F, Gang L, Dong Y (2011) Conversational speech transcript using context-dependent
deep neural networks. Interspeech
Terrain Classification Algorithm for Lunar Rover … 219
20. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In:
Computer vision–ECCV 2014, pp 818–833
21. Hinton GE, Srivastava N, Krizhevsky A et al (2012) Improving neural networks by preventing
co-adaptation of feature detectors. Comput Sci 3(4):212–223
22. Wan L, Zeiler M, Zhang SX et al (2013) Regularization of neural networks using dropconnect.
In: Proceedings ICML, pp 2095–2103
23. Goodfellow IJ, Wardefarley D, Mirza M et al (2013) Maxout networks. In: International
conference on machine learning
24. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: International
conference on learning representations
25. Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874
26. Hand DJ, Till R (2001) A simple generalisation of the area under the ROC curve for multiple
class classification problems. Mach Learn 45(2):171–186
27. Kaul A, Maheshwary S, Pudi V et al (2017) AutoLearn—automated feature generation and
selection. In: International conference on data mining, pp 217–226
28. Liu C, Zoph B, Neumann M et al (2018) Progressive neural architecture search. In: European
conference on computer vision, pp 19–35
29. Tan M, Chen B, Pang R et al (2018) MnasNet: platform-aware neural architecture search for
mobile. Comput Vis Pattern Recogn 2820–2828
An Automatic Evaluation Method
for Modal Logic Combination Formula
1 Introduction
Modern modal logic appeared in the 1910s. Later, due to the diversified development
of modern modal logic research, in addition to philosophical analysis, the application
of modal logic has increasingly become a study of modal logic. Important directions
include mathematics [1], social sciences [2], economics [3], engineering [4], and
computer science [5, 6]. Modern modal logic is applied in the formal representation
of subject knowledge of intelligent systems, knowledge communication and knowl-
edge reasoning, description or verification protocols, and design of knowledge-based
support decision systems [7]. The XYZ/E (Sequential Logic Language) based on
modal logic has been successfully applied to the hardware system [8], the behavior
description and verification of the hybrid system [9] and the interpretation of program
semantics. Model detection techniques proposed by American scientists CLarke and
Emerson [10] use formulas in modal logic to describe the properties of the system. In
the multi-agent system, the description of knowledge and beliefs by modal logic can
solve the ambiguous problem of conscious concepts such as “belief” and “wish” in
first-order logic, thus having clearer expressive power and more effective. The form
of reasoning [11] can be applied to a given set of tasks and agents. When assigning
corresponding tasks to different agents, it needs to be applied to the corresponding
policies. The choice and representation of the strategy can be formalized into the
true and false value of the compound formula in the modal logic. Modal logic has
moved from philosophical fields to more discipline applications.
Compound formula evaluation is a very important branch of modern modal logic.
The artificial evaluation method based on axioms and deduction rules, if faced with
a relatively simple state space at this time, the computational difficulty of manually
calculating the true and false values of the compound formula can be controlled,
when faced with the possible world model of the complex space state, because of the
need Considering that all the possible worlds involved in the current propositional
formula and the truth values of the propositions in different reachable relationships
are different, the general artificial compound formula is difficult to evaluate and the
correct rate is low.
In order to improve the efficiency and correctness of the modal logic compound
formula evaluation. This paper proposes to implement the automatic evaluation of
compound formulas in modern modal logic in every possible world based on Haskell
functional programming method. Haskell functional programming is different from
conventional imperative programming. Functional programming abstracts the com-
putational process into expression evaluation. The expression consists of pure math-
ematical functions, which is closely related to the formulas in modal logic, which is
more conducive to modal logic. Formal expression of axioms and deduction rules.
First, we build a model of modal propositional logic. Then we will give each possi-
ble world in the model, and fix the assignment function of all primitive propositions
according to the assignment function in the model. Different truth value definitions
determine the true value of the compound formula in each possible world. Then we
propose a compound formula automatic evaluation algorithm based on Haskell func-
tional programming. Finally, we demonstrate the application of this method through
an experiment.
The organizational structure of this paper is as follows: In the second part, we dis-
cuss the related concepts of modal logic compound formula and common evaluation
methods; the third part gives definitions of concepts such as propositions and possi-
ble worlds in modal logic, and introduces the model. The model of state proposition
logic gives the relevant theorems and deduction rules. The fourth part introduces
the automatic evaluation algorithm of modal logic compound formula. The fifth part
discusses the application of this method under a specific case model. The final part
is the conclusion and future work.
An Automatic Evaluation Method for Modal Logic Combination Formula 223
2 Related Work
The description of knowledge and formal reasoning are important issues in the study
of intelligent knowledge systems. In the traditional modal logic, the method of cal-
culating the true and false values of the compound formula has artificial reasoning
and resolution method. Manual Deduction Method Step 1: Define a non-empty lan-
guage set K that contains the context of all possible situations. If there is a temporal
structure such as “always” or “every” in the situation, then the context set contains
the description of the time. If there is a modal structure such as “necessary” and
“possible”, then we will confirm the context as considering all possible situations.
The second step is to formalize the relevant strategy or knowledge into a primitive
proposition, and superimpose the modal operator and the logical conjunction on
the primitive proposition to represent the corresponding modal proposition. Step 3:
Estimate the composite proposition truth value for each possible world based on the
specific context k (taken from the context set K). The disadvantage of this method
is that it is difficult to control the calculation difficulty of manually calculating the
true and false values of the compound formula in the face of a relatively complicated
state space.
The semantic tableau method proposed by Beth and Hintikka extends the formula
construction set, which is generally applicable to the formula reasoning of logic
systems. On the basis of the classical semantic tableau, Liu Quan and Sun Jigui
applied non-classical logic automatic reasoning methods that reinterpreted symbols,
extended symbols, modified closed rules, changed tree extension methods, added
edge information, and used dual trees. In tableau, the efficiency of machine inference
execution in non-classical logic formulas is improved [12].
Zhang Jian proposed the translation method of modal logic reasoning [13]. He
proposed to translate the modal logic formula into a classical logic formula according
to certain rules, and then use the traditional theorem prover to reason. This method
theoretically maintains the decidability of the formal propositional modal logic.
In the research of modal logic formula derivation, Liu Lei, Wang Qiang and Lu
Shuai proposed to extend the Tableau method in classical modal logic to FPML,
and proposed a reduction strategy based on FPML Tableau rules and fuzzy assertion
set [14]. On this basis, the definitions of inconsistency and inconsistent estimates in
FPML are given. Finally, the CID of the FPML consistency detection method TFPML
and the fuzzy assertion set based on the Tableau method is given, which proves the
validity and correctness of the method in the modal logic formula derivation.
When Zhou Juan and Li Chao studied the modal inference problem, the method of
deductive reasoning in modal logic was transformed into modalsimic multi-valued
logic with the necessary formal methods, and then Lucasivi The multi-valued logic is
converted to Boolean logic [15]. The results show that compared with other methods,
the inference engine has universality, computational simplicity, and unreasonable
application of inference rules in modal logic.
In addition to this, another common method is the resolution method. Sun Jigui and
Liu Xuhua proposed label-based modal inference [16], which overcomes the exces-
224 X. Li et al.
Modal logic is a complex subject. This section introduces the definition of modal
logic, modal operators and modal propositions, and models them according to rele-
vant knowledge points to prepare for the fourth algorithm.
The concept of the possible world comes from GW Leibinz’s idea of non-contradictory
possibility [20]: “As long as the combination of the situation of the thing and the
situation of the transaction does not push the logical contradiction, the combination
of the situation of the thing or the situation of the thing is possible”. In this thought,
Leibniz proposed: “The world is a combination of possible things. The real world
is a combination of all possible things that exist (the most abundant combination).
There may be different combinations of things, which will form many The possible
world.” Therefore, the possible world is a combination of possible things that are
used to express modal assertions.
There are three types of modal words in modal logic: the first type of words that
represent the “state” of things such as “inevitable” and “possible”; the second type
An Automatic Evaluation Method for Modal Logic Combination Formula 225
of words that express the “modality” of a character such as “know” and “Believe
in” etc.; the third category of words that represent the “temporal” of the process
such as “future”, “consistent” and so on. The combination of modal words and meta-
propositions forms a new modal proposition.
Propositional form: Ma p, where a is the cognitive subject agent, M is a cognitive
modal word, and p is an arbitrary proposition.
The context involved in the concept refers to the state and state of the person or
thing. Perhaps the context of the world is closely related to the type of modal words.
If the modal word is a word that expresses tense, such as “future” and “consistent”,
the context is a description of the moment of time.
A model M is defined as follows:
1. A non-empty context set K;
2. A binary relationship R on K is called a reachable relationship;
3. Value function V, for each context k ∈ K , each primitive proposition p is assigned
a true value Vk ( p);
The collection of all contexts is called the mirror set (represented by K). The
calculation of the true and false values of the modal proposition depends not only on
the current given context, but also on other contexts, which is related to the specific
meaning of the modal words. This means that an evaluative function that assigns a
true value according to a specific context k (taken from the vocal set K) is used instead
of a simple formula to assign an artificial evaluation of the absolute true value.
The reachable relationship is a binary relationship (represented by R) on the set
of K, and the context k’ associated with the estimate in a context k is said to be
reachable from k.
Hypothesis the proposition form have p and ¬p, inevitably = , maybe = ♦ then:
226 X. Li et al.
Axiom 1:¬ p ↔ ♦ p
Axiom 2: p ↔ ♦¬ p
These two axioms behave as the relationship between the “necessary” operator and
the “possible” operator, which can be visually seen through the modality. Through
these axioms, the two operators can be freely converted. This serves as the basis for
deriving the evaluation of the modal logic compound formula.
Let M be a model of the possible world set W, R and V are the reachable relationship
and the assignment function respectively. Given the true value of p in w W, the
correlation theorem is as follows:
1. VM,w ( p) = VW ( p),p is the primitive proposition;
2. VM,w (¬ p) = tr ue, if and only if VM,w ( p) = f alse;
3. VM,w ( p) = tr ue, if and only if ∀w ∈ W satis f y w Rw , then VM,w (P) =
T r ue;
4. VM,w (♦ p) = tr ue, if and only if ∃w ∈ W satis f y w Rw , then VM,w (P) =
T r ue.
We can clearly see the “necessary” modal operator in table above, which means
that the primitive proposition p is true in all reachable worlds, defining the “possible”
modal operator in 4, meaning that the primitive proposition p At least one of the
worlds is true.
In the research process of this paper, we fill the model with specific data, and the
filled model M + is as follows:
1. A possible world of non-empty context sets W = w1, w2, …, wn has n possible
worlds;
2. A binary set R on W = (w1, w2), (w2, w2), …, (wi, wj), …, (wk, wn), called the
reachable relationship set. Where (wi, wj) means that wi can reach wj, and if i =
j, it means that the possible world is self-achievable.
3. The value function V , for each world w ∈ W , each primitive proposition p is
assigned a true value VW ( p). When the model is established, we will fix the
assignment function V , that is, the true value of the proposition in all possible
worlds is VW i ( p) = tr ue VW j ( p) = f alse indicates that the primitive proposi-
tion has a value of true under the possible world wi and a value of false under
wj.
An Automatic Evaluation Method for Modal Logic Combination Formula 227
We will discuss the true value of p and ♦ p and the overlapping modal operators
♦ p and ♦¬ p in each possible world. We need to set up an efficient algorithm to
quickly and accurately obtain the true and false values of the modal logic compound
proposition in the case of an increase in the number of modal operator superpositions
and an increase in the complexity of the world state space.
Algorithm 1 solves the true value of the proposition under the action of the possible
operator in the possible world of the target. First, we use a for loop to traverse the
reachable relationship model to find the reachable world of the target possible world.
We abstract it into a function filtrate using Haskell functional programming as a
tool. The function filtrate::String->Model->[String] indicates that the function takes
a string type and a Model type as input, and a String list as output. String represents
the possible world of the target, and Model is the set of reachable relationships.
Next, using a double loop, the inner loop traverses to find the true value of the
proposition in the target possible world proposition, and the design function finds
the corresponding true value of the possible world. The function findbool::String-
>Proposition->Bool indicates that the input is a character. The string and a collection
of type Proposition, the output is the Bool value, and the function is used to find the
true value of the proposition in the possible world of String. The outer loop is to take
all the true values of all possible worlds by traversing the possible world, and judge the
truth value according to the definition. The design function posvfun::Proposition->
Model->String->Bool means that the Proposition type binary group, the Model type
binary group, and the String type representing the target possible world are taken as
inputs, the truth value is judged, and the Bool type value is output.
Algorithm 2 and Algorithm 1 are similar, and the filtrate function is the same
as the findbool function. When traversing the possible world, the design function
necvfun::Proposition->Model->String->Bool is designed according to the difference
between the possible operator and the inevitable operator in the value of the true
value. Although the input and output types are the same, But the design procedure
is different. In the outer loop of the double loop, the number of possible true world
values is true, and the non-algorithm is a true value that determines whether there is
true.
Theorem 1 Let W = w1, w2, ...wn R = (w1, w2)...(wi, wi)...(wk, wn); VM,wi
( p) = tr ue;
Then:VM,w (♦ p) = tr ue
Theorem 1 states: When the possible world is self-sufficient and the primitive propo-
sition is true in the possible world, the proposition under the action of the operator
may be true in the possible world.
228 X. Li et al.
Algorithm 1: POSOPERATOR(WN,PRO,MODEL)
Input: WN,PRO,MODLE
Output: Bool
1 List ←
2 for ∀( pr e, suc) ∈ M O D E L do
3 if pre==WN then
4 List ← List ∪ {suc}
5 end
6 end
7 f lag ← f alse
8 for ∀(Pw, Bo) ∈ P R O do
9 for ∀i ∈ List do
10 if i==Pw and Bo==true then
11 f lag ← tr ue
12 end
13 if flag==true then
14 return true
15 end
16 end
17 end
18 return f alse
Algorithm 2: NECOPERATOR(WN,PRO,MODEL)
Input: WN,PRO,MODLE
Output: Bool
1 List ←
2 for ∀( pr e, suc) ∈ M O D E L do
3 if pre==WN then
4 List ← List ∪ {suc}
5 end
6 end
7 f lag ← f alse
8 for ∀(Pw, Bo) ∈ P R O do
9 for ∀i ∈ List do
10 if i==Pw and Bo==true then
11 f lag ← f lag + 1
12 end
13 end
14 end
15 if flag==length of List then
16 return true
17 end
18 return f alse
An Automatic Evaluation Method for Modal Logic Combination Formula 229
In order to prove the correctness of the algorithm and theorem proposed in the
previous section, we first construct a modal logic model of a multi-possible world.
On this basis, the validity and accuracy of the proposed algorithm are verified.
According to the experimental model in 3.3, the M+ model is given, in which the
world of w2 is self-achievable, and the correctness of Theorem 1 is proved by the
automatic evaluation of the formula under the possible operator of w2 world. The
model is as follows (Fig. 1).
Verification: In w1,
Input: model=[(“W1”, “W2”),(“W2”, “W2”),(“W2”, “W3”)];
proposition=[(“W1”,True),(“W2”, True),(“W2”, False)];
Output: posvfun proposition model “W1”=true;
nesvfun proposition model “W1”=true;
Because w1Rw2, VM,w2 ( p) = tr ue,
So VM,w1 (♦ p) = tr ue,
And because w2 is the only w1 reachable world;
So VM,w1 ( p) = tr ue.
Verification result: The experimental result is correct.
Verification: In w2,
Input: model=[(“W1”, “W2”),(“W2”, “W2”),(“W2”, “W3”)];
proposition=[(“W1”,True),(“W2”, True),(“W2”, False)];
By comparing the experimental results with the verification results, we can clearly
see that the modal logic compound formula evaluation algorithm based on Haskell
functional programming is characterized by high accuracy and high efficiency. In
the M + model, it is possible that the world w2 is self-achievable, and the primitive
proposition p is true at w2, and it can be deduced that the proposition under the action
of the possible operator is always true in the possible world, and the experimental
results prove The correctness of this theorem. In addition, we use this algorithm to
perform the true value calculation under the action of overlapping modal operators,
and the experimental results obtained are also correct and efficient. Because in the
modal propositional logic, the propositional form has p and ♦ p, and also ¬ p, ♦ p,
♦ p etc. These can also be calculated using the above algorithm. Taking ♦ p as
an example, to calculate the true value of the proposition ♦ p in the possible world
Wi, we need to calculate the true value of p in each possible world, then treat it as
a whole, and then give it according to posvfun. The method of calculating the true
value of the proposition that may act as an operator in the possible world Wi. It is
only necessary to master the above algorithm to calculate the true value calculation
on the larger and more complex model M. The compound overlap of modal operators
can only be solved by calling the algorithm multiple times.
An Automatic Evaluation Method for Modal Logic Combination Formula 231
Acknowledgments This work was supported by the National Nature Science Foundation of China
under Grant 61872313, the key research projects in education informatization in Jiangsu province
(20180012), in part by the Postgraduate Research and Practice Innovation Program of Jiangsu
Province under Grant KYCX18 2366, and in part by the Yangzhou Science and Technology under
Grant YZ2017288, YZ2018076, YZ2018209, and Yangzhou University Jiangdu Highend Equip-
ment Engineering Technology Research Institute Open Project under Grant YDJD201707, and
Jiangsu Practice Innovation Training Program for College Students under Grant 201811117029Z.
References
14. Liu L, Wang Q, Lü S (2017) Tableau method of fuzzy propositional modal logic. J Harbin Eng
Univ 2017(6)
15. Zhou J, Li C (2015) Modal logic inference engine based on Lucasiwitz multi-valued calculus.
J Hubei Univ Natl Nat Sci Edn 3:285–289
16. Jigui S, Xuhua L (1996) Marking modal resolution reasoning. J Softw A00:156–162
17. Weimin P, Tuyun C (1997) A new modal regression. Chin J Comput 20(8):000711–717
18. Robinson JA (1965) A machine-oriented logic based on the resolution principle. J ACM 12
19. Cerro LFD (1982) A simple deduction method for modal logic. Inf Process Lett 14(2):49–51
20. Changle Z (2001) Introduction to cognitive logic. Tsinghua University, Beijing, pp 7–9
Research on CS-Based Channel
Estimation Algorithm for UWB
Communications
Wentao Fu, Xingbo Dong, Xiyan Sun, Yuanfa Ji, Suqing Yan,
and Jianguo Song
1 Introduction
used in various fields such as through-wall radar, indoor positioning, and high-speed
wireless LAN [1−6]. UWB uses nanosecond or sub-nanosecond narrow pulses with
bandwidths up to several GHz. According to the Shannon-Nyquist sampling theorem,
the sampling rate is at least twice the bandwidth in order to sample and receive the
signal without distortion. Therefore, taking a 0.7 ns UWB pulse as an example, the
sampling rate needs to be as high as several tens of GHz. The higher sampling rate
puts higher requirements on the hardware technology, which increases the workload
on the one hand and increases the cost on the other hand.
The merits of the channel estimation technique [7] determine the performance
of the UWB system. However, the channel estimation has high requirements on the
sampling value. How to accurately estimate the channel at a low sampling rate is an
urgent problem to be solved.
Compressed sensing is a technique for extracting and reconstructing sparse signals
at a lower sampling rate [8, 9], which breaks the shackles of the traditional Shannon
sampling theorem. Therefore, the idea of applying compressed sensing technology to
UWB channel estimation has attracted wide attention from scholars all over the world.
Literature [10] focuses on the design of observation matrices and over-completed
dictionary libraries. In order to avoid amplifying noise, a filter module is creatively
added at the transmitting end. The reconstruction algorithm uses the DS algorithm, the
base tracking noise reduction algorithm and the Orthogonal Matching Pursuit (OMP)
algorithm to reconstruct the channel. The literature [11, 12] combines Bayesian
theory with the theory of compressed sensing, adaptively sets hyperparameters for
the reconstructed vector, and uses the maximized edge likelihood algorithm to filter
and reconstruct. In [13], the UWB channel is sampled under-Nyquist sampling,
and the OMP algorithm is used to estimate the sparse ultra-wideband channel. The
above algorithms are all performed under the premise that the sparsity is known.
In the actual UWB transmission system, the number of multipaths of the channel
is often random and unpredictable. The reconstruction algorithm in [14, 15] uses
the sparse adaptive matching tracking (SAMP) algorithm to reconstruct the UWB
channel under unknown sparsity. However, the iterative condition is to set a fixed
threshold for the residual, and the noise is randomly generated, so there is no fixed
reasonable termination algorithm threshold.
This paper proposes an ASSP algorithm based on subspace tracking sparse recon-
struction algorithm for UWB channel reconstruction problem with unknown sparsity.
Firstly, the UWB channel estimation model is established, and then the sparsity is
adjusted according to the minimum residual, and the empirical principle of initial
setting of sparsity is given. Finally, the simulation is designed to verify the recon-
struction performance and channel estimation performance of the proposed ASSP
algorithm.
The rest of this paper is structured as follows. Section 2 discusses related work. In
Sect. 3, we introduced the algorithm proposed in this paper. This section is mainly
composed of the description of the adaptive sparsity subspace algorithm, which
includes the UWB channel model, the principle of adaptive sparsity and the steps of
the improved algorithm. Section 4 presents the evaluation of our experimental results
for the MFAPCE algorithm, and Sect. 5 presents out conclusions.
Research on CS-Based Channel Estimation Algorithm … 235
2 Related Work
With the advent of the information age, the contradiction between data volume and
data processing capabilities has intensified. Compressed sensing theory is proposed
to solve this contradiction. It compresses and samples a sparse signal of length N
into a signal of length M and processes it. The mathematical model is:
y = Φ x. (1)
2.2 SP Algorithm
The difference between the SP algorithm and the OMP algorithm is mainly in the
way of generating a set of reconstructed atomic vectors. The OMP algorithm and
its improved algorithm select an optimal set of atomic vector sets for each iteration
and then add it to the atomic index set. Add one atomic vector set per iteration until
the iteration K times, reconstructing the channel from the iterated atomic vector set.
In the SP algorithm, K optimal atomic vector sets are selected for each iteration,
and the optimization is gradually refined by continuously updating the atomic vector
set until the requirements are met. The idea of backtracking algorithm is used in the
process of refinement, and the optimal atomic vector set can be systematically found.
The above algorithm can reconstruct any sparse signal if the measurement matrix
satisfies the RIP principle. It is widely used in practice. The implementation steps
are as follows:
Input parameters: observation vector y, sensing matrix A, sparsity K.
Output parameter: K sparse approximation of h.
Initialization: The atomic index set T 0 is set to the index corresponding to the first
K maximum values of u = | A, y|. The residual y_r 0 is updated with the initialized
atomic index set T 0 .
y_r 0 = y − AT 0 ( AT 0 AT 0 )−1 AT 0 y (3)
+∞ +∞ Ns −1
x(t) = bi p(t − i T p ) = bi j p(t − i T p − j Ts ) (8)
i=−∞ i=−∞ j=0
where p(t) is the transmit pulse, T p is the period of each packet, Ts is the frame
period, bi j is the i-th data symbol of the j-th data packet, and Ns is the number of
data in one data packet. Referring to the IEEE802.15.4a channel model, the discrete
UWB channel model in a complex multipath environment is [24]:
L−1
h(t) = αi δ(t − τi ) (9)
i=0
L−1
r (t) = x(t) ∗ h(t) + n(t) = αl x(t − τi ) + n(t) (10)
l=0
238 W. Fu et al.
r = xh + n (11)
y = r = xh (12)
y = Ah (13)
The channel h of the UWB system is a sparse channel, and the mathematical
models of the Eqs. (13) and (1) are identical. The UWB channel estimation problem
can be converted to the problem of recovering the sparse channel h from the obser-
vation vector by using the compressed sensing reconstruction algorithm when A
satisfies the RIP principle.
For the input signal sparsity, this paper adaptively estimates the size of the old and
new residuals. The signal is reconstructed without the need for signal sparsity as
a priori knowledge. The main innovation of the algorithm is to set the initial pre-
estimation sparsity to s, and select the s optimal atomic vector set in the observation
matrix . The size of s is n/10, and the sparsity of the general reconstructed signal
is greater than s, y is projected onto (νs) to get x_p, selecting the first s largest
index value set in x_p, updating the residual y_r_n, and determining whether the new
residual is smaller than the original residual. Less than the original residual, then s
plus one continues to cycle. If it is larger than the original residual, it jumps out of
the loop. The basic principle is to continuously increase the value of s. By comparing
the mean of the old and new residuals to determine whether s is the sparsity K, the
closer the value of s is to the sparsity K, the smaller the residual, only when s iterates
to the sparsity K, The structure has the smallest residual. It should be noted that the
pre-estimated sparsity must be smaller than the actual signal sparsity. To reduce the
number of iterations, refer to [8] channel sparsity ratio, which is set to one tenth of
the signal length. The specific adaptive process is shown in the Fig. 1.
Research on CS-Based Channel Estimation Algorithm … 239
y_r=y_r_n
s=s+1
Estimate by least squares method
y_r_n>y_r?
Adaptive estimation
of sparsity k=s-1
For the traditional SP algorithm, the a priori signal sparsity K is needed to accurately
reconstruct the signal with low signal residual. The specific steps of the adaptive
sparsity subspace tracking algorithm are as follows.
The model is reconstructed by the formula (13). The core process of the detailed
algorithm improvement is:
Input parameters: observation vector y, sensing matrix A.
Output parameters: estimate vector ĥ and residual y_r .
Step 1: Data initialization: y_r = y, atomic index set T 0 = [], current iteration
number t = 1, initial atomic index set number s;
Step 2: The absolute value of the product of the residual y_r and the sensing
matrix is stored in T _add according to the index value corresponding to the atom
with the largest absolute value of the first s inner product in the order of largest
to smallest, as in Eq. (15).
Step 3: Constructing an atomic support index set C: Merged by T _add and the
last iteration atomic index value.
Step 4: Calculate the estimated vector by least squares method according to the
updated atomic index set.
ĥ = ( AC t ∗ AC t )−1 ∗ AC t ∗ y (17)
Step 5: Update the atomic index set and find the index set corresponding to the
first s atom with the largest absolute value of the product in the unit matrix and
store it in.
T t = max( ĥ, E , s) (18)
Step 6: Update the residuals using all the atoms of the updated atom set in step 5.
4 Performance Analysis
Set simulation parameters: The training sequence selects an identity matrix with a
length of 300. The number of analog channel taps is 300, the number of measured
values M is 70, and the sparsity is 20. The channel estimation result using the ASSP
algorithm is compared with the original channel as shown in Fig. 4.
It can be seen from the figure that when the sampling rate is reduced to about
one quarter of the traditional sampling rate, the residual of the estimated channel is
0.2331. The sampling rate is greatly reduced within the allowable residual tolerance
range, which reduces the technical difficulty of hardware implementation.
242 W. Fu et al.
By comparing the channel estimation parameters with the original channel param-
eters, the proposed ASSP algorithm can basically realize the accurate reconstruction
of the channel data. To further measure the performance of the channel estimation
algorithm, it is evaluated by normalized mean square error. Its expression is:
N
1 2
MES = (ĥ i − h i ) (20)
N i=1
where ĥ i is the estimated parameter of the i-th path, and h i is the original parameter
of the i-th path.
For the data that the number of taps N is 256, the sparsity is 30, and the measured
value M is 140, the simulation curves of the normalized mean square error of SP,
SAMP, and ASSP with SNR change are shown in Fig. 5. The simulation uses the
802.15.4a channel model as the simulation environment. In order to ensure the uncor-
relation between the observation matrix and the training sequence signal, the obser-
vation matrix adopts a Gaussian random matrix A ∈ R M∗N . In order to satisfy
the RIP characteristics of the measurement matrix in the compressed sensing, the
measurement matrix adopts a random Gaussian measurement matrix. The channel
noise is additive white Gaussian noise. The simulation results show that the normal-
ized mean square error performance of the ASSP algorithm proposed in this paper
is better than the traditional compressed sensing SP algorithm and SAMP algorithm
under the same measurement matrix and noise. SAMP, SP and ASSP algorithms
increase with the signal-to-noise ratio. The training sequence provides more prior
Fig. 6 Comparison of MSE performance of ASSP algorithm under different measured values
knowledge for channel estimation. The mean square error is gradually reduced until
the signal-to-noise ratio reaches 30 dB.
The normalized mean square error simulation of the reconstructed channel using
the ASSP algorithm for different observations is shown in Fig. 6. Under different
observation conditions, the mean square error of the ASSP algorithm decreases with
increasing signal-to-noise ratio.
It is not difficult to find by comparison. In the low SNR environment, the normal-
ized mean square error decreases with the increase of the number of observa-
tions. This is due to the increase in the number of observations, which makes the
pre-estimated information obtained by channel estimation more abundant, and the
channel estimation accuracy is higher.
5 Conclusion
In order to solve the problem of UWB radio sampling difficulty, this paper uses the
compressed sensing sparse reconstruction algorithm to estimate the UWB channel.
A sparse reconstruction ASSP algorithm is proposed based on the SP reconstruction
algorithm. The algorithm does not use sparsity as a priori condition, and initializes the
sparsity to one tenth of n. Iteratively generates new residuals by gradually increasing
the sparsity. Using the comparison of old and new residuals, the minimum residual is
sought to achieve the purpose of adaptively estimating signal sparsity. The priori spar-
sity k is exchanged at the expense of reconstruction time and computation. Simulation
Research on CS-Based Channel Estimation Algorithm … 245
experiments show that the reconstruction algorithm can reconstruct the channel with
higher precision. It has important guiding significance for the application of practical
engineering.
Acknowledgements This work has been supported by the following units and projects. They are
the National Key R&D Program of China (2018YFB0505103), the National Natural Science Foun-
dation of China (61561016, 61861008), Department of Science and Technology of Guangxi Zhuang
Autonomous Region (AC16380014, AA17202048, AA17202033), Sichuan Science and Tech-
nology Plan Project (17ZDYF1495), Innovation Project of Guet Graduate Education(2018YJCX22)
Guilin Science and Technology Bureau Project (20160202, 20170216), the basic ability promotion
project of young and middle-aged teach-ers in Universities of Guangxi province (ky2016YB164),
research on blind estima-tion of signal parameters for DSSS Communication (2019YCXS024).
References
1. Landolsi MA (2015) Signal design for improved multiple access capacity in DS-UWB
communication. Kluwer Academic Publishers
2. Serikawa S, Lu H (2014) Underwater image dehazing using joint trilateral filter. Comput Electr
Eng 40(1):41–50
3. Lu H, Li Y, Mu S, Wang D, Kim H, Serikawa S (2018) Motor anomaly detection for unmanned
aerial vehicles using reinforcement learning. IEEE Internet of Things J 5(4):2315–2322
4. Lu H, Li Y, Chen M, Kim H, Serikawa S (2018) Brain Intelligence: go beyond artificial
intelligence. Mobile Netw Appl 23:368–375
5. Lu H, Wang D, Li Y, Li J, Li X, Kim H, Serikawa S, Humar I (2019) CONet: a cognitive ocean
network. IEEE Wireless Communications, in press
6. Lu H, Li Y, Uemura T, Kim H, Serikawa S (2018) Low illumination underwater light field
images reconstruction using deep convolutional neural networks. Future Generat Comput Syst
82:142–148
7. Islam SMR, Kwak KS (2013) Preamble-based improved channel estimation for multiband
UWB system in presence of interferences. Telecomm Syst 52(1):1–14
8. Cheng X, Wang M, Guan YL (2015) Ultra wideband channel estimation: A Bayesian compres-
sive sensing strategy based on statistical sparsity. IEEE Trans Veh Technol 64(5):1819–1832
9. Ping W, Huailin R, Fuhua F (2014) UWB multi-path channel estimation based on CS-
CoSaAMP algorithm. Comput Eng Appl 50(4):227–230
10. Huanan Y, Shuxu G (2012) Research on CS-based channel estimation methods for UWB
communications. J Electron Inf Technol 34(6):1452–1456
11. Weidong W, Junan Y (2013) Ultra wide-band communication channel estimation based on
Bayesian compressed sensing. J Circuits Syst 18(1):168–176
12. Mehmet R, Serhat K, Rpan HA (2015) Bayesian compressive sensing for ultra-wideband
channel estimation: algorithm and performance analysis. Kluwer Academic Publishers
13. Cohen KM, Attias C, Farbman B et al (2014) Channel estimation in UWB channels using
compressed sensing. In: IEEE international conference on acoustics, speech and signal
processing. IEEE, pp 1966–1970
14. Fuhua F, Huailin R (2014) Non-convex compressive sensing utral-wide band channel estimation
method in low SNR conditions. Acta Electronica Sinica 42(2):353–359
15. Yanfen W, Xiaoyu C, Yanjing S (2017) Sparsity adaptive algorithm for ultra-wideband channel
estimation. J Univ Electron Sci Technol China 46(3):498–504
16. Candès EJ, Romberg JK, Tao T (2010) Stable signal recovery from incomplete and inaccurate
measurements. Commun Pure Appl Math 59(8):1207–1223
246 W. Fu et al.
17. Xianyu Z, Yulin L, Kai W (2010) Ultra wide-band channel estimation and signal detection
through compressed sending. J Xi’an Jiao Tong Uni 44(2):88–91
18. Tropp JA, Gilbert AC (2007) Signal recovery from random measurements via orthogonal
matching pursuit. IEEE Trans Inf Theory 53(12):4655–4666
19. Schwab H (2007) Signal recovery from incomplete and inaccurate measurements via regu-
larized orthogonal matching pursuit. Submitted for publication. IEEE J Select Topics Signal
Process 4(2):310–316
20. Needell D, Vershynin R (2010) Signal recovery from incomplete and inaccurate measurements
via regularized orthogonal matching pursuit. IEEE J Select Topics Signal Process 4(2):310–316
21. Davenport MA, Needell D, Wakin MB (2013) Signal space CoSaMP for sparse recovery with
redundant dictionaries. IEEE Trans Inf Theory 59(10):6820–6829
22. Zhang L (2015) Image adaptive reconstruction based on compressive sensing via CoSaMP.
Appl Mech Mater 631–632:436–440
23. Dai W, Milenkovic O (2009) Subspace pursuit for compressive sensing signal reconstruction.
IEEE Trans Inf Theory 55(5):2230–2249
24. Lei S, Zheng Z, Liang T (2012) Ultra wideband channel estimation based on kalman filter
compressed sensing. Trans Beijing Instit Technol 32(2):64–67+77
A New Unambiguous Acquisition
Algorithm for BOC(n, n) Signals
Xiyan Sun, Qing Zhou, Yuanfa Ji, Suqing Yan, and Wentao Fu
Abstract The autocorrelation function of the BOC modulated signal has more than
one peak, which causes the receiver to capture the error signal. Therefore, this paper
proposes a new fuzzy-free capture algorithm for BOC(n, n) signals. Divide the local
BOC signal into two branch signals. Then, correlating the tributary signal with the
received BOC signal, and the correlation function is then shifted by a quarter and
three quarters of the chip, respectively, for reconstructed. The simulation analysis
shows that the algorithm can weaken interference of the secondary peak and has a
small amount of calculation. Both capture sensitivity and de-blurring performance
are improved.
1 Introduction
In order to ensure that the various modern GNSS (Global Navigation Satellite
System) systems [1, 2] can work in different frequency bands, the spectral split-
ting characteristics of BOC (Binary Offset Carrier) [3] become an opportunity to use
it as a navigation satellite signal. In addition, because BOC has higher positioning
accuracy than traditional BPSK (Binary phase shift keying), various BOC modula-
tions become an important component of modern GNSS systems and become the
main candidate for the development of GNSS systems. Nevertheless, the multi-peak
nature of autocorrelation function of BOC signal causes the receiver to have severe
ambiguities in its baseband signal processing, which can cause positioning bias that
cannot be ignored. Therefore, it is a key issue for research to eliminate the ambiguity
of correlation peaks.
The solutions currently proposed mainly include the following: The BPSK-LIKE
algorithm [4] is further divided into a single sideband method [5, 6] and a double
sideband method. In [7], the BPSK-LIKE algorithm captures local pseudocodes
after moving them to both sides of the spectrum. Although BPSK-like can remove
the ambiguity of BOC and has a simple structure, the disadvantage is that the corre-
lation peak similar to BPSK is wider than the BOC signal in terms of correlation
peaks; The SPCP [8, 9] algorithm simulates the method of quadrature demodula-
tion stripping carrier, and uses a pair of mutually orthogonal local BOC codes to
remove the influence of subcarriers on BOC signal acquisition, which is similar to
the capture result obtained by BPSK-like technology. The essence of the ASPeCT
[10, 11] algorithm is that the signal received by the receiver is related to the local
BOC code, and the signal is correlated with the PRN code, and then reconstructed by
using the above two correlation functions to eliminate the influence of the subcarrier.
This method successfully suppresses the influence of subcarriers and can improve
the acquisition accuracy, but the method is computationally intensive and only works
for the BOC(n, n) group. The BOC signal direct processing method [12] improves
the peak-to-peak ratio of the primary and secondary peaks. Although the algorithm
is simple to implement, it cannot handle the ambiguity. Reference [13] also obtains
the side peak elimination capability by constructing a local auxiliary signal, but only
for high-order BOC signals. A variety of capture and tracking techniques [14–18]
have been proposed for the ambiguity of the BOC capture process.
In this article, a new improved algorithm is proposed. Firstly, the local BOC
signal is split, and one set of odd-branch signals is correlated with the received
signal, and then shifted and combined to achieve the purpose of eliminating the
peaks (Alternatively, the even tributary signal is correlated with the received signal,
shifted and recombined to eliminate side peaks). This paper takes the odd branch
signal as an example to explain. The simulation shows that the algorithm effectively
eliminates the secondary peak of the correlation function and reduces the main peak
span to three quarters of the chip, which can result in sharp and narrow correlation
peaks for better acquisition performance.
2 Fuzziness Problem
This article studies the BOC(n, n) group signals used by most modern GNSS navi-
gation systems. Define the product of the PRN and the subcarrier as the BOC code,
denoted by Sboc (t) as:
1.2
BOCs(1,1)
1 BOCc(1,1)
0.8
0.6
Normalized amplitude
0.4
0.2
-0.2
-0.4
-0.6
Code phase
where c(t) is the PRN (Pseudo Random Noise) code; sc(t) represents the binary
subcarrier.
This article describes the BOC(1, 1) signal, the same characteristics of BOC(n,
n). Figure 1 shows the autocorrelation function of BOC(1, 1). It can be seen from the
figure that there is a secondary peak in the BOC signal, which will cause blurring
during acquisition problem.
3 Proposed Method
This paper proposes a new no-fuzzy acquisition algorithm to solve the ambiguity
problem. First, split the BOC signal. The pseudo-random code can be modeled as:
∞
c(t) = Ci PTC (t − i TC ) (2)
i=−∞
where Ci is a chip value, Ci ∈ (−1, 1); PT c is one rectangular pulse; Tc is the width
of a chip. The local sub-carrier is expressed as:
250 X. Sun et al.
N −1
sc(t) = PTC = d j PTsc (t − j Tsc ) (3)
j=0
N −1
∞
Sboc (t) = Ci d j PTsc (t − i TC − j Tsc ) (4)
j=0 i=−∞
The separation process of the BOC(1, 1) signal is shown in Figs. 2 and 3, which
Ce (t) represents the signal part of the odd road and Co (t) represents the signal portion
of the even road.
The mathematical expression of Ce (t) and Co (t) are as follows:
N c −1
N c −1
r (t) = PS × C(t − τ ) × D(t − τ ) × SC(t − τ ) × cos(2π ( f I F + f D )t) + n(t)
(8)
PS is the power of the input signal, D(t) is the navigation data, τ is the code delay,
and n(t) is the noise term [19].
The input signal is mixed with the local carrier and multiplied by the odd-branch
signal to obtain:
3T C
re2 (t) = r (t)[cos[2π ( f I F + f D )t] + j sin[2π ( f I F + f D )t]]Ce t − + n e2
4
(11)
The algorithm block diagram is shown in Fig. 6. According to the flow in the drawing,
the carrier is first stripped out, and then subcarrier modulated local PRN sequence
is divided into two branch signals with the subcarrier pulse length as a reference.
This paper focuses on the related shifting mode of odd-branch signal. The correlation
between the two signals is obtained by correlating the branch signal and the input
signal described
in this paper. Then move the left side by 1 4 chip and the right
side by 3 4 chip to get two new functions. Then the two new functions are added,
subtracted, and modulo subtracted, and the edge of the autocorrelation function is
finally eliminated.
The sampling frequency is set to 40.92 MHz, so one spreading chip corresponds to
40 sampling points, and the setting code phase is the 601th sampling point, that is,
the 16th chip. Figure 7 is a diagram showing the result of capturing the BOC(1, 1)
signal.
In the matlab platform, the same parameters as shown in Fig. 7 are set. According
to the Fig. 8, for the BOCs(1, 1) signal, the captured two-dimensional maps of the
ASPeCT and SCPC methods have secondary peaks, and the algorithm completely
eliminates the secondary peaks. And the correlation peak is very narrow. In Fig. 9,
the algorithm is also superior to the SCPC algorithm for the BOCc(1, 1) signal.
A New Unambiguous Acquisition Algorithm … 255
The algorithm proposed judges whether the BOC signal is accurately catched based
on the comparison of the detection statistic and the detection threshold value set by
the decider. If the detected amount is greater than the detection threshold, the signal is
accurately captured; if the detected amount does not exceed the detection threshold,
the signal is not accurately captured. The traditional non-coherent detection statistics
are as follows:
M
D= (I j2 + Q 2j ) (14)
j=1
In (14),
√
I j = Ts C/N0 sin c(π f D Ts )R(τ ) cos(ϕ) + Ni, j
√ (15)
Q j = Ts C/N0 sin c(π f D Ts )R(τ ) sin(ϕ) + Nq, j
It can be seen from Eqs. (16) and (17) that when the τ is smaller until it is
reduced to zero, the statistical detection amount D can be maximized.
Assuming PD (x) is the probability density, then the detection probability Pd at
this time is:
+∞
Pd = PD (x)d x (18)
V
According to the reconstruction rule, re1−e2 and re1+e2 can be expressed as:
Take Eqs. (21) and (22) into formula (14), the new detection statistic D1 can be
expressed by the following formula:
M
D1 = ((re1−e2 )2 + (re1+e2 )2 ) (23)
j=1
Now assuming that the probability density is PD1 (x), the detection probability
Pd1 is indicated as:
+∞
In Fig. 10, the detection performance of this algorithm is higher than SCPC and
BPSK-LIKE for the sin-BOC(1, 1) signal. In Fig. 11, for the cos-BOC(1, 1) signal,
the detection performance of the proposed algorithm is not much different from the
BPSK-LIKE algorithm, but higher than the SCPC algorithm.
It can be seen from Table 1 that the ASPeCT algorithm needs to perform two corre-
lations and squares separately; the calculation amount of the SCPC method is the
same as that of the ASPeCT method; however, the proposed algorithm only needs
to be correlated once, respectively shifted and squared twice, so which complexity
is lesser.
5 Conclusion
This paper proposes a new improved method based on the related shift, which mainly
uses BOC(1, 1) signal for theoretical analysis and simulation, and actually conforms
to other BOC(n, n) groups. Compared with the BPSK-LIKE and SPCP methods,
the proposed method has higher deblurring validity and higher capture sensitivity.
Therefore, the unambiguous acquisition method proposed in this paper is a good
choice for modern GNSS receivers. The fuzzy acquisition method of other BOC
groups has reference significance.
Acknowledgements This work has been supported by the following units and projects. They are
the National Key R&D Program of China (2018YFB0505103), the National Natural Science Foun-
dation of China (61561016, 61861008), Department of Science and Technology of Guangxi Zhuang
Autonomous Region (AC16380014, AA17202048, AA17202033), Sichuan Science and Tech-
nology Plan Project (17ZDYF1495), Guilin Science and Technology Bureau Project (20160202,
20170216), the basic ability promotion project of young and middle-aged teachers in Universities
of Guangxi province (ky2016YB164), the graduate education innovation program funding project
of Guilin University of Electronic Technology (2019YCXS024).
260 X. Sun et al.
References