IMINT Target Acquisition Using Deep Learning
IMINT Target Acquisition Using Deep Learning
IMINT Target Acquisition Using Deep Learning
Abstract: Detecting different targets from a High Resolution Remote Sensing image is one of the classical problems of
computer vision and is often described as a difficult task. This paper will present the appropriate tasks in com-
puter vision using Deep Learning technology with the constraint of small training data and to use a pretrained
Convolutional Neural Network (CNN) (by preprocessing the images dataset and developing the right training
process). Faster R-CNN method will be used for the object detection task. Due to practical use, this work will
detect 2 classes (Airplanes and storage tanks). The dataset used for the training is a combination of an existing
data set and collected images for the military aircrafts. The analyze of the large data of IMINT will be faster and
the human labor will be reduced to the minimum. The software will recognize different targets from large images
collected by Satellites, Reconnaissance UAVs or Aircrafts from the ISR missions by an average accuracy of
90% for the airplane class.
Index Terms: Imagery Intelligence; Remote Sensing; ISR; Deep Learning; Object Detection; Convolutional Neural Network; CNN;
Faster RCNN; Computer Vision.
In addition, with the UC Merced Dataset, I have 3. Extracting the target images from the image
collected other satellite images generally of military dataset: As a third step, all the labeled target from
transport airplanes using the GIS (Geographic Infor- the images in the dataset after step 2 are extracted
mation System) Software Global Mapper [16]. I have by the crop method. I did this step to initially train
searched for Air Bases around the world and down- the CNN as a classification method. I believe that it
loaded the high-resolution images with different will initialize the CNN weights to get better result in
sizes with a spatial resolution of 0.3m as shown in the Faster RCNN training (Total output is 1353 im-
Figure 3. ages). But in order the feed the CNN, images must
be resized into 𝑚 × 𝑚 and I have to keep the same
The total number of images in the dataset is 188 image ratio hence the need of the next step.
images with different sizes split into 2 categories. 4. Image resizing: The CNN input squared im-
age (for our CNN it’s 227 × 227 and RGB channel) so
I resized all the cropped images from step 3. The
resize method idea is to fill the shortest side by black
(or said zeros intensity) and output a square image
so we don’t lose the image ratio and we don’t get a
distorted target for training. After that, the output
Figure 3: Example of images from the dataset collected using Global is resized into 227 × 227 pixels.
Mapper 5. Create image datastore: this fifth step is to
prepare the data for the training purpose. I create an
3.1.2. Images preprocessing image datastore with all the images in the previous
The most time consuming in this work is the im- step with their respective classification label. After
ages preprocessing. I made many processing that I split every category in 3 parts: A Training
Dataset representing 80% of the data, A Validation the input image. This issue was overcome by the
Dataset (to follow the improvement of the network data augmentation in the input layer and the image
while training) 10%, and a Test Dataset (to get the preprocessing. The dataset created on step 2 de-
accuracy) of 10%. scribed in subsection 03.1.2 is used.
3.2. Architecture of the Network The Faster R-CNN have 2 CNN working to-
gether. One network is the pretrained CNN de-
3.2.1. CNN scribed in the section 2 above, transformed into a
For the CNN, I used the transfer learning tech- Fast R-CNN (see subsection 2.2) by adding a regres-
nique in a pre-trained network which is the Alexnet sion network in order to output the localization box
[5] (described in section 2). The advantage of this ap- represented by 4 values [𝑥 𝑦 ℎ𝑒𝑖𝑔ℎ𝑡 𝑤𝑖𝑑ℎ𝑡]. The other
proach is that the network has already learned a rich one is called the RPN to output the RoI. This net-
set of images features that is applicable to a wide work shares the same weights as the previous CNN,
range of images. The time of the training can be too but the last layers concerned by the classification are
long because of the heavy calculation needed to train changed by an RoI output to feed the Fast R-CNN for
the network. To make the transfer learning, I de- the classification [9]. The Faster R-CNN passes by 4-
leted the last 3 layers that are trained for the classi- steps alternating training [9], where: First, train the
fication (Fully-connected layer, Softmax layer and RPN initialized by the pretrained CNN. Second,
the classification layer) because the output was 1000 train a separate detection network by Fast R-CNN
classes, and our output is only 2 classes. I also using proposals generated by the last step, initial-
changed the first layer which is the input-layer and ized by the CNN. Third, fix the convolutional layers,
add a data augmentation method for making a ran- fine tune unique layers to RPN, initialized by the de-
dom vertical flip of the input image on every itera- tector on the third step. Fourth, fix the convolutional
tion of the mini-batch for the training. This data aug- layer, fine-tune the Fully-Connected layers of Fast
mentation, together with the image preprocessing R-CNN.
explained in the subsection 3.1.2, will help the
Faster R-CNN to detect the object with rotational in- The chosen learning options of the Faster R-CNN
variance due to the limitation of the object detection are almost the same for the 4 steps: SGDM with a
method explained in the subsection 3.2.2 below. batch size of 256, a learn rate drop factor of 0.5 every
3 epochs and by shuffling the image dataset on every
The network was trained on a laptop with a single epoch. The initial learn rate of the 2 first steps is
Intel core i5 CPU and 8GB RAM memory. The oper- 10−5 , and 10−6 for the rest of the steps (because
ating system was Windows 10, and the implementa- those steps are used to fine tune the previous ones).
tion environment was under MATLAB 2017b on the The learning epochs is fixed at 10 for the 1st and the
Training Dataset with the Validation Dataset de- 3rd steps and 12 epochs for the 2nd and the 4th steps.
scribed on the step 5 of the image preprocessing.
The positive IoU (Intersection over Union of the
The CNN is trained using SGDM (Stochastic Gra- anchor over the Ground Truth box, more details in
dient Descent with Momentum) with a batch size of reference [13]) is also fixed by [0.6 1] (object) and
128, a momentum of 0.9 and a learning rate of 10−4 . negative IoU by [0 0.3] (not object)
The training required 8 epochs (shuffling the dataset
on every epoch) and took us around 385 minutes. All the parameters were chosen after many prac-
tical tries to get a better result. The training took
After running the trained network on the test da- around 72 hours on the same laptop environment
taset, I got a mean diagonal accuracy of 99.98% on previously mentioned.
the 2 trained classes as shown in the confusion ma-
trix in Table 1. 3.2.3. Evaluation Method
For the target detection performance, four com-
Airplane Storage Tank
monly used criteria were computed: FPR, MR, Accu-
Airplane 0.9999 2.7189E-06
Storage
racy (AC), and Error Ratio (ER). These criteria are
4,63E-04 0.9995 defined as follows:
Tank
Table 1: Confusion Matrix of the trained CNN
Number of falsely detected targets
𝐹𝑃𝑅 = × 100%
This classification trained network will be used Number of detected targets
for the object detection method. Number of missing targets
𝑀𝑅 = × 100%
3.2.2. Object detection method Number of targets
For the object detection method, the Faster R- Number of detected targets
CNN described in section 2.4 is used because of its 𝐴𝐶 = × 100%
Number of targets
high accuracy result as shown in section 2.5. Faster
R-CNN don’t have rotation invariance implemented 𝐸𝑅 = 𝐹𝑃𝑅 + 𝑀𝑅
[9]. Objects in RS imagery need to be rotational in-
variant due to the random direction of the targets in
4. CONCLUSION [10] J. R. R. Uijlings, K. E. A. Sande, T. Gevers
and A. W. M. Smeulders, "Selective Search for
The goal of this work is to show the ability of ob- Object Recognition," International Journal of
ject detection using DL technology using Faster R- Computer Vision, vol. 2, pp. 154-171, 2013.
CNN method for military targets from a High-Reso-
[11] C. M. Bishop, Pattern Recognition and
lution RS image. To implement this method, a series
Machine Learning, Secaucus,NJ, USA:
of process must be done on the Dataset and on the
Springer-Verlag New York, Inc., 2006.
chosen pre-trained CNN due to the small number of
images in the dataset. I demonstrated the feasibility [12] I. Goodfellow, Y. Bengio and A. Courville,
of this type of work on a single CPU laptop DEEP LEARNING, MIT Press, 2016.
[13] C. L. Zitnick and P. Dollar, "Edge boxes:
Two object classes (Airplane, Storage tanks) are
Locating object proposals from edges,"
used as an example. In the future, the software can
European Conference on Computer Vision,
be trained to detect more classes (ships, armored ve-
pp. 391-405, 2014.
hicles, bridges, deployed troops…). There will be no
limitation as long as we can collect the appropriate [14] M. Everingham, L. Van Gool, C. K. I.
Dataset from Remote Sensing images (including EO, Williams, J. Winn and A. Zisserman, "The
IR, SAR…). PASCAL Visual Object Classes Challenge
2007," [Online]. Available:
http://host.robots.ox.ac.uk/pascal/VOC/voc20
5. REFERENCES 07/.
[15] Y. Yang and S. Newsam, "Bag-of-visual-
[1] K. Fukushima, "Neocognitron: A hierarchical words and spatial extensions for land-use
neural network capable," Neural Networks, classification," in 18th ACM SIGSPATIAL
vol. 1, no. 2, pp. 119 - 130, 1988. Int. Symposium on Advances in Geograph.
[2] D. Marr and E. Hildreth, "Theory of edge Inf. Sys., San Jose, CA, USA, June, 2010.
detection," Proceedings, vol. 207, no. 1167, pp. [16] Blue Marble Geographics, "Global Mapper,"
187 - 217, 1980. Blue Marble Geographics , [Online].
[3] R. Girshick, "Fast R-CNN," Proceedings of Available:
the IEEE International, pp. 1440-1448, 2015. http://www.bluemarblegeo.com/products/glob
al-mapper.php.
[4] R. Girshick, J. Donahue, T. Darrell and J.
Malik, "Rich feature hierarchies for accurate [17] Mathworks, "Matlab," [Online]. Available:
object detection and semantic segmentation," https://www.mathworks.com/products/matla
in Proceedings of the IEEE conference on b.html.
computer vision and, 2014. [18] D. Rumelhart, G. Hinton and R. Williams,
[5] A. Krizhevsky, I. Sutskever and G. E. Hinton, "Learning representations by back-
"ImageNet Classification with Deep propagating errors," Cognitive modeling, vol.
Convolutional Neural Networks," Advances 5, no. 3, 1986.
in neural information processing systems, pp. [19] R. Szeliski, Computer Vision: Algorithms and
1097-1105, 2012. Applications, Springer, 2010.
[6] ImageNet, "Large Scale Visual Recognition [20] F.-F. Li, A. Karpathy and J. Johnson, Writers,
Challenge (ILSVRC)," Stanford Vision Lab, Convolutional Neural Networks for Visual
[Online]. Available: http://www.image- Recognition. [Performance]. Stanford
net.org/challenges/LSVRC/. University CS231n, 2016.
[7] J. Redmon, S. Divvala, R. Girshick and A.
Farhadi, "You Only Look Once: Unified, Real-
Time Object Detection," Computer Vision and
Pattern Recognition (cs.CV), 2016.
[8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S.
Reed, C. Y. Fu and A. C. Berg, "SSD: Single
shot multibox detector," in European
Conference on Computer Vision , Amsterdam,
Netherlands, 2016.
[9] S. Ren, K. He, R. Girshick and J. Sun, "Faster
R-CNN: Towards Real-Time Object Detection
with Region Proposal Networks," IEEE
Transactions on Pattern Analysis and
Machine Intelligence, vol. 39, no. 6, pp. 1137-
1149, 01 06 2017.