0% found this document useful (0 votes)
26 views57 pages

SSRN Id3599656

This document discusses geospatial data analytics from a machine learning perspective. It presents a novel framework for converging machine learning and GIS technologies. The paper considers three datasets to demonstrate the proposed framework and presents various results generated. It also discusses limitations and future directions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views57 pages

SSRN Id3599656

This document discusses geospatial data analytics from a machine learning perspective. It presents a novel framework for converging machine learning and GIS technologies. The paper considers three datasets to demonstrate the proposed framework and presents various results generated. It also discusses limitations and future directions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Geospatial Data Analytics : A Machine Learning

Perspective

Saneev Kumar Das1 , Meenakshi Kandpal1 , Sujit Bebortta1


Department of Computer Science and Engineering, College of Engineering and Technology,
Bhubaneswar, India, 751003

Abstract

The advent of geographic information system has proven its efficiency in many
relevant domains but still needs some more explorations. In this paper, a novel
architecture towards the convergence of machine learning and GIS technology
is presented. The enabling technologies associated to the implementations per-
formed are well-defined and presented in this paper. Further, relevant works in
the domain of the merger technology of machine learning and GIS technology
is epitomized. This paper considered three relevant datasets viz., Indian cen-
sus data, Zomato restaurants data and COVID-19 data for India for practically
realizing the proposed framework. The proficiency of the proposed framework
is presented through various results generated in this paper. Furthermore, the
limitations posed by the proposed framework and ways to tackle the challenges
are presented. Various geospatial visualization operations along with statistical
plots are shown in this paper to epitomize the overlay analysis, point-cluster
generation, heatmap analysis and spatial clustering approaches. Further, wher-
ever required a Microsoft Bing map is provided to epitomize the performed risk
analysis along with zonal classification. Finally, concluding remarks are sum-
marized with the presentation of recent techniques like deep learning, serverless
computing paradigm and so on to improve the proposed framework with dimin-
ished limitations.
Keywords: Geographic Information Systems (GIS), Machine Learning, Spatial

Email address: saneevdas.061995@gmail.com (Sujit Bebortta)

Preprint submitted to Journal of LATEX Templates July 24, 2020

Electronic copy available at: https://ssrn.com/abstract=3599656


Clustering, Overlay Analysis, Point-Cluster Analysis, Spatial Data
Infrastructure (SDI)

1. Introduction

The essence of geospatial data analytics lies in the convergence of two do-
mains of computer science which are in rage in today’s era viz., geographic
information systems (GIS) and machine learning (ML). The entire scope of
5 the paper entails the merger of both the technologies which shall improve the
functionalities of visualizing the geospatial data which is high-dimensional, hy-
perspectral as well as spatio-temporal in nature. The domain of data analytics
is competent enough to handle and visualize diverse categories of data. Data
mining which signifies a technology of extracting knowledge from data provides
10 predictive competency to the data further making it more powerful. There is
an extreme necessity to understand what the data under consideration signifies.
Thus, the data to get analyzed needs to be visualized properly. Multiple ap-
proaches exist in statistics to visualize the data but to simulate the visualization
using computers is a hectic task. Furthermore, data with specific dimensions are
15 easy to handle and visualize but high-dimensional data are practically difficult
to visualize, process as well as analyze. This paper has prime focus on the visu-
alization of geospatial data with the assistance of machine learning algorithms
which shall be discussed in subsequent sections.

20 The emergence of novel technologies in today’s era have witnessed an unex-


pected growth in the domain of geospatial data analytics. The underlying tech-
nologies for geospatial data analytics involve GIS technology, remote sensing,
machine learning, data visualization and many such. The wide scope of applica-
tions in this domain needs certain technologies to analyze such computationally
25 intensive data. Over the last decades, few eminent researchers used GIS tech-
nology to analyze non-parametric techniques for its application in modeling
environmental data [1, 2, 3]. The widespread area of research in GIS technology

Electronic copy available at: https://ssrn.com/abstract=3599656


especially is in the domains of healthcare, environment, crowd-sourcing, crime
analysis, disaster management, interpolation of census data and so on. The
30 tools of GIS over the last decade had prime focus on capturing the data, stor-
ing it and displaying it as a map [4]. Nowadays, tools such as QGIS, ArcGIS,
Maptitude and many such have proven their proficiency by providing multiple
extravagant features for geospatial data analytics. The competencies of GIS
technology in the integration of heterogeneous data as well as the visualization
35 strategies to analyze the geospatial data has led to the rage of GIS technology
[5]. In today’s era, machine learning algorithms comprising of artificial neural
networks (ANNs), support vector machines (SVMs) and many such are indis-
pensable tools for geospatial data analysis [6].

40 In this paper, a three-tiered framework to converge machine learning and GIS


technology is proposed. Further, the datasets considered to practically realize
the proposed framework includes:

• Indian census data,

• Zomato restaurants data for Delhi province of India, and

45 • COVID-19 pandemic data in context to India in a state-wise fashion.

The implementation carried out over the above mentioned datasets epitomize
the relevance of GIS technology in convergence with machine learning in the
fields of economy, business and epidemiology or healthcare.

1.1. Motivations

50 The growing rage of GIS technology has led towards multiple applications
of it in the field of computer engineering. Data in today’s era is becoming
indispensable and thus spatial data also needs proper visualization mechanisms.
GIS technology aids in visualizing spatial data with geographical significance.
Data where location information are too relevant can be visualized with the
55 aid of operations in GIS. Further, the prime motivation behind the paper is to

Electronic copy available at: https://ssrn.com/abstract=3599656


practically realize the convergence of machine learning with GIS technology to
provide predictive competency to the model. Many present-day datasets require
predictive analysis and thus, machine learning aids in visualizing such datasets.
Also, the proficiency of machine learning is well-known but this paper focused
60 on educating researchers with the relevance of machine learning in the field of
GIS technology.

1.2. Contributions

• The enabling technologies behind the working of the proposed framework


is discussed in this paper.

65 • Relevant works in this domain is epitomized in a section-wise fashion.

• The paper entails a proposed framework converging machine learning with


GIS technology.

• The proposed framework is realized by implementing it over three relevant


datasets classified as case-studies.

70 • Details of the experimentation and corresponding results are presented in


this paper.

• The limitations and future research directions are summarized at the end
to present the challenges faced by the current proposed architecture and
how can it be improved.

Electronic copy available at: https://ssrn.com/abstract=3599656


75 2. Enabling Technologies

In this section, various back-end technologies in detailed view are explained.


The implementation of the proposed framework uses GIS technology, machine
learning and so on. This section explores each in detail.

2.1. GIS and Spatial Data Infrastructure

80 The domain of GIS comprises of a system that is competent enough to


capture, store as well as analyse and display geospatial data, spatial relationships
and various patterns. The concept of spatial data infrastructure (SDI) states the
technologies, various standards, policies and resources vital in order to capture
as well as process and preserve spatial data [7]. The architecture of SDI is
85 presented in figure 1. In an SDI framework, three types of clients are focused
upon viz.,

• Thick Client : It signifies those clients who work over GIS applications
which are of standalone type.

• Thin Client : It epitomizes users working over web-applications which


90 are browser-oriented.

• Mobile Client : It refers to a group of thin clients working on locked


operating systems.

GIS technology is of utmost importance in visualizing spatial data infrastruc-


ture. Remote sensing is converged with GIS to include satellite imagery for wide
95 applications. The data formats used in GIS are raster format and vector format.
The raster format indicates grids of pixels used for satellite imagery in context
of GIS whereas the latter indicates polygons and lines used to display streets,
borders, etc [8]. GIS is a boon for cartographers who design maps for vari-
ous scenarios. The geospatial data needs proper description through entities,
100 attributes and relationships.

Electronic copy available at: https://ssrn.com/abstract=3599656


GIS Data
Policies

Metadata
Standards

Human
Resources
Geo-Database

Spatial Data Infrastructure

Thick
Client

Figure 1: Architecture of SDI with types of clients involved within.

2.2. Operations in GIS

There exists multiple operations in GIS applications to perform geospatial


analysis. Some of the relevant terminologies are discussed as follows:

• Overlay Analysis : It specifies an operation where two or more layers are


105 overlaid [9]. Various points or polygons can be put together for performing
geospatial analysis. Diverse layers generated during plotting of data points
over a map are required to be overlaid and this operation is called overlay

Electronic copy available at: https://ssrn.com/abstract=3599656


analysis. The data points can also be labeled during this operation to
explain the data points overlaid on the map. Various types of overlay
110 analysis exist such as weighted-overlay, map-overlay, and so on.

• Heatmap Analysis : The analysis of density of data points in a partic-


ular region is done by the generation of various heatmaps. The heatmap
is created by using pseudo-color codes to signify various regions as per
density. Various color intensities gradually decrease as the concentration
115 of data points reduce [10]. Also, kernel density interpolation is a technique
which uses various modal functions to generate a heatmap can be used.

• Point-Cluster Analysis : This analysis is performed when large number


of data points exist all over the map which makes the analysis process too
complex. The topography in such cases becomes hard to visualize and
120 thus point-cluster analysis is performed. In this operation, clusters are
generated based on distance metrics and neighborhood to represent a par-
ticular number of data points within a cluster epitomized as a symbology.
Further, this operation diminishes the number of points over the map.

2.3. Machine Learning

125 Machine learning is a sub-field of artificial intelligence where on providing


input a specific problem we get a program as the output [11]. Various oper-
ations exist in machine learning such as regression, classification, association,
clustering, ensembling, feature extraction, dimensionality reduction, principal
component analysis, maximum likelihood estimation and so on. The above-
130 mentioned operations come under four specific learning types viz., unsupervised
learning, supervised learning, semi-supervised learning and reinforcement learn-
ing. The major task of machine learning is to provide predictive competency to
the model on which a specific algorithm is applied. Regression is also referred
to as numeric prediction. Classification performs diversification of different at-
135 tributes based on features and labels. Clustering is an unsupervised learning
approach where based on similarity of features, clusters are generated. It may

Electronic copy available at: https://ssrn.com/abstract=3599656


also be called unsupervised classification. Association is also an unsupervised
learning approach where common patterns are analyzed.

2.4. Convergence of GIS with Machine Learning


140 Multiple GIS-based operations and packages exist within GIS applications to
practically visualize the essence of machine learning over GIS implementations.
Spatio-temporal data consists of spatial as well as temporal instances. The
predictive analysis is possible through machine learning integration with GIS
technology. Also, a new domain considered to be a sub-field of machine learn-
145 ing i.e., deep learning sufficiently aids in classifying spatial instances with the
use of convolutional neural networks (CNNs) and temporal instances with the
use of recurrent neural networks (RNNs) or long short-term memory networks
(LSTMs). Further, in context to this paper, the datasets considered for imple-
mentation are unlabeled and thus spatial clustering is an efficient technique to
150 perform predictive analysis over geospatial data.

2.5. Spatial Clustering in GIS


In GIS, spatial clustering is based on different parameters and the algorithms
that aid in practically realizing it includes:

• K-means Clustering : The spatial k-means clustering is an iterative


155 approach to cluster various data points either based on the centroid values
or on the Euclidean distance [12]. For data in raster format, centroids are
chosen as the clustering parameter whereas for data in vector format,
Euclidean distance is calculated for each data point to perform the spatial
clustering. In context to this paper, the implementation is performed
160 over vector data and thus, the equation for calculation of the distance
parameter is as follows:
v
u n
uX 2
s (x, y) = t (yi − xi ) (1)
i=1

where, x and y are the input features, n is the total number of data points
and s (x, y) epitomizes the 2-D distance between x and y.

Electronic copy available at: https://ssrn.com/abstract=3599656


Further, the number of clusters provided as a parameter is user-defined
165 and thus as per the user specifications clusters are generated with different
cluster IDs.

• DBSCAN Clustering : One more approach for spatial clustering is


the density based spatial clustering of applications with noise (DBSCAN).
Here, the parameters that are needed to be specified include the following:

170 – eps : It stands for ”epsilon”. This parameter signifies the distance
between two data points present in different clusters [12].

– MinPts : This attribute epitomizes the minimum data points avail-


able within a cluster [12].

– Clusters : This parameter allows the user to specify the desired


175 number of clusters based on which cluster IDs corresponding to each
cluster is generated.

DBSCAN clustering significantly improves the spatial clustering as com-


pared to K-means clustering approach since it can detect clusters having
arbitrary shape [12].

Electronic copy available at: https://ssrn.com/abstract=3599656


180 3. Literature Survey

3.1. Remote Sensing and GIS Technology

Hiepke [13] summarized state-of-the-art technological advancements in crowd-


sourcing spatial data with geographical significance. The author defined crowd-
sourcing to be a geospatial mapping method with the aid of Web 2.0 technology
185 as well as social networks which are of informal type. The primary backend tech-
nologies required to crowd-source geospatial data were discussed in this paper.
The author tried to answer some challenges associated with the crowd-sourcing
of geospatial data. Also, this paper provided future research directions and
challenges associated to the current implementation strategies. Crowd-sourcing
190 over OSM and HD Traffic systems were presented as case-studies in context to
crowd-sourcing geospatial data.

Reddy et al. [14] with the assistance of remote sensing and GIS technology
modeled the regimes of disturbances as well as biological richness over the re-
195 gion of Similipal Biosphere Reserve (SBR) situated in Orissa province of India.
The authors computed disturbance indices by considering diverse parameters
causing disturbances such as road proximity, interspersion, fragmentation, seg-
mentation, juxtaposition and porosity. Future research directions towards con-
servation of biodiversity were also provided in this paper. The observations
200 signified in this paper were as follows:

• Savannah type and dry deciduous type forests of SBR shown high distur-
bances.

• The performed gradient analysis epitomized 70.01% area were under lower
rates of disturbances, 15.24% area under medium rates of disturbances and
205 12.75% area under higher rates of disturbances.

• The biological richness of semi-evergreen forests were epitomized through


67.01% area.

10

Electronic copy available at: https://ssrn.com/abstract=3599656


3.2. Convergence of Machine Learning with GIS Technology
Pradhan [15] discussed the performance comparisons of three well-known
210 machine learning strategies viz., decision tree (DT) approach, support vector
machine (SVM) algorithm and the technique of adaptive neuro-fuzzy inference
systems (ANFIS). The algorithms mentioned above were used in context to
landslide susceptibility cartography for the region of Penang hill area situated
in Malaysia. The author considered 113 locations of landslides collected through
215 aerial photography as well as field surveys. The area of study focused on 340,608
pixels out of which 8,403 pixels contained landslides. The dataset generated was
splitted into training dataset which constituted of 50% of the data and remain-
ing 50% was considered to be the validating dataset. The processed images
were subjected to GIS technology to visualize the map representation of the
220 considered input parameters of the landslide susceptibility data. A sum of 15
maps generated using GIS technology were analyzed in context to specified per-
formance metrics to validate the maps using the actual locations of landslides.
The results presented in this paper are based on the success rates of above three
mentioned techniques of machine learning presented over different instances of
225 landslide susceptibility maps. Both success and prediction rate curves were an-
alyzed in this paper and ANFIS method was proven to be highly accurate as
compared to other two considered algorithms. Finally, the author inculcated
the viability of the experimentation and determined various risk zones over the
landslide susceptibility maps.
230

Lee et al. [16] with the aid of machine learning techniques, focused on the elim-
ination as well as aggregation of building layers by classifying corresponding
buildings over a large scale into 0-eliminated, 1-retained and 2-aggregated. The
authors used classification algorithms over the data to classify various buildings.
235 The data was retrieved by the authors from National Geographic Information
Institute. The prime classification algorithms used in this paper included deci-
sion tree, naive-Bayes, k-nearest neighbor and support vector machines. This
paper also presented the corresponding accuracy obtained over the considered

11

Electronic copy available at: https://ssrn.com/abstract=3599656


algorithms to be 88.96% for decision tree algorithm, 79.50% for naive-Bayes al-
240 gorithm, 88.27% for k-nearest neighbor algorithm and 87.57% for support vector
machine algorithm. The highest accuracy was obtained with the aid of decision
tree classifier. The authors suggested that the classified buildings could act as
a preparatory step for performing building generalization operations such as
simplification, aggregation and so on. The confusion matrix and F-measures
245 were tabulated for the considered classifiers to explain the classes of buildings
previously defined in an algorithm-specific fashion.

Lary et al. [17] highlighted the essence of machine learning in dealing with
issues faced in GIS technology as well as remote sensing. Non-parametric anal-
250 ysis through regression and classification were presented in this paper to signify
the role of machine learning in improving the functionalities of GIS and remote
sensing applications. The authors illustrated several clustering techniques and
dist sources over airborne particulates and salt floats data respectively as case
studies. Genetic programming utilization in the fields of GIS and remote sensing
255 was epitomized through different results presented in the paper. Also, dealing
with spatial big data was presented as a challenge since machine learning, pre-
dictive analysis as well as knowledge extraction over such data becomes a hectic
task.

260 Furlanello et al. [18] discussed the proficiency of machine learning techniques
in the field of GIS technology to perform epidemiological analysis. The authors
used the geographic resources analysis support system (GRASS) tool, R pro-
gramming language and PostgreSQL technology to model and analyze geospa-
tial epidemiological data. This paper presented a risk mapping for tick-borne
265 diseases like Lyme borreliosis as well as tick borne encephalitis (TBE) over the
Trentino region of Italian Alps province. Also, the authors discussed the con-
vergence of machine learning with GIS technology for landspace epidemiological
analysis. The paper primely focused on the random forest classifier model to
perform the analysis. Various relevant plots and risk analyzed zonal classified

12

Electronic copy available at: https://ssrn.com/abstract=3599656


270 maps were presented as results. Further, the prediction algorithm considered
multiple climatic factors which were sampled and used in the process of analysis.

Pourghasemi et al. [19] performed experimentation using the support vector


machine classifiers and GIS technology to produce landslide susceptibility map-
275 pings over Kalaleh Township situated at Golestan province of Iran. The types of
kernels used to classify and produce mappings for landslide susceptibility were
as follows:

• Linear kernel classifier,

• Polynomial degree-2 kernel classifier,

280 • Polynomial degree-3 kernel classifier,

• Polynomial degree-4 kernel classifier,

• Radial basis function (RBF) kernel classifier, and

• Sigmoid kernel classifier.

Aerial imagery and field surveys were used to extract a total of 82 landslide
285 locations as the base data. Out of which, 61 instances were used as training
data and the remaining 21 instances as validating data. The factors selected for
conditioning were as follows:

• Altitude,

• Slope degree,

290 • Plan Curvature,

• Slope aspect,

• Tangential curvature,

• Profile curvature,

• Lithology,

13

Electronic copy available at: https://ssrn.com/abstract=3599656


295 • Surface area ratio,

• Distance from faulty regions,

• Land Use,

• Distance from roadways,

• Distance from rivers,

300 • Stream power index, and

• Topographic wetness index.

With the aid of ArcGIS software package, after employing the conditioning
factors using support vector machines, landslide susceptibility mappings were
generated. The validation performed over the dataset resulted in success rates
305 ranging from 79% to 87% and the prediction rates higher for RBF kernel classi-
fier and Polynomial degree-3 kernel classifiers given as 85% and 83% respectively.

Bui et al. [20] analyzed the malaria occurrences based on socio-physical fac-
tors over the Daknong region of Vietnam with the aid of remote sensing, GIS
310 technology and machine learning classification. The authors performed accu-
racy assessments through receiver operating characteristics (ROC-Curves) and
pair-t tests. This paper focused on using ensembling models and proved the
random subspace model to be the best fit. The motive behind the research was
to perform vulnerability mapping over malaria data so that control measures
315 could be effectively implemented based on the mappings. The factors influenc-
ing the spread were selected to be elevation, slope, aspect, rainfall, temperature,
normalized difference vegetation index (NDVI), land-use, distance from river,
distance from roads and distance from residential areas. Various performance
metrics for different classifiers and ensembling methods were discussed in this
320 paper. Results included six susceptibility maps generated using ArcGIS soft-
ware package.

14

Electronic copy available at: https://ssrn.com/abstract=3599656


Brown [21] proposed a novel framework for crime analysis acronymized Re-
CAP which signified Regional Crime Analysis Program. The model combined
325 two well-known technologies namely data fusion and data mining which can
be used for analyzing high-dimensional crime data. The proposed framework
actually acts as a case-study for the crime analysis task. The competencies of
ReCAP are like crime hotspot detection, crime hotsheets based on past crimi-
nal records, map-oriented searches, control charting which alerts the users, time
330 charting to develop temporal patterns, summary reports, cluster analysis and
detailed inter-modular searching. These capabilities with the aid of GIS tech-
nology leads to a perfect crime analysis tool through which crime can easily be
mapped along with multiple facilities surrounding the model.

335 Fandino et al. [22] proposed ways to analyze crime data using machine learning
at the backend. Furthermore, the prime focus was on exploring patterns of crime
using R programming language. The paper adopted data mining strategies and
performed histogram analysis over the datasets in a district-wise fashion. Also,
the paper tried to visualize patterns and trends of crime. The proposed method-
340 ology proved data mining as an approach to visualize the spatial datasets of
crime using some machine learning algorithms.

Flaxman et al. [23] proposed a forecasting solution spatio-temporal crime event.


The paper used high-ended technicalities such as Reproducing Kernel Hilbert
345 Space acronymized RKHS for the approximation of Gaussian processes with
the autoregressive smoothing kernels. Furthermore, the proposed procedure fo-
cused on improving the performance of two widely used crime analytics areas
namely Kernel Density Estimation acronymized KDE and Self-Exciting Point
Process acronymized SEPP. After the model hyperparameters were trained and
350 cross-validated, the performance of KDE and SEPP were shown to be improved.

15

Electronic copy available at: https://ssrn.com/abstract=3599656


3.3. Deep Learning Approaches in GIS
Murayama [24] discussed the core concepts of deep learning which is a sub-
field of machine learning in context to prediction and simulation of urban
growth, urban expansion and urban evolution. Many socio-economic factors
355 were selected as the driving attributes such as market behavior, population
density, transport infrastructure and government policies associated to lands.
Dealing with big data is a challenge when machine learning is the backend tech-
nology. Thus, the author epitomized the use of deep learning which is compe-
tent enough to deal with spatial big data for performing predictive analysis over
360 multi-spectral high-dimensional geospatial big data. The author considered the
urban expansion and evolution data to be a computationally intensive data and
suggested the use of deep learning over such data to perform predictive analysis.

Castelluccio et al. [25] inspected the features of convolutional neural networks


365 (CNNs) in context to semantic classification over remote sensed imagery. The
authors considered two recent architectures viz., CaffeNet and GoogLeNet for
semantic classification with the aid of three diverse modalities. This paper
focused on the use of pre-trained and fine-tuned networks for training rather
than conventional training approaches to diminish design time as well as tackle
370 the problems of overfitting. The authors performed experimentation over three
datasets based on remote sensing to validate their proposed framework. The
considered datasets includes UC-Merced data, WHU-RS data and Coffee Scene
data. Finally, the authors tried to present the efficacy of pre-trained CNNs for
its applicability in semantic classification tasks over remote sensed imagery.
375

Scott et al. [26], in this letter, investigated the use of deep convolutional neu-
ral networks (DCNNs) for classifying land-covers with the assistance of high-
resolution data retrieved through remote sensing. Two methodologies in con-
vergence were proposed in this letter viz., transfer learning (TL) and the novel
380 data augmentation technique. The authors stated that TL enables the boot-
strapping of DCNNs with extracted features well-preserved and data augmen-

16

Electronic copy available at: https://ssrn.com/abstract=3599656


tation revamps the robustness of DCNNs. The authors further considered the
UC-Merced dataset for practically realizing their proposed soution. The accura-
cies achieved with different platforms were tremendous and can be summarized
385 as follows:

• CaffeNet - 97.8 ± 2.3%

• GoogLeNet - 97.6 ± 2.6%

• ResNet - 98.5 ± 1.4%

Li et al. [27] proposed an anomaly detection architecture with the assistance


390 of CNNs. The developed architecture was practically realized by considering
certain facts which can be summarized as follows:

• Utilization of the considered data with samples which are in a labeled


fashion.

• Generation of pixel pairs to realize the enlargement of the size of samples,


395 since CNNs work on large training data. The proposed framework with
the aid of diversities found between the generated pairs of pixels and the
considered samples provided training to a multi-layered CNN. The authors
calculated similarity scores for each pixel to perform anomaly detection,
on averaging which the output was generated. This letter inculcated that
400 the proposed framework outperformed the Reed-Xiaoli and representation-
based detectors in detecting anomalies. The considered data constituted
the AVIRIS sensor dataset to perform the required experimentation.

Rao et al. [28] introduced two primary multi-modal learning techniques


in context to model the interrelation among visual imagery captured by an au-
405 tonomous underwater vehicle and acoustic bathymetry data previously obtained
through remote sensing. A multi-tier architecture was presented in this paper
which served the purpose of fetching joint distribution among the visual im-
agery and bathymetry modalities. The authors proposed a gated feature learn-
ing model as an extension to the existing framework. The proposed extension

17

Electronic copy available at: https://ssrn.com/abstract=3599656


410 enabled the model to perform clustering over the visual imagery with the aid of a
single parameter i.e., the information of ocean depth. The performed experimen-
tation demonstrated that the techniques of multi-modal learning significantly
improved the accuracy of semantic classification. Further, the convergence of
the extension allowed clustering to be performed on both the modalities or in-
415 dividually. The performed semantic mapping was suggested to leverage mission
planning strategies through image-based or class-based queries.

Zhang et al. [29] proposed a novel deep learning-based model which performed
network-based predictive mapping of sparse spatio-temporal events. This paper
420 used a gated localized diffusion network acronymized GLDNet for represent-
ing a graph of the network-structured data. Furthermore, this paper tried to
differentiate two mechanisms for spatial and temporal data where a gated net-
work signified temporal data and GLDNet signified spatial data. The article
employed a weighted loss function and finally achieved 10% mean hit rate and
425 20% coverage level which is a remarkable achievement because it outperformed
the benchmark mean hit rate of 12% and coverage level of 25% over the Crime
Forecasting Case of South Chicago, USA dataset.

Duan et al. [30] focused on feature extraction from crime datasets with the
430 assistance of deep neural networks. The proposed model acronymized STCN
referring to Spatio-Temporal Crime Network used the essence of convolutional
neural network with the hyperspectral high-dimensional crime datasets. The
model forecasted crime risks based on the datasets provided. The considered
datasets were felony dataset and 311 dataset taken from New York City over
435 which the model was tested and the F1 score obtained was 88% and AUC ob-
tained 92% which was a remarkable performance achievement. It can be inferred
that the accuracy of deep neural networks actually outperforms the use of data
mining, machine learning and many technologies.

440

18

Electronic copy available at: https://ssrn.com/abstract=3599656


4. Proposed Framework

Due to the extensive applicability area of GIS technology in diverse fields,


three case-studies were considered for implementation. Our proposed framework
is three-tiered and consists of three layers viz.,

445 • Geospatial Data Pre-processing Layer

• Geospatial Visualization Layer

• Statistical and Predictive Analytics Layer

The proposed framework is visualized in figure 2 with the above mentioned


layers in detail.

450 4.1. Geospatial Data Pre-processing Layer


In this layer of the proposed framework, datasets with geographical signifi-
cance are collected. Certain preliminary operations on the dataset are performed
to make it suitable for the purpose of analysis. Three datasets are considered
viz.,

455 • Indian Census Dataset

• Zomato Dataset

• COVID-19 Dataset for India

The datasets are thoroughly analyzed and following operations have been per-
formed to preprocess the datasets.

460 • Handling Missing Values : Many available datasets are incomplete


and some instances are missing and thus, handling missing values is an
important task. Based on the data type of the instances, missing values
can be replaced with zero or N/A or others. It is an indispensable task
because if missing values persist in the dataset, it may affect the geospa-
465 tial visualization accuracy and predictive analytics can’t be so accurate.
For, unspecified data instances a common value is employed to ensure the
dataset is complete and ready for geospatial visualization.

19

Electronic copy available at: https://ssrn.com/abstract=3599656


• Checking the Geographical coordinates : Geographical coordinates
are mandatory for GIS based operations to be performed. It includes the
470 latitude and longitude values associated with each feature of the dataset.
For, the COVID-19 data of India the state-wise geographical coordinates
are manually inserted to make it geographically significant. Without the
geographical coordinates, plotting data points over a map is not worthy.
Thus, this operation makes the data suitable for geospatial analysis.

475 • Generating shape files for the data : Individual datasets need to be
converted into ”.shp” format or ”.csv” format to make the data importable
into the GIS application as a delimited text. Further, the datasets need to
be well-formatted with each column values representing each instance of
a particular type. This operation is necessary because it allows the data
480 to be imported into the application for performing geospatial analysis.

4.2. Geospatial Visualization Layer

This is the prime section of the proposed framework. This layer employs
geospatial operations over the pre-processed datasets. The corresponding geospa-
tial operations include:

485 • Map-Overlay Analysis : This is one of the basic geospatial operation


carried out to perform geospatial analysis. The datasets have rows which
signify each feature attribute and these features are plotted as data points
in the GIS application. The data points are then subjected to modifica-
tions as per convenience of analysis. Further, open street maps (OSM) or
490 Google maps or Google satellite maps or similar maps are allowed to be
visualized over the data points. The data points are finally overlaid on
the corresponding map or a combination of maps in some cases leading to
map-overlay operation. The data points can further be labeled in order
to signify each data point corresponding to a particular parameter.

495 • Point-Cluster Analysis : This operation is a symbological operation


performed over the map-overlay to signify the geographical topography

20

Electronic copy available at: https://ssrn.com/abstract=3599656


or the structure in which the data points are spread over the map. For,
multiple data points in a particular region, it becomes difficult to analyze
the structure and thus for densely plotted maps, this operation aids in vi-
500 sualizing the topography. A particular number of points are clustered and
clusters are symbolized based on distance metrics in this operation which
in turn diminishes the number of data points and make the topography
too clear to visualize.

• Heatmap Analysis : To analyze location based hotspots, this operation


505 is employed. For any dataset, there exists some major attributes based
on which hotspots can be classified or zonal analysis can be done. This
operation uses pseudo-color codes as color ramps which is diversified by
their intensities. The regions which are highly dense with data points
or dense due to a particular attribute are represented with high color
510 intensities which gradually decrease as the density reduces. Also, the
hotspots can be detected as regions with higher color intensities. Further,
this geospatial operation allows proper analysis of the major attribute as
well as the density of data points.

• Machine Learning Integration : The convergence of machine learn-


515 ing with GIS technology is realized through this operation where multi-
ple machine learning algorithms can be used for analysis over a geospa-
tial environment. The datasets considered are unlabeled and thus spatial
clustering techniques are the best fit for performing geospatial analysis.
DBSCAN clustering algorithm and K-means clustering algorithm over the
520 maps are employed. Both the algorithms spatially cluster the data points
based on similar features and allocate a cluster ID associated to each data
point. Further, parameters for both the clustering techniques are different
as discussed in section 2. This operation allows classification of different
groups of data points based on features and distance metrics.

21

Electronic copy available at: https://ssrn.com/abstract=3599656


525 4.3. Statistical and Predictive Analytics Layer

This section of the proposed framework deals with the analysis based on
statistical parameters. For each dataset, various charts have been generated to
signify diverse trends found in the dataset. Further, maps are also presented
with zonal classifications in diverse colors for relevant datasets to signify the
530 predictive analytics. The datasets considered to practically realize the pro-
posed framework are relevant in some domains and the statistical analysis per-
formed over each dataset improves the visual analysis of various parameters in
the dataset. Clustered-column charts, scatter plots, line charts, zone-classified
maps are some of the analysis presented during the realization of the proposed
535 framework.

4.4. Proposed Workflow

The proposed workflow epitomizes the sequence in which operations are be-
ing carried out to practically realize the proposed framework. The diagrammatic
representation of the workflow is presented in figure 3. The initial task being the
540 collection of geospatial datasets is followed by a sequence of three indispensable
tasks. The foremost task after the collection of geospatial data is the handling
of missing data by adding some specific values in place of missing instances.
Further, the addition of latitude and longitude values is signified and finally the
sequence ends with the generation of GIS importable files. After the importable
545 files are generated, two sub-sequences are formed. The former specifying the
corresponding geospatial operations carried out over the importable files and the
latter signifying the statistical plot generation activities performed over the gen-
erated importable files. Further, following the geospatial data importing, four
further sequences generated specifying overlay generation, point-cluster map-
550 ping, heatmap analysis and spatial clustering. The workflow diagram sequence
numbers are shown in circles in a phase-wise fashion.

22

Electronic copy available at: https://ssrn.com/abstract=3599656


GEOSPATIAL DATA PRE-PROCESSING LAYER

CHECKING THE GENERATING SHAPE


HANDLING MISSING
GEOGRAPHICAL FILES FOR THE DATA
VALUES
COORDINATES

INDIAN CENSUS ZOMATO DATASET COVID-19


DATASET DATASET

GEOSPATIAL VISUALIZATION LAYER

MAP-OVERLAY POINT- HEATMAP MACHINE


ANALYSIS CLUSTER ANALYSIS LEARNING
ANALYSIS INTEGRATION

STATISTICAL AND PREDICTIVE ANALYTICS LAYERS

Figure 2: 3-tiered proposed framework for geospatial analytical implementation.

23

Electronic copy available at: https://ssrn.com/abstract=3599656


Geospatial Data Collection

1 3
2

Handling Adding Generation of


Missing Data Geographical “.csv” or “.shp”
Coordinates (if files
required)
3.2
3.1
Plotting
Geospatial Data Import in GIS Statistical
Charts
3.1.1 3.1.2 3.1.3 3.1.4

Overlay Point-Cluster Heatmap Spatial


Generation Mapping Analysis Clustering

Figure 3: Proposed workflow diagram to practically realize the 3-tiered proposed architecture.

24

Electronic copy available at: https://ssrn.com/abstract=3599656


5. Experimentation

The paper entails the utilization of the convergence of GIS technology and
machine learning in diverse application areas. The validation of the proposed
555 framework through three different case studies is performed. The case studies
conducted over diverse datasets presents the inevitability of GIS technology in
performing economical, business-oriented as well as epidemiological analyses.

5.1. CASE-STUDY 1
This case-study epitomizes the practical realization of the proposed frame-
560 work over the Indian census data diversified for the years 2001 and 2011 in a
district-wise fashion.

5.1.1. Dataset Specification


The dataset named ”Indian census data with geospatial indexing” is consid-
ered to perform geospatial analysis [31]. The dataset comprises of district-wise
565 populations for the years 2001 and 2011 represented in 432 rows and 6 columns
signifying 2,592 instances in total. The geospatial coordinates have been calcu-
lated by creating the borders of maps for each district and then finding the mean
center over a two-dimensional plane. The attributes in the dataset considered
are as follows:
570

• State : This column signifies the name of states in India with repetitive
entries because of their corresponding district names being unique values.

• District : This attribute specifies the name of each district in India com-
ing under various states with unique values for each instance.

575 • Latitude : This parameter is the geographical parameter as analyzed by


the author of this dataset by calculating the mean centers. It specifies the
Northing coordinates geographically for each district.

• Longitude : It is similar to the Latitude parameter except for that it spec-


ifies the Easting coordinates geographically for corresponding districts.

25

Electronic copy available at: https://ssrn.com/abstract=3599656


580 • Population in 2001 : This attribute contains unique numeric values
specifying the district-wise population in the year 2001 as collected from
Indian census data of 2001.

• Population in 2011 : It is similar to the Population in 2001 attribute


technically but it specifies the population of 2011 in a district-wise manner.

585 5.1.2. Methodology


In order to practically realize the essence of the proposed framework certain
geospatial operations in convergence to machine learning needs to be performed.
The step-wise methodology followed to geospatially visualize the considered
dataset is as follows:

590 • STEP-1 : The data is checked for the format to be ”comma separated val-
ues”. The geospatial coordinates are checked for each row in the dataset.
Since, there are no missing values in the considered dataset, data prepro-
cessing is not required.

• STEP-2 : The dataset is imported to the GIS application as ”delimited


595 text” by specifying X value to be Longitude and Y value to be Latitude.
A layer specifying data points over two-dimensional plane is generated.

• STEP-3 : A new layer to specify the pre-built maps needs to be added.


The open street map (OSM) is considered to be added as the map layer.
Then, the data points are overlaid on the OSM and the overlay analysis is
600 performed by labeling each data point with the population values in 2011.

• STEP-4 : The layer specifying the data points is symbolized with the
aid of point-cluster analysis by specifying the number of data points that
should be present in a cluster. Here, the clusters are generated using a
specific symbol named ”topo pop capital”.

605 • STEP-5 : The data points layer is reset and then the heatmap symbology
is selected to be operated on. The heatmap is generated after selecting a

26

Electronic copy available at: https://ssrn.com/abstract=3599656


color ramp and specifying the attribute to be ”Population in 2011”. The
output is a diverse color intensified map.

• STEP-6 : The heatmap is selected as the layer and the ”Processing


610 Toolbox” of the GIS application is opened to specify the ”Vector Analysis”
procedure. DBSCAN Clustering method is selected over the heatmap as
the layer and number of clusters specified to be 7. The ”MinPts” and ”eps”
values are set to be default. The output is the heatmap with clustered
points and individual cluster IDs corresponding to each data point.

615 • STEP-7 : Corresponding statistical analysis is done over the dataset to


present relevant plots by specifying the parameters for X and Y coordi-
nates. Trend analysis for the relevant attributes is also provided. Mi-
crosoft Bing powered maps with relevant legends are also visualized.

5.2. CASE-STUDY 2

620 A well-known trending application in the sector of food delivery and rating
worldwide is ”Zomato”. This is a mobile phone based application where users
can order food of nearly all segments online and a valet shall be allotted to
deliver the ordered food. Various restaurants worldwide are registered with this
business and they are rated and voted by the users of this application based on
625 food quality, packaging quality and so on.

5.2.1. Dataset Specification


The dataset is open-source and consists the list of restaurants registered
with Zomato with various attributes [32]. Since the dataset is high-dimensional
and gigantic, the data have been precised by collecting only the instances that
630 appeared in the dataset for the Delhi province in India. The dataset created by
taking instances of Delhi province in India consists of 5,473 rows and 21 columns
i.e., a total of 114,933 instances. The attributes in the manipulated dataset are
as follows:

27

Electronic copy available at: https://ssrn.com/abstract=3599656


• Restaurant ID : It is an unique field and it consists of unique IDs allotted
635 by Zomato to each restaurant in the Delhi province, India.

• Restaurant Name : The names of each restaurant in various locations


of Delhi is summarized row wise under this field.

• Country Code : The country code for India is ”1” and this attribute is
same for all instances in the dataset.

640 • City : This attribute is also same for all instances by the city name
”Delhi” since the original dataset has been cropped for this specific at-
tribute only.

• Address : The complete address of the restaurants individually is sum-


marized under this column.

645 • Locality : The locality where the restaurant lies which helps to summarize
specific registered restaurants in a particular locality is presented in this
column.

• Locality Verbose : This attribute is similar to the Locality attribute


except for here it contains the province name as well.

650 • Longitude : The geographical coordinate i.e., longitude for each restau-
rant in Delhi province of India is summarized under this column.

• Latitude : The latitude values corresponding to the location of each


restaurant is epitomized in this column.

• Cuisines : The primary cuisine type for each restaurant is specified in


655 this column which shall aid in specifying categories of restaurants based
on cuisine types.

• Average Cost for two : Every restaurant has a price value to specify the
average cost for two persons. The average cost is represented individually
for each restaurant in this column.

28

Electronic copy available at: https://ssrn.com/abstract=3599656


660 • Currency : The currency type is same in all rows i.e., ”Indian Rupees”
since Delhi province of India is selected as the area of study.

• Has Table booking : Restaurants have been classified based on whether


the restaurant has options for table booking with binary variables ”Yes”
or ”No” in this column.

665 • Has Online delivery : Restaurants having online delivery are marked
with ”Yes” and those not providing options for online delivery are marked
with ”No” in this column.

• Is delivering now : For allowing users to select restaurants of their


choice and check whether the restaurant timing is suitable to deliver as
670 per user convenience, this attribute specifies ”Yes” or ”No” values for each
instance.

• Switch to order menu : In the application interface, there exists an


option to view the order menu of the restaurant. This attribute classifies
the restaurants based on the option for switching to order menu.

675 • Price range : The price range is classified within the range 1 to 4 indi-
vidually for each restaurant in this specific column.

• Aggregate rating : Based on the user ratings, an aggregate is calcu-


lated for each restaurant which signifies its popularity and the ratings are
summarized under this column.

680 • Rating color : Users are provided with options of providing rating colors
for the restaurants they have ordered food from. Based on the rating
colors, the restaurants are classified under this column.

• Rating text : The rating text showing the satisfaction level of the cus-
tomers is shown for each restaurant under this parameter.

685 • Votes : Users based on satisfaction, food quality, taste and so on provide
votes to the restaurants and this attribute specifies the aggregate votes
provided to each restaurant.

29

Electronic copy available at: https://ssrn.com/abstract=3599656


5.2.2. Methodology
The step-wise methodology which needs to be adhered to for geospatial
690 visualization of the considered dataset is as follows:

• STEP-1 : Each missing value is checked for initially. Since, its a high-
dimensional data manually entering any values is not recommended. Each
geospatial coordinates are further checked to see if no coordinate is mis-
placed.

695 • STEP-2 : As the data is imported in the GIS analysis tool, X coordinate
is specified to be longitude values and Y coordinate to be latitude values.

• STEP-3 : The data points are plotted over the OSM with labels specify-
ing restaurant names to each data point.

• STEP-4 : The point-cluster symbology is mapped with clusters repre-


700 sented with cluster numbers within red circles to visualize the topography
associated with the map.

• STEP-5 : Heatmap analysis is performed to signify regions dense with


restaurants in various localities. Nearly eight localities are highly dense
with restaurants which is epitomized by deeper color intensities.

705 • STEP-6 : Spatial k-means clustering is selected under the ”Processing


Toolbox” to cluster the data points into 5 clusters based on Euclidean
distance and cluster IDs allocated to each data point.

• STEP-7 : The performed statistical analysis provided various trends in


the dataset like ”Votes versus Restaurant Name”, ”Restaurant Count ver-
710 sus Rating Color”, ”Restaurant Count versus Rating Text” and ”Restau-
rant Count versus Has Online Delivery”.

5.3. CASE-STUDY 3

The rage of spread of the worldwide pandemic corona virus disease acronymized
COVID-19 is selected as the area of study for geospatial visualization. The state-

30

Electronic copy available at: https://ssrn.com/abstract=3599656


715 wise data of COVID-19 in India is collected from authorized government portal
[33].

5.3.1. Dataset Specification


The data collected and modified as on 29 April, 2020 consists of 33 rows and
6 columns i.e., a total of 198 instances. The geographical coordinate values were
720 missing and geographical coordinate values for each row is manually inserted in
the dataset. The attributes after modification are as follows:

• Name of State/UT : This attribute specifies a total of 33 regions in


India designated as a province or a union territory.

• Total Confirmed cases (Including 111 foreign Nationals) : This


725 parameter epitomizes the total infected patients of COVID-19 in a region-
wise fashion. Also, 111 foreign nationals are included.

• Cured/Discharged/Migrated : Patient counts that are in a status of


cured or discharged or migrated are presented under this column in a
region-wise fashion.

730 • Death : This attribute specifies the region-wise fatalities caused due to
the spread of COVID-19.

• Latitude : The geographical coordinate specifying the latitude values for


each region based on their centroids are signified under this column.

• Longitude : The corresponding longitude values are collected for each


735 region in the dataset and is summarized under this column.

5.3.2. Methodology
The methodology followed to perform geospatial analysis is as follows:

• STEP-1 : Since the collected data had no instances for geographical


coordinates, the latitude and longitude values were added in the dataset
740 for each state/union territory as specified.

31

Electronic copy available at: https://ssrn.com/abstract=3599656


• STEP-2 : The processed data is imported to the GIS application as
delimited text by specifying X value to be longitude and Y value to be
latitude.

• STEP-3 : Various features are plotted as data points over the OSM with
745 labels of region names initially to specify affected regions of COVID-19.

• STEP-4 : Further, overlay analysis is performed by labeling the pie-charts


over each data point. The pie-charts are generated for three attributes
”Total Confirmed Cases”, ”Cured/Discharged/Migrated” and ”Death”.

• STEP-5 : Heatmap analysis is performed to epitomized hotspots of


750 COVID-19 in India classified based on number of infected counts.

• STEP-6 : K-means clustering is performed over the heatmap with 5 clus-


ters. Different cluster IDs are symbolized over each data point classified
based on Euclidean distance.

• STEP-7 : Statistical analysis performed over this dataset is relevant in


755 educating public authorities to take proper controlling measures. Two
Microsoft Bing powered maps are generated for specifying risk analysis
and zonal classification performed over the data. Various significant plots
are also generated corresponding to the relevant attributes in the dataset.

5.4. Assisting Tools and Hardware Specifications

760 • The tool used for visualization and spatial clustering is the ”QGIS 3.10 A
Coruna” which is an open-source GIS software package.

• The software package used to realize the statistical visualizations and gen-
erate Microsoft Bing powered maps is Microsoft Excel 2019.

• The hardware specifications are 8 GB RAM, Intel CORE i5-8265U CPU,


765 Windows-10 Home Single Language (2019) operating system, 256 GB
Solid State Drive (SSD) and 1 TB Hard Disk Drive (HDD).

32

Electronic copy available at: https://ssrn.com/abstract=3599656


6. Results and Discussions

6.1. CASE-STUDY 1 - Indian Census Data (2001, 2011)

6.1.1. Geospatial Visualization


770 The geospatial visualization techniques applied in this case-study witnesses
the applicability of GIS technology in determining certain economical parame-
ters of the country especially population diversity in our case. The map overlay
analysis performed over the Indian census data is visualized geospatially. Figure
4 visualizes the data points plotted over open street map with labels concerned
775 to the district-wise population in 2011. Further, the point-cluster symbology
is used to visualize the topography of the data points. The point-cluster anal-
ysis performed over the considered data is represented through black dotted
circles as clusters and yellow circles as points. Figure 5 epitomizes the per-
formed point-cluster analysis. Further, the heatmap symbology is visualized
780 to represent the population density based on 2011 population in India through
pseudo color codes. The intensities of color gradually reduce as the population
count reduces district-wise. The region with the highest population is depicted
through deep color intensity. The heatmap symbology is visualized in figure 6.
Figure 7 is the visual representation of the performed DBSCAN clustering over
785 the heatmap. The DBSCAN clustering is performed by specifying 7 clusters
with specified ”MinPts” and ”eps” values. The cluster IDs corresponding to
each data point are presented in black numbers over the heatmap.

33

Electronic copy available at: https://ssrn.com/abstract=3599656


Figure 4: Map-overlay analysis with labels of 2011 population in a district-wise fashion.

Figure 5: Point-cluster symbology of Indian census data to show the overall topography.

34

Electronic copy available at: https://ssrn.com/abstract=3599656


Figure 6: Symbological heatmap generated using pseudo color codes to signify highly dense
regions over the map.

Figure 7: DBSCAN clustering with 7 clusters performed over the heatmap with Cluster IDs
represented for each data point.

35

Electronic copy available at: https://ssrn.com/abstract=3599656


6.1.2. Statistical Visualization
Figure 8 is a clustered column chart epitomizing the district-wise population
790 in 2001 in India. The similar chart for the analysis of population in 2011 in a
district-wise fashion is presented in figure 9. A comparison between the pop-
ulations in 2001 and 2011 to show the district-wise hike in population all over
India is visualized in figure 10 where population in 2001 is signified by blue color
code and population in 2011 by orange color code. Further, a Microsoft Bing
795 powered map with zonal classifications based on population density in 2001 is
presented in figure 11. Different intensities of color as represented in the as-
sociated legend in figure 11 present diverse regions with the highest to lowest
population densities in India as of 2001.

Figure 8: Chart presenting the population in 2001 in a district-wise fashion in India.

36

Electronic copy available at: https://ssrn.com/abstract=3599656


Figure 9: Chart showing population in 2011 in a district-wise fashion in India.

Figure 10: Chart showing the comparison between population in 2001 and 2011 represented
in blue and orange colors respectively.

37

Electronic copy available at: https://ssrn.com/abstract=3599656


Figure 11: Microsoft Bing powered map with zonal classification generated based on popula-
tion of various states with various colors and associated legend.

6.2. CASE-STUDY 2 - Zomato Restaurants in New Delhi province of India

800 6.2.1. Geospatial Visualization


The diverse geospatial operations presented through this case study signifies
the relevance of GIS technology in industries as well as business sectors. The
well-known Zomato food delivering and rating application dataset considered
in this case-study has undergone geospatial visualization operations over the
805 Delhi province of India. A lot of restaurants exist in the selected area of study
and thus, the map-overlay analysis performed over the data is visualized as
multiple data points with labels of restaurant names depicted around each data
point based on the geographical coordinates. The map-overlay operation is
visualized in figure 12. Further, the point-cluster symbology to signify the
810 geographic topography is presented in figure 13 with big red circles with cluster
numbers representing the clusters and small gray circles epitomizing the points.
The corresponding heatmap symbology created with false-color code intensities
to signify the regions which are densely populated with restaurants in Delhi
province of India is provided in figure 14. Finally, spatial k-means clustering is

38

Electronic copy available at: https://ssrn.com/abstract=3599656


815 performed over the map overlay with labels of restaurant names. The spatial
k-means clustering is performed by calculating the Euclidean distance for each
data point and specifying the number of clusters to be five. K-means clustering
with each cluster ID for corresponding data points is shown in figure 15.

Figure 12: Map overlay analysis performed to show the data points with labels of restaurant
names in Delhi province of India.

39

Electronic copy available at: https://ssrn.com/abstract=3599656


Figure 13: Point-cluster symbology to present the topography of the generated map by cluster
numbers presented in big red circles and points epitomized by small gray circles.

Figure 14: Heatmap symbology analysis generated to show highly dense regions of Delhi,
India where maximum restaurants lie using pseudo color intensities.

40

Electronic copy available at: https://ssrn.com/abstract=3599656


Figure 15: Spatial k-means clustering performed over the map-overlay with labels of restaurant
names and 5 different cluster IDs allotted to each restaurant registered under Zomato in Delhi,
India.

41

Electronic copy available at: https://ssrn.com/abstract=3599656


6.2.2. Statistical Visualization
820 Diverse plots signifying various parameters necessary in context to analysis
of Zomato data are presented in this section. Figure 16 presents a chart which
provides the classification of various restaurants in Delhi province of India based
on the votes provided by Zomato users. Further, the clustered column chart
presenting the restaurant classification based on the rating colors provided by
825 Zomato users in Delhi is presented in figure 17. Also, restaurants are diversified
based on rating texts and is shown in figure 18. Finally, a clustered column chart
presenting the count of restaurants with options for online delivery is presented
in figure 19. The various parameters shown through diverse plots signify the
trends observed during the analysis of the considered dataset over a specified
region.

Figure 16: Chart presenting classification of restaurant based on votes provided by Zomato
users in Delhi, India.

42

Electronic copy available at: https://ssrn.com/abstract=3599656


Figure 17: Clustered column chart generated to show number of restaurants in Delhi, India
with diverse rating colors present while rating through Zomato application .

43

Electronic copy available at: https://ssrn.com/abstract=3599656


Figure 18: Clustered column chart to epitomize number of restaurants in Delhi, India classified
based on rating text provided in Zomato application.

Figure 19: Clustered column chart to present restaurants in Delhi province in India which
provide online delivery options through Zomato application.

44

Electronic copy available at: https://ssrn.com/abstract=3599656


830 6.3. CASE-STUDY 3 - COVID-19 pandemic spread in India

6.3.1. Geospatial Visualization


The recent COVID-19 outbreak suddenly took the form of a pandemic and
created a havoc among the common public. This section presents the corre-
sponding geospatial visualizations presented for the effect of COVID-19 in India.
835 The states and union territories affected by this novel virus are labeled based
on the geographical coordinates and the map-overlay analysis is presented in
figure 20. Further, for each data points specifying a state or a union territory,
corresponding pie-charts to present total victims, cured/discharged patients and
fatalities are visualized over the map overlay. The pie-chart presented over the
840 map of India in a state-wise fashion is shown in figure 21. The heatmap symbol-
ogy to signify the regions which are highly vulnerable to COVID-19 is presented
using pseudo color codes with intensities gradually reducing as the number of
victims reduce. The heatmap symbology plotted over the map overlay is shown
in figure 22. Finally, spatial k-means clustering is performed over the heatmap
845 with five number of clusters. The k-means clustering can be realized with cluster
IDs presented over each data point in figure 23.

Figure 20: Map overlay analysis to show COVID-19 affected regions in India with labels of
different state names.

45

Electronic copy available at: https://ssrn.com/abstract=3599656


Figure 21: Pie-charts showing total cases, cured cases and fatalities in a state-wise fashion
over the map-overlay of India.

Figure 22: Heatmap symbology showing the regions with highly infected patients in various
pseudo color codes.

46

Electronic copy available at: https://ssrn.com/abstract=3599656


Figure 23: Spatial K-means clustering performed over the generated heatmap with cluster
IDs 0, 1, 2, 3 and 4 corresponding to various regions.

6.3.2. Statistical Visualization


The various statistical measures relevant in context to visualization of the
spread of COVID-19 in India are presented in this section. Figure 24 presents
850 a clustered column chart to visualize the total number of victims, number of
patients cured and fatalities occurred in a state-wise fashion. The state of
Maharashtra being affected the most with 9,318 number of infected patients
can be visualized through the charts shown in this section. Further, figure 25
epitomizes a scatter plot to signify the rate of cured patients and fatalities out
855 of the total number of victims with blue color code signifying cured cases and
orange signifying death cases. Two Microsoft Bing powered maps are presented
to visualize state-wise infected patient counts and state-wise death counts due to
COVID-19. The map presenting region with highest infected count in red color
code and diverse zones classified in various colors is shown in figure 26. Also,
860 the zonal classification over the map based on death counts of various states
is presented in figure 27. Figure 26 and figure 27 epitomizes the risk analysis
performed over map of India with zonal classifications presenting COVID-19
vulnerable zones in diverse colors.

47

Electronic copy available at: https://ssrn.com/abstract=3599656


Figure 24: Clustered column chart presenting the total cases, cured patients and fatalities
caused by COVID-19 in a state-wise fashion.

Figure 25: Scatter-plot to signify the rates of cured/discharged patients and fatalities due to
COVID-19 in India.

48

Electronic copy available at: https://ssrn.com/abstract=3599656


Figure 26: Microsoft Bing map created to present the number of infected cases due to COVID-
19 in India.

Figure 27: Microsoft Bing map generated to show state wise zonal classification based on
death counts due to COVID-19 pandemic in India.

49

Electronic copy available at: https://ssrn.com/abstract=3599656


7. Challenges and Future Scope

865 This paper presented a framework converging GIS technology and machine
learning for three indispensable data presented as case studies. The proposed
framework poses certain limitations as well as future research direction towards
tackling the limitations. The posed challenges can be summarized as follows:

• Data in today’s era is increasing rapidly and techniques to visualize over


870 local engines is becoming complex. Spatial data which are hectic to be
processed through local engines are called spatial big data (SBD) [34].
One major challenge posed by the proposed framework is the handling of
SBD.

• The accuracy of machine learning significantly diminish as the size of


875 data increases gradually. Thus, machine learning in convergence with GIS
technology is not sufficient enough to handle high-dimensional data.

• In remote sensing applications, satellite imagery datasets are dealt with


and the proposed framework can’t handle such data due to lack of technical
approaches available in classical machine learning to deal with imagery
880 data.

• The processing of geospatial data over a GIS application as suggested by


the proposed framework lacks in providing real-time visualization of the
performed implementations.

• To perform epidemiological analysis, a framework with constant updating


885 of the data and performing visualization is required but the proposed
framework is a static method and it lacks the dynamic updating of the
growing data.

• Statistical visualization can significantly be improved by employing certain


machine learning algorithms but the considered datasets are unlabeled and
890 thus spatial clustering is the only option to perform visualization. The

50

Electronic copy available at: https://ssrn.com/abstract=3599656


paper presents statistical visualization in a manual way performed among
significant attributes.

Techniques exist in the domain of computer engineering to tackle the above


mentioned challenges. Some of the approaches are presented as future research
895 directions in this section which can be summarized as follows:

• A computing paradigm called ”Cloud GIS” can provide a way to practi-


cally realize the real-time visualization of the performed implementations
[35]. In this paradigm, the implemented layers need to be inserted into
the PostGIS database and then uploaded to the cloud framework for expe-
900 riencing dynamic visualization. The output generated comprises of web-
based links corresponding to three types of clients viz., thick client, thin
client and mobile client [36].

• Since data processing over cloud requires much effort, the integration as
a future scope can be shown to be performed over ”Serverless computing
905 paradigm”, where developer effort is minimized to a greater extent [37].
Various platforms like Microsoft Azure Functions, Google Cloud Func-
tions, Amazon AWS Lambda and so on can be integrated to experience a
shorter duration of implementation.

• Deep learning is a sub-field of machine learning which is in rage currently


910 because of the accuracy it provides as compared to traditional machine
learning algorithms. Convolutional neural networks (CNNs), Recurrent
Neural Networks (RNNs), Long Short-Term Memory Networks (LSTM)
and Deep Belief Networks (DBNs) or a hybrid network can be used as the
driving algorithms. This approach can tackle the challenges posed due to
915 SBD as well as satellite imagery classification in remote sensing. Further,
for high-dimensional, hyperspectral, spatio-temporal data this approach
can aid sufficiently in providing predictive competency to the data.

51

Electronic copy available at: https://ssrn.com/abstract=3599656


8. Concluding Remarks

This paper presented the applicability of machine learning in the domain


920 of GIS technology by providing relevant implementations over the correspond-
ing datasets. The paper proposed a framework to converge machine learning
with GIS technology for performing geospatial analysis operations on three spe-
cific datasets. Machine learning, a sub-field of artificial intelligence provides
relevant algorithms to perform the tasks of regression analysis, classification,
925 association and clustering over specific datasets. Further, spatial clustering is a
field in machine learning technology where unlabeled data are grouped into di-
verse number of clusters based on certain parameters. The considered datasets
included Indian census data, Zomato restaurants data and COVID-19 data in
India. The experimentation conducted to practically realize the proposed frame-
930 work was explained in detail. Further, the supporting results were presented
based on two sections viz., geospatial visualization and statistical visualization.
The results corresponding to geospatial visualization presented overlay analysis,
heatmap analysis, point-cluster generation and spatial clustering approaches.
The map-overlays with sufficient details of labels are provided for all the con-
935 sidered datasets. Also, the heatmaps presented in this paper can be improved
by employing kernel density estimation technique over the map-overlay. The
integration of cloud computing paradigm shall provide much ease in real-time
visualization of the performed GIS-based implementation. Further, the statis-
tical visualization epitomized various trends found in the considered datasets.
940 The challenges associated with the proposed framework was shown and corre-
sponding techniques to tackle the challenges were presented as future research
directions. This paper discussed relevant works in the domain of GIS technology,
remote sensing and the convergence of GIS technology with machine learning
and deep learning. The operations performed over the considered datasets in
945 order to practically visualize the data in a geographical fashion were explained
in detail.

52

Electronic copy available at: https://ssrn.com/abstract=3599656


References

References

[1] C. J. Bradshaw, M. Purvis, R. Raykov, Q. Zhou, L. S. Davis, Predicting


950 patterns in spatial ecology using neural networks: Modelling colonisation
of new zealand fur seals, in: International Symposium on Environmental
Software Systems, Springer, 1999, pp. 57–65.

[2] T.-S. Chon, Y. S. Park, K. H. Moon, E. Y. Cha, Patternizing communities


by using an artificial neural network, Ecological modelling 90 (1) (1996)
955 69–78.

[3] S. L. Özesmi, U. Özesmi, An artificial neural network approach to spa-


tial habitat modelling with interspecific interaction, Ecological modelling
116 (1) (1999) 15–31.

[4] S. Openshaw, Some suggestions concerning the development of artificial


960 intelligence tools for spatial modelling and analysis in gis, The annals of
regional science 26 (1) (1992) 35–51.

[5] M. Neteler, M. H. Bowman, M. Landa, M. Metz, Grass gis: A multi-purpose


open source gis, Environmental Modelling & Software 31 (2012) 124–130.

[6] M. Kanevski, A. Pozdnukhov, V. Timonin, Machine learning algorithms


965 for geospatial data. applications and software tools.

[7] R. Groot, Spatial data infrastructure (sdi) for sustainable land manage-
ment, ITC journal 3 (4) (1997) 287–294.

[8] T. G. Wade, J. D. Wickham, M. S. Nash, A. C. Neale, K. H. Riitters,


K. B. Jones, A comparison of vector and raster gis methods for calculating
970 landscape metrics used in environmental assessments, Photogrammetric
Engineering & Remote Sensing 69 (12) (2003) 1399–1405.

53

Electronic copy available at: https://ssrn.com/abstract=3599656


[9] D. Kumar, R. Singh, R. Kaur, Spatial data analysis, in: Spatial Information
Technology for Sustainable Development Goals, Springer, 2019, pp. 101–
113.

975 [10] J. P. D. Ordóñez, A. d. J. P. Viloria, P. W. Rojas, Comparison of spatial


clustering techniques for location privacy, in: 2019 IEEE Latin-American
Conference on Communications (LATINCOM), IEEE, 2019, pp. 1–6.

[11] E. Alpaydin, Introduction to machine learning, MIT press, 2020.

[12] C. Zhou, D. Frankowski, P. Ludford, S. Shekhar, L. Terveen, Discovering


980 personally meaningful places: An interactive clustering approach, ACM
Transactions on Information Systems (TOIS) 25 (3) (2007) 12–es.

[13] C. Heipke, Crowdsourcing geospatial data, ISPRS Journal of Photogram-


metry and Remote Sensing 65 (6) (2010) 550–557.

[14] C. S. Reddy, C. Pattanaik, M. Murthy, P. Roy, Spatial modeling for biolog-


985 ical richness analysis in similipal biosphere reserve, orissa, india, Journal of
Bioresource Conservation and Management (2008) 285–298.

[15] B. Pradhan, A comparative study on the predictive ability of the decision


tree, support vector machine and neuro-fuzzy models in landslide suscep-
tibility mapping using gis, Computers & Geosciences 51 (2013) 350–365.

990 [16] J. Lee, H. Jang, J. Yang, K. Yu, Machine learning classification of buildings
for map generalization, ISPRS International Journal of Geo-Information
6 (10) (2017) 309.

[17] D. J. Lary, A. H. Alavi, A. H. Gandomi, A. L. Walker, Machine learning


in geosciences and remote sensing, Geoscience Frontiers 7 (1) (2016) 3–10.

995 [18] C. Furlanello, M. Neteler, S. Merler, S. Menegon, S. Fontanari, A. Donini,


A. Rizzoli, C. Chemini, Gis and the random forest predictor: Integration
in r for tick-borne disease risk assessment, in: Proceedings of DSC, 2003,
p. 2.

54

Electronic copy available at: https://ssrn.com/abstract=3599656


[19] H. R. Pourghasemi, A. G. Jirandeh, B. Pradhan, C. Xu, C. Gokceoglu,
1000 Landslide susceptibility mapping using support vector machine and gis at
the golestan province, iran, Journal of Earth System Science 122 (2) (2013)
349–369.

[20] Q.-T. Bui, Q.-H. Nguyen, V. M. Pham, M. H. Pham, A. T. Tran, Under-


standing spatial variations of malaria in vietnam using remotely sensed data
1005 integrated into gis and machine learning classifiers, Geocarto International
34 (12) (2019) 1300–1314.

[21] D. E. Brown, The regional crime analysis program (recap): a framework for
mining data to catch criminals, in: SMC’98 Conference Proceedings. 1998
IEEE International Conference on Systems, Man, and Cybernetics (Cat.
1010 No. 98CH36218), Vol. 3, IEEE, 1998, pp. 2848–2853.

[22] P. M. Fandino, J. B. Tan Jr, Crime analytics: Exploring analysis of crimes


through r programming language, Science 132 (2019) 696–705.

[23] S. Flaxman, M. Chirico, P. Pereira, C. Loeffler, et al., Scalable high-


resolution forecasting of sparse spatiotemporal events with kernel methods:
1015 a winning solution to the nij “real-time crime forecasting challenge”, The
Annals of Applied Statistics 13 (4) (2019) 2564–2585.

[24] I.-P. D. Y. Murayama, T. A.-G. Hao, Machine learning in geoscience.

[25] M. Castelluccio, G. Poggi, C. Sansone, L. Verdoliva, Training convolutional


neural networks for semantic classification of remote sensing imagery, in:
1020 2017 Joint Urban Remote Sensing Event (JURSE), IEEE, 2017, pp. 1–4.

[26] G. J. Scott, M. R. England, W. A. Starms, R. A. Marcum, C. H. Davis,


Training deep convolutional neural networks for land–cover classification
of high-resolution imagery, IEEE Geoscience and Remote Sensing Letters
14 (4) (2017) 549–553.

55

Electronic copy available at: https://ssrn.com/abstract=3599656


1025 [27] W. Li, G. Wu, Q. Du, Transferred deep learning for anomaly detection
in hyperspectral imagery, IEEE Geoscience and Remote Sensing Letters
14 (5) (2017) 597–601.

[28] D. Rao, M. De Deuge, N. Nourani-Vatani, S. B. Williams, O. Pizarro,


Multimodal learning and inference from visual and remotely sensed data,
1030 The International Journal of Robotics Research 36 (1) (2017) 24–43.

[29] Y. Zhang, T. Cheng, Graph deep learning model for network-based predic-
tive hotspot mapping of sparse spatio-temporal events, Computers, Envi-
ronment and Urban Systems 79 (2020) 101403.

[30] L. Duan, T. Hu, E. Cheng, J. Zhu, C. Gao, Deep convolutional neural net-
1035 works for spatiotemporal crime prediction, in: Proceedings of the Interna-
tional Conference on Information and Knowledge Engineering (IKE), The
Steering Committee of The World Congress in Computer Science, Com-
puter . . . , 2017, pp. 61–67.

[31] S. Kumar, Indian census data with geospatial indexing, retrieved from,
1040 https://www.kaggle.com/sirpunch/indian-census-data-with-geospatial-
indexing on 25 Sept, 2019 (2017).

[32] S. Mehta, Zomato restaurants data, retrieved from,


https://www.kaggle.com/shrutimehta/zomato-restaurants-data on 27
Sept, 2019 (2018).

1045 [33] myGOV, Covid19 statewise status, retrieved from,


https://www.mygov.in/corona-data/covid19-statewise-status on 29
April, 2020 (2020).

[34] S. Shekhar, M. R. Evans, V. Gunturi, K. Yang, D. C. Cugler, Benchmarking


spatial big data, in: Specifying Big Data Benchmarks, Springer, 2012, pp.
1050 81–93.

[35] J. D. Blower, Gis in the cloud: implementing a web map service on google
app engine, in: Proceedings of the 1st International Conference and Ex-

56

Electronic copy available at: https://ssrn.com/abstract=3599656


hibition on Computing for Geospatial Research & Application, 2010, pp.
1–4.

1055 [36] K. Evangelidis, K. Ntouros, S. Makridis, C. Papatheodorou, Geospatial


services in the cloud, Computers & Geosciences 63 (2014) 116–122.

[37] S. Bebortta, S. K. Das, M. Kandpal, R. K. Barik, H. Dubey, Geospatial


serverless computing: Architectures, tools and future directions, ISPRS
Int. J. Geo-Inf. 9 (5).

57

Electronic copy available at: https://ssrn.com/abstract=3599656

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy