SSRN Id3599656
SSRN Id3599656
Perspective
Abstract
The advent of geographic information system has proven its efficiency in many
relevant domains but still needs some more explorations. In this paper, a novel
architecture towards the convergence of machine learning and GIS technology
is presented. The enabling technologies associated to the implementations per-
formed are well-defined and presented in this paper. Further, relevant works in
the domain of the merger technology of machine learning and GIS technology
is epitomized. This paper considered three relevant datasets viz., Indian cen-
sus data, Zomato restaurants data and COVID-19 data for India for practically
realizing the proposed framework. The proficiency of the proposed framework
is presented through various results generated in this paper. Furthermore, the
limitations posed by the proposed framework and ways to tackle the challenges
are presented. Various geospatial visualization operations along with statistical
plots are shown in this paper to epitomize the overlay analysis, point-cluster
generation, heatmap analysis and spatial clustering approaches. Further, wher-
ever required a Microsoft Bing map is provided to epitomize the performed risk
analysis along with zonal classification. Finally, concluding remarks are sum-
marized with the presentation of recent techniques like deep learning, serverless
computing paradigm and so on to improve the proposed framework with dimin-
ished limitations.
Keywords: Geographic Information Systems (GIS), Machine Learning, Spatial
1. Introduction
The essence of geospatial data analytics lies in the convergence of two do-
mains of computer science which are in rage in today’s era viz., geographic
information systems (GIS) and machine learning (ML). The entire scope of
5 the paper entails the merger of both the technologies which shall improve the
functionalities of visualizing the geospatial data which is high-dimensional, hy-
perspectral as well as spatio-temporal in nature. The domain of data analytics
is competent enough to handle and visualize diverse categories of data. Data
mining which signifies a technology of extracting knowledge from data provides
10 predictive competency to the data further making it more powerful. There is
an extreme necessity to understand what the data under consideration signifies.
Thus, the data to get analyzed needs to be visualized properly. Multiple ap-
proaches exist in statistics to visualize the data but to simulate the visualization
using computers is a hectic task. Furthermore, data with specific dimensions are
15 easy to handle and visualize but high-dimensional data are practically difficult
to visualize, process as well as analyze. This paper has prime focus on the visu-
alization of geospatial data with the assistance of machine learning algorithms
which shall be discussed in subsequent sections.
The implementation carried out over the above mentioned datasets epitomize
the relevance of GIS technology in convergence with machine learning in the
fields of economy, business and epidemiology or healthcare.
1.1. Motivations
50 The growing rage of GIS technology has led towards multiple applications
of it in the field of computer engineering. Data in today’s era is becoming
indispensable and thus spatial data also needs proper visualization mechanisms.
GIS technology aids in visualizing spatial data with geographical significance.
Data where location information are too relevant can be visualized with the
55 aid of operations in GIS. Further, the prime motivation behind the paper is to
1.2. Contributions
• The limitations and future research directions are summarized at the end
to present the challenges faced by the current proposed architecture and
how can it be improved.
• Thick Client : It signifies those clients who work over GIS applications
which are of standalone type.
Metadata
Standards
Human
Resources
Geo-Database
Thick
Client
where, x and y are the input features, n is the total number of data points
and s (x, y) epitomizes the 2-D distance between x and y.
170 – eps : It stands for ”epsilon”. This parameter signifies the distance
between two data points present in different clusters [12].
Reddy et al. [14] with the assistance of remote sensing and GIS technology
modeled the regimes of disturbances as well as biological richness over the re-
195 gion of Similipal Biosphere Reserve (SBR) situated in Orissa province of India.
The authors computed disturbance indices by considering diverse parameters
causing disturbances such as road proximity, interspersion, fragmentation, seg-
mentation, juxtaposition and porosity. Future research directions towards con-
servation of biodiversity were also provided in this paper. The observations
200 signified in this paper were as follows:
• Savannah type and dry deciduous type forests of SBR shown high distur-
bances.
• The performed gradient analysis epitomized 70.01% area were under lower
rates of disturbances, 15.24% area under medium rates of disturbances and
205 12.75% area under higher rates of disturbances.
10
Lee et al. [16] with the aid of machine learning techniques, focused on the elim-
ination as well as aggregation of building layers by classifying corresponding
buildings over a large scale into 0-eliminated, 1-retained and 2-aggregated. The
authors used classification algorithms over the data to classify various buildings.
235 The data was retrieved by the authors from National Geographic Information
Institute. The prime classification algorithms used in this paper included deci-
sion tree, naive-Bayes, k-nearest neighbor and support vector machines. This
paper also presented the corresponding accuracy obtained over the considered
11
Lary et al. [17] highlighted the essence of machine learning in dealing with
issues faced in GIS technology as well as remote sensing. Non-parametric anal-
250 ysis through regression and classification were presented in this paper to signify
the role of machine learning in improving the functionalities of GIS and remote
sensing applications. The authors illustrated several clustering techniques and
dist sources over airborne particulates and salt floats data respectively as case
studies. Genetic programming utilization in the fields of GIS and remote sensing
255 was epitomized through different results presented in the paper. Also, dealing
with spatial big data was presented as a challenge since machine learning, pre-
dictive analysis as well as knowledge extraction over such data becomes a hectic
task.
260 Furlanello et al. [18] discussed the proficiency of machine learning techniques
in the field of GIS technology to perform epidemiological analysis. The authors
used the geographic resources analysis support system (GRASS) tool, R pro-
gramming language and PostgreSQL technology to model and analyze geospa-
tial epidemiological data. This paper presented a risk mapping for tick-borne
265 diseases like Lyme borreliosis as well as tick borne encephalitis (TBE) over the
Trentino region of Italian Alps province. Also, the authors discussed the con-
vergence of machine learning with GIS technology for landspace epidemiological
analysis. The paper primely focused on the random forest classifier model to
perform the analysis. Various relevant plots and risk analyzed zonal classified
12
Aerial imagery and field surveys were used to extract a total of 82 landslide
285 locations as the base data. Out of which, 61 instances were used as training
data and the remaining 21 instances as validating data. The factors selected for
conditioning were as follows:
• Altitude,
• Slope degree,
• Slope aspect,
• Tangential curvature,
• Profile curvature,
• Lithology,
13
• Land Use,
With the aid of ArcGIS software package, after employing the conditioning
factors using support vector machines, landslide susceptibility mappings were
generated. The validation performed over the dataset resulted in success rates
305 ranging from 79% to 87% and the prediction rates higher for RBF kernel classi-
fier and Polynomial degree-3 kernel classifiers given as 85% and 83% respectively.
Bui et al. [20] analyzed the malaria occurrences based on socio-physical fac-
tors over the Daknong region of Vietnam with the aid of remote sensing, GIS
310 technology and machine learning classification. The authors performed accu-
racy assessments through receiver operating characteristics (ROC-Curves) and
pair-t tests. This paper focused on using ensembling models and proved the
random subspace model to be the best fit. The motive behind the research was
to perform vulnerability mapping over malaria data so that control measures
315 could be effectively implemented based on the mappings. The factors influenc-
ing the spread were selected to be elevation, slope, aspect, rainfall, temperature,
normalized difference vegetation index (NDVI), land-use, distance from river,
distance from roads and distance from residential areas. Various performance
metrics for different classifiers and ensembling methods were discussed in this
320 paper. Results included six susceptibility maps generated using ArcGIS soft-
ware package.
14
335 Fandino et al. [22] proposed ways to analyze crime data using machine learning
at the backend. Furthermore, the prime focus was on exploring patterns of crime
using R programming language. The paper adopted data mining strategies and
performed histogram analysis over the datasets in a district-wise fashion. Also,
the paper tried to visualize patterns and trends of crime. The proposed method-
340 ology proved data mining as an approach to visualize the spatial datasets of
crime using some machine learning algorithms.
15
Scott et al. [26], in this letter, investigated the use of deep convolutional neu-
ral networks (DCNNs) for classifying land-covers with the assistance of high-
resolution data retrieved through remote sensing. Two methodologies in con-
vergence were proposed in this letter viz., transfer learning (TL) and the novel
380 data augmentation technique. The authors stated that TL enables the boot-
strapping of DCNNs with extracted features well-preserved and data augmen-
16
17
Zhang et al. [29] proposed a novel deep learning-based model which performed
network-based predictive mapping of sparse spatio-temporal events. This paper
420 used a gated localized diffusion network acronymized GLDNet for represent-
ing a graph of the network-structured data. Furthermore, this paper tried to
differentiate two mechanisms for spatial and temporal data where a gated net-
work signified temporal data and GLDNet signified spatial data. The article
employed a weighted loss function and finally achieved 10% mean hit rate and
425 20% coverage level which is a remarkable achievement because it outperformed
the benchmark mean hit rate of 12% and coverage level of 25% over the Crime
Forecasting Case of South Chicago, USA dataset.
Duan et al. [30] focused on feature extraction from crime datasets with the
430 assistance of deep neural networks. The proposed model acronymized STCN
referring to Spatio-Temporal Crime Network used the essence of convolutional
neural network with the hyperspectral high-dimensional crime datasets. The
model forecasted crime risks based on the datasets provided. The considered
datasets were felony dataset and 311 dataset taken from New York City over
435 which the model was tested and the F1 score obtained was 88% and AUC ob-
tained 92% which was a remarkable performance achievement. It can be inferred
that the accuracy of deep neural networks actually outperforms the use of data
mining, machine learning and many technologies.
440
18
• Zomato Dataset
The datasets are thoroughly analyzed and following operations have been per-
formed to preprocess the datasets.
19
475 • Generating shape files for the data : Individual datasets need to be
converted into ”.shp” format or ”.csv” format to make the data importable
into the GIS application as a delimited text. Further, the datasets need to
be well-formatted with each column values representing each instance of
a particular type. This operation is necessary because it allows the data
480 to be imported into the application for performing geospatial analysis.
This is the prime section of the proposed framework. This layer employs
geospatial operations over the pre-processed datasets. The corresponding geospa-
tial operations include:
20
21
This section of the proposed framework deals with the analysis based on
statistical parameters. For each dataset, various charts have been generated to
signify diverse trends found in the dataset. Further, maps are also presented
with zonal classifications in diverse colors for relevant datasets to signify the
530 predictive analytics. The datasets considered to practically realize the pro-
posed framework are relevant in some domains and the statistical analysis per-
formed over each dataset improves the visual analysis of various parameters in
the dataset. Clustered-column charts, scatter plots, line charts, zone-classified
maps are some of the analysis presented during the realization of the proposed
535 framework.
The proposed workflow epitomizes the sequence in which operations are be-
ing carried out to practically realize the proposed framework. The diagrammatic
representation of the workflow is presented in figure 3. The initial task being the
540 collection of geospatial datasets is followed by a sequence of three indispensable
tasks. The foremost task after the collection of geospatial data is the handling
of missing data by adding some specific values in place of missing instances.
Further, the addition of latitude and longitude values is signified and finally the
sequence ends with the generation of GIS importable files. After the importable
545 files are generated, two sub-sequences are formed. The former specifying the
corresponding geospatial operations carried out over the importable files and the
latter signifying the statistical plot generation activities performed over the gen-
erated importable files. Further, following the geospatial data importing, four
further sequences generated specifying overlay generation, point-cluster map-
550 ping, heatmap analysis and spatial clustering. The workflow diagram sequence
numbers are shown in circles in a phase-wise fashion.
22
23
1 3
2
Figure 3: Proposed workflow diagram to practically realize the 3-tiered proposed architecture.
24
The paper entails the utilization of the convergence of GIS technology and
machine learning in diverse application areas. The validation of the proposed
555 framework through three different case studies is performed. The case studies
conducted over diverse datasets presents the inevitability of GIS technology in
performing economical, business-oriented as well as epidemiological analyses.
5.1. CASE-STUDY 1
This case-study epitomizes the practical realization of the proposed frame-
560 work over the Indian census data diversified for the years 2001 and 2011 in a
district-wise fashion.
• State : This column signifies the name of states in India with repetitive
entries because of their corresponding district names being unique values.
• District : This attribute specifies the name of each district in India com-
ing under various states with unique values for each instance.
25
590 • STEP-1 : The data is checked for the format to be ”comma separated val-
ues”. The geospatial coordinates are checked for each row in the dataset.
Since, there are no missing values in the considered dataset, data prepro-
cessing is not required.
• STEP-4 : The layer specifying the data points is symbolized with the
aid of point-cluster analysis by specifying the number of data points that
should be present in a cluster. Here, the clusters are generated using a
specific symbol named ”topo pop capital”.
605 • STEP-5 : The data points layer is reset and then the heatmap symbology
is selected to be operated on. The heatmap is generated after selecting a
26
5.2. CASE-STUDY 2
620 A well-known trending application in the sector of food delivery and rating
worldwide is ”Zomato”. This is a mobile phone based application where users
can order food of nearly all segments online and a valet shall be allotted to
deliver the ordered food. Various restaurants worldwide are registered with this
business and they are rated and voted by the users of this application based on
625 food quality, packaging quality and so on.
27
• Country Code : The country code for India is ”1” and this attribute is
same for all instances in the dataset.
640 • City : This attribute is also same for all instances by the city name
”Delhi” since the original dataset has been cropped for this specific at-
tribute only.
645 • Locality : The locality where the restaurant lies which helps to summarize
specific registered restaurants in a particular locality is presented in this
column.
650 • Longitude : The geographical coordinate i.e., longitude for each restau-
rant in Delhi province of India is summarized under this column.
• Average Cost for two : Every restaurant has a price value to specify the
average cost for two persons. The average cost is represented individually
for each restaurant in this column.
28
665 • Has Online delivery : Restaurants having online delivery are marked
with ”Yes” and those not providing options for online delivery are marked
with ”No” in this column.
675 • Price range : The price range is classified within the range 1 to 4 indi-
vidually for each restaurant in this specific column.
680 • Rating color : Users are provided with options of providing rating colors
for the restaurants they have ordered food from. Based on the rating
colors, the restaurants are classified under this column.
• Rating text : The rating text showing the satisfaction level of the cus-
tomers is shown for each restaurant under this parameter.
685 • Votes : Users based on satisfaction, food quality, taste and so on provide
votes to the restaurants and this attribute specifies the aggregate votes
provided to each restaurant.
29
• STEP-1 : Each missing value is checked for initially. Since, its a high-
dimensional data manually entering any values is not recommended. Each
geospatial coordinates are further checked to see if no coordinate is mis-
placed.
695 • STEP-2 : As the data is imported in the GIS analysis tool, X coordinate
is specified to be longitude values and Y coordinate to be latitude values.
• STEP-3 : The data points are plotted over the OSM with labels specify-
ing restaurant names to each data point.
5.3. CASE-STUDY 3
The rage of spread of the worldwide pandemic corona virus disease acronymized
COVID-19 is selected as the area of study for geospatial visualization. The state-
30
730 • Death : This attribute specifies the region-wise fatalities caused due to
the spread of COVID-19.
5.3.2. Methodology
The methodology followed to perform geospatial analysis is as follows:
31
• STEP-3 : Various features are plotted as data points over the OSM with
745 labels of region names initially to specify affected regions of COVID-19.
760 • The tool used for visualization and spatial clustering is the ”QGIS 3.10 A
Coruna” which is an open-source GIS software package.
• The software package used to realize the statistical visualizations and gen-
erate Microsoft Bing powered maps is Microsoft Excel 2019.
32
33
Figure 5: Point-cluster symbology of Indian census data to show the overall topography.
34
Figure 7: DBSCAN clustering with 7 clusters performed over the heatmap with Cluster IDs
represented for each data point.
35
36
Figure 10: Chart showing the comparison between population in 2001 and 2011 represented
in blue and orange colors respectively.
37
38
Figure 12: Map overlay analysis performed to show the data points with labels of restaurant
names in Delhi province of India.
39
Figure 14: Heatmap symbology analysis generated to show highly dense regions of Delhi,
India where maximum restaurants lie using pseudo color intensities.
40
41
Figure 16: Chart presenting classification of restaurant based on votes provided by Zomato
users in Delhi, India.
42
43
Figure 19: Clustered column chart to present restaurants in Delhi province in India which
provide online delivery options through Zomato application.
44
Figure 20: Map overlay analysis to show COVID-19 affected regions in India with labels of
different state names.
45
Figure 22: Heatmap symbology showing the regions with highly infected patients in various
pseudo color codes.
46
47
Figure 25: Scatter-plot to signify the rates of cured/discharged patients and fatalities due to
COVID-19 in India.
48
Figure 27: Microsoft Bing map generated to show state wise zonal classification based on
death counts due to COVID-19 pandemic in India.
49
865 This paper presented a framework converging GIS technology and machine
learning for three indispensable data presented as case studies. The proposed
framework poses certain limitations as well as future research direction towards
tackling the limitations. The posed challenges can be summarized as follows:
50
• Since data processing over cloud requires much effort, the integration as
a future scope can be shown to be performed over ”Serverless computing
905 paradigm”, where developer effort is minimized to a greater extent [37].
Various platforms like Microsoft Azure Functions, Google Cloud Func-
tions, Amazon AWS Lambda and so on can be integrated to experience a
shorter duration of implementation.
51
52
References
[7] R. Groot, Spatial data infrastructure (sdi) for sustainable land manage-
ment, ITC journal 3 (4) (1997) 287–294.
53
990 [16] J. Lee, H. Jang, J. Yang, K. Yu, Machine learning classification of buildings
for map generalization, ISPRS International Journal of Geo-Information
6 (10) (2017) 309.
54
[21] D. E. Brown, The regional crime analysis program (recap): a framework for
mining data to catch criminals, in: SMC’98 Conference Proceedings. 1998
IEEE International Conference on Systems, Man, and Cybernetics (Cat.
1010 No. 98CH36218), Vol. 3, IEEE, 1998, pp. 2848–2853.
55
[29] Y. Zhang, T. Cheng, Graph deep learning model for network-based predic-
tive hotspot mapping of sparse spatio-temporal events, Computers, Envi-
ronment and Urban Systems 79 (2020) 101403.
[30] L. Duan, T. Hu, E. Cheng, J. Zhu, C. Gao, Deep convolutional neural net-
1035 works for spatiotemporal crime prediction, in: Proceedings of the Interna-
tional Conference on Information and Knowledge Engineering (IKE), The
Steering Committee of The World Congress in Computer Science, Com-
puter . . . , 2017, pp. 61–67.
[31] S. Kumar, Indian census data with geospatial indexing, retrieved from,
1040 https://www.kaggle.com/sirpunch/indian-census-data-with-geospatial-
indexing on 25 Sept, 2019 (2017).
[35] J. D. Blower, Gis in the cloud: implementing a web map service on google
app engine, in: Proceedings of the 1st International Conference and Ex-
56
57