Covid-19 Data Analysis: Brian Mendes
Covid-19 Data Analysis: Brian Mendes
Covid-19 Data Analysis: Brian Mendes
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 24, Issue 2, Ser. IV (Mar. –Apr. 2022), PP 11-23
www.iosrjournals.org
Abstract: Data Analysis is the process of bringing order and structure to collected data. It turns data into
information teams can use. Data visualization is the process of putting data into a chart, graph, or other visual
format that helps inform analysis and interpretation. Analysis and Visualization of datasets has always been a
helpful for various reasons whether it’s for improvement of customer experience or business plans, etc. These
all aspects require the analysis of the data.
In 2020, the world has seen a paradigm shift across many industries, businesses, climate and to human life itself
due to the COVID pandemic. The Government and many private organizations need to know the damage caused
by the pandemic for reasons ranging from public welfare to business strategies. These calculations are very
important for the growth and robustness of the National economy.
To calculate and analyze the effects, we need data regarding the damage. Data is available as clusters in the
many nooks and crannies of the internet. This data is then collected as a whole and then merged into a data-set.
Even when data is amassed into data sets, it is still an enormous task to sort and make meaning out of it. This
data can be simplified and visualized using various Python libraries like matplotlib, NumPy, pandas, etc.
In this project the main goal is to implement the Python tools to simplify, analyse, visualize and predict different
aspects under the banner “Impact of COVID - 19 on industries, climate and population.”
Key Word: Data Analysis, NumPy, Data Visualization, Pandas, Dataset.
---------------------------------------------------------------------------------------------------------------------------------------
Date of Submission: 12-04-2022 Date of Acceptance: 28-04-2022
---------------------------------------------------------------------------------------------------------------------------------------
I. Introduction
Problem Definition: Our project is helpful in visualizing several differences brought about at
industrial, climatic and public level due to the pandemic by comparing historical data (before 2020) to that of
the years 2020-21. We also plan to implement machine learning module to interpret post pandemic stock prices
and performance of different industries based on the current data.
by USD 1.56 billion. Oil has plummeted to 18-year low of $ 22 per barrel in March, and Foreign Portfolio
Investors (FPIs) have withdrawn huge amounts from India, about USD 571.4 million. While lower oil prices
will shrink the current account deficit, reverse capital flows will expand it. Rupee is continuously depreciating.
MSMEs will undergo a severe cash crunch. The crisis witnessed a horrifying mass exodus of such floating
population of migrants on foot, amidst countrywide lockdown
II. Objectives
To analysis the impact of the pandemic across the globe in fields such as public health, economic
effects, climate changes.-
Visualize the data by using various visualization tools available with python
Cleaning the data to improve the data quality and overall productivity
Use machine learning to predict stock market
The trained LSTM model will help us visualize how the stock market is affected due to the pandemic
and based on past results will also be able to predict where the market is heading
Around 80% of people with COVID-19 recover without specialist treatment. These people may experience
mild, flu-like symptoms. However, one in six people may experience severe symptoms, such as trouble
breathing.
Globally, as on April 25, 2021, there are 145,216,414 confirmed cases of COVID-19, including 3,079,390
deaths, reported to WHO. As of April 21, 2021, a total of 899,936,102 vaccine doses have been administered.
India is now witnessing the third wave of the infections with a significant surge in the covid cases and with a
greater mortality rate. The double and triple mutated viruses are spreading in many of the states in India.
Mutation happens when the virus replicates copies of itself with changes from the original strain; these mutated
viruses are also called variants of the original virus.
India has 2,682,751 active cases, 14,085,110 discharged cases and 192,311 confirmed deaths as on April 25,
2021 . India had started its first vaccination drive on January 16, 2021 and as per Ministry of Health, as of April
25, 2021, a total of 140,916,417 people have been vaccinated. From May 1, 2021 the vaccination drive would
cover all aged 18 and above.
In an AutoRegressive model the forecasts correspond to a linear combination of past values of the variable. In a
Moving Average model, the forecasts correspond to a linear combination of past forecast errors.
Basically, the ARIMA models combine these two approaches. Since they require the time series to be stationary,
differencing (Integrating) the time series may be a necessary step, i.e., considering the time series of the
differences instead of the original one.
The SARIMA model (Seasonal ARIMA) extends the ARIMA by adding a linear combination of seasonal past
values and/or forecast errors.
2. Exponential Smoothing:
Exponential smoothing is one of the most successful classical forecasting methods. In its basic form it is called
simple exponential smoothing and its forecasts are given by:
Ŷ(t+h|t) = ⍺y(t) + ⍺(1-⍺)y(t-1) + ⍺(1-⍺)²y(t-2) + …
with 0<⍺<1.
We can see that forecasts are equal to a weighted average of past observations and the corresponding weights
decrease exponentially as we go back in time.
3. LSTM(RNN):
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning order
dependence in sequence prediction problems. This is a behavior required in complex problem domains like
machine translation, speech recognition, and more. LSTMs are a complex area of deep learning
Long Short-Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of
learning long-term dependencies.
Problem Statement:
In this project we dived deep into ‘What does data say about Covid-19 situation in India?’. And with available
data we came up with some observations and conclusions. This analysis mainly focuses on:
DOI: 10.9790/0661-2402041123 www.iosrjournals.org 13 | Page
Covid-19 Data Analysis
It showcases a series of plots being played in the form of a video over several weeks indicating the differences
in numbers using heat maps in matplotlib.
Climate Screen:
This module highlights the climatic changes brought about by the pandemic.
The two parameters we chose to analyze are:
•Air Quality Index
•Temperature comparison of Mumbai between the years 2019 and 2020.
Economy Screen:
•This module shows performance of sectors and companies in terms average annual turnover in the form of
comparison graphs for the years 2017, 2019, 2018, 2020.
•It also allows the user to type the names of two countries to plot a comparison graph of the unemployment rates
of those countries.
•Stock Price prediction using stacked LSTM for Tata Motors (can be done for any company).
System Requirements:
Hardware requirements:
Processor: Pentium(R)Dual Core CPU
RAM: 2 GB
Software requirements:
Operating system: Windows 7/8/10
Environment:Streamlit and Jupyter Notebook
Python Version: 3.7+
The following libraries and modules are required for project implementation:
o NumPy
o Streamlit
o Pandas
o Matplotlib
o Plotly
o Sci-kit learn
o TensorFlow
o Folium
o Pandas data-reader
o Datetime
Solution Methodology
Climate Screen:
Data used is static and is read into a Pandas data-frame.
For the first graph, the data is segregated into 6 data frames of 6 different cities. Then the AQI indices
are given as parameters to the plot function which are selected from a drop-down list in the Streamlit app.
For the second graph, the data is cleaned using dropna and fillna functions, segregated into different time
intervals(2019 and 2020) and visualized in the form a comparison bar chart.
Economy Screen:
For the first graph the data is segregated into 4-time intervals (2017,18,19,20), then it is grouped sector
wise and a mean annual turnover is calculated using the NumPy mean function.
For the second graph the same procedure is followed except here the data is grouped company wise for
the years 2019 and 2020 which was earlier segregated.
In the third graph NIFTY and BSE index data is gathered from yahoo finance and plotted using
matplotlib methods.
The last graph’s data is gathered from worldbank.org and the unemployment rates are plotted in the
form on of line graph.
C. Experimental Results
Graphical user interface (GUI):
Climate
Public Health:
Economy
Stock Prediction
V. Methodology
First, we got the top 500 companies companies from wikipedia then we arranged them by GICS sector which
includes Information technology, Financials, Health care, etc
Then we added the stocks data from google finance where we again got the list of S&P 500 companies and then
we did some data cleaning to elimininate unwanted data
Then we started reading the data by the number of stocks people are selling and comparing it with the history of
data. We also added market cap data and percentage change of stock prices for better visualization, for this we
created separate dataframes in jupyter notebook
Then we wanted to calculate the total change in the total market cap of the S&P 500 so we used to formula:
sum(dataframe1[market_cap1]-dataframe2[market_cap])/10**9
We started ranking companies by percentage change of stock prices and percentage change of stock prices by
sector, which was later visualized using box plot.
First, we got the dataset from kaggle which provided us with the necessary data which was different cities AQI.
We created dataframes for 6 major cities which include Ahmedabad, Delhi, Mumbai, Chennai, Hyderabad,
Kolkata.
Then we had to clean the data to make sure no blank columns or redudant data is left.
We then used a simple line chart to first analyse the AQI levels across the cities and we noticed a drop when the
covid-19 pandemic hit the world
We also compared the temperature change between the years 2019-2020 and noticed that the pandemic helped
the earth by reducing the greenhouse gases and also keeping the temperature in check
First, we used JHU dataset of covid-19 cases, deaths, recovered, etc. and then we made separate data frames for
each of following categories
Then we cleaned the data to avoid errors and thus improving the quality of the code
We sorted the countries in descending order of their confirmed cases numbers which means USA which was
reporting the maximum cases (as per April 2021) was ranked first and followed by India, Brazil, etc.
Then we used various plotting methods like line graph, bar graph and finally plotting it on a world map
We used Folium library in python to plot the confirmed cases and deaths on a world map which helps us to
visualize and just see the impact of covid-19 on our planet.
First, we had to gather the data, for that purpose we used tiingo which helps us with the necessary stocks market
past prices. We chose Tata Motors for the prediction
Then we spilt the dataset into training and testing data which included 65% training data and 35% testing data
Since we are using LSTM model thus, we had to create an array of values into a dataset matrix
We used 100 epochs to train the model and used mean squared error method to compile
Then we calculated the RMSE performance metrics and configured the model to predict for the next 10 days on
a graph.
Fig 14 represents that which the blue line indicates the past performance and the yellow line denotes the
prediction which the LSTM model came up with for the next 10 days.
VI. Conclusion
Hence, we were successful in visualizing the three major parameters affected due to the pandemic
namely Climate, Public Health and Economy. This analysis can prove to be useful for the government to carry
out vaccination drives and to impose stricter restrictions towards the adversely affected areas. The economic
analysis can be useful for the companies to understand the losses/profits they are making in order to change their
marketing strategies. The climatic analysis helps us to understand the difference brought about by halting the
industrial practices (resulting in much lower air pollution overall).
By recognizing and rectifying errors and minimizing non-value-added chores, data analytics may assist
improve the user experience. Furthermore, data analytics may assist with automated data cleansing and data
quality improvement, so benefiting both customers and enterprises.
The goal of data visualization is self-evident. It is to make sense of the data and put it to good use for
the organization. Data, on the other hand, is intricate, and it gains more value as it is visualized. It's difficult to
swiftly explain data discoveries and detect patterns, let alone pull insights and engage with data, without
visualization.
Without visualization, data scientists can find patterns and flaws. It is, nevertheless, vital to convey
data discoveries and extract key information from them. Interactive data visualization tools make all the
difference in this case.
The present pandemic is a recent and pertinent example. Yes, data scientists can examine the data and
draw conclusions. But data visualization is assisting experts in staying informed and calm with such an
abundance of data.
o Data visualization enhances the effect of your messaging for your target audiences and displays data
analysis findings in the most persuasive way possible. It unites the organization's messaging systems across all
organizations and fields.
o Visualization allows you to understand large volumes of data at a look and in a more efficient manner.
It aids in the better understanding of data in order to assess its impact on the business and visually
communicates the knowledge to internal and external audiences. It's impossible to make decisions in a vacuum.
Decision-makers can use available data and insights to improve decision-making. Access to the proper kind of
information and visualization to depict and keep that information relevant is enabled by unbiased data that is
free of mistakes.
B. SWOC Analysis
STRENGTHS: The Covid-19 data visualization certainly helps anyone on the internet wanting to know more
about the impact of the pandemic on health, economic and weather. The project offers a vast range of
visualization tools which were made easy with the help of streamlit.
The machine learning model also works as intended. While we cannot accurately predict the markets movement,
we can certainly analysis the past performance and give a predicted output.
WEAKNESS: Machine Learning module is not integrated with the Streamlit app.
• Realtime data is not available for temperature and AQI.
• Data is not stored into a database.
• Results are not stored into a database.
• On interaction with any of the widgets, the Streamlit app re-runs the entire script.
• To avoid the script from being run every time, caching was partially done for reading data but wasn’t
implemented for the graphs as many errors were raised due to it.
OPPORTUNITIES: The project can certainly grow in a positive way; the LSTM model can be further optimized
to collect real-time data and perform the training automatically with the help of streamlit. A separate database
can also be integrated with the project to further store the past data and visualizations. As the pandemic is
coming to an end people can now view the entire dataset of how covid-19 affected human life and industries.
CHALLENGES: Since the pandemic caused countries to go in complete lockdown mode, it was really difficult
to get help for the project as peers and professors had to be contacted via mails. The integration with the
streamlit app was quite a tedious task as many graphs were incorrectly displayed or the interface was getting
messed up.
Machine learning model was producing high loss which wasn’t acceptable which was further fixed by adding
new columns and cleaning the data further.
C. Future Scope
Features to be added:
Machine Learning module can be integrated with the Streamlit app.
Realtime data analysis for temperature and AQI analysis.
Data can be stored into a database.
Results can be stored into a database.
Caching can be done for the graphs so that on interaction with any of the widgets, the entire script is
not re-run and only the part of the script which has changed will run.
References
[1]. Front. Public Health, 28 May 2020, https://doi.org/10.3389/fpubh.2020.00216
[2]. Cyranoski D, “Did pangolins spread the China coronavirus to people?” - Nature. (2020), https://www.nature.com/articles/d41586-
020-00364-2
[3]. “COVID-19: Emergence, Spread, Possible Treatments, and Global Burden”,
https://www.frontiersin.org/articles/10.3389/fpubh.2020.00216/full#B13
[4]. https://covid19.who.int
[5]. https://www.mohfw.gov.in
[6]. Unemployment, total (% of total labour force) (modelled ILO estimate) -worldbank.org
[7]. Mayukh Bhattacharyya, www.covidexplore.com, github.com/mayukh18/covidexplore
[8]. Akash Kundu, Akshay Kale, Shubham Rajput, Siddhant Fulzele, Tejas Akadkar, Vinay Kumar Kushwaha, “Covid-19 PANDEMIC
INIDA”- M.Sc. (Data Science) – SEM II Department of Computer Science. FERGUSSON COLLEGE (AUTONOMOUS).
[9]. Adil Moujahid, “Analysing the Impact of Coronavirus on the Stock Market using Python, Google Sheets and Google Finance”,
adilmoujahid.com, 12/04/2020.
[10]. S. Mehrmolaei and M. R. Keyvanpour, "Time series forecasting using improved ARIMA," 2016 Artificial Intelligence and Robotics
(IRANOPEN), 2016, pp. 92-97, doi: 10.1109/RIOS.2016.7529496.
[11]. C. A. G. d. A. Júnior, F. A. T. de Carvalho and A. L. S. Maia, "Exponential smoothing methods for forecasting bar diagram-valued
time series," 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2012, pp. 1361-1366, doi:
10.1109/ICSMC.2012.6377923.
[12]. Y. Wang, S. Zhu and C. Li, "Research on Multistep Time Series Prediction Based on LSTM," 2019 3rd International Conference
on Electronic Information Technology and Computer Engineering (EITCE), 2019, pp. 1155-1159, doi:
10.1109/EITCE47263.2019.9095044.
Brian Mendes. “Covid-19 Data Analysis.” IOSR Journal of Computer Engineering (IOSR-JCE),
24(2), 2022, pp. 11-23.