Rainfall

Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

RAINFALL PREDICTION USING

MACHINE LEARNING TECHNIQUES


A Project Report submitted in the partial fulfillment
of the requirements for the award of the Degree of
BACHELOR OF TECHNOLOGY
Submitted By

K.RENUKA DEVI (Regd.No: 20NE1A0572)

A.PADMA SREE (Regd.No: 20NE1A05B7)

N.BALA MARY SOWMYA (Regd.No:20NE1A05B4)

K.SANTOSH DUTT (Regd.No: 20NE1A0580)

Under The Esteemed Guidance Of


Mrs.SHAMMI SHAIK B.Tech, M.Tech
Asst. Professor

Department of Computer Science & Engineering


TIRUMALA ENGINEERING COLLEGE
(Approved by AICTE & Affiliated to JNTU, KAKINADA, Accredited by NAAC
& NBA) Jonnalagadda, Narasaraopet, GUNTUR (Dt.), A.P.
2020-2024
TIRUMALA ENGINEERING COLLEGE
(Approved by AICTE & Affiliated to JNTU KAKINADA, Accredited by
NAAC & NBA) Jonnalagadda, Narasaraopet-522601, Guntur (Dist) A.P

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

CERTIFICATE

This is to certify that the project report entitled “RAINFALL PREDICTION USING
MACHINE LEARNING TECHINQUES” is the bonafied work carried out by
K.RENUKA DEVI (20NE1A0572), A.PADMA SREE (20NE1A05B7), N.BALA
MARY SOWMYA (20NE1A05B4), K.SANTOSH DUTT (20NE1A0580) in partial
fulfillment of the requirements for the award of “Bachelor of Technology” degree in
the Department of CSE from J.N.T.U. KAKINADA during the year 2023-2024 under our
guidance and supervision and worth ofacceptance of requirements of the university.

Project Guide Head of the Department


Mrs.SHAMMI SHAIK B.Tech, M.Tech Dr.N.Gopala Krishna M.Tech, Ph.D, MISTE

Project Coordinator External Examiner


Mr. S. Anil Kumar M. Tech
ACKNOWLEDGEMENT

We wish to express our thanks to carious personalities who are responsible for the
completion of the project. We are extremely thankful to our beloved chairman Sri. Bolla Brahma
Naidu, our secretary Sri. R. Satyanarayana, who took keen interest in our every effort throughout
this course. We owe out gratitude to our principal sir Dr. Y. V. Narayana M.E, Ph.D, FIETE for his
kind attention and valuable guidance through out the course.

We express our deep felt gratitude to our H.O.D Dr.N.Gopala Krishna, M.tech, Ph.D,
MISTE and Mr. S. Anil Kumar M.Tech, coordinator of the project for extending their
encouragement. Their profound knowledge and willingness have been a constant source of
inspiration for us through out the project work.

We wish to express our sincere deep sense of gratitude to our, Mrs.Shammi Shaik B.Tech,
M.Tech , for significant suggestions and help in every respect to accomplish the project work. Her
persisting encouragement, everlasting patience and keen interest in discussions have benefited usto
be extent that cannot be spanned by words to our college management for providing excellent lab
facilities for completion of project within our campus.

We extend our sincere thanks to all other teaching and non-teaching staff of department of CSE for
their cooperation and encouragement during our B. Tech course.

We have no words to acknowledge the warm affection, constant inspiration and


encouragement that we received from my parent.

We affectionately acknowledge the encouragement received from my friends and those who
involved in giving valuable suggestions had clarifying out doubts which had really helped usin
successfully completing our project.

By

K.RENUKA DEVI (20NE1A0572)

A.PADMA SREE (20NE1A05B7)

N.BALA MARY SOWMYA (20NE1A05B4)

K. SANTOSH DUTT (20NE5A0580)


ABSTRACT

India is an agricultural country and its economy is largely based upon crop
productivity and rainfall. For analyzing the crop productivity, rainfall prediction
is require and necessary to all farmers. Rainfall Prediction is the application of
science and technology to predict the state of the atmosphere. It is important to
exactly determine the rainfall for effective use of water resources, crop
productivity and pre planning of water structures. Using different data mining
techniques it can predict rainfall. Data mining techniques are used to estimate
the rainfall numerically. This paper focuses some of the popular data mining
algorithms for rainfall prediction. Random Forest, K-Nearest Neighbor
algorithm, Logistic regression, Decision Tree are some of the algorithms have
been used. From that comparison, it can analyze which method gives better
accuracy for rainfall prediction.
TABLE OF CONTENTS

CHAPTER TITLE PAGE


No. No
ABSTRACT

1 INTRODUCTION 1
1
1.1 Objective of the project
1.1.1 Necessity 1
1.1.2 Software Development Method 1
1.1.3 Layout of the document 1
1.2 Overview of the designed project 2

2 LITERATURE SURVEY 3

2.1 Literature Survey 4

3 AIM AND SCOPE OF THE PRESENT INVESTIGATION 7

3.1 Project proposal 8


3.1.1 Mission 8
3.1.2 Goal 8
3.2 Scope of the Project 8
3.3 Overview of the project 8
3.4 Existing system 9
3.4.1 Disadvantages 9
3.5 Preparing the dataset 9
3.6 Proposed system 9
3.6.1 Exploratory Data Analysis of Rainfall Predication 9
3.6.2 Data Cleaning 9
3.6.3 Data collection 10
3.6.4 Building the classification model 10
3.6.5 Advantages 11
3.7 Flow chart 11
4 EXPERIMENTAL OR MATERIALS AND METHODS; ALGORITHMS 12
USED

4.1 System Study 13

4.1.1 System requirement specifications 13

4.2 System Specifications 13

4.2.1 Machine Learning Overview 13

4.3 Steps to download & install Python 14

4.3.1 IDE Installation for python 14

4.3.2 Python File Creation 14

4.4 Python Libraries needed 14

4.4.1 Numpy library 15

4.4.2 Pandas library 15

4.4.3 Matplotlib library 15

4.4.4 Seaborn library 16

4.5 Modules 16

4.6 UML diagrams 17

4.6.1 Use Case Diagram 17

4.6.2 Class Diagram 18

4.6.3 Activity Diagram 19

4.6.4 Sequence Diagram 19

4.6.5 Data Flow Diagram 20

4.7 Module Details 21

4.7.1 Data Pre-processing 21

4.7.2 Data Validation /Cleaning /Preparing Process 23

4.7.3 Exploration data analysis of visualization 24

4.7.4 Comparing Algorithm with prediction in the form of best 25


accuracy result
4.7.5 Algorithm and Techniques 29
5 RESULTS AND DISCUSSION, PERFORMANCE ANALYSIS 37

5.1 Performance Analysis 38


5.2 Discussion 39

6 SUMMARY AND CONCLUSION 40

6.1 Summary 41
6.2 Conclusion 41
6.3 Future Work 41

APPENDIX
SOURCE CODE WITH OUTPUT SCREENS 42-51

REFERENCES 52
CHAPTER-1
INTRODUCTION
1.INTRODUCTION

1.1 OBJECTIVE OF THE PROJECT:

The goal is to develop a machine learning model for Rainfall Prediction to


potentially replace the updatable supervised machine learning classification models by predicting
results in the form of best accuracy by comparing supervised algorithm.

1.1.1 Necessity:
This prediction helps in predicting the rainfall and it helps in
overcoming the crop productivity and to predict the state of atmosphere in agricultural
countries. These models are very easy to use. It can work accurately and very smoothly in
a different scenario. It reduces the effort workload and increases efficiency in work. In
aspects of time value, it is worthy.

1.1.2 Software development method:


In many software applications program different methods and cases
are followed such as, Waterfall model, Iterative model, Spiral model, V- model and Big
Bang model. we used waterfall model in this application. we tried to use test case and case
software approaches.

1.1.3 Layout of the document:


This documentation starts with formal introduction. After
introduction analysis and design of the project are described. In analysis and design of the
project have many parts such as project proposal, mission, goal, target audience,
environment. Use cases and test cases are in chapter 2 and chapter 3 respectively. Finally,
this documentation finished with result and Conclusion part.

1
1.2 OVERVIEW OF THE DESIGNED PROJECT:
At first, we take the dataset from out resource then we have to perform data-
preprocessing, visualization methods for cleaning and visualizing the dataset respectively and
we applied the Machine Learning algorithms on the dataset and then we plot confusion matrix
of each technology at last we compare those models and draw the ROC curve for the best
perfoming model and also we plot classification report for that model.

2
CHAPTER-2
LITERATURE SURVEY

3
2.LITERATURE SURVEY

A literature review is a body of text that aims to review the critical points of current
knowledge on and/or methodological approaches to a particular topic. It is secondary
sources and discuss published information in a particular subject area and sometimes
information in a particular subject area within a certain time period. Its ultimate goal is
to bring the reader up to date with current literature on a topic and forms the basis for
another goal, such as future research that may be needed in the area and precedes a
research proposal and may be just a simple summary of sources. Usually, it has an
organizational pattern and combines both summary and synthesis.
A summary is a recap of important information about the source, but a synthesis is a re-
organization, reshuffling of information. It might give a new interpretation of old
material or combine new with old interpretations or it might trace the intellectual
progression of the field, including major debates. Depending on the situation, the
literature review may evaluate the sources and advise the reader on the most pertinent
or relevant of them.
Review of Literature Survey
[1] Measurable investigation shows the idea of ISMR, which can't be precisely
anticipated by insights or factual information. Hence, this review exhibits the utilization
of three techniques: object creation, entropy, and artificial neural network (ANN). In view
of this innovation, another technique for anticipating ISMR times has been created to
address the idea of ISMR. This model has been endorsed and supported by the studio and
exploration data. Factual examination of different information and near investigations
showing the presentation of the normal technique.

[2] The primary impact of this movement is to exhibit the advantages of AI calculations,
just as the more prominent degree of clever framework than the advanced rainfall
determining methods. We analyze and think about the momentum execution (Markov
chain stretched out by rainfall research) with the forecasts of the six most not able AI
machines: Genetic programming, Vector relapse support, radio organizations, M5

4
organizations, M5models, models - Happy. To work with a more itemized appraisal, we
led a rainfall overview utilizing information from 42 metropolitan urban communities.

[3] RF was utilized to anticipate assuming that it would rain in one day, while SVM
was utilized to foresee downpour on a blustery day. The limit of the Hybrid model was
fortified by the decrease of day-by-day rainfall in three spots at the rainfall level in the
eastern piece of Malaysia. Crossover models have likewise been found to emulate the full
change, the quantity of days straight, 95% of the month-to-month rainfall, and the
dispersion of the noticed rainfall.

[4] In India, farming is the backbone. Downpour is a significant plant. These days,
climate is a major issue. Climate gauging gives data on rainfall estimating and crop
security. Numerous strategies have been created to recognize
rainfall. Machine7Learning calculations are significant in foreseeing rainfall.

[5] Climate sooner or later. Climatic still up in the air utilizing various sorts of factors all
over the place of these, main the main highlights are utilized in climate conjectures.
Picking something like this relies a great deal upon the time you pick. Underlying
displaying is utilized to incorporate the fate of demonstrating, AI applications, data trade,
and character examination.

[6] Contrasted with different spots where rainfall information isn't accessible, it consumes
a large chunk of the day to build up a solid water overview for a long time. Improving
complex neural organizations is intended to be a brilliant instrument for anticipating the
stormy season. This downpour succession was affirmed utilizing a complex perceptron
neural organization. Estimations like MSE (Early Modeling), NMSE (Usually Early
Error), and the arrangement of informational collections for transient arranging are clear
in the examination of different organizations, like Adanaive. AdaSVM.

5
[7] In this paper, Artificial Neural Network (ANN) innovation is utilized to foster a
climate anticipating strategy to distinguish rainfall utilizing Indian rainfall information.
Along these lines, Feed Forward Neural Network (FFNN) was utilized utilizing the
Backpropagation Algorithm. Execution of the two models is assessed dependent on
emphasis examination, Mean Square Error (MSE) and Magnitude of Relative Error
(MRE). This report likewise gives a future manual for rainfall determining.

[8] This page features rainfall investigation speculations utilizing Machine Learning. The
principle motivation behind utilizing this program is to secure against the impacts of
floods. This program can be utilized by conventional residents or the public authority to
anticipate what will occur before the flood. The flood card, then, at that point,
furnish them with the vital help by moving versatile or other important measures.

6
CHAPTER-3
AIM AND SCOPE OF
THE PRESENT
INVESTIGATION

7
3.AIM AND SCOPE OF THE PRESENT INVESTIGATION

3.1 PROJECT PROPOSAL:

The project proposal is the term of documents. A project can describe


the project proposal. It is the set of all plans of a project. Like, how the software works,
what are the steps to complete the entire projects, and what are the software requirements
and analysis for this project. In our project, we are doing all the steps and also risk and
reward and other project dependencies in the project proposal.

3.1.1 Mission:
To compare several machine learning models like logistic
regression, random forest, knn and decision tree. Plotting confusion matrix for each
model after cleaning the data set so that we can easily find the best model among
them. After finding best model we will draw ROC curve and classification report
for that best fit model to predict rain fall which is very essential for farmers.

3.1.2 Goal:
The goal is to develop a machine learning model for predicting the
rainfall.

3.2 SCOPE OF THE PROJECT:


The scope of this paper is to implement and investigate how different
supervised binary classification methods impact default prediction. The model evaluation
techniques used in this project are limited to precision, sensitivity, F1-score.

3.3 OVERVIEW OF THE PROJECT:


The overview of the project is to provide a best machine learning
algorithm to the user. Therefore, the user can directly know whether the rainfall is occur
or not through this best model.

8
3.4 EXISTING SYSTEM:
Agriculture is the strength of our Indian economy. Farmer
only depends upon monsoon to be their cultivation. The good crop productivity needs
good soil, fertilizer and also good climate. Weather forecasting is the very important
requirement of the each farmer. Due to the sudden changes in climate/weather, The
people are suffered economically and physically. Weather prediction is one of the
challenging problems in current state. The main motivation of this paper to predict the
weather using various data mining techniques. Such as classification, clustering,
decision tree and also neural networks. Weather related information is also called the
meteorological data. In this paper the most commonly used weather parameters are
rainfall, wind speed, temperature and cold.

3.4.1 Disadvantages:
The biggest disadvantage of this approach is that it fails when it comes
for long term estimation.

3.5 PREPARING THE DATASET:

This dataset contains 145460 records of features extracted from kaggle, which is
having RainTomorrow as a target column containing 2 values.

3.6 PROPOSED SYSTEM:

3.6.1 Exploratory Data Analysis of Rainfall Prediction


Multiple datasets from different sources would be combined to form a
generalized dataset, and then different machine learning algorithms would be
applied to extract patterns and to obtain results with maximum accuracy.

3.6.2 Data Cleaning


In this section of the report will load in the data, check for cleanliness,
and then trim and clean given dataset for analysis. Make sure that the

9
document steps carefully and justify for cleaning decisions.

3.6.3 Data collection


The data set collected for predicting given data is split into Training
set and Test set. Generally, we split the dataset into Training set and Test set.
The Data Model which was created using machine learning algorithms are
applied on the Training set and based on the test result accuracy, Test set
prediction is done.

3.6.4 Building the classification model


For predicting the rainfall, ML algorithm prediction model is
effective because of the following reasons: It provides better results in
classification problem.
 It is strong in preprocessing outliers, irrelevant variables, and a mix of
continuous, categorical and discrete variables.
 It produces out of bag estimate error which has proven to be unbiased in
many tests and it is relatively easy to tune with.

Rainfall Dataset

Data Processing Test


dataset

Classification ML Model
Training Algorithm
dataset

Fig : Architecture of Proposed model

10
3.6.5 Advantages:
 Performance and accuracy of the algorithms can be calculated and
compared.
 Numerical Weather Prediction
 Statistical Weather Prediction
 Synoptic Weather Prediction

3.7 FLOW CHART:

Fig : FLOW CHART

11
CHAPTER-4
EXPERIMENTAL
OR MATERIALS
AND METHODS
ALGORITHMS USED

12
4.EXPERIMENTAL OR MATERIALS
AND METHODS ALGORITHMS USED

4.1 SYSTEM STUDY:


To develop this model we use new modern technologies which are
Machine Learning using Python for predicting rainfall.

4.1.1 System requirement specifications:


a) Hardware requirements:
 Processor : Intel
 RAM : 2GB
 Hard Disk : 80GB
b) Software requirements:
 OS : Windows
 Framework : Flask
 Technology : Machine Learning using Python
 Web Browser : Chrome, Microsoft Edge
 Code editor : Visual Studio Code, Google Colab,
Anaconda or Jupyter notebook.

4.2 SYSTEM SPECIFICATIONS:


4.2.1 Machine Learning Overview:
Machine learning is a field of study that looks at using
computational algorithms to turn empirical data into usable models. The
machine learning field grew out of traditional statistics and artificial
intelligences communities. Through their business processes immense
amounts of data have been and will be collected. This has provided an
opportunity to re-invigorate the statistical and computational approaches to
autogenerate useful models from data. Machine learning algorithms can be
used to (a) gather understanding of the cyber phenomenon that produced the
data under study, (b) abstract the understanding of

13
underlying phenomena in the form of a model, (c) predict future values of a
phenomena using the above-generated model, and (d) detect anomalous behavior
exhibited by a phenomenon under observation.

4.3 STEPS TO DOWNLOAD & INSTALL PYTHON:


Download the Latest version of the Python executable installer
(https://www.python.org/downloads/). Watch the PIP list where pip is the
package installer for python. Now upgrade the pip and setuptools using the
command

Pip install --upgrade pip and Pip install --upgrade setuptools

4.3.1 IDE INSTALLATION FOR PYTHON


IDE stands for Integrated Development Environment. It
is a GUI (Graphical User Interface) where programmers write their
code and produce the final products. Best IDE is Pycharm. So
download the pycharm new version and install the software
(https://www.jetbrains.com/pycharm/download/)

4.3.2 PYTHON FILE CREATION


GO To FILE MENU > CREATE > NEW > PYTHON FILE
>(Name Your Python File as “RAINFALLPREDICTION” >
SAVE

4.4 PYTHON LIBRARIES NEEDED


There are many libraries in python. In those we only use few main libraries needed like
NUMPY LIBRARY
PANDAS LIBRARY
MATPLOTLIB LIBRARY
SEABORN LIBRARY

14
4.4.1 NUMPY LIBRARY
NumPy is an open-source numerical Python library. NumPy
contains a multi- dimensional array and matrix data structures. It can be
utilized to perform a number of mathematical operations on arrays such
as trigonometric, statistical, and algebraic routines like mean, mode,
standard deviation etc…,

Installation- (https://numpy.org/install/)

pip install NUMPY

Here we mainly use array, to find mean and standard deviation.

4.4.2 PANDAS LIBRARY


Pandas is a high-level data manipulation tool developed by
Wes McKinney. It is built on the Numpy package and its key data structure
is called the DataFrame. DataFrames allow you to store and manipulate
tabular data in rows of observations and columns of variables. There are
several ways to create a DataFrame.
Installation- (https://pandas.pydata.org/getting_started.html)

pip install PANDAS

Here we use pandas for reading the csv files, for grouping the data, for
cleaning the data using some operations.

4.4.3 MATPLOTLIB LIBRARY


Matplotlib is a comprehensive library for creating static,
animated, and interactive visualizations in Python. Matplotlib makes easy
things easy and hard things possible. Use interactive figures that can zoom,
pan, update, visualize etc..,
Installation- (https://matplotlib.org/users/installing.html)

pip install Matplotlib

15
Here we use pyplot mainly for plotting graphs.
matplotlib.pyplot is a collection of functions that make matplotlib work
like MATLAB. Each pyplot function makes some change to a figure: e.g.,
creates a figure, creates a plotting area in a figure, plots some lines in a
plotting area, decorates the plot with labels, etc.

4.4.4 SEABRON LIBRARY


Seaborn package was developed based on the Matplotlib
library. It is used to create more attractive and informative statistical
graphics. While seaborn is a different package, it can also be used to
develop the attractiveness of matplotlib graphics.
Installation-(https://seaborn.pydata.org/installing.html)

pip install Seaborn

4.5 MODULES:
A modular design reduces complexity, facilities change (a critical
aspect of software maintainability), and results in easier implementation by
encouraging parallel development of different part of system. Software with
effective modularity is easier to develop because function may be
compartmentalized and interfaces are simplified. Software architecture embodies
modularity that is software is divided into separately named and addressable
components called modules that are integrated to satisfy problem requirements.
Modularity is the single attribute of software that allows a program to be
intellectually manageable. The five important criteria that enable us to evaluate a
design method with respect to its ability to define an effective modular design are:
Modular decomposability, Modular Comps ability, Modular Understand ability,
Modular continuity, Modular Protection.

16
Fig : SYSTEM ARCHITECTURE

4.6 UML DIAGRAMS

4.6.1 Use Case Diagram

Fig : USE CASE DIAGRAM

17
Use case diagrams are considered for high level requirement analysis of a
system. So when the requirements of a system are analyzed the functionalities are
captured in use cases. So, it can say that uses cases are nothing but the system
functionalities written in an organized manner.

4.6.2 Class Diagram

Fig : CLASS DIAGRAM

Class diagram is basically a graphical representation of the static view of the system
and represents different aspects of the application. So a collection of class diagrams
represent the whole system. The name of the class diagram should be meaningful to
describe the aspect of the system. Each element and their relationships should be
identified in advance Responsibility (attributes and methods) of each class should be
clearly identified for each class minimum number of properties should be specified
and because, unnecessary properties will make the diagram complicated. Use notes
whenever required to describe some aspect of the diagram and at the end of the
drawing it should be understandable to the developer/coder. Finally, before making the
final version, the diagram should be drawn on plain paper and rework as many times
as possible to make it correct.

18
4.6.3 Activity Diagram

Fig : ACTIVITY DIAGRAM

Activity is a particular operation of the system. Activity diagrams are not only used for
visualizing dynamic nature of a system but they are also used to construct the
executable system by using forward and reverse engineering techniques. The only
missing thing in activity diagram is the message part. It does not show any message
flow from one activity to another. Activity diagram is some time considered as the
flow chart. Although the diagrams looks like a flow chart but it is not. It shows
different flow like parallel, branched, concurrent and single.

4.6.4 Sequence Diagram

19
Sequence diagrams model the flow of logic within your system in a visual manner,

enabling you both to document and validate your logic, and are commonly

used for both analysis and design purposes. Sequence diagrams are the most

popular UML artifact for dynamic modeling, which focuses on identifying

the behavior within your system. Other dynamic modeling techniques include

activity diagramming, communication diagramming, timing diagramming,

and interaction overview diagramming. Sequence diagrams, along with class

diagrams and physical data models are in my opinion the most important design-

level models for modern business application development.

4.6.5 Data Flow Diagrams

In Software engineering DFD(data flow diagram) can be drawn to represent the

system of different levels of abstraction. Higher-level DFDs are partitioned into

low levels-hacking more information and functional elements. Levels in DFD

are numbered 0, 1, 2 or beyond. Here, we will see mainly 3 levels in the data

flow diagram, which are: 0-level DFD, 1-level DFD, and 2-level DFD.

Data Flow Diagrams (DFD) are graphical representations of a system that

illustrate the flow of data within the system.

Fig : LEVEL-0 DATA FLOW DIAGRAM

20
Fig : LEVEL-1 DATA FLOW DIAGRAM

Fig : LEVEL-2 DATA FLOW DIAGRAM

21
4.7 MODULE DETAILS:
4.7.1 Data Pre-processing
Validation techniques in machine learning are used to get
the error rate of the Machine Learning (ML) model, which can be
considered as close to the true error rate of the dataset. If the data volume
is large enough to be representative of the population, you may not need
the validation techniques. However, in real-world scenarios, to work with
samples of data that may not be a true representative of the population of
given dataset. To finding the missing value, duplicate value and
description of data type whether it is float variable or integer. The sample
of data used to provide an unbiased evaluation of a model fit on the
training dataset while tuning model hyper parameters.
The evaluation becomes more biased as skill on the validation dataset is
incorporated into the model configuration. The validation set is used to
evaluate a given model, but this is for frequent evaluation. It as machine
learning engineers use this data to fine-tune the model hyper
parameters. Data collection, data analysis, and the process of addressing
data content, quality, and structure can add up to a time-consuming to-do
list. During the process of data identification, it helps to understand your
data and its properties; this knowledge will help you choose which
algorithm to use to build your model.
A number of different data cleaning tasks using Python Pandas library
and specifically, it focus on probably the biggest data cleaning task,
missing values and it able to more quickly clean data. It wants to spend
less time cleaning data, and more time exploring and modeling.
Some of these sources are just simple random mistakes. Other times,
there can be a deeper reason why data is missing. It’s important to
understand these different types of missing data from a statistics point of
view. Here are some typical reasons why data is missing.

22
1. User forgot to fill in a field.
2. Data was lost while transferring manually from a legacy database.
3. There was a programming error.
4. Users chose not to fill out a field tied to their beliefs about how the
results would be used or interpreted.

Variable identification with Uni-variate, Bi-variate and Multi-variate analysis:


 import libraries for access and functional purpose
 Read the given dataset
 General Properties of Analyzing the given dataset
 Display the given dataset in the form of data frame
 Data Cleaning
 Show columns
 Removing unnecessary spaces in column names
 Exploratory data analysis
 Plot histograms for columns
 Plot boxplots for columns
 Excluding non numeric columns from correlation calculation
 Crosstab for RainTomorrow and RainToday

4.7.2 Data Validation/ Cleaning/Preparing Process


Importing the library packages with loading given dataset.
To analyzing the variable identification by data shape, data type and
evaluating the missing values, duplicate values. A validation
dataset is a sample of data held back from training your model that is
used to give an estimate of model skill while tuning model's and
procedures that you can use to make the best use of validation and test
datasets when evaluating your models. Data cleaning / preparing by
rename the given dataset and drop the column etc. to

23
analyze the uni-variate, bi-variate and multi-variate process.
The steps and techniques for data cleaning will vary from dataset to
dataset. The primary goal of data cleaning is to detect and remove errors
and anomalies to increase the value of data in analytics and decision
making.
MODULE DIAGRAM

GIVEN INPUT EXPECT OUTPUT


input: data
output: removing noisy data

4.7.3 Exploration data analysis of visualization


Data visualization is an important skill in applied statistics
and machine learning. Statistics does indeed focus on quantitative
descriptions and estimations of data. Data visualization provides an
important suite of tools for gaining a qualitative understanding. This can
be helpful when exploring and getting to know a dataset and can help
with identifying patterns, corrupt data, outliers, and much more. With a
little domain knowledge, data visualizations can be used to express and
demonstrate key relationships in plots and charts that are more visceral
and stakeholders than measures of association or significance. Data
visualization and exploratory data analysis are whole fields themselves
and it will recommend a deeper dive into some the books mentioned at
the end.
Sometimes data does not make sense until it can look at in a visual form,
such as with charts and plots. Being able to quickly visualize of data
samples and

24
others is an important skill both in applied statistics and in applied
machine learning. It will discover the many types of plots that you will
need to know when visualizing data in Python and how to use them to
better understand your own data.
 How to chart time series data with line plots and categorical
quantities with bar charts.
 How to summarize data distributions with histograms and box plots.

MODULE DIAGRAM

GIVEN INPUT EXPECT OUTPUT


input: data
output: visualized data

4.7.4. Comparing Algorithm with prediction in the form of best accuracy


result
It is important to compare the performance of multiple
different machine learning algorithms consistently and it will discover to
create a test harness to compare multiple different machine learning
algorithms in Python with scikit-learn. It can use this test harness as a
template on your own machine learning problems and add more and
different algorithms to compare. Each model will have different
performance characteristics. Using resampling methods like cross
validation, you can get an estimate for how accurate each model may be
on unseen data. It needs to be able to use these estimates to choose one or
two best models from the suite of models that you have created. When
have a new dataset, it is a good idea to visualize the data using different

25
techniques in order to look at the data from different perspectives. You
should use a number of different ways of looking at the estimated
accuracy of your machine learning algorithms in order to choose the one
or two to finalize. A way to do this is to use different visualization
methods to show the average accuracy, variance and other properties of
the distribution of model accuracies.
In the next section you will discover exactly how you can do that in
Python with scikit-learn. The key to a fair comparison of machine
learning algorithms is ensuring that each algorithm is evaluated in the
same way on the same data and it can achieve this by forcing each
algorithm to be evaluated on a consistent test harness.
Pre-processing refers to the transformations applied to our data before
feeding it to the algorithm. Data Preprocessing is a technique that is used
to convert the raw data into a clean data set. To achieving better results
from the applied model in Machine Learning method of the data has to be
in a proper manner. Some specified Machine Learning model needs
information in a specified format, for example, Random Forest algorithm
does not support null values. Therefore, to execute random forest
algorithm null values have to be managed from the original raw data set.
And another aspect is that data set should be formatted in such a way that
more than one Machine Learning and Deep Learning algorithms are
executed in given dataset.
In the example below these 4 different algorithms are compared:
 Logistic Regression
 Random Forest
 Decision Tree
 K Nearest Neighbor

The K-fold cross validation procedure is used to evaluate each algorithm,

26
importantly configured with the same random seed to ensure that the
same splits to the training data are performed and that each algorithm is
evaluated in precisely the same way. Before that comparing algorithm,
Building a Machine Learning Model using install Scikit-Learn libraries.
In this library package have to done preprocessing, linear model with
logistic regression method, cross validating by KFold method, ensemble
with random forest method and tree with decision tree classifier.
Additionally, splitting the train set and test set. To predicting the result by
comparing accuracy.

False Positives (FP): A person who will pay predicted as defaulter.


When actual class is no and predicted class is yes. E.g. if actual class says
this passenger did not survive but predicted class tells you that this
passenger will survive.
False Negatives (FN): A person who default predicted as payer. When
actual class is yes but predicted class in no. E.g. if actual class value
indicates that this passenger survived and predicted class tells you that
passenger will die.
True Positives (TP): A person who will not pay predicted as defaulter.
These are the correctly predicted positive values which means that the
value of actual class is yes and the value of predicted class is also yes.
E.g. if actual class value indicates that this passenger survived and
predicted class tells you the same thing.
True Negatives (TN): A person who default predicted as payer. These
are the correctly predicted negative values which means that the value of
actual class is no and value of predicted class is also no. E.g. if actual
class says this passenger did not survive and predicted class tells you the
same thing.

27
Prediction result by accuracy:
Decision Tree algorithm also uses a linear equation with independent
predictors to predict a value. The predicted value can be anywhere
between negative infinity to positive infinity. It needs the output of the
algorithm to be classified variable data. Higher accuracy predicting result
is decision tree model by comparing the best accuracy.

True Positive Rate (TPR) = TP / (TPFN) False


Positive Rate (FPR) = FP / (FP + TN)

Accuracy: The Proportion of the total number of predictions that is


correct otherwise overall how often the model predicts correctly
defaulters and non- defaulters.

Accuracy calculation:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Accuracy is the most intuitive performance measure and it is simply a
ratio of correctly predicted observation to the total observations. One may
think that, if we have high accuracy then our model is best. Yes, accuracy
is a great measure but only when you have symmetric datasets where
values of false positive and false negatives are almost same.

Precision: The proportion of positive predictions that are actually


correct. Precision = TP / (TP + FP)
Precision is the ratio of correctly predicted positive observations to the
total predicted positive observations. The question that this metric answer
is of all passengers that labeled as survived, how many actually survived?
High precision relates to the low false positive rate. We have got 0.788
precision which is pretty good.

28
Recall: The proportion of positive observed values correctly predicted.
(The proportion of actual defaulters that the model will correctly predict)
Recall = TP / (TP + FN)
Recall (Sensitivity) - Recall is the ratio of correctly predicted positive
observations to the all observations in actual class - yes.

F1 Score is the weighted average of Precision and Recall. Therefore, this


score takes both false positives and false negatives into account.
Intuitively it is not as easy to understand as accuracy, but F1 is usually
more useful than accuracy, especially if you have an uneven class
distribution. Accuracy works best if false positives and false negatives
have similar cost. If the cost of false positives and false negatives are
very different, it’s better to look at both Precision and Recall.

General Formula:
F- Measure = 2TP / (2TP + FP + FN)
F1-Score Formula:
F1 Score = 2*(Recall * Precision) / (Recall + Precision)

4.7.5 ALGORITHM AND TECHNIQUES

Algorithm Explanation

In machine learning and statistics, classification is a supervised learning


approach in which the computer program learns from the data input given
to it and then uses this learning to classify new observation. This data set
may simply be bi-class (like identifying whether the person is male or
female or that the mail is spam or non-spam) or it may be multi-class too.

29
Some examples of classification problems are: speech recognition,
handwriting recognition, bio metric identification, document
classification etc. In Supervised Learning, algorithms learn from labeled
data. After understanding the data, the algorithm determines which label
should be given to new data based on pattern and associating the patterns
to the unlabeled new data.

Used Python Packages:


Seaborn:
 In python, seaborn is a machine learning package which include a lot

of ML algorithms.

 Here, we are using some of its modules like


train_test_split, DecisionTreeClassifier or Logistic Regression and
accuracy_score.
Numpy:
 It is a numeric python module which provides fast maths

functions for calculations.


 It is used to read data in numpy arrays and for manipulation purpose.

Pandas:
 Used to read and write different files.

 Data manipulation can be done easily with data frames.

Matplotlib:
 Data visualization is a useful way to help with identify the

patterns from given dataset.


 Data manipulation can be done easily with data frames.

30
Logistic Regression:
It is a statistical method for analyzing a data set in which there are one or
more independent variables that determine an outcome. The outcome is
measured with a dichotomous variable (in which there are only two
possible outcomes). The goal of logistic regression is to find the best
fitting model to describe the relationship between the dichotomous
characteristic of interest (dependent variable = response or outcome
variable) and a set of independent (predictor or explanatory) variables.
Logistic regression is a Machine Learning classification algorithm that is
used to predict the probability of a categorical dependent variable. In
logistic regression, the dependent variable is a binary variable that
contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).

In other words, the logistic regression model predicts P(Y=1) as a


function of X. Logistic regression Assumptions:

 Binary logistic regression requires the dependent variable to be binary.


 For a binary regression, the factor level 1 of the dependent
variable should represent the desired outcome.
 Only the meaningful variables should be included.
 The independent variables should be independent of each other.
That is, the model should have little.
 The independent variables are linearly related to the log odds.
 Logistic regression requires quite large sample sizes.

MODULE DIAGRAM

31
Fig : LOGISTIC REGRESSION

GIVEN INPUT EXPECT OUTPUT


input: data
output: getting accuracy

Random Forest Classifier:


Random forests or random decision forests are an ensemble learning
method for classification, regression and other tasks, that operate by
constructing a multitude of decision trees at training time and outputting
the class that is the mode of the classes (classification) or mean prediction
(regression) of the individual trees. Random decision forests correct for

32
decision trees’ habit of over fitting to their training set. Random forest is
a type of supervised machine learning algorithm based on ensemble
learning. Ensemble learning is a type of learning where you join different
types of algorithms or same algorithm multiple times to form a more
powerful prediction model. The random forest algorithm combines
multiple algorithm of the same type i.e. multiple decision trees, resulting
in a forest of trees, hence the name "Random Forest". The random forest
algorithm can be used for both regression and classification tasks.
The following are the basic steps involved in performing the random
forest algorithm:

 Pick N random records from the dataset.


 Build a decision tree based on these N records.
 Choose the number of trees you want in your algorithm and
repeat steps 1 and 2.
In case of a regression problem, for a new record, each tree in the forest
predicts a value for Y (output). The final value can be calculated by
taking the average of all the values predicted by all the trees in forest. Or,
in case of a classification problem, each tree in the forest predicts the
category to which the new record belongs. Finally, the new record is
assigned to the category that wins the majority vote.

33
Fig : RANDOM FOREST CLASSIFIER

GIVEN INPUT EXPECT OUTPUT


input: data
output: getting accuracy

K-Nearest Neighbor
K-Nearest Neighbor is one of the simplest Machine Learning algorithms
based on Supervised Learning technique. It assumes the similarity
between the new case/data and available cases and put the new case into
the category that is most similar to the available categories. It stores all
the available data and classifies a new data point based on the similarity.
This means when new data appears then it can be easily classified into a
well suite category by using K- NN algorithm. K-NN algorithm can be
used for Regression as well as for Classification but mostly it is used for
the Classification problems.

34
Fig : K-NEAREST NEIGHBOR

GIVEN INPUT EXPECT OUTPUT


input: data
output: getting accuracy

Decision Tree :
It is one of the most powerful and popular algorithm. Decision-tree
algorithm falls under the category of supervised learning algorithms. It
works for both continuous as well as categorical output variables.
Assumptions of Decision tree:
 At the beginning, we consider the whole training set as the root.
 Attributes are assumed to be categorical for information
gain, attributes are assumed to be continuous.
 On the basis of attribute values records are distributed recursively.
 We use statistical methods for ordering attributes as root or
internal node.

35
Decision tree builds classification or regression models in the form of a
tree structure. It breaks down a data set into smaller and smaller subsets
while at the same time an associated decision tree is incrementally
developed. A decision node has two or more branches and a leaf node
represents a classification or decision. The topmost decision node in a
tree which corresponds to the best predictor called root node. Decision
trees can handle both categorical and numerical data. Decision tree builds
classification or regression models in the form of a tree structure. It
utilizes an if-then rule set which is mutually exclusive and exhaustive for
classification. The rules are learned sequentially using the training data
one at a time. Each time a rule is learned, the tuples covered by the rules
are removed. This process is continued on the training set until meeting a
termination condition. It is constructed in a top-down recursive divide-
and-conquer manner. All the attributes should be categorical. Otherwise,
they should be discretized in advance. Attributes in the top of the tree
have more impact towards in the classification and they are identified
using the information gain concept. A decision tree can be easily over-
fitted generating too many branches and may reflect anomalies due to
noise or outliers.

36
Fig : DECISION TREE CLASSIFIER

GIVEN INPUT EXPECT OUTPUT


input: data
output: getting accuracy

37
CHAPTER-5
RESULTS AND
DISCUSSION,
PERFORMANCE
ANALYSIS

38
5.RESULTS AND DISCUSSION, PERFORMANCE ANALYSIS

5.1 PERFORMANCE ANALYSIS:


Website performance optimization, the focal point of
technologically superior website designs is the primary factor dictating the Rainfall
occurred or not. After all, unimpressive website performance kills admission
process when the torture of waiting for slow Web pages to load frustrates visitors
into seeking alternatives – impatience is a digital virtue! And also the ml algorithms
used in our project will give the best accurate result to the user for Rainfall
prediction
We created the following six chapter in-depth speed
optimization guide to show you how important it is to have a fast loading, snappy
website! Countless research papers and benchmarks prove that optimizing your
sites’ speed is one of the most affordable and highest ROI providing investments!
Lightning-fast page load speed amplifies visitor engagement,
retention, and boosts sales. Instantaneous website response leads to higher
conversion rates, and every 1 second delay in page load decreases customer
satisfaction by 16 percent, page views by 11 percent and conversion rates by 7
percent according to a recent Aberdeen Group research.

Algorithm Accuracy

Logistic Regression 0.8

Random Forest 1.0

K Nearest Neighbors 0.9

Decision Tree Classifier 1.0

Table : Algorithms Accuracy

39
5.2 DISCUSSION:
While discussions provide avenues for exploration and
discovery, leading a discussion can be anxiety-producing: discussions are, by their
nature, unpredictable, and require us as instructors to surrender a certain degree of
control over the flow of information. Fortunately, careful planning can help us
ensure that discussions are lively without being chaotic and exploratory without
losing focus. When planning a discussion, it is helpful to consider not only
cognitive, but also social/emotional, and physical factors that can either foster or
inhibit the productive exchange of ideas.

40
CHAPTER-6
SUMMARY AND
CONCLUSION

41
6.SUMMARY AND CONCLUSION

6.1 SUMMARY:

This project objective is to predict the Rainfall. So this online


Rainfall prediction system will helps the farmers to analyzing the crop
productivity, preplanning the water structure and estimate the rainy or not.

6.2 CONCLUSION:
This project represented the Machine Learning Approach forpredicting the rainfall
by using 4 ML algorithms like Logistic Regression, Random Forest Classifier,
Decision Tree and KNN. Comparing the 4 algorithms and choosing the best
approach for rainfall prediction. This project provides a study of
different types of methodologies used to forecast and predict rainfall and
Issues that could be found when applying different approaches to forecasting
rainfall.
Because of nonlinear relationships in rainfall datasets and the ability to learn from
the past, makes a superior solution to all approaches available.
The future work of the project would be the improvement of architecture for light
and other weather scenarios. Also, can develop a model for small changes in
climate in future. An algorithm for testing daily basis dataset instead of
accumulated dataset could be of paramount Importance for further research.

6.3 FUTURE WORK:


• Rainfall prediction to connect with cloud.
• Creating pickle file and deployment using flask.
• Developing a website using pickle file.
• To optimize the work to implement in Artificial Intelligence environment.

42
APPENDIX:

SOURCE CODE WITH OUTPUT SCREENS

1. Importing packages and loading data set

2. Data cleaning

43
3. Exploratory data analysis

4. Exploratory data analysis

4.Plotting histograms for columns

44
45
46
5.Bar Plotting for columns

47
48
6.Exclude non numeric columns from corelation calculation

49
7.Logistic Regression

8.Random Forest Classifier

50
9.K-Nearest Neighbor

10.Decision Tree

51
10.Model Evaluation

52
REFERENCES:

[1] singh, p., 2018. indian summer monsoon rainfall (ismr) forecasting using time
series data: a fuzzy-entropy-neuro based expert system. geoscience frontiers, 9(4),
pp.1243- 1257.

[2] cramer, s., kampouridis, m., freitas, a. and alexandridis, a., 2017. an extensive
evaluation of seven machine learning methods for rainfall prediction in weather
derivatives. expert systems with applications, 85, pp.169-181.

[3] pour, s., shahid, s. and chung, e., 2016. a hybrid model for statistical
downscaling of daily rainfall. procedia engineering, 154, pp.1424-1430.

[4] manjunath n, muralidhar b r, sachin kumar s, vamshi k and savitha p, 2021.


rainfall prediction using machine learning and deep learning techniques. [online]
irjet.net. available at: <https://www.irjet.net/archives/v8/i8/irjet-v8i850.pdf>
[accessed 20 january 2022].

[5] tanvi patil and dr. kamal shah, 2021. weather forecasting analysis using linear
and logistic regression algorithm. [online] irjet.net. available
at:
<https://www.irjet.net/archives/v8/i6/irjet-v8i6454.pdf> [accessed 20 january 2022].

[6] n. divya prabha and p. radha, 2019. prediction of weather and rainfall
forecasting using classification techniques. [online] irjet.net.
available at:
<https://www.irjet.net/archives/v6/i2/irjet-v6i2154.pdf> [accessed 20 january 2022].

waghmare, d., 2021. machine learning technique for rainfall prediction.

53
[7] international journal for research in applied science and engineering technology,
9(vi), pp.594-600.

[8] yashasathreya, vaishalibv, sagark and srinidhihr, 2021. flood prediction and
rainfall analysis using machine learning. [online]
irjet.net.available at:
<https://www.irjet.net/archives/v8/i7/irjet-v8i7432.pdf> [accessed 20 january 2022].

54

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy