Rainfall
Rainfall
Rainfall
CERTIFICATE
This is to certify that the project report entitled “RAINFALL PREDICTION USING
MACHINE LEARNING TECHINQUES” is the bonafied work carried out by
K.RENUKA DEVI (20NE1A0572), A.PADMA SREE (20NE1A05B7), N.BALA
MARY SOWMYA (20NE1A05B4), K.SANTOSH DUTT (20NE1A0580) in partial
fulfillment of the requirements for the award of “Bachelor of Technology” degree in
the Department of CSE from J.N.T.U. KAKINADA during the year 2023-2024 under our
guidance and supervision and worth ofacceptance of requirements of the university.
We wish to express our thanks to carious personalities who are responsible for the
completion of the project. We are extremely thankful to our beloved chairman Sri. Bolla Brahma
Naidu, our secretary Sri. R. Satyanarayana, who took keen interest in our every effort throughout
this course. We owe out gratitude to our principal sir Dr. Y. V. Narayana M.E, Ph.D, FIETE for his
kind attention and valuable guidance through out the course.
We express our deep felt gratitude to our H.O.D Dr.N.Gopala Krishna, M.tech, Ph.D,
MISTE and Mr. S. Anil Kumar M.Tech, coordinator of the project for extending their
encouragement. Their profound knowledge and willingness have been a constant source of
inspiration for us through out the project work.
We wish to express our sincere deep sense of gratitude to our, Mrs.Shammi Shaik B.Tech,
M.Tech , for significant suggestions and help in every respect to accomplish the project work. Her
persisting encouragement, everlasting patience and keen interest in discussions have benefited usto
be extent that cannot be spanned by words to our college management for providing excellent lab
facilities for completion of project within our campus.
We extend our sincere thanks to all other teaching and non-teaching staff of department of CSE for
their cooperation and encouragement during our B. Tech course.
We affectionately acknowledge the encouragement received from my friends and those who
involved in giving valuable suggestions had clarifying out doubts which had really helped usin
successfully completing our project.
By
India is an agricultural country and its economy is largely based upon crop
productivity and rainfall. For analyzing the crop productivity, rainfall prediction
is require and necessary to all farmers. Rainfall Prediction is the application of
science and technology to predict the state of the atmosphere. It is important to
exactly determine the rainfall for effective use of water resources, crop
productivity and pre planning of water structures. Using different data mining
techniques it can predict rainfall. Data mining techniques are used to estimate
the rainfall numerically. This paper focuses some of the popular data mining
algorithms for rainfall prediction. Random Forest, K-Nearest Neighbor
algorithm, Logistic regression, Decision Tree are some of the algorithms have
been used. From that comparison, it can analyze which method gives better
accuracy for rainfall prediction.
TABLE OF CONTENTS
1 INTRODUCTION 1
1
1.1 Objective of the project
1.1.1 Necessity 1
1.1.2 Software Development Method 1
1.1.3 Layout of the document 1
1.2 Overview of the designed project 2
2 LITERATURE SURVEY 3
4.5 Modules 16
6.1 Summary 41
6.2 Conclusion 41
6.3 Future Work 41
APPENDIX
SOURCE CODE WITH OUTPUT SCREENS 42-51
REFERENCES 52
CHAPTER-1
INTRODUCTION
1.INTRODUCTION
1.1.1 Necessity:
This prediction helps in predicting the rainfall and it helps in
overcoming the crop productivity and to predict the state of atmosphere in agricultural
countries. These models are very easy to use. It can work accurately and very smoothly in
a different scenario. It reduces the effort workload and increases efficiency in work. In
aspects of time value, it is worthy.
1
1.2 OVERVIEW OF THE DESIGNED PROJECT:
At first, we take the dataset from out resource then we have to perform data-
preprocessing, visualization methods for cleaning and visualizing the dataset respectively and
we applied the Machine Learning algorithms on the dataset and then we plot confusion matrix
of each technology at last we compare those models and draw the ROC curve for the best
perfoming model and also we plot classification report for that model.
2
CHAPTER-2
LITERATURE SURVEY
3
2.LITERATURE SURVEY
A literature review is a body of text that aims to review the critical points of current
knowledge on and/or methodological approaches to a particular topic. It is secondary
sources and discuss published information in a particular subject area and sometimes
information in a particular subject area within a certain time period. Its ultimate goal is
to bring the reader up to date with current literature on a topic and forms the basis for
another goal, such as future research that may be needed in the area and precedes a
research proposal and may be just a simple summary of sources. Usually, it has an
organizational pattern and combines both summary and synthesis.
A summary is a recap of important information about the source, but a synthesis is a re-
organization, reshuffling of information. It might give a new interpretation of old
material or combine new with old interpretations or it might trace the intellectual
progression of the field, including major debates. Depending on the situation, the
literature review may evaluate the sources and advise the reader on the most pertinent
or relevant of them.
Review of Literature Survey
[1] Measurable investigation shows the idea of ISMR, which can't be precisely
anticipated by insights or factual information. Hence, this review exhibits the utilization
of three techniques: object creation, entropy, and artificial neural network (ANN). In view
of this innovation, another technique for anticipating ISMR times has been created to
address the idea of ISMR. This model has been endorsed and supported by the studio and
exploration data. Factual examination of different information and near investigations
showing the presentation of the normal technique.
[2] The primary impact of this movement is to exhibit the advantages of AI calculations,
just as the more prominent degree of clever framework than the advanced rainfall
determining methods. We analyze and think about the momentum execution (Markov
chain stretched out by rainfall research) with the forecasts of the six most not able AI
machines: Genetic programming, Vector relapse support, radio organizations, M5
4
organizations, M5models, models - Happy. To work with a more itemized appraisal, we
led a rainfall overview utilizing information from 42 metropolitan urban communities.
[3] RF was utilized to anticipate assuming that it would rain in one day, while SVM
was utilized to foresee downpour on a blustery day. The limit of the Hybrid model was
fortified by the decrease of day-by-day rainfall in three spots at the rainfall level in the
eastern piece of Malaysia. Crossover models have likewise been found to emulate the full
change, the quantity of days straight, 95% of the month-to-month rainfall, and the
dispersion of the noticed rainfall.
[4] In India, farming is the backbone. Downpour is a significant plant. These days,
climate is a major issue. Climate gauging gives data on rainfall estimating and crop
security. Numerous strategies have been created to recognize
rainfall. Machine7Learning calculations are significant in foreseeing rainfall.
[5] Climate sooner or later. Climatic still up in the air utilizing various sorts of factors all
over the place of these, main the main highlights are utilized in climate conjectures.
Picking something like this relies a great deal upon the time you pick. Underlying
displaying is utilized to incorporate the fate of demonstrating, AI applications, data trade,
and character examination.
[6] Contrasted with different spots where rainfall information isn't accessible, it consumes
a large chunk of the day to build up a solid water overview for a long time. Improving
complex neural organizations is intended to be a brilliant instrument for anticipating the
stormy season. This downpour succession was affirmed utilizing a complex perceptron
neural organization. Estimations like MSE (Early Modeling), NMSE (Usually Early
Error), and the arrangement of informational collections for transient arranging are clear
in the examination of different organizations, like Adanaive. AdaSVM.
5
[7] In this paper, Artificial Neural Network (ANN) innovation is utilized to foster a
climate anticipating strategy to distinguish rainfall utilizing Indian rainfall information.
Along these lines, Feed Forward Neural Network (FFNN) was utilized utilizing the
Backpropagation Algorithm. Execution of the two models is assessed dependent on
emphasis examination, Mean Square Error (MSE) and Magnitude of Relative Error
(MRE). This report likewise gives a future manual for rainfall determining.
[8] This page features rainfall investigation speculations utilizing Machine Learning. The
principle motivation behind utilizing this program is to secure against the impacts of
floods. This program can be utilized by conventional residents or the public authority to
anticipate what will occur before the flood. The flood card, then, at that point,
furnish them with the vital help by moving versatile or other important measures.
6
CHAPTER-3
AIM AND SCOPE OF
THE PRESENT
INVESTIGATION
7
3.AIM AND SCOPE OF THE PRESENT INVESTIGATION
3.1.1 Mission:
To compare several machine learning models like logistic
regression, random forest, knn and decision tree. Plotting confusion matrix for each
model after cleaning the data set so that we can easily find the best model among
them. After finding best model we will draw ROC curve and classification report
for that best fit model to predict rain fall which is very essential for farmers.
3.1.2 Goal:
The goal is to develop a machine learning model for predicting the
rainfall.
8
3.4 EXISTING SYSTEM:
Agriculture is the strength of our Indian economy. Farmer
only depends upon monsoon to be their cultivation. The good crop productivity needs
good soil, fertilizer and also good climate. Weather forecasting is the very important
requirement of the each farmer. Due to the sudden changes in climate/weather, The
people are suffered economically and physically. Weather prediction is one of the
challenging problems in current state. The main motivation of this paper to predict the
weather using various data mining techniques. Such as classification, clustering,
decision tree and also neural networks. Weather related information is also called the
meteorological data. In this paper the most commonly used weather parameters are
rainfall, wind speed, temperature and cold.
3.4.1 Disadvantages:
The biggest disadvantage of this approach is that it fails when it comes
for long term estimation.
This dataset contains 145460 records of features extracted from kaggle, which is
having RainTomorrow as a target column containing 2 values.
9
document steps carefully and justify for cleaning decisions.
Rainfall Dataset
Classification ML Model
Training Algorithm
dataset
10
3.6.5 Advantages:
Performance and accuracy of the algorithms can be calculated and
compared.
Numerical Weather Prediction
Statistical Weather Prediction
Synoptic Weather Prediction
11
CHAPTER-4
EXPERIMENTAL
OR MATERIALS
AND METHODS
ALGORITHMS USED
12
4.EXPERIMENTAL OR MATERIALS
AND METHODS ALGORITHMS USED
13
underlying phenomena in the form of a model, (c) predict future values of a
phenomena using the above-generated model, and (d) detect anomalous behavior
exhibited by a phenomenon under observation.
14
4.4.1 NUMPY LIBRARY
NumPy is an open-source numerical Python library. NumPy
contains a multi- dimensional array and matrix data structures. It can be
utilized to perform a number of mathematical operations on arrays such
as trigonometric, statistical, and algebraic routines like mean, mode,
standard deviation etc…,
Installation- (https://numpy.org/install/)
Here we use pandas for reading the csv files, for grouping the data, for
cleaning the data using some operations.
15
Here we use pyplot mainly for plotting graphs.
matplotlib.pyplot is a collection of functions that make matplotlib work
like MATLAB. Each pyplot function makes some change to a figure: e.g.,
creates a figure, creates a plotting area in a figure, plots some lines in a
plotting area, decorates the plot with labels, etc.
4.5 MODULES:
A modular design reduces complexity, facilities change (a critical
aspect of software maintainability), and results in easier implementation by
encouraging parallel development of different part of system. Software with
effective modularity is easier to develop because function may be
compartmentalized and interfaces are simplified. Software architecture embodies
modularity that is software is divided into separately named and addressable
components called modules that are integrated to satisfy problem requirements.
Modularity is the single attribute of software that allows a program to be
intellectually manageable. The five important criteria that enable us to evaluate a
design method with respect to its ability to define an effective modular design are:
Modular decomposability, Modular Comps ability, Modular Understand ability,
Modular continuity, Modular Protection.
16
Fig : SYSTEM ARCHITECTURE
17
Use case diagrams are considered for high level requirement analysis of a
system. So when the requirements of a system are analyzed the functionalities are
captured in use cases. So, it can say that uses cases are nothing but the system
functionalities written in an organized manner.
Class diagram is basically a graphical representation of the static view of the system
and represents different aspects of the application. So a collection of class diagrams
represent the whole system. The name of the class diagram should be meaningful to
describe the aspect of the system. Each element and their relationships should be
identified in advance Responsibility (attributes and methods) of each class should be
clearly identified for each class minimum number of properties should be specified
and because, unnecessary properties will make the diagram complicated. Use notes
whenever required to describe some aspect of the diagram and at the end of the
drawing it should be understandable to the developer/coder. Finally, before making the
final version, the diagram should be drawn on plain paper and rework as many times
as possible to make it correct.
18
4.6.3 Activity Diagram
Activity is a particular operation of the system. Activity diagrams are not only used for
visualizing dynamic nature of a system but they are also used to construct the
executable system by using forward and reverse engineering techniques. The only
missing thing in activity diagram is the message part. It does not show any message
flow from one activity to another. Activity diagram is some time considered as the
flow chart. Although the diagrams looks like a flow chart but it is not. It shows
different flow like parallel, branched, concurrent and single.
19
Sequence diagrams model the flow of logic within your system in a visual manner,
enabling you both to document and validate your logic, and are commonly
used for both analysis and design purposes. Sequence diagrams are the most
the behavior within your system. Other dynamic modeling techniques include
diagrams and physical data models are in my opinion the most important design-
are numbered 0, 1, 2 or beyond. Here, we will see mainly 3 levels in the data
flow diagram, which are: 0-level DFD, 1-level DFD, and 2-level DFD.
20
Fig : LEVEL-1 DATA FLOW DIAGRAM
21
4.7 MODULE DETAILS:
4.7.1 Data Pre-processing
Validation techniques in machine learning are used to get
the error rate of the Machine Learning (ML) model, which can be
considered as close to the true error rate of the dataset. If the data volume
is large enough to be representative of the population, you may not need
the validation techniques. However, in real-world scenarios, to work with
samples of data that may not be a true representative of the population of
given dataset. To finding the missing value, duplicate value and
description of data type whether it is float variable or integer. The sample
of data used to provide an unbiased evaluation of a model fit on the
training dataset while tuning model hyper parameters.
The evaluation becomes more biased as skill on the validation dataset is
incorporated into the model configuration. The validation set is used to
evaluate a given model, but this is for frequent evaluation. It as machine
learning engineers use this data to fine-tune the model hyper
parameters. Data collection, data analysis, and the process of addressing
data content, quality, and structure can add up to a time-consuming to-do
list. During the process of data identification, it helps to understand your
data and its properties; this knowledge will help you choose which
algorithm to use to build your model.
A number of different data cleaning tasks using Python Pandas library
and specifically, it focus on probably the biggest data cleaning task,
missing values and it able to more quickly clean data. It wants to spend
less time cleaning data, and more time exploring and modeling.
Some of these sources are just simple random mistakes. Other times,
there can be a deeper reason why data is missing. It’s important to
understand these different types of missing data from a statistics point of
view. Here are some typical reasons why data is missing.
22
1. User forgot to fill in a field.
2. Data was lost while transferring manually from a legacy database.
3. There was a programming error.
4. Users chose not to fill out a field tied to their beliefs about how the
results would be used or interpreted.
23
analyze the uni-variate, bi-variate and multi-variate process.
The steps and techniques for data cleaning will vary from dataset to
dataset. The primary goal of data cleaning is to detect and remove errors
and anomalies to increase the value of data in analytics and decision
making.
MODULE DIAGRAM
24
others is an important skill both in applied statistics and in applied
machine learning. It will discover the many types of plots that you will
need to know when visualizing data in Python and how to use them to
better understand your own data.
How to chart time series data with line plots and categorical
quantities with bar charts.
How to summarize data distributions with histograms and box plots.
MODULE DIAGRAM
25
techniques in order to look at the data from different perspectives. You
should use a number of different ways of looking at the estimated
accuracy of your machine learning algorithms in order to choose the one
or two to finalize. A way to do this is to use different visualization
methods to show the average accuracy, variance and other properties of
the distribution of model accuracies.
In the next section you will discover exactly how you can do that in
Python with scikit-learn. The key to a fair comparison of machine
learning algorithms is ensuring that each algorithm is evaluated in the
same way on the same data and it can achieve this by forcing each
algorithm to be evaluated on a consistent test harness.
Pre-processing refers to the transformations applied to our data before
feeding it to the algorithm. Data Preprocessing is a technique that is used
to convert the raw data into a clean data set. To achieving better results
from the applied model in Machine Learning method of the data has to be
in a proper manner. Some specified Machine Learning model needs
information in a specified format, for example, Random Forest algorithm
does not support null values. Therefore, to execute random forest
algorithm null values have to be managed from the original raw data set.
And another aspect is that data set should be formatted in such a way that
more than one Machine Learning and Deep Learning algorithms are
executed in given dataset.
In the example below these 4 different algorithms are compared:
Logistic Regression
Random Forest
Decision Tree
K Nearest Neighbor
26
importantly configured with the same random seed to ensure that the
same splits to the training data are performed and that each algorithm is
evaluated in precisely the same way. Before that comparing algorithm,
Building a Machine Learning Model using install Scikit-Learn libraries.
In this library package have to done preprocessing, linear model with
logistic regression method, cross validating by KFold method, ensemble
with random forest method and tree with decision tree classifier.
Additionally, splitting the train set and test set. To predicting the result by
comparing accuracy.
27
Prediction result by accuracy:
Decision Tree algorithm also uses a linear equation with independent
predictors to predict a value. The predicted value can be anywhere
between negative infinity to positive infinity. It needs the output of the
algorithm to be classified variable data. Higher accuracy predicting result
is decision tree model by comparing the best accuracy.
Accuracy calculation:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Accuracy is the most intuitive performance measure and it is simply a
ratio of correctly predicted observation to the total observations. One may
think that, if we have high accuracy then our model is best. Yes, accuracy
is a great measure but only when you have symmetric datasets where
values of false positive and false negatives are almost same.
28
Recall: The proportion of positive observed values correctly predicted.
(The proportion of actual defaulters that the model will correctly predict)
Recall = TP / (TP + FN)
Recall (Sensitivity) - Recall is the ratio of correctly predicted positive
observations to the all observations in actual class - yes.
General Formula:
F- Measure = 2TP / (2TP + FP + FN)
F1-Score Formula:
F1 Score = 2*(Recall * Precision) / (Recall + Precision)
Algorithm Explanation
29
Some examples of classification problems are: speech recognition,
handwriting recognition, bio metric identification, document
classification etc. In Supervised Learning, algorithms learn from labeled
data. After understanding the data, the algorithm determines which label
should be given to new data based on pattern and associating the patterns
to the unlabeled new data.
of ML algorithms.
Pandas:
Used to read and write different files.
Matplotlib:
Data visualization is a useful way to help with identify the
30
Logistic Regression:
It is a statistical method for analyzing a data set in which there are one or
more independent variables that determine an outcome. The outcome is
measured with a dichotomous variable (in which there are only two
possible outcomes). The goal of logistic regression is to find the best
fitting model to describe the relationship between the dichotomous
characteristic of interest (dependent variable = response or outcome
variable) and a set of independent (predictor or explanatory) variables.
Logistic regression is a Machine Learning classification algorithm that is
used to predict the probability of a categorical dependent variable. In
logistic regression, the dependent variable is a binary variable that
contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).
MODULE DIAGRAM
31
Fig : LOGISTIC REGRESSION
32
decision trees’ habit of over fitting to their training set. Random forest is
a type of supervised machine learning algorithm based on ensemble
learning. Ensemble learning is a type of learning where you join different
types of algorithms or same algorithm multiple times to form a more
powerful prediction model. The random forest algorithm combines
multiple algorithm of the same type i.e. multiple decision trees, resulting
in a forest of trees, hence the name "Random Forest". The random forest
algorithm can be used for both regression and classification tasks.
The following are the basic steps involved in performing the random
forest algorithm:
33
Fig : RANDOM FOREST CLASSIFIER
K-Nearest Neighbor
K-Nearest Neighbor is one of the simplest Machine Learning algorithms
based on Supervised Learning technique. It assumes the similarity
between the new case/data and available cases and put the new case into
the category that is most similar to the available categories. It stores all
the available data and classifies a new data point based on the similarity.
This means when new data appears then it can be easily classified into a
well suite category by using K- NN algorithm. K-NN algorithm can be
used for Regression as well as for Classification but mostly it is used for
the Classification problems.
34
Fig : K-NEAREST NEIGHBOR
Decision Tree :
It is one of the most powerful and popular algorithm. Decision-tree
algorithm falls under the category of supervised learning algorithms. It
works for both continuous as well as categorical output variables.
Assumptions of Decision tree:
At the beginning, we consider the whole training set as the root.
Attributes are assumed to be categorical for information
gain, attributes are assumed to be continuous.
On the basis of attribute values records are distributed recursively.
We use statistical methods for ordering attributes as root or
internal node.
35
Decision tree builds classification or regression models in the form of a
tree structure. It breaks down a data set into smaller and smaller subsets
while at the same time an associated decision tree is incrementally
developed. A decision node has two or more branches and a leaf node
represents a classification or decision. The topmost decision node in a
tree which corresponds to the best predictor called root node. Decision
trees can handle both categorical and numerical data. Decision tree builds
classification or regression models in the form of a tree structure. It
utilizes an if-then rule set which is mutually exclusive and exhaustive for
classification. The rules are learned sequentially using the training data
one at a time. Each time a rule is learned, the tuples covered by the rules
are removed. This process is continued on the training set until meeting a
termination condition. It is constructed in a top-down recursive divide-
and-conquer manner. All the attributes should be categorical. Otherwise,
they should be discretized in advance. Attributes in the top of the tree
have more impact towards in the classification and they are identified
using the information gain concept. A decision tree can be easily over-
fitted generating too many branches and may reflect anomalies due to
noise or outliers.
36
Fig : DECISION TREE CLASSIFIER
37
CHAPTER-5
RESULTS AND
DISCUSSION,
PERFORMANCE
ANALYSIS
38
5.RESULTS AND DISCUSSION, PERFORMANCE ANALYSIS
Algorithm Accuracy
39
5.2 DISCUSSION:
While discussions provide avenues for exploration and
discovery, leading a discussion can be anxiety-producing: discussions are, by their
nature, unpredictable, and require us as instructors to surrender a certain degree of
control over the flow of information. Fortunately, careful planning can help us
ensure that discussions are lively without being chaotic and exploratory without
losing focus. When planning a discussion, it is helpful to consider not only
cognitive, but also social/emotional, and physical factors that can either foster or
inhibit the productive exchange of ideas.
40
CHAPTER-6
SUMMARY AND
CONCLUSION
41
6.SUMMARY AND CONCLUSION
6.1 SUMMARY:
6.2 CONCLUSION:
This project represented the Machine Learning Approach forpredicting the rainfall
by using 4 ML algorithms like Logistic Regression, Random Forest Classifier,
Decision Tree and KNN. Comparing the 4 algorithms and choosing the best
approach for rainfall prediction. This project provides a study of
different types of methodologies used to forecast and predict rainfall and
Issues that could be found when applying different approaches to forecasting
rainfall.
Because of nonlinear relationships in rainfall datasets and the ability to learn from
the past, makes a superior solution to all approaches available.
The future work of the project would be the improvement of architecture for light
and other weather scenarios. Also, can develop a model for small changes in
climate in future. An algorithm for testing daily basis dataset instead of
accumulated dataset could be of paramount Importance for further research.
42
APPENDIX:
2. Data cleaning
43
3. Exploratory data analysis
44
45
46
5.Bar Plotting for columns
47
48
6.Exclude non numeric columns from corelation calculation
49
7.Logistic Regression
50
9.K-Nearest Neighbor
10.Decision Tree
51
10.Model Evaluation
52
REFERENCES:
[1] singh, p., 2018. indian summer monsoon rainfall (ismr) forecasting using time
series data: a fuzzy-entropy-neuro based expert system. geoscience frontiers, 9(4),
pp.1243- 1257.
[2] cramer, s., kampouridis, m., freitas, a. and alexandridis, a., 2017. an extensive
evaluation of seven machine learning methods for rainfall prediction in weather
derivatives. expert systems with applications, 85, pp.169-181.
[3] pour, s., shahid, s. and chung, e., 2016. a hybrid model for statistical
downscaling of daily rainfall. procedia engineering, 154, pp.1424-1430.
[5] tanvi patil and dr. kamal shah, 2021. weather forecasting analysis using linear
and logistic regression algorithm. [online] irjet.net. available
at:
<https://www.irjet.net/archives/v8/i6/irjet-v8i6454.pdf> [accessed 20 january 2022].
[6] n. divya prabha and p. radha, 2019. prediction of weather and rainfall
forecasting using classification techniques. [online] irjet.net.
available at:
<https://www.irjet.net/archives/v6/i2/irjet-v6i2154.pdf> [accessed 20 january 2022].
53
[7] international journal for research in applied science and engineering technology,
9(vi), pp.594-600.
[8] yashasathreya, vaishalibv, sagark and srinidhihr, 2021. flood prediction and
rainfall analysis using machine learning. [online]
irjet.net.available at:
<https://www.irjet.net/archives/v8/i7/irjet-v8i7432.pdf> [accessed 20 january 2022].
54