ML Lab Manual

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 113

Department of Computer Engineering Course: Laboratory Practice-III

Write-up Correctness Documentation Viva Timely Total Dated


of Program of Program Completion Sign of
Subject
Teacher

4 4 4 4 4 20

Expected Date of Completion:.............................................................................Actual Date of Completion:

—--------------------------------------------------------------------------------------
Group B
Assignment No : 7

—---------------------------------------------------------------------------------------

Title of the Assignment: Predict the price of the Uber ride from a given pickup
point to the agreed drop-off location. Perform following tasks:
1. Pre-process the dataset.
2. Identify outliers.
3. Check the correlation.
4. Implement linear regression and random forest regression models.
5. Evaluate the models and compare their respective scores like R2, RMSE, etc.

Dataset Description: The project is about on world's largest taxi company Uber inc.
In this project, we're looking to predict the fare for their future transactional
cases. Uber delivers service to lakhs of customers daily. Now it becomes really
important to manage their data properly to come up with new business ideas to
get best results. Eventually, it becomes really important to estimate the fare
prices accurately.

Link for Dataset: https://www.kaggle.com/datasets/yasserh/uber-fares-

dataset Objective of the Assignment:


Students should be able to preprocess dataset and identify outliers, to check correlation and
implement linear regression and random forest regression models. Evaluate them
with respective scores like R2, RMSE etc.

Prerequisite:
1. Basic knowledge of Python
2. Concept of preprocessing data
3. Basic knowledge of Data Science and Big Data Analytics.

DHOLE PATIL COLLEGE OF ENGINEERING, PUNE


Department of Computer Course: Laboratory
Engineering Practice-III

Contents of the Theory:

1. Data Preprocessing
2. Linear regression
3. Random forest regression models
4. Box Plot
5. Outliers
6. Haversine
7. Mathplotlib
8. Mean Squared Error

Data Preprocessing:

Data preprocessing is a process of preparing the raw data and making it suitable for amachine
learning model. It is the rst and crucial step while creating a machine learning model.

When creating a machine learning project, it is not always a case that we come across the clean and
formatted data. And while doing any operation with data, it is mandatory to clean itand put in a formatted
way. So for this, we use data preprocessing task.

Why do we need Data Preprocessing?

A real-world data generally contains noises, missing values, and maybe in an unusable format which
cannot be directly used for machine learning models. Data preprocessing is required tasks for cleaning the
data and making it suitable for a machine learning model which also increases the accuracy and e ciency
of a machine learning model.

It involves below steps:

●Getting the dataset

●Importing libraries

●Importing datasets

●Finding Missing Data

●Encoding Categorical Data

●Splitting dataset into training and test set

●Feature scaling

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
Linear Regression:

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical
method that is used for predictive analysis. Linear regression makes predictions for continuous/real or
numeric variables such assales, salary, age, product price,etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it nds how the value of the dependent variable is changing according to the
value of the independent variable.

The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:

Random Forest Regression Models:

Random Forest is a popular machine learning algorithm that belongs to the supervisedlearning
technique. It can be used for both Classi cation and Regression problems in ML. Itis based on the
concept ofensemble learning,which is a process ofcombining multiple classi ers to solve a complex
problem and to improve the performance of the model.

As the name suggests,"Random Forest is a classi er that contains a number of decision


trees on various subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset."Instead of relying on one decision tree, the random forest takes the

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
prediction from each tree and based on the majority votes of predictions, and it predicts the
nal output.

The greater number of trees in the forest leads to higher accuracy and prevents theproblem of
over tting.

Boxplot:

Boxplots are a measure of how well data is distributed across a data set. This divides thedata set into
three quartiles. This graph represents the minimum, maximum, average, rst quartile, and the third
quartile in the data set. Boxplot is also useful in comparing thedistribution of data in a data set by
drawing a boxplot for each of them.

R provides a boxplot() function to create a boxplot. There is the following syntax of boxplot()function:

boxplot(x, data, notch, varwidth, names, main)

Here,

S.N Parameter Description


o

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

1. X It is a vector or a formula.

2. Data It is the data frame.

3. Notch It is a logical value set as true to draw a notch.

4. varwidth It is also a logical value set as true to draw the width of the boxsame

as the sample size.

5. Names It is the group of labels that will be printed under each boxplot.

6. Main It is used to give a title to the graph.

Outliers:
As the name suggests, "outliers" refer to the data points that exist outside of what is to beexpected. The
major thing about the outliers is what you do with them. If you are going toanalyze any task to analyze
data sets, you will always have some assumptions based onhow this data is generated. If you nd some
data points that are likely to contain some form of error, then these are de nitely outliers, and
depending on the context, you want toovercome those errors. The data mining process involves the
analysis and prediction of datathat the data holds. In 1969, Grubbs introduced the rst de nition of
outliers.

Global Outliers

Global outliers are also called point outliers. Global outliers are taken as the simplest form of outliers.
When data points deviate from all the rest of the data points in a given data set, it is known as the global
outlier. In most cases, all the outlier detection procedures are targeted to determine the global outliers.
The green data point is the global outlier.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Collective Outliers

In a given set of data, when a group of data points deviates from the rest of the data set is called collective
outliers. Here, the particular set of data objects may not be outliers, but when you consider the data
objects as a whole, they may behave as outliers. To identify thetypes of different outliers, you need to go
through background information about the relationship between the behavior of outliers shown by
different data objects. For example, in an Intrusion Detection System, the DOS package from one system
to another is taken as normal behavior. Therefore, if this happens with the various computer
simultaneously, it is considered abnormal behavior, and as a whole, they are called collective outliers.
The greendata points as a whole represent the collective outlier.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
Contextual Outliers

As the name suggests, "Contextual" means this outlier introduced within a context. For example, in the
speech recognition technique, the single background noise. Contextual outliers are also known as
Conditional outliers. These types of outliers happen if a data object deviates from the other data points
because of any speci c condition in a given data set. As we know, there are two types of attributes of
objects of data: contextual attributes and behavioral attributes. Contextual outlier analysis enables the
users to examine outliers in different contexts and conditions, which can be useful in various applications.
For example, A temperature reading of 45 degrees Celsius may behave as an outlier in a rainy season.
Still, it will behave like a normal data point in the context of a summer season. In thegiven diagram, a
green dot representing the low-temperature value in June is a contextual outlier since the same value in
December is not an outlier.

Haversine:

The Haversine formula calculates the shortest distance between two points on a
sphere using their latitudes and longitudes measured along the surface. It is
important for use in navigation.

Matplotlib:

Matplotlib is an amazing visualization library in Python for 2D plots of arrays.


Matplotlib is a multi-platform data visualization library built on NumPy arrays and
designed to work with the broader SciPy stack. It was introduced by John Hunter in the
year 2002.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
One of the greatest benefits of visualization is that it allows us visual access to
huge amounts of data in easily digestible visuals. Matplotlib consists of several plots
like line, bar, scatter, histogram etc.

Mean Squared Error:-

The Mean Squared Error (MSE) or Mean Squared Deviation (MSD)of an estimator
measures the average of error squares i.e. the average squared difference
between the estimated values and true value. It is a risk function, corresponding to
the expected value of the squared error loss. It is always non – negative and
values close to zero are better. The MSE is the second moment of the error (about
the origin) and thus incorporates both the variance of the estimator and its bias.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Code and Output:

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Conclusion:

In this way we have explored Concept correlation and implement linear regression and random
forest regression models.

Assignment Questions:

1. What is data preprocessing?


2. De ne Outliers?
3. What is Linear Regression?
4. What is Random Forest Algorithm?
5. Explain: pandas, numpy?

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Write-up Correctness Documentation Viva Timely Total Dated


of Program of Program Completion Sign of
Subject
Teacher

4 4 4 4 4 20

Expected Date of Completion:.............................................................................Actual Date of Completion:

—--------------------------------------------------------------------------------------
Group B
Assignment No : 8

—---------------------------------------------------------------------------------------

Title of the Assignment: Classify the email using the binary classification method.
Email Spam detection has two states:
a) Normal State – Not Spam,
b) Abnormal State – Spam.
Use K-Nearest Neighbors and Support Vector Machine for classification. Analyze
their performance.

Dataset Description: The csv file contains 5172 rows, each row for each email. There
are 3002 columns. The first column indicates Email name. The name has been set
with numbers and not recipients' name to protect privacy. The last column has the
labels for prediction : 1for spam, 0 for not spam. The remaining 3000 columns are
the 3000 most common words in all the emails, after excluding the non-alphabetical
characters/words. For each row, the count of each word(column) in that
email(row) is stored in the respective cells. Thus, information regarding all 5172
emails are stored in a compact dataframe rather than as separate text files.

Link: https://www.kaggle.com/datasets/balaka18/email-spam-classification-

dataset-csv Objective of the Assignment:

Students should be able to classify email using the binary Classification and
implement email spam detection technique by using K-Nearest Neighbors and
Support Vector Machine algorithm.

Prerequisite:
1. Basic knowledge of Python
2. Concept of K-Nearest Neighbors and Support Vector Machine for classification.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Contents of the Theory:

1. Data Preprocessing
2. Binary Classification
3. K-Nearest Neighbours
4. Support Vector Machine
5. Train, Test and Split Procedure

Data Preprocessing:

Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the rst and crucial step while creating a machine learning model.

When creating a machine learning project, it is not always a case that we come across the clean and
formatted data. And while doing any operation with data, it is mandatory to clean it and put in a formatted
way. So for this, we use data preprocessing task.

Why do we need Data Preprocessing?

A real-world data generally contains noises, missing values, and maybe in an unusable format which
cannot be directly used for machine learning models. Data preprocessing is required tasks for cleaning the
data and making it suitable for a machine learning model which also increases the accuracy and efficiency
of a machine learning model.

It involves below steps:

●Getting the dataset

●Importing libraries

●Importing datasets

●Finding Missing Data

●Encoding Categorical Data

●Splitting dataset into training and test set

●Feature scaling

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Code and Output:

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Write-up Correctness Documentation Viva Timely Total Dated


of Program of Program Completion Sign of
Subject
Teacher

4 4 4 4 4 20

Expected Date of Completion:.............................................................................Actual Date of Completion:

—--------------------------------------------------------------------------------------
Group B
Assignment No : 9

—---------------------------------------------------------------------------------------

Title of the Assignment: Given a bank customer, build a neural network-based


classifier that can determine whether they will leave or not in the next 6 months

Dataset Description: The case study is from an open-source dataset from Kaggle.
The dataset contains 10,000 sample points with 14 distinct features such as
Customer Id, Credi t Score, Geography, Gender, Age, Tenure, Balance, etc.

Link for Dataset: https://www.kaggle.com/barelydedicated/bank-customer-churn-

modeling Perform the following steps:


1. Read the dataset.
2. Distinguish the feature and target set and divide the data set into
training and test sets.
3. Normalize the train and test data.
4. Initialize and build the model. Identify the points of improvement and
implement the same.
5. Print the accuracy score and confusion matrix (5 points).

Objective of the Assignment:


Students should be able to distinguish the feature and target set and divide the data
set into training and test sets and normalize them and students should build the
model on the basis of that.

Prerequisite:
1. Basic knowledge of Python
2. Concept of Confusion Matrix

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
Contents of the Theory:

1. Artificial Neural Network


2. Keras
3. tensorflow
4. Normalization
5. Confusion Matrix

Artificial Neural Network:

The term "Arti cial Neural Network" is derived from Biological neural networks that develop the
structure of a human brain. Similar to the human brain that has neurons interconnected to one another,
arti cial neural networks also have neurons that are interconnected to one another in various layers of the
networks. These neurons are known as nodes.

The given gure illustrates the typical diagram of Biological Neural Network.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
The typical Arti cial Neural Network looks something like the given gure.

Dendrites from Biological Neural Network represent inputs in Arti cial Neural Networks, cellnucleus
represents Nodes, synapse represents Weights, and Axon represents Output.

Relationship between Biological neural network and arti cial neural network:

Artificial Neural Network

Dendrites Inputs

Cell nucleus Nodes

Synapse Weights

Axon

AnArti cial Neural Networkin the eld ofArti cial intelligencewhere it attempts to mimic the network
of neurons makes up a human brain so that computers will have an option tounderstand things and make
decisions in a human-like manner. The arti cial neural networkis designed by programming computers to
behave simply like interconnected brain cells.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
There are around 1000 billion neurons in the human brain. Each neuron has an association point
somewhere in the range of 1,000 and 100,000. In the human brain, data is stored in such a manner as to
be distributed, and we can extract more than one piece of this data when necessary from our memory
parallelly. We can say that the human brain is made up ofincredibly amazing parallel processors.

We can understand the arti cial neural network with an example, consider an example of adigital logic
gate that takes an input and gives an output. "OR" gate, which takes two inputs. If one or both the inputs
are "On," then we get "On" in output. If both the inputs are "Off," thenwe get "Off" in output. Here the
output depends upon input. Our brain does not perform the same task. The outputs to inputs relationship
keep changing because of the neurons in our brain, which are "learning."

The architecture of an artificial neural network:

To understand the concept of the architecture of an arti cial neural network, we have to understand what
a neural network consists of. In order to de ne a neural network that consists of a large number of
arti cial neurons, which are termed units arranged in a sequence of layers. Lets us look at various types
of layers available in an arti cial neural network.

Arti cial Neural Network primarily consists of three layers:

Input Layer:

As the name suggests, it accepts inputs in several different formats provided by the programmer.

Hidden Layer:

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
The hidden layer presents in-between input and output layers. It performs all the calculations to an hidden
features and patterns.

Output Layer:

The input goes through a series of transformations using the hidden layer, which nally results in output
that is conveyed using this layer.

The artificial neural network takes input and computes the weighted sum of the inputs and includes a bias.
This computation is represented in the form of a transfer function.

It determines weighted total is passed as an input to an activation function to produce the output.
Activation functions choose whether a node should re or not. Only those who are
red make it to the output layer. There are distinctive activation functions available that canbe applied
upon the sort of task we are performing.

Keras:

Keras is an open-source high-level Neural Network library, which is written in Python is capable enough
to run on Theano, TensorFlow, or CNTK. It was developed by one of the Google engineers, Francois
Chollet. It is made user-friendly, extensible, and modular for facilitating faster experimentation with deep
neural networks. It not only supports Convolutional Networks and Recurrent Networks individually but
also their combination.

It cannot handle low-level computations, so it makes use of theBackendlibrary to resolve it.The backend
library act as a high-level API wrapper for the low-level API, which lets it run onTensorFlow, CNTK, or
Theano.

Initially, it had over 4800 contributors during its launch, which now has gone up to 250,000 developers. It
has a 2X growth ever since every year it has grown. Big companies like Microsoft, Google, NVIDIA, and
Amazon have actively contributed to the development of Keras. It has an amazing industry interaction,
and it is used in the development of popular
rms likes Net ix, Uber, Google, Expedia, etc.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Tensor ow:

TensorFlow is a Google product, which is one of the most famous deep learning tools widelyused in the
research area of machine learning and deep neural network. It came into the market on 9th November
2015 under the Apache License 2.0. It is built in such a way that it can easily run on multiple CPUs and
GPUs as well as on mobile operating systems. It consists of various wrappers in distinct languages such as
Java, C++, or Python.

Normalization:

Normalization is a scaling technique in Machine Learning applied during data preparation tochange the
values of numeric columns in the dataset to use a common scale. It is not necessary for all datasets in a
model. It is required only when features of machine learning models have different ranges.

Mathematically, we can calculate normalization with the below formula:Xn = (X -

Xminimum) / ( Xmaximum - Xminimum)

Where,

●Xn = Value of Normalization

●Xmaximum = Maximum value of a feature

●Xminimum = Minimum value of a feature

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Example:Let's assume we have a model dataset having maximum and minimum values of feature as
mentioned above. To normalize the machine learning model, values are shiftedand rescaled so their
range can vary between 0 and 1. This technique is also known asMin-Max scaling. In this scaling
technique, we will change the feature values as follows:

Case1-If the value of X is minimum, the value of Numerator will be 0; hence Normalizationwill also be
0.

Xn = (X - Xminimum) / ( Xmaximum - Xminimum)------------------formula

Put X =Xminimum in above formula, we get;

Xn = Xminimum- Xminimum/ ( Xmaximum - Xminimum)Xn = 0

Case2-If the value of X is maximum, then the value of the numerator is equal to the
denominator; hence Normalization will be 1.

Xn = (X - Xminimum) / ( Xmaximum - Xminimum)Put X

=Xmaximum in above formula, we get;

Xn = Xmaximum - Xminimum/ ( Xmaximum - Xminimum)Xn = 1

Case3-On the other hand, if the value of X is neither maximum nor minimum, then values of
normalization will also be between 0 and 1.

Hence, Normalization can be de ned as a scaling method where values are shifted and rescaled to
maintain their ranges between 0 and 1, or in other words; it can be referred to as Min-Max scaling
technique.

Normalization techniques in Machine Learning

Although there are so many feature normalization techniques in Machine Learning, few of them are most
frequently used. These are as follows:

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

● Min-Max Scaling:This technique is also referred to as scaling. As we have already


discussed above, the Min-Max scaling method helps the dataset to shift and rescalethe
values of their attributes, so they end up ranging between 0 and 1.

●Standardization scaling:

Standardization scaling is also known asZ-scorenormalization, in which values are centered around the
mean with a unit standard deviation, which means the attributebecomes zero and the resultant
distribution has a unit standard deviation. Mathematically,we can calculate the standardization by
subtracting the feature value from the mean anddividing it by standard deviation.

Hence, standardization can be expressed as follows:

Here,µrepresents the mean of feature value, andσrepresents the standard deviation offeature values.

However, unlike Min-Max scaling technique, feature values are not restricted to a speci crange in the
standardization technique.

This technique is helpful for various machine learning algorithms that use distance measures such
asKNN, K-means clustering, and Principal component analysis, etc. Further,it is also important that the
model is built on assumptions and data is normally distributed.

When to use Normalization or Standardization?

Which is suitable for our machine learning model, Normalization or Standardization? This is probably a
big confusion among all data scientists as well as machine learning engineers. Although both terms have
the almost same meaning choice of using normalization or standardization will depend on your problem
and the algorithm you are using in models.

1. Normalization is a transformation technique that helps to improve the performance as well as the
accuracy of your model better. Normalization of a machine learning model is useful when you don't know
feature distribution exactly. In other words, the feature distribution of data does not follow
aGaussian(bell curve) distribution. Normalization must

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
have an abounding range, so if you have outliers in data, they will be affected by Normalization.

Further, it is also useful for data having variable scaling techniques such asKNN, arti cial neural
networks. Hence, you can't use assumptions for the distribution of data.

2. Standardization in the machine learning model is useful when you are exactly aware of the feature
distribution of data or, in other words, your data follows a Gaussian distribution. However, this does not
have to be necessarily true. Unlike Normalization, Standardization does not necessarily have a bounding
range, so if you have outliers in your data, they will notbe affected by Standardization.

Further, it is also useful when data has variable dimensions and techniques such aslinear regression,
logistic regression, and linear discriminant analysis.

Example:Let's understand an experiment where we have a dataset having two attributes, i.e., age and
salary. Where the age ranges from 0 to 80 years old, and the income varies from 0 to 75,000 dollars or
more. Income is assumed to be 1,000 times that of age. As a result, the ranges of these two attributes are
much different from one another.

Because of its bigger value, the attributed income will organically in uence the conclusion more when
we undertake further analysis, such as multivariate linear regression. However, this does not necessarily
imply that it is a better predictor. As a result, we normalize the dataso that all of the variables are in the
same range.

Further, it is also helpful for the prediction of credit risk scores where normalization isapplied to
all numeric data except the class column. It uses thetanh transformation technique, which converts
all numeric features into values of range between 0 to 1.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Confusion Matrix:

The confusion matrix is a matrix used to determine the performance of the classi cation models for a
given set of test data. It can only be determined if the true values for test data are known. The matrix itself
can be easily understood, but the related terminologies may beconfusing. Since it shows the errors in the
model performance in the form of a matrix, hence also known as anerror matrix. Some features of
Confusion matrix are given below:

● For the 2 prediction classes of classi ers, the matrix is of 2*2 table, for 3 classes, it is3*3
table, and so on.

● The matrix is divided into two dimensions, that arepredicted


valuesandactual valuesalong with the total number of predictions.

● Predicted values are those values, which are predicted by the model, and actualvalues
are the true values for the given observations.

●It looks like the below table:

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
The above table has the following cases:

●True Negative:Model has given prediction No, and the real or actual value was alsoNo.

●True Positive:The model has predicted yes, and the actual value was also true.

● False Negative:The model has predicted no, but the actual value was Yes, it is
alsocalled asType-II error.

● False Positive:The model has predicted Yes, but the actual value was No. It is
alsocalled aType-I error.

Need for Confusion Matrix in Machine learning

● It evaluates the performance of the classi cation models, when they


makepredictions on test data, and tells how good our classi cation model is.

● It not only tells the error made by the classi ers but also the type of errors such as itis either
type- I or type-II error.

● With the help of the confusion matrix, we can calculate the different parameters forthe
model, such as accuracy, precision, etc.

Example: We can understand the confusion matrix using an example.

Suppose we are trying to create a model that can predict the result for the disease that iseither a person
has that disease or not. So, the confusion matrix for this is given as:

From the above example, we can conclude that:

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

●The table is given for the two-class classi er, which has two predictions "Yes" and

"NO." Here, Yes de nes that patient has the disease, and No de nes that patient doesnot has that
disease.

● The classi er has made a total of100 predictions. Out of 100 predictions,89 are
true predictions, and11 are incorrect predictions.

● The model has given prediction "yes" for 32 times, and "No" for 68 times. Whereas
theactual "Yes" was 27, and actual "No" was 73 times.

Calculations using Confusion Matrix:

We can perform various calculations for the model, such as the model's accuracy, using thismatrix. These
calculations are given below:

● Classi cation Accuracy:It is one of the important parameters to determine the accuracy of
the classi cation problems. It de nes how often the model predicts thecorrect output. It can be
calculated as the ratio of the number of correct predictionsmade by the classi er to all number of
predictions made by the classi ers. The formula is given below:

● Misclassi cation rate:It is also termed as Error rate, and it de nes how often the model
gives the wrong predictions. The value of error rate can be calculated as thenumber of
incorrect

predictions to all number of the predictions made by the classi er. The formula is given
below:

●Precision:It can be de ned as the number of correct outputs provided by the model

or out of all positive classes that have predicted correctly by the model, how many ofthem were
actually true. It can be calculated using the below formula:

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

● Recall:It is de ned as the out of total positive classes, how our model
predictedcorrectly. The recall must be as high as possible.

● F-measure:If two models have low precision and high recall or vice versa, it is di cult to
compare these models. So, for this purpose, we can use F-score. This score helps us to evaluate
the recall and precision at the same time. The F-score is
maximum if the recall is equal to the precision. It can be calculated using the belowformula:

Other important terms used in Confusion Matrix:

●Null Error rate:It de nes how often our model would be incorrect if it always predicted the

majority class. As per the accuracy paradox, it is said that "the


er has a higher
bestclassi error rate than the null error rate."
● ROC Curve:The ROC is a graph displaying a classi er's performance for all possible
thresholds. The graph is plotted between the true positive rate (on the Y-axis) and thefalse
Positive rate (on the x-axis).

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
Code and Output:

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Conclusion:

In this way we build a a neural network-based er that can determine whether they willleave or not in
classi the next 6 months

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Assignment Questions:

1) What is Normalization?

2) What is Standardization?
3) Explain Confusion Matrix ?

4) De ne the following: Classification, Accuracy, Misclassi cation Rate, Precision.

5) One Example of Confusion Matrix?

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Write-up Correctness Documentation Viva Timely Total Dated


of Program of Program Completion Sign of
Subject
Teacher

4 4 4 4 4 20

Expected Date of Completion:.............................................................................Actual Date of Completion:

—--------------------------------------------------------------------------------------
Group B
Assignment No : 10

—---------------------------------------------------------------------------------------

Title of the Assignment: Implement K-Nearest Neighbors algorithm on diabetes.csv dataset.


Compute confusion matrix, accuracy, error rate, precision and recall on the given dataset.

Dataset Description: We will try to build a machine learning model to accurately predict
whether or not the patients in the dataset have diabetes or not?
The datasets consists of several medical predictor variables and one target variable,
Outcome. Predictor variables includes the number of pregnancies the patient has had,
their BMI, insulin level, age, and so on.

Link for Dataset: Diabetes predication system with KNN algorithm | Kaggle

Objective of the Assignment:


Students should be able to preprocess dataset and identify outliers, to check
correlation and implement KNN algorithm and random forest classification
models. Evaluate them with respective scores like confusion_matrix, accuracy
score, mean_squared_error, r2_score, roc_auc_score, roc_curve etc.

Prerequisite:
1. Basic knowledge of Python
2. Concept of Confusion Matrix
3. Concept of roc_auc curve.
4. Concept of Random Forest and KNN algorithms

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

k-Nearest-Neighbors (k-NN) is a supervised machine learning model. Supervised learning is when a model
learns from data that is already labeled. A supervised learning model takes in a set of input objects and output
values. The model then trains on that data to learn how to map the inputs to the desired output so it can learn to
make predictions on unseen data.
k-NN models work by taking a data point and looking at the „k‟ closest labeled data points. The data point is
then assigned the label of the majority of the „k‟ closest points.
For example, if k = 5, and 3 of points are „green‟ and 2 are „red‟, then the data point in question would be
labeled „green‟, since „green‟ is the majority (as shown in the above graph).
Scikit-learn is a machine learning library for Python. In this tutorial, we will build a k-NN model using Scikit-
learn to predict whether a patient has diabetes.
Reading in the training data

For our k-NN model, the first step is to read in the data we will use as input. For this

example, we are using the diabetes dataset. To start, we will use Pandas to read in the

data. I will not go into detail on Pandas, but it is a library you should become familiar

with if you ‘re looking to dive further into data science and machine learning.

import pandas as pd#read in the data using pandas


df = pd.read_csv(‗data/diabetes_data.csv‘)#check data has been read in
properly df.head()

Next, let ‘s see how much data we have. We will call the ‗shape ‘function on our
dataframe to see how many rows and columns there are in our data. The rows indicate
the number of patients and the columns indicate the number of features (age, weight,
etc.) in the dataset for each patient. 1

#check number of rows and columns in


dataset df.shape

Op  (768,9)
We can see that we have 768 rows of data (potential diabetes patients) and 9 columns (8
input features and 1 target output).
DHOLE PATIL COLLEGE OF ENGINEERING,
PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Split up the dataset into inputs and targets

Now let ‘s split up our dataset into inputs (X) and our target (y). Our input will be every column except

‗diabetes ‘because ‗diabetes ‘is what we will be attempting to predict. Therefore,

‗diabetes ‘will be our target.

We will use pandas ‗drop ‘function to drop the column ‗diabetes ‘from our dataframe and

store it in the variable ‗X ‘. This will be our input.


#create a dataframe with all training data except the target column
X = df.drop(columns=[‗diabetes‘])#check that the target variable has been removed
X.head()

We will insert the ‗diabetes ‘column of our dataset into our target variable (y).
#separate target values
y = df[‗diabetes‘].values#view target
values y[0:5]

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Split the dataset into train and test data

Now we will split the dataset into into training data and testing data. The training data is

the data that the model will learn from. The testing data is the data we will use to see

how well the model performs on unseen data.

Scikit-learn has a function we can use called ‗train_test_split‘ that makes it easy for us

to split our dataset into training and testing data.


from sklearn.model_selection import train_test_split#split dataset into train and
test data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,

‗train_test_split‘ takes in 5 parameters. The first two parameters are the input and

target data we split up earlier. Next, we will set ‗test_size‘ to 0.2. This means that 20%

of all the data will be used for testing, which leaves 80% of the data as training data for

the model to learn from. Setting

‗random_state‘ to 1 ensures that we get the same split each time so we can reproduce our results.

Setting ‗stratify‘ to y makes our training split represent the proportion of each value in

the y variable. For example, in our dataset, if 25% of patients have diabetes and 75%

don‘t have diabetes, setting

‗stratify‘ to y will ensure that the random split has 25% of patients with diabetes and

75% of patients without diabetes.

Building and training the model

Next, we have to build the model. Here is the code:


from sklearn.neighbors import KNeighborsClassifier# Create KNN
classifier knn = KNeighborsClassifier(n_neighbors = 3)# Fit the
classifier to the data knn.fit(X_train,y_train)

First, we will create a new k-NN classifier and set ‗n_neighbors‘ to 3. To recap, this

means that if at least 2 out of the 3 nearest points to an new data point are patients

without diabetes, then the new data point will be labeled as ‗no diabetes‘, and vice3

versa. In other words, a new data point is labeled with by majority from the 3 nearest

points.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

We have set ‗n_neighbors‘ to 3 as a starting point. We will go into more detail below on

how to better select a value for ‗n_neighbors‘ so that the model can improve its

performance.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Next, we need to train the model. In order to train our new model, we will use the ‗fit‘

function and pass in our training data as parameters to fit our model to the training

data.

Testing the model

Once the model is trained, we can use the ‗predict‘ function on our model to make

predictions on our test data. As seen when inspecting ‗y‘ earlier, 0 indicates that the

patient does not have diabetes and 1 indicates that the patient does have diabetes. To

save space, we will only show print the first 5 predictions of our test set.
#show first 5 model predictions on the
test data knn.predict(X_test)[0:5]

We can see that the model predicted ‗no diabetes‘ for the first 4 patients in the test set

and ‗has diabetes‘ for the 5th patient.

Now let‘s see how our accurate our model is on the full test set. To do this, we will use

the ‗score‘ function and pass in our test input and target data to see how well our model

predictions match up to the actual results.


#check accuracy of our model on the
test data knn.score(X_test, y_test)

Our model has an accuracy of approximately 66.88%. It‘s a good start, but we will see

how we can increase model performance below.

Congrats! You have now built an amazing k-NN model!

k-Fold Cross-Validation

Cross-validation is when the dataset is randomly split up into ‗k‘ groups. One of the

groups is used as the test set and the rest are used as the training set. The model is
4
trained on the training set and scored on the test set. Then the process is repeated until

each unique group as been used as the test set.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

For example, for 5-fold cross validation, the dataset would be split into 5 groups, and the

model would be trained and tested 5 separate times so each group would get a chance to

be the test set. This can be seen in the graph below.

4- fold cross validation (image credit)

The train-test-split method we used in earlier is called ‗holdout‘. Cross-validation is

better than using the holdout method because the holdout method score is dependent on

how the data is split into train and test sets. Cross-validation gives the model an

opportunity to test on multiple splits so we can get a better idea on how the model will

perform on unseen data.

In order to train and test our model using cross-validation, we will use the

‗cross_val_score‘ function with a cross-validation value of 5. ‗cross_val_score‘ takes in

our k-NN model and our data as parameters. Then it splits our data into 5 groups and fits

and scores our data 5 seperate times, recording the accuracy score in an array each

time. We will save the accuracy scores in the

‗cv_scores‘ variable.

To find the average of the 5 scores, we will use numpy‘s mean function, passing in

‗cv_score‘. Numpy is a useful math library in Python.


from sklearn.model_selection import
cross_val_score import numpy as np#create a
new KNN model
knn_cv = KNeighborsClassifier(n_neighbors=3)#train model with cv of 5
cv_scores = cross_val_score(knn_cv, X, y, cv=5)#print each cv score (accuracy) and
average them print(cv_scores)

Using cross-validation, our mean score is about 71.36%. This is a more accurate

representation of how our model will perform on unseen data than our earlier testing

using the holdout method.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Hypertuning model parameters using GridSearchCV

When built our initial k-NN model, we set the parameter ‗n_neighbors‘ to 3 as a starting

point with no real logic behind that choice.

Hypertuning parameters is when you go through a process to find the optimal

parameters for your model to improve accuracy. In our case, we will use GridSearchCV to

find the optimal value for

‗n_neighbors‘.

GridSearchCV works by training our model multiple times on a range of parameters that

we specify. That way, we can test our model with each parameter and figure out the

optimal values to get the best accuracy results.

For our model, we will specify a range of values for ‗n_neighbors‘ in order to see which

value works best for our model. To do this, we will create a dictionary, setting

‗n_neighbors‘ as the key and using numpy to create an array of values from 1 to 24.

Our new model using grid search will take in a new k-NN classifier, our param_grid and a

cross- validation value of 5 in order to find the optimal value for ‗n_neighbors‘.
from sklearn.model_selection import GridSearchCV#create new a knn model
knn2 = KNeighborsClassifier()#create a dictionary of all values we want to test for
n_neighbors param_grid = {‗n_neighbors‘: np.arange(1, 25)}#use gridsearch to test all
values for n_neighbors knn_gscv = GridSearchCV(knn2, param_grid, cv=5)#fit model to
data

After training, we can check which of our values for ‗n_neighbors‘ that we tested

performed the best. To do this, we will call ‗best_params_‘ on our model.


#check top performing n_neighbors value
knn_gscv.best_params_

We can see that 14 is the optimal value for ‗n_neighbors‘. We can use the ‗best_score_‘

function to check the accuracy of our model when ‗n_neighbors‘ is 14. ‗best_score_‘

outputs the mean accuracy of the scores obtained through cross-validation. 6


#check mean score for the top performing value of n_neighbors
knn_gscv.best_score_

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

By using grid search to find the optimal parameter for our model, we have improved our

model accuracy by over 4%!

Code :-

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Conclusion:

In this way we build a neural network-based classifier that can determine whether they will leave or
not inthe next 6 months

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Write-up Correctness Documentation Viva Timely Total Dated


of Program of Program Completion Sign of
Subject
Teacher

4 4 4 4 4 20

Expected Date of Completion:.............................................................................Actual Date of Completion:

—--------------------------------------------------------------------------------------
Group B
Assignment No : 11

—---------------------------------------------------------------------------------------

Title of the Assignment: Implement K-Means clustering/ hierarchical clustering on


sales_data_sample.csv dataset. Determine the number of clusters using the elbow method.

Dataset Description: The data includes the following features:


1. Customer ID
2. Customer Gender
3. Customer Age
4. Annual Income of the customer (in Thousand Dollars)
5. Spending score of the customer (based on customer behavior and spending nature)

Objective of the Assignment:


Students should able to understand how to use unsupervised learning to segment
different-different clusters or groups and used to them to train your model to predict
future things.

Prerequisite:

1. Knowledge of Python
2. Unsupervised learning
3. Clustering
4. Elbow method

Clustering algorithms try to find natural clusters in data, the various


aspects of how the algorithms to cluster data can be tuned and
modified. Clustering is based on the principle that items within the
same cluster must be similar to each other. The data is grouped in
such a way that related elements are close to each other.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Diverse and different types of data are subdivided into smaller groups.

Uses of Clustering
Marketing:

In the field of marketing, clustering can be used to identify various


customer groups with existing customer data. Based on that,
customers can be provided with discounts, offers, promo codes etc.

Real Estate:

Clustering can be used to understand and divide various property


locations based on value and importance. Clustering algorithms can
process through the data and identify various groups of property on
the basis of probable price.

BookStore and Library management:

Libraries and Bookstores can use Clustering to better manage the book
database. With proper book ordering, better operations can be
implemented. 9

Document Analysis:

Often, we need to group together various research texts and


documents according to similarity. And in such cases, we don‘t have
DHOLE PATIL COLLEGE OF ENGINEERING,
PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

any labels. Manually labelling large amounts of data is also not


possible. Using clustering, the algorithm can process the text and
group it into different themes.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

These are some of the interesting use cases of clustering.

K-Means Clustering
K-Means clustering is an unsupervised machine learning algorithm
that divides the given data into the given number of clusters. Here, the
―K‖ is the given number of predefined clusters, that need to be
created.

It is a centroid based algorithm in which each cluster is associated


with a centroid. The main idea is to reduce the distance between the
data points and their respective cluster centroid.

The algorithm takes raw unlabelled data as an input and divides the
dataset into clusters and the process is repeated until the best
clusters are found.

K-Means is very easy and simple to implement. It is highly scalable,


can be applied to both small and large datasets. There is, however, a
problem with choosing the number of clusters or K. Also, with the
increase in dimensions, stability decreases. But, overall K Means is a
simple and robust algorithm that makes clustering very easy.

#Importing the necessary


libraries import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D

%matplotlib inline

The necessary libraries are imported.

#Reading the excel file


data=pd.read_excel("Mall_Customers.xlsx")

The data is read. I will share a link to the entire code and excel data
at the end of the article.

The data has 200 entries, that is data from 200 customers.
1
data.head()

So let us have a look at the data.


DHOLE PATIL COLLEGE OF ENGINEERING,
PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

data.corr()

The data seems to be interesting. Let us look at the data distribution.

Annual Income Distribution:

#Distribution of Annnual Income


plt.figure(figsize=(10, 6))
sns.set(style = 'whitegrid')
sns.distplot(data['Annual Income
(k$)'])
plt.title('Distribution of Annual Income (k$)', fontsize = 20)
plt.xlabel('Range of Annual Income (k$)')
plt.ylabel('Count')

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Most of the annual income falls between 50K to 85K.

Age Distribution:

#Distribution of age
plt.figure(figsize=(10,
6)) sns.set(style =
'whitegrid')
sns.distplot(data['Age'])
plt.title('Distribution of Age', fontsize = 20)
plt.xlabel('Range of Age')
plt.ylabel('Count')

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Engineering Course: Laboratory Practice-III

There are customers of a wide variety of ages.

Spending Score Distribution:

Learn | Write | Earn


Assured INR 2000 ($26) for every published article! Register Now
#Distribution of spending score
plt.figure(figsize=(10, 6))
sns.set(style = 'whitegrid')
sns.distplot(data['Spending Score (1-100)'])
plt.title('Distribution of Spending Score (1-100)', fontsize
= 20) plt.xlabel('Range of Spending Score (1-100)')
plt.ylabel('Count')

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

The maximum spending score is in the range of 40 to 60.

Gender Analysis:
genders = data.Gender.value_counts()
sns.set_style("darkgrid")
plt.figure(figsize=(10,4))
sns.barplot(x=genders.index,
y=genders.values) plt.show()

More female customers than male.

I have made more visualizations. Do have a look at the GitHub link at

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
the end to understand the data analysis and overall data exploration.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
Clustering based on 2 features

First, we work with two features only, annual income and spending score.

#We take just the Annual Income and Spending score


df1=data[["CustomerID","Gender","Age","Annual Income (k$)","Spending Score (1-
100)"]] X=df1[["Annual Income (k$)","Spending Score (1-100)"]]

#The input
data X.head()

#Scatterplot of the input data


plt.figure(figsize=(10,6))
sns.scatterplot(x = 'Annual Income (k$)',y = 'Spending Score (1-100)', data = X ,s = 60
)

plt.xlabel('Annual Income (k$)')


plt.ylabel('Spending Score (1-100)')
plt.title('Spending Score (1-100) vs Annual Income
(k$)') plt.show()

The data does seem to hold some patterns.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

#Importing KMeans from sklearn


from sklearn.cluster import
KMeans

Now we calculate the Within Cluster Sum of Squared Errors (WSS) for
different values of k. Next, we choose the k for which WSS first starts
to diminish. This value of K gives us the best number of clusters to
make from the raw data.

wcss=[]
for i in range(1,11):
km=KMeans(n_clusters=
i) km.fit(X)
wcss.append(km.inertia
_)
#The elbow curve
plt.figure(figsize=(12,
6))
plt.plot(range(1,11),wc
ss)

plt.plot(range(1,11),wcss, linewidth=2, color="red", marker ="8")


plt.xlabel("K Value")
plt.xticks(np.arange(1,11,
1)) plt.ylabel("WCSS")
1
plt.show()

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
The plot:

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

This is known as the elbow graph, the x-axis being the number of
clusters, the number of clusters is taken at the elbow joint point. This
point is the point where making clusters is most relevant as here the
value of WCSS suddenly stops decreasing. Here in the graph, after 5
the drop is minimal, so we take 5 to be the number of clusters.

#Taking 5 clusters
km1=KMeans(n_clusters=
5) #Fitting the input
data km1.fit(X)
#predicting the labels of the input data
y=km1.predict(X)
#adding the labels to a column named label
df1["label"] = y
#The new dataframe with the clustering
done df1.head()

The labels added to the data.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

#Scatterplot of the clusters


plt.figure(figsize=(10,6))
sns.scatterplot(x = 'Annual Income (k$)',y = 'Spending Score (1-100)',hue="label",
palette=['green','orange','brown','dodgerblue','red'],
legend='full',data = df1 ,s = 60 )
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Spending Score (1-100) vs Annual Income
(k$)') plt.show()

We can clearly see that 5 different clusters have been formed from the
data. The red cluster is the customers with the least income and least
1

spending score,
similarly, the blue cluster is the customers with the most income and
most spending score.

k-Means Clustering on the basis of 3D data


DHOLE PATIL COLLEGE OF ENGINEERING,
PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Now, we shall be working on 3 types of data. Apart from the spending score and

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

annual income of customers, we shall also take in the age of the customers.

#Taking the features

X2=df2[["Age","Annual Income (k$)","Spending Score (1-100)"]]


#Now we calculate the Within Cluster Sum of Squared Errors (WSS) for different
values of k.

wcss = []

for k in range(1,11):
kmeans = KMeans(n_clusters=k, init="k-means+
+") kmeans.fit(X2)
wcss.append(kmeans.inerti
a_)
plt.figure(figsize=(12,6))
plt.plot(range(1,11),wcss, linewidth=2, color="red", marker ="8")
plt.xlabel("K Value")
plt.xticks(np.arange(1,11,
1)) plt.ylabel("WCSS")
plt.show()

The WCSS curve.

Here can assume that K=5 will be a good value. 1

#We choose the k for which WSS starts to


diminish km2 = KMeans(n_clusters=5)
y2 =
km.fit_predict(X2)

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
df2["label"] = y2
#The data with
labels df2.head()

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

The data:

Now we plot it.

#3D Plot as we did the clustering on the basis of 3 input


features fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df2.Age[df2.label == 0], df2["Annual Income (k$)"][df2.label == 0],
df2["Spending Score (1-100)"][df2.label == 0], c='purple', s=60)
ax.scatter(df2.Age[df2.label == 1], df2["Annual Income (k$)"][df2.label == 1],
df2["Spending Score (1-100)"][df2.label == 1], c='red', s=60)
ax.scatter(df2.Age[df2.label == 2], df2["Annual Income (k$)"][df2.label == 2],
df2["Spending Score (1-100)"][df2.label == 2], c='blue', s=60)
ax.scatter(df2.Age[df2.label == 3], df2["Annual Income (k$)"][df2.label == 3],
df2["Spending Score (1-100)"][df2.label == 3], c='green', s=60)
ax.scatter(df2.Age[df2.label == 4], df2["Annual Income (k$)"][df2.label == 4],
df2["Spending Score (1-100)"][df2.label == 4], c='yellow', s=60)
ax.view_init(35, 185)
plt.xlabel("Age")
plt.ylabel("Annual Income
(k$)")
ax.set_zlabel('Spending Score (1-100)')
plt.show()

The output:

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

What we get is a 3D plot. Now, if we want to know the customer IDs,


we can do that too.

cust1=df2[df2["label"]==1]
print('Number of customer in 1st group=',
len(cust1)) print('They are -',
cust1["CustomerID"].values) print("
"
) cust2=df2[df2["label"]==2]
print('Number of customer in 2nd group=',
len(cust2)) print('They are -',
cust2["CustomerID"].values) print("
"
) cust3=df2[df2["label"]==0]
print('Number of customer in 3rd group=', len(cust3))
print('They are -', cust3["CustomerID"].values)
print(" "
) cust4=df2[df2["label"]==3]
2

print('Number of customer in 4th group=',


len(cust4)) print('They are -',
cust4["CustomerID"].values) print("
"
DHOLE PATIL COLLEGE OF ENGINEERING,
PUNE
Department of Computer Course: Laboratory
)Engineering
cust5=df2[df2["label"]==4] Practice-III
print('Number of customer in 5th group=',
len(cust5)) print('They are -',
cust5["CustomerID"].values)

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
print(" ")

The output we get:

Number of customer in 1st group= 24


They are - [129 131 135 137 139 141 145 147 149 151 153 155 157 159 161 163 165 167
169 171 173 175 177 179]

——————————————–

Number of the customer in 2nd group= 29

They are - [ 47 51 55 56 57 60 67 72 77 78 80 82 84 86 90 93 94 97
99 102 105 108 113 118 119 120 122 123 127]

——————————————–

Number of the customer in 3rd group= 28

They are - [124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158

160 162 164 166 168 170 172 174 176 178]

——————————————–

Number of the customer in 4th group= 22


They are - [ 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 46]

Number of customer in 5th group= 12


They are - [ 3 7 9 11 13 15 23 25 31 33 35 37]

——————————————–

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
Code and Output:

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Conclusion:
So, we used K-Means clustering to understand customer data. K-Means is a good clustering
algorithm. Almost all the clusters have similar density. It is also fast and efficient in terms of
computational cost.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Assignment No : 12

MINI PROJECT 2

Problem Statement: - Build a machine learning model that


predicts the type of people who survived the Titanic shipwreck
using passenger data (i.e. name, age, gender, socio-economic class,
etc.).

Importing the Libraries


# linear algebra
import numpy as np

# data processing
import pandas as pd

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import
LogisticRegression from sklearn.ensemble import
RandomForestClassifier from sklearn.linear_model
import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC

Getting the Data


test_df = pd.read_csv("test.csv")
train_df = pd.read_csv("train.csv")

Data Exploration/Analysis
train_df.info()

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

The training-set has 891 examples and 11 features + the


target variable (survived). 2 of the features are floats, 5 are
survival: andSurvival
integers 5 are objects. Below I have listed the features with
PassengerId: Unique Id of a
a short description:
passenger. pclass:Ticket class
sex: Sex
Age: Age in years
sibsp: # of siblings / spouses aboard the Titanic
parch: # of parents / children aboard the Titanic
ticket: Ticket number
fare: Passenger
fare cabin:
Cabin number
embarked: Port of Embarkationtrain_df.describe()

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Above we can see that 38% out of the training-set survived


the Titanic. We can also see that the passenger ages range
from 0.4 to 80. On top of that we can already detect some
features, that contain missing
values, like the „Age‟ feature.
train_df.head(8)

From the table above, we can note a few things. First of all,
that we need to convert a lot of features into numeric ones
2

later on, so that the machine learning algorithms can process


them. Furthermore, we can see that the features have widely
different ranges, that we will need to convert into roughly the

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

same scale. We can also spot some more features, that contain
missing values (NaN = not a number), that wee need to deal
with.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Let’s take a more detailed look at what data is actually missing:


total = train_df.isnull().sum().sort_values(ascending=False)
percent_1 = train_df.isnull().sum()/train_df.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Total',
'%'])
missing_data.head(5)

The Embarked feature has only 2 missing values, which can


easily be filled. It will be much more tricky, to deal with the
„Age‟ feature, which has 177 missing values. The „Cabin‟ feature
needs further investigation, but it looks like that we might want
to drop it from the dataset, since 77 % of it
are missing.
train_df.columns.values

Above you can see the 11 features + the target variable


(survived). What features could contribute to a high
survival rate ?

To me it would make sense if everything except „PassengerId‟, „Ticket‟ and


„Name‟ would be correlated with a high survival rate.

1. Age and Sex:

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
survived
Engineering= 'survived' Practice-III
not_survived = 'not survived'
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(10, 4))
women = train_df[train_df['Sex']=='female']
men = train_df[train_df['Sex']=='male']
ax = sns.distplot(women[women['Survived']==1].Age.dropna(), bins=18,
label = survived, ax = axes[0], kde =False)
ax = sns.distplot(women[women['Survived']==0].Age.dropna(), bins=40,
label = not_survived, ax = axes[0], kde =False)
ax.legend()
ax.set_title('Female')
ax = sns.distplot(men[men['Survived']==1].Age.dropna(), bins=18, label
= survived, ax = axes[1], kde = False)

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
ax = sns.distplot(men[men['Survived']==0].Age.dropna(), bins=40, label
= not_survived, ax = axes[1], kde = False)
ax.legend()
_ = ax.set_title('Male')

You can see that men have a high probability of survival when
they are between 18 and 30 years old, which is also a little bit
true for women but not fully. For women the survival chances
are higher between 14 and 40.

For men the probability of survival is very low between the age
of 5 and 18, but that isn‟t true for women. Another thing to
note is that infants also have a little bit higher probability of
survival.

Since there seem to be certain ages, which have increased


odds of survival and because I want every feature to be
roughly on the same scale, I will create age groups later on.

3. Embarked, Pclass and Sex:


FacetGrid = sns.FacetGrid(train_df, row='Embarked', size=4.5,
aspect=1.6)
FacetGrid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex',
palette=None, order=None, hue_order=None )
FacetGrid.add_legend()

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Embarked seems to be correlated with survival, depending on the gender.


Women on port Q and on port S have a higher chance of
survival. The inverse is true, if they are at port C. Men have a
high survival probability if they are on port C, but a low
2
probability if they are on port Q or S.

Pclass also seems to be correlated with survival. We will


generate another plot of it below.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
4. Pclass:
Engineering Practice-III
sns.barplot(x='Pclass', y='Survived', data=train_df)

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Here we see clearly, that Pclass is contributing to a persons


chance of survival, especially if this person is in class 1. We will
create another pclass plot below.
grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', size=2.2,
aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();

The plot above confirms our assumption about pclass 1, but we

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
can also spot a high probability that a person
Engineering in pclass 3 will
Practice-III

not survive.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

5. SibSp and Parch:

SibSp and Parch would make more sense as a combined


feature, that shows the total number of relatives, a person
has on the Titanic. I will create it below and also a feature
data = [train_df, test_df]
that sows ifin
for dataset someone
data: is not alone.
dataset['relatives'] = dataset['SibSp'] + dataset['Parch']
dataset.loc[dataset['relatives'] > 0, 'not_alone'] = 0
dataset.loc[dataset['relatives'] == 0, 'not_alone'] = 1
dataset['not_alone'] =
dataset['not_alone'].astype(int)train_df['not_alone'].value_counts()

axes =
sns.factorplot('relatives','Survived',

Here we can see that you had a high probabilty of survival with
1 to 3 realitves, but a lower one if you had less than 1 or more
than 3 (except for some cases with 6 relatives).

Data Preprocessing

First, I will drop „PassengerId‟ from the train set, because it


does not contribute to a persons survival probability. I will not
drop it from the test set, since it is required there for the
train_df = train_df.drop(['PassengerId'], axis=1) 3
submission.

Missing Data:

Cabin:
DHOLE PATIL COLLEGE OF ENGINEERING,
PUNE
Department of Computer Course: Laboratory
As a reminder, we have to deal with Cabin
Engineering (687), Embarked (2)
Practice-III

and Age (177). First I thought, we have to delete the „Cabin‟


variable but then I

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

found something interesting. A cabin number looks like „C123‟ and


the letter refers to the deck. Therefore we‟re going to extract
these and create a new feature, that contains a persons deck.
Afterwords we will convert the feature into a numeric variable.
The missing values will be converted to zero. In the picture
below you can see the actual decks of the titanic, ranging from
import re
A to G.
deck = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "U":
8}
data = [train_df, test_df]

for dataset in data:


dataset['Cabin'] = dataset['Cabin'].fillna("U0")
dataset['Deck'] = dataset['Cabin'].map(lambda x:
re.compile("([a- zA-Z]+)").search(x).group())
dataset['Deck'] = dataset['Deck'].map(deck)
dataset['Deck'] = dataset['Deck'].fillna(0)
dataset['Deck'] = dataset['Deck'].astype(int)# we can now drop
the cabin feature
train_df = train_df.drop(['Cabin'], axis=1)
test_df = test_df.drop(['Cabin'], axis=1)

Age:
Now we can tackle the issue with the age features missing
values. I will create an array that contains random numbers,
which are computed based on the mean age value in regards to
data = [train_df, test_df]
the standard deviation and is_null.
for dataset in data:
mean = train_df["Age"].mean()
std = test_df["Age"].std()
is_null = dataset["Age"].isnull().sum()
# compute random numbers between the mean, std and is_null
rand_age = np.random.randint(mean - std, mean + std, size
= is_null)
# fill NaN values in Age column with random values generated
age_slice = dataset["Age"].copy()
age_slice[np.isnan(age_slice)] = rand_age
dataset["Age"] = age_slice
dataset["Age"] =
train_df["Age"].astype(int)train_df["Age"].isnull().sum()

Embarked:

Since the Embarked feature has only 2 missing values, we

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
will just fill these with the most common one.
train_df['Embarked'].describe()

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

common_value = 'S'
data = [train_df, test_df]

for dataset in data:


dataset['Embarked'] = dataset['Embarked'].fillna(common_value)

Converting Features:
train_df.info()

Above you can see that „Fare‟ is a float and we have to deal
with 4 categorical features: Name, Sex, Ticket and Embarked.
Lets investigate and transfrom one after another.
Fare:
Converting “Fare” from float to int64, using the “astype()” function
pandas provides:
data = [train_df, test_df]

for dataset in data:


dataset['Fare'] = dataset['Fare'].fillna(0)
dataset['Fare'] = dataset['Fare'].astype(int)

3
Name:
We will use the Name feature to extract the Titles from the
Name, so that we can build a new feature out of that.
data = [train_df, test_df]
titles = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}

for dataset in
data:

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.',
expand=False)
# replace titles with a more common title or as Rare
dataset['Title'] =
dataset['Title'].replace(['Lady', 'Countess','Capt',
'Col','Don', 'Dr',\
'Major', 'Rev', 'Sir',
'Jonkheer', 'Dona'], 'Rare')
dataset['Title'] = dataset['Title'].replace('Mlle',
'Miss') dataset['Title'] = dataset['Title'].replace('Ms',
'Miss') dataset['Title'] =
dataset['Title'].replace('Mme', 'Mrs') # convert titles
into numbers
dataset['Title'] = dataset['Title'].map(titles)
# filling NaN with 0, to get safe
dataset['Title'] = dataset['Title'].fillna(0)train_df

Sex:
Convert „Sex‟ feature into numeric.
genders = {"male": 0, "female": 1}
data = [train_df, test_df]

for dataset in data:


dataset['Sex'] = dataset['Sex'].map(genders)

Ticket:
train_df['Ticket'].describe()

Since the Ticket attribute has 681 unique tickets, it will be a bit
tricky to convert them into useful categories. So we will drop it
train_df = train_df.drop(['Ticket'],
from the
axis=1) dataset.
test_df = test_df.drop(['Ticket'],
axis=1)

Embarked:
Convert „Embarked‟ feature into numeric.
ports = {"S": 0, "C": 1, "Q": 2}
data = [train_df, test_df]

for dataset in data:


3
dataset['Embarked'] = dataset['Embarked'].map(ports)

Creating Categories:

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
We will now create categories within thePractice-III
Engineering following features:

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Age:
Now we need to convert the „age‟ feature. First we will convert
it from float into integer. Then we will create the new
„AgeGroup” variable, by categorizing every age into a group.
Note that it is important to place attention on how you form
these groups, since you don‟t want for example that 80% of your
data = [train_df, test_df]
data falls into
for dataset group 1.
in data:
dataset['Age'] = dataset['Age'].astype(int)
dataset.loc[ dataset['Age'] <= 11, 'Age'] = 0
dataset.loc[(dataset['Age'] > 11) & (dataset['Age'] <= 18),
'Age']
= 1
dataset.loc[(dataset['Age'] > 18) & (dataset['Age'] <= 22),
'Age']
= 2
dataset.loc[(dataset['Age'] > 22) & (dataset['Age'] <= 27),
'Age']
= 3
dataset.loc[(dataset['Age'] > 27) & (dataset['Age'] <= 33),
'Age']
= 4
dataset.loc[(dataset['Age'] > 33) & (dataset['Age'] <= 40),
'Age']
= 5
dataset.loc[(dataset['Age'] > 40) & (dataset['Age'] <= 66),

Fare:
For the „Fare‟ feature, we need to do the same as with the „Age‟
feature. But it isn‟t that easy, because if we cut the range of
the fare values into a few equally big categories, 80% of the
values would fall into the first category. Fortunately, we can
use sklearn “qcut()” function, that we can use to see,
how we can form the categories.
train_df.head(10)

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

data = [train_df, test_df]

for dataset in data:


dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <=
14.454), 'Fare'] = 1
dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <=
31), 'Fare'] = 2
dataset.loc[(dataset['Fare'] > 31) & (dataset['Fare'] <=
99), 'Fare'] = 3
dataset.loc[(dataset['Fare'] > 99) & (dataset['Fare'] <=
250), 'Fare'] = 4
dataset.loc[ dataset['Fare'] > 250, 'Fare'] =
5 dataset['Fare'] = dataset['Fare'].astype(int)

Creating new Features

I will add two new features to the dataset, that I compute out of
other features.

1.Age times Class


data = [train_df, test_df]
for dataset in data:
dataset['Age_Class']= dataset['Age']* dataset['Pclass']

2. Fare per Person


for dataset in data:
dataset['Fare_Per_Person'] = 3
dataset['Fare']/(dataset['relatives']+1)
dataset['Fare_Per_Person'] =
dataset['Fare_Per_Person'].astype(int)# Let's take a last look at
the training set, before we start training the models.
train_df.head(10)

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Building Machine Learning Models

Now we will train several Machine Learning models and


compare their results. Note that because the dataset does not
provide labels for their testing-set, we need to use the
predictions on the training set to compare the algorithms
X_train = train_df.drop("Survived", axis=1)
with each
Y_train other. Later on, we will use cross validation.
= train_df["Survived"]
X_test = test_df.drop("PassengerId", axis=1).copy()

Stochastic Gradient Descent (SGD):


sgd = linear_model.SGDClassifier(max_iter=5, tol=None)
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)

sgd.score(X_train, Y_train)

acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)

Random Forest:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)

Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100,
2)
3

Logistic Regression:

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
logreg = LogisticRegression()
Engineering Practice-III
logreg.fit(X_train, Y_train)

Y_pred = logreg.predict(X_test)

acc_log = round(logreg.score(X_train, Y_train) * 100, 2)

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

K Nearest Neighbor:
# KNN knn = KNeighborsClassifier(n_neighbors = 3) knn.fit(X_train,
Y_train) Y_pred = knn.predict(X_test) acc_knn =
round(knn.score(X_train, Y_train) * 100, 2)

gaussian = GaussianNB() gaussian.fit(X_train, Y_train) Y_pred =


gaussian.predict(X_test) acc_gaussian = round(gaussian.score(X_train,
Y_train) * 100, 2)
Gaussian Naive Bayes:

Perceptron:
perceptron = Perceptron(max_iter=5)
perceptron.fit(X_train, Y_train)

Y_pred = perceptron.predict(X_test)

acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)

Linear Support Vector Machine:


linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)

Y_pred = linear_svc.predict(X_test)

acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)

Decision Tree
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train) Y_pred =
decision_tree.predict(X_test) acc_decision_tree =
round(decision_tree.score(X_train, Y_train) * 100, 2)

Which is the best Model ?


results = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic
Regression', 'Random Forest', 'Naive Bayes',
'Perceptron', 'Stochastic Gradient Decent',
'Decision Tree'],
'Score': [acc_linear_svc, acc_knn, acc_log,
acc_random_forest, acc_gaussian, acc_perceptron,
acc_sgd, acc_decision_tree]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
result_df.head(9)

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

As we can see, the Random Forest classifier goes on the first


place. But first, let us check, how random-forest performs,
when we use cross validation.

K-Fold Cross Validation:

K-Fold Cross Validation randomly splits the training data into K


subsets called folds. Let‟s image we would split our data into
4 folds (K = 4). Our random forest model would be trained and
evaluated 4 times, using a different fold for evaluation
everytime, while it would be trained on the remaining 3 folds.

The image below shows the process, using 4 folds (K = 4).


Every row represents one training + evaluation process. In the
first row, the model get‟s trained on the first, second and third
subset and evaluated on the fourth. In the second row, the model
get‟s trained on the second, third and fourth subset and
evaluated on the first. K-Fold Cross Validation repeats this
process till every fold acted once as an evaluation fold.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

The result of our K-Fold Cross Validation example would be an


array that contains 4 different scores. We then need to
compute the mean and the standard deviation for these scores.

The code below perform K-Fold Cross Validation on our random


forest model, using 10 folds (K = 10). Therefore it outputs an
array with 10 different scores.
from sklearn.model_selection import cross_val_score
rf = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(rf, X_train, Y_train, cv=10, scoring =
"accuracy")print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())

This looks much more realistic than before. Our model has a
average accuracy of 82% with a standard deviation of 4 %. The
standard deviation shows us, how precise the estimates are .

This means in our case that the accuracy of our model can differ + — 4%.

I think the accuracy is still really good and since random


forest is an easy to use model, we will try to increase it‟s
performance even further in the following section.

Random Forest

What is Random Forest ?

Random Forest is a supervised learning algorithm. Like you can


already see from it‟s name, it creates a forest and makes it
somehow random. The
„forest“ it builds, is an ensemble of Decision Trees, most of the
3
time trained with the “bagging” method. The general idea of
the bagging method is that a combination of learning models
increases the overall result.
To say it in simple words: Random forest builds multiple
DHOLE PATIL COLLEGE OF ENGINEERING,
PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
decision trees and merges them together to get a more
accurate and stable prediction.

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

One big advantage of random forest is, that it can be used for
both classification and regression problems, which form the
majority of current machine learning systems. With a few
exceptions a random-forest classifier has all the
hyperparameters of a decision-tree classifier and also all the
hyperparameters of a bagging classifier, to control the
ensemble itself.

The random-forest algorithm brings extra randomness into the


model, when it is growing the trees. Instead of searching for
the best feature while splitting a node, it searches for the best
feature among a random subset of features. This process
creates a wide diversity, which generally results in a better
model. Therefore when you are growing a tree in random
forest, only a random subset of the features is considered for
splitting a node. You can even make trees more random, by
using random thresholds on top of it, for each feature rather
than searching for the best possible thresholds (like a normal
decision tree does).

Below you can see how a random forest would look like with two trees:

Feature Importance

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
Another great quality of random forest is that they make it very
easy to measure the relative importance of each feature.
Sklearn measure a features importance by looking at how much
the treee nodes, that use that

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

feature, reduce impurity on average (across all trees in the


forest). It computes this score automaticall for each feature
after training and scales the results so that the sum of all
importances is equal to 1. We will acces this below:
importances =
pd.DataFrame({'feature':X_train.columns,'importance':np.round(random_f
orest.feature_importances_,3)})
importances =
importances.sort_values('importance',ascending=False).set_index('featu
re')importances.head(15)

importances.plot.bar()

Conclusion:

not_alone and Parch doesn‟t play a significant role in our


random forest classifiers prediction process. Because of that I
will drop them from the dataset and train the classifier again.
DHOLE PATIL COLLEGE OF ENGINEERING,
PUNE
Department of Computer Course: Laboratory
We could also remove more or less
Engineering Practice-III

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

features, but this would need a more detailed investigation of


the features effect on our model. But I think it‟s just fine to
remove only Alone and Parch.
train_df = train_df.drop("not_alone", axis=1)
test_df = test_df.drop("not_alone", axis=1)

train_df = train_df.drop("Parch",
axis=1) test_df = test_df.drop("Parch",
axis=1)

Training random forest again:


# Random Forest

random_forest = RandomForestClassifier(n_estimators=100, oob_score =


True)
random_forest.fit(X_train, Y_train)
Y_prediction =
random_forest.predict(X_test)

random_forest.score(X_train, Y_train)

acc_random_forest = round(random_forest.score(X_train, Y_train) * 100,


2)
print(round(acc_random_forest,2,), "%")

92.82%

Our random forest model predicts as good as it did before. A


general rule is that, the more features you have, the more
likely your model will suffer from overfitting and vice
versa. But I think our data looks fine for now and hasn't too
much features.

There is also another way to evaluate a random-forest


classifier, which is probably much more accurate than the
score we used before. What I am talking about is the out-of-
bag samples to estimate the generalization accuracy. I will
not go into details here about how it works. Just note that out-
of-bag estimate is as accurate as using a test set of the same
size as the training set. Therefore, using the out-of-bag error
print("oob score:", round(random_forest.oob_score_, 4)*100, "%")
estimate removes the need for a set aside test set. 4

oob score: 81.82 %

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Now we can start tuning the hyperameters
Engineering of random forest.
Practice-III

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Hyperparameter Tuning

Below you can see the code of the hyperparamter tuning for
the parameters criterion, min_samples_leaf, min_samples_split
and n_estimators.

I put this code into a markdown cell and not into a code cell,
because it takes a long time to run it. Directly underneeth it, I
put a screenshot of the gridsearch's output.
param_grid = { "criterion" : ["gini", "entropy"], "min_samples_leaf"
: [1, 5, 10, 25, 50, 70], "min_samples_split" : [2, 4, 10, 12, 16,
18,
25, 35], "n_estimators": [100, 400, 700, 1000, 1500]}from
sklearn.model_selection import GridSearchCV, cross_val_scorerf =
RandomForestClassifier(n_estimators=100, max_features='auto',
oob_score=True, random_state=1, n_jobs=-1)clf =
GridSearchCV(estimator=rf, param_grid=param_grid, n_jobs=-

Test new Parameters:


# Random Forest
random_forest = RandomForestClassifier(criterion = "gini",
min_samples_leaf = 1,
min_samples_split = 10,
n_estimators=100,
max_features='auto',
oob_score=True,
random_state=1,
n_jobs=-1)

random_forest.fit(X_train, Y_train)
Y_prediction =
random_forest.predict(X_test)
4
random_forest.score(X_train, Y_train)

print("oob score:", round(random_forest.oob_score_, 4)*100, "%")

oob score: 83.05 %

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Now that we have a proper model, we can start evaluating it‟s


performace in a more accurate way. Previously we only used
accuracy and the oob score, which is just another form of
accuracy. The problem is just, that it‟s more complicated to
evaluate a classification model than a regression model. We
will talk about this in the following section.

Further Evaluation

Confusion Matrix:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
predictions = cross_val_predict(random_forest, X_train, Y_train, cv=3)
confusion_matrix(Y_train, predictions)

The first row is about the not-survived-predictions: 493


passengers were correctly classified as not survived
(called true negatives) and 56 where wrongly classified as
not survived (false positives).

The second row is about the survived-predictions: 93 passengers


where wrongly classified as survived (false negatives) and 249
where correctly classified as survived (true positives).

A confusion matrix gives you a lot of information about how


well your model does, but theres a way to get even more, like
computing the classifiers precision.

Precision and Recall:


from sklearn.metrics import precision_score, recall_score

print("Precision:", precision_score(Y_train, predictions))


print("Recall:",recall_score(Y_train, predictions))

4
Precision: 0.801948051948
Recall: 0.722222222222

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

Our model predicts 81% of the time, a passengers survival


correctly (precision). The recall tells us that it predicted the
survival of 73 % of the people who actually survived.

F-Score

You can combine precision and recall into one score, which is
called the F- score. The F-score is computed with the harmonic
mean of precision and recall. Note that it assigns much more
weight to low values. As a result of that, the classifier will only
get a high F-score, if both recall and precision are high.
from sklearn.metrics import f1_score
f1_score(Y_train, predictions)

0.7599999999999

There we have it, a 77 % F-score. The score is not that high,


because we have a recall of 73%. But unfortunately the F-score
is not perfect, because it favors classifiers that have a similar
precision and recall. This is a problem, because you sometimes
want a high precision and sometimes a high recall. The thing is
that an increasing precision, sometimes results in an
decreasing recall and vice versa (depending on the threshold).
This is called the precision/recall tradeoff. We will discuss this
in the following section.

Precision Recall Curve

For each person the Random Forest algorithm has to classify,


it computes a probability based on a function and it classifies
the person as survived (when the score is bigger the than
threshold) or as not survived (when the score is smaller than
the threshold). That‟s why the threshold plays an important
Bharati Vidyapeeth‘s College Of Engineering Lavale Pune.
part. 4

We will plot the precision and recall with the threshold using matplotlib:

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
from sklearn.metrics import precision_recall_curve
Engineering Practice-III

# getting the probabilities of our


predictions y_scores =
random_forest.predict_proba(X_train) y_scores
= y_scores[:,1]

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III
y_scores)def plot_precision_and_recall(precision, recall, threshold):
plt.plot(threshold, precision[:-1], "r-", label="precision",
linewidth=5)
plt.plot(threshold, recall[:-1], "b", label="recall", linewidth=5)
plt.xlabel("threshold", fontsize=19)
plt.legend(loc="upper right",
fontsize=19) plt.ylim([0, 1])

plt.figure(figsize=(14, 7))
plot_precision_and_recall(precision, recall,
threshold) plt.show()

Above you can clearly see that the recall is falling of rapidly at
a precision of around 85%. Because of that you may want to
select the precision/recall tradeoff before that — maybe at
around 75 %.

You are now able to choose a threshold, that gives you the best
precision/recall tradeoff for your current machine learning
problem. If you want for example a precision of 80%, you can
easily look at the plots and see that you would need a threshold
of around 0.4. Then you could train a model with exactly that
threshold and would get the desired accuracy.

Another way is to plot the precision and recall against each other:

Bharati Vidyapeeth‘s College Of Engineering Lavale Pune.


4

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
def plot_precision_vs_recall(precision, recall):
Engineering Practice-III
plt.plot(recall, precision, "g--", linewidth=2.5)
plt.ylabel("recall", fontsize=19)
plt.xlabel("precision", fontsize=19)
plt.axis([0, 1.5, 0, 1.5])

plt.figure(figsize=(14, 7))
plot_precision_vs_recall(precision, recall)
plt.show()

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

ROC AUC Curve

Another way to evaluate and compare your binary classifier is


provided by the ROC AUC Curve. This curve plots the true
positive rate (also called recall) against the false positive rate
(ratio of incorrectly classified negative instances), instead of
from sklearn.metrics import roc_curve
plotting
# computethe
trueprecision
positive versus the
rate and recall.
false positive rate
false_positive_rate, true_positive_rate, thresholds =
roc_curve(Y_train, y_scores)# plotting them against each
other def plot_roc_curve(false_positive_rate,
true_positive_rate, label=None):
plt.plot(false_positive_rate, true_positive_rate, linewidth=2,
label=label)
plt.plot([0, 1], [0, 1], 'r', linewidth=4)
plt.axis([0, 1, 0, 1])
plt.xlabel('False Positive Rate (FPR)', fontsize=16)
plt.ylabel('True Positive Rate (TPR)', fontsize=16)

plt.figure(figsize=(14, 7))
plot_roc_curve(false_positive_rate, true_positive_rate)
plt.show()

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

The red line in the middle represents a purely random


classifier (e.g a coin flip) and therefore your classifier
should be as far away from it as possible. Our Random
Forest model seems to do a good job.

Of course we also have a tradeoff here, because the


classifier produces more false positives, the higher the
true positive rate is.

ROC AUC Score

The ROC AUC Score is the corresponding score to the


ROC AUC Curve. It is simply computed by measuring the
area under the curve, which is called AUC.

A classifiers that is 100% correct, would have a ROC AUC Score


of 1 and a completely random classiffier would have a score of
0.5.
from sklearn.metrics import roc_auc_score
r_a_score = roc_auc_score(Y_train,
y_scores) print("ROC-AUC-Score:",
r_a_score)

ROC_AUC_SCORE: 0.945067587

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE
Department of Computer Course: Laboratory
Engineering Practice-III

DHOLE PATIL COLLEGE OF ENGINEERING,


PUNE

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy